Other Research

In our lab, we are open to conducting a range of research that doesn’t fit into our otherwise broad portfolio. Here you will find information about publications from these projects.

Blockchain

Bitcoin’s success has led to significant interest in its underlying components, particularly blockchain technology. Over 10 years after Bitcoin’s initial release, the community still suffers from a lack of clarity regarding what properties defines blockchain technology, its relationship to similar technologies, and which of its proposed use-cases are tenable and which are little more than hype. In our research,1 we have answered four common questions regarding blockchain technology:

  1. What exactly is blockchain technology?
  2. What capabilities does it provide?
  3. What are good applications for blockchain technology?
  4. How does it relate to other distributed technologies (e.g., distributed databases)?

Our finding show that Blockchain technology is most appropriate under three conditions: (a) a need for shared governance and operation (i.e., not trusted central parties can conduct any of these responsibilities), (b) auditable state, and (c) resilience to data loss.

We are currently exploring applications for blockchain technology in two areas. First, distributed document management for health care. Second, supporting multi-organizational humanitarian aid and disaster relief. In both these situations, there is no central authority to handle governance and operation, but there is a need for organizations to work with each other. Our research aims to build tools that will support both of these use cases, including in situations with limited Internet connectivity—situations which are critical for both applications, but which are not currently well-supported by existing blockchain techniques.


Publications

Conferences

Abstract:  Crowdsourcing platforms have traditionally been designed with a focus on workstation interfaces, restricting the flexibility that crowdworkers need. Recognizing this limitation and the need for more adaptable platforms, prior research has highlighted the diverse work processes of crowdworkers, influenced by factors such as device type and work stage. However, these variables have largely been studied in isolation. Our study is the first to explore the interconnected variabilities among these factors within the crowdwork community. Through a survey involving 150 Amazon Mechanical Turk crowdworkers, we uncovered three distinct groups characterized by their interrelated variabilities in key work aspects. The largest group exhibits a reliance on traditional devices, showing limited interest in integrating smartphones and tablets into their work routines. The second-largest group also primarily uses traditional devices but expresses a desire for supportive tools and scripts that enhance productivity across all devices, particularly smartphones and tablets. The smallest group actively uses and strongly prefers non-workstation devices, especially smartphones and tablets, for their crowdworking activities. We translate our findings into design insights for platform developers, discussing the implications for creating more personalized, flexible, and efficient crowdsourcing environments. Additionally, we highlight the unique work practices of these crowdworker clusters, offering a contrast to those of more traditional and established worker groups.
Abstract:  Modern Websites rely on various client-side web resources, such as JavaScript libraries, to provide end-users with rich and interactive web experiences. Unfortunately, anecdotal evidence shows that improperly managed client-side resources could open up attack surfaces that adversaries can exploit. However, there is still a lack of a comprehensive understanding of the updating practices among web developers and the potential impact of inaccuracies in Common Vulnerabilities and Exposures (CVE) information on the security of the web ecosystem. In this paper, we conduct a longitudinal (four-year) measurement study of the security practices and implications on client-side resources (e.g., JavaScript libraries and Adobe Flash) across the Web. Specifically, we first collect a large-scale dataset of 157.2M webpages of Alexa Top 1M websites for four years in the wild. Analyzing the dataset, we find an average of 41.2% of websites (in each year of the four years) carry at least one vulnerable client-side resource (e.g., JavaScript or Adobe Flash). We also reveal that vulnerable JavaScript library versions are frequently observed in the wild, suggesting a concerning level of lagging update practice in the wild. On average, we observe 531.2 days with 25,337 websites of the window of vulnerability due to the unpatched client-side resources from the release of security patches. Furthermore, we manually investigate the fidelity of CVE (Common Vulnerabilities and Exposures) reports on client-side resources, leveraging PoC (Proof of Concept) code. We find that 13 CVE reports (out of 27) have incorrect vulnerable version information, which may impact security-related tasks such as security updates.
Abstract:  Decompilation is a crucial capability in forensic analysis, facilitating analysis of unknown binaries. The recentrise of Python malware has brought attention to Python decompilers that aim to obtain source code representation from a Python binary. However, Python decompilers fail to handle various binaries, limiting their capabilities in forensic analysis. This paper proposes a novel solution that transforms a decompilation error-inducing Python binary into a decompilable binary. Our key intuition is that we can resolve the decompilation errors by transforming error-inducing code blocks in the input binary into another form. The core of our approach is the concept of Forensically Equivalent Transformation (FET) which allows non-semantic preserving transformation in the context of forensic analysis. We carefully define the FETs to minimize their undesirable consequences while fixing various error-inducing instructions that are difficult to solve when preserving the exact semantics. We evaluate the prototype of our approach with 17,117 real-world Python malware samples causing decompilation errors in five popular decompilers. It successfully identifies and fixes 77,022 errors. Our approach also handles anti-analysis techniques, including opcode remapping, and helps migrate Python 3.9 binaries to 3.8 binaries.
Abstract:  The packet stream analysis is essential for the early identification of attack connections while in progress, enabling timely responses to protect system resources. However, there are several challenges for implementing effective analysis, including out-of-order packet sequences introduced due to network dynamics and class imbalance with a small fraction of attack connections available to characterize. To overcome these challenges, we present two deep sequence models: (i) a bidirectional recurrent structure designed for resilience to out-of-order packets, and (ii) a pre-training-enabled sequence-to-sequence structure designed for better dealing with unbalanced class distributions using self-supervised learning. We evaluate the presented models using a real network dataset created from month-long real traffic traces collected from backbone links with the associated intrusion log. The experimental results support the feasibility of the presented models with up to 94.8% in F1 score with the first five packets (k=5), outperforming baseline deep learning models.
Abstract:  Software systems may contain critical program components such as patented program logic or sensitive data. When those components are reverse-engineered by adversaries, it can cause significantly damage (e.g., financial loss or operational failures). While protecting critical program components (e.g., code or data) in software systems is of utmost importance, existing approaches, unfortunately, have two major weaknesses: (1) they can be reverse-engineered via various program analysis techniques and (2) when an adversary obtains a legitimate-looking critical program component, he or she can be sure that it is genuine. In this paper, we propose Ambitr, a novel technique that hides critical program components. The core of Ambitr is Ambiguous Translator that can generate the critical program components when the input is a correct secret key. The translator is ambiguous as it can accept any inputs and produces a number of legitimate-looking outputs, making it difficult to know whether an input is correct secret key or not. The executions of the translator when it processes the correct secret key and other inputs are also indistinguishable, making the analysis inconclusive. Our evaluation results show that static, dynamic and symbolic analysis techniques fail to identify the hidden information in Ambitr. We also demonstrate that manual analysis of Ambitr is extremely challenging.
Abstract:  Server-side malware is one of the prevalent threats that can affect a large number of clients who visit the compromised server. In this paper, we propose Dazzle-attack, a new advanced server-side attack that is resilient to forensic analysis such as reverse-engineering. Dazzleattack retrieves typical (and non-suspicious) contents from benign and uncompromised websites to avoid detection and mislead the investigation to erroneously associate the attacks with benign websites. Dazzleattack leverages a specialized state-machine that accepts any inputs and produces outputs with respect to the inputs, which substantially enlarges the input-output space and makes reverse-engineering effort significantly difficult. We develop a prototype of Dazzle-attack and conduct empirical evaluation of Dazzle-attack to show that it imposes significant challenges to forensic analysis.
Abstract:  This research explores the possibility of a new anti-analysis technique, carefully designed to attack weaknesses of the existing program analysis approaches. It encodes a program code snippet to hide, and its decoding process is implemented by a sophisticated state machine that produces multiple outputs depending on inputs. The key idea of the proposed technique is to ambiguously decode the program code, resulting in multiple decoded code snippets that are challenging to distinguish from each other. Our approach is stealthier than previous similar approaches as its execution does not exhibit different behaviors between when it decodes correctly or incorrectly. This paper also presents analyses of weaknesses of existing techniques and discusses potential improvements. We implement and evaluate the proof of concept approach, and our preliminary results show that the proposed technique imposes various new unique challenges to the program analysis technique.
Abstract:  Outlier detection has been shown to be a promising machine learning technique for a diverse array of fields and problem areas. However, traditional, supervised outlier detection is not well suited for problems such as network intrusion detection, where proper labelled data is scarce. This has created a focus on extending these approaches to be unsupervised, removing the need for explicit labels, but at a cost of poorer performance compared to their supervised counterparts. Recent work has explored ways of making up for this, such as creating ensembles of diverse models, or even diverse learning algorithms, to jointly classify data. While using unsupervised, heterogeneous ensembles of learning algorithms has been proposed as a viable next step for research, the implications of how these ensembles are built and used has not been explored.

Journals and Magazines

Abstract:  Bitcoin's success has led to significant interest in its underlying components, particularly blockchain technology. Over 10 years after Bitcoin's initial release, the community still suffers from a lack of clarity regarding what properties defines blockchain technology, its relationship to similar technologies, and which of its proposed use-cases are tenable and which are little more than hype. In this paper we answer four common questions regarding blockchain technology: (1) what exactly is blockchain technology, (2) what capabilities does it provide, and (3) what are good applications for blockchain technology, and (4) how does it relate to other distributed technologies (e.g., distributed databases). We accomplish this goal by using grounded theory (a structured approach to gathering and analyzing qualitative data) to thoroughly analyze a large corpus of literature on blockchain technology. This method enables us to answer the above questions while limiting researcher bias, separating thought leadership from peddled hype and identifying open research questions related to blockchain technology. The audience for this paper is broad as it aims to help researchers in a variety of areas come to a better understanding of blockchain technology and identify whether it may be of use in their own research.
Abstract:  Bitcoin's success has led to significant interest in its underlying components, particularly blockchain technology. Over 10 years after Bitcoin's initial release, the community still suffers from a lack of clarity regarding what properties defines blockchain technology, its relationship to similar technologies, and which of its proposed use-cases are tenable and which are little more than hype. In this paper we answer four common questions regarding blockchain technology: (1) what exactly is blockchain technology, (2) what capabilities does it provide, and (3) what are good applications for blockchain technology, and (4) how does it relate to other distributed technologies (e.g., distributed databases). We accomplish this goal by using grounded theory (a structured approach to gathering and analyzing qualitative data) to thoroughly analyze a large corpus of literature on blockchain technology. This method enables us to answer the above questions while limiting researcher bias, separating thought leadership from peddled hype and identifying open research questions related to blockchain technology. The audience for this paper is broad as it aims to help researchers in a variety of areas come to a better understanding of blockchain technology and identify whether it may be of use in their own research.

Ph.D. Dissertations

Abstract:  Crowdworkers are drawn to the profession in part due to the flexibility it affords. However, the current design of crowdsourcing platforms limits this flexibility. Therefore, it is important to support the overall flexibility of crowdworkers. Incorporating a variety of device types in the workflow plays an important role in supporting the flexibility of crowdworkers, however each device type requires a tailored workflow. The standard workflow of crowdworkers consists of stages of work such as managing and completing tasks. We hypothesize that there might be differences in factors and characteristics of task completion and task management to support the tailored workflow of different device types. Therefore this dissertation aims to explore and understand the factors and characteristics of task completion and task management on different devices in order to support the overall flexibility of crowdworkers. To achieve this, this dissertation introduces four pivotal innovations : (1) understanding characteristics of task completion and factors affecting the process on smartphones to support the tailored workflow on smartphones in crowdwork (2) understanding of crowdworkers’ current task completion and task management practices and expectations when working on smartphone, tablet, speaker and smartwatch to support the flexibility of crowdworkers on all these devices based on crowdworkers’ work practices and expectations. (3) After a broad understanding of crowdworkers’ practices and expectations across different devices, this thesis identifies the systematic differences among crowdworkers in order to develop customizable support depending on workers’ individual need for flexibility in crowdsourcing platforms (4) Finally, this dissertation looks into other popular crowdsourcing platform named Prolific to understand work practices of Prolific workers as well as compare Prolific with Amazon MTurk to gain a comprehensive understanding of the factors and characteristics that support flexibility in different crowdsourcing environments.

Technical Reports

Abstract:  Bitcoin's success has led to significant interest in its underlying components, particularly blockchain technology. Over 10 years after Bitcoin's initial release, the community still suffers from a lack of clarity regarding what properties defines blockchain technology, its relationship to similar technologies, and which of its proposed use-cases are tenable and which are little more than hype. In this paper we answer four common questions regarding blockchain technology: (1) what exactly is blockchain technology, (2) what capabilities does it provide, and (3) what are good applications for blockchain technology, and (4) how does it relate to other distributed technologies (e.g., distributed databases). We accomplish this goal by using grounded theory (a structured approach to gathering and analyzing qualitative data) to thoroughly analyze a large corpus of literature on blockchain technology. This method enables us to answer the above questions while limiting researcher bias, separating thought leadership from peddled hype and identifying open research questions related to blockchain technology. The audience for this paper is broad as it aims to help researchers in a variety of areas come to a better understanding of blockchain technology and identify whether it may be of use in their own research.