Doowon Kim

Co-Director of the USER Lab
Assistant Professor

Email: doowon@utk.edu
Address: Min H. Kao Building, Room 345
1520 Middle Drive
Knoxville, TN 37996-2250
Phone: 865-974-8061

Education

  • Ph.D. in Computer Science, The University of Maryland, College Park, 2020

Bio

I’m an Assistant Professor in the Department of Electrical Engineering and Computer Science at the University of Tennessee, Knoxville. I received my Ph.D. in Computer Science from the University of Maryland, College Park in May 2020 where I worked with Prof. Tudor Dumitras.

My research interests include computer security (data-driven security and usable security) and computer networks (Internet measurement). I am interested in identifying the root causes of security threats by understanding actors (e.g., adversary and end-users) involved, with data-driven and human-centered perspectives (i.e., usability study).

I have been awarded the 5th annual NSA Best Scientific Cybersecurity Paper (2017) and the Ann G. Wylie Dissertation Fellowship (2019). My works about the Code-Signing PKI have been featured in Ars Technica, The Register, Schneier on Security, Threatpost, etc.


Selected Recent Publications

Abstract:  Decompilation is a crucial capability in forensic analysis, facilitating analysis of unknown binaries. The recentrise of Python malware has brought attention to Python decompilers that aim to obtain source code representation from a Python binary. However, Python decompilers fail to handle various binaries, limiting their capabilities in forensic analysis. This paper proposes a novel solution that transforms a decompilation error-inducing Python binary into a decompilable binary. Our key intuition is that we can resolve the decompilation errors by transforming error-inducing code blocks in the input binary into another form. The core of our approach is the concept of Forensically Equivalent Transformation (FET) which allows non-semantic preserving transformation in the context of forensic analysis. We carefully define the FETs to minimize their undesirable consequences while fixing various error-inducing instructions that are difficult to solve when preserving the exact semantics. We evaluate the prototype of our approach with 17,117 real-world Python malware samples causing decompilation errors in five popular decompilers. It successfully identifies and fixes 77,022 errors. Our approach also handles anti-analysis techniques, including opcode remapping, and helps migrate Python 3.9 binaries to 3.8 binaries.
Abstract:  The packet stream analysis is essential for the early identification of attack connections while in progress, enabling timely responses to protect system resources. However, there are several challenges for implementing effective analysis, including out-of-order packet sequences introduced due to network dynamics and class imbalance with a small fraction of attack connections available to characterize. To overcome these challenges, we present two deep sequence models: (i) a bidirectional recurrent structure designed for resilience to out-of-order packets, and (ii) a pre-training-enabled sequence-to-sequence structure designed for better dealing with unbalanced class distributions using self-supervised learning. We evaluate the presented models using a real network dataset created from month-long real traffic traces collected from backbone links with the associated intrusion log. The experimental results support the feasibility of the presented models with up to 94.8% in F1 score with the first five packets (k=5), outperforming baseline deep learning models.
Abstract:  Software systems may contain critical program components such as patented program logic or sensitive data. When those components are reverse-engineered by adversaries, it can cause significantly damage (e.g., financial loss or operational failures). While protecting critical program components (e.g., code or data) in software systems is of utmost importance, existing approaches, unfortunately, have two major weaknesses: (1) they can be reverse-engineered via various program analysis techniques and (2) when an adversary obtains a legitimate-looking critical program component, he or she can be sure that it is genuine. In this paper, we propose Ambitr, a novel technique that hides critical program components. The core of Ambitr is Ambiguous Translator that can generate the critical program components when the input is a correct secret key. The translator is ambiguous as it can accept any inputs and produces a number of legitimate-looking outputs, making it difficult to know whether an input is correct secret key or not. The executions of the translator when it processes the correct secret key and other inputs are also indistinguishable, making the analysis inconclusive. Our evaluation results show that static, dynamic and symbolic analysis techniques fail to identify the hidden information in Ambitr. We also demonstrate that manual analysis of Ambitr is extremely challenging.
Abstract:  Server-side malware is one of the prevalent threats that can affect a large number of clients who visit the compromised server. In this paper, we propose Dazzle-attack, a new advanced server-side attack that is resilient to forensic analysis such as reverse-engineering. Dazzleattack retrieves typical (and non-suspicious) contents from benign and uncompromised websites to avoid detection and mislead the investigation to erroneously associate the attacks with benign websites. Dazzleattack leverages a specialized state-machine that accepts any inputs and produces outputs with respect to the inputs, which substantially enlarges the input-output space and makes reverse-engineering effort significantly difficult. We develop a prototype of Dazzle-attack and conduct empirical evaluation of Dazzle-attack to show that it imposes significant challenges to forensic analysis.
Abstract:  This research explores the possibility of a new anti-analysis technique, carefully designed to attack weaknesses of the existing program analysis approaches. It encodes a program code snippet to hide, and its decoding process is implemented by a sophisticated state machine that produces multiple outputs depending on inputs. The key idea of the proposed technique is to ambiguously decode the program code, resulting in multiple decoded code snippets that are challenging to distinguish from each other. Our approach is stealthier than previous similar approaches as its execution does not exhibit different behaviors between when it decodes correctly or incorrectly. This paper also presents analyses of weaknesses of existing techniques and discusses potential improvements. We implement and evaluate the proof of concept approach, and our preliminary results show that the proposed technique imposes various new unique challenges to the program analysis technique.

Conferences

Abstract:  Decompilation is a crucial capability in forensic analysis, facilitating analysis of unknown binaries. The recentrise of Python malware has brought attention to Python decompilers that aim to obtain source code representation from a Python binary. However, Python decompilers fail to handle various binaries, limiting their capabilities in forensic analysis. This paper proposes a novel solution that transforms a decompilation error-inducing Python binary into a decompilable binary. Our key intuition is that we can resolve the decompilation errors by transforming error-inducing code blocks in the input binary into another form. The core of our approach is the concept of Forensically Equivalent Transformation (FET) which allows non-semantic preserving transformation in the context of forensic analysis. We carefully define the FETs to minimize their undesirable consequences while fixing various error-inducing instructions that are difficult to solve when preserving the exact semantics. We evaluate the prototype of our approach with 17,117 real-world Python malware samples causing decompilation errors in five popular decompilers. It successfully identifies and fixes 77,022 errors. Our approach also handles anti-analysis techniques, including opcode remapping, and helps migrate Python 3.9 binaries to 3.8 binaries.
Abstract:  The packet stream analysis is essential for the early identification of attack connections while in progress, enabling timely responses to protect system resources. However, there are several challenges for implementing effective analysis, including out-of-order packet sequences introduced due to network dynamics and class imbalance with a small fraction of attack connections available to characterize. To overcome these challenges, we present two deep sequence models: (i) a bidirectional recurrent structure designed for resilience to out-of-order packets, and (ii) a pre-training-enabled sequence-to-sequence structure designed for better dealing with unbalanced class distributions using self-supervised learning. We evaluate the presented models using a real network dataset created from month-long real traffic traces collected from backbone links with the associated intrusion log. The experimental results support the feasibility of the presented models with up to 94.8% in F1 score with the first five packets (k=5), outperforming baseline deep learning models.
Abstract:  Software systems may contain critical program components such as patented program logic or sensitive data. When those components are reverse-engineered by adversaries, it can cause significantly damage (e.g., financial loss or operational failures). While protecting critical program components (e.g., code or data) in software systems is of utmost importance, existing approaches, unfortunately, have two major weaknesses: (1) they can be reverse-engineered via various program analysis techniques and (2) when an adversary obtains a legitimate-looking critical program component, he or she can be sure that it is genuine. In this paper, we propose Ambitr, a novel technique that hides critical program components. The core of Ambitr is Ambiguous Translator that can generate the critical program components when the input is a correct secret key. The translator is ambiguous as it can accept any inputs and produces a number of legitimate-looking outputs, making it difficult to know whether an input is correct secret key or not. The executions of the translator when it processes the correct secret key and other inputs are also indistinguishable, making the analysis inconclusive. Our evaluation results show that static, dynamic and symbolic analysis techniques fail to identify the hidden information in Ambitr. We also demonstrate that manual analysis of Ambitr is extremely challenging.
Abstract:  Server-side malware is one of the prevalent threats that can affect a large number of clients who visit the compromised server. In this paper, we propose Dazzle-attack, a new advanced server-side attack that is resilient to forensic analysis such as reverse-engineering. Dazzleattack retrieves typical (and non-suspicious) contents from benign and uncompromised websites to avoid detection and mislead the investigation to erroneously associate the attacks with benign websites. Dazzleattack leverages a specialized state-machine that accepts any inputs and produces outputs with respect to the inputs, which substantially enlarges the input-output space and makes reverse-engineering effort significantly difficult. We develop a prototype of Dazzle-attack and conduct empirical evaluation of Dazzle-attack to show that it imposes significant challenges to forensic analysis.
Abstract:  This research explores the possibility of a new anti-analysis technique, carefully designed to attack weaknesses of the existing program analysis approaches. It encodes a program code snippet to hide, and its decoding process is implemented by a sophisticated state machine that produces multiple outputs depending on inputs. The key idea of the proposed technique is to ambiguously decode the program code, resulting in multiple decoded code snippets that are challenging to distinguish from each other. Our approach is stealthier than previous similar approaches as its execution does not exhibit different behaviors between when it decodes correctly or incorrectly. This paper also presents analyses of weaknesses of existing techniques and discusses potential improvements. We implement and evaluate the proof of concept approach, and our preliminary results show that the proposed technique imposes various new unique challenges to the program analysis technique.
Abstract:  To provide secure content delivery, Transport Layer Security (TLS) has become a de facto standard over a couple of decades. However, TLS has a long history of security weaknesses and drawbacks. Thus, the security of TLS has been enhanced by addressing security problems through continuous version upgrades. Meanwhile, to provide fast content delivery globally, websites (or origin web servers) need to deploy and administer many machines in globally distributed environments. They often delegate the management of machines to web hosting services or content delivery networks (CDNs), where the security configurations of distributed servers may vary spatially depending on the managing entities or locations. Based on these spatial differences in TLS security, we find that the security level of TLS connections (and their web services) can be lowered. After collecting the information of (web) domains that exhibit different TLS versions and cryptographic options depending on clients' locations, we show that it is possible to redirect TLS handshake messages to weak TLS servers, which both the origin server and the client may not be aware of. We investigate 7M domains with these spatial differences of security levels in the wild and conduct the analyses to better understand the root causes of this phenomenon. We also measure redirection delays at various locations in the world to see whether there are noticeable delays in redirections.
Abstract:  Phishing attacks are causing substantial damage albeit extensive effort in academia and industry. Recently, a large volume of phishing attacks transit toward adopting HTTPS, leveraging TLS certificates issued from Certificate Authorities (CAs), to make the attacks more effective. In this paper, we present a comprehensive study on the security practices of CAs in the HTTPS phishing ecosystem. We focus on the CAs, critical actors under-studied in previous literature, to better understand the importance of the security practices of CAs and thwart the proliferating HTTPS phishing. In particular, we first present the current landscape and effectiveness of HTTPS phishing attacks comparing to traditional HTTP ones. Then, we conduct an empirical experiment on the CAs' security practices in terms of the issuance and revocation of the certificates. Our findings highlight serious conflicts between the expected security practices of CAs and reality, raising significant security concerns. We further validate our findings using a longitudinal dataset of abusive certificates used for real phishing attacks in the wild. We confirm that the security concerns of CAs prevail in the wild and these concerns can be one of the main contributors to the recent surge of HTTPS phishing attacks.
Abstract:  Transport Layer Security (TLS) has become the norm for secure communication over the Internet. In August 2018, TLS 1.3, the latest version that improves security and performance of the previous TLS version, was approved. In this paper, we take a closer look at TLS 1.3 deployments in practice regarding adoption rate, security, performance, and implementation by applying temporal, spatial, and platform-based approaches on 687M connections. Overall, TLS 1.3 has rapidly been adopted mainly due to third party platforms such as Content Delivery Networks (CDNs) makes a significant contribution to the Internet. In fact, it deprecates vulnerable cryptographic primitives and substantially reduces the time required to perform the TLS 1.3 full handshake compared to the TLS 1.2 handshake. We quantify these aspects and show TLS 1.3 is beneficial to websites that do not rely on the third-party platforms. We also review Common Vulnerabilities and Exposures (CVE) regarding TLS libraries and show that many of recent vulnerabilities can be easily addressed by upgrading to TLS 1.3. However, some websites exhibit unstable support for TLS 1.3 due to multiple platforms with different TLS versions or migration to other platforms, which means that a website can show the lower TLS version at a certain time or from a certain region. Furthermore, we find that most of the implementations (including TLS libraries) do not fully support the new features of TLS 1.3 such as downgrade protection and certificate extensions.
Abstract:  As the COVID-19 pandemic started triggering widespread lockdowns across the globe, cybercriminals did not hesitate to take advantage of users' increased usage of the Internet and their reliance on it. In this paper, we carry out a comprehensive measurement study of online social engineering attacks in the early months of the pandemic. By collecting, synthesizing, and analyzing DNS records, TLS certificates, phishing URLs, phishing website source code, phishing emails, web traffic to phishing websites, news articles, and government announcements, we track trends of phishing activity between January and May 2020 and seek to understand the key implications of the underlying trends. We find that phishing attack traffic in March and April 2020 skyrocketed up to 220\% of its pre-COVID-19 rate, far exceeding typical seasonal spikes. Attackers exploited victims' uncertainty and fear related to the pandemic through a variety of highly targeted scams, including emerging scam types against which current defenses are not sufficient as well as traditional phishing which outpaced the ecosystem's collective response.
Abstract:  Recent measurement studies have highlighted security threats against the code-signing public key infrastructure (PKI), such as certificates that had been compromised or issued directly to the malware authors. The primary mechanism for mitigating these threats is to revoke the abusive certificates. However, the distributed yet closed nature of the code signing PKI makes it difficult to evaluate the effectiveness of revocations in this ecosystem. In consequence, the magnitude of signed malware threat is not fully understood. In this paper, we collect seven datasets, including the largest corpus of code-signing certificates, and we combine them to analyze the revocation process from end to end. Effective revocations rely on three roles: (1) discovering the abusive certificates, (2) revoking the certificates effectively, and (3) disseminating the revocation information for clients. We assess the challenge for discovering compromised certificates and the subsequent revocation delays. We show that erroneously setting revocation dates causes signed malware to remain valid even after the certificate has been revoked. We also report failures in disseminating the revocations, leading clients to continue trusting the revoked certificates.
Abstract:  Digitally signed malware can bypass system protection mechanisms that install or launch only programs with valid signatures. It can also evade anti-virus programs, which often forego scanning signed binaries. Known from advanced threats such as Stuxnet and Flame, this type of abuse has not been measured systematically in the broader malware landscape. In particular, the methods, effectiveness window, and security implications of code-signing PKI abuse are not well understood. We propose a threat model that highlights three types of weaknesses in the code-signing PKI. We overcome challenges specific to code-signing measurements by introducing techniques for prioritizing the collection of code-signing certificates that are likely abusive. We also introduce an algorithm for distinguishing among different types of threats. These techniques allow us to study threats that breach the trust encoded in the Windows code-signing PKI. The threats include stealing the private keys associated with benign certificates and using them to sign malware or by impersonating legitimate companies that do not develop software and, hence, do not own code-signing certificates. Finally, we discuss the actionable implications of our findings and propose concrete steps for improving the security of the code-signing ecosystem.
Abstract:  Potentially dangerous cryptography errors are well-documented in many applications. Conventional wisdom suggests that many of these errors are caused by cryptographic Application Programming Interfaces (APIs) that are too complicated, have insecure defaults, or are poorly documented. To address this problem, researchers have created several cryptographic libraries that they claim are more usable, however, none of these libraries have been empirically evaluated for their ability to promote more secure development. This paper is the first to examine both how and why the design and resulting usability of different cryptographic libraries affects the security of code written with them, with the goal of understanding how to build effective future libraries. We conducted a controlled experiment in which 256 Python developers recruited from GitHub attempt common tasks involving symmetric and asymmetric cryptography using one of five different APIs. We examine their resulting code for functional correctness and security, and compare their results to their self-reported sentiment about their assigned library. Our results suggest that while APIs designed for simplicity can provide security benefits - reducing the decision space, as expected, prevents choice of insecure parameters - simplicity is not enough. Poor documentation, missing code examples, and a lack of auxiliary features such as secure key storage, caused even participants assigned to simplified libraries to struggle with both basic functional correctness and security. Surprisingly, the availability of comprehensive documentation and easy-to-use code examples seems to compensate for more complicated APIs in terms of functionally correct results and participant reactions, however, this did not extend to security results. We find it particularly concerning that for about 20% of functionally correct tasks, across libraries, participants believed their code was secure when it was not. Our results suggest that while new cryptographic libraries that want to promote effective security should offer a simple, convenient interface, this is not enough: they should also, and perhaps more importantly, ensure support for a broad range of common tasks and provide accessible documentation with secure, easy-to-use code examples.
Abstract:  Many critical communications now take place digitally, but recent revelations demonstrate that these communications can often be intercepted. To achieve true message privacy, users need end-to-end message encryption, in which the communications service provider is not able to decrypt the content. Historically, end-to-end encryption has proven extremely difficult for people to use correctly, but recently tools like Apple's iMessage and Google's End-to-End have made it more broadly accessible by using key-directory services. These tools (and others like them) sacrifice some security properties for convenience, which alarms some security experts, but little is known about how average users evaluate these tradeoffs. In a 52-person interview study, we asked participants to complete encryption tasks using both a traditional key-exchange model and a key-directory-based registration model. We also described the security properties of each (varying the order of presentation) and asked participants for their opinions. We found that participants understood the two models well and made coherent assessments about when different tradeoffs might be appropriate. Our participants recognized that the less-convenient exchange model was more secure overall, but found the security of the registration model to be “good enough” for many everyday purposes.
Abstract:  Vulnerabilities in Android code—including but not limited to insecure data storage, unprotected inter-component communication, broken TLS implementations, and violations of least privilege—have enabled real-world privacy leaks and motivated research cataloguing their prevalence and impact. Researchers have speculated that appification promotes security problems, as it increasingly allows inexperienced laymen to develop complex and sensitive apps. Anecdotally, Internet resources such as Stack Overflow are blamed for promoting insecure solutions that are naively copy-pasted by inexperienced developers. In this paper, we for the first time systematically analyzed how the use of information resources impacts code security. We first surveyed 295 app developers who have published in the Google Play market concerning how they use resources to solve security-related problems. Based on the survey results, we conducted a lab study with 54 Android developers (students and professionals), in which participants wrote security-and privacy-relevant code under time constraints. The participants were assigned to one of four conditions: free choice of resources, Stack Overflow only, official Android documentation only, or books only. Those participants who were allowed to use only Stack Overflow produced significantly less secure code than those using, the official Android documentation or books, while participants using the official Android documentation produced significantly less functional code than those using Stack Overflow. To assess the quality of Stack Overflow as a resource, we surveyed the 139 threads our participants accessed during the study, finding that only 25% of them were helpful in solving the assigned tasks and only 17% of them contained secure code snippets. In order to obtain ground truth concerning the prevalence of the secure and insecure code our participants wrote in the lab study, we statically analyzed a random sample of 200,000 apps from Google Play, finding that 93.6% of the apps used at least one of the API calls our participants used during our study. We also found that many of the security errors made by our participants also appear in the wild, possibly also originating in the use of Stack Overflow to solve programming problems. Taken together, our results confirm that API documentation is secure but hard to use, while informal documentation such as Stack Overflow is more accessible but often leads to insecurity. Given time constraints and economic pressures, we can expect that Android developers will continue to choose those resources that are easiest to use, therefore, our results firmly establish the need for secure-but-usable documentation.