Papers that received an ACM SIGSOFT Distinguished Paper Award:
List of accepted papers of the main research track:
Concolic Testing for Models of State-Based Systems
Reza Ahmadi and
Juergen Dingel
(Queen’s University, Canada)
Testing models of modern cyber-physical systems is not straightforward
due to timing constraints, numerous if not infinite possible
behaviors, and complex communications between components.
Software testing tools and approaches that can generate test cases
to test these systems are therefore important. Many of the existing
automatic approaches support testing at the implementation level
only. The existing model-level testing tools either treat the model
as a black box (e.g., random testing approaches) or have limitations
when it comes to generating complex test sequences (e.g., symbolic
execution). This paper presents a novel approach and tool support
for automatic unit testing of models of real-time embedded systems
by conducting concolic testing, a hybrid testing technique based
on concrete and symbolic execution. Our technique conducts automatic
concolic testing in two phases. In the first phase, model is
isolated from its environment, is transformed to a testable model
and is integrated with a test harness. In the second phase, the harness
tests the model concolically and reports the test execution
results. We describe an implementation of our approach in the
context of Papyrus-RT, an open source Model Driven Engineering
(MDE) tool based on the modeling language UML-RT, and report
the results of applying our concolic testing approach to a set of
standard benchmark models to validate our approach.
Article Search
Artifacts Reusable
Target-Driven Compositional Concolic Testing with Function Summary Refinement for Effective Bug Detection
Yunho Kim, Shin Hong, and Moonzoo Kim
(KAIST, South Korea; Handong Global University, South Korea)
Concolic testing is popular in unit testing because it can detect bugs quickly in a relatively small search space. But, in system-level testing, it suffers from the symbolic path explosion and often misses bugs. To resolve this problem, we have developed a focused compositional concolic testing technique, FOCAL, for effective bug detection. Focusing on a target unit failure v (a crash or an assert violation) detected by concolic unit testing, FOCAL generates a system-level test input that validates v. This test input is obtained by building and solving symbolic path formulas that represent system-level executions raising v. FOCAL builds such formulas by combining function summaries one by one backward from a function that raised v to main. If a function summary φa of function a conflicts with the summaries of the other functions, FOCAL refines φa to φa′ by applying a refining constraint learned from the conflict. FOCAL showed high system-level bug detection ability by detecting 71 out of the 100 real-world target bugs in the SIR benchmark, while other relevant cutting edge techniques (i.e., AFL-fast, KATCH, Mix-CCBSE) detected at most 40 bugs. Also, FOCAL detected 13 new crash bugs in popular file parsing programs.
Article Search
Info
Generating Automated and Online Test Oracles for Simulink Models with Continuous and Uncertain Behaviors
Claudio Menghi, Shiva Nejati, Khouloud Gaaloul, and
Lionel C. Briand
(University of Luxembourg, Luxembourg)
Test automation requires automated oracles to assess test outputs. For cyber physical systems (CPS), oracles, in addition to be automated, should ensure some key objectives: (i) they should check test outputs in an online manner to stop expensive test executions as soon as a failure is detected; (ii) they should handle time- and magnitude-continuous CPS behaviors; (iii) they should provide a quantitative degree of satisfaction or failure measure instead of binary pass/fail outputs; and (iv) they should be able to handle uncertainties due to CPS interactions with the environment. We propose an automated approach to translate CPS requirements specified in a logic-based language into test oracles specified in Simulink – a widely-used development and simulation language for CPS. Our approach achieves the objectives noted above through the identification of a fragment of Signal First Order logic (SFOL) to specify requirements, the definition of a quantitative semantics for this fragment and a sound translation of the fragment into Simulink. The results from applying our approach on 11 industrial case studies show that: (i) our requirements language can express all the 98 requirements of our case studies; (ii) the time and effort required by our approach are acceptable, showing potentials for the adoption of our work in practice, and (iii) for large models, our approach can dramatically reduce the test execution time compared to when test outputs are checked in an offline manner.
Article Search
Artifacts Available
Artifacts Reusable
Lifting Datalog-Based Analyses to Software Product Lines
Ramy Shahin, Marsha Chechik, and Rick Salay
(University of Toronto, Canada)
Applying program analyses to Software Product Lines (SPLs) has been a fundamental research problem at the intersection of Product Line Engineering and software analysis. Different attempts have been made to ”lift” particular product-level analyses to run on the entire product line. In this paper, we tackle the class of Datalog-based analyses (e.g., pointer and taint analyses), study the theoretical aspects of lifting Datalog inference, and implement a lifted inference algorithm inside the Soufflé Datalog engine. We evaluate our implementation on a set of benchmark product lines. We show significant savings in processing time and fact database size (billions of times faster on one of the benchmarks) compared to brute-force analysis of each product individually.
Article Search
An Empirical Study of Real-World Variability Bugs Detected by Variability-Oblivious Tools
Austin Mordahl, Jeho Oh, Ugur Koc, Shiyi Wei, and Paul Gazzillo
(University of Texas at Dallas, USA; University of Texas at Austin, USA; University of Maryland, USA; University of Central Florida, USA)
Many critical software systems developed in C utilize compile-time configurability. The many possible configurations of this software make bug detection through static analysis difficult. While variability-aware static analyses have been developed, there remains a gap between those and state-of-the-art static bug detection tools. In order to collect data on how such tools may perform and to develop real-world benchmarks, we present a way to leverage configuration sampling, off-the-shelf “variability-oblivious” bug detectors, and automatic feature identification techniques to simulate a variability-aware analysis. We instantiate our approach using four popular static analysis tools on three highly configurable, real-world C projects, obtaining 36,061 warnings, 80% of which are variability warnings. We analyze the warnings we collect from these experiments, finding that most results are variability warnings of a variety of kinds such as NULL dereference. We then manually investigate these warnings to produce a benchmark of 77 confirmed true bugs (52 of which are variability bugs) useful for future development of variability-aware analyses.
Article Search
Artifacts Available
Artifacts Reusable
Principles of Feature Modeling
Damir Nešić, Jacob Krüger,
Ștefan Stănciulescu, and Thorsten Berger
(KTH, Sweden; University of Magdeburg, Germany; ABB, Switzerland; Chalmers University of Technology, Sweden; University of Gothenburg, Sweden)
Feature models are arguably one of the most intuitive and successful notations for modeling the features of a variant-rich software system. Feature models help developers to keep an overall understanding of the system, and also support scoping, planning, development, variant derivation, configuration, and maintenance activities that sustain the system’s long-term success.
Unfortunately, feature models are difficult to build and evolve. Features need to be identified, grouped, organized in a hierarchy, and mapped to software assets. Also, dependencies between features need to be declared. While feature models have been the subject of three decades of research, resulting in many feature-modeling notations together with automated analysis and configuration techniques, a generic set of principles for engineering feature models is still missing. It is not even clear whether feature models could be engineered using recurrent principles. Our work shows that such principles in fact exist. We analyzed feature-modeling practices elicited from ten interviews conducted with industrial practitioners and from 31 relevant papers. We synthesized a set of 34 principles covering eight different phases of feature modeling, from planning over model construction, to model maintenance and evolution. Grounded in empirical evidence, these principles provide practical, context-specific advice on how to perform feature modeling, describe what information sources to consider, and highlight common characteristics of feature models. We believe that our principles can support researchers and practitioners enhancing feature-modeling tooling, synthesis, and analyses techniques, as well as scope future research.
Preprint
Info
Understanding GCC Builtins to Develop Better Tools
Manuel Rigger, Stefan Marr,
Bram Adams, and Hanspeter Mössenböck
(JKU Linz, Austria; University of Kent, UK; Polytechnique Montréal, Canada)
C programs can use compiler builtins to provide functionality that the C language lacks. On Linux, GCC provides several thousands of builtins that are also supported by other mature compilers, such as Clang and ICC. Maintainers of other tools lack guidance on whether and which builtins should be implemented to support popular projects. To assist tool developers who want to support GCC builtins, we analyzed builtin use in 4,913 C projects from GitHub. We found that 37% of these projects relied on at least one builtin. Supporting an increasing proportion of projects requires support of an exponentially increasing number of builtins; however, implementing only 10 builtins already covers over 30% of the projects. Since we found that many builtins in our corpus remained unused, the effort needed to support 90% of the projects is moderate, requiring about 110 builtins to be implemented. For each project, we analyzed the evolution of builtin use over time and found that the majority of projects mostly added builtins. This suggests that builtins are not a legacy feature and must be supported in future tools. Systematic testing of builtin support in existing tools revealed that many lacked support for builtins either partially or completely; we also discovered incorrect implementations in various tools, including the formally verified CompCert compiler.
Preprint
Artifacts Available
Artifacts Reusable
Assessing the Quality of the Steps to Reproduce in Bug Reports
Oscar Chaparro,
Carlos Bernal-Cárdenas, Jing Lu,
Kevin Moran, Andrian Marcus,
Massimiliano Di Penta,
Denys Poshyvanyk, and
Vincent Ng
(College of William and Mary, USA; University of Texas at Dallas, USA; University of Sannio, Italy)
A major problem with user-written bug reports, indicated by developers and documented by researchers, is the (lack of high) quality of the reported steps to reproduce the bugs. Low-quality steps to reproduce lead to excessive manual effort spent on bug triage and resolution. This paper proposes Euler, an approach that automatically identifies and assesses the quality of the steps to reproduce in a bug report, providing feedback to the reporters, which they can use to improve the bug report. The feedback provided by Euler was assessed by external evaluators and the results indicate that Euler correctly identified 98% of the existing steps to reproduce and 58% of the missing ones, while 73% of its quality annotations are correct.
Preprint
Info
A Learning-Based Approach for Automatic Construction of Domain Glossary from Source Code and Documentation
Chong Wang,
Xin Peng, Mingwei Liu, Zhenchang Xing, Xuefang Bai, Bing Xie, and Tuo Wang
(Fudan University, China; Australian National University, Australia; Peking University, China)
A domain glossary that organizes domain-specific concepts and their aliases and relations is essential for knowledge acquisition and software development. Existing approaches use linguistic heuristics or term-frequency-based statistics to identify domain specific terms from software documentation, and thus the accuracy is often low. In this paper, we propose a learning-based approach for automatic construction of domain glossary from source code and software documentation. The approach uses a set of high-quality seed terms identified from code identifiers and natural language concept definitions to train a domain-specific prediction model to recognize glossary terms based on the lexical and semantic context of the sentences mentioning domain-specific concepts. It then merges the aliases of the same concepts to their canonical names, selects a set of explanation sentences for each concept, and identifies “is a”, “has a”, and “related to” relations between the concepts. We apply our approach to deep learning domain and Hadoop domain and harvest 5,382 and 2,069 concepts together with 16,962 and 6,815 relations respectively. Our evaluation validates the accuracy of the extracted domain glossary and its usefulness for the fusion and acquisition of knowledge from different documents of different projects.
Preprint
On Using Machine Learning to Identify Knowledge in API Reference Documentation
Davide Fucci, Alireza Mollaalizadehbahnemiri, and
Walid Maalej
(University of Hamburg, Germany)
Using API reference documentation like JavaDoc is an integral part of software development. Previous research introduced a grounded taxonomy that organizes API documentation knowledge in 12 types, including knowledge about the Functionality, Structure, and Quality of an API. We study how well modern text classification approaches can automatically identify documentation containing specific knowledge types. We compared conventional machine learning (k-NN and SVM) with deep learning approaches trained on manually-annotated Java and .NET API documentation (n = 5,574). When classifying the knowledge types individually (i.e., multiple binary classifiers) the best AUPRC was up to 87
Article Search
Artifacts Available
Generating Query-Specific Class API Summaries
Mingwei Liu,
Xin Peng, Andrian Marcus, Zhenchang Xing, Wenkai Xie, Shuangshuang Xing, and Yang Liu
(Fudan University, China; University of Texas at Dallas, USA; Australian National University, Australia)
Source code summaries are concise representations, in form of text and/or code, of complex code elements and are meant to help developers gain a quick understanding that in turns help them perform specific tasks. Generation of summaries that are task-specific is still a challenge in the automatic code summarization field. We propose an approach for generating on-demand, extrinsic hybrid summaries for API classes, relevant to a programming task, formulated as a natural language query. The summaries include the most relevant sentences extracted from the API reference documentation and the most relevant methods.
External evaluators assessed the summaries generated for classes retrieved from JDK and Android libraries for several programming tasks. The majority found that the summaries are complete, concise, and readable.
A comparison with summaries produce by three baseline approaches revealed that the information present only in our summaries is more relevant than the one present only in the baselines summaries. Finally, an extrinsic evaluation study showed that the summaries help the users evaluating the correctness of API retrieval results, faster and more accurately.
Preprint
Semantic Relation Based Expansion of Abbreviations
Yanjie Jiang, Hui Liu, and
Lu Zhang
(Beijing Institute of Technology, China; Peking University, China)
Identifiers account for 70% of source code in terms of characters, and thus the quality of such identifiers is critical for program comprehension and software maintenance. For various reasons, however, many identifiers contain abbreviations, which reduces the readability and maintainability of source code. To this end, a number of approaches have been proposed to expand abbreviations in identifiers. However, such approaches are either inaccurate or confined to specific identifiers. To this end, in this paper we propose a generic and accurate approach to expand identifier abbreviations. The key insight of the approach is that abbreviations in the name of software entity e have great chance to find their full terms in names of software entities that are semantically related to e. Consequently, the proposed approach builds a knowledge graph to represent such entities and their relationships with e, and searches the graph for full terms. The optimal searching strategy for the graph could be learned automatically from a corpus of manually expanded abbreviations. We evaluate the proposed approach on nine well known open-source projects. Results of our k-fold evaluation suggest that the proposed approach improves the state of the art. It improves precision significantly from 29% to 85%, and recall from 29% to 77%. Evaluation results also suggest that the proposed generic approach is even better than the state-of-the-art parameter-specific approach in expanding parameter abbreviations, improving F1 score significantly from 75% to 87%.
Article Search
Diversity-Based Web Test Generation
Matteo Biagiola, Andrea Stocco, Filippo Ricca, and Paolo Tonella
(Fondazione Bruno Kessler, Italy; USI Lugano, Switzerland; University of Genoa, Italy)
Existing web test generators derive test paths from a navigational model of the web application, completed with either manually or randomly generated input values. However, manual test data selection is costly, while random generation often results in infeasible input sequences, which are rejected by the application under test. Random and search-based generation can achieve the desired level of model coverage only after a large number of test execution at- tempts, each slowed down by the need to interact with the browser during test execution. In this work, we present a novel web test generation algorithm that pre-selects the most promising candidate test cases based on their diversity from previously generated tests. As such, only the test cases that explore diverse behaviours of the application are considered for in-browser execution. We have implemented our approach in a tool called DIG. Our empirical evaluation on six real-world web applications shows that DIG achieves higher coverage and fault detection rates significantly earlier than crawling-based and search-based web test generators.
Article Search
Web Test Dependency Detection
Matteo Biagiola, Andrea Stocco,
Ali Mesbah, Filippo Ricca, and Paolo Tonella
(Fondazione Bruno Kessler, Italy; USI Lugano, Switzerland; University of British Columbia, Canada; University of Genoa, Italy)
E2E web test suites are prone to test dependencies due to the heterogeneous multi-tiered nature of modern web apps, which makes it difficult for developers to create isolated program states for each test case. In this paper, we present the first approach for detecting and validating test dependencies present in E2E web test suites. Our approach employs string analysis to extract an approximated set of dependencies from the test code. It then filters potential false dependencies through natural language processing of test names. Finally, it validates all dependencies, and uses a novel recovery algorithm to ensure no true dependencies are missed in the final test dependency graph. Our approach is implemented in a tool called TEDD and evaluated on the test suites of six open-source web apps. Our results show that TEDD can correctly detect and validate test dependencies up to 72% faster than the baseline with the original test ordering in which the graph contains all possible dependencies. The test dependency graphs produced by TEDD enable test execution parallelization, with a speed-up factor of up to 7×.
Article Search
Testing Scratch Programs Automatically
Andreas Stahlbauer, Marvin Kreis, and Gordon Fraser
(University of Passau, Germany)
Block-based programming environments like Scratch foster engagement with computer programming and are used by millions of young learners. Scratch allows learners to quickly create entertaining programs and games, while eliminating syntactical program errors that could interfere with progress. However, functional programming errors may still lead to incorrect programs, and learners and their teachers need to identify and understand these errors. This is currently an entirely manual process. In this paper, we introduce a formal testing framework that describes the problem of Scratch testing in detail. We instantiate this formal framework with the Whisker tool, which provides automated and property-based testing functionality for Scratch programs. Empirical evaluation on real student and teacher programs demonstrates that Whisker can successfully test Scratch programs, and automatically achieves an average of 95.25 % code coverage. Although well-known testing problems such as test flakiness also exist in the scenario of Scratch testing, we show that automated and property-based testing can accurately reproduce and replace the manually and laboriously produced grading efforts of a teacher, and opens up new possibilities to support learners of programming in their struggles.
Article Search
A Large-Scale Empirical Study of Compiler Errors in Continuous Integration
Chen Zhang, Bihuan Chen, Linlin Chen,
Xin Peng, and Wenyun Zhao
(Fudan University, China)
Continuous Integration (CI) is a widely-used software development practice to reduce risks. CI builds often break, and a large amount of efforts are put into troubleshooting broken builds. Despite that compiler errors have been recognized as one of the most frequent types of build failures, little is known about the common types, fix efforts and fix patterns of compiler errors that occur in CI builds of open-source projects. To fill such a gap, we present a large-scale empirical study on 6,854,271 CI builds from 3,799 open-source Java projects hosted on GitHub. Using the build data, we measured the frequency of broken builds caused by compiler errors, investigated the ten most common compiler error types, and reported their fix time. We manually analyzed 325 broken builds to summarize fix patterns of the ten most common compiler error types. Our findings help to characterize and understand compiler errors during CI and provide practical implications to developers, tool builders and researchers.
Preprint
A Statistics-Based Performance Testing Methodology for Cloud Applications
Sen He, Glenna Manns, John Saunders, Wei Wang,
Lori Pollock, and Mary Lou Soffa
(University of Texas at San Antonio, USA; University of Virginia, USA; University of Delaware, USA)
The low cost of resource ownership and flexibility have led users to increasingly port their applications to the clouds. To fully realize the cost benefits of cloud services, users usually need to reliably know the execution performance of their applications. However, due to the random performance fluctuations experienced by cloud applications, the black box nature of public clouds and the cloud usage costs, testing on clouds to acquire accurate performance results is extremely difficult. In this paper, we present a novel cloud performance testing methodology called PT4Cloud. By employing non-parametric statistical approaches of likelihood theory and the bootstrap method, PT4Cloud provides reliable stop conditions to obtain highly accurate performance distributions with confidence bands. These statistical approaches also allow users to specify intuitive accuracy goals and easily trade between accuracy and testing cost. We evaluated PT4Cloud with 33 benchmark configurations on Amazon Web Service and Chameleon clouds. When compared with performance data obtained from extensive performance tests, PT4Cloud provides testing results with 95.4% accuracy on average while reducing the number of test runs by 62%. We also propose two test execution reduction techniques for PT4Cloud, which can reduce the number of test runs by 90.1% while retaining an average accuracy of 91%. We compared our technique to three other techniques and found that our results are much more accurate.
Article Search
Artifacts Available
Artifacts Reusable
How Bad Can a Bug Get? An Empirical Analysis of Software Failures in the OpenStack Cloud Computing Platform
Domenico Cotroneo, Luigi De Simone, Pietro Liguori, Roberto Natella, and Nematollah Bidokhti
(Federico II University of Naples, Italy; FutureWei Technologies, USA)
Cloud management systems provide abstractions and APIs for programmatically configuring cloud infrastructures. Unfortunately, residual software bugs in these systems can potentially lead to high-severity failures, such as prolonged outages and data losses. In this paper, we investigate the impact of failures in the context widespread OpenStack cloud management system, by performing fault injection and by analyzing the impact of the resulting failures in terms of fail-stop behavior, failure detection through logging, and failure propagation across components. The analysis points out that most of the failures are not timely detected and notified; moreover, many of these failures can silently propagate over time and through components of the cloud management system, which call for more thorough run-time checks and fault containment.
Article Search
Artifacts Available
Artifacts Reusable
Towards More Efficient Meta-heuristic Algorithms for Combinatorial Test Generation
Jinkun Lin, Shaowei Cai, Chuan Luo, Qingwei Lin, and Hongyu Zhang
(Institute of Software at Chinese Academy of Sciences, China; Microsoft Research, China; University of Newcastle, Australia)
Combinatorial interaction testing (CIT) is a popular approach to detecting faults in highly configurable software systems. The core task of CIT is to generate a small test suite called a t-way covering array (CA), where t is the covering strength. Many meta-heuristic algorithms have been proposed to solve the constrained covering array generating (CCAG) problem. A major drawback of existing algorithms is that they usually need considerable time to obtain a good-quality solution, which hinders the wider applications of such algorithms. We observe that the high time consumption of existing meta-heuristic algorithms for CCAG is mainly due to the procedure of score computation. In this work, we propose a much more efficient method for score computation. The score computation method is applied to a state-of-the-art algorithm TCA, showing significant improvements. The new score computation method opens a way to utilize algorithmic ideas relying on scores which were not affordable previously. We integrate a gradient descent search step to further improve the algorithm, leading to a new algorithm called FastCA. Experiments on a broad range of real-world benchmarks and synthetic benchmarks show that, FastCA significantly outperforms state-of-the-art algorithms for CCAG algorithms, in terms of both the size of obtained covering array and the run time.
Article Search
Compiler Bug Isolation via Effective Witness Test Program Generation
Junjie Chen, Jiaqi Han, Peiyi Sun, Lingming Zhang,
Dan Hao, and
Lu Zhang
(Tianjin University, China; Peking University, China; University of Texas at Dallas, USA)
Compiler bugs are extremely harmful, but are notoriously difficult to debug because compiler bugs usually produce few debugging information. Given a bug-triggering test program for a compiler, hundreds of compiler files are usually involved during compilation, and thus are suspect buggy files. Although there are lots of automated bug isolation techniques, they are not applicable to compilers due to the scalability or effectiveness problem. To solve this problem, in this paper, we transform the compiler bug isolation problem into a search problem, i.e., searching for a set of effective witness test programs that are able to eliminate innocent compiler files from suspects. Based on this intuition, we propose an automated compiler bug isolation technique, DiWi, which (1) proposes a heuristic-based search strategy to generate such a set of effective witness test programs via applying our designed witnessing mutation rules to the given failing test program, and (2) compares their coverage to isolate bugs following the practice of spectrum-based bug isolation. The experimental results on 90 real bugs from popular GCC and LLVM compilers show that DiWi effectively isolates 66.67%/78.89% bugs within Top-10/Top-20 compiler files, significantly outperforming state-of-the-art bug isolation techniques.
Article Search
Concolic Testing with Adaptively Changing Search Heuristics
Sooyoung Cha and Hakjoo Oh
(Korea University, South Korea)
We present Chameleon, a new approach for adaptively changing search heuristics during concolic testing. Search heuristics play a central role in concolic testing as they mitigate the path-explosion problem by focusing on particular program paths that are likely to increase code coverage as quickly as possible. A variety of techniques for search heuristics have been proposed over the past decade. However, existing approaches are limited in that they use the same search heuristics throughout the entire testing process, which is inherently insufficient to exercise various execution paths. Chameleon overcomes this limitation by adapting search heuristics on the fly via an algorithm that learns new search heuristics based on the knowledge accumulated during concolic testing. Experimental results show that the transition from the traditional non-adaptive approaches to ours greatly improves the practicality of concolic testing in terms of both code coverage and bug-finding.
Article Search
Symbolic Execution-Driven Extraction of the Parallel Execution Plans of Spark Applications
Luciano Baresi,
Giovanni Denaro, and Giovanni Quattrocchi
(Politecnico di Milano, Italy; University of Milano-Bicocca, Italy)
The execution of Spark applications is based on the execution order and parallelism of the different jobs, given data and available resources. Spark reifies these dependencies in a graph that we refer to as the (parallel) execution plan of the application. All the approaches that have studied the estimation of the execution times and the dynamic provisioning of resources for this kind of applications have always assumed that the execution plan is unique, given the computing resources at hand. This assumption is at least simplistic for applications that include conditional branches or loops and limits the precision of the prediction techniques.
This paper introduces SEEPEP, a novel technique based on symbolic execution and search-based test generation, that: i) automatically extracts the possible execution plans of a Spark application, along with dedicated launchers with properly synthesized data that can be used for profiling, and ii) tunes the allocation of resources at runtime based on the knowledge of the execution plans for which the path conditions hold. The assessment we carried out shows that SEEPEP can effectively complement dynaSpark, an extension of Spark with dynamic resource provisioning capabilities, to help predict the execution duration and the allocation of resources.
Article Search
Generating Effective Test Cases for Self-Driving Cars from Police Reports
Alessio Gambi, Tri Huynh, and Gordon Fraser
(University of Passau, Germany; Saarland University, Germany; CISPA, Germany)
Autonomous driving carries the promise to drastically reduce the number of car accidents; however, recently reported fatal crashes involving self-driving cars show that such an important goal is not yet achieved. This calls for better testing of the software controlling self-driving cars, which is difficult because it requires producing challenging driving scenarios. To better test self-driving car soft- ware, we propose to specifically test car crash scenarios, which are critical par excellence. Since real car crashes are difficult to test in field operation, we recreate them as physically accurate simulations in an environment that can be used for testing self-driving car software. To cope with the scarcity of sensory data collected during real car crashes which does not enable a full reproduction, we extract the information to recreate real car crashes from the police reports which document them. Our extensive evaluation, consisting of a user study involving 34 participants and a quantitative analysis of the quality of the generated tests, shows that we can generate accurate simulations of car crashes in a matter of minutes. Compared to tests which implement non critical driving scenarios, our tests effectively stressed the test subject in different ways and exposed several shortcomings in its implementation.
Article Search
Preference-Wise Testing for Android Applications
Yifei Lu, Minxue Pan, Juan Zhai, Tian Zhang, and Xuandong Li
(Nanjing University, China)
Preferences, the setting options provided by Android, are an essential part of Android apps. Preferences allow users to change app features and behaviors dynamically, and therefore, need to be thoroughly tested. Unfortunately, the specific preferences used in test cases are typically not explicitly specified, forcing testers to manually set options or blindly try different option combinations. To effectively test the impacts of different preference options, this paper presents PREFEST, as a preference-wise enhanced automatic testing approach, for Android apps. Given a set of test cases, PREFEST can locate the preferences that may affect the test cases with a static and dynamic combined analysis on the app under test, and execute these test cases only under necessary option combinations. The evaluation shows that PREFEST can improve 6.8% code coverage and 12.3% branch coverage and find five more real bugs compared to testing with the original test cases. The test cost is reduced by 99% for both the number of test cases and the testing time, compared to testing under pairwise combination of options.
Article Search
Bisecting Commits and Modeling Commit Risk during Testing
Armin Najafi, Peter C. Rigby, and
Weiyi Shang
(Concordia University, Canada)
Software testing is one of the costliest stages in the software development life cycle. One approach to reducing the test execution cost is to group changes and test them as a batch (i.e. batch testing). However, when tests fail in a batch, commits in the batch need to be re-tested to identify the cause of the failure, i.e. the culprit commit. The re-testing is typically done through bisection (i.e. a binary search through the commits in a batch). Intuitively, the effectiveness of batch testing highly depends on the size of the batch. Larger batches require fewer initial test runs, but have a higher chance of a test failure that can lead to expensive test re-runs to find the culprit. We are unaware of research that investigates and simulates the impact of batch sizes on the cost of testing in industry.
In this work, we first conduct empirical studies on the effectiveness of batch testing in three large-scale industrial software systems at Ericsson. Using 9 months of testing data, we simulate batch sizes from 1 to 20 and find the most cost-effective BatchSize for each project. Our results show that batch testing saves 72% of test executions compared to testing each commit individually. In a second simulation, we incorporate flaky tests that pass and fail on the same commit as they are a significant source of additional test executions on large projects. We model the degree of flakiness for each project and find that test flakiness reduces the cost savings to 42%. In a third simulation, we guide bisection to reduce the likelihood of batch-testing failures. We model the riskiness of each commit in a batch using a bug model and a test execution history model. The risky commits are tested individually, while the less risky commits are tested in a single larger batch. Culprit predictions with our approach reduce test executions up to 9% compared to Ericsson’s current bisection approach.
The results have been adopted by developers at Ericsson and a tool to guide bisection is in the process of being added to Ericsson’s continuous integration pipeline.
Article Search
White-Box Testing of Big Data Analytics with Complex User-Defined Functions
Muhammad Ali Gulzar, Shaghayegh Mardani, Madanlal Musuvathi, and Miryung Kim
(University of California at Los Angeles, USA; Microsoft Research, USA)
Data-intensive scalable computing (DISC) systems such as Google’s MapReduce, Apache Hadoop, and Apache Spark are being leveraged to process massive quantities of data in the cloud. Modern DISC applications pose new challenges in exhaustive, automatic testing because they consist of dataflow operators, and complex user-defined functions (UDF) are prevalent unlike SQL queries. We design a new white-box testing approach, called BigTest to reason about the internal semantics of UDFs in tandem with the equivalence classes created by each dataflow and relational operator.
Our evaluation shows that, despite ultra-large scale input data size, real world DISC applications are often significantly skewed and inadequate in terms of test coverage, leaving 34% of Joint Dataflow and UDF (JDU) paths untested. BigTest shows the potential to minimize data size for local testing by 10^5 to 10^8 orders of magnitude while revealing 2X more manually-injected faults than the previous approach. Our experiment shows that only few of the data records (order of tens) are actually required to achieve the same JDU coverage as the entire production data. The reduction in test data also provides CPU time saving of 194X on average, demonstrating that interactive and fast local testing is feasible for big data analytics, obviating the need to test applications on huge production data.
Article Search
Empirical Review of Java Program Repair Tools: A Large-Scale Experiment on 2,141 Bugs and 23,551 Repair Attempts
Thomas Durieux, Fernanda Madeiral, Matias Martinez, and Rui Abreu
(University of Lisbon, Portugal; INESC-ID, Portugal; Federal University of Uberlândia, Brazil; Polytechnic University of Hauts-de-France, France)
In the past decade, research on test-suite-based automatic program repair has grown significantly. Each year, new approaches and implementations are featured in major software engineering venues. However, most of those approaches are evaluated on a single benchmark of bugs, which are also rarely reproduced by other researchers. In this paper, we present a large-scale experiment using 11 Java test-suite-based repair tools and 2,141 bugs from 5 benchmarks. Our goal is to have a better understanding of the current state of automatic program repair tools on a large diversity of benchmarks. Our investigation is guided by the hypothesis that the repairability of repair tools might not be generalized across different benchmarks. We found that the 11 tools 1) are able to generate patches for 21% of the bugs from the 5 benchmarks, and 2) have better performance on Defects4J compared to other benchmarks, by generating patches for 47% of the bugs from Defects4J compared to 10-30% of bugs from the other benchmarks. Our experiment comprises 23,551 repair attempts, which we used to find causes of non-patch generation. These causes are reported in this paper, which can help repair tool designers to improve their approaches and tools.
Preprint
Info
Artifacts Available
Artifacts Reusable
iFixR: Bug Report driven Program Repair
Anil Koyuncu, Kui Liu, Tegawendé F. Bissyandé, Dongsun Kim, Martin Monperrus,
Jacques Klein, and Yves Le Traon
(University of Luxembourg, Luxembourg; Furiosa A.I., South Korea; KTH, Sweden)
Issue tracking systems are commonly used in modern software development for collecting feedback from users and developers. An ultimate automation target of software maintenance is then the systematization of patch generation for user-reported bugs. Although this ambition is aligned with the momentum of automated program repair, the literature has, so far, mostly focused on generate-and- validate setups where fault localization and patch generation are driven by a well-defined test suite. On the one hand, however, the common (yet strong) assumption on the existence of relevant test cases does not hold in practice for most development settings: many bugs are reported without the available test suite being able to reveal them. On the other hand, for many projects, the number of bug reports generally outstrips the resources available to triage them. Towards increasing the adoption of patch generation tools by practitioners, we investigate a new repair pipeline, iFixR, driven by bug reports: (1) bug reports are fed to an IR-based fault localizer; (2) patches are generated from fix patterns and validated via regression testing; (3) a prioritized list of generated patches is proposed to developers. We evaluate iFixR on the Defects4J dataset, which we enriched (i.e., faults are linked to bug reports) and carefully-reorganized (i.e., the timeline of test-cases is naturally split). iFixR generates genuine/plausible patches for 21/44 Defects4J faults with its IR-based fault localizer. iFixR accurately places a genuine/plausible patch among its top-5 recommendation for 8/13 of these faults (without using future test cases in generation-and-validation).
Article Search
Artifacts Available
Artifacts Reusable
Exploring and Exploiting the Correlations between Bug-Inducing and Bug-Fixing Commits
Ming Wen, Rongxin Wu, Yepang Liu, Yongqiang Tian, Xuan Xie,
Shing-Chi Cheung, and Zhendong Su
(Hong Kong University of Science and Technology, China; Xiamen University, China; Southern University of Science and Technology, China; Sun Yat-sen University, China; ETH Zurich, Switzerland)
Bug-inducing commits provide important information to understand when and how bugs were introduced.
Therefore, they have been extensively investigated by existing studies and frequently leveraged to facilitate bug fixings in industrial practices.
Due to the importance of bug-inducing commits in software debugging,
we are motivated to conduct the first systematic empirical study to explore the correlations between bug-inducing and bug-fixing commits in terms of code elements and modifications.
To facilitate the study, we collected the inducing and fixing commits for 333 bugs from seven large open-source projects.
The empirical findings reveal important and significant correlations between a bug’s inducing and fixing commits.
We further exploit the usefulness of such correlation findings from two aspects.
First, they explain why the SZZ algorithm, the most widely-adopted approach to collecting bug-inducing commits, is imprecise.
In view of SZZ’s imprecision, we revisited the findings of previous studies based on SZZ,
and found that 8 out of 10 previous findings are significantly affected by SZZ’s imprecision.
Second, they shed lights on the design of automated debugging techniques.
For demonstration, we designed approaches that exploit the correlations with respect to statements and change actions.
Our experiments on Defects4J show that our approaches can boost the performance of fault localization significantly and also advance existing APR techniques.
Article Search
Info
Effects of Explicit Feature Traceability on Program Comprehension
Jacob Krüger, Gül Çalıklı, Thorsten Berger, Thomas Leich, and
Gunter Saake
(University of Magdeburg, Germany; Chalmers University of Technology, Sweden; University of Gothenburg, Sweden; Harz University of Applied Sciences, Germany; METOP, Germany)
Developers spend a substantial amount of their time with program comprehension. To improve their comprehension and refresh their memory, developers need to communicate with other developers, read the documentation, and analyze the source code. Many studies show that developers focus primarily on the source code and that small improvements can have a strong impact. As such, it is crucial to bring the code itself into a more comprehensible form. A particular technique for this purpose are explicit feature traces to easily identify a program’s functionalities. To improve our empirical understanding about the effects of feature traces, we report an online experiment with 49 professional software developers. We studied the impact of explicit feature traces, namely annotations and decomposition, on program comprehension and compared them to the same code without traces. Besides this experiment, we also asked our participants about their opinions in order to combine quantitative and qualitative data. Our results indicate that, as opposed to purely object-oriented code: (1) annotations can have positive effects on program comprehension; (2) decomposition can have a negative impact on bug localization; and (3) our participants perceive both techniques as beneficial. Moreover, none of the three code versions yields significant improvements on task completion time. Overall, our results indicate that lightweight traceability, such as using annotations, provides immediate benefits to developers during software development and maintenance without extensive training or tooling; and can improve current industrial practices that rely on heavyweight traceability tools (e.g., DOORS) and retroactive fulfillment of standards (e.g., ISO-26262, DO-178B).
Preprint
Artifacts Available
What the Fork: A Study of Inefficient and Efficient Forking Practices in Social Coding
Shurui Zhou,
Bogdan Vasilescu, and
Christian Kästner
(Carnegie Mellon University, USA)
Forking and pull requests have been widely used in open-source communities as a uniform development and contribution mechanism, giving developers the flexibility to modify their own fork without affecting others before attempting to contribute back. However, not all projects use forks efficiently; many experience lost and duplicate contributions and fragmented communities. In this paper, we explore how open-source projects on GitHub differ with regard to forking inefficiencies. First, we observed that different communities experience these inefficiencies to widely different degrees and interviewed practitioners to understand why. Then, using multiple regression modeling, we analyzed which context factors correlate with fewer inefficiencies.We found that better modularity and centralized management are associated with more contributions and a higher fraction of accepted pull requests, suggesting specific best practices that project maintainers can adopt to reduce forking-related inefficiencies in their communities.
Article Search
Info
ServDroid: Detecting Service Usage Inefficiencies in Android Applications
Wei Song, Jing Zhang, and Jeff Huang
(Nanjing University of Science and Technology, China; Texas A&M University, USA)
Services in Android applications are frequently-used components for performing time-consuming operations in the background. While services play a crucial role in the app performance, our study shows that service uses in practice are not as efficient as expected, e.g., they tend to cause unnecessary resource occupation and/or energy consumption. Moreover, as service usage inefficiencies do not manifest with immediate failures, e.g., app crashes, existing testing-based approaches fall short in finding them. In this paper, we identify four anti-patterns of such service usage inefficiency bugs, including premature create, late destroy, premature destroy, and service leak, and present a static analysis technique, ServDroid, to automatically and effectively detect them based on the anti-patterns. We have applied ServDroid to a large collection of popular real-world Android apps. Our results show that, surprisingly, service usage inefficiencies are prevalent and can severely impact the app performance.
Preprint
Info
Together Strong: Cooperative Android App Analysis
Felix Pauck and Heike Wehrheim
(University of Paderborn, Germany)
Recent years have seen the development of numerous tools for the analysis of taint flows in Android apps. Taint analyses aim at detecting data leaks, accidentally or by purpose programmed into apps. Often, such tools specialize in the treatment of specific features impeding precise taint analysis (like reflection or inter-app communication). This multitude of tools, their specific applicability and their various combination options complicate the selection of a tool (or multiple tools) when faced with an analysis instance, even for knowledgeable users, and hence hinders the successful adoption of taint analyses.
In this work, we thus present CoDiDroid, a framework for cooperative Android app analysis. CoDiDroid (1) allows users to ask questions about flows in apps in varying degrees of detail, (2) automatically generates subtasks for answering such questions, (3) distributes tasks onto analysis tools (currently DroidRA, FlowDroid, HornDroid, IC3 and two novel tools) and (4) at the end merges tool answers on subtasks into an overall answer. Thereby, users are freed from having to learn about the use and functionality of all these tools while still being able to leverage their capabilities. Moreover, we experimentally show that cooperation among tools pays off with respect to effectiveness, precision and scalability.
Article Search
Info
Artifacts Available
A Framework for Writing Trigger-Action Todo Comments in Executable Format
Pengyu Nie, Rishabh Rai, Junyi Jessy Li, Sarfraz Khurshid, Raymond J. Mooney, and
Milos Gligoric
(University of Texas at Austin, USA)
Natural language elements, e.g., todo comments, are frequently used to communicate among developers and to describe tasks that need to be performed (actions) when specific conditions hold on artifacts related to the code repository (triggers), e.g., from the Apache Struts project: “remove expectedJDK15 and if() after switching to Java 1.6”. As projects evolve, development processes change, and development teams reorganize, these comments, because of their informal nature, frequently become irrelevant or forgotten.
We present the first framework, dubbed TrigIt, to specify trigger-action todo comments in executable format. Thus, actions are executed automatically when triggers evaluate to true. TrigIt specifications are written in the host language (e.g., Java) and are evaluated as part of the build process. The triggers are specified as query statements over abstract syntax trees, abstract representation of build configuration scripts, issue tracking systems, and system clock time. The actions are either notifications to developers or code transformation steps. We implemented TrigIt for the Java programming language and migrated 44 existing trigger-action comments from several popular open-source projects. Evaluation of TrigIt, via a user study, showed that users find TrigIt easy to learn and use. TrigIt has the potential to enforce more discipline in writing and maintaining comments in large code repositories.
Article Search
Decomposing the Rationale of Code Commits: The Software Developer’s Perspective
Khadijah Al Safwan and Francisco Servant
(Virginia Tech, USA)
Communicating the rationale behind decisions is essential for the success of software engineering projects. In particular, understanding the rationale of code commits is an important and often difficult task. We posit that part of such difficulty lies in rationale often being treated as a single piece of information. In this paper, we set to discover the breakdown of components in which developers decompose the rationale of code commits in the context of software maintenance, and to understand their experience with it and with its individual components. For this goal, we apply a mixed-methods approach, interviewing 20 software developers to ask them how they decompose rationale, and surveying an additional 24 developers to understand their experiences needing, finding, and recording those components. We found that developers decompose the rationale of code commits into 15 components, each of which is differently needed, found, and recorded. These components are: goal, need, benefits, constraints, alternatives, selected alternative, dependencies, committer, time, location, modifications, explanation of modifications, validation, maturity stage, and side effects. Our findings provide multiple implications. Educators can now disseminate the multiple dimensions and importance of the rationale of code commits. For practitioners, our decomposition of rationale defines a “common vocabulary” to use when discussing rationale of code commits, which we expect to strengthen the quality of their rationale sharing and documentation process. For researchers, our findings enable techniques for automatically assessing, improving, and generating rationale of code commits to specifically target the components that developers need.
Article Search
Info
Artifacts Available
Monitoring-Aware IDEs
Jos Winter,
Maurício Aniche, Jürgen Cito, and Arie van Deursen
(Adyen, Netherlands; Delft University of Technology, Netherlands; Massachusetts Institute of Technology, USA)
Engineering modern large-scale software requires software developers to not solely focus on writing code, but also to continuously examine monitoring data to reason about the dynamic behavior of their systems. These additional monitoring responsibilities for developers have only emerged recently, in the light of DevOps culture. Interestingly, software development activities happen mainly in the IDE, while reasoning about production monitoring happens in separate monitoring tools. We propose an approach that integrates monitoring signals into the development environment and workflow. We conjecture that an IDE with such capability improves the performance of developers as time spent continuously context switching from development to monitoring would be eliminated. This paper takes a first step towards understanding the benefits of a possible monitoring-aware IDE. We implemented a prototype of a Monitoring-Aware IDE, connected to the monitoring systems of Adyen, a large-scale payment company that performs intense monitoring in their software systems. Given our results, we firmly believe that monitoring-aware IDEs can play an essential role in improving how developers perform monitoring.
Preprint
Going Big: A Large-Scale Study on What Big Data Developers Ask
Mehdi Bagherzadeh and
Raffi Khatchadourian
(Oakland University, USA; City University of New York, USA)
Software developers are increasingly required to write big data code. However, they find big data software development challenging. To help these developers it is necessary to understand big data topics that they are interested in and the difficulty of finding answers for questions in these topics. In this work, we conduct a large-scale study on Stackoverflow to understand the interest and difficulties of big data developers. To conduct the study, we develop a set of big data tags to extract big data posts from Stackoverflow; use topic modeling to group these posts into big data topics; group
similar topics into categories to construct a topic hierarchy; analyze popularity and difficulty of topics and their correlations; and discuss implications of our findings for practice, research and education of big data software development and investigate their coincidence with the findings of previous work.
Article Search
Why Aren’t Regular Expressions a Lingua Franca? An Empirical Study on the Re-use and Portability of Regular Expressions
James C. Davis, Louis G. Michael IV, Christy A. Coghlan, Francisco Servant, and Dongyoon Lee
(Virginia Tech, USA)
This paper explores the extent to which regular expressions (regexes)
are portable across programming languages. Many languages offer
similar regex syntaxes, and it would be natural to assume that
regexes can be ported across language boundaries. But can regexes
be copy/pasted across language boundaries while retaining their
semantic and performance characteristics?
In our survey of 158 professional software developers, most indicated
that they re-use regexes across language boundaries and about
half reported that they believe regexes are a universal language.We
experimentally evaluated the riskiness of this practice using a novel
regex corpus — 537,806 regexes from 193,524 projects written in
JavaScript, Java, PHP, Python, Ruby, Go, Perl, and Rust. Using our
polyglot regex corpus, we explored the hitherto-unstudied regex
portability problems: logic errors due to semantic differences,
and security vulnerabilities due to performance differences.
We report that developers’ belief in a regex lingua franca is understandable
but unfounded. Though most regexes compile across
language boundaries, 15% exhibit semantic differences across languages
and 10% exhibit performance differences across languages.
We explained these differences using regex documentation, and
further illuminate our findings by investigating regex engine implementations.
Along the way we found bugs in the regex engines
of JavaScript-V8, Python, Ruby, and Rust, and potential semantic
and performance regex bugs in thousands of modules.
Preprint
Artifacts Available
Artifacts Reusable
Nodest: Feedback-Driven Static Analysis of Node.js Applications
Benjamin Barslev Nielsen, Behnaz Hassanshahi, and François Gauthier
(Oracle Labs, Australia; Aarhus University, Denmark)
Node.js provides the ability to write JavaScript programs for the server-side and has become a popular language for developing web applications. Node.js allows direct access to the underlying filesystem, operating system resources, and databases, but does not provide any security mechanism such as sandboxing of untrusted code, and injection vulnerabilities are now commonly reported in Node.js modules. Existing static dataflow analysis techniques do not scale to Node.js applications to find injection vulnerabilities because small Node.js web applications typically depend on many third-party modules. We present a new feedback-driven static analysis that scales well to detect injection vulnerabilities in Node.js applications. The key idea behind our new technique is that not all third-party modules need to be analyzed to detect an injection vulnerability. Results of running our analysis, Nodest, on real-world Node.js applications show that the technique scales to large applications and finds previously known as well as new vulnerabilities. In particular, Nodest finds 63 true positive taint flows in a set of our benchmarks, whereas a state-of-the-art static analysis reports 3 only. Moreover, our analysis scales to Express, the most popular Node.js web framework, and reports non-trivial injection vulnerabilities.
Article Search
DeepStellar: Model-Based Quantitative Analysis of Stateful Deep Learning Systems
Xiaoning Du, Xiaofei Xie, Yi Li, Lei Ma,
Yang Liu, and Jianjun Zhao
(Nanyang Technological University, Singapore; Kyushu University, Japan; Zhejiang Sci-Tech University, China)
Deep Learning (DL) has achieved tremendous success in many cutting-edge applications.
However, the state-of-the-art DL systems still suffer from quality issues.
While some recent progress has been made on the analysis of feed-forward DL systems,
little study has been done on the Recurrent Neural Network (RNN)-based stateful DL systems, which
are widely used in audio, natural languages and video processing, etc.
In this paper, we initiate the very first step towards the quantitative analysis of RNN-based DL
systems.
We model RNN as an abstract state transition system to characterize its internal behaviors. Based
on the abstract model, we design two trace similarity metrics and five coverage criteria which
enable the quantitative analysis of RNNs. We further propose two algorithms powered by the
quantitative measures for adversarial sample detection and coverage-guided test generation.
We evaluate DeepStellar on four RNN-based systems covering image classification and automated speech recognition. The results demonstrate that the abstract model is useful in capturing the internal behaviors of RNNs, and confirm that (1) the similarity metrics could effectively capture the differences between samples even with very small perturbations (achieving 97% accuracy for detecting adversarial samples) and (2) the coverage criteria are useful in revealing erroneous behaviors (generating three times more adversarial samples than random testing and hundreds times more than the unrolling approach).
Article Search
REINAM: Reinforcement Learning for Input-Grammar Inference
Zhengkai Wu, Evan Johnson, Wei Yang, Osbert Bastani, Dawn Song, Jian Peng, and
Tao Xie
(University of Illinois at Urbana-Champaign, USA; University of Texas at Dallas, USA; University of Pennsylvania, USA; University of California at Berkeley, USA)
Program input grammars (i.e., grammars encoding the language of valid program inputs) facilitate a wide range of applications in software engineering such as
symbolic execution and delta debugging.
Grammars synthesized by existing approaches can cover only a small part of the valid input space mainly due to unanalyzable code (e.g., native code) in programs and lacking high-quality and high-variety seed inputs.
To address these challenges, we present REINAM, a reinforcement-learning approach for synthesizing
probabilistic context-free program input grammars without any seed inputs.
REINAM uses an industrial symbolic execution engine to generate an initial set of inputs for the given target program, and then uses an iterative process of grammar generalization to proactively generate additional inputs to infer grammars generalized from these initial seed inputs.
To efficiently search for target generalizations in a huge search space of candidate generalization operators, REINAM includes a novel formulation of the search problem as a reinforcement learning problem.
Our evaluation on eleven real-world benchmarks shows that REINAM outperforms an existing state-of-the-art approach on precision and recall of synthesized grammars, and fuzz testing based on REINAM substantially increases the coverage of the space of valid inputs.
REINAM is able to synthesize a grammar covering the entire valid input space for some benchmarks without decreasing the accuracy of the grammar.
Article Search
Info
Boosting Operational DNN Testing Efficiency through Conditioning
Zenan Li, Xiaoxing Ma,
Chang Xu, Chun Cao, Jingwei Xu, and Jian Lü
(Nanjing University, China)
With the increasing adoption of Deep Neural Network (DNN) models as integral parts of software systems, efficient operational testing of DNNs is much in demand to ensure these models’ actual performance in field conditions. A challenge is that the testing often needs to produce precise results with a very limited budget for labeling data collected in field.
Viewing software testing as a practice of reliability estimation through statistical sampling, we re-interpret the idea behind conventional structural coverages as conditioning for variance reduction. With this insight we propose an efficient DNN testing method based on the conditioning on the representation learned by the DNN model under testing. The representation is defined by the probability distribution of the output of neurons
in the last hidden layer of the model. To sample from this high dimensional distribution in which the operational data are sparsely distributed, we design an algorithm leveraging cross entropy minimization.
Experiments with various DNN models and datasets were conducted to evaluate the general efficiency of the approach. The results show that, compared with simple random sampling, this approach requires only about a half of labeled inputs to achieve the same level of precision.
Article Search
A Comprehensive Study on Deep Learning Bug Characteristics
Md Johirul Islam, Giang Nguyen, Rangeet Pan, and Hridesh Rajan
(Iowa State University, USA)
Deep learning has gained substantial popularity in recent years.
Developers mainly rely on libraries and tools to add deep learning
capabilities to their software. What kinds of bugs are frequently
found in such software? What are the root causes of such bugs?
What impacts do such bugs have? Which stages of deep learning
pipeline are more bug prone? Are there any antipatterns? Understanding
such characteristics of bugs in deep learning software
has the potential to foster the development of better deep learning
platforms, debugging mechanisms, development practices, and encourage
the development of analysis and verification frameworks.
Therefore, we study 2716 high-quality posts from Stack Overflow
and 500 bug fix commits from Github about five popular deep
learning libraries Caffe, Keras, Tensorflow, Theano, and Torch to
understand the types of bugs, root causes of bugs, impacts of bugs,
bug-prone stage of deep learning pipeline as well as whether there
are some common antipatterns found in this buggy software. The
key findings of our study include: data bug and logic bug are the
most severe bug types in deep learning software appearing more
than 48% of the times, major root causes of these bugs are Incorrect
Model Parameter (IPS) and Structural Inefficiency (SI) showing up
more than 43% of the times.We have also found that the bugs in the
usage of deep learning libraries have some common antipatterns.
Article Search
Just Fuzz It: Solving Floating-Point Constraints using Coverage-Guided Fuzzing
Daniel Liew,
Cristian Cadar,
Alastair F. Donaldson, and J. Ryan Stinnett
(Imperial College London, UK; Mozilla, USA)
We investigate the use of coverage-guided fuzzing as a means of proving satisfiability of SMT formulas over finite variable domains, with specific application to floating-point constraints. We show how an SMT formula can be encoded as a program containing a location that is reachable if and only if the program’s input corresponds to a satisfying assignment to the formula. A coverage-guided fuzzer can then be used to search for an input that reaches the location, yielding a satisfying assignment. We have implemented this idea in a tool, Just Fuzz-it Solver (JFS), and we present a large experimental evaluation showing that JFS is both competitive with and complementary to state-of-the-art SMT solvers with respect to solving floating-point constraints, and that the coverage-guided approach of JFS provides significant benefit over naive fuzzing in the floating-point domain. Applied in a portfolio manner, the JFS approach thus has the potential to complement traditional SMT
solvers for program analysis tasks that involve reasoning about floating-point constraints.
Article Search
Artifacts Available
Cerebro: Context-Aware Adaptive Fuzzing for Effective Vulnerability Detection
Yuekang Li, Yinxing Xue, Hongxu Chen, Xiuheng Wu, Cen Zhang, Xiaofei Xie, Haijun Wang, and
Yang Liu
(University of Science and Technology of China, China; Nanyang Technological University, Singapore; Zhejiang Sci-Tech University, China)
Existing greybox fuzzers mainly utilize program coverage as the goal to guide the fuzzing process.
To maximize their outputs, coverage-based greybox fuzzers need to evaluate the quality of seeds properly, which involves making two decisions: 1) which is the most promising seed to fuzz next (seed prioritization), and 2) how many efforts should be made to the current seed (power scheduling).
In this paper, we present our fuzzer, Cerebro, to address the above challenges.
For the seed prioritization problem, we propose an online multi-objective based algorithm to balance various metrics such as code complexity, coverage, execution time, etc.
To address the power scheduling problem, we introduce the concept of input potential to measure the complexity of uncovered code and propose a cost-effective algorithm to update it dynamically.
Unlike previous approaches where the fuzzer evaluates an input solely based on the execution traces that it has covered, Cerebro is able to foresee the benefits of fuzzing the input by adaptively evaluating its input potential.
We perform a thorough evaluation for Cerebro on 8 different real-world programs.
The experiments show that Cerebro can find more vulnerabilities and achieve better coverage than state-of-the-art fuzzers such as AFL and AFLFast.
Article Search
iFixFlakies: A Framework for Automatically Fixing Order-Dependent Flaky Tests
August Shi,
Wing Lam, Reed Oei,
Tao Xie, and
Darko Marinov
(University of Illinois at Urbana-Champaign, USA)
Regression testing provides important pass or fail signals that developers use to make decisions after code changes. However, flaky tests,
which pass or fail even when the code has not changed, can mislead
developers. A common kind of flaky tests are order-dependent tests,
which pass or fail depending on the order in which the tests are run.
Fixing order-dependent tests is often tedious and time-consuming.
We propose iFixFlakies, a framework for automatically fixing
order-dependent tests. The key insight in iFixFlakies is that test
suites often already have tests, which we call helpers, whose logic
resets or sets the states for order-dependent tests to pass. iFixFlakies
searches a test suite for helpers that make the order-dependent tests
pass and then recommends patches for the order-dependent tests
using code from these helpers. Our evaluation on 110 truly orderdependent tests from a public dataset shows that 58 of them have
helpers, and iFixFlakies can fix all 58. We opened pull requests for 56
order-dependent tests (2 of 58 were already fixed), and developers
have already accepted pull requests for 21 of them, with all the
remaining ones still pending.
Article Search
AggrePlay: Efficient Record and Replay of Multi-threaded Programs
Ernest Pobee and W. K. Chan
(City University of Hong Kong, China)
Deterministic replay presents challenges and often results in high memory and runtime overheads. Previous studies deterministically reproduce program outputs often only after several replay iterations or may produce a non-deterministic sequence of output to external sources. In this paper, we propose AggrePlay, a deterministic replay technique which is based on recording read-write interleavings leveraging thread-local determinism and summarized read values. During the record phase, AggrePlay records a read count vector clock for each thread on each memory location. Each thread checks the logged vector clock against the current read count in the replay phase before a write event. We present an experiment and analyze the results using the Splash2x benchmark suite as well as two real-world applications. The experimental results show that on average, AggrePlay experiences a better reduction in compressed log size, and 56% better runtime slowdown during the record phase, as well as a 41.58% higher probability in the replay phase than existing work.
Article Search
The Review Linkage Graph for Code Review Analytics: A Recovery Approach and Empirical Study
Toshiki Hirao, Shane McIntosh, Akinori Ihara, and Kenichi Matsumoto
(NAIST, Japan; McGill University, Canada; Wakayama University, Japan)
Modern Code Review (MCR) is a pillar of contemporary quality assurance approaches, where developers discuss and improve code changes prior to integration. Since review interactions (e.g., comments, revisions) are archived, analytics approaches like reviewer recommendation and review outcome prediction have been proposed to support the MCR process. These approaches assume that reviews evolve and are adjudicated independently; yet in practice, reviews can be interdependent.
In this paper, we set out to better understand the impact of review linkage on code review analytics. To do so, we extract review linkage graphs where nodes represent reviews, while edges represent recovered links between reviews. Through a quantitative analysis of six software communities, we observe that (a) linked reviews occur regularly, with linked review rates of 25% in OpenStack, 17% in Chromium, and 3%–8% in Android, Qt, Eclipse, and Libreoffice; and (b) linkage has become more prevalent over time. Through qualitative analysis, we discover that links span 16 types that belong to five categories. To automate link category recovery, we train classifiers to label links according to the surrounding document content. Those classifiers achieve F1-scores of 0.71–0.79, at least doubling the F1-scores of a ZeroR baseline. Finally, we show that the F1-scores of reviewer recommenders can be improved by 37%–88% (5–14 percentage points) by incorporating information from linked reviews that is available at prediction time. Indeed, review linkage should be exploited by future code review analytics.
Preprint
Artifacts Available
Mitigating Power Side Channels during Compilation
Jingbo Wang, Chungha Sung, and Chao Wang
(University of Southern California, USA)
The code generation modules inside modern compilers, which use a
limited number of CPU registers to store a large number of program
variables, may introduce side-channel leaks even in software equipped
with state-of-the-art countermeasures. We propose a program analysis
and transformation based method to eliminate such leaks. Our method
has a type-based technique for detecting leaks, which leverages
Datalog-based declarative analysis and domain-specific optimizations
to achieve high efficiency and accuracy.
It also has a mitigation technique for the compiler’s backend, more
specifically the register allocation modules, to ensure that leaky
intermediate computation results are stored in different CPU registers
or memory locations.
We have implemented and evaluated our method in LLVM for the x86
instruction set architecture. Our experiments on cryptographic
software show that the method is effective in removing the side
channel while being efficient, i.e., our mitigated code is more
compact and runs faster than code mitigated using state-of-the-art
techniques.
Preprint
Maximal Multi-layer Specification Synthesis
Yanju Chen, Ruben Martins, and Yu Feng
(University of California at Santa Barbara, USA; Carnegie Mellon University, USA)
There has been a significant interest in applying programming-by-example to automate repetitive and tedious tasks. However, due to the incomplete nature of input-output examples, a synthesizer may generate programs that pass the examples but do not match the user intent. In this paper, we propose MARS, a novel synthesis framework that takes as input a multi-layer specification composed by input-output examples, textual description, and partial code snippets that capture the user intent. To accurately capture the user intent from the noisy and ambiguous description, we propose a hybrid model that combines the power of an LSTM-based sequence-to-sequence model with the apriori algorithm for mining association rules through unsupervised learning. We reduce the problem of solving a multi-layer specification synthesis to a Max-SMT problem, where hard constraints encode well-typed concrete programs and soft constraints encode the user intent learned by the hybrid model. We instantiate our hybrid model to the data wrangling domain and compare its performance against Morpheus, a state-of-the-art synthesizer for data wrangling tasks. Our experiments demonstrate that our approach outperforms MORPHEUS in terms of running time and solved benchmarks. For challenging benchmarks, our approach can suggest candidates with rankings that are an order of magnitude better than MORPHEUS which leads to running times that are 15x faster than MORPHEUS.
Article Search
Phoenix: Automated Data-Driven Synthesis of Repairs for Static Analysis Violations
Rohan Bavishi, Hiroaki Yoshida, and Mukul R. Prasad
(University of California at Berkeley, USA; Fujitsu Labs, USA)
Traditional automatic program repair (APR) tools rely on a test-suite as a repair specification. But test suites even when available are not of specification quality, limiting the performance and hence viability of test-suite based repair. On the other hand, static analysis-based bug finding tools are seeing increasing adoption in industry but still face challenges since the reported violations are viewed as not easily actionable. We propose a novel solution that solves both these challenges through a technique for automatically generating high-quality patches for static analysis violations by learning from examples. Our approach uses the static analyzer as an oracle and does not require a test suite. We realize our solution in a system, Phoenix, that implements a fully-automated pipeline that mines and cleans patches for static analysis violations from the wild, learns generalized executable repair strategies as programs in a novel Domain Specific Language (DSL), and then instantiates concrete repairs from them on new unseen violations. Using Phoenix we mine a corpus of 5,389 unique violations and patches from 517 Github projects. In a cross-validation study on this corpus Phoenix successfully produced 4,596 bug-fixes, with a recall of 85% and a precision of 54%. When applied to the latest revisions of a further5 Github projects, Phoenix produced 94 correct patches to previously unknown bugs, 19 of which have already been accepted and merged by the development teams. To the best of our knowledge this constitutes, by far the largest application of any automatic patch generation technology to large-scale real-world systems
Article Search
Black Box Fairness Testing of Machine Learning Models
Aniya Aggarwal, Pranay Lohia, Seema Nagar, Kuntal Dey, and
Diptikalyan Saha
(IBM Research, India)
Any given AI system cannot be accepted unless its trustworthiness is proven. An important characteristic of a trustworthy AI system is the absence of algorithmic bias. ‘Individual discrimination’ exists when a given individual different from another only in ‘protected
attributes’ (e.g., age, gender, race, etc.) receives a different decision outcome from a given machine learning (ML) model as compared to the other individual. The current work addresses the problem of detecting the presence of individual discrimination in
given ML models. Detection of individual discrimination is test-intensive in a black-box setting, which is not feasible for
non-trivial systems. We propose a methodology for auto-generation of test inputs, for the task of detecting individual discrimination. Our approach combines two well-established techniques – symbolic execution and local explainability for effective test case generation.
We empirically show that our approach to generate test cases is very effective as compared to the best-known benchmark systems that we examine.
Article Search
Java Reflection API: Revealing the Dark Side of the Mirror
Felipe Pontes,
Rohit Gheyi, Sabrina Souto, Alessandro Garcia, and
Márcio Ribeiro
(Federal University of Campina Grande, Brazil; State University of Paraíba, Brazil; PUC-Rio, Brazil; Federal University of Alagoas, Brazil)
Developers of widely used Java Virtual Machines (JVMs) implement and test the Java Reflection API based on a Javadoc, which is specified using a natural language. However, there is limited knowledge on whether Java Reflection API developers are able to systematically reveal i) underdetermined specifications; and ii) non-conformances between their implementation and the Javadoc. Moreover, current automatic test suite generators cannot be used to detect them. To better understand the problem, we analyze test suites of two widely used JVMs, and we conduct a survey with 130 developers who use the Java Reflection API to see whether the Javadoc impacts on their understanding. We also propose a technique to detect underdetermined specifications and non-conformances between the Javadoc and the implementations of the Java Reflection API. It automatically creates test cases, and executes them using different JVMs. Then, we manually execute some steps to identify underdetermined specifications and to confirm whether a non-conformance candidate is indeed a bug. We evaluate our technique in 439 input programs. Our technique identifies underdetermined specification and non-conformance candidates in 32 Java Reflection API public methods of 7 classes. We report underdetermined specification candidates in 12 Java Reflection API methods. Java Reflection API specifiers accept 3 underdetermined specification candidates (25%). We also report 24 non-conformance candidates to Eclipse OpenJ9 JVM, and 7 to Oracle JVM. Eclipse OpenJ9 JVM developers accept and fix 21 candidates (87.5%), and Oracle JVM developers accept 5 and fix 4 non-conformance candidates.
Article Search
A Conceptual Replication of Continuous Integration Pain Points in the Context of Travis CI
David Gray Widder, Michael Hilton,
Christian Kästner, and
Bogdan Vasilescu
(Carnegie Mellon University, USA)
Continuous integration (CI) is an established software quality assurance practice, and the focus of much prior research with a diverse range of methods and populations. In this paper, we first conduct a literature review of 37 papers on CI pain points. We then conduct a conceptual replication study on results from these papers using a triangulation design consisting of a survey with 132 responses, 12 interviews, and two logistic regressions predicting Travis CI abandonment and switching on a dataset of 6,239 GitHub projects. We report and discuss which past results we were able to replicate, those for which we found conflicting evidence, those for which we did not find evidence, and the implications of these findings.
Article Search
Info
Ethnographic Research in Software Engineering: A Critical Review and Checklist
He Zhang, Xin Huang, Xin Zhou, Huang Huang, and Muhammad Ali Babar
(Nanjing University, China; University of Adelaide, Australia)
Software Engineering (SE) community has recently been investing significant amount of effort in qualitative research to study the human and social aspects of SE processes, practices, and technologies. Ethnography is one of the major qualitative research methods, which is based on constructivist paradigm that is different from the hypothetic-deductive research model usually used in SE. Hence, the adoption of ethnographic research method in SE can present significant challenges in terms of sufficient understanding of the methodological requirements and the logistics of its applications. It is important to systematically identify and understand various aspects of adopting ethnography in SE and provide effective guidance. We carried out an empirical inquiry by integrating a systematic literature review and a confirmatory survey. By reviewing the ethnographic studies reported in 111 identified papers and 26 doctoral theses and analyzing the authors’ responses of 29 of those papers, we revealed several unique insights. These identified insights were then transformed into a preliminary checklist that helps improve the state-of-the-practice of using ethnography in SE. This study also identifies the areas where methodological improvements of ethnography are needed in SE.
Article Search
Achilles’ Heel of Plug-and-Play Software Architectures: A Grounded Theory Based Approach
Joanna C. S. Santos, Adriana Sejfia, Taylor Corrello, Smruthi Gadenkanahalli, and Mehdi Mirakhorli
(Rochester Institute of Technology, USA)
Through a set of well-defined interfaces, plug-and-play architectures enable additional functionalities to be added or removed from a system at its runtime. However, plug-ins can also increase the application’s attack surface or introduce untrusted behavior into the system. In this paper, we (1) use a grounded theory-based approach to conduct an empirical study of common vulnerabilities in plug-and-play architectures; (2) conduct a systematic literature survey and evaluate the extent that the results of the empirical study are novel or supported by the literature; (3) evaluate the practicality of the findings by interviewing practitioners with several years of experience in plug-and-play systems. By analyzing Chromium, Thunderbird, Firefox, Pidgin, WordPress, Apache OfBiz, and OpenMRS, we found a total of 303 vulnerabilities rooted in extensibility design decisions and observed that these plugin-related vulnerabilities were caused by 16 different types of vulnerabilities. Out of these 16 vulnerability types we identified 19 mitigation procedures for fixing them. The literature review supported 12 vulnerability types and 8 mitigation techniques discovered in our empirical study, and indicated that 5 mitigation techniques were not covered in our empirical study. Furthermore, it indicated that 4 vulnerability types and 11 mitigation techniques discovered in our empirical study were not covered in the literature. The interviews with practitioners confirmed the relevance of the findings and highlighted ways that the results of this empirical study can have an impact in practice.
Preprint
Info
Latent Error Prediction and Fault Localization for Microservice Applications by Learning from System Trace Logs
Xiang Zhou,
Xin Peng,
Tao Xie,
Jun Sun, Chao Ji, Dewei Liu, Qilin Xiang, and Chuan He
(Fudan University, China; University of Illinois at Urbana-Champaign, USA; Singapore Management University, Singapore)
In the production environment, a large part of microservice failures are related to the complex and dynamic interactions and runtime environments, such as those related to multiple instances, environmental configurations, and asynchronous interactions of microservices. Due to the complexity and dynamism of these failures, it is often hard to reproduce and diagnose them in testing environments. It is desirable yet still challenging that these failures can be detected and the faults can be located at runtime of the production environment to allow developers to resolve them efficiently. To address this challenge, in this paper, we propose MEPFL, an approach of latent error prediction and fault localization for microservice applications by learning from system trace logs. Based on a set of features defined on the system trace logs, MEPFL trains prediction models at both the trace level and the microservice level using the system trace logs collected from automatic executions of the target application and its faulty versions produced by fault injection. The prediction models thus can be used in the production environment to predict latent errors, faulty microservices, and fault types for trace instances captured at runtime. We implement MEPFL based on the infrastructure systems of container orchestrator and service mesh, and conduct a series of experimental studies with two opensource microservice applications (one of them being the largest open-source microservice application to our best knowledge). The results indicate that MEPFL can achieve high accuracy in intraapplication prediction of latent errors, faulty microservices, and fault types, and outperforms a state-of-the-art approach of failure diagnosis for distributed systems. The results also show that MEPFL can effectively predict latent errors caused by real-world fault cases.
Preprint
The Importance of Accounting for Real-World Labelling When Predicting Software Vulnerabilities
Matthieu Jimenez, Renaud Rwemalika, Mike Papadakis,
Federica Sarro, Yves Le Traon, and
Mark Harman
(University of Luxembourg, Luxembourg; University College London, UK; Facebook, UK)
Previous work on vulnerability prediction assume that predictive models are trained with respect to perfect labelling information (includes labels from future, as yet undiscovered vulnerabilities). In this paper we present results from a comprehensive empirical study of 1,898 real-world vulnerabilities reported in 74 releases of three security-critical open source systems (Linux Kernel, OpenSSL and Wiresark). Our study investigates the effectiveness of three previously proposed vulnerability prediction approaches, in two settings: with and without the unrealistic labelling assumption. The results reveal that the unrealistic labelling assumption can profoundly mis- lead the scientific conclusions drawn; suggesting highly effective and deployable prediction results vanish when we fully account for realistically available labelling in the experimental methodology. More precisely, MCC mean values of predictive effectiveness drop from 0.77, 0.65 and 0.43 to 0.08, 0.22, 0.10 for Linux Kernel, OpenSSL and Wiresark, respectively. Similar results are also obtained for precision, recall and other assessments of predictive efficacy. The community therefore needs to upgrade experimental and empirical methodology for vulnerability prediction evaluation and development to ensure robust and actionable scientific findings.
Article Search
Detecting Concurrency Memory Corruption Vulnerabilities
Yan Cai, Biyun Zhu, Ruijie Meng, Hao Yun,
Liang He, Purui Su, and Bin Liang
(Institute of Software at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Renmin University of China, China)
Memory corruption vulnerabilities can occur in multithreaded executions, known as concurrency vulnerabilities in this paper. Due to non-deterministic multithreaded executions, they are extremely difficult to detect. Recently, researchers tried to apply data race detectors to detect concurrency vulnerabilities. Unfortunately, these detectors are ineffective on detecting concurrency vulnerabilities. For example, most (90%) of data races are benign. However, concurrency vulnerabilities are harmful and can usually be exploited to launch attacks. Techniques based on maximal causal model rely on constraints solvers to predict scheduling; they can miss concurrency vulnerabilities in practice. Our insight is, a concurrency vulnerability is more related to the orders of events that can be reversed in different executions, no matter whether the corresponding accesses can form data races. We then define exchangeable events to identify pairs of events such that their execution orders can be probably reversed in different executions. We further propose algorithms to detect three major kinds of concurrency vulnerabilities. To overcome potential imprecision of exchangeable events, we also adopt a validation to isolate real vulnerabilities. We implemented our algorithms as a tool ConVul and applied it on 10 known concurrency vulnerabilities and the MySQL database server. Compared with three widely-used race detectors and one detector based on maximal causal model, ConVul was significantly more effective by detecting 9 of 10 known vulnerabilities and 6 zero-day vulnerabilities on MySQL (four have been confirmed). However, other detectors only detected at most 3 out of the 16 known and zero-day vulnerabilities.
Article Search
Locating Vulnerabilities in Binaries via Memory Layout Recovering
Haijun Wang, Xiaofei Xie,
Shang-Wei Lin,
Yun Lin, Yuekang Li, Shengchao Qin,
Yang Liu, and Ting Liu
(Shenzhen University, China; Nanyang Technological University, Singapore; National University of Singapore, Singapore; Teesside University, UK; Xi’an Jiaotong University, China)
Locating vulnerabilities is an important task for security auditing, exploit writing, and code hardening. However, it is challenging to locate vulnerabilities in binary code, because most program semantics (e.g., boundaries of an array) is missing after compilation. Without program semantics, it is difficult to determine whether a memory access exceeds its valid boundaries in binary code. In this work, we propose an approach to locate vulnerabilities based on memory layout recovery. First, we collect a set of passed executions and one failed execution. Then, for passed and failed executions, we restore their program semantics by recovering fine-grained memory layouts based on the memory addressing model. With the memory layouts recovered in passed executions as reference, we can locate vulnerabilities in failed execution by memory layout identification and comparison. Our experiments show that the proposed approach is effective to locate vulnerabilities on 24 out of 25 DARPA’s CGC programs (96%), and can effectively classifies 453 program crashes (in 5 Linux programs) into 19 groups based on their root causes.
Article Search
Storm: Program Reduction for Testing and Debugging Probabilistic Programming Systems
Saikat Dutta, Wenxian Zhang, Zixin Huang, and Sasa Misailovic
(University of Illinois at Urbana-Champaign, USA)
Probabilistic programming languages offer an intuitive way to model uncertainty by representing complex probability models as simple probabilistic programs. Probabilistic programming systems (PP systems) hide the complexity of inference algorithms away from the program developer. Unfortunately, if a failure occurs during the run of a PP system, a developer typically has very little support in finding the part of the probabilistic program that causes the failure in the system.
This paper presents Storm, a novel general framework for reducing probabilistic programs. Given a probabilistic program (with associated data and inference arguments) that causes a failure in
a PP system, Storm finds a smaller version of the program, data, and arguments that cause the same failure. Storm leverages both generic code and data transformations from compiler testing and
domain-specific, probabilistic transformations. The paper presents new transformations that reduce the complexity of statements and expressions, reduce data size, and simplify inference arguments
(e.g., the number of iterations of the inference algorithm).
We evaluated Storm on 47 programs that caused failures in two popular probabilistic programming systems, Stan and Pyro. Our experimental results show Storm’s effectiveness. For Stan, our minimized programs have 49% less code, 67% less data, and 96% fewer iterations. For Pyro, our minimized programs have 58% less code, 96% less data, and 99% fewer iterations. We also show the benefits of Storm when debugging probabilistic programs.
Article Search
NullAway: Practical Type-Based Null Safety for Java
Subarno Banerjee, Lazaro Clapp, and Manu Sridharan
(University of Michigan, USA; Uber Technologies, USA; University of California at Riverside, USA)
NullPointerExceptions (NPEs) are a key source of crashes in modern Java programs. Previous work has shown how such errors can be prevented at compile time via code annotations and pluggable type checking. However, such systems have been difficult to deploy on large-scale software projects, due to significant build-time overhead and / or a high annotation burden. This paper presents NullAway, a new type-based null safety checker for Java that overcomes these issues. NullAway has been carefully engineered for low overhead, so it can run as part of every build. Further, NullAway reduces annotation burden through targeted unsound assumptions, aiming for no false negatives in practice on checked code. Our evaluation shows that NullAway has significantly lower build-time overhead (1.15×) than comparable tools (2.8-5.1×). Further, on a corpus of production crash data for widely-used Android apps built with NullAway, remaining NPEs were due to unchecked third-party libraries (64%), deliberate error suppressions (17%), or reflection and other forms of post-checking code modification (17%), never due to NullAway’s unsound assumptions for checked code.
Article Search
Artifacts Available
Artifacts Reusable
Automatically Detecting Missing Cleanup for Ungraceful Exits
Zhouyang Jia, Shanshan Li,
Tingting Yu, Xiangke Liao, and Ji Wang
(National University of Defense Technology, China; University of Kentucky, USA)
Software encounters ungraceful exits due to either bugs in the interrupt/signal handler code or the intention of developers to debug the software. Users may suffer from ”weird” problems caused by leftovers of the ungraceful exits. A common practice to fix these problems is rebooting, which wipes away the stale state of the software. This solution, however, is heavyweight and often leads to poor user experience because it requires restarting other normal processes. In this paper, we design SafeExit, a tool that can automatically detect and pinpoint the root causes of the problems caused by ungraceful exits, which can help users fix the problems using lightweight solutions. Specifically, SafeExit checks the program exit behaviors in the case of an interrupted execution against its expected exit behaviors to detect the missing cleanup behaviors required for avoiding the ungraceful exit. The expected behaviors are obtained by monitoring the program exit under a normal execution. We apply SafeExit to 38 programs across 10 domains. SafeExit finds 133 types of cleanup behaviors from 36 programs and detects 2861 missing behaviors from 292 interrupted executions. To predict missing behaviors for unseen input scenarios, SafeExit trains prediction models using a set of sampled input scenarios. The results show that SafeExit is accurate with an average F-measure of 92.5%.
Article Search
Finding and Understanding Bugs in Software Model Checkers
Chengyu Zhang, Ting Su, Yichen Yan, Fuyuan Zhang, Geguang Pu, and Zhendong Su
(East China Normal University, China; ETH Zurich, Switzerland; MPI-SWS, Germany)
Software Model Checking (SMC) is a well-known automatic program
verification technique and frequently adopted for checking
safety-critical software. Thus, the reliability of SMC tools
themselves (i.e., software model checkers) is critical. However, little
work exists on validating software model checkers, an important
problem that this paper tackles by introducing a practical,
automated fuzzing technique. For its simplicity and
generality, we focus on control-flow reachability (e.g., whether or how
many times a branch is reached) and address two specific challenges
for effective fuzzing: oracle and scalability. Given a
deterministic program, we (1) leverage its concrete executions to
synthesize valid branch reachability properties (thus solving the
oracle problem) and (2) fuse such individual properties into a single
safety property (thus improving the scalability of fuzzing and
reducing manual inspection). We have realized our approach as the
MCFuzz tool and applied it to extensively test three state-of-the-art C
software model checkers, CPAchecker, CBMC, and SeaHorn. MCFuzz has
found 62 unique bugs in all three model checkers — 58 have been confirmed,
and 20 have been fixed. We have further analyzed and
categorized these bugs (which are diverse), and summarized several
lessons for building reliable and robust model checkers. Our testing
effort has been well-appreciated by the model checker developers, and
also led to improved tool usability and documentation.
Article Search
A Segmented Memory Model for Symbolic Execution
Timotej Kapus and
Cristian Cadar
(Imperial College London, UK)
Symbolic execution is an effective technique for exploring paths in a
program and reasoning about all possible values on those paths.
However, the technique still struggles with code that uses complex
heap data structures, in which a pointer is allowed to refer to more
than one memory object. In such cases, symbolic execution typically
forks execution into multiple states, one for each object to which the
pointer could refer.
In this paper, we propose a technique that avoids this expensive
forking by using a segmented memory model. In this model,
memory is split into segments, so that each symbolic pointer refers to
objects in a single segment. The size of the segments are bound by a
threshold, in order to avoid expensive constraints. This results in a
memory model where forking due to symbolic pointer dereferences is
significantly reduced, often completely.
We evaluate our segmented memory model on a mix of whole program benchmarks
(such as m4 and make) and library benchmarks (such as SQLite), and
observe significant decreases in execution time and memory usage.
Article Search
Artifacts Available
Artifacts Reusable
Releasing Fast and Slow: An Exploratory Case Study at ING
Elvan Kula, Ayushi Rastogi, Hennie Huijgens, Arie van Deursen, and
Georgios Gousios
(Delft University of Technology, Netherlands; ING Bank, Netherlands)
The appeal of delivering new features faster has led many software projects to adopt rapid releases. However, it is not well understood what the effects of this practice are. This paper presents an exploratory case study of rapid releases at ING, a large banking company that develops software solutions in-house, to characterize rapid releases. Since 2011, ING has shifted to a rapid release model. This switch has resulted in a mixed environment of 611 teams releasing relatively fast and slow. We followed a mixed-methods approach in which we conducted a survey with 461 participants and corroborated their perceptions with 2 years of code quality data and 1 year of release delay data. Our research shows that: rapid releases are more commonly delayed than their non-rapid counterparts, however, rapid releases have shorter delays; rapid releases can be beneficial in terms of reviewing and user-perceived quality; rapidly released software tends to have a higher code churn, a higher test coverage and a lower average complexity; challenges in rapid releases are related to managing dependencies and certain code aspects, e.g., design debt.
Preprint
SAR: Learning Cross-Language API Mappings with Little Knowledge
Nghi D. Q. Bui,
Yijun Yu, and Lingxiao Jiang
(Singapore Management University, Singapore; Open University, UK)
To save effort, developers often translate programs from one programming language to another, instead of implementing it from scratch. Translating application program interfaces (APIs) used in one language to functionally equivalent ones available in another language is an important aspect of program translation. Existing approaches facilitate the translation by automatically identifying the API mappings across programming languages. However, these approaches still require large amount of parallel corpora, ranging from pairs of APIs or code fragments that are functionally equivalent, to similar code comments.
To minimize the need of parallel corpora, this paper aims at an automated approach that can map APIs across languages with much less a priori knowledge than other approaches. The approach is based on an realization of the notion of domain adaption, combined with code embedding, to better align two vector spaces. Taking as input large sets of programs, our approach first generates numeric vector representations of the programs (including the APIs used in each language), and it adapts generative adversarial networks (GAN) to align the vectors in different spaces of two languages. For a better alignment, we initialize the GAN with parameters derived from API mapping seeds that can be identified accurately with a simple automatic signature-based matching heuristic. Then the cross language API mappings can be identified via nearest-neighbors queries in the aligned vector spaces. We have implemented the approach (SAR, named after three main technical components in the approach) in a prototype for mapping APIs across Java and C# programs. Our evaluation on about 2 million Java files and 1 million C# files shows that the approach can achieve 54% and 82% mapping accuracy in its top-1 and top-10 API mapping results with only 174 automatically identified seeds, more accurate than other approaches using the same or much more mapping seeds.
Preprint
Info
Artifacts Available
Artifacts Reusable
Robust Log-Based Anomaly Detection on Unstable Log Data
Xu Zhang, Yong Xu, Qingwei Lin, Bo Qiao, Hongyu Zhang, Yingnong Dang, Chunyu Xie, Xinsheng Yang, Qian Cheng, Ze Li, Junjie Chen, Xiaoting He, Randolph Yao, Jian-Guang Lou, Murali Chintalapati, Furao Shen, and Dongmei Zhang
(Microsoft Reseach, China; Nanjing University, China; University of Newcastle, Australia; Microsoft, USA; Tianjin University, China)
Logs are widely used by large and complex software-intensive systems for troubleshooting. There have been a lot of studies on log-based anomaly detection. To detect the anomalies, the existing methods mainly construct a detection model using log event data extracted from historical logs. However, we find that the existing methods do not work well in practice. These methods have the close-world assumption, which assumes that the log data is stable over time and the set of distinct log events is known. However, our empirical study shows that in practice, log data often contains previously unseen log events or log sequences. The instability of log data comes from two sources: 1) the evolution of logging statements, and 2) the processing noise in log data. In this paper, we propose a new log-based anomaly detection approach, called LogRobust. LogRobust extracts semantic information of log events and represents them as semantic vectors. It then detects anomalies by utilizing an attention-based Bi-LSTM model, which has the ability to capture the contextual information in the log sequences and automatically learn the importance of different log events. In this way, LogRobust is able to identify and handle unstable log events and sequences. We have evaluated LogRobust using logs collected from the Hadoop system and an actual online service system of Microsoft. The experimental results show that the proposed approach can well address the problem of log instability and achieve accurate and robust results on real-world, ever-changing log data.
Article Search
Pinpointing Performance Inefficiencies in Java
Pengfei Su, Qingsen Wang, Milind Chabbi, and Xu Liu
(College of William and Mary, USA; Scalable Machines Research, USA)
Many performance inefficiencies such as inappropriate choice of algorithms or data structures, developers’ inattention to performance, and missed compiler optimizations show up as wasteful memory operations. Wasteful memory operations are those that produce/consume data to/from memory that may have been avoided. We present, JXPerf, a lightweight performance analysis tool for pinpointing wasteful memory operations in Java programs. Traditional byte code instrumentation for such analysis (1) introduces prohibitive overheads and (2) misses inefficiencies in machine code generation. JXPerf overcomes both of these problems. JXPerf uses hardware performance monitoring units to sample memory locations accessed by a program and uses hardware debug registers to monitor subsequent accesses to the same memory. The result is a lightweight measurement at the machine code level with attribution of inefficiencies to their provenance — machine and source code within full calling contexts. JXPerf introduces only 7% runtime overhead and 7% memory overhead making it useful in production. Guided by JXPerf, we optimize several Java applications by improving code generation and choosing superior data structures and algorithms, which yield significant speedups.
Article Search
Understanding Flaky Tests: The Developer’s Perspective
Moritz Eck, Fabio Palomba, Marco Castelluccio, and Alberto Bacchelli
(University of Zurich, Switzerland; Mozilla, UK)
Flaky tests are software tests that exhibit a seemingly random outcome (pass or fail) despite exercising unchanged code. In this work, we examine the perceptions of software developers about the nature, relevance, and challenges of flaky tests.
We asked 21 professional developers to classify 200 flaky tests they previously fixed, in terms of the nature and the origin of the flakiness, as well as of the fixing effort. We also examined developers’ fixing strategies.
Subsequently, we conducted an online survey with 121 developers with a median industrial programming experience of five years.
Our research shows that: The flakiness is due to several different causes, four of which have never been reported before, despite being the most costly to fix; flakiness is perceived as significant by the vast majority of developers, regardless of their team’s size and project’s domain, and it can have effects on resource allocation, scheduling, and the perceived reliability of the test suite; and the challenges developers report to face regard mostly the reproduction of the flaky behavior and the identification of the cause for the flakiness.
Public preprint [http://arxiv.org/abs/1907.01466], data and materials [https://doi.org/10.5281/zenodo.3265785].
Article Search
SEntiMoji: An Emoji-Powered Learning Approach for Sentiment Analysis in Software Engineering
Zhenpeng Chen, Yanbin Cao, Xuan Lu, Qiaozhu Mei, and Xuanzhe Liu
(Peking University, China; University of Michigan, USA)
Sentiment analysis has various application scenarios in software engineering (SE), such as detecting developers’ emotions in commit messages and identifying their opinions on Q&A forums. However, commonly used out-of-the-box sentiment analysis tools cannot obtain reliable results on SE tasks and the misunderstanding of technical jargon is demonstrated to be the main reason. Then, researchers have to utilize labeled SE-related texts to customize sentiment analysis for SE tasks via a variety of algorithms. However, the scarce labeled data can cover only very limited expressions and thus cannot guarantee the analysis quality. To address such a problem, we turn to the easily available emoji usage data for help. More specifically, we employ emotional emojis as noisy labels of sentiments and propose a representation learning approach that uses both Tweets and GitHub posts containing emojis to learn sentiment-aware representations for SE-related texts. These emoji-labeled posts can not only supply the technical jargon, but also incorporate more general sentiment patterns shared across domains. They as well as labeled data are used to learn the final sentiment classifier. Compared to the existing sentiment analysis methods used in SE, the proposed approach can achieve significant improvement on representative benchmark datasets. By further contrast experiments, we find that the Tweets make a key contribution to the power of our approach. This finding informs future research not to unilaterally pursue the domain-specific resource, but try to transform knowledge from the open domain through ubiquitous signals such as emojis.
Preprint