We seek to construct a corpus of bugs that is representative and sufficiently large to support statistical inference. As always, achieving representativeness is the main difficulty, which we address by uniform sampling. We cannot sample bugs directly, but rather commits that we must classify into fixes and non-fixes. Why fixes? Because a fix is often labelled as such, its parent is almost certainly buggy and it identifies the region in the parent that a developer deemed relevant to the bug. To identify bug-fixing commits, we consider only projects that use issue trackers, then we look for bug report references in commit messages and commit ids (SHAs) in bug reports. This heuristic is not only noisy; it must also contend with bias in project selection and bias introduced by missing links.
Please click here for the list of the 400 studied bugs.
Procedure 1 defines our manual type
annotation procedure. Because we annotate each bug twice, once
for each type system, our experiment is a within-subject
repeated measure experiment. As such, a phenomenon known as
learning effects may come into play, as knowledge gained from
creating the annotations for one type checker may speed
annotating the other. To mitigate learning effects, for a bug
\(b\) in \(B\), we first pick a type system \(ts\) from Flow and
TypeScript uniformly at random, so that, on average, we consider
as many bugs for the first time for each type system. If \(b\)
is not type related “beyond a shadow of a doubt”, such as
misunderstanding the specification, we label it as undetectable
under \(ts\) and categorise it, skipping the annotation process.
If not, we read the bug report and the fix to identify the
patched region, the set of lexical scopes the fix changes.
loop (REPL), e.g. Node.js, we attempt to understand the intended
behavior of a program and add consistent and minimal annotations
that cause ts to error on \(b\). We are not experts in type
systems nor any project in our corpus. To combat this, we have
striven to be conservative: we annotate variables whose types
are difficult to infer with
any. Then we type check
the resulting program. We ignore type errors that we consider
unrelated to this goal. We repeat this process until we confirm
that \(b\) is \(ts\)-detectable because \(ts\) throws an error
within the patched region and the added annotations are
consistent, or we deem \(b\) is not \(ts\)-detectable, or we
exceed the time budget \(M\).
Histogram of TC-Detectable Bugs
Histogram of Undetectability
Of the 80 uniformly-sampled bugs that we used to calculate inter-rater agreement, each rater needed to make 160 decisions in total, 80 for TypeScript-preventability and 80 for Flow-preventability. 138 of these 160 decisions were unanimously labelled. We define a strong disagreement as a disagreement in which one rater deems the bug preventable while another deems it unpreventable. Of the 22 disagreements, 12 are strong.
Though sharing a similar annotation syntax, Flow and TypeScript differ in some dimensions. We compared Flow and TypeScript in terms of their ability to potentially prevent public bugs had they been used when those bugs were introduced. Flow and TypeScript both catch a nontrivial portion of public bugs. In our dataset, the bugs they can prevent largely overlap, with 6 exceptions: 3 bugs are only Flow-preventable and 3 only TypeScript-preventable.
TypeScript 2.0 was released during this study, giving
us the opportunity to measure the effectiveness of its
undefined. Prior to 2.0, all types were
nullable in TypeScript. TypeScript 2.0 added the
makes most types nonnullable. We reviewed our corpus and
found that 22 bugs, an increase of 58%, are preventable
under TypeScript 2.0 but not under TypeScript 1.8.
Contact us and we'll get back as soon as possible.
Univerisity College London, Gower Street, London, UK
z.gao.12 (at) ucl.ac.uk