Description

Flaky tests are tests whose outcome changes without altering the test or the code under test. These tests become unreliable, and they have been found to impact test-suite in large industrial projects. As a result, these have become an active area of research. Most of this research focuses on traditional unit tests found in non-interactive, non-UI projects. For this reason, we perform a study on these UI flaky tests to identify their root causes, how they manifest, and how they are fixed.

Dataset

Raw Data

Tables

Table 2 shows the total number of results returned for each of the keywords used in the crawling procedure for web projects. It also breaks down the results further by showing the results after appling the keyword filtering. From these found projects, we performed manual inspection on the commits to identify those containing real flaky tests.

UI Topic Projects Commits Flaky Keyword Filtering UI Keyword Filtering
web 999 772901 2553 210
angular 998 407434 222 19
vue 998 344526 52 1
react 997 1110993 603 30
svg 995 135563 24 1
bootstrap 995 98264 112 0
d3 980 106160 82 1
emberjs 629 3961 1 0


Revised Table 4 presents the revised breakdown of root cause categories and their counts to match better with existing work. The high level categories can be further broken down into subcategories. We found that the highest causes of flakiness come from causes under Async Wait.

In this table, the cells marked in yellow have been changed to reflect new category names and counts. The cells in blue are categories that are unique to UI flakiness.

Root Cause Categories Root Cause Subcategories Web Mobile total
Async Wait Network Resource Loading 15 4 19
Resource Rendering 47 14 61
Animation Timing Issue 17 9 26
Environment Platform Issue 16 18 34
Layout Difference 9 1 10
Test Runner API Issue DOM Selector Issue 13 3 16
Incorrect Test Runner Interaction 10 14 24
Test Script Logic Issue Unordered Collections 5 0 5
Time 1 0 1
Incorrect Resource Load Order 11 11 22
Test Order Dependency 6 6 12
Randomness 2 3 5
Total 152 83 235


Table 4 is the original breakdown of tests by the root cause categories found in our work.

Root Cause Categories Root Cause Subcategories Web Mobile total
Timing Issue Network Resource Loading 15 4 19
Resource Rendering 47 14 61
Animation Timing Issue 17 9 26
Environment Platform Issue 16 18 34
Layout Difference 8 1 9
Test Runner API Issue DOM Selector Issue 13 3 16
Incorrect Test Runner Interaction 10 14 24
Test Script Logic Issue Strict Comparison Checks 8 1 9
Incorrect Resource Load Order 11 11 22
Stale Data 5 5 10
Random Data Edge Case 2 3 5
Total 152 83 235


Table 5 shows the breakdown of tests by the manifestation strategy uses when reporting the flaky test.

Manifestation Category Web Mobile Total
Unspecified 101 40 141
Specify Problematic Platform 21 17 38
Reorder/Prune Test Suite 9 3 12
Reset Configuration Between Tests 2 7 9
Provide Code Snippet 14 6 20
Force Environment Conditions 5 10 15
Totals 152 83 235


Table 6 shows the breakdown of the fixing strategies used to mitigate the flaky behavior in the tests.

Categories Subcategories Web Mobile Total
Delay Add/Increase Delay 14 7 21
Fix Await Mechanism 35 8 43
Dependency Fix API Access 1 12 13
Change Library Version 1 6 7
Refactor Test Refactor Logic Implementation 49 26 75
Disable Features Disable Animations 1 3 4
Remove Test Remove Test 51 22 73
Total 152 84 236


Figures

We also try to map root causes to their manifestation and fixing strategies. Based on our observation, flaky tests under each root cause category are reproduced by multiple manifestation strategies, while different fixing strategies are commonly applied to different root causes. In our dataset, each manifestation strategy is correlated to all root cause categories, so we omit the correlation links in the figure. Figure 1 shows the relationship between root causes and fixing strategies.



Figure 2 is an example of a Network Resource Loading issue found in the influxdb project. The test fails due to attempting to interact with an updated UI during an incomplete network response.



Figure 5 presents an Animation Timing issue within the plotly.js where the transition effect applied in a bar chart is behaving incorrectly.



Figure 7 shows the log of a Continuous Integration system from within the react-jsonschema-form project. It shows a Time issue where the comparsion between two timestamps is too strict.