Systematic Analysis of testing-related publications concerning reprocucibility and comparability



Bachelor's Thesis Defense by Artur Solomonik

Referees: Prof. Dr. Norbert Siegmund, Prof. Dr. Martin Potthast

Software Testing

Software Testing Life Cycle


Software Testing Research


  • Generating test suites
    • Exploration pinciples
    • Mutation testing
    • Executing generated test suites
    • Prioritization and Reduction of Test Cases
  • Automating test case creation, selection and execution
  • Finding new approaches on organizing testing processes
    • Testing Workflow
    • Decision Making Process
    • When and What to Automate?

Software Testing Research


  • Testing Levels
      • Data-Flow Testing, Static Code Analysis | Unit Testing
      • Backbone-, Client-Server-, Bottom-Up | Integration Testing
      • GUI Testing, End-To-End Testing | System Testing
      • Reliability and Stability, Chaos Testing | Acceptance Testing
  • Execution Paradigms

Test Execution Paradigms


How do we know the testing system is working?


Empirical Software Evaluations

Evaluating result data


  • Present the result data set and identify significant values
  • Connect hypotheses and results
  • Compare related work and their findings
  • Argument the improvement or benefits of the approach
  • Apply suitable metrics

Reproducibility


Goal: Provide the reader with every information and resource necessary to recreate the findings presented in the paper

Reproducibility Attributes


  • Reproduction score influenced by data set attributes
    • Identification: Explanation of where the data is and what it is called
    • Description: Level of the of the explanation regarding the element
    • Availability: Ease of accessing or obtaining the research elements
    • Persistence: Confidence in future state and availability of the elements
    • Flexibility: Adaptability of the elements to new environments
  • Varying data sources - Attributes not applicable to anything

Comparability


Goal: Assess papers on whether empirical comparisons in the evaluation are appropriate or existent.

  • Criteria for comprehensible evaluations
  • Strategies of Comparison
  • Connectivity to related work

How can we understand the research strategies of software testing publications in terms of reproducibility and comparability?

Paper Classification

Data Source


  • Papers from 10 popular software engineering conferences (ASE, ICSE, ISSTA, ...)
  • Additional publications from two journals (ESE, TOSEM)
  • Frequently mentioned publications
  • Papers from modification / refinement phases

Processed Data Set


Raw Data Set

Spreadsheet with 8060 registered papers of which 360 are classified by 23 columns

205 documented benchmarks

Over 15000 bibliographic and semantic connections between records

Classification Parameters
Availability [open/closed]
Data Set State [vanilla/modified]
Selection Cause [...]
Modification Cause [...]
Sub-Check Systems [single/multiple] [named/unnamed]
Classification Parameters
Contribution [...]
Choice of Metric [functionality/performance/both]
Metrics [ ] Metrics
Classification Parameters
Error Creation [generation/real world/both]
Error Annotation [TRUE/FALSE]
Comparison [TRUE/FALSE] [former/foreign/parallel] [exclusive/inclusive]

Open Source vs. Closed Source

Software Testing Evaluation Metrics

Choice of Metric and Error Annotation

Selection and modification causes of benchmarks

Bibliographic Networks

Goal: Visualizing great amounts of bibliographic data, increasing the interactivity with a set of publications and creating dynamic, time-based insight on the netvork evolution.

Current implementations of paper networks


  • Visualize the connection and influence between authors
  • Giving insight rather than specific values
  • Connected over citations, bibliographic coupling, co-citations or co-authorship relations
  • Color- and size-coding node information
  • Geographic hierarchies

Additions and Improvements


  • Benchmarks and software systems as their own entities in a network
  • More insight on reproducibility
  • Multidimensional graph data visualization without clutter
  • Tailouring the visualization to a certain aspect of a publication (e.g. the evaluation)

Visualizing bibliographic networks

TeLO-S

D3 visualization of testing publications in a node-link force-directed graph

Cypher Query Input and Configuration

Selecting sepecific nodes from the Neo4J graph data base and manipulating the layout and color-coding

Contribution Plot

Immediate assessement of proportions of contribution representatives

Node analysis

Additional information on a selected node concerning his references

Findings

Patterns

Vanishing Point Pattern


Outsider Pattern


  • Loose nodes in a subgraph without any connection to other queried nodes
  • Nodes might imply a connection to other unqueried research fields
  • Misclassifications or special cases

Familiar Foreigner Pattern


Chain Pattern


Conclusion

  • Most evaluations conducted similarly
  • Choice of benchmark varies significantly
  • Availability as a major reproducibility issue
  • Solution: Dedicated sub-check systems (possibly provided by conferences)
  • Mutation scores and coverage metrics widely used
  • Findings of closely related papers rarely mentioned
  • Bibliographic networks benefit from sub-check system nodes and different relation types
  • Comparability improves continuous improvement of research
  • Comparing evaluations unfortunately very uncommon, yet beneficial

Future Work

  • Adding referencing patterns to the visualization
  • Classifiers for testing paper classification
  • Multiple refinement cycles of the data set using relevant citations
  • Implementation of author nodes, citation scores and bibliographic coupling
  • Hierarchical edge bundling regarding relevancy, geography or popularity
  • Generalization for other research topics aside from software testing

Thank you for your attention.