6. Regression Testing

6.1. Overview

VisIt has a large and continually growing test suite. VisIt’s test suite involves a combination python scripts in src/test, raw data and data generation sources in src/testdata and of course the VisIt sources themselves. Regression tests are run on a nightly basis. Testing exercises VisIt’s viewer, mdserver, engine and cli but not the GUI.

6.2. Running regression tests

6.2.1. Where nightly regression tests are run

The regression suite is run on LLNL’s Pascal Cluster. Pascal runs the TOSS3 operating system, which is a flavor of Linux. If you are going to run the regression suite yourself you should run on a similar system or there will be differences due to numeric precision issues.

The regression suite is run on Pascal using a cron job that checks out VisIt source code, builds it, and then runs the tests.

6.2.2. How to run the regression tests manually

The regression suite relies on having a working VisIt build and test data available on your local computer. Our test data and baselines are stored using git lfs, so you need to setup git lfs and pull to have all the necessary files.

The test suite is written in python and to source is in src/test. When you configure VisIt, a bash script is generated in the build directory that you can use to run the test suite out of source with all the proper data and baseline directory arguments.

cd visit-build/test/
./run_visit_test_suite.sh

Here is an example of the contents of the generated run_visit_test_suite.sh script

/Users/harrison37/Work/github/visit-dav/visit/build-mb-develop-darwin-10.13-x86_64/thirdparty_shared/third_party/python/2.7.14/darwin-x86_64/bin/python2.7
/Users/harrison37/Work/github/visit-dav/visit/src/test/visit_test_suite.py \
   -d /Users/harrison37/Work/github/visit-dav/visit/build-mb-develop-darwin-10.13-x86_64/build-debug/testdata/  \
   -b /Users/harrison37/Work/github/visit-dav/visit/src/test/../../test/baseline/   \
   -o output \
   -e /Users/harrison37/Work/github/visit-dav/visit/build-mb-develop-darwin-10.13-x86_64/build-debug/bin/visit "$@"

Once the test suite has run, the results can be found in the output/html directory. Open output/html/index.html in a web browser to view the test suite results.

6.2.3. Accessing regression test results

The nightly test suite results are posted to: http://portal.nersc.gov/project/visit/.

6.2.4. In the event of failure on the nightly run

If any tests fail, ‘’all’’ developers who updated the code from the last time all tests successfully passed will receive an email indicating what failed. In addition, failed results should be available on the web.

6.3. How regression testing works

The workhorse script that manages the testing is visit_test_suite.py in src/test. Tests can be run in a variety of ways called modes. For example, VisIt’s nightly testing is run in serial, parallel and scalable,parallel modes. Each of these modes represents a fundamental and relatively global change in the way VisIt is doing business under the covers during its testing. For example, the difference between parallel and scalable,parallel modes is whether the scalable renderer is being used to render images. In the parallel mode, rendering is done in the viewer. In scalable,parallel mode, it is done, in parallel, on the engine and images from each processor are composited. Typically, the entire test suite is run in each mode specified by the regression test policy.

There are a number of command-line options to the test suite. ./run_visit_test_suite.sh -help will give you details about these options. Until we are able to get re-baselined on the systems available outside of LLNL firewalls, options enabling some filtering of image differences will be very useful. Use of these options on platforms other than the currently adopted testing platform (pascal.llnl.gov) will facilitate filtering big differences (and probably real bugs that have been introduced) from differences due to platform where tests are run. See the section on filtering image differences.

There are a number of different categories of tests. The test categories are the names of all the directories under src/test/tests. The .py files in this directory tree are all the actual test driver files that drive VisIt’s CLI and generate images and text to compare with baselines. In addition, the src/test/visit_test_main.py file defines a number of helper Python functions that facilitate testing including two key functions; Test() for testing image outputs and TestText() for testing text outputs. Of course, all the .py files in src/test/tests subtree are excellent examples of test scripts.

When the test suite finishes, it will have created a web-browseable HTML tree in the html directory. The actual image and text raw results will be in the current directory and difference images will be in the diff directory. The difference images are essentially binary bitmaps of the pixels that are different and not the actual pixel differences themselves. This is to facilitate identifying the location and cause of the differences.

Adding a test involves a) adding a .py file to the appropriate subdirectory in src/test/tests, b) adding the expected baselines to test/baselines and, depending on the test, c) adding any necessary input data files to src/testdata. The test suite will find your added .py files the next time it runs. So, you don’t have to do anything special other than adding the .py file.

One subtlety about the current test modality is what we call mode specific baselines. In theory, it should not matter what mode VisIt is run in to produce an image. The image should be identical across modes. In practice there is a long list of things that can contribute to a handful of pixel differences in the same test images run in different modes. This has lead to mode specific baselines. In the baseline directory, there are subdirectories with names corresponding to modes we currently run. When it becomes necessary to add a mode specific baseline, the baseline file should be added to the appropriate baseline subdirectory.

In some cases, we skip a test in one mode but not in others. Or, we temporarily disable a test by skipping it until a given problem in the code is resolved. This is handled by the --skiplist argument to the test suite. We maintained list of the tests we currently skip and update it as necessary. The default skip list file is src/test/skip.json.

6.3.1. Three Types of Test Results

VisIt’s testing system, visit_test_main.py, uses three different methods to process and check results.

  • Test() which processes .png image files.
  • TestText() which process .txt text files.
  • TestValueXX() (where XX``==>``EQ, LT, LE, etc.) which processes no files and simply checks actual and expected values passed as arguments.

The Test() and TestText() methods both take the name of a file. To process a test result, these methods output a file produced by the current test run and then compare it to a blessed baseline file stored in test/baseline. When they can be used, the TestValueXX() are a little more convenient because they do not involve storing data in files and having to maintain separate baseline files. Instead the TestTextXX() methods take both an actual (current) and expected (baseline) result as arguments directly coded in the calling .py file.

As VisIt testing has evolved over the past twenty years, understanding and improving productivity related to test design has not been a priority. As a result, there are likely far more image test results than are truly needed to fully vet all of VisIt’s plotting features. Or, image tests are used unecessarily to confirm non-visual behavior like that a given database reader is working. Some text tests are better handled as TestValueXX() tests and other text tests often contain 90% noise text unrelated to the functionality being tested. This has made maintaining and ensuring portability of the test suite more laborious.

Because image tests tend to be the most difficult to make portable, a better design would minimize image tests to only those needed to validate visual behaviors, text tests would involve only the essenteial text of the test and a majority of tests would involve value type tests.

The above explanation is offered as a rational to justify that whenever possible adding new tests to the test suite should use the TestValueXX() approach as much as practical.

6.3.2. More About TestValueXX Type Tests

The TestValueXX() methods are similar in spirit to Test() and TestText() except operates on Python values passed as args both for the current (actual) and the baseline (expected) results. The values can be any Python object. When they are floats or ints or strings of floats or ints or lists/tuples of the same, these methods will round the arguments to the desired precision and do the comparisons numerically. Otherwise they will compare them as strings.

TestValueEQ(case_name, actual, expected, prec=5) :
Passes if actual == expected within specific precision otherwise fails.
TestValueNE(case_name, actual, expected, prec=5) :
Passes if actual != expected within specific precision otherwise fails.
TestValueLT(case_name, actual, expected, prec=5) :
Passes if actual < expected within specific precision otherwise fails.
TestValueLE(case_name, actual, expected, prec=5) :
Passes if actual <= expected within specific precision otherwise fails.
TestValueGT(case_name, actual, expected, prec=5) :
Passes if actual > expected within specific precision otherwise fails.
TestValueGE(case_name, actual, expected, prec=5) :
Passes if actual >= expected within specific precision otherwise fails.
TestValueIN(case_name, bucket, expected, eqoper=operator.eq, prec=5) :
Passes if bucket contains expected according to eqoper equality operator. Fails otherwise.

For some examples, see test_values_simple.py.

6.3.3. Filtering Image Differences

There are many alternative ways for both compiling and even running VisIt to produce any given image or textual output. Nonetheless, we expect results to be nearly if not perfectly identical. For example, we expect VisIt running on two different implementations of the GL library to produce by and large the same images. We expect VisIt running in serial or parallel to produce the same images. We expect VisIt running on Ubuntu Linux to produce the same images as it would running on Mac OSX. We expect VisIt running in client-server mode to produce the same images as VisIt running entirely remotely.

In many cases, we expect outputs produced by these alternative approaches to be nearly the same but not always bit-for-bit identical. Minor variations such as single pixel shifts in position or slight variations in color are inevitable and ultimately unremarkable.

When testing, it would be nice to be able to ignore variations in results attributable to these causes. On the other hand, we would like to be alerted to variations in results attributable to changes made to the source code.

To satisfy both of these goals, we use bit-for-bit identical matching to track the impact of changes to source code but fuzzy matching for anything else. We maintain a set of several thousand version-controlled, baseline results computed for a specific, fixed configuration and test mode of VisIt. Nightly testing of key branches of development reveals any results that are not bit-for-bit identical to their baseline.

These failures are then corrected in one of two ways. Either the new result is wrong and additional source code changes are required to ensure VisIt continues to produce the original baseline. Or, the original baseline is wrong and it must be updated to the new result. In this latter situation, it is also prudent to justify the new result with a plausible explanation as to why it is expected, better or acceptable as well as to include such explanation in the commit comments.

6.3.3.1. Mode specific baselines

VisIt testing can be run in a variety of modes; serial, parallel, scalable-parallel, scalable-parallel-icet, client-server, etc. For a fixed configuration, in most cases baseline results computed in one mode agree bit-for-bit identically with the other modes. However, this is not always true. About 2% of results vary with the execution mode. To handle these cases, we also maintain mode-specific baseline results as the need arises.

The need for a mode-specific baseline is discovered as new tests are added. When testing reveals that VisIt computes slightly different results in different modes, a single mode-agnostic baseline will fail to match in all test modes. At that time, mode-specific baselines are added.

6.3.3.2. Changing Baseline Configuration

One weakness with this approach to testing is revealed when it becomes necessary to change the configuration used to compute the baselines. For example, moving VisIt’s testing system to a different hardware platform or updating to a newer compiler or third-party library such as VTK, may result in a slew of minor variations in the results. Under these circumstances, we are confronted with having to individually assess possibly thousands of minor image differences to rigorously determine whether the new result is in fact good or whether some kind of issue or bug is being revealed.

In practice, we use fuzzy matching (see below) to filter out minor variations from major ones and then focus our efforts only on fully understanding the major cases. We summarily accept all minor variations as the new baselines.

6.3.3.3. Promise of Machine Learning

In theory, we should be able to develop a machine-learning approach to filtering VisIt’s test results that enable us to more effectily attribute variations in results to various causes. A challenge here is in developing a sufficiently large and fully labeled set of example results to prime the machine learning. This would make for a great summer project.

6.3.3.4. Fuzzy Matching Metrics

Image difference metrics are reported on terminal output and in HTML reports.

Total Pixels (#pix) :
Count of all pixels in the test image
Non-Background (#nonbg) :
Count of all pixels which are not background either by comparison to constant background color or if a non-constant color background is used to same pixel in background image produced by drawing with all plots hidden. Note that if a plot produces a pixel which coincidentally winds up being the same color as the background, our accounting logic would count it as background. We think this situation is rare enough as to not cause serious issues.
Different (#diff) :
Count of all pixels that are different from the current baseline image.
% Diff. Pixels (~%diff) :
The precentage of different pixels computed as 100.0*#diff/#nonbg
Avg. Diff (avgdiff) :
The average luminance (gray-scale, obtained by weighting RGB channels by 1/3rd and summing) difference. This is the sum of all pixel luminance differences divided by #diff.

6.3.3.5. Fuzzy Matching Thresholds

There are some command-line arguments to run tests that control fuzzy matching. When computed results match bit-for-bit with the baseline, a PASS is reported and it is colored green in the HTML reports. When a computed result fails the bit-for-bit match but passes the fuzzy match, a PASS is reported on the terminal and it is colored yellow in the HTML reports.

Pixel Difference Threshold (--pixdiff) :
Specifies the acceptable threshold for the #diff metric as a percent. Default is zero which implies bit-for-bit identical results.
Average Difference Threshold (--avgdiff) :
Specifies the acceptable threshold for the avgdiff metric. Note that this threshold applies only if the --pixdiff threshold is non-zero. If a test is above the pixdiff threshold but below the avgdiff threshold, it is considered a PASS. The avgdiff option allows one to specify a second tolerance for the case when the pixdiff tolerance is exceeded.
Numerical (textual) Difference Threshold (--numdiff) :
Specifies the acceptable relative numerical difference threshold in computed, non-zero numerical results. The relative difference is computed as the ratio of the magnitude of the difference between the current and baseline results and the minimum magnitude value of the two results.

The command-line with --pixdiff=0.5 --avgdiff=0.1 means that any result with fewer than 0.5% of pixels that are different is a PASS and anything with more than 0.5% of pixels different but where the average pixel gray-scale difference is less than .1 is still a PASS.

6.3.3.6. Testing on Non-Baseline Configurations

When running the test suite on platforms other than the currently adopted baseline platform or when running tests in modes other than the standard modes, the --pixdiff and --avgdiff command-line options will be very useful.

For numerical textual results, there is also a --numdiff command-line option that specifies a relative numerical difference tolerance in numerical textual results. The command-line option --numdiff=0.01 means that if a numerical result is different but the magnitude of the difference divided by the magnitude of the expected value is less than 0.01 it is considered a Pass.

When specified on the command-line to a test suite run, the above tolerances wind up being applied to all test results computed during a test suite run. It is also possible to specify these tolerances in specific tests by passing them as arguments, for example Test(pixdiff=4.5) and TestText(numdiff=0.01), in the methods used to check test outputs.

Finally, it may make sense for developers to generate (though not ever commit) a complete and validated set of baselines on their target development platform and then use those (uncommitted) baselines to enable them to run tests and track code changes using an exact match methodology.

6.3.4. Tips on writing regression tests

  • Whenever possible, add only new TestValueXX() type tests.
  • Test images in which plots occupy a small portion of the total image are fraught with peril and should be avoided. Images with poor coverage are more likely to produce false positives (e.g. passes that should have failed) or to exhibit somewhat random differences as test scenario is varied.
  • Except in cases where annotations are being specifically tested, remember to call TurnOffAllAnnotations() as one of the first actions in your test script. Otherwise, you can wind up producing images containing machine-specific annotations which will produce differences on other platforms.
  • When setting plot and operator options, take care to decide whether you need to work from default or current attributes. Methods to obtain plot and operator attributes optionally take an additional 1 argument to indicate that current, rather that default attributes are desired. For example CurveAttributes() returns default Curve plot attributes wherease CurveAttributes(1) returns current Curve plot attributes which will be the currently active plot, if it is a Curve plot or the first Curve plot in the plot list of the currently active window whether it is active or hidden. If there is no Curve plot available, it will return the default attributes.
  • When writing tests involving text differences and file pathnames, be sure that all pathnames in the text strings passed to TestText() are absolute. Internally, VisIt testing system will filter these out and replace the machine-specific part of the path with VISIT_TOP_DIR to facilitate comparison with baseline text. In fact, the .txt files that get generated in the current dir will have been filtered and all pathnames modified to have VISIT_TOP_DIR in them.
  • Here is a table of python tests scripts which serve as examples of some interesting and lesser known VisIt/Python scripting practices:
Script What it demonstrates
tests/faulttolerant/savewindow.py
  • uses python exceptions
tests/databases/itaps.py
  • uses OpenDatabase with specific plugin
  • uses SIL restriction via names of sets
tests/databases/silo.py
  • uses OpenDatabase with virtual database and a specific timestep
tests/rendering/scalable.py
  • uses OpenComputeEngine to launch a parallel engine
tests/rendering/offscreensave.py
  • uses Test() with alternate save window options
tests/databases/xform_precision.py
  • uses test-specific enviornment variable settings

6.3.5. Rebaselining Test Results

A python script, rebase.py, in the test/baseline dir can be used to rebaseline large numbers of results. In particular, this script enables a developer to rebase test results without requiring access to the test platform where testing is performed. This is becase the PNG files uploaded (e.g. posted) to VisIt’s test results dashboard are suitable for using as baseline results. To use this script, run ./rebase.py --help. Once you’ve completed using rebase.py to update image baselines, don’t forget to commit your changes back to the repository.

6.4. Using VisIt Test Suite for Sim Code Testing

VisIt’s testing infrastructure can also be used from a VisIt install by simulation codes how want to write their own Visit-based tests. For more details about this, see: Leveraging VisIt in Sim Code RegressionTesting