6. Regression Testing¶
VisIt has a large and continually growing test suite. VisIt’s test
suite involves a combination python scripts
src/test, raw data and data generation sources in
and of course the VisIt sources themselves. Regression tests are
run on a nightly basis. Testing exercises VisIt’s viewer,
mdserver, engine and cli but not the GUI.
6.2. Running regression tests¶
6.2.1. Where nightly regression tests are run¶
The regression suite is run on LLNL’s Pascal Cluster. Pascal runs the TOSS3 operating system, which is a flavor of Linux. If you are going to run the regression suite yourself you should run on a similar system or there will be differences due to numeric precision issues.
The regression suite is run on Pascal using a cron job that checks out VisIt source code, builds it, and then runs the tests.
6.2.2. How to run the regression tests manually¶
The regression suite relies on having a working VisIt build and test data available on your local computer. Our test data and baselines are stored using git lfs, so you need to setup git lfs and pull to have all the necessary files.
The test suite is written in python and to source is in
When you configure VisIt, a bash script is generated in the build directory that you can use to run the test
suite out of source with all the proper data and baseline directory arguments.
cd visit-build/test/ ./run_visit_test_suite.sh
Here is an example of the contents of the generated
/Users/harrison37/Work/github/visit-dav/visit/build-mb-develop-darwin-10.13-x86_64/thirdparty_shared/third_party/python/2.7.14/darwin-x86_64/bin/python2.7 /Users/harrison37/Work/github/visit-dav/visit/src/test/visit_test_suite.py \ -d /Users/harrison37/Work/github/visit-dav/visit/build-mb-develop-darwin-10.13-x86_64/build-debug/testdata/ \ -b /Users/harrison37/Work/github/visit-dav/visit/src/test/../../test/baseline/ \ -o output \ -e /Users/harrison37/Work/github/visit-dav/visit/build-mb-develop-darwin-10.13-x86_64/build-debug/bin/visit "$@"
Once the test suite has run, the results can be found in the
output/html directory. Open
output/html/index.html in a web browser to view the test suite results.
6.2.3. Accessing regression test results¶
The nightly test suite results are posted to: http://portal.nersc.gov/project/visit/.
6.2.4. In the event of failure on the nightly run¶
If any tests fail, ‘’all’’ developers who updated the code from the last time all tests successfully passed will receive an email indicating what failed. In addition, failed results should be available on the web.
6.3. How regression testing works¶
The workhorse script that manages the testing is
src/test. Tests can be run in a variety of ways called modes.
For example, VisIt’s nightly testing is run in
scalable,parallel modes. Each of these modes represents a fundamental and
relatively global change in the way VisIt is doing business
under the covers during its testing. For example, the difference
scalable,parallel modes is whether the scalable
renderer is being used to render images. In the
rendering is done in the viewer. In
scalable,parallel mode, it
is done, in parallel, on the engine and images from each processor
are composited. Typically, the entire test suite is run in each
mode specified by the regression test policy.
There are a number
of command-line options to the test suite.
will give you details about these options. Until we are
able to get re-baselined on the systems available outside of LLNL firewalls,
options enabling some filtering of image differences will be very useful.
Use of these options on platforms other than the currently adopted testing
platform (pascal.llnl.gov) will facilitate filtering big
differences (and probably real bugs that have been introduced)
from differences due to platform where tests are run. See the section on
filtering image differences.
There are a number of different categories of tests. The test
categories are the names of all the directories under
src/test/tests. The .py files in this directory tree are all
the actual test driver files that drive VisIt’s CLI and
generate images and text to compare with baselines. In addition,
src/test/visit_test_main.py file defines a number of helper Python
functions that facilitate testing including two key functions;
Test() for testing image outputs and
TestText() for testing text
outputs. Of course, all the .py files in
are excellent examples of test scripts.
When the test suite finishes, it will have created a web-browseable HTML tree in the html directory. The actual image and text raw results will be in the current directory and difference images will be in the diff directory. The difference images are essentially binary bitmaps of the pixels that are different and not the actual pixel differences themselves. This is to facilitate identifying the location and cause of the differences.
Adding a test involves a) adding a .py file to the appropriate
src/test/tests, b) adding the expected baselines
test/baselines and, depending on the test, c) adding
any necessary input data files to
The test suite will find your added .py files the next time it runs.
So, you don’t have to do anything special other than adding the .py file.
One subtlety about the current test modality is what we call mode specific baselines. In theory, it should not matter what mode VisIt is run in to produce an image. The image should be identical across modes. In practice there is a long list of things that can contribute to a handful of pixel differences in the same test images run in different modes. This has lead to mode specific baselines. In the baseline directory, there are subdirectories with names corresponding to modes we currently run. When it becomes necessary to add a mode specific baseline, the baseline file should be added to the appropriate baseline subdirectory.
In some cases, we skip a test in one mode but
not in others. Or, we temporarily disable a test by skipping it
until a given problem in the code is resolved. This is handled
--skiplist argument to the test suite. We maintained list of the
tests we currently skip and update it as necessary.
The default skip list file is
6.3.1. Three Types of Test Results¶
VisIt’s testing system,
visit_test_main.py, uses three different methods
to process and check results.
LE, etc.) which processes no files and simply checks actual and expected values passed as arguments.
TestText() methods both take the name of a file. To process a
test result, these methods output a file produced by the current test run and
then compare it to a blessed baseline file stored in
When they can be used, the
TestValueXX() are a little more convenient because
they do not involve storing data in files and having to maintain separate
baseline files. Instead the
TestTextXX() methods take both an actual
(current) and expected (baseline) result as arguments directly coded in the
As VisIt testing has evolved over the past twenty years, understanding and
improving productivity related to test design has not been a priority. As a
result, there are likely far more image test results than are truly needed to
fully vet all of VisIt’s plotting features. Or, image tests are used
unecessarily to confirm non-visual behavior like that a given database reader
is working. Some text tests are better handled as
TestValueXX() tests and
other text tests often contain 90% noise text unrelated to the functionality
being tested. This has made maintaining and ensuring portability of the test
suite more laborious.
Because image tests tend to be the most difficult to make portable, a better design would minimize image tests to only those needed to validate visual behaviors, text tests would involve only the essenteial text of the test and a majority of tests would involve value type tests.
The above explanation is offered as a rational to justify that whenever possible
adding new tests to the test suite should use the
TestValueXX() approach as
much as practical.
6.3.2. More About TestValueXX Type Tests¶
TestValueXX() methods are similar in spirit to
TestText() except operates on Python values passed as args both for the
current (actual) and the baseline (expected) results. The values can be any
Python object. When they are floats or ints or strings of floats or ints or
lists/tuples of the same, these methods will round the arguments to the desired
precision and do the comparisons numerically. Otherwise they will compare them as
TestValueEQ(case_name, actual, expected, prec=5):
- Passes if
actual == expectedwithin specific precision otherwise fails.
TestValueNE(case_name, actual, expected, prec=5):
- Passes if
actual != expectedwithin specific precision otherwise fails.
TestValueLT(case_name, actual, expected, prec=5):
- Passes if
actual < expectedwithin specific precision otherwise fails.
TestValueLE(case_name, actual, expected, prec=5):
- Passes if
actual <= expectedwithin specific precision otherwise fails.
TestValueGT(case_name, actual, expected, prec=5):
- Passes if
actual > expectedwithin specific precision otherwise fails.
TestValueGE(case_name, actual, expected, prec=5):
- Passes if
actual >= expectedwithin specific precision otherwise fails.
TestValueIN(case_name, bucket, expected, eqoper=operator.eq, prec=5):
- Passes if bucket contains expected according to
eqoperequality operator. Fails otherwise.
For some examples, see test_values_simple.py.
6.3.3. Filtering Image Differences¶
There are many alternative ways for both compiling and even running VisIt to produce any given image or textual output. Nonetheless, we expect results to be nearly if not perfectly identical. For example, we expect VisIt running on two different implementations of the GL library to produce by and large the same images. We expect VisIt running in serial or parallel to produce the same images. We expect VisIt running on Ubuntu Linux to produce the same images as it would running on Mac OSX. We expect VisIt running in client-server mode to produce the same images as VisIt running entirely remotely.
In many cases, we expect outputs produced by these alternative approaches to be nearly the same but not always bit-for-bit identical. Minor variations such as single pixel shifts in position or slight variations in color are inevitable and ultimately unremarkable.
When testing, it would be nice to be able to ignore variations in results attributable to these causes. On the other hand, we would like to be alerted to variations in results attributable to changes made to the source code.
To satisfy both of these goals, we use bit-for-bit identical matching to track the impact of changes to source code but fuzzy matching for anything else. We maintain a set of several thousand version-controlled, baseline results computed for a specific, fixed configuration and test mode of VisIt. Nightly testing of key branches of development reveals any results that are not bit-for-bit identical to their baseline.
These failures are then corrected in one of two ways. Either the new result is wrong and additional source code changes are required to ensure VisIt continues to produce the original baseline. Or, the original baseline is wrong and it must be updated to the new result. In this latter situation, it is also prudent to justify the new result with a plausible explanation as to why it is expected, better or acceptable as well as to include such explanation in the commit comments.
22.214.171.124. Mode specific baselines¶
VisIt testing can be run in a variety of modes; serial, parallel, scalable-parallel, scalable-parallel-icet, client-server, etc. For a fixed configuration, in most cases baseline results computed in one mode agree bit-for-bit identically with the other modes. However, this is not always true. About 2% of results vary with the execution mode. To handle these cases, we also maintain mode-specific baseline results as the need arises.
The need for a mode-specific baseline is discovered as new tests are added. When testing reveals that VisIt computes slightly different results in different modes, a single mode-agnostic baseline will fail to match in all test modes. At that time, mode-specific baselines are added.
126.96.36.199. Changing Baseline Configuration¶
One weakness with this approach to testing is revealed when it becomes necessary to change the configuration used to compute the baselines. For example, moving VisIt’s testing system to a different hardware platform or updating to a newer compiler or third-party library such as VTK, may result in a slew of minor variations in the results. Under these circumstances, we are confronted with having to individually assess possibly thousands of minor image differences to rigorously determine whether the new result is in fact good or whether some kind of issue or bug is being revealed.
In practice, we use fuzzy matching (see below) to filter out minor variations from major ones and then focus our efforts only on fully understanding the major cases. We summarily accept all minor variations as the new baselines.
188.8.131.52. Promise of Machine Learning¶
In theory, we should be able to develop a machine-learning approach to filtering VisIt’s test results that enable us to more effectily attribute variations in results to various causes. A challenge here is in developing a sufficiently large and fully labeled set of example results to prime the machine learning. This would make for a great summer project.
184.108.40.206. Fuzzy Matching Metrics¶
Image difference metrics are reported on terminal output and in HTML reports.
- Total Pixels (
- Count of all pixels in the test image
- Non-Background (
- Count of all pixels which are not background either by comparison to constant background color or if a non-constant color background is used to same pixel in background image produced by drawing with all plots hidden. Note that if a plot produces a pixel which coincidentally winds up being the same color as the background, our accounting logic would count it as background. We think this situation is rare enough as to not cause serious issues.
- Different (
- Count of all pixels that are different from the current baseline image.
- % Diff. Pixels (
- The precentage of different pixels computed as
- Avg. Diff (
- The average luminance (gray-scale, obtained by weighting RGB channels by 1/3rd
and summing) difference. This is the sum of all pixel luminance differences
220.127.116.11. Fuzzy Matching Thresholds¶
There are some command-line arguments to run tests that control fuzzy matching. When computed results match bit-for-bit with the baseline, a PASS is reported and it is colored green in the HTML reports. When a computed result fails the bit-for-bit match but passes the fuzzy match, a PASS is reported on the terminal and it is colored yellow in the HTML reports.
- Pixel Difference Threshold (
- Specifies the acceptable threshold for the
#diffmetric as a percent. Default is zero which implies bit-for-bit identical results.
- Average Difference Threshold (
- Specifies the acceptable threshold for the
avgdiffmetric. Note that this threshold applies only if the
--pixdiffthreshold is non-zero. If a test is above the
pixdiffthreshold but below the
avgdiffthreshold, it is considered a PASS. The
avgdiffoption allows one to specify a second tolerance for the case when the
pixdifftolerance is exceeded.
- Numerical (textual) Difference Threshold (
- Specifies the acceptable relative numerical difference threshold in computed, non-zero numerical results. The relative difference is computed as the ratio of the magnitude of the difference between the current and baseline results and the minimum magnitude value of the two results.
The command-line with
--pixdiff=0.5 --avgdiff=0.1 means that any result with fewer
than 0.5% of pixels that are different is a PASS and anything with more than 0.5% of
pixels different but where the average pixel gray-scale difference is less than .1 is
still a PASS.
18.104.22.168. Testing on Non-Baseline Configurations¶
When running the test suite on platforms other than the currently adopted baseline
platform or when running tests in modes other than the standard modes, the
--avgdiff command-line options will be very useful.
For numerical textual results, there is also a
--numdiff command-line option
that specifies a relative numerical difference tolerance in numerical textual
results. The command-line option
--numdiff=0.01 means that if a numerical
result is different but the magnitude of the difference divided by the magnitude of
the expected value is less than
0.01 it is considered a Pass.
When specified on the command-line to a test suite run, the above tolerances wind
up being applied to all test results computed during a test suite run. It is
also possible to specify these tolerances in specific tests by passing them as
arguments, for example
the methods used to check test outputs.
Finally, it may make sense for developers to generate (though not ever commit) a complete and validated set of baselines on their target development platform and then use those (uncommitted) baselines to enable them to run tests and track code changes using an exact match methodology.
6.3.4. Tips on writing regression tests¶
- Whenever possible, add only new
- Test images in which plots occupy a small portion of the total image are fraught with peril and should be avoided. Images with poor coverage are more likely to produce false positives (e.g. passes that should have failed) or to exhibit somewhat random differences as test scenario is varied.
- Except in cases where annotations are being specifically tested, remember to call TurnOffAllAnnotations() as one of the first actions in your test script. Otherwise, you can wind up producing images containing machine-specific annotations which will produce differences on other platforms.
- When setting plot and operator options, take care to decide whether you need to work from default or current attributes.
Methods to obtain plot and operator attributes optionally take an additional
1argument to indicate that current, rather that default attributes are desired. For example
CurveAttributes()returns default Curve plot attributes wherease
CurveAttributes(1)returns current Curve plot attributes which will be the currently active plot, if it is a Curve plot or the first Curve plot in the plot list of the currently active window whether it is active or hidden. If there is no Curve plot available, it will return the default attributes.
- When writing tests involving text differences and file pathnames, be sure that all pathnames in the text strings passed to
TestText()are absolute. Internally, VisIt testing system will filter these out and replace the machine-specific part of the path with
VISIT_TOP_DIRto facilitate comparison with baseline text. In fact, the .txt files that get generated in the current dir will have been filtered and all pathnames modified to have
- Here is a table of python tests scripts which serve as examples of some interesting and lesser known VisIt/Python scripting practices:
|Script||What it demonstrates|
6.3.5. Rebaselining Test Results¶
A python script,
rebase.py, in the
test/baseline dir can be used to rebaseline large numbers of results.
In particular, this script enables a developer to rebase test results without requiring access to the test
platform where testing is performed. This is becase the PNG files uploaded (e.g. posted) to VisIt’s test
results dashboard are suitable for using as baseline results. To use this script, run
Once you’ve completed using
rebase.py to update image baselines, don’t forget to commit your changes back
to the repository.