4. Regression Testing¶
VisIt has a large and continually growing test suite. VisIt’s test suite involves a combination python scripts in src/test, raw data and data generation sources in src/testdata and of course the VisIt sources themselves. Regression tests are run on a nightly basis. Testing exercises VisIt’s viewer, mdserver, engine and cli but not the GUI.
4.2. Running regression tests¶
4.2.1. Where nightly regression tests are run¶
The regression suite is run on LLNL’s Pascal Cluster. Pascal runs the TOSS3 operating system, which is a flavor of Linux. If you are going to run the regression suite yourself you should run on a similar system or there will be differences due to numeric precision issues.
The regression suite is run on Pascal using a cron job that checks out VisIt source code, builds it, and then runs the tests.
4.2.2. How to run the regression tests manually¶
The regression suite relies on having a working VisIt build and test data available on your local computer. Our test data and baselines are stored using git lfs, so you need to setup git lfs and pull to have all the necessary files.
The test suite is written in python and to source is in src/test. When you configure VisIt, a bash script is generated in the build directory that you can use to run the test suite out of source with all the proper data and baseline directory arguments.
cd visit-build/test/ ./run_visit_test_suite.sh
Here is an example of the contents of the generated run_visit_test_suite.sh script
/Users/harrison37/Work/github/visit-dav/visit/build-mb-develop-darwin-10.13-x86_64/thirdparty_shared/third_party/python/2.7.14/darwin-x86_64/bin/python2.7 /Users/harrison37/Work/github/visit-dav/visit/src/test/visit_test_suite.py \ -d /Users/harrison37/Work/github/visit-dav/visit/build-mb-develop-darwin-10.13-x86_64/build-debug/testdata/ \ -b /Users/harrison37/Work/github/visit-dav/visit/src/test/../../test/baseline/ \ -o output \ -e /Users/harrison37/Work/github/visit-dav/visit/build-mb-develop-darwin-10.13-x86_64/build-debug/bin/visit "$@"
Once the test suite has run, the results can be found in the output/html directory. Open output/html/index.html in a web browser to view the test suite results.
4.2.3. Accessing regression test results¶
The nightly test suite results are posted to: http://portal.nersc.gov/project/visit/.
4.2.4. In the event of failure on the nightly run¶
If any tests fail, ‘’all’’ developers who updated the code from the last time all tests successfully passed will receive an email indicating what failed. In addition, failed results should be available on the web.
If the results fail to post, the visit group on NERSC’s systems may be over quota. If you have a NERSC account you can check usage by sshing to NERSC and running the following command:
4.2.5. How regression testing works¶
The workhorse script that manages the testing is visit_test_suite.py in src/test. Tests can be run in a variety of ways called modes. For example, VisIt’s nightly testing is run in serial, parallel and scalable,parallel modes. Each of these modes represents a fundamental and relatively global change in the way VisIt is doing business under the covers during its testing. For example, the difference between parallel and scalable,parallel modes is whether the scalable renderer is being used to render images. In the parallel mode, rendering is done in the viewer. In scalable,parallel mode, it is done, in parallel, on the engine and images from each processor are composited. Typically, the entire test suite is run in each mode specified by the regression test policy.
There are a number of command-line options to the test suite. ./run_visit_test_suite.sh -help will give you details about these options. Until we are able to get re-baselined on the systems available outside of LLNL firewalls, options enabling some filtering of image differences will be very useful. Use of these options on platforms other than the currently adopted testing platform (pascal.llnl.gov) will facilitate filtering big differences (and probably real bugs that have been introduced) from differences due to platform where tests are run. See the section on filtering image differences.
There are a number of different categories of tests. The test categories are the names of all the directories under src/test/tests. The .py files in this directory tree are all the actual test driver files that drive VisIt’s CLI and generate images and text to compare with baselines. In addition, the src/test/visit_test_main.py file defines a number of helper Python functions that facilitate testing including two key functions; Test() for testing image outputs and TestText() for testing text outputs. Of course, all the .py files in src/test/tests subtree are excellent examples of test scripts.
When the test suite finishes, it will have created a web-browseable HTML tree in the html directory. The actual image and text raw results will be in the current directory and difference images will be in the diff directory. The difference images are essentially binary bitmaps of the pixels that are different and not the actual pixel differences themselves. This is to facilitate identifying the location and cause of the differences.
Adding a test involves a) adding a .py file to the appropriate subdirectory in src/test/tests, b) adding the expected baselines to test/baselines and, depending on the test, c) adding any necessary input data files to src/testdata. The test suite will find your added .py files the next time it runs. So, you don’t have to do anything special other than adding the .py file.
One subtlety about the current test modality is what we call mode specific baselines. In theory, it should not matter what mode VisIt is run in to produce an image. The image should be identical across modes. In practice there is a long list of things that can contribute to a handful of pixel differences in the same test images run in different modes. This has lead to mode specific baselines. In the baseline directory, there are subdirectories with names corresponding to modes we currently run. When it becomes necessary to add a mode specific baseline, the baseline file should be added to the appropriate baseline subdirectory.
In some cases, we skip a test in one mode but
not in others. Or, we temporarily disable a test by skipping it
until a given problem in the code is resolved. This is handled
--skiplist argument to the test suite. We maintained list of the
tests we currently skip and update it as necessary.
The default skip list file is src/test/skip.json.
4.2.6. Filtering Image Differences¶
There are many ways of both compiling and running VisIt to produce image and textual outputs. In many cases, we expect the image or textual outputs to be about the same (though not always bit-wise exact matches) even if the manner in which they are generated varies dramatically. For example, we expect VisIt running on two different implementations of the GL library to produce by and large the same images. Or, we expect VisIt running in serial or parallel to produce the same images. Or we expect VisIt running on Ubuntu Linux to produce the same images as it would running on Mac OSX. We expect and therefore wish to ignore ‘’minor variations’‘. But, we want to be alerted to ‘’major variations’‘. So when any developer runs a test, we require some means of filtering out image differences we expect from those we are not expecting.
On the other hand, as we make changes to VisIt source code, we may either expect or not expect image outputs for specific testing scenarios to change in either minor or dramatic ways. For example, if we fix a bug leading to a serious image artifact that just happened to be overlooked when the original baseline image was committed, we could improve the image dramatically implying a large image difference and still expect such a difference. For example, maybe the Mesh plot had a bug where it doesn’t obey the Mesh line color setting. If we fix that bug, the mesh line color will likely change dramatically. But, the resultant image is expected to change too. Therefore, have a set of baselines from which we compute exact differences is also important in tracking impact of code changes on VisIt behavior.
These two goals, running VisIt tests to confirm correct behavior in a wide variety of conditions where we expect minor but not major variations in outputs and running VisIt tests to confirm behavior as code is changed where we may or may not expect minor or major variations are somewhat complimentary.
It may make sense for developers to generate (though not ever commit) a complete and valid set of baselines on their target development platform and then use those (uncommitted) baselines to enable them to run tests and track code changes using an exact match methodology.
total pixels- count of all pixels in the test image
plot pixels- count of all pixels touched by plot(s) in the test image
coverage- percent of all pixels that are plot pixels (plot pixels / total pixels). Test images in which plots occupy a small portion of the total image are fraught with peril and should be avoided to begin with. Images with poor coverage are more likely to produce false positives (e.g. passes that should have failed) or to exhibit somewhat random differences as test scenario is varied.
dmax / dmaxp- maximum raw numerical / human perceptual difference in any color (R,G or B) channel or intensity (average of R, G, B colors). A good first try in filtering image differences is a dmax setting of 1. That will admit variations of 1 in any R, G or B channel or in intensity. However, for line-based plots like the mesh plot, due to differences in the way lines of the plot get scanned into pixels, this metric can fail miserably.
dmed / dmedp- median value of raw numerical / human perceptual differences over all color channels and intensity
When running the test suite on platforms other than the currently adopted baseline platform or when running tests in modes other than the standard modes, a couple of options will be very useful; -pixdiff and -avgdiff. The pixdiff option allows one to specify a tolerance on the percentage of non*background pixels that are different. The avgdiff option allows one to specify a second tolerance for the case when the pixdiff tolerance is exceeded. The avgdiff option specifies the maximum average (intensity) difference difference allowed averaged over all pixels that are different.
4.2.7. Tips on writing regression tests¶
- Except in cases where annotations are being specifically tested, remember to call TurnOffAllAnnotations() as one of the first actions in your test script. Otherwise, you can wind up producing images containing machine-specific annotations which will produce differences on other platforms.
- When writing tests involving text differences and file pathnames, be sure that all pathnames in the text strings passed to TestText() are absolute. Internally, VisIt testing system will filter these out and replace the machine-specific part of the path with VISIT_TOP_DIR to facilitate comparison with baseline text. In fact, the .txt files that get generated in the current dir will have been filtered and all pathnames modified to have VISIT_TOP_DIR in them.
- Here is a table of python tests scripts which serve as examples of some interesting and lesser known VisIt/Python scripting practices:
|Script||What it demonstrates|
4.2.8. Rebaselining Test Results¶
A python script, rebase.py, at src/tests dir can be used to rebaseline large numbers of results. In particular, this script enables a developer to rebase test results without requiring access to the test platform where testing is performed. This is becase the PNG files uploaded (e.g. posted) to VisIt’s test results dashboard are suitable for using as baseline results. To use this script, run ./rebase.py –help. Once you’ve completed using rebase.py to update image baselines, don’t forget to commit your changes back to the repository.