Guidelines for creating test programs in the CISST-ERC development library
Ofri Sadowsky
Department of Computer Science (NEB B26), Whiting School of
Engineering,
Johns Hopkins University 3400 N. Charles st., Baltimore, MD 21218
May 2003
This document is aimed at defining guidelines for creating test
programs in the CISST-ERC development library. Software testing is an
inexact science, whose main goal is to foresee the
unforeseeable. Basic computability theorems show that it is impossible
to design a 100% proof test program. The best we can hope for in the
dynamic environment of the ERC development library is to be able to
collect over time a test suite for various features of the
system. Therefore, this document does not try to teach how to write
good test programs, but rather is an organizational document, whose
goal is to set rules that will facilitate control over the collection
of test programs, extensibility of the collection, and an automation
of the testing process.
To meet this goal, we first have to analyze the behavior of test
programs, set definitions of inputs and outputs, and to classify the
test programs to cover sufficient variety. The first part of the
document makes this analysis. The second part sets the rules based on
the analysis results.
2 Classfying test programs
The ERC development library includes a broad spectrum of
features. Among them: 2D and 3D vectors and matrices, general linear
algebra, optimization, computational geometry, communications and
distributed computing, robot control and sensory systems, computer
graphics and image processing, user interface, and possibly
more. While each sub-package has its unique testing
requirements, and needs a specific set-up protocol for
the testing, we can still try to identify features which are common
to all test programs. To achieve that, we start by listing the
requirements for testing different features of the library. The list
is sorted by the features to be tested, and may grow or change in the future.
- Object I/O
- General execution-control issues
- Memory leaks.
- Time-bounded execution.
- Exception handling.
- Concurrency.
- Algorithmic correctness and computational precision
- Expected output for a given input.
- Independent verification.
- Control groups.
- Supervised tests
- Interactive tests.
- Distributed objects.
- Interfacing external devices
- Abstraction layer tests and diagnostics.
- Sensory systems.
- Robot motion tests.
Below, we discuss the requirements for testing each of the above
categories, and identify needed resources.
An important feature of the ERC development library is storing and
retrieving object information and configuration data. While the
requirement identification and design of serialization and other types
of I/O are planned for elsewhere, we can still outline some possible
tests for I/O operations.
We should also note that having an object I/O interface is a useful
feature for the test framework as well. For example, object I/O can
be used to import test cases into the framework and to dump the
configuration of a failed test for debugging purposes. Therefore, we
discuss the tests for this feature set first.
We would like to be able to serialize objects in and out of our
application. The serialized format can take many shapes. However,
the following seems to be the most reliable test for
serialization:
- Get an object (from an input source, through serialization, hard
coded, random, etc.).
- Serialize it out.
- Deserialize from the same stream.
- Compare the result to the original object.
The test may use temporary files to store the serialized object(s),
and may use external input to fetch the original object. Note that
the test requires an effective method for comparing instances of the
same class, and that the deserialized object must be
identical to the original object.
Also note that a process of serializing an object x,
deserializing into object y, then serializing y again
and comparing only the serialized output from x and y is
an insufficient test. That is because if information is lost during
the serialization of x, it is not recovered in y, and we
cannot detect this by comparing the serialized y with the
serialized x.
The term ``parsing'' usually applies to machine processing of data in
human readable format. While in essence it is equivalent to
serialization, it is more loosely associated with single
objects. Parsing always requires external input to be parsed. Of
course, one could hard-code the input to the parsing as well, but it
seems to miss the point.
When dealing with parsing of text files, we should pay attention to the
following items:
- There is no direct way to verify the correctness of parsing a text file,
when it is the sole source of input. To overcome this problem, we can adopt
one or more of the following strategies:
- Import objects from two files (one is text, and the other may
be text or binary) and compare them. This tests the validity of one
I/O method against another.
- Add checksums at the end of sections in the text file, and use
them to verify the correctness of the parsing. The checksums must
be optional, so that a human composer can create the text file
without worrying about them. Then, we may provide a small utility
program to add the checksums to a text file.
- Due to the limited precision of both binary and text floating
point representation, we may have trouble in restoring
identical objects from a text file. This affectes directly the
ability to calculate checksums, for example. If we want to use
checksum as part of our parsing format, we have to be sure that it is
robust to perturbations of low-order bits in a floating-point
representation.
Here we list some issues that relate to testing any piece of software,
regardless of correctness or dependencies on other packages. In other
words, we test if a package is ``well behaved'' regarding the criteria
below.
Our test corpus should include tests for memory leaks, for example, in
containers or library objects with dynamic memory allocation. There
are existing test tools to detect memory leaks, such as pstack
and gmmprof, or the MFC internal classes for memory monitoring.
We should find a way to incorporate them into our development
environment, preferably in a portable manner.
It remains to be decided if all the test programs should run in a
memory monitoring environment, or only some should have memory leak
detectors invoked. Running all the tests with memory monitoring
probably takes longer and may lead to unexpected behavior resulting
from possible bugs in the memory monitor rather than in our code.
Running only some tests in a monitored environment leads to questions
of which tests are monitored and which are not, and how we can invoke
the monitor for some tests and not for others. These should be
answered better when we learn more about memory monitors.
A possible programming error is writing code that does not terminate.
It is not necessarily an infinite loop, but may be the result of an
error in the termination conditions, wrong assumptions on convergence,
etc. In addition, modifications to an algorithm or a data structure
can significantly increase the runtime of the algorithm. We would
like to be able to handle such cases in an automatic testing module.
It is therefore suggested that each test program should have a runtime
specification in the form of estimated average runtime, estimated
worst-case runtime, and estimated time to decide it's an infinite loop
(some or all of the values may be identical). The test program or the
test framework should issue a warning message in the test log for
exceeding each of these times, and the test framework should kill the
test program if it exceeds the time bound to determine it's an
infinite loop.
Since running the same program on different machines takes different
times, we should use a normalized time unit to measure program
runtimes. For example, we can have one program that iterates a simple
loop (not empty –- it may get optimized and got rid of by the
compiler)
times, and state that this takes
normalized time
units. By measuring its runtime and dividing it by
, we immediately
get the normalized time unit. Then, we may be able to run each test
procedure in a ``diagnostic'' mode to provide an estimate of its
runtime for future tests.
The subject of time-bounded execution raises the issue of providing
some query interface to each test program, that enables the user to
find which tests are included, run them selectively and in different
modes, and so on. This will be discussed with more details later.
It is important to verify that our exception handling mechanisms work.
For that, we try to generate artificially exception conditions or
invalid input. We also need a methodology of specifying which
exceptions are generated during the test. Possibly, the easiest way
is to write one test program for each exception case, but that can be
tedious. The subject should be thought over more when exception
handling scheme is developed. In addition to normal functionality, if
we want to ensure that our software is robust, it should also be
tested for exception conditions. As we have chosen to use the CppUnit
package as our test framework, we can also use its exception testing
features. See more about it in this document and in the CppUnit
documentation.
Additionally, when performing automatic testing, we need to have tools
to handle program crashes, record them, and overcome them to continue
the test. We would not like a single bug in one of the library classes
or test programs to stop an automatic test process.
Concurrency is among the hardest targets for testing. The reason is, a
context switch between concurrent processes can generally occur at any
time. It is possible to insert breakpoints or controlled preemptions
(CPU yield) in strategic locations in the code to control context
switches, but I suspect it turns out to be an infinite amount of
work.
Hope can rise from the fact that in most cases, the cycle of a
concurrent thread is short enough in time to be executed many times in
a relatively small number of seconds. By probability theory, and
especially by Murphy's Theorem, if anything can go wrong in a minute
or so of concurrent running, it will. As of now, it seems that the
best testing strategy for concurrency is to write normal test programs
which involve concurrent execution, let them run for a while, and see
if they crash, fails to meet a goal, etc. As of (August 2003)
our code is not designed for thread safety, and is not meant to run
concurrently. The best we can do to detect concurrency errors is to
treat them like any other error, and enforce the strategies listed
under ``Time-bounded execution'' and ``Exception handling''.
We must be able to verify that our computational modules perform
according to the specifications. The simplest way to do that is by
knowing in advance what is the output of an algorithm for a given
input. However, it may not be as simple as that in most cases, as
discussed below.
A straightforward example for testing expected output is the
following. Take two known vectors, add them, and compare the result
to a third vector that was evaluated in advance. The test fails if
the addition result is not equal to the expected. The operands and
the known result can be hard coded or can be stored in an input file.
In any case, they are determined before the actual execution of the
test.
This test strategy has several drawbacks. Firstly, when our
operations involve floating point multiplications, there is an
accumulated error. Different computation environments (machines,
compilers, etc.) may accumulate the errors in a different way, making
it hard to predict the exact expected result. Therefore, in
many cases we can only specify the precision of the output within a
certain tolerance.
Secondly, sometimes the computation of the known value is complicated
enough, and cannot be carried out without running the same algorithm.
This means, that if we use a computer to evaluate the expected result,
we basically let the algorithm test itself, which defeats the idea of
testing. On the other hand, carrying out the computation manually may
be too tedious and error prone to start with.
Thirdly, testing against known inputs lacks coverage of the problem
domain. Usually, we cannot even run through the entire domain, and
would like to generate a reasonable random sample of it. But then, we
do not want to evaluate the results for each and every element of the
domain, just to be able to draw random samples.
This leads us to a more sophisticated strategy of testing, that
verifies internal consistency of implemented methods, and thus
indirectly indicates to the correctness of more than one method at the
same time. In other words, we verify the correctness of an output
indirectly by using other methods and checking the output consistency.
We assume that the chance that all the methods demonstrate consistent
behavior while being in fact incorrect is small enough. Thus, we can
use random inputs and with a high likelihood detect errors in the
code. The strategy is as follows.
Independent verification is a sophisticated test, that can test
several functions at the same time. For example, evaluate the cross
product of two vectors, then verify that the cross product is
perpendicular to both and has an appropriate magnitude. Here we test
cross product, dot product, vector norm and vector angle. If all the
results are consistent, we assume that the functions are evaluated
correctly. Another example: solve a linear equation system, then see
that the result is a solution by multiplying it by the original
matrix.
Note that for independent verification we can choose a random input,
which usually makes the test more robust. However, attention should be
paid to numerical precision issues. In many cases, the examples above
included, one cannot simply test for equality of independent results,
due to roundoff errors. In such cases, a reasonable error tolerance
should be spcified in the test, and the test should fail if the
results do not agree with each other within that tolerance.
Another aspect to consider is the generation of random input. There
is a number of different randomization methods, for example in
[Numerical Recipes], and they differ in runtime and randomization
quality. Although all these methods are in fact pseudo-randomized,
and not randomized in the strong sense of unpredictability, we can, in
fact, take advantage of the pseudo-random property. That is because
being able to repeat a certain pseudo-random sequence may enable us to
reproduce inputs that were detected as error-generating. Therefore,
when reporting about a failed test, we may wish to dump out not only
the objects for which the error was detected, but also the
randomization seeds.
In some cases, such as computational learning, we have no direct way
to verify that our result is accurate. Instead, we rank the accuracy
of the result by using a control group. For example, to evaluate the
robustness of a registration algorithm to noisy input, we can select a
training group of points, inject random noise, and compute the
registration. The result depends on our random noise, and often cannot
be predicted with enough precision. However, we can use another group
of points –- the control group –- and measure the precision of
applying the computed transformation to the contol group by comparing
it with application of the originally known transformation.
In a test of this kind, the pass/fail criterion is fuzzy, and depends
on statistical significance rules, the magnitude of the noise, etc. In
general, the decision if the test was successful should be left to a
human observer. However, the process of running the test can be
automated, and at the very least it can be used to verify successful
termination of algorithms, that is, termination with no
exceptions.
Many of our tests, such as the ones listed below, require a supervised
set-up, and cannot be run automatically. For these, a detailed
instruction sheet should be attached to the test, so that the test
operator can run it smoothly. Some of the tests listed below fall
outside the scope of our planned automated test framework. Yet we
list them to be sure we don't miss important questions.
Automated testing of interactive behavior is impossible. The best we
can do is simulate input sequences by fixed or randomized input. However,
in some cases, and especially with GUI, we can at least check that the
interactive modules run without exceptions for certain correct and exceptional
input sequences.
We can consider using some scripts to generate artificial events in a sequence
spcified by an external resource file. This seems to go beyond the scope of
this document as of (August 2003), and may be dealt with later.
There are several difficulties in testing the functionality of
distributed objects. One is setting up the environment: do we test
them as independent processes on a single machine, or do we actually
run them on separate hosts. If we choose the second option, we need a
way to remotely launch a remote object from the main test machine.
Another problem is testing for exceptions and failures, which requires
an even more complicated set-up. A third is the scope of the test. To
test algorithmic correctness, we would generally prefer a direct
method rather than involving distibuted computation. Therefore, we
are left with testing the networking functionality, which simplifies
our work, but still requires special set-up.
Currently, we do not plan to add distributed objects to the features
of our test framework.
External devices increase the complexity of the system by requiring an
interface abstraction layer, increasing the number of potential
failures of the system, requiring additional set-up for testing, and
exposing the system to the unexpected behavior of the physical
environment. Special attention should be paid to testing robot
interfaces, as they are an essential part of our development. Note
that as the device tests require the device being connected to the
computer, these tests usually cannot be run unsupervised. These
subjects are discussed in more detail next.
Any device abstraction layer should have a standard diagnostics
package that's capable of performing elementary tests for the
device. These can include: initialization, sending commands, running
device-internal diagnostics, and testing simple configurations or
instruction sets. The diagnostics can use fixed configurations, which
are either hard-coded or stored in input files.
Diagnostics should be simple to run and require little more than
connecting the external device to the computer, switching it on, and
running the test. Therefore, they are likely to fit into our test framework,
even if they would not be run automatically.
For many external hardware items, a diagnostics package is provided by
the vendor. In the simplest case, our own diagnostics can wrap the
vendor's. In more sophisticated cases, our diagnostics may involve
invoking API commands. Yet, they should be simple and short.
Sensory systems can include trackers, digitizers, encoders, force
sensors, cameras, imaging systems, temperature sensors, and any other
device that reports to the computer of an external physical event
(except for standard input devices, such as keyboard, mouse, etc.).
Robots represent a higher level of complexity, and perhaps only robot
encoder testing falls into this category. Under normal circumstances,
we are not taking responsibility for calibrating sensors, unless the
calibration is part of the standard protocol of using them. However,
we may still wish to have at least a diagnostics package for the
correctness of the sensory input. These may require supervision and
should have an appropriate instruction sheet.
As part of our diagnostics, we may wish to verify that robot motion
commands are executed correctly. For example, that a certain
positioning command brings the end effector to the desired pose.
Motion tests can be more complicated, and test also the dynamics of
the robot. These can include velocity control, real-time avoidance of
forbidden configurations, responding to exception conditions, and
possibly more. The entire subject is left for the developers of the
MRC package, and will not be discussed further here.
We can now make a list of requirements from our test framework. These
will be classified as input requirements, output requirements, and
test characteristics.
- Object input. We should be able to import objects in
text or binary formats from external sources (e.g., files, standard
input) into the test program.
- Randomization. We need a reliable pseudo-random
generator, which when initialized with the same seed yields the same
sequence. We may want to record randomization seeds for debugging
purposes.
- Configuration files. To test external device
interfaces, output LOD, etc., we may depend on common configuration
files. These files should be located easily by the test program.
- Time control. We need to be able to measure the
execution time of a program, either externally (from a launching
script) or internally. It is good to have each program able to
measure its runtime in ``standard'' units, that is, compared to some
small common denominator.
- Object output. We should be able to export objects from
the test program in a way that enables to reproduce an error condition.
- Test log. A test program should be able to store
output directly to a test log, and flush it immediately. It cannot
rely on accumulating failed tests into some buffer like JUnit, for
example, does, as the test program may crash and leave the buffer
unflushed.
- Termination conditions. We need to a way identify
that a program terminated successfully, terminated unsuccessfully,
crashed, or was killed from the outside, and record these conditions
to the test log.
All test programs should have a fixed set of command line options to
provide standard operations. Below is a list of suggested standard operations
for a test program.
- Test query. The program should list the tests it performs.
The list should include the name of the test and optionally a description.
- Test selection. The program should let the user
select through a command-line option a single test to be executed.
The test may be described by its name or a number, and should be
consistent with the query output. If no test is selected, the program
should run all the tests.
- Time diagnostics. The program should be able to
estimate the time needed to run each of the tests, in ``standard''
units. The estimation does not have to be accurate, but should
provide a reasonable upper bound. With more sophisticated work, we
can try to deduce tighter bounds for the execution time of a test, but
for now we will suffice with that.
- Randomization seed. The program should be
initializable with an external randomization seed to reproduce
identical random sequences.
In addition, a test program may or may not require specifications of
the number of iterations to perform, test-specific inputs, and so on.
These cannot, in general, be part of the standard command-line
interface. However, when running in time diagnostics mode, the
program may ask the user for numbers of iterations, test-specific
input, etc., in order to estimate the runtime. Therefore, the program
may have the same interaction with the user in diagnostic mode as in
test mode.
The following features are optional for our test suite. We may or may not
implement them in different stages of the development.
- Tcl/Tk graphical user interface.
- Automated scripts to find and run tests.
- Implementation of test modules as DLLs/shared objects that can
be dynamically loaded into a standard tesing ``shell'' or GUI.
Based on the analysis in Section
, we seek
for a software framework to satisfy our requirements. We would like
to explore existing solutions first, then propose an organizational
method and framework for our tests.
There is a large number of unit-test frameworks available from Gnu. Examples
include:
Another large collection of testing frameworks, with design similar to
the JUnit package, is available at
http://www.xprogramming.com/software.htm.
One particular test framework to be discussed is the CppUnit -
http://cppunit.sourceforge.net.
There is a decision to be made if we want to develop out internal test
framework, or to rely on external packages. For any option, there is
likely to be a learning curve.
The CppUnit project has combined and built on previous works of
Michael Feathers and Jerome Lacoste. It is released under the
GNU Lesser General Public License. The main design and usage
features of the CppUnit framework include:
- Creating classes derived from CppUnit::Test and
defining member functions of the signature void
TestFunction() for specific tests.
- Creating ``fixtures'' which are instance set-ups for test
cases. A fixture sets up member variables of a Test class,
and the variables are to be used by the member functions of the class.
It is important to note that a new instance of the test object is used
for each of the cases tested (member functions). Therefore, each
invocation of a test member function uses an independent set-up.
- A set of helper macros to assert conditions for a successful
test. They include general assertion, comparison assertions,
exception-thrown assertion, and possibly more.
- A set of helper macros and interfaces to construct a table of
available test cases in a test class, and instantiate the test cases
separately.
- A group of ``Runner'' classes which wrap the
Test classes with a common interface to execute the tests by
the user.
Examining these features suggests that we can use the CppUnit as a
basis for our test framework, even when it does not provide what we
need by default. For example, to satisfy the requirement for time
control, we can create a Runner class that times the cases and outputs
the times. We should probably be able to solve the other requirements
one way or another. I believe it is worthwhile to begin the
development of our test framework with CppUnit, and extend it towards
our needs. If during the work we discover that CppUnit cannot handle
one of our requirements, we can try to build on top of the existing
test cases a new framework which does. But for the meanwhile, I
believe CppUnit can be a good start.
CppUnit provides a test framework, but does not say much about the
practical programming rules for employing it. As CppUnit is an
adaptation of JUnit, we may be able to learn something from the JUnit
documentation. However, JUnit is not very specific about the rules
either, as much as I could tell. For the most part, JUnit conventions
rely on the use of the Java reflection mechanism and the
well-developed (compared to C++) type information mechanism.
For what it's worth, the JUnit convention seems to be as follows:
- For each library class that you want to test, create one test
class. The name of the test class for library class
SomeClass is SomeClassTest.
- In the test class, write as many test functions you like. The
name of a test function should start with test.
- The JUnit Cookbook (on
http://www.junit.org)
does not use reflection for automatic detection of test classes and
test functions. However, it does use reflection to locate a
function when the function's explicit name is given. A similar
approach is taken in CppUnit, that is, the programmer specifies the
names of functions which are included in the test class. But unlike
Java, it has to use macros as a substitute for reflection.
We can adopt similar rules in our test framework, with name changes to
match our own naming conventions. I would also suggest that every
test function (with a signature void TestSomething();) will
be a simple wrapper for another function that performs the actual
test. The actual test class can look something like the following.
class TriangleTest : public CpUnit::TestCase
{
public:
void TestArea()
{
TestArea(TheTriangleInstance);
}
// This is the actual test
static void TestArea(Triangle & t);
private:
Triangle TheTriangleInstance;
CPPUNIT_TEST_SUITE(TriangleTest);
CPPUNIT_TEST(TestArea);
CPPUNIT_TEST_SUITE_END();
};
The CPPUNIT_ macros create an interface to the
TriangleTest class that enables it to create a list of
available tests. An additional step enables us to register
TriangleTest in a global test registry. When we come to
executing the tests, we do not even have to explicitly instantiate a
TriangleTest object, as the CppUnit framework can do it for us.
This section will most likely be revised, and is presented here
as an initial suggestion only.
We can use an automated script to run any test program that does not
require manual set-up.
- A test program that requires input should have its main input
from stdin. That is,
- If a program only uses sequential input, it will take all its
input from stdin.
- If a program requires multiple input files, and can use
alternating combinations of input files, it will read all the names
of the input files from stdin, then continue to process the
information in these files as needed.
- A test program that produces output should send all the the
output which is pertinent to the test results to stdout, and should
not send any other output to stdout. That is, user prompts, debug
messages, etc., should, if needed, be sent through stderr.
- We need to determine a way for the test program to report
success or failure. One option is to check the program's exit status.
Depending on the use of other packages, such as CppUnit, we may decide
on a different reporting method. For a partly-standardized list of
common exit codes, see the manpage for exit(3), and the
Linux header file /usr/include/sysexits.h.
- Find a way to detect program crashes.
The rules below will be revised. Currently, a description of the
directory structure for the whole CISST development environment is
presented elsewhere.
Each test program should have its own project directory, and be
configured to build using CMake.
The name of the executable should be the same as the name of the
project directory.
The name of the source file where the main() function is
should be xxxMain.cpp (for a C++ file), where xxx is
the name of the project directory.
The names of alternative main input files should be
xxx-##.in, where xxx is the name of the project
and ## indicates the sequential number of the input file
(use leading zero??).
Standard configuration files will be refered to in the config-files
directory as specified in the general programming guidelines. The setting
of path variables for the local installation can be done using CMake.
Any additional data files required for the tests should be in a
directory named Data under the project directory.
The test program should be accompanied by a brief and meaningful
explanation of what is being tested, including an explanation about
each input file. The explanation should be in a file named
README in the project directory.
Names of project directories for test programs should start with
Test. If a test program is a diagnostic for a specific
class, the project name will be TestClassname, where
Classname is the name of the class without any prefix. For
example, for the class vctVec3 the name of a test program
will be TestVec3.
Guidelines for creating test programs in the CISST-ERC development library
This document was generated using the
LaTeX2HTML translator Version 2002 (1.62)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -nofootnode -show_section_numbers -split 0 TestPlan.tex
The translation was initiated by Ofri Sadowsky on 2003-08-27
Ofri Sadowsky
2003-08-27