Skip to content

MrGeo Unit Testing

ericwood73 edited this page Aug 10, 2016 · 4 revisions

MrGeo Unit Testing

Unit testing is critical to ensuring the stability of any code base and proper functioning of the units that comprise that code base. In a system like MrGeo that relies on external services, such as the Hadoop ecosystem, to provide critical capbilities creation verifying proper behavior of the units is complicated by the dependency on external services. Ideally unit testing should be able to be performed with only the units under test; no external dependcies should be required. This is not to be confused with integration tests which are intended to verify dependencies, as well as proper configuration and operation with integrated services. The typical strategy for isolating the units under tests is to stub out or mock the external dependencies and services. This can be done by creating explcit Mock Objects that behave like the external service. However, this can be tedious and may not be faeasible for 3rd Party services. A more common approach is to mock only the behavior of the external systems when the unit under test is interfacing with them. Enabling this for Java are unit testing frameworks that support mocking such behaviors. MrGeo already makes use of one such framework called Mockito. We will discuss Mockito based approaches to Unit Testing as well as another framework called Spock.

Mockito

Mockito is a framework that works with JUnit to provide mocking capabilities to JUnit tests. Mockito is a powerful mocking framework that can provide full and partial mocking of objects. Although Mockito is powerful, it has limitations. Mockito can only mock public methods. In practice this means that we will need to test a given unit entirely through it's public interface. This is actually a good practice because it verifies the unit's behavior without requiring any knowledge of the implementation and can therefore detect changes to that behavior caused by code changes (one of the primary purposes of unit tests). Mockito can mock method on the subject under test (called Spying), but it cannot mock a constructor. Therefore, we will need to create the test subject via it's real constructors. This becomes complicated in the case where a constructor takes object parameters because we will need to mock out enough of the passed in instances to allow the construction to complete without error. It is always a good idea to provide a default no argument constructor, even if we make it package visible so it can only be used by test fixtures. Testing via the public interface also means that the only way we have of verifying the behavior is through public methods for accessing the state. This means that we need to mock out enough of the objects passed in to the test subject to achieve all of the states that we want to verify.

By way of a very simple example to introduce the constructs we'll be using, here is a couple lines of code that mock

TiledInputSplit mockSplit = mock(TiledInputSplit.class);
when(mockSplit.getStartTileId()).thenReturn(1L);

Consider the class HdfsMrsPyramidRecordReader. This class has a few public methods (getProgress, initialize, and nextKeyValue) that should be tested. All of these methods depend on a SequenceFile.Reader which is created in initialize. The initialize method presents the greatest challenges to mock. These lines involve calls to multiple objects that return other objects, all of which need to be mocked.

FileSplit fileSplit = (FileSplit)((TiledInputSplit)split).getWrappedSplit();
Configuration conf = context.getConfiguration();
Path path = fileSplit.getPath();
FileSystem fs = path.getFileSystem(conf);

The code to mock these objects and dependecies could quicly get ugly. Enter the Builder pattern. In this pattern, we construct a class that can collect all of the things needed to build and object (i.e. associations) and then build the object in one call. Builder patterns work well with mock objects becasue the creation of the mock can be decoupled from the definition of the mocks behavior. For example, here is a builder for the TaskAttemptContext.

public class TaskAttemptContextBuilder {

    private TaskAttemptContext taskAttemptContext;
    private Configuration configuration;

    public TaskAttemptContextBuilder() {
        taskAttemptContext = mock(org.apache.hadoop.mapreduce.TaskAttemptContext.class);
    }

    public TaskAttemptContextBuilder configuration(Configuration configuration) {
        this.configuration = configuration;

        return this;
    }

    public TaskAttemptContext build() {
        when(taskAttemptContext.getConfiguration()).thenReturn(configuration);

        return taskAttemptContext;
    }
}

This builder allows us to specify a configuration to be returned by the mock which itself could be a mock returned by another builder. For example:

mockTiledInputSplit = new TiledInputSplitBuilder().wrappedSplit(
        new FileSplitBuilder().path(
                new PathBuilder().fileSystem(
                        new FileSystemBuilder().build()).build()
        ).build()
).build();

This is pretty simple because of the single relationship that we care about with the Configuration instance. But this pattern is powerful and can be used to prescribe values to be returned as well as relationships to other objects.

private TileIdWritable[] tileIds = {new TileIdWritable(1L), new TileIdWritable(2L), new TileIdWritable(3L)};
    private RasterWritable[] rasters = {new RasterWritable("0x01,0x02".getBytes()),
                                        new RasterWritable("0x03,0x04".getBytes()),
                                        new RasterWritable("0x05,0x06".getBytes())};

mockReader = new SequenceFileReaderBuilder()
    .keyClass(TileIdWritable.class)
    .valueClass(RasterWritable.class)
    .keys(tileIds)
    .values(rasters)
.build();

The next challenge is in the construction of the sequence file reader.

this.reader = new SequenceFile.Reader(fs, path, conf);

It arises from the fact that the readers constructor is invoked in the initialize method, which does not give us an opportunity to intercept this creation and inject our mock. There are a few different ways to deal with this.

The first is to delegate the creation to a method

this.reader = makeReader(fs, path, conf);
...
private SequenceFile.Reader makeReader(FileSystem fs, Path path, Configuration config) throws IOException {
    return new SequenceFile.Reader(fs, path, config);
}

and then use Mockito's spy functionallity to mock the makeReader method only on the class under test

// Instance under test
HdfsMrsPyramidRecordReader subject = new HdfsMrsPyramidRecordReader();
HdfsMrsPyramidRecordReader spySubject = spy(subject);
doReturn(mockReader).when(spySubject).makeReader(fs, path, config);

Another option is to delegate creation to a Factory instance and then inject the factory instance via a package visible constructor used for testing. We create a default factory instance in our original constructor.

public HdfsMrsPyramidRecordReader() {
    this.readerFactory = new ReaderFactory();
}

public HdfsMrsPyramidRecordReader(ReaderFactory readerFactory) {
    this.readerFactory = readerFactory;
}
...
this.reader = readerFactory.createReader(fs, path, config);

Modifying code in this way to support teting might seem like a bad idea, but the prevailing thought is that design for testing is often good design for the code. For example, injecting instances as opposed to creating them directly can be useful if we want to manage creation of those instances externally (i.e. pooling, singleton) or if we want to extend the class to add behavior or instrumentation.

The next challenge arises from the API for the SequenceFile.Reader class. This class has a next(key, value) method that writes the next key and value found in the file to the key and value. Providing side effects such as this is not possible by simply specifying the return value. Fortunately Mockito allows us to implement complex behavior on method invocation through it's Answer class. Mockito allows the developer to provide a custom implementation of the Answer class that is created when the mock behavior is defined. The same instance of this class will then be used for all subsequent invocations on that mock instance. In our mock implementation, we would like to be able to specify an array of keys and values as shown in our Builder example above and have the reader iterate over them with each call to next. The following code snippet shows how this might be accomplished.

public SequenceFile.Reader build() throws IOException {
    when(sequenceFileReader.getKeyClass()).thenReturn(keyClass);
    when(sequenceFileReader.getValueClass()).thenReturn(valueClass);

    when(sequenceFileReader.next(any(Writable.class), any(Writable.class))).thenAnswer(new Answer<Boolean>() {
        private int index = 0;
        @Override
        public Boolean answer(InvocationOnMock invocationOnMock) throws Throwable {
            if (index >= keys.length) return false;

            // Get the key and value
            Object[] args = invocationOnMock.getArguments();
            Writable key = (Writable)args[0];
            Writable value = (Writable)args[1];
            copyData(keys[index], key);
            copyData(values[index], value);
            logger.info("Read: key: " + keys[index] + " value: " + values[index] + " Wrote: key: " + key + " value: " + value);
            ++index;
            return true;
        }

        private void copyData(Writable src, Writable tgt) throws IOException {
            PipedInputStream in = new PipedInputStream();
            PipedOutputStream out = new PipedOutputStream(in);
            DataOutputStream dos = new DataOutputStream(out);
            DataInputStream din = new DataInputStream(in);
            src.write(dos);
            tgt.readFields(din);
        }
    });

    return sequenceFileReader;
}

By utilizing the Answer capability, we are able to keep and increment an index into the arrays of keys and values. We are also able to write the data from the array elements to the arguments passed in to the call to next.

MapOp Testing

Testing of MrGeo map operations requires creating instances of RDDs and MapOp subclasses. The Spark RDD is the fundamental operand for all operations. As such nearly all of it's capability would need to be mocked in order to test map operations. This is not practical, not is it good practice to mock out so much of an external object. For testing of MapOps, we need to rely on the ability to run Spark in a local context. When Spark is run in a local context, the parallelize method can be used to create an RDD from a collection of items (most ofter TileId, Raster pairs) in MrGeo. These pairs can be prepopulated with any test data needed, run through the mapop under test and the resulting RDD can be analyzed for the expected transformation. In order to avoid writing the code to setup and verify the results of the test for each test, the RasterMapOpTestSupport and RasterMapOpTestVerifySupport scala traits have been created. RasterMapOpSupport contains functions for creating RasterMapOps with either a specified value for each band, or optionally using a supplied raster generator function to create the raster. RasterMapOpVerifySupport extends RasterMapOpSupport and adds several function for verifying rasters that can use a supplied verification function for each tile. It also includes functions for verifying that a raster is unchanged and for verifying that a raster has no data outside of some bounds and constant value within the bounds. These traits can be mixed in with a scala TestSuite to provide the functions to the suite for creating and verifying test data.

Creating a basic RasterMapOp to use as input requires only the tileIds, zoom level, and tile size, although their are optional arguments to ovveride the default nodata value for each band, the initial sample value for each band, and the function used to generate the Raster.

val tileIds: Array[Long] = Array(11, 12, 19, 20)
val zoomLevel = 3
val tileSize = 512
inputRaster = createRasterMapOp(tileIds, zoomLevel, tileSize)

Behind the scenes this creates a local SparkContext, creates a raster for each tile, uses the parallelize method on the tile raster pairs to generate an RDD, and then creates a RasterMapOp instance wrapping that RDD. This RasterMapOp can them be used as input to another mapop

it should "keep all tiles when the bounds includes all tiles" in {
    subject = CropMapOp.create(inputRaster, -44.0, -44.0, 44.0, 44.0).asInstanceOf[RasterMapOp]
    subject.execute(subject.context())
    val transformedRDD = subject.rdd().get
    assertResult(4) {
      transformedRDD.count
    }
    verifyRastersAreUnchanged(transformedRDD, tileIds)
  }

The resulting RDD can be obtianed and then passed to a verify method to verify the expected results. For most mapops, a custom verification function will need to be used in conjuction with the verifyRasters function on RasterMapOpVerifySupport as shown in the following example. In this example the same verification function is used for each tile. The verifyRasters function has an optional argument that defaults to true that will cause it to fail the test if a tile is associated with a verifier, but not found in the RDD.

val verifier: RasterVerifier = verifyRastersAreTheSame(generatedRasters.values.head) _
val verifiers = tileIds.map(t => (t, verifier)).toMap
verifyRasters(rdd, verifiers)

Verifiers can make use of RasterMapOpTestVerifySupport's forEachSampleInRaster function to apply the verification function to each sample.

forEachSampleInRaster(actual, (b, x, y, sample) => {
    Assertions.assertResult(expected.getSample(x, y, b), s"Sample at x: $x y: $y band: $b") {sample}
})

Some tests may require a second mapop to be used as input to a mapop, such as the Crop mapop that can take another RastermapOp to use for the crop bounds. The createRasterMapOpWithBounds function can be used to create a raster with explicit bounds. The createRasterMapOp functions can be called multiple times, but they will reuse the existing SaprkContext if one is avaialble.

it should "be able to use the bounds of another Raster as the bounds for cropping" in {
    // Create a new raster map op.  
    val rasterMapOpForBounds = createRasterMapOpWithBounds(tileIds = tileIds, zoomLevel = 3, tileSize = 512,
                                                           bounds = new Bounds(10.0, 10.0, 35.0, 35.0))
    subject = CropMapOp.create(inputRaster, rasterMapOpForBounds).asInstanceOf[RasterMapOp]
    subject.execute(subject.context())
    val transformedRDD = subject.rdd().get
    assertResult(1) {
      transformedRDD.count
    }
    verifyRastersAreUnchanged(transformedRDD, Array(20))
  }

Speaking of context, it is important to clean up the context after test using RasterMapOpTestSupport.stopSparkContext. This is becasue an error will be thrown if creating a SparkContext while there is an active one inside the JVM.

Python MapOp Testing

It is recommended to do full coverage testing for Mapops in Scala. However, given that the python mapops are generated from the scala mapops, it is a good idea to verify the python interface, especially when making any changes to the mapops create methods. It is only neccessary to test a single case for each interface to confirm that the mapop generated correctly, though the correct passing of all arguments including optional ones should be verified. To minimize the number of variables in the testing, the PythonGateway can be run within the same VM, using a local Spark context. It is also possible to make use of the functions in the RasterMapOpTestSupport trait, through the StandaloneRasterMapOpTestSupport class. For convenience, the interface to this class has been wrapped in a rastermapOptestsupport Python class. This class is used in much the same way as the Scala test support as the following example shows.

def test_crop_with_explicit_bounds(self):
    cropresult = self._inputRaster.crop(w=-44.0, s=-44.0, e=44.0, n=44.0)
    croppedRDD = self._rasterMapOpTestSupport.getRDD(cropresult)
    count =  croppedRDD.count()
    self.assertEqual(4, count)
    self._rasterMapOpTestSupport.verifyRastersAreUnchanged(croppedRDD, [11, 12, 19, 20])

The key difference is that the SparkContext is managed externally when using the python test support classes. This should be done in the setup method for the test:

 def setUp(self):       
    mrgeo = MrGeo()
    # Get the JVM.  This will create the gateway
    self._jvm = mrgeo._get_jvm()
    self._createSparkContext()
    mrgeo.usedebug()
    mrgeo.start(self._sparkContext)
    self._mrgeo = mrgeo
    rasterMapOpTestSupport = RasterMapOpTestSupport(self._mrgeo)
    rasterMapOpTestSupport.useSparkContext(self._sparkContext)
    self._rasterMapOpTestSupport = rasterMapOpTestSupport
    self.createDefaultRaster()

The order of operations is important. The SparkContext must be created before the gateway is started and the context should be passed to the mrgeo.start method to prevent MrGeo from foking a new process. Mrgeo.start must be called before calling any methods on rastermapoptestsupport and the context must be set on rastermapoptestsupport before calling any other methods.

Clone this wiki locally