Testing data processing applications

We’re testing a data processing applications. There are multiple pipeline stages, the boundaries (i.e. inputs and outputs) of which are well defined. The bulk of the code deals with reshaping data from one form to another; there is very little functional logic. Therefore, testing should be focused mainly on how the code reacts to:

  • Null data – when certain fields have null/nil/None value
  • “Empty” data – when certain fields are populated with the empty string “”, which is distinct from null/nil/None.
  • Missing fields – when not just the value is missing, but the field (or column) is missing entirely from the input
  • Improperly formatted data – from fields where the data is slightly off, like different date string formats, all the way to fuzz testing.

Also to test:

  • Validate input schema – especially if you have no control over it
  • Validate output schema – This is basically a regression test

I think a test harness for this should:

  • Make it easy to maintain/refresh test data. This may involve pulling inputs from your data sources, but testing shouldn’t be interrupted if the refresh fails.
  • Have designated “base” input objects
  • Have API calls for modifying input and then validating the output without having to manually reset the input object
  • Make sure the output schema is valid
  • Make sure the output values fall within a valid range

Leave a Reply