How to test(unit test) on ETL process?

You can unit test ETLs.

End-to-end tests are good, but slow, expensive and difficult to construct and keep stable.

Unit testing ETLs is highly desired to be able to test all data permutations but generally put into the too-hard basket. However it is possible to write true unit tests for ETLs that can run quickly and reliably.

We have found that the key is to decompose the ETL into two separate sections. Since an ETL is an Extract-Transform-Load the key is to separate the T from the E&L. Make a pure Transform function that transforms an input dataset to an output dataset, then call this function from the Extract and Load module.

The Extract and Load module isn't suitable for unit testing because it will generally involve external data sources and sinks, access tokens and user permissions, etc.

But all of the testable logic should be in the Transform component. Test this function from any unit testing framework - you will be able to pass in predefined datasets and test the transformed output against expected results. With some thinking we have even managed to create unit tests that test multi-stage updates of datasets onto each other.

Our particular implementation was done on Databricks in Scala, but the concept should work on any platform.


We've set up a system where for each ETL procedure we have defined an input dataset and an expected result dataset. Then we have created a system which, utilizing Robot Framework, runs three-part tests for each ETL procedure where the first part inserts the input dataset into the source data tables, the second part runs the ETL, and the third part compares the actual results with our expected results.

This works pretty well for us, but there are a couple of downsides: first of all, we create the test datasets manually for each ETL procedure which takes some work, and secondly, this means that testing for "unexpected" inputs is not done.

For the automated unit testing we have a separate environment in which we can install builds of our entire DW automatically.


testing of an ETL is usually a problem. More precisely, testing isnt problem, problem is how to get reasonable test data. ETL is typically tested on production data. Aside of the security issue, the problem with production data is that does not cover functionality of ETL sufficiently (typically about 40% of business rules isnt covered by production data sample) and it takes too much of time to process.

Recently we have developed a test data generator (for more detail, please look for GTL QAceGen: Business Logic Driven Data Generator on Informatica Market Place) which generate test data based into source tables/files on business rule specification. Tool takes into consideration any foreign key applied and it works for any major ETL and/databases.

This tool helps to speed up testing cycle by at least 50% (compared to manual testing) an covers 100% of all business rules. It also generates quite detailed reports and more importantly, these tests can be repeated at any time (ie regression tests).


We recently worked on a project where the governance board demanded 'You must have Unit Tests' and so we tried our best.

What worked for us was have each ETL solution start and end with a QA/Test package.

Anything unexpected discovered by these packages was logged into an audit table and a Fail Package event was then raised to stop the entire Job - We figured it was better to run with yesterdays good data than risk reporting against possible bad 'today' data.

The starting package would do db schema and data sanity checks. Data Sanity involved checking for duplicate or missing data caused by a lack of Referential Integrity in the source systems. Schema checks ensured that any schema changes that did not get applied during Continuous integration were detected.

The end package would check the results of any transformations. These included:

  • Comparing record counts between source|destination
  • Checking specific transforms (eg: all date values changed to appropriate SK value, all string values RTrimed)
  • Ensuring all SK fields were populated (-1 instead of nulls)

Most of these tests were SQL statements the used the built in schema objects of our database, so they were not to onerous to create.

In addition, as part of our development process we would create views that had the end result of any transformations we were doing. We would make use of these views to validate our package transformations.

Each of these checks created a record in our special audit table. That way we could provide a comprehensive list of all the tests and checks we had done each running of the process to satisfy the governance peoples.

(We also had a separate set of packages that would unit test each QA test by means of creating dummy tables, populating them, running the test then confirming the appropriate audit record was written. As Nick stated, this was a lot of work and of little real value)