Until recently I would have followed the party line and told clients that “you shouldn’t build a Data Warehouse (DW) if your data quality is too poor, ‘rubbish in rubbish out’” after all. I may have even told them that they should establish an Information Management (IM) group that tackles issues like data quality, governance, mastering, stewardship, etc. Once the IM group have some success, then commence the DW build.
This is actually reasonable advice, however, given the right set of circumstances it is overly conservative.
So what made me change? I built a greenfield DW for a client that had:
- Suspicions that their data was poor.
- No existing reporting, beyond excel.
- No master data lists but many disparate sources.
- Poor or inconsistent business processes.
So with appropriate trepidation, I started the project but took the following additional steps:
- Created simple DQ reports even if though the client hadn’t asked for them.
- During testing, I limited the access to a select few. The message of data quality issues is lost on some and they will blame the new DW instead.
- Controlled the DQ message and communicated widely and often. I had to be careful to not appear defensive or being accused of finger pointing.
- Tracked DQ directly and indirectly (i.e. missing surrogate keys is an easy approach). The following is an example:
Each line reflects a different surrogate key, I have removed the legend to preserve anonymity of the client.
As you can see it tells a story, lots of issues at the beginning, some success but still a long way to go
The outcome was surprisingly positive, the project delivered:
- Visibility of the DQ issues in a business process specific manner, after all that’s what the facts are modeled on. It is very hard for the sponsors to make sweeping statements about the reporting accuracy when you can easily demonstrate their DQ.
- Tangible evidence of where DQ is changing over time enabling targeted remediation in the source.
- A set of ‘bonus’ DQ reports that can be used in the future for regression testing.
- A new DW and 28 reports on time and budget. There is of course additional work to be done refining business rules as the DQ improves.
So what circumstances made this possible?
- A supportive project manager who understood the business rules and the impact of DQ issues.
- Small data sets that enabled full DW reloads in a short amount of time (i.e. less than 10 minutes).
- Source data that is transactional.
- A DW model that did not need snapshot facts or slowly changing dimensions.
Obviously this is not a typical scenario but hopefully the blog gives you some options to consider if you find yourself in the same situation.