Implementing Data Quality for Enterprise Data Warehouses

Implementing Data Quality for Enterprise Data Warehouses 2019-10-11T10:13:19+00:00

Implementing Data Quality for Enterprise Data Warehouse Solutions

Data quality implementation is important in the context of Data Warehouse and Business Intelligence. An Enterprise Data Warehouse Solutions is an integral part of those enterprises which want to have clear business insights from customer and operational data. It has been observed that fundamental problems arise in populating a warehouse with quality data from various multiple sources systems.

GrayMatter’s experts have ensured data quality for enterprise data warehouse solutions for global customers using varied technology stacks. Let us see the below DWBI Architecture, one without Data Quality implemented and other with Data quality implemented.

DWBI Architecture without Data quality implemented

Data from source is extracted as it is and loaded into stage tables (Landing layer) without any transformation using ETL. Post which data from Stage tables will be extracted and loaded to Data Store layer (Persistent layer). As a final process, data from DS layer will be extracted, transformed and loaded into Data Marts/DW.

The problem with this approach is, there are no Data quality checks implemented in the source data and because of which Enterprise Data Warehouse Solutions might get populated with poor quality data which could lead into wrong analysis and faulty decisions.

Enterprise Data Warehouse Solutions

DWBI Architecture with Data quality implemented

Once the data is loaded into stage tables, ETL will trigger Data quality check process on the stage table data. For each stage table, ETL picks up the DQ Rules (Query) defined in the DQ meta data table and executes to identify which all are the records violating the rules. Accordingly, this process will mark/update those bad quality records with the respective DQ rule IDs. Single record can be failed due to multiple rules and those DQ rule IDs will be updated against those records.

Once the DQ process is completed, there is another ETL/Process that picks up all the bad quality records from Stage table and moves to Reject tables. These records can be identified by applying a filter “.DQ_Rule_ID is NOT NULL”. Same way another ETL/process that picks up good quality data from Stage to Data Quality stage and from there to Data Store and DW/Data Marts.

As an illustration of Data Warehouse Implementation, following section details how this can be achieved using Pentaho Data Integration (PDI).

Enterprise Data Warehouse Solutions

Technical Implementation of Data Quality using Pentaho Data Integration (PDI)

For this implementation using Pentaho Data Integration, we need to have some user defined meta data tables as below:

  • ETL_RT_DAILY_JOBS – This table will have the information such as ETL Job name, Module Name (Same as module name) and some other audit information that gets populated when the ETL job is triggered
  • ETL_DQ_MASTER – This table will have all the Data Quality – Enterprise Data Warehouse Solutions rules for each Stage table. DQ rules are maintained in the form of Database query as shown below. We can add ‘n’ number of DQ rules for each source entity (Table)
  • Note
    • Stage table is a temporary table that gets loaded as it is from Source. Hence it will hold the same data as source with no change
    • Refer Fig-3 for the ETL_DQ_MASTER table structure and data. We can have any kind of rule such as query for De dupe check (Query to identify Duplicate records), PK column null check, Foreign Key constraint violation, date formats etc.

1. De-Dupe Check

Let us create an ETL job as below. Also it is very important that you need to define the De-Dupe check query in the ETL_DQ_MASTER table for the respective stage table. Also, we will have to add a column IS_DUPLICATE to the stage table which will have values 0 or 1. 1 indicates Duplicate records.

big data analytics companies

“GENERIC SQL QUERY” job is the step that will call a job as below which is used to identify the duplicate records in the stage tables based on the rule defined in the DQ rule Meta data tables and RULE Type such as “De_Dupe_Check”, “Row_Level_DQ_Reject” etc.

big data analytics companies

Below are the steps used in “Apply De Dupe Rule” transformation.

big data analytics companies

Get variables step is being used to get the variable values for ETL Job Name, App Group Name (Same as module name) etc.

big data analytics companies

Then we will do a look up to ETL_RT_DAILY_JOBS to get the stage table name

big data analytics companies
big data analytics companies

Then the stage table name, DQ rule type is being passed to ETL_DQ_MASTER to get the DQ Rule ID, Rule for each stage table

big data analytics companies

Once the stage table name, DQ Rule ID, DQ rule is obtained, a transformation is being call with below values are parameters.

big data analytics companies

In this transformation, we will get all the variables using “Get Variables” and an “Execute SQL Script” step that executes the DQ rule (Query defined in ETL_DQ_MASTER) as shown below. The highlighted is the way of passing variable and the parameter is the DQ_RULE field which is the query defined in ETL_DQ_MASTER.

big data analytics companies
big data analytics companies

Once this Job “GENERIC SQL QUERY” is completed, all the duplicate records in the respective stage table will be marked as IS_DUPLICATE – 1. The records with value IS_DUPLICATE -0 will be good quality records for further processing

2. Row Level DQ quality check implementation

Once the De-Dupe job is completed, another job “Apply Data Rejection Rules” being called with steps as given below.

big data analytics companies
big data analytics companies
big data analytics companies

Here for this Rule type, unlike De-Dup DQ rule column value fro ETL_DQ_MASTER for each stage table will be appended with the Update statement of below “Execute SQL Script”.

big data analytics companies
big data analytics companies

Once the Job is completed, all the stage table records which are violating the rules will be marked with respective DQ rules (Append the DQ rule IDs if single record is breaking multiple rules)

big data analytics companies
At GrayMatter, Your Business Matters!
Contact Us
At GrayMatter, Your Business Matters!
Contact Us
Contact us
close slider










I would like to receive information about GrayMatter products & services