PDI Best Practices – Avoid Insert/Update Step that Slow Down PDI (Pentaho Data Integration)

PDI Best Practices – Avoid Insert/Update Step that Slow Down PDI (Pentaho Data Integration)

Are you aware that the process which may slow down the PDI (Pentaho Data Integration) is the insert/update step?

We GrayMatter Software Service are glad to describe the process to avoid in PDI (Pentaho Data Integration), namely, insert/update step, which is the reason for slow down in PDI process. Explanation is given below in brief.

Get a Free Quote for Pentaho Consulting


Pentaho Data Integration (PDI) Insert/Update step by step process slows down the PDI process as mentioned below

Let us take an example of loading a target table. Assume that there is a daily load of 100k records into a target table with 10million records and every incoming row from the source table looks up against all the 10 million records in the target table. This process continues for all the 100k input records.

Quick Enquiry










Are you aware that the process which may slow down the PDI (Pentaho Data Integration) is the insert/update step?

We GrayMatter Software Service are glad to describe the process to avoid in PDI (Pentaho Data Integration), namely, insert/update step, which is the reason for slow down in PDI process. Explanation is given below in brief.

Get a Free Quote for Pentaho Consulting


Pentaho Data Integration (PDI) Insert/Update step by step process slows down the PDI process as mentioned below

Let us take an example of loading a target table. Assume that there is a daily load of 100k records into a target table with 10million records and every incoming row from the source table looks up against all the 10 million records in the target table. This process continues for all the 100k input records.

Quick Enquiry










Insert/Update step involves multiple round trips to the database depending on the commit size.

Performance of most steps depends heavily upon the number of round trips, speed of round trip which is a combination of speed of network, the latency on the network and the performance of database.

Find below the wiki link to explain the Insert/Update step

https://wiki.pentaho.com/display/EAI/Insert+-+Update

As there is an extra lookup process involved in this step, it definitely slows down the process as it needs to run through the entire lot of records to check for a matching record ( to update ) and if none matches ( to perform insert ).This step is also slower than the regular “Table Output step”.

Addressing the above situation, on the network latency side, there is very little one can do. In the lookout for other possible solutions, reduction of number of round trips to the database is the first thing to be considered. How this can be accomplished is by having a mechanism to load lookup data in memory (cache)

In PDI, most Lookup steps have options to cache data. Steps such as “Stream Lookup” & “Merge Join” are some of them.

In the above scenario, one can design the transformation with Merge join” step to perform operations within a single statement, therefore, minimizing the number of times data in the source and target tables are processed. This helps in less memory consumption and good performance. “Merge join” requires input data for the step to be sorted.

Find below the wiki link to explain the Merge Join step.

https://wiki.pentaho.com/display/EAI/Merge+Join

Our Innovative Pentaho Projects by Service Line

75+

BUSINESS ANALYTICS

65+

DATA INTEGRATION

16+

EMBED PENTAHO

43+

PREMIUM CONSULTING

10+

BIG DATA

15+

MIGRATION

Take a Deep Dive

Projects by service line

75+

BUSINESS ANALYTICS

65+

DATA INTEGRATION

16+

EMBED PENTAHO

43+

PREMIUM CONSULTING

10+

BIG DATA

15+

MIGRATION

Take a Deep Dive
At GrayMatter, Your Business Matters!
Contact Us
At GrayMatter, Your Business Matters!
Contact Us
2019-12-20T06:59:01+00:00
Contact us
close slider










I would like to receive information about GrayMatter products & services