Pentaho PDI Best Practices – When we are looking at a transformation designed to use insert/update step to perform the data load on the target table. Pentaho PDI step Insert/Update technically works as below
Let us take an example of loading a target table. Assume that there is a daily load of 100k records into a target table with 10million records and every incoming row from the source table looks up against all the 10 million records in the target table. This process continues for all the 100k input records.
Insert/Update step involves multiple round trips to the database depending on the commit size.
Performance of most steps depends heavily upon the number of round trips, speed of round trip which is a combination of speed of network, the latency on the network and the performance of database.
Find below the wiki link to explain the Insert/Update step
As there is an extra lookup process involved in this step, it definitely slows down the process as it needs to run through the entire lot of records to check for a matching record ( to update ) and if none matches ( to perform insert ).This step is also slower than the regular “Table Output step”.
Addressing the above situation, on the network latency side, there is very little one can do. In the lookout for other possible solutions, reduction of number of round trips to the database is the first thing to be considered. How this can be accomplished is by having a mechanism to load lookup data in memory (cache)
In PDI, most Lookup steps have options to cache data. Steps such as “Stream Lookup” & “Merge Join” are some of them.
In the above scenario, one can design the transformation with Merge join” step to perform operations within a single statement, therefore, minimizing the number of times data in the source and target tables are processed. This helps in less memory consumption and good performance. “Merge join” requires input data for the step to be sorted.
Find below the wiki link to explain the Merge Join step.