For all, who don’t know what preceding loads are, the following blog posts are a great introduction:
The Authors mention, that after they got to know that functionality for the first time, they wondered, how they ever did without it. I totally agree: many problems in loadscripts can be realized elegantly, simply and with great readability with the help of preceding loads.
Preceding load Overhead
You have to be aware though, that the handoff from one layer to the next comes with a certain overhead. Let’s consider the following ficticious code fragement:
It simply copies columns from one table to another and renames them at the same time. The time required to do this develops linearly from 1 column on the left, to 5 columns on the right. Kind of what we expect.
With an additional preceding load level, it would look like this (Table1 only has those two columns in this case):
For this example, we have a linear development as well, but the fifth column already needs 183% more time, while it had only been 67% more in the example above.
But only when we now let the two options compete against each other, we see the entire the difference in its entirety.
One column with a preceding load already needs 347% more time, than without it. At five columns we have already reached a premium of 656% (1.264% against 167%).
Table 1 contains 1.000 rows of randomly generated numbers. Then this table is copied 100 times into another (copy to Table2_1, Table 2_2 etc.) over 100 runs (10.000 copies in total). The time it takes to copy 100 times is measured and then broken down to one copy. In each of the 100 runs, we randomly decide, which option we want to run. With all this, I hope to have created enough randomness and increased the measuring period enough to get robust data.
Once the test setup has further stabilized, I’ll publish another post on that.
Without a doubt, preceding loads are extremely handy and elegant and every developer should have them in her toolkit. You also shouldn’t necessarily loose sleep over the performance impact right away. At 1.000 rows and 5 columns, we are talking about a difference of 0,01 sec on my machine. But when the number of rows rises, so does the absolute impact on performance.
10.000 rows: 0,09
100.000 rows: 0,91
1.000.000 rows: 9,11
And we all know, that 1.000.000 rows in a QlikView application is the norm, rather than the exception.
Next week, we’ll dig further into this and see, if the performance impact depends on the “data type” (integer vs. string, probably not) or on the number of distinct items in the field (more likely).
If you found this post interesting and don’t want to miss the next, why not subscribe to my RSS feed on the left.