Preceding loads, data types and distinct values

Intro

In my last post I have looked at the overhead using a preceding load causes. In this post, I am going to look closer at whether the data type (or what comes closest to a datatype in QlikView) or the number of distinct values have a bigger influence on the performance. The result surprised me…

Number of distinct values

We’ll compare the performance when copying one table into another (with an additional preceding load layer) mit fields of different numbers of distinct values. Each time we copy 1 column and 1000 rows, but these contains either 100, 200, 300, 500, 800 or 1000 distinct values. Which result did I expect? According to a post on the QlikView Design Blog on Symbol tables and Bit-stuffed pointers I expected fewer bits to be read and written with the fewer distinct values, as the pointer in the symbol table is respectively smaller (100 values = 7 Bits, 1000 values = 10 Bits).

Here is the result:

PL with Float, 1c, 1000r, 100dv to 1000dv

To my surprise, I could hardly find any difference between the variations.

Different data types

Let’s now look at the comparison of different data types. Here we copy (again with an additional preceding load layer) 1 column and 1000 rows. But now with a constant 100 distinct values and different data types.

Integer: 2, 4, 6, etc. (no closed sequence of Integers to avoid the sequential Integer optimization)

Float: 0.9, 1.9, 2.9, etc.

Text: 1Text, 2Text, 3Text, etc.

According to above logic, they should all behave the same, as the number of bits in the symbol table is the same. Here is the result:

PL with different data types, 1c, 1000r, 100dv

The Float and Integer processes are at roughly the same level. But they are both about 80% slower, than the Text datatype. Again a result, which I didn’t expect.

Copying with a Preceding Load

Out of curiosity, I repeated above scenarios again without the preceding load layer. First again the variation over the number of distinct values:

Copy with Float, 1c, 1000r, 100dv to 1000dv

And now the variation of the datatype:

Copy with different data types, 1c, 1000r, 100dv

As we can see, the result of the variation over number of distinct values is the same as above (there seems to be no dependence). But what is interesting here is, that we also don’t see a dependence on the data type anymore.

Schlussfolgerung

Finding a conclusion is difficult on this one so far. Either I have not understood something about the way QlikView processes the data or I have made a mistake somewhere. But what’s the use of doing experiments, when the results are always what you expect, right? ;-)

If you can contribute anything to the demystification of the above, I’d be happy to receive a comment either here in the blog or on the Qlik Community thread I created for the discussion. I’ll update this post, once we have created some clarity on this.

If you found this post interesting and don’t want to miss the next, why not subscribe to my RSS feed on the left.

Sandro

Welcome to my new blog

Welcome to my new blog!

My name is Sandro Krumbein and I would like to use this platform to help bridge several information gaps that have been bothering me while working with QlikView. A QlikView developer is continually confronted with the following two questions:

  1. How do I achieve functionality x?
  2. What is the best way to make functionality x available?

With this blog, I want to help you find the fasted approach. Sometimes, one solution is clearly preferable to another, but in some situations, differences are not that apparent, or the best approach depends on many other factors. Finding the “optimal” solution is thus not always easy. Establishing universally-applicable rules is often difficult. Of course, you have to consider if optimization is even worth the effort. For example, if we are dealing with only 10,000 records, often any solution will do. However, when we are dealing with 10,000,000 records, things might look a little different.

For your future decision-making, you will find information presented here that will help in the search for an appropriate solution by providing you with a better sense of the results certain design decisions can achieve.

The blog should shed light on questions such as:

  • is it better to use applymap or stay with inner join?
  • preceding load at any cost to keep the code uncluttered, or is it better to duplicate certain parts but thereby increase performance?
  • is a join using a numeric field faster than using a text field?
    • what about when a field contains more/less distinct values?
  • etc.

If the answers to these types of questions are of interest to you, subscribe to the RSS Feed or visit regularly. I am always happy to receive questions, constructive criticism, ideas, or suggestions regarding what questions need answering for you. You can use the comment section of a post, or send an e-mail to feedback@ks-qlik.com.

Best regards,
Sandro

Beware of preceding loads!

Intro

For all, who don’t know what preceding loads are, the following blog posts are a great introduction:

https://community.qlik.com/blogs/qlikviewdesignblog/2013/03/04/preceding-load
http://www.quickintelligence.co.uk/preceding-load-qlikview/
http://qlikviewcookbook.com/2014/08/preceding-load-is-elegant/

The Authors mention, that after they got to know that functionality for the first time, they wondered, how they ever did without it. I totally agree: many problems in loadscripts can be realized elegantly, simply and with great readability with the help of preceding loads.

Preceding load Overhead

You have to be aware though, that the handoff from one layer to the next comes with a certain overhead. Let’s consider the following ficticious code fragement:

No preceding load, several columns, 2 columns

It simply copies columns from one table to another and renames them at the same time. The time required to do this develops linearly from 1 column on the left, to 5 columns on the right. Kind of what we expect.

result - no preceding load, x columns

With an additional preceding load level, it would look like this (Table1 only has those two columns in this case):

Preceding load, several columns, 2 columns

For this example, we have a linear development as well, but the fifth column already needs 183% more time, while it had only been 67% more in the example above.

result - preceding load, x columns

But only when we now let the two options compete against each other, we see the entire the difference in its entirety.

result - no preceding load vs preceding load, x columns

One column with a preceding load already needs 347% more time, than without it. At five columns we have already reached a premium of 656% (1.264% against 167%).

Test Setup

Table 1 contains 1.000 rows of randomly generated numbers. Then this table is copied 100 times into another (copy to Table2_1, Table 2_2 etc.) over 100 runs (10.000 copies in total). The time it takes to copy 100 times is measured and then broken down to one copy. In each of the 100 runs, we randomly decide, which option we want to run. With all this, I hope to have created enough randomness and increased the measuring period enough to get robust data.

Once the test setup has further stabilized, I’ll publish another post on that.

Conclusion

Without a doubt, preceding loads are extremely handy and elegant and every developer should have them in her toolkit. You also shouldn’t necessarily loose sleep over the performance impact right away. At 1.000 rows and 5 columns, we are talking about a difference of 0,01 sec on my machine. But when the number of rows rises, so does the absolute impact on performance.

10.000 rows: 0,09
100.000 rows: 0,91
1.000.000 rows: 9,11

And we all know, that 1.000.000 rows in a QlikView application is the norm, rather than the exception.

Outlook

Next week, we’ll dig further into this and see, if the performance impact depends on the “data type” (integer vs. string, probably not) or on the number of distinct items in the field (more likely).

If you found this post interesting and don’t want to miss the next, why not subscribe to my RSS feed on the left.

Sandro