Pentaho: lookup component observations
The lookup, when caching, loads each row in memory. The whole list gets loaded by getRows() and returns a List<Object[]). Each list entry is a row, each object entry is a column.
The row object[] gets allocated as (resultset.getColumnCount + 10). So for our 4 columns we allocate an Object[14] for each row. This means (16+14*4) = 72 bytes per row. For 24 mln rows: 1.647 GB
For each data type we need to add storage too. For the bookingdetail line this is:
What | Type | Example | Calculation | Size |
---|---|---|---|---|
key | String | GM|76614952|KOSTENPLAATS|101720375W | 40 + 35*2 | 110 |
organization | Long | 24 | 24 | |
exists_in_dv | Long | 24 | 24 | |
surrogate_key | Long | 24 | 24 | |
Total | 182 |
For 24 mil rows this means: 4.165GB memory.
Total for the whole cache is 5.812 GB
Data collected while loading cache in new streaming code:
After load of 8.6 mln recs:
After 9.3mln
Experimental code changes
Rewrote lookup cache load:
- Load rows streaming instead of making a zillion copies.
- Force use readonly cache which caches badly but less horrible than DefaultCache
Loading 10 million rows now shows:
20 million Object[] is caused by the expensive split in lookup data and result data (one Object[][] array for each).
The number of Long objects is caused by very sad storage for cached data: 2 of the columns use a small int value, but they are stored as Long instances by reference (at 460MB costs).
Sizes of Java structures on a 64bit JVM
The following seem to be the sizes of Java objects. Please remember that complete Java objects (instances) are always 8-byte aligned, so an Object's size is always a multiple of 8 bytes.
Size (bytes) | Rounded size | What |
---|---|---|
12 | 16 | An instance of Object (but rounded size is 16 bytes due to alignment) |
4 | Size of an int in Object | |
8 | Size of a long or double in an object | |
4 | Object pointer (surprising, but probably due to pointer compression) | |
16 | 16 | Array base size (12 bytes object, 4 bytes length). Will be followed by length * datatype size |
40 + 2*n | String(n): 24 bytes for String object (object, char[] reference, start, end), 16 bytes for char[] plus 2*n bytes for string length | |
24 | 24 | Long wrapper |