Robelle Tech: Is a "Zero Seconaries" Report Good News?

June 2003

Boosting your e3000 productivity

Is a “Zero Secondaries”
Report Good News?

By Bob Green

As a TurboIMAGE user, you probably know that master datasets have primary entries and secondary entries. Primaries are “good” because they can be located in a single disk read. Secondaries are “bad” because you must search down a chain of indeterminate length from the primary to find them. You would run a Howmessy or DBGENERAL report to find out what percentage secondaries you have in each master dataset.

If secondary entries slow down a master dataset, would a “zero secondaries” be ideal?

No! Just the opposite.

If your master dataset has no secondaries, it means that the key values are not hashing randomly. Instead, they are hashing into some regular pattern (usually into adjacent locations). Figure 1 below shows a proper “random” dataset looks like, with P for primary and S for secondary — and what a dataset with no secondaries usually looks like underneath it:

P PP PS P P PSS P P P P
PP P P P PS P P PS PSS PPP
P PS PPP P P P P PP PP PSS P

PPPPPPPPPPPP
PPPPPPPPPPPPP
PPPPPPPPPPPP

In the second example, all the entries are primaries and they are all clustered in one area, with no empty space between them. How could this happen?

The reason for this pattern is usually binary keys and transaction numbers.

For example, consider typical invoice numbers that start at 030001 and increase by one: 030002, 030003, 030004, etc. If these invoice numbers are stored in a “Binary” field (J2, I2, P8, etc.) instead of an “Ascii” field (X, Z), IMAGE disables random hashing and calculates the primary location as follows instead:

Divide the rightmost 32 bits of the binary value by the dataset capacity, and use the remainder as the primary location (dividing by the capacity ensures that you produce an entry number between 0 and the capacity).

If you try a few examples, you will see that monotonically increasing key values produce adjacent entry numbers. Assume that the capacity is 1000:

030001 => 1
030002 => 2
030003 => 3

What is so bad about that? It seems perfect. Each DBFIND or keyed DBGET will take a maximum of one disk read to find each key value! Very fast.

Unfortunately, the bad part happens when you change the sequence of invoice numbers to 040001, 040002, 040003,...

040001 => 1
040002 => 2
040003 => 3

It is quite possible that the new invoice numbers will hash to the same location as the last batch. This causes a collision. The primary location is already occupied, and DBPUT must find a free space for a secondary entry. But the next physical entry is also occupied. In fact, there may be no free entries for quite a while.

The result of this wraparound collision is that DBPUT may have to search thousands of disk pages. And the next DBPUT, will look at the original location + 1, then do the long search completely over again! If this happens to you, it can freeze your database and your entire application.

What can you do?

1. Enable Master Dataset Expansion (MDX) on this dataset. I have never actually done this, but it should help quite a bit, because after searching a short time for an empty space, DBPUT will then go to the overflow area instead. However, this solution is not permanent. The overflow entries are on a single chain that will get longer and longer as you add entries, slowing down inquiries. Before that happens, you should implement one of the other solutions below.

2. Increase the capacity of the master dataset and hope that the clusters do not overlap with the new capacity.

3. Change the key field from a binary type to an ASCII type, such as X or Z. This of course means changing all the COBOL copylibs and recompiling all the programs, which may be quite time-consuming.

4. Convert the application to Eloquence on HP-UX. Since it does not have hashing, Eloquence does not have these problems.

For more articles on TurboIMAGE, go to www.robelle.com/tips/image.html