|
|
A Practical Guide to
Making Use of B-trees
By Wirt Atmar
Editor's Note: The Express 3 release of
MPE/iX 5.5 , now shipping
to HP customers on support, has the first
built-in indexes for
TurboIMAGE included as a free enhancement.
Known informally as
b-trees, the upgrade promises radically
faster database access
for HP 3000 systems.
Youre going to like the new
b-trees a lot if you write queries
into IMAGE datasets.
The b-trees are attached only to master
datasets (automatic or
manual, it makes no difference). But the
masters are, of course,
themselves attached to detail datasets.
What used to be a two-level
structure is now three-level. But you
dont have to think about
it in that manner. You cant directly
access the information in
the b-tree itself. The b-trees are there in
support of queries
that you couldnt ask before in IMAGE
without a serial search,
such as:
find name is GOR@
find city > Chic
find amount ib 50.00,99.99
Each individual query language that will
use the new b-tree feature
may use slightly different syntaxes, but
the basic behaviors will
be the same, nonetheless. Each one of these
queries is now a chained
search, rather than a serial search,
although they may not look
like it.
If the data items NAME, CITY, and AMOUNT
are search items in a
detail dataset (or key items in a master
dataset), and if a b-tree
is attached to the master
dataset appropriate for each of these
search items, then each of the query
formats above will go out
and first create what is now called a
superchain. That is, the
process first searches the b-tree attached
to the master dataset
for all of the names that begin with GOR
and creates a list of
all of the fully-specified names that exist
in the master dataset
that meet this criterion (GORDON, GORE,
GORFINKLE, GORTON, GORVEY,
etc.). Each one of these names chain
is then searched in turn
and every qualifying entry is marked,
exactly as it always has
been, as if it were just a single name you
were looking for. All
of this is done completely automatically,
without you having to
do anything special at all.
Getting started
As for turning the b-tree feature on, that
too is surprisingly
simple. Three new commands have been added
to DBUTIL. They are:
ADDINDEX, DROPINDEX, and REBUILDINDEX. The
commands are fully
explained in DBUTILs help. I
personally recommend adding b-tree
indexes to every master dataset in your
databases unless you
have some overwhelming reason to argue that
no benefit would be
obtained from one or another specific
master dataset having a
b-tree attached.
B-trees also offer the possiblity of
changing the manner by which
IMAGE databases have been used in the past.
In the examples I
gave above, AMOUNT is clearly a real
number. Ordinarily, it would
have been considered a little declasse to
put a hashing index
on a real number, but I think that changes
now. An automatic master
could be attached to a real number field,
such as AMOUNT, and
with a b-tree attached, you can now ask
range questions and get
answers back almost immediately.
What performance increase should you expect
with b-trees over
serial searches? In our in-house
measurements, were only seeing
about a 100x increase, but thats
simply because our databases
are too small. Were not a production
shop, so we dont have enormously
large datasets. But if the NAME field
existed in a four-million
record dataset and the query above only had
thirty records in
total for all of the names that began with
GOR, you could expect
potentially a 5000x increase in retrieval
speed, as compared to
an MR-NOBUF, high-speed serial search.
Thats not a bad bit of
performance enchancement especially
given the price of the enhancement.
You almost dont have to do any
benchmarks to understand how well
the new b-trees are implemented. Since the
b-tree is attached
to the master dataset, the b-tree is
automatically connected to
all of the detail datasets that any single
master has paths to.
The KSAM b-tree maintains an exact
one-to-one relationship with
the values held in the master dataset, data
item value-for-data
item value. Thus, for most transactions
because most transactions
do not cause a master datasets search
value to be either added
or deleted there will be absolutely
no performance hit due to
the presence of the b tree. You simply
cant do better than that.
The addition or deletion of records from
the detail datasets will
proceed as they always have in the past.
The appropriate search
chains for that value in that dataset are
simply made longer or
shorter. The master and its attached b-tree
are untouched.
The only time the b-tree is (automatically)
modified is when a
new indexing value is added to the master
dataset or when it is
deleted (which occurs only when it
disappears from the last detail
dataset of the sets which the master
dataset indexes). Both of
these conditions are relatively rare events
for most of the indexing
values held in most databases. The only
possible exception would
be for search fields such as order numbers
that are uniquely created
on the fly. In this case, the first dataset
to add the order number
would pay a small price the addition
of one new search value
to the master dataset, as it always has
been in the past, and
one new search value to the b-tree.
Thats a new additional cost,
but it is very small. What you get for that
small cost is that
every subsequent detail dataset to which
this new order number
value is written during this transaction is
free, without penalty.
The single largest performance hit will
occur when you first turn
b trees on and build the KSAM indexes for
all of your masters.
Due to a variety of circumstances
the size of your database
being the primary one its impossible
to estimate how long that
will take, but it aint long. I drop
and add all of the b-tree
indexes to our databases on a regular
basis. Ive been pleased
with how fast it goes and we did
almost all of our testing on
a Series 922. No matter long it takes, in a
production circumstance,
it only has to be done once.
When you should index
IMAGEs R4 and E4 number types are
64-bit numeric formats where
most of the information about the value in
a real number resides
in the left-most bits. For small, integer
numbers stored in R4/E4
formats, the right-most 32 bits will all be
zero and that represents
a catastrophe for the hashing algorithm
that IMAGE uses. All of
these small, integer R4 numbers will hash
to the same address
in the master dataset, leading to one
incredibly long synonym
chain (and abyssmal performance)!
IMAGE uses the right-most 32 bits because
that is where most of
the information lies for an integer number.
Integer-based formats
begin counting in the right-most (least
significant) bit. Real
numbers, on the other hand, begin counting
at the other end, more
or less at the left-most (most significant)
bits and thats
where the greatest amount of
information (unexpectedness) lies
for a real number. IMAGE, unfortunately,
doesnt discriminate
between reals and integers and always
hashes its address values
based on the right-most bits, leaving us
with the R4 problem.
However, if the real numbers are stored as
R2/E2 (32-bit) numbers,
then there is no real problem (pun
intended), so long as we expect
to see a fair diversity in the numeric
values in the dataset.
If, by the worst possible luck, the
real-numbered values in the
dataset predominately represent a harmonic
series of the current
capacity of the master dataset, then
well again be saddled with
a substantial amount of clustering and long
synonym chains, but
the cure for such a condition is, as it
always has been, to simply
change the capacity of the master.
Nonetheless, such resonances
are an extremely unlikely condition, by any
measure.
So, the bottom-line moral is: dont go
about indexing your R4/E4
numeric fields without forethought.
Absolutely dont index your
R4/E4 fields if they are likely to contain
only small number,
integer values. However, R2/E2 numbers
shouldnt give you too
many problems.
What b-trees can mean to you
The mental model that we preach to every
one of our users, CEO
and DP manager alike, is to think of IMAGE
as a set of steel filing
cabinets, the detail datasets being the
large steel drawers, marked
with little white cards on the front,
saying things such as JOBS,
INVOICES, CHECKS, etc., and the master
datasets as being 3x5 indexing
cards that sit on top of the filing
cabinets.
While we actually do explain how b-trees
work to everyone, most
people simply sit there and politely
listen. Its all interesting,
of course, in the same way that its
interesting for you to know
how the electrical capacitance changes in
your car engines cylinders
as the piston nears the top of its stroke,
but thats all it is:
interesting. Most people merely want to
know how the feature affects
their work and how might they take
advantage of it.
The important take-home lesson of b-trees
is that you no longer
need to know the search value in IMAGE in
its entirety in order
to create a high-speed, indexed search.
While we debated for some
time what to call the new, b-tree-enhanced
search items, we eventually
settled on labeling them as
full and simple search items,
with simple being a standard,
hashed IMAGE search item, where
you have to know the value in its entirety,
and a full search
being one that you can ask for either a
high speed range or a
generic search (if you know the beginning
of the value and its
a text field).
CSY is to be sincerely congratulated for
the quality of work theyve
done on b-trees.
Wirt Atmar is a member of the SIGIMAGE
Executive Committee and
has been leading the effort for IMAGE
b-tree access since the
1980s.
|
|