Sunday, November 28, 2010

Database Index - B tree, B+ tree

Why database index are used
Database index helps improve the data retrieval speed in response to certain search conditions.

Drawbacks with database index
slower write, index needs to be updated on sql updates
storage space, index may grow based on the column values

Type of Single level Database index

Fig. Sequential indexes

Ordering key field is used to physically order records on disk

Clustered index dictates the order of physical data in a table.
  • can define only single cluster index on the table though it may have multiple attributes.
  • clustering index has one entry for each distinct value hence non dense index
  • when data is read in some sequential order or for some range it is efficient because it requires less data block reads.
  • Some database define primary key as clustered index.
  • If clustered index is defined on the primary key it is called Primary index.
  • index entry is stored for only first record (anchor record)in each block, so its sparse index.
  • Problem with it is insertion and deletion of records. Here we have to move the physical records as well as change the index entries also
Non Clustered index Data is present in random order in the table but ordering is specified by database index.
On the ordered index we can perform the binary search.
good for table whose values modify frequently.


Secondary index are dense and so one entry for each record and hence a secondary index needs more storage. There can be number of secondary index on table
 
Search Tree
Index sequential file has disadvantage that performance degrades as file grows.
the goal is to reduce the height of tree leading to less disk access
  • search tree of order p has at most p children or
  • (p-1) key values,ordered set of values, and p pointers, pointing to child nodes
  • p1 k1 p2 k2 ....pq-1 kq-1 pq
  • if any node has k children it has k-1 key values

Fig. search tree of order 4
Following condition for ordering of the keys
  • each left node has key values less than parent node key value and each  right node key value greater than parent node key value
  • keys in node are in ascending order

  • Used for searching records on disk
  • Each key value in tree is associated with a block or record pointer.
  • the search tree itself can be stored on disk with each node stored in blocks
  • insertion algorithms do not guarantee search tree is balanced which is important otherwise nodes at very high level would require many block access.
  • deletion may leave nodes empty and may lead to more levels.

B Tree
B Tree has additional constraints that ensures balanced tree and not much space wastage on deletions
insertion and deletion are handled especially when node is full or when node becomes less than half empty on deletion
B Tree of order p is a search tree with order p with additional constraint that
  • all leaves are on bottom level
  • all internal nodes(except root) have at least ceil(p/2) nodes or it says each node should have at least half the maximum number of children
  • root node can have as few as 2 children
  • each leaf node must contain at least ceil(p/2) -1 keys

Fig. B tree of order 5
Here all internal nodes has at least ceil(5/2)= 3 children or 2 key values and each leaf node has at least 2 keys.


Any index structure other than B tree is subjected to overflow. Overflow is where any changes made to tables will not have records added into the original index structure, but rather tacked on the end.


Problem 1 Consider search field V= 9bytes block size B= 512bytes record pointer Pr=7bytes and block pointer Pb is 6bytes
How many childrens can a node have or order of tree so that each node can come in single block
Solution
let p be child pointer
a node consists of
key value + data pointer + child pointer
total size of key values ( p-1 ) * V
total size for data pointers (p-1)*Pr
total size for child pointers p*Pb
i.e (p*Pb) + (p-1)(V+Pr) <= B to fit in single block
p= 23


B+ Tree
variant of B tree datastructure
A B+ tree of order m is a tree where each internal node contains up to m branches (children nodes) and thus store up to m-1 search key values
order of tree measures the capacity of node or number of children nodes in internal node
B+ tree combines features of index sequential file organization with B tree
most widely used multilvel index implementation
For internal node
, where q <= p
For leaf node


where Pnext points to next leaf node
Requirements for maintaining B+ -tree
every node must contain at least ceil(n/2) pointers except for the root (which should have at least 2)
balanced: for ensuring good performance

Searching for key field value K

1) visit the root node, looking for the smallest key value greater than K. Suppose the value is Ki .
2) follow pointer Pi to another node
  - if K < K1 , then follow P1
    if K > Kmax , then follow Pmax
3) repeat step 2 until reaching a leaf node
Differences of B+-tree from B-tree

1. In B+ tree, data pointers are stored only at the leaf nodes

- more entires can be packed into internal (non-leaf) nodes of a B+ -tree than for a similar B tree meaning
B+ tree has larger fan out
- for the same block (node) size, the order p will be larger for the B+ -tree than for the B tree meaning   improved search time
- B tree eliminates redundant storage of search key values
- faster search in some cases to find desired search key values before reading a leaf node in B tree

2. Leaf and non-leaf nodes are of the same size in B+ tree, while in B tree, non-leaf nodes are larger
- complication in storage management for index structures

3. Deletion in B tree is more complicated
- in B+ tree, deleted entry always appears in a leaf
- in B tree, it can be a non-leaf node, requiring replacement by the proper value from the subtree of the node containing the deleted entry.

Problem 2 If maximum number of keys in b+ tree is 5 what is minimum number of keys in internal node and in root node.
Solution
Maximum number of childrens that a node can have(order of tree) = maximum number of keys + 1
so order of tree(b) = 5 + 1 =6
number of childrens in internal nodes(m)  is constrained by
ceil(b/2) <= m < =b
minimum number of internal nodes =  ceil(b/2) = ceil(6/2) = 3
so number of keys in that node = 3-1 = 2
 number of childrens of root node in b+tree is constrained by
2 <= m <=b
number of childrens in leaf node = no childrens of leaf node
number of keys in leaf node is constrained by
ceil(b-1/2) <= number of keys in leaf < = b-1

Problem 3
Consider a relation with 2M tuples stored in heap structure If we create a B+ tree index on this relation  with block filled to maximum capacity what would be size of tree . Assume following data
Disk page size B= 1000byte
key value(order key of relation) size V= 40bytes
disk page pointer Pb=10byte

Solution

If we were to calculate the maximum that can be stored we approach it top to bottom but if we have the entries that has to stored we approach bottom to top

Data record pointer will be stored in leaf pages since there are 2M tuples and its on order key so we will need 2M data record pointers but how many entries can a leaf page have

1) Find Number of entries a leaf page can have
leaf node stores same number of key values and record or data pointer along with a single next block pointer
so if we assume number of leaf entries to be n
n(V+ Pb) + Pb <= (size of page)  --since maximum number of entries should be allowed in a page
n = floor(size of page/(V+Pb))  -- ignoring the next block pointer for simplicity
n= 20

2) Number of leaf pages required to store 2M records = (Number of required index entries )/ (Number of index entries per leaf page)
= 2M/20 = 100000
3) Now we need to find number of entries that can come in internal nodes or branching factor m
consider number of index entries in internal node as n
each internal node will have a n-1 key values and n child pointers
(n-1)* V + n * Pb <= (B)
n =floor((B +V)/(Pb+V))
n=20

4) No we calculate Number of pages at level above leaf
there are 100000 leaf pages so we need this much index entries at level above it
so number of pages required at level above leaf = 100000 / 20 = 5000
Similarly we need to get to root that is when page is = 1
ceil(10000/20) ,ceil( 5000/20) ,ceil( 250 / 20), ceil(13 / 20) respectively for level above leaf
which is log20(100000) = 4
so we have total 5 level of pages and access to record would take 5 b tree block access  + 1 data block access = 6 disk access


References:
http://cis.stvincent.edu/carlsond/swdesign/btree/btree.html
http://www.cecs.csulb.edu/~monge/classes/share/ http://www.cs.sjsu.edu/~lee/cs157b/B_Trees_And_B__Trees.ppt
http://www.cs.virginia.edu/~son/662.pdffiles/662.physical.pdf

No comments:

Post a Comment