Mon 5 Jun 2006
TreeMaps are not Hashes
Posted by dkaz under Java, Python, Programming
One of my co-workers recently called TreeMaps “hashes”, which is imprecise at best, but can be easily excused as a by-product of his Pythonista upbringing.
java.util.TreeMap, in fact, is implemented as a Red-Black tree - a data structure that has not had a definition in my brain for years. As a refresher course, I decided to Wikipedia the B-Tree family and blog the highlights.
First, the BINARY TREE:
A binary tree is a tree data structure in which each node has at most two children.
- A directed edge connects the parent to the child.
- A node that has no children is called a leaf.
- The depth of a node n is the length of the path from the root to the node. The set of all nodes at a given depth is sometimes called a level of the tree.
- The height of a node n is the length of the path from the node n to its furthest leaf.
- Nodes that share parents are called siblings.
- If a path exists from node p to node q, then p is an ancestor of q and q is a descendant of p.
- The size of a node is the number of descendants it has including itself.
Second, a BINARY SEARCH TREE:
In computer science, a binary search tree (BST) is a binary tree which has the following properties:
- Each node has a value.
- A total order is defined on these values.
- The left subtree of a node contains only values less than or equal to the node’s value.
- The right subtree of a node contains only values greater than or equal to the node’s value.
The major advantage of binary search trees is that the related sorting algorithms and search algorithms such as in-order traversal can be very efficient.
Binary search trees are a fundamental data structure used to construct more abstract data structures such as sets, multisets, and associative arrays.
If a BST allows duplicate values, then it represents a multiset. This kind of tree uses non-strict inequalities, so everything in the left subtree of a node is less than or equal to the value of the node, and everything in the right subtree is greater than or equal to the value of the node.
If a BST doesn’t allow duplicate values, then the tree represents a set with unique values, like the mathematical set. Trees without duplicate values use strict inequalities, meaning that the left subtree of a node only contains nodes with values that are less than the value of the node, and the right subtree only contains values that are greater.
Third, a SELF-BALANCING BINARY SEARCH TREE:
In computer science, a self-balancing binary search tree or height-balanced binary search tree is a binary search tree that attempts to keep its height, or the number of levels of nodes beneath the root, as small as possible at all times, automatically. They are one of the most efficient ways of implementing associative arrays, sets, and other data structures.
Most operations on a binary search tree take time directly proportional to the tree’s height, so it is desirable to keep the height small. Ordinary binary search trees have the primary disadvantage that they can attain very large heights in rather ordinary situations, such as when the keys are inserted in order. The result is a data structure similar to a linked list, making all operations on the tree expensive. If we know all the data ahead of time, we can keep the height small on average by adding values in a random order, but we don’t always have this luxury, particularly in on-line algorithms.
Self-balancing binary trees instead solve this problem by performing transformations on the tree, such as tree rotations, at key times, in order to reduce the height. Although a certain overhead is involved, it is justified in the long run by drastically decreasing the time of later operations. Note that the height must always be at least the ceiling of log n, since there are at most 2k nodes on the kth level; a complete or full binary tree has exactly this many levels. Balanced BSTs are not always so precisely balanced, since it can be expensive to keep a tree at minimum height at all times; instead, they keep the height within a constant factor of this lower bound.
Finally, a RED-BLACK TREE:
A red-black tree is a binary search tree where each node has a color attribute, the value of which is either red or black. In addition to the ordinary requirements imposed on binary search trees, we make the following additional requirements of any valid red-black tree:
- A node is either red or black.
- The root is black.
- All leaves are black.
- Both children of every red node are black. (i.e. Every red node must have a black parent.)
- All paths from any given node to its leaf nodes contain the same number of black nodes.
These constraints enforce a critical property of red-black trees: that the longest possible path from the root to a leaf is no more than twice as long as the shortest possible path. The result is that the tree is roughly balanced. Since operations such as inserting, deleting, and finding values requires worst-case time proportional to the height of the tree, this theoretical upper bound on the height allows red-black trees to be efficient in the worst-case, unlike ordinary binary search trees.
To see why these properties guarantee this, it suffices to note that no path can have two red nodes in a row, due to property 4. The shortest possible path has all black nodes, and the longest possible path alternates between red and black nodes. Since all maximal paths have the same number of black nodes, by property 5, this shows that no path is more than twice as long as any other path.
