Recently was talking to a friend who said that big internet companies like Yahoo! and others have proprietary databases to be able to scale for huge customer base. Obviously with more than 50% of traffic for Yahoo! coming from it’s email application (refer http://www.alexa.com/data/details/traffic_details?url=yahoo.com), it needs to support huge customer base.
While there are benefits having proprietary data formats I personally feel going for such schemes is not really a good idea. Mainly because, many of these schemes which sacrifice the ACID properties of the RDBMS databases to achieve their additional speed fall apart big time to generate aggregate reporting. Unless you are a Yahoo! or Google, the cost of writing and maintaining such code is usually not justified.
There are ways to use traditional databases and achieve good performance. One of the techniques is using data partitioning (table partitioning). Table partitioning is a feature where the data in the table is partitioned into multiple segments each of which can potentially reside in a separate disk there by giving a better IO throughput.
There are 3 types of table partitioning. They are Hash Partitioning, Range Partitioning and List Partitioning. Below I will go in detail about each of these and which one is best used for what purpose.
List Partitioning: If the set of values of a column is fixed, then list partitioning is useful. For example, one can assume that the names of customers can only start from A to Z and hence have 26 different partitions. Think of this when you can write your column values as (a, b, c, d … fixed-list) and there are several records for each of these values.
Range Partitioning: If the set of values is large and the queries deal with a subset of these values, then range partitioning is usually the right choice. For example, order date is a good range partitioning candidate since usually one is interested in all the orders placed in the last one month, last one quarter or last one year. So, depending on the volume of orders, the size of the partitioning can be by month, quarter or year. Also, range partitioning is the only possibility to be able to keep adding partitioning as needed. Think of this when you can write your column values as (a-b, c-d, e-f …) and there are several records between each of these ranges.
Hash Partitioning: Finally, Hash partitioning is useful when the values of the column can be too many but range is not really the right choice. For example, an account number, though has ranges, since accessed randomly is a good candidate for hash partitioning. The idea here is, using a hashing function, each value is mapped to one of the hash buckets.
MySQL defines another partitioning called Key partitioning, but it’s a lot similar to Hash Partitioning.
Using the right partitioning scheme and the right set of where clause in the queries can cut down the amount of IO quite a bit. And given that each partition can reside in a separate disk, it also gives scalability as concurrent IO is possible.
Another benefit from partitioning is the data gets clustered into the appropriate bucket. In one scenario with 4 valued list partitioning scheme, I was able to take advantage of this clustering factor to get even more optimized IO.