Today I just came to know about Amazon SimpleDB webservices from slashdot. Read the documentation which didn’t really give details about the internals of this technology. So, using my knowledge of information retrieval and databases, I am going to do a few comparisons. First some terminology,
Amazon SimpleDB => Databases
Domains => Tables
Items => Records
Attributes => Columns
A few key differences are
1) there is no need to define the data type of the attribute. This is because, the storage mechanism is mostly inverse indexing treating each value as a token.
2) values of an attribute can be both a string and a number. The token logic goes for this as well.
3) an attribute can have multiple values. Some databases like Oracle support user defined data types and varrays making it possible to capture multiple values.
4) different items can have different attributes. However, the doc indicates “256 total attribute name-value pairs per item”. It wasn’t clear if the union of all attributes across the items in a domain is 256 or for each item individually it is 256 (but much more across items). If it’s across the items, then this is no different than having a static table with 256 character columns and using a bit-map index on all of them.
Bitmap indexes are extremely fast. They work best when the ratio of the distinct set of values to the total records is low (fewer distinct values). Due to their vector storage nature, the query criteria are resolved individually (unlike in case of other types of indexes where either nested-loops, hash-join or merge-join is used), which is nothing but just fetching the index which is a vector and set-operations are performed on the bit vectors. This works well for “=” operator.
For example, there are a million records of people and Gender is indexed as a bit-map index, then getting the list of all the Males in this set of population is no more than loading the bitmap vector and picking all the records whose bits are set to 1. The “!=” operator is also not hard to implement it.
Dealing with numeric types and the related operators such as >= and and < is a bit expensive. The way it works is, the distinct values possible between the specified range and available among the distinct indexed values is identified and for each of these values, the index is loaded and a set union operation is performed. Though, a CBO database may be smart enough to go directly for a FTS (Full Table Scan) than loading all the indexes and then resolving the table.
From the SimpleDB documentation which mentions
“# Positive integers should be zero-padded to match the largest number of digits in your data set. For example, if the largest number you are planning to use in a range is 1,000,000, every number that you store in SimpleDB should be zero-padded to at least 7 digits. You would store 25 as 0000025, 4597 as 0004597, and so on.
# Negative integers should be offset and turned into positive numbers and zero-padded. For example, if the smallest negative integer in your data set is -500, your application should add at least 500 to every number that you store. This ensures that every number is now positive and enables you to use the zero-padding technique.”
it appears that the SimpleDB has been optimized to deal with constant-length positive numbers represented as string tokens. This is perhaps understandable since this technology, most likely developed to support Amazon’s highly scalable product catalog has most numeric attributes that are only positive (I mean, when was the last time you got an item at a -ve price? I once did after the manufacturer rebate and the retailer rebate both got cleared for an item).