I recently posted an article about using BINARY(16) for storing MD5’s as unique identifiers instead of simple integer ID’s (usually auto increment); in that article I touched on one of the benefits, reducing JOIN’s, but there are other reasons for doing it too, so I thought I’d post an article discussing purely the reasons behind using BINARY(16).
As I discussed in my previous article, an MD5 string is actually a hexadecimal number capable of storing values as large as 340,282,366,920,938,463,463,374,607,431,768,211,456. MySQL doesn’t have any efficient integer field for storing numbers this big so you have two choices for storage, a CHAR(32) or a BINARY(16). If you convert a hexadecimal MD5 into a unhexed character string, it will become 16 bytes rather than 32. MySQL handily has a feature built in for this called UNHEX.
So, why use binary(16) as a unique field for data storage? Databases like MySQL have superb functionality such as JOIN, allowing you to query one table and “join” the results of that query to another table. However, when you get to 10’s, 100’s or even 1000’s of millions of rows of data, JOIN’s become expensive, especially when the join only exists because you need an ID field from one table to query against on another. From tests at work, replacing a JOIN by using a binary(16) unique identifier has seen noticeable improvements to speed, noticeable here being human noticeable, not iterate it a million times and you’ll see 1.5 as opposed to 1.9 seconds noticeable.
The main benefits include:
- Fast queries against any table where you know the formula that was used to create the MD5 binary(16) using human-readable English and no integers.
- Complete disassociation of relational data values
- Ability to use INSERT IGNORE to avoid duplicate data without having to use overly large indexes
- More unique values than even a BIGINT.
The main drawbacks include:
- 12 bytes more storage for the ID (INT is 4 bytes)
- No auto-incrementation
- Completely unreadable to humans when the data is in BINARY(16) form.
One thing I just mentioned was disassociation of relational data values. What does this mean exactly? Well it means exactly the same as what people do now with MySQL and unique integer ID’s to be honest! The difference here is you can query against it without those pesky JOIN’s a lot of the time. For example, say you are storing every town in the UK in a database and how they link together (i.e. if there is a direct route from one to another.) You’d have a table named towns probably, with a unique ID and the town name. You’d then have a separate table with 2 columns, both columns would store a town ID which would basically mean “this town has a direct route to this town.” If you were to use integers as the town’s unique ID, every time you wanted to get the town’s linked to said town, you’d have to query against the towns table first to get the town ID you want to get links to, then again to get the names of the towns that link to it.
If you were to use a binary(16) representation of the town you could scrap the first join, instead you could query by saying “get me any towns that link to UNHEX(MD5(‘Town Name’))”. You’d still have to do the second join to get the town names, but you’ve instantly dropped a JOIN and simplified the whole experience as you can now query more naturally.
Basically, all you’re doing is replacing any place in your database that is a string that is usually more than 16 characters in length with a binary(16) of it, then storing the strings elsewhere for when you actually need to read the output. This effectively gives you a look-up table that can contain any string whatsoever and a database that stores relationships of strings without requiring special tables and integers for every string.
As a note, a table with 100 million rows of data with two columns – BINARY(16), TEXT – to look-up the textual value of a binary(16) string takes 0.0019 seconds for us and having that table of text has meant we’ve severely de-duped our database as the data we store often is identical, even when the source is completely different. Even if we do a WHERE BINARY(16) IN (list,of,values), the time sticks at 0.0019 up to the maximum test I’ve done so far which is 100 MD5’s.