Over the past few months at work, we’ve seen our database grown from silly big to really silly big, it’s still a way to go to get to the size of the big boys such as Facebook etc. but it’s still a database stored in MySQL that most day-to-day PHP programmers would avoid like a midget cannibal.
One of the great things about using something like MySQL (and any other “real” database) is the ability to cross-query data, i.e. to grab data from one data-set (table) and join it to another data-set (table) to get a single set of results, either as a combination of the data or the result of an exclusion due to the join. *
However, as tables grow, the time taken to perform queries, particularly in the realm of joins, grows rather quickly. So for example take this query:
SELECT * FROM table2 LEFT JOIN table1 ON table1.columnB = table2.columnA WHERE table1.columnC = 'John.Doe';
Let’s say table1 is a list of all employees in a small business and table2 is a list of their days off, so it’s a one-to-many relationship. Running the above query to get the days off for person 5 would be pretty quick and most developers would be happy with that, even if the columns weren’t indexed, the performance of that query (as it’s a small business – therefore small dataset) would be more than suitable for any real-world application.
Now imagine a table where rather than a couple of hundred rows, you have millions or (such as ours) billions of rows of data; as for why we have that much data, that’s for another topic. That join could could result in a rather painful execution time. The problem you’ve got is, you have to first query table1 to get the ID of user ‘John.Doe’ and then use that ID for table2 to get the actual data.
So how can you optimise this? Well you’ve got three choices, the first would be two queries, one to grab the users ID from table1, then the next to grabs the users data from table2; but that’s 2 queries now. In a lot of places that wouldn’t matter, but we want speed here and reduction of hits to MySQL. The second is have the users name in table2 for each day off – that’s duplicating data though and because (in this case) you’d have a string, it’s not the fastest lookup and creates rather large indexes when people’s usernames are quite long.
The third option? A unique hash associated with that user. In this case, MD5 the username and store it as binary(16). MD5 is, after all, a 128-bit number basically. Most people are used to seeing it as a 32 character string, e.g. 7ecb9bba8130abe56cfd9a8430ca969c. That is just a hexadecimal number though, albeit a very very big one – capable of storing the value 340,282,366,920,938,463,463,374,607,431,768,211,456, for those in the UK that’s 340 sextillion. MySQL Doesn’t really have a suitable INT type for storing a number that big so it’s best to either store it as a 32-byte string (hexadecimal MD5) or better yet, as a binary string of 16 characters.
So how does that change our query now?
SELECT * FROM table2 WHERE table2.columnA = UNHEX(MD5('John.Doe'));
No more join and only one select. It means you can look up days off for any user simply by knowing the username. MySQL has UNHEX(MD5()) to md5 a string and convert to its binary equivalent. In PHP you’d use md5(‘string’, true) or pack(‘H*’, md5(‘string’));
In all honesty, this isn’t the best use of binary(16), but it’s a relatively simple example to follow. For us though, moving away from auto-incrementing ID’s towards binary hashes has allowed use to do blind inserts (insert ignore) and lightning fast selects where they used to take minutes or even hours. INSERT IGNORE has to be one of the biggest benefits we’ve seen. By setting the primary key to the BINARY(16) column, you can easily guarantee unique data without wasted extra index space and you only need to query that table when you actually need to data associated with that unique hash, the rest of the time, you can query other tables that relate to that hash without having to do a join.
* I would like to point out I am fully aware of people who store data without a dedicated database and use Map-Reduce due to the sheer size of it, however databases like MySQL allow a quick line of text to get the results you want, there’s no further effort involved.