The other day I had write a 60gb Json file to a MySql database. The data was already clean and deduped, so all that had to be done was to write it into MySql. But, with the file being so large, I couldn't read the file into memory. I usually program in Python at my job, and when dealing with small files I typically use json.loads to deserialize the Json into a Python object.
So, as I naively read the file line by line and wrote it to the database. I used Sequel Pro to visually monitor the writes to the database, I was able to observe that it was writing approximately 50-150 rows to the database every second or so. Horribly slow! We're talking kilobytes a minute. And for a 60gb file this was useless.
Initial Solution/Idea: I initially wanted to jump right to writing a multi-threaded or multi-processor program in Python, which I've never done before. And I needed to make this script quickly. Me, being more comfortable in Java, decided to start going that route. But, after doing some research using Google, I stumbled upon this blog by Viral Patel titled Batch Insert In Java. This was an eye opener as I never realized how such a slight change in building the queries and pooling them into batches would add this much performance to writing these files. This modification to the process of building and writing the queries was a drastic improvement to the speed of writing to the database. So, instead of taking days, weeks or months. It took about half of a day to write this data to MySql.
My solution: Write to MySql in batches of 10,000, this is the limit that seemed best before I stopped noticing an increase in pooled queries per second. Disclaimer, this was solely based on my visual perception of Sequel Pro's Table Information for rows. It's an approximation of current rows, that gets updated as it refreshes and rows are written to the database. Despite this limit of 10,000, the increase was 100 fold compared to the average of 100 rows/second I was getting by writing queries line per line. Overall, it was a cool lesson in bulk inserting data into MySql. Here's a version of the code I wrote to feed the data into MySql.
Disclaimer: I am aware that 1: I could have done the bulk insertion using Python and 2: this could definitely be optimized to be faster and more efficient, and I may return to this program to do so eventually. I needed to come up with a solution fast and this was the result. The key takeaway of this blog post is to remember to try bulk insertions before assuming that concurrency or parallelism is the best solution for your needs.
A young professionals blog on various software engineering, computer science and programming topics.
Subscribe to:
Posts (Atom)
Getting 60gb of Json Data into MySql
The other day I had write a 60gb Json file to a MySql database. The data was already clean and deduped, so all that had to be done was to wr...
-
This topic is somewhat a pet peeve of mine.. I tend to get slightly annoyed when people or job descriptions act as if Java is JavaScript, ...
-
Here's a cool video I've found on YouTube that illustrates sorting algorithms with the use of visualization and audio. Pretty cool...
-
SCRUM meetings, three times a week. What's it like? Well imagine running a marathon (26.2 miles), and every 3 miles there was a...