The other day I had write a 60gb Json file to a MySql database. The data was already clean and deduped, so all that had to be done was to write it into MySql. But, with the file being so large, I couldn't read the file into memory. I usually program in Python at my job, and when dealing with small files I typically use json.loads to deserialize the Json into a Python object.
So, as I naively read the file line by line and wrote it to the database. I used Sequel Pro to visually monitor the writes to the database, I was able to observe that it was writing approximately 50-150 rows to the database every second or so. Horribly slow! We're talking kilobytes a minute. And for a 60gb file this was useless.
Initial Solution/Idea: I initially wanted to jump right to writing a multi-threaded or multi-processor program in Python, which I've never done before. And I needed to make this script quickly. Me, being more comfortable in Java, decided to start going that route. But, after doing some research using Google, I stumbled upon this blog by Viral Patel titled Batch Insert In Java. This was an eye opener as I never realized how such a slight change in building the queries and pooling them into batches would add this much performance to writing these files. This modification to the process of building and writing the queries was a drastic improvement to the speed of writing to the database. So, instead of taking days, weeks or months. It took about half of a day to write this data to MySql.
My solution: Write to MySql in batches of 10,000, this is the limit that seemed best before I stopped noticing an increase in pooled queries per second. Disclaimer, this was solely based on my visual perception of Sequel Pro's Table Information for rows. It's an approximation of current rows, that gets updated as it refreshes and rows are written to the database. Despite this limit of 10,000, the increase was 100 fold compared to the average of 100 rows/second I was getting by writing queries line per line. Overall, it was a cool lesson in bulk inserting data into MySql. Here's a version of the code I wrote to feed the data into MySql.
Disclaimer: I am aware that 1: I could have done the bulk insertion using Python and 2: this could definitely be optimized to be faster and more efficient, and I may return to this program to do so eventually. I needed to come up with a solution fast and this was the result. The key takeaway of this blog post is to remember to try bulk insertions before assuming that concurrency or parallelism is the best solution for your needs.
Jeremy L. Morris's Blog
A young professionals blog on various software engineering, computer science and programming topics.
Sunday, June 25, 2017
Wednesday, August 19, 2015
Using FreeMarker to assign Agility VM properties to a PowerShell variable.
It's been about a month since I last posted. Well, I've been a little busy. I've been starting the interview process for a few companies. Since this upcoming Fall semester is my last, it is imperative that I have a software engineering job lined up for January 2016. And my week is spent full-time as a tech engineer working in the automation department at some company where I create PowerShell scripts leveraging FreeMarker that are utilized within CSC's Agility platform.
So...... to the point of this blog. I've been working on a script lately that validates the properties of an instance and then compares it to the properties listed in Agility using a Java API called FreeMarker. I just want to display an example for extracting properties from VMs in Agility and then assigning it to a variable in PowerShell.
*NOTE* Scripts leveraging FreeMarker can only be run in Agility with the extensions button enabled when adding/creating the script in Agility. They will not execute properly in PowerShell ISE, etc.. The FreeMarker aspect of the script will only be viewed as a multi-lined comment in traditional PowerShell editor settings.
Here's an example code snippet that extracts a designated VM asset property from an Agility template in agility and assigns it to a PowerShell variable...
<#assign agilityOSProperty = vmAssetProperty(this, "template.domain")>
$agilityOSProperty = '${agilityOSProperty.stringValue}'
$agilityOSProperty = $agilityOSProperty.ToUpper()
link that may be of interest ---> http://freemarker.org/
So...... to the point of this blog. I've been working on a script lately that validates the properties of an instance and then compares it to the properties listed in Agility using a Java API called FreeMarker. I just want to display an example for extracting properties from VMs in Agility and then assigning it to a variable in PowerShell.
*NOTE* Scripts leveraging FreeMarker can only be run in Agility with the extensions button enabled when adding/creating the script in Agility. They will not execute properly in PowerShell ISE, etc.. The FreeMarker aspect of the script will only be viewed as a multi-lined comment in traditional PowerShell editor settings.
Here's an example code snippet that extracts a designated VM asset property from an Agility template in agility and assigns it to a PowerShell variable...
<#assign agilityOSProperty = vmAssetProperty(this, "template.domain")>
$agilityOSProperty = '${agilityOSProperty.stringValue}'
$agilityOSProperty = $agilityOSProperty.ToUpper()
link that may be of interest ---> http://freemarker.org/
Monday, July 20, 2015
What is Python good for..
-Web and Internet Development
* Frameworks like Django and Pyramid
* Micro-frameworks like Flask and Bottle
* Advanced content management systems such as Plone and django CMS
* supports HTML and XML
* supports JSON
* supports E-mail processing
* support for FTP and IMAP as well as other internet protocols
* supports socket interface
* Requests, a powerful HTTP client library
* BeautifulSoup, an HTML parser that can handle all sorts of oddball HTML
* Feedparser for parsing RSS/Atom feeds
* Paramiko, implementing the SSH2 protocol
* Twisted Python, a framework for asynchronous network programming
-Scientific and Numeric
* SciPy is a collection of packages for mathematics, science, and engineering
* Pandas is a data analysis and modeling library
* IPython for editing and recording work sessions, supporting visualizaitons and parallel computing
* Software Carpentry Course teaches basic skills for scientific computing, running bootcamps and providing open-access teaching materials.
-Education
* python.org
* docs.python
* hundreds of videos on youtube
-Desktop GUIs
* Tk
* wxWidgets
* Kivy, for multitouch applications
* Qt via pyqt and pyside
* GTK+
-Software Development
* Build Control ---> SCons
* Automated Continuous Compilation and Testing ---> Buildbot and Apache
* Bug Tracking and Project Management ---> Roundup or Trac
This information was adapted from the Python Software Foundation self described as "The Python Software Foundation (PSF) is a 501(c)(3) non-profit corporation that holds the intellectual property rights behind the Python programming language. We manage the open source licensing for Python version 2.1 and later and own and protect the trademarks associated with Python. We also run the North American PyCon conference annually, support other Python conferences around the world, and fund Python related development with our grants program and by funding special projects". https://www.python.org/about/apps/
Wednesday, July 8, 2015
Java != JavaScript
This topic is somewhat a pet peeve of mine.. I tend to get slightly annoyed when people or job descriptions act as if Java is JavaScript, and vice versa. They are not related! Yes, they are both prevalent in our everyday lives; from the phones we use to the websites we access. A lot of software that an average person will interact with regularly, technology wise, will have been influenced entirely if not partially by one of these two languages. Everything from your mobile device, websites, to even the apps on your smart TV.. Here's one example of a line of smart TV's that are in fact populated with apps that are programmed in Java (FYI: Android apps are typically programmed with a combination of Java and Android XML) ----> http://www.theverge.com/2015/1/5/7497383/sony-new-smart-tv-run-android-tv-ces-2015
Here are some quick facts..
Java:
-Launched by Sun Microsystems in 1995.
-Developed by James Gosling.
-Object-Oriented Programming Language (objects represent instances of a class).
-Strongly-Typed language.
-Runs on the Java Virtual Machine(JVM).
-The language was originally named OAK.
-Considered a general-purpose programming language.
-Able to run on any OS where JVM is available.
-Java supports multi-threaded programming.
(multiple processes running simultaneously)
-De facto language used to develop Android apps.
-Filename extension is .java
HELLO WORLD IN JAVA:
class helloWorldApp {
public static void main(String[] args) {
System.out.println("Hello, World!");
}
}
JavaScript:
-Developed at Netscape and released after Java in 1995
-Originally called LiveScript, Sun Microsystems gave Netscape
an exclusive license to allow the switch from LiveScript to JavaScript.
-Does not use classes.
-Typically embedded in HTML.
-Interpreted and ran by the client's browser.
-Loosely-typed language (Don't have to declare a data type before using it).
-Supported by all if not most browsers.
-Prototyped scripting language with object-oriented aspects.
-Filename extension is .js
HELLO WORLD IN JAVASCRIPT:
<!DOCTYPE HTML>
<html>
<body>
<p>Header...</p>
<script>
alert('Hello, World!')
</script>
<p>...Footer</p>
</body>
</html>
Both awesome and productive languages? YES! The same? NO!
15 Sorting Algorithms in 6 Minutes
Here's a cool video I've found on YouTube that illustrates sorting algorithms with the use of visualization and audio. Pretty cool stuff
Monday, July 6, 2015
SCRUM Meetings and Agile Methodology; an Intern's perspective!
SCRUM meetings, three times a week. What's it like? Well imagine running a marathon (26.2 miles), and every 3 miles there was a checkpoint (sprint release) with water breaks every 5 minutes(the weekly SCRUM meetings). Now picture, for explanation purposes, that you had to reach each checkpoint in about 30 minutes. The first checkpoint (sprint 1) goes fine. Then later on, as you running along on your way to the next checkpoint (sprint 2) filled with confidence and optimism, your shoe becomes undone and you trip and slam hard on to the ground scraping your knee. By the time you get up, have assessed the damage and adapted to this new obstacle (a scraped knee) you've added like 5 minutes to your sprint. Put that into context of software development, you're a week or two behind on features that should've been released by the planned sprint/ release date and now the date has to be pushed back a week to adapt to and change some things in the project........But no worries, the next checkpoint(sprint) is coming up and you'll get a band-aid, some water and be ready to go! This is what it's like developing software with Agile methodology.
This may seem annoying, but remember that a lot of what is a part of Agile methodology is a project team continuously setting expectations, trying to adhere to them and adapt to obstacles that occur. Fixing issues when running into them and not waiting until the end of a project to realize a feature set won't work out or that you don't meet the required security specifications. In Software development there is always road bumps. Agile methodology helps to minimize the blow back by continuously addressing the problems and staying on top of them as they occur instead of letting it worsen and allowing a potential bug to be an integral part of the project that compromises most of what has been done.
Ultimately SCRUM allows for increased adaptive capability to change, expecting these changes, and more control of the direction of development. As well as increased quality in each release.
Subscribe to:
Posts (Atom)
Getting 60gb of Json Data into MySql
The other day I had write a 60gb Json file to a MySql database. The data was already clean and deduped, so all that had to be done was to wr...
-
This topic is somewhat a pet peeve of mine.. I tend to get slightly annoyed when people or job descriptions act as if Java is JavaScript, ...
-
Here's a cool video I've found on YouTube that illustrates sorting algorithms with the use of visualization and audio. Pretty cool...
-
SCRUM meetings, three times a week. What's it like? Well imagine running a marathon (26.2 miles), and every 3 miles there was a...