Snippet – Small sample from large CSV

By | 2016-01-16T19:59:50+00:00 January 16th, 2016|Application Lifecycle Management (ALM)|0 Comments

When doing some initial data modeling, you often have multi-gigabyte input data files to machine learning algorithms.  But working with such large data files can be difficult.

This quick Python pandas script grabs the first x rows from a large CSV and saves them. In this case, it’s 5000, just for getting a quick understanding of the data. You could then pull a million or so rows to do your initial data analysis.

# Import the necessary items
import pandas as pd

# Read the CSV data, passing in nrows as a parameter to get just that number of rows (from the beginning)
df = pd.read_csv("HugeFile.csv", sep=",", low_memory=False, nrows=5000)

# Save a new copy in CSV
df.to_csv("SmallFile.csv", sep=",", index=False)

UPDATE: As soon as I wrote this, I heard Stan Dotloe’s voice saying “Steve, that’s not even close to a representative sample, you’re going to run into all sorts of problems if you try to understand the data from a sample that only takes the first pieces of data. You really need to randomly select the values from the file.” He’s right, of course, so here’s the update that includes grabbing a random subset of data from a large CSV file.
As a side note, this is relatively slow. Rather than milliseconds, it can take minutes. As an example, on my non-SSD drive, just counting the rows in my 20 million row table took nearly a minute. Then it took nearly another minute to create the non-duplicating 19,995,000 randoms numbers and stuff them into an array, and finally nearly another minute to read the randomly selected rows.  Still, very easy to do and not too shabby time-wise.  If this was going into an Azure ML data pipeline, I’d surely use on of the faster approaches hinted to in the gist, but this is great for manually trimming down a file.

If the pandas read_csv method had a takerows parameter, instead of just a skiprows parameter, then the only thing required would be to generate 5,000 non-duplicating random numbers.  And that would be easy.  Thankfully, numpy is very, very fast, so it’s not too bad.

python 3.4 64-bit
pandas 0.17.1

About the Author:

Leave A Comment