On my GitHub here, there is now a few python scripts for splitting out csv files. I’ve had to work with some huge 60-70GB csv files from some source systems, and the little dev environment hasn’t got enough scale to load these files, so I knocked up a few items to split them into smaller and testable chucks.
There are three items:
- CSV Splitter – splits files into definable row sizes
- CSV Sampler – Samples top n rows
- CSV Random Sampler – Sample n random rows of data.
Here’s a sample of the sampler one. No pun intended!
#CSV Random Sampler
import csv
import random
sample_size = 10000
filename = "file.csv"
output_filename = "C:/filelocation/samplefile_random_" + str(sample_size) + ".csv"
with open(filename, "r") as file:
reader = csv.reader(file)
rows = list(reader)
selected_rows = random.sample(rows, sample_size)
with open(output_filename, "w", newline="") as new_file:
writer = csv.writer(new_file)
for row in selected_rows:
writer.writerow(row)
Hope they help some one!