On my GitHub here, there is now a few python scripts for splitting out csv files. I’ve had to work with some huge 60-70GB csv files from some source systems, and the little dev environment hasn’t got enough scale to load these files, so I knocked up a few items to split them into smaller and testable chucks.
There are three items:
- CSV Splitter – splits files into definable row sizes
- CSV Sampler – Samples top n rows
- CSV Random Sampler – Sample n random rows of data.
Here’s a sample of the sampler one. No pun intended!
#CSV Random Sampler import csv import random sample_size = 10000 filename = "file.csv" output_filename = "C:/filelocation/samplefile_random_" + str(sample_size) + ".csv" with open(filename, "r") as file: reader = csv.reader(file) rows = list(reader) selected_rows = random.sample(rows, sample_size) with open(output_filename, "w", newline="") as new_file: writer = csv.writer(new_file) for row in selected_rows: writer.writerow(row)
Hope they help some one!