How to work with large CSV files efficiently
· Category: Data Science
Short answer
Large CSV files can exceed available RAM; processing them efficiently requires chunking, type optimization, or distributed frameworks.
Steps
- Estimate file size and memory requirements before loading.
- Use chunksize in pd.read_csv to process files in manageable fragments.
- Specify dtypes explicitly to prevent pandas from inferring expensive object types.
- Load only necessary columns using the usecols parameter.
- Consider Dask or Polars for out-of-core or parallel processing.
Tips
- Compress output files to reduce disk I/O and storage costs.
- Index important columns after loading to speed up downstream queries.
- Profile peak memory usage with tools like memory_profiler.
- Convert to Parquet after initial cleaning for faster future reads.
Common issues
- Parser errors due to malformed rows or inconsistent delimiters.
- Memory spikes when concatenating many chunks without freeing intermediates.
- Slow performance from repeated string parsing in object columns.
- Encoding issues causing UnicodeDecodeError on large files.
Example
import pandas as pd
import numpy as np
df = pd.DataFrame({'sales': [100, 150, 200, np.nan]})
df['sales'] = df['sales'].fillna(df['sales'].median())
print(df.describe())
This snippet creates a DataFrame, handles a missing value with the median, and prints summary statistics common in exploratory analysis.