How to work with large CSV files efficiently

· Category: Data Science

Short answer

Large CSV files can exceed available RAM; processing them efficiently requires chunking, type optimization, or distributed frameworks.

Steps

  1. Estimate file size and memory requirements before loading.
  2. Use chunksize in pd.read_csv to process files in manageable fragments.
  3. Specify dtypes explicitly to prevent pandas from inferring expensive object types.
  4. Load only necessary columns using the usecols parameter.
  5. Consider Dask or Polars for out-of-core or parallel processing.

Tips

  • Compress output files to reduce disk I/O and storage costs.
  • Index important columns after loading to speed up downstream queries.
  • Profile peak memory usage with tools like memory_profiler.
  • Convert to Parquet after initial cleaning for faster future reads.

Common issues

  • Parser errors due to malformed rows or inconsistent delimiters.
  • Memory spikes when concatenating many chunks without freeing intermediates.
  • Slow performance from repeated string parsing in object columns.
  • Encoding issues causing UnicodeDecodeError on large files.

Example

import pandas as pd
import numpy as np

df = pd.DataFrame({'sales': [100, 150, 200, np.nan]})
df['sales'] = df['sales'].fillna(df['sales'].median())
print(df.describe())

This snippet creates a DataFrame, handles a missing value with the median, and prints summary statistics common in exploratory analysis.