How to transform data types in pandas
· Category: Data Science
Short answer
Correct data types reduce memory usage, enable proper operations, and prevent subtle bugs caused by implicit type coercion.
Steps
- Inspect current types using dtypes and identify columns that need conversion.
- Convert numeric strings to integers or floats using pd.to_numeric with error handling.
- Parse date strings into datetime objects using pd.to_datetime with format specification.
- Convert low-cardinality object columns to category type for memory efficiency.
- Validate conversions by checking dtypes again and sampling values.
Tips
- Specify formats in to_datetime to speed up parsing and catch anomalies.
- Use nullable integer types like Int64 to preserve missing values.
- Downcast numeric columns to the smallest sufficient type.
- Set copy=False where possible to reduce memory overhead.
Common issues
- Locale-specific date formats causing parsing errors.
- Mixed types within a single column preventing vectorized conversion.
- Silent coercion of invalid values to NaN when errors is set to coerce.
- Object columns consuming excessive memory on large datasets.
Example
import pandas as pd
df = pd.read_csv('data.csv')
df = df.dropna().drop_duplicates()
df['date'] = pd.to_datetime(df['date'])
print(df.head())
This example shows how to load a CSV, remove missing values and duplicates, and parse dates in just a few lines of pandas code.