How to use groupby for aggregations
· Category: Data Science
Short answer
Groupby splits data into groups based on categorical variables, applies computations within each group, and combines results into a summary.
Steps
- Choose grouping columns that define the meaningful partitions for your analysis.
- Select aggregation functions such as sum, mean, count, or custom lambdas.
- Apply multiple aggregations simultaneously with agg for comprehensive summaries.
- Use transform to return results aligned with the original DataFrame index.
- Reset the index after aggregation to flatten hierarchical results.
Tips
- Use as_index=False to keep grouping columns as regular columns.
- Apply named aggregations for cleaner column names in the output.
- Filter groups with filter based on group-level properties.
- Use size instead of count when you want total rows including NaNs.
Common issues
- Slow performance when grouping on high-cardinality columns.
- Unexpected multi-index structures complicating downstream operations.
- Missing groups due to filtering before aggregation.
- Memory spikes from intermediate group objects in complex pipelines.
Example
import pandas as pd
import numpy as np
df = pd.DataFrame({'sales': [100, 150, 200, np.nan]})
df['sales'] = df['sales'].fillna(df['sales'].median())
print(df.describe())
This snippet creates a DataFrame, handles a missing value with the median, and prints summary statistics common in exploratory analysis.