How to use groupby for aggregations

· Category: Data Science

Short answer

Groupby splits data into groups based on categorical variables, applies computations within each group, and combines results into a summary.

Steps

  1. Choose grouping columns that define the meaningful partitions for your analysis.
  2. Select aggregation functions such as sum, mean, count, or custom lambdas.
  3. Apply multiple aggregations simultaneously with agg for comprehensive summaries.
  4. Use transform to return results aligned with the original DataFrame index.
  5. Reset the index after aggregation to flatten hierarchical results.

Tips

  • Use as_index=False to keep grouping columns as regular columns.
  • Apply named aggregations for cleaner column names in the output.
  • Filter groups with filter based on group-level properties.
  • Use size instead of count when you want total rows including NaNs.

Common issues

  • Slow performance when grouping on high-cardinality columns.
  • Unexpected multi-index structures complicating downstream operations.
  • Missing groups due to filtering before aggregation.
  • Memory spikes from intermediate group objects in complex pipelines.

Example

import pandas as pd
import numpy as np

df = pd.DataFrame({'sales': [100, 150, 200, np.nan]})
df['sales'] = df['sales'].fillna(df['sales'].median())
print(df.describe())

This snippet creates a DataFrame, handles a missing value with the median, and prints summary statistics common in exploratory analysis.