Python Data Analysis with Pandas — From Raw Data to Insights
Pandas is the backbone of data analysis in Python. Whether you're cleaning messy CSV exports, analyzing sales data, or building dashboards — pandas is where the work happens. This guide takes you from loading raw data to producing actionable insights, with patterns you'll use daily.
Setup and Loading Data
Install
pip install pandas numpy openpyxl sqlalchemy
Loading from different sources
import pandas as pd
# CSV — the most common
df = pd.read_csv("sales.csv")
# CSV with options
df = pd.read_csv(
"sales.csv",
parse_dates=["date"], # auto-parse date columns
dtype={"zip_code": str}, # force string (don't lose leading zeros)
na_values=["N/A", "null", ""], # treat as NaN
usecols=["date", "product", "amount", "region"], # only load what you need
)
# Excel
df = pd.read_excel("report.xlsx", sheet_name="Q1 Sales")
# JSON
df = pd.read_json("api_response.json")
# From SQL database
from sqlalchemy import create_engine
engine = create_engine("sqlite:///analytics.db")
df = pd.read_sql("SELECT * FROM orders WHERE year = 2026", engine)
# From clipboard (great for quick analysis)
df = pd.read_clipboard()
# From dict (for testing / building manually)
df = pd.DataFrame({
"product": ["Widget A", "Widget B", "Widget C"],
"price": [29.99, 49.99, 19.99],
"units_sold": [150, 80, 300],
})
💡 Large files: For files >1GB, use pd.read_csv(..., chunksize=10000) to process in chunks, or switch to polars (faster for big data). For SQL, always filter in the query — don't load everything.
First Look: Understanding Your Data
# Shape and types
print(f"Rows: {len(df):,}, Columns: {df.shape[1]}")
print(df.dtypes)
# Quick stats
df.describe() # numeric summary (mean, std, min, max, quartiles)
df.describe(include="O") # categorical summary (count, unique, top, freq)
# Missing values
print(df.isnull().sum())
print(f"Missing: {df.isnull().sum().sum()} total ({df.isnull().mean().mean():.1%})")
# Sample rows
df.head(10)
df.sample(5) # random sample — better than head for spotting patterns
df.tail(3)
# Unique values
df["region"].value_counts()
df["product"].nunique() # count of unique values
# Memory usage
df.info(memory_usage="deep")
Data Cleaning
Real data is messy. Here's how to fix the most common problems.
Missing values
# Check what's missing
df.isnull().sum().sort_values(ascending=False)
# Drop rows where critical columns are missing
df = df.dropna(subset=["product", "amount"])
# Fill numeric with median (better than mean — outlier-resistant)
df["price"] = df["price"].fillna(df["price"].median())
# Fill categorical with mode
df["region"] = df["region"].fillna(df["region"].mode()[0])
# Forward-fill time series (carry last known value)
df["stock_price"] = df["stock_price"].ffill()
# Interpolate (linear between known points)
df["temperature"] = df["temperature"].interpolate(method="linear")
Type conversion
# Strings to datetime
df["date"] = pd.to_datetime(df["date"], format="%Y-%m-%d")
df["date"] = pd.to_datetime(df["date"], format="mixed") # mixed formats
# Strings to numeric (handles "$1,234.56")
df["revenue"] = (
df["revenue"]
.str.replace("$", "", regex=False)
.str.replace(",", "", regex=False)
.astype(float)
)
# Categorical (saves memory, enables ordering)
df["priority"] = pd.Categorical(
df["priority"],
categories=["low", "medium", "high"],
ordered=True,
)
# Downcast numeric types (save memory)
df["quantity"] = pd.to_numeric(df["quantity"], downcast="integer")
String cleaning
# Standardize text
df["name"] = df["name"].str.strip().str.title()
df["email"] = df["email"].str.lower().str.strip()
df["phone"] = df["phone"].str.replace(r"[^\d+]", "", regex=True)
# Extract patterns
df["domain"] = df["email"].str.extract(r"@(\w+\.\w+)")
df["year"] = df["date_string"].str.extract(r"(\d{4})")
# Replace values
df["status"] = df["status"].replace({
"cancelled": "canceled",
"ACTIVE": "active",
"inactive ": "inactive",
})
Deduplication
# Find duplicates
dupes = df[df.duplicated(subset=["email", "date"], keep=False)]
print(f"Found {len(dupes)} duplicate rows")
# Remove duplicates (keep first occurrence)
df = df.drop_duplicates(subset=["email", "date"], keep="first")
# Remove based on all columns
df = df.drop_duplicates()
Filtering and Selecting
# Boolean filtering
high_value = df[df["amount"] > 1000]
recent = df[df["date"] >= "2026-01-01"]
# Multiple conditions (use & for AND, | for OR, ~ for NOT)
target = df[
(df["region"] == "LATAM") &
(df["amount"] > 500) &
(~df["product"].str.contains("legacy", case=False))
]
# .query() — cleaner for complex filters
target = df.query("region == 'LATAM' and amount > 500")
# isin — filter by list
top_products = df[df["product"].isin(["Widget A", "Widget B", "Premium"])]
# between
mid_range = df[df["price"].between(20, 100)]
# String matching
python_jobs = df[df["title"].str.contains("python", case=False, na=False)]
# Select columns
subset = df[["product", "amount", "date"]]
subset = df.loc[:, "product":"amount"] # slice by column name
# Conditional assignment
df.loc[df["amount"] > 10000, "tier"] = "enterprise"
df.loc[df["amount"] <= 10000, "tier"] = "standard"
# np.where for vectorized if-else
import numpy as np
df["size"] = np.where(df["amount"] > 5000, "large", "small")
Transformations
Creating new columns
# Arithmetic
df["total"] = df["price"] * df["quantity"]
df["margin"] = (df["revenue"] - df["cost"]) / df["revenue"] * 100
# Date parts
df["year"] = df["date"].dt.year
df["month"] = df["date"].dt.month
df["weekday"] = df["date"].dt.day_name()
df["quarter"] = df["date"].dt.quarter
df["is_weekend"] = df["date"].dt.dayofweek >= 5
# Binning continuous values
df["price_bucket"] = pd.cut(
df["price"],
bins=[0, 25, 50, 100, float("inf")],
labels=["budget", "mid", "premium", "luxury"],
)
# Ranking
df["rank"] = df["revenue"].rank(ascending=False, method="dense")
# Cumulative calculations
df["cumulative_sales"] = df.sort_values("date")["amount"].cumsum()
df["running_avg"] = df["amount"].expanding().mean()
# Rolling window (7-day moving average)
df = df.sort_values("date")
df["ma_7d"] = df["amount"].rolling(window=7).mean()
Apply for custom logic
# apply — for complex per-row logic
def categorize_customer(row):
if row["total_spent"] > 10000:
return "VIP"
elif row["orders"] > 20:
return "loyal"
elif row["last_order_days"] > 90:
return "at_risk"
return "regular"
df["segment"] = df.apply(categorize_customer, axis=1)
# Vectorized is 10-100x faster than apply — prefer it
# Instead of apply, use np.select for multi-condition:
conditions = [
df["total_spent"] > 10000,
df["orders"] > 20,
df["last_order_days"] > 90,
]
choices = ["VIP", "loyal", "at_risk"]
df["segment"] = np.select(conditions, choices, default="regular")
⚡ Performance rule: apply() is a Python loop in disguise. For numeric operations, use vectorized pandas/numpy operations. They're 10-100x faster. Use apply() only when logic is too complex to vectorize.
Grouping and Aggregation
This is where pandas really shines — turning raw rows into business insights.
# Basic groupby
by_region = df.groupby("region")["amount"].sum()
by_product = df.groupby("product")["amount"].agg(["sum", "mean", "count"])
# Multiple aggregations
summary = df.groupby("region").agg(
total_revenue=("amount", "sum"),
avg_order=("amount", "mean"),
order_count=("amount", "count"),
unique_products=("product", "nunique"),
first_order=("date", "min"),
last_order=("date", "max"),
).sort_values("total_revenue", ascending=False)
# Group by multiple columns
monthly = df.groupby([df["date"].dt.to_period("M"), "region"]).agg(
revenue=("amount", "sum"),
orders=("amount", "count"),
).reset_index()
# Percentage of total
df["pct_of_total"] = df["amount"] / df["amount"].sum() * 100
# Percentage within group
df["pct_in_region"] = df.groupby("region")["amount"].transform(
lambda x: x / x.sum() * 100
)
# Top N per group
top_3_per_region = (
df.sort_values("amount", ascending=False)
.groupby("region")
.head(3)
)
Pivot tables
# Pivot: rows=region, columns=product, values=revenue
pivot = df.pivot_table(
values="amount",
index="region",
columns="product",
aggfunc="sum",
fill_value=0,
margins=True, # add row/column totals
)
# Cross-tabulation (frequency table)
ct = pd.crosstab(
df["region"],
df["customer_type"],
normalize="index", # show as percentages
)
# Melt (unpivot): wide → long format
long = pd.melt(
df,
id_vars=["date", "region"],
value_vars=["q1_sales", "q2_sales", "q3_sales", "q4_sales"],
var_name="quarter",
value_name="sales",
)
Merging and Joining
# Inner join (only matching rows)
merged = pd.merge(orders, customers, on="customer_id", how="inner")
# Left join (keep all orders, add customer info where available)
merged = pd.merge(orders, customers, on="customer_id", how="left")
# Join on different column names
merged = pd.merge(
orders, products,
left_on="product_code",
right_on="sku",
how="left",
)
# Multiple join keys
merged = pd.merge(orders, inventory, on=["product_id", "warehouse_id"])
# Concatenate (stack vertically)
all_months = pd.concat([jan_df, feb_df, mar_df], ignore_index=True)
# Concatenate with source tracking
all_months = pd.concat(
[jan_df, feb_df, mar_df],
keys=["jan", "feb", "mar"],
names=["month", "row"],
)
# Anti-join: rows in A not in B
orders_without_customer = orders[
~orders["customer_id"].isin(customers["customer_id"])
]
Time Series Analysis
# Set datetime index
df = df.set_index("date").sort_index()
# Resample (change frequency)
monthly = df["amount"].resample("M").sum()
weekly = df["amount"].resample("W").mean()
quarterly = df["amount"].resample("Q").agg(["sum", "mean", "count"])
# Period-over-period change
df["mom_change"] = df["revenue"].pct_change() # month-over-month
df["yoy_change"] = df["revenue"].pct_change(12) # year-over-year
# Rolling calculations
df["sma_30"] = df["price"].rolling(30).mean() # 30-day simple moving average
df["ema_30"] = df["price"].ewm(span=30).mean() # exponential moving average
df["volatility"] = df["price"].rolling(30).std() # 30-day volatility
# Shift (lag/lead)
df["prev_month"] = df["revenue"].shift(1)
df["next_month"] = df["revenue"].shift(-1)
# Date range and reindex (fill gaps)
full_range = pd.date_range(df.index.min(), df.index.max(), freq="D")
df = df.reindex(full_range).ffill() # forward-fill missing days
Exporting Results
# CSV
df.to_csv("output.csv", index=False)
# Excel (multiple sheets)
with pd.ExcelWriter("report.xlsx", engine="openpyxl") as writer:
summary.to_excel(writer, sheet_name="Summary")
df.to_excel(writer, sheet_name="Raw Data", index=False)
pivot.to_excel(writer, sheet_name="Pivot")
# JSON
df.to_json("output.json", orient="records", indent=2)
# SQL
df.to_sql("analysis_results", engine, if_exists="replace", index=False)
# Clipboard (paste into spreadsheet)
df.to_clipboard(index=False)
# Markdown (for reports)
print(df.head(10).to_markdown(index=False))
Real-World Analysis Pattern
Here's a complete analysis workflow — the kind you'd do for a weekly business report:
import pandas as pd
import numpy as np
def analyze_sales(filepath: str) -> dict:
"""Complete sales analysis pipeline."""
# 1. Load and clean
df = pd.read_csv(
filepath,
parse_dates=["order_date"],
dtype={"zip_code": str},
)
df = df.dropna(subset=["order_date", "amount"])
df["amount"] = pd.to_numeric(df["amount"], errors="coerce")
df = df[df["amount"] > 0] # remove refunds/errors
# 2. Enrich
df["month"] = df["order_date"].dt.to_period("M")
df["weekday"] = df["order_date"].dt.day_name()
df["is_weekend"] = df["order_date"].dt.dayofweek >= 5
# 3. Key metrics
total_revenue = df["amount"].sum()
avg_order = df["amount"].mean()
median_order = df["amount"].median()
order_count = len(df)
# 4. Monthly trend
monthly = df.groupby("month").agg(
revenue=("amount", "sum"),
orders=("amount", "count"),
aov=("amount", "mean"),
)
monthly["growth"] = monthly["revenue"].pct_change() * 100
# 5. Top performers
top_products = (
df.groupby("product")["amount"]
.agg(["sum", "count"])
.sort_values("sum", ascending=False)
.head(10)
)
top_regions = (
df.groupby("region")["amount"]
.sum()
.sort_values(ascending=False)
)
# 6. Customer segmentation
customer_stats = df.groupby("customer_id").agg(
total_spent=("amount", "sum"),
orders=("amount", "count"),
avg_order=("amount", "mean"),
first_order=("order_date", "min"),
last_order=("order_date", "max"),
)
conditions = [
customer_stats["total_spent"] > customer_stats["total_spent"].quantile(0.9),
customer_stats["orders"] > 10,
customer_stats["orders"] == 1,
]
segments = ["VIP", "loyal", "one-time"]
customer_stats["segment"] = np.select(conditions, segments, default="regular")
# 7. Weekend vs weekday
weekend_pct = df[df["is_weekend"]]["amount"].sum() / total_revenue * 100
return {
"total_revenue": total_revenue,
"order_count": order_count,
"avg_order": avg_order,
"monthly_trend": monthly,
"top_products": top_products,
"top_regions": top_regions,
"customer_segments": customer_stats["segment"].value_counts(),
"weekend_revenue_pct": weekend_pct,
}
# Run
results = analyze_sales("sales_2026.csv")
print(f"Revenue: ${results['total_revenue']:,.2f}")
print(f"Orders: {results['order_count']:,}")
print(f"\nMonthly trend:\n{results['monthly_trend']}")
print(f"\nTop products:\n{results['top_products']}")
print(f"\nCustomer segments:\n{results['customer_segments']}")
Performance Tips
- Use vectorized operations — df["a"] + df["b"] is 100x faster than df.apply(lambda r: r.a + r.b, axis=1)
- Filter early — reduce dataframe size before heavy operations
- Use categories — df["status"] = df["status"].astype("category") saves 90%+ memory for low-cardinality columns
- Avoid iterrows — it's 1000x slower than vectorized. Use itertuples() if you must iterate
- Use read_csv options — usecols, dtype, chunksize to control memory
- Consider polars — for datasets >1GB, polars is 5-10x faster than pandas
🚀 Want production-ready data analysis scripts, automation tools, and pipeline templates?
Related Articles
- Data Visualization with Python — Matplotlib, Seaborn & Plotly — visualize your pandas analysis
- Build a Data Pipeline in Python — ETL from Scratch — automate data loading and transformation
- Python Database Operations — load data from SQL databases
- Python Performance Optimization — speed up your analysis code
- Web Scraping with Python — collect data to analyze
Need custom data analysis or automated reporting? I build Python data pipelines, dashboards, and analytics tools. Reach out on Telegram →