post · Python · NUS · DSA1101 · Data Science
Pandas in 30 Minutes for NUS DSA1101 and Statistics Modules
DSA1101 throws students into pandas in week 3. Most of you've never seen a DataFrame before, and the lecture notes assume you'll figure it out. By week 5 you're expected to do groupbys, merges, and basic plots without help.
This is the 30-minute version. Read once, keep open during your next tutorial, and you'll be ahead of half the cohort.
Why pandas confuses people first
Pandas is built on top of NumPy, and most pandas weirdness traces back to two ideas:
- A DataFrame is a table where every column is a Series. A Series is a NumPy array with labels. Almost every operation works on whole columns at once, not row-by-row.
- Indexing in pandas is overloaded.
df[x]means different things depending on whetherxis a string, a list, a slice, or a boolean array. This is whydf["age"]returns a column butdf[0:5]returns rows.
If you internalise those two ideas, the rest is mostly memorising method names.
Setup, the minimum
import pandas as pd
import numpy as np
# Read a CSV, the most common DSA1101 entry point
df = pd.read_csv("data.csv")
# First look, do this every time
df.head() # first 5 rows
df.shape # (rows, columns)
df.columns # column names
df.dtypes # types per column
df.describe() # summary stats, numeric columns only
df.info() # types + non-null counts, useful for finding missing data
If you only memorise one habit from this whole post, make it df.head() after every transformation. Catching shape changes early saves you from the silent bugs that DSA1101 graders love to find.
Selecting columns and rows
The four patterns that cover 95% of selection:
# Single column, returns a Series
ages = df["age"]
# Multiple columns, returns a DataFrame, note the double brackets
subset = df[["age", "height", "weight"]]
# Rows by position (like Python slicing)
first_ten = df.iloc[:10]
# Rows by condition, "boolean masking"
adults = df[df["age"] >= 21]
The fifth pattern, combining row filter with column select, uses .loc:
# Adults, only their age and height columns
df.loc[df["age"] >= 21, ["age", "height"]]
df.loc is label-based (uses column names and index labels), df.iloc is position-based (integer offsets). Mix them up and you get wrong rows or shape errors.
Filtering with multiple conditions
Boolean masks combine with & and |. The brackets are not optional:
# Adults from a specific country
mask = (df["age"] >= 21) & (df["country"] == "SG")
df[mask]
# Either young or local
df[(df["age"] < 18) | (df["country"] == "SG")]
Common mistake: writing and or or instead of & or |. Pandas raises a confusing error about ambiguous truth value. Always use the bitwise operators with parentheses.
GroupBy, the part DSA1101 loves
GroupBy splits the data, applies a function, and combines the results. It's the bread and butter of the module.
# Average age by country
df.groupby("country")["age"].mean()
# Multiple aggregations at once
df.groupby("country")["age"].agg(["mean", "median", "std", "count"])
# Aggregations on multiple columns
df.groupby("country")[["age", "income"]].mean()
# Group by two columns
df.groupby(["country", "gender"])["age"].mean()
The output of a groupby is itself a DataFrame or Series with a meaningful index. If you want a flat result with the group keys back as columns, add .reset_index():
result = df.groupby("country")["age"].mean().reset_index()
# Now you can do result["country"] and result["age"] as normal columns
Reset_index is what trips up most DSA1101 students. The graders often want a flat table for the next step, not a multi-indexed object.
Adding new columns
# Simple arithmetic on existing columns
df["bmi"] = df["weight"] / (df["height"] ** 2)
# Conditional column with np.where
df["adult"] = np.where(df["age"] >= 21, "yes", "no")
# Multi-condition with np.select
conditions = [df["age"] < 13, df["age"] < 21, df["age"] >= 21]
labels = ["child", "teen", "adult"]
df["age_group"] = np.select(conditions, labels)
# Apply, slow but flexible, last resort
df["full_name"] = df.apply(lambda row: f"{row['first']} {row['last']}", axis=1)
Avoid apply if you can vectorise the operation. It's roughly 100x slower and DSA1101's larger datasets will time out under it.
Handling missing data
Real datasets have NaN, and DSA1101 specifically tests whether you handle it correctly:
df.isna().sum() # Count missing per column
df.dropna() # Drop rows with any NaN
df.dropna(subset=["age"]) # Drop only if age is missing
df["age"].fillna(0) # Replace NaN with 0
df["age"].fillna(df["age"].mean()) # Replace with column mean
The exam favourite: a column has 5% missing values and you're told to "handle it appropriately". Dropping is fine if random. Filling with mean is fine if numeric and roughly normal. Filling with mode is fine for categorical. The wrong answer is ignoring it.
Merging DataFrames
The merge method is pandas' SQL JOIN. You'll see it in any DSA1101 dataset that comes split into two CSVs.
# Inner join, default, only keeps rows present in both
combined = pd.merge(students, grades, on="student_id")
# Left join, keeps all rows from students
combined = pd.merge(students, grades, on="student_id", how="left")
# Different column names per side
combined = pd.merge(students, grades,
left_on="id", right_on="student_id")
Always check combined.shape after a merge. If your row count exploded, you have duplicate keys and a many-to-many join. If it shrunk, you lost data on a join that shouldn't have lost any.
Sorting and ranking
# Sort by one column, descending
df.sort_values("age", ascending=False)
# Sort by multiple columns, second one breaks ties
df.sort_values(["country", "age"], ascending=[True, False])
# Top 10 by some metric
df.nlargest(10, "income")
df.nsmallest(10, "age")
nlargest is faster than sort_values followed by head for big datasets. Remember the difference, the exam might test it.
Plotting, the bare minimum
DSA1101 expects matplotlib via pandas, not seaborn. The defaults are ugly but functional:
import matplotlib.pyplot as plt
# Histogram of one column
df["age"].hist(bins=20)
plt.xlabel("Age")
plt.ylabel("Count")
plt.title("Age distribution")
plt.show()
# Scatter of two columns
df.plot.scatter(x="age", y="income")
plt.show()
# Bar chart from groupby
df.groupby("country")["income"].mean().plot.bar()
plt.show()
# Line plot, common for time series
df.plot(x="date", y="value")
plt.show()
Always label your axes and add a title. Markers are awarded for "is the plot self-explanatory". Default unlabelled plots lose easy points.
The 5 mistakes I see in DSA1101 submissions
- Iterating over rows with for loops. Slow, error-prone. Use vectorised operations or
applyif you must. - Forgetting
.reset_index()after groupby. The next operation expects a flat DataFrame and silently breaks. - Chained indexing like
df[df["age"] > 21]["income"] = 0. Pandas warns you about this, listen. Use.loc[mask, "income"] = 0instead. - Ignoring NaN. A
.mean()over a column with NaN ignores them by default, but other operations don't. Checkisna().sum()first. - Trusting
df.head()to represent the full dataset. The first 5 rows are sorted, the rest may not be. Always sample withdf.sample(10)if you want a real look.
Where this falls short
This guide covers the procedural bits. DSA1101 also tests:
- Statistical interpretation. Knowing what a histogram means, not just how to draw one.
- Choice of method. When to drop NaN versus impute, when to use mean versus median, when log-transform helps.
- Communicating results. Your final answer matters as much as your code.
Pandas is a tool. The module grades whether you use it for the right reason.
When to ask for help
If after a tutorial session you can't:
- Reproduce the example from scratch with the slides closed
- Explain what
groupby().agg()returns and why - Read a stack trace and understand which line broke
You're not behind, you're stuck on a foundational concept. Send me your code on Telegram with the specific exercise you're trying. A 60-minute session usually fixes the foundation, and the rest of the module flows easier from there.
Keep reading
More from the blog
-
· Exam Prep · Revision
How to Revise for a Programming Exam in 7 Days
A focused 7-day revision plan for any Singapore programming module exam. Practical, what-to-do-each-day, no fluff.
-
· Group Projects · Soft Skills
Group Project Survival Guide for Singapore CS Modules
How to survive group projects in CS2103T, IS200, INF1002, and every other Singapore module that loves them. Without becoming the person doing 80% of the work.
-
· FYP · Capstone
Picking a Tech Stack for Your FYP, Without Regret
How to choose the right tech stack for your final-year project or capstone in a Singapore university or polytechnic, from someone who has rescued plenty of bad picks.
Related services
Need help with this directly?
Stuck on something specific?
Send your brief and I will reply with a fixed price, usually within the hour.