Pandas in 30 Minutes for NUS DSA1101 and Statistics Modules
5 min read
DSA1101 throws students into pandas in week 3. Most of you've never seen a DataFrame before, and the lecture notes assume you'll figure it out. By week 5 you're expected to do groupbys, merges, and basic plots without help.
This is the 30-minute version. Read once, keep open during your next tutorial, and you'll be ahead of half the cohort.
Why pandas confuses people first
Pandas is built on top of NumPy, and most pandas weirdness traces back to two ideas:
- A DataFrame is a table where every column is a Series. A Series is a NumPy array with labels. Almost every operation works on whole columns at once, not row-by-row.
- Indexing in pandas is overloaded.
df[x]means different things depending on whetherxis a string, a list, a slice, or a boolean array. This is whydf["age"]returns a column butdf[0:5]returns rows.
If you internalise those two ideas, the rest is mostly memorising method names.
Setup, the minimum
import pandas as pd
import numpy as np
# Read a CSV, the most common DSA1101 entry point
df = pd.read_csv("data.csv")
# First look, do this every time
df.head() # first 5 rows
df.shape # (rows, columns)
df.columns # column names
df.dtypes # types per column
df.describe() # summary stats, numeric columns only
df.info() # types + non-null counts, useful for finding missing data
If you only memorise one habit from this whole post, make it df.head() after every transformation. Catching shape changes early saves you from the silent bugs that DSA1101 graders love to find.
Selecting columns and rows
The four patterns that cover 95% of selection:
# Single column, returns a Series
ages = df["age"]
# Multiple columns, returns a DataFrame, note the double brackets
subset = df[["age", "height", "weight"]]
# Rows by position (like Python slicing)
first_ten = df.iloc[:10]
# Rows by condition, "boolean masking"
adults = df[df["age"] >= 21]
The fifth pattern, combining row filter with column select, uses .loc:
# Adults, only their age and height columns
df.loc[df["age"] >= 21, ["age", "height"]]
df.loc is label-based (uses column names and index labels), df.iloc is position-based (integer offsets). Mix them up and you get wrong rows or shape errors.
Filtering with multiple conditions
Boolean masks combine with & and |. The brackets are not optional:
# Adults from a specific country
mask = (df["age"] >= 21) & (df["country"] == "SG")
df[mask]
# Either young or local
df[(df["age"] < 18) | (df["country"] == "SG")]
Common mistake: writing and or or instead of & or |. Pandas raises a confusing error about ambiguous truth value. Always use the bitwise operators with parentheses.
GroupBy, the part DSA1101 loves
GroupBy splits the data, applies a function, and combines the results. It's the bread and butter of the module.
# Average age by country
df.groupby("country")["age"].mean()
# Multiple aggregations at once
df.groupby("country")["age"].agg(["mean", "median", "std", "count"])
# Aggregations on multiple columns
df.groupby("country")[["age", "income"]].mean()
# Group by two columns
df.groupby(["country", "gender"])["age"].mean()
The output of a groupby is itself a DataFrame or Series with a meaningful index. If you want a flat result with the group keys back as columns, add .reset_index():
result = df.groupby("country")["age"].mean().reset_index()
# Now you can do result["country"] and result["age"] as normal columns
Reset_index is what trips up most DSA1101 students. The graders often want a flat table for the next step, not a multi-indexed object.
Adding new columns
# Simple arithmetic on existing columns
df["bmi"] = df["weight"] / (df["height"] ** 2)
# Conditional column with np.where
df["adult"] = np.where(df["age"] >= 21, "yes", "no")
# Multi-condition with np.select
conditions = [df["age"] < 13, df["age"] < 21, df["age"] >= 21]
labels = ["child", "teen", "adult"]
df["age_group"] = np.select(conditions, labels)
# Apply, slow but flexible, last resort
df["full_name"] = df.apply(lambda row: f"{row['first']} {row['last']}", axis=1)
Avoid apply if you can vectorise the operation. It's roughly 100x slower and DSA1101's larger datasets will time out under it.
Handling missing data
Real datasets have NaN, and DSA1101 specifically tests whether you handle it correctly:
df.isna().sum() # Count missing per column
df.dropna() # Drop rows with any NaN
df.dropna(subset=["age"]) # Drop only if age is missing
df["age"].fillna(0) # Replace NaN with 0
df["age"].fillna(df["age"].mean()) # Replace with column mean
The exam favourite: a column has 5% missing values and you're told to "handle it appropriately". Dropping is fine if random. Filling with mean is fine if numeric and roughly normal. Filling with mode is fine for categorical. The wrong answer is ignoring it.
Merging DataFrames
The merge method is pandas' SQL JOIN. You'll see it in any DSA1101 dataset that comes split into two CSVs.
# Inner join, default, only keeps rows present in both
combined = pd.merge(students, grades, on="student_id")
# Left join, keeps all rows from students
combined = pd.merge(students, grades, on="student_id", how="left")
# Different column names per side
combined = pd.merge(students, grades,
left_on="id", right_on="student_id")
Always check combined.shape after a merge. If your row count exploded, you have duplicate keys and a many-to-many join. If it shrunk, you lost data on a join that shouldn't have lost any.
Sorting and ranking
# Sort by one column, descending
df.sort_values("age", ascending=False)
# Sort by multiple columns, second one breaks ties
df.sort_values(["country", "age"], ascending=[True, False])
# Top 10 by some metric
df.nlargest(10, "income")
df.nsmallest(10, "age")
nlargest is faster than sort_values followed by head for big datasets. Remember the difference, the exam might test it.
Plotting, the bare minimum
DSA1101 expects matplotlib via pandas, not seaborn. The defaults are ugly but functional:
import matplotlib.pyplot as plt
# Histogram of one column
df["age"].hist(bins=20)
plt.xlabel("Age")
plt.ylabel("Count")
plt.title("Age distribution")
plt.show()
# Scatter of two columns
df.plot.scatter(x="age", y="income")
plt.show()
# Bar chart from groupby
df.groupby("country")["income"].mean().plot.bar()
plt.show()
# Line plot, common for time series
df.plot(x="date", y="value")
plt.show()
Always label your axes and add a title. Markers are awarded for "is the plot self-explanatory". Default unlabelled plots lose easy points.
The 5 mistakes I see in DSA1101 submissions
- Iterating over rows with for loops. Slow, error-prone. Use vectorised operations or
applyif you must. - Forgetting
.reset_index()after groupby. The next operation expects a flat DataFrame and silently breaks. - Chained indexing like
df[df["age"] > 21]["income"] = 0. Pandas warns you about this, listen. Use.loc[mask, "income"] = 0instead. - Ignoring NaN. A
.mean()over a column with NaN ignores them by default, but other operations don't. Checkisna().sum()first. - Trusting
df.head()to represent the full dataset. The first 5 rows are sorted, the rest may not be. Always sample withdf.sample(10)if you want a real look.
Where this falls short
This guide covers the procedural bits. DSA1101 also tests:
- Statistical interpretation. Knowing what a histogram means, not just how to draw one.
- Choice of method. When to drop NaN versus impute, when to use mean versus median, when log-transform helps.
- Communicating results. Your final answer matters as much as your code.
Pandas is a tool. The module grades whether you use it for the right reason.
When to ask for help
If after a tutorial session you can't:
- Reproduce the example from scratch with the slides closed
- Explain what
groupby().agg()returns and why - Read a stack trace and understand which line broke
You're not behind, you're stuck on a foundational concept. Send me your code on Telegram with the specific exercise you're trying. A 60-minute session usually fixes the foundation, and the rest of the module flows easier from there.
Keep reading
More from the blog
-
Python · Singapore · Hiring · Tutoring
Best Python Tutor in Singapore (2026): How to Choose, Where to Look, What to Expect
An honest guide to finding a Python tutor in Singapore in 2026 — what your options actually are, how to evaluate them, and what each type costs.
-
Tutoring · Singapore · Hiring
How to Pick a Programming Tutor in Singapore Without Wasting Money
What actually matters when choosing a programming tutor in Singapore. Red flags, fair pricing, and the questions to ask before you pay anyone.
-
Python · Polytechnic · SP · NP · NYP · TP · RP · Singapore
Polytechnic Year 1 Python in Singapore: What to Expect at SP, NP, NYP, TP, and RP
How Year 1 Python is actually taught at the five Singapore polytechnics, the modules to expect, common stuck-points, and how to get help when stuck.
-
Exam Prep · Revision
How to Revise for a Programming Exam in 7 Days
A focused 7-day revision plan for any Singapore programming module exam. Practical, what-to-do-each-day, no fluff.
-
Singapore · Tutoring · Pricing
How Much Does a Programming Tutor Cost in Singapore? (2026)
A straight answer on what a programming tutor costs in Singapore in 2026: the hourly ranges by level, why assignment and FYP help is quoted per job, and how to tell whether you are paying for real understanding or just borrowed code.
Related services
Need help with this directly?
Stuck on something specific?
Send your brief and I will reply with a fixed price, usually within the hour.