post · Python · NUS · DSA1101 · Data Science

Pandas in 30 Minutes for NUS DSA1101 and Statistics Modules

Name: Coding Solutions
Price range: $$

18 May 2026

DSA1101 throws students into pandas in week 3. Most of you've never seen a DataFrame before, and the lecture notes assume you'll figure it out. By week 5 you're expected to do groupbys, merges, and basic plots without help.

This is the 30-minute version. Read once, keep open during your next tutorial, and you'll be ahead of half the cohort.

Why pandas confuses people first

Pandas is built on top of NumPy, and most pandas weirdness traces back to two ideas:

A DataFrame is a table where every column is a Series. A Series is a NumPy array with labels. Almost every operation works on whole columns at once, not row-by-row.
Indexing in pandas is overloaded. df[x] means different things depending on whether x is a string, a list, a slice, or a boolean array. This is why df["age"] returns a column but df[0:5] returns rows.

If you internalise those two ideas, the rest is mostly memorising method names.

Setup, the minimum

import pandas as pd
import numpy as np

# Read a CSV, the most common DSA1101 entry point
df = pd.read_csv("data.csv")

# First look, do this every time
df.head()         # first 5 rows
df.shape          # (rows, columns)
df.columns        # column names
df.dtypes         # types per column
df.describe()     # summary stats, numeric columns only
df.info()         # types + non-null counts, useful for finding missing data

If you only memorise one habit from this whole post, make it df.head() after every transformation. Catching shape changes early saves you from the silent bugs that DSA1101 graders love to find.

Selecting columns and rows

The four patterns that cover 95% of selection:

# Single column, returns a Series
ages = df["age"]

# Multiple columns, returns a DataFrame, note the double brackets
subset = df[["age", "height", "weight"]]

# Rows by position (like Python slicing)
first_ten = df.iloc[:10]

# Rows by condition, "boolean masking"
adults = df[df["age"] >= 21]

The fifth pattern, combining row filter with column select, uses .loc:

# Adults, only their age and height columns
df.loc[df["age"] >= 21, ["age", "height"]]

df.loc is label-based (uses column names and index labels), df.iloc is position-based (integer offsets). Mix them up and you get wrong rows or shape errors.

Filtering with multiple conditions

Boolean masks combine with & and |. The brackets are not optional:

# Adults from a specific country
mask = (df["age"] >= 21) & (df["country"] == "SG")
df[mask]

# Either young or local
df[(df["age"] < 18) | (df["country"] == "SG")]

Common mistake: writing and or or instead of & or |. Pandas raises a confusing error about ambiguous truth value. Always use the bitwise operators with parentheses.

GroupBy, the part DSA1101 loves

GroupBy splits the data, applies a function, and combines the results. It's the bread and butter of the module.

# Average age by country
df.groupby("country")["age"].mean()

# Multiple aggregations at once
df.groupby("country")["age"].agg(["mean", "median", "std", "count"])

# Aggregations on multiple columns
df.groupby("country")[["age", "income"]].mean()

# Group by two columns
df.groupby(["country", "gender"])["age"].mean()

The output of a groupby is itself a DataFrame or Series with a meaningful index. If you want a flat result with the group keys back as columns, add .reset_index():

result = df.groupby("country")["age"].mean().reset_index()
# Now you can do result["country"] and result["age"] as normal columns

Reset_index is what trips up most DSA1101 students. The graders often want a flat table for the next step, not a multi-indexed object.

Adding new columns

# Simple arithmetic on existing columns
df["bmi"] = df["weight"] / (df["height"] ** 2)

# Conditional column with np.where
df["adult"] = np.where(df["age"] >= 21, "yes", "no")

# Multi-condition with np.select
conditions = [df["age"] < 13, df["age"] < 21, df["age"] >= 21]
labels = ["child", "teen", "adult"]
df["age_group"] = np.select(conditions, labels)

# Apply, slow but flexible, last resort
df["full_name"] = df.apply(lambda row: f"{row['first']} {row['last']}", axis=1)

Avoid apply if you can vectorise the operation. It's roughly 100x slower and DSA1101's larger datasets will time out under it.

Handling missing data

Real datasets have NaN, and DSA1101 specifically tests whether you handle it correctly:

df.isna().sum()              # Count missing per column
df.dropna()                  # Drop rows with any NaN
df.dropna(subset=["age"])    # Drop only if age is missing
df["age"].fillna(0)          # Replace NaN with 0
df["age"].fillna(df["age"].mean())  # Replace with column mean

The exam favourite: a column has 5% missing values and you're told to "handle it appropriately". Dropping is fine if random. Filling with mean is fine if numeric and roughly normal. Filling with mode is fine for categorical. The wrong answer is ignoring it.

Merging DataFrames

The merge method is pandas' SQL JOIN. You'll see it in any DSA1101 dataset that comes split into two CSVs.

# Inner join, default, only keeps rows present in both
combined = pd.merge(students, grades, on="student_id")

# Left join, keeps all rows from students
combined = pd.merge(students, grades, on="student_id", how="left")

# Different column names per side
combined = pd.merge(students, grades,
                    left_on="id", right_on="student_id")

Always check combined.shape after a merge. If your row count exploded, you have duplicate keys and a many-to-many join. If it shrunk, you lost data on a join that shouldn't have lost any.

Sorting and ranking

# Sort by one column, descending
df.sort_values("age", ascending=False)

# Sort by multiple columns, second one breaks ties
df.sort_values(["country", "age"], ascending=[True, False])

# Top 10 by some metric
df.nlargest(10, "income")
df.nsmallest(10, "age")

nlargest is faster than sort_values followed by head for big datasets. Remember the difference, the exam might test it.

Plotting, the bare minimum

DSA1101 expects matplotlib via pandas, not seaborn. The defaults are ugly but functional:

import matplotlib.pyplot as plt

# Histogram of one column
df["age"].hist(bins=20)
plt.xlabel("Age")
plt.ylabel("Count")
plt.title("Age distribution")
plt.show()

# Scatter of two columns
df.plot.scatter(x="age", y="income")
plt.show()

# Bar chart from groupby
df.groupby("country")["income"].mean().plot.bar()
plt.show()

# Line plot, common for time series
df.plot(x="date", y="value")
plt.show()

Always label your axes and add a title. Markers are awarded for "is the plot self-explanatory". Default unlabelled plots lose easy points.

The 5 mistakes I see in DSA1101 submissions

Iterating over rows with for loops. Slow, error-prone. Use vectorised operations or apply if you must.
Forgetting .reset_index() after groupby. The next operation expects a flat DataFrame and silently breaks.
Chained indexing like df[df["age"] > 21]["income"] = 0. Pandas warns you about this, listen. Use .loc[mask, "income"] = 0 instead.
Ignoring NaN. A .mean() over a column with NaN ignores them by default, but other operations don't. Check isna().sum() first.
Trusting df.head() to represent the full dataset. The first 5 rows are sorted, the rest may not be. Always sample with df.sample(10) if you want a real look.

Where this falls short

This guide covers the procedural bits. DSA1101 also tests:

Statistical interpretation. Knowing what a histogram means, not just how to draw one.
Choice of method. When to drop NaN versus impute, when to use mean versus median, when log-transform helps.
Communicating results. Your final answer matters as much as your code.

Pandas is a tool. The module grades whether you use it for the right reason.

When to ask for help

If after a tutorial session you can't:

Reproduce the example from scratch with the slides closed
Explain what groupby().agg() returns and why
Read a stack trace and understand which line broke

You're not behind, you're stuck on a foundational concept. Send me your code on Telegram with the specific exercise you're trying. A 60-minute session usually fixes the foundation, and the rest of the module flows easier from there.

Free download