Master the three core tools every data analyst uses daily. Written for complete beginners — no experience assumed. Each section explains what you are learning, why it matters in real work, and shows you exactly what each line of code does. Every lesson ends with 5 exercises where you try it yourself before seeing the answer.
What is SQL? SQL (Structured Query Language) is how you talk to a database. Almost every company stores its data in one — customer records, sales, products, web clicks. SQL is the tool that lets you ask questions of that data and get answers back as a table of results. It reads almost like plain English, and most analysts use it every single day. These are the exact commands you will use from day one.
Group 1 — The Big Six Clauses (the skeleton of every SQL query)
How the Big Six work together — and the order they run in
Every SQL query is built from the same six clauses. You do not need to use all six every time, but when you do, SQL always processes them in this internal order: FROM first (find the table), then WHERE (filter the rows), then GROUP BY (collapse into groups), then HAVING (filter the groups), then SELECT (choose the columns), then ORDER BY (sort the output). This order matters when you are deciding which clause to use — for example, WHERE cannot see the results of GROUP BY because WHERE runs before grouping happens.
Think of it like a recipe: gather your ingredients (FROM), throw away what you do not need (WHERE), sort them into bowls (GROUP BY), discard bowls that are too small (HAVING), plate what is left (SELECT), and arrange it nicely (ORDER BY).
Request: “Show me the customer ID, name, and country for all customers”
Outcome: A table with exactly 3 columns and every row from the customers table. SELECT picks the columns. FROM identifies the table. That is all you need for the simplest query.
SQL — SELECT and FROM
SELECT customer_id, first_name, country
FROM customers;
-- SELECT lists the columns you want to see, separated by commas
-- FROM says which table those columns live in
-- Together these two words are the minimum for any SQL query
-- The semicolon ; ends the query (required in most databases)
-- Result: every row from the customers table, showing only those 3 columns
Request: “Show me only the customers from the UK who signed up after 2023”
Outcome: Only rows that meet BOTH conditions. Every row that fails even one condition is excluded entirely.
SQL — WHERE
SELECT customer_id, first_name, signup_date
FROM customers
WHERE country = 'UK'AND signup_date > '2023-12-31';
-- WHERE filters rows BEFORE any grouping or calculation
-- country = 'UK' checks each row: is this row's country equal to UK?
-- YES -> keep the row | NO -> discard the row entirely
-- AND connects two conditions: BOTH must be true for a row to survive
-- OR would keep a row if EITHER condition is true
-- Text values always go in single quotes: 'UK' not UK
-- Numbers do NOT need quotes: WHERE price > 50 not WHERE price > '50'
Watch out: To check for missing values you must write IS NULL — never = NULL. NULL means 'no value exists at all' and in SQL it is not equal to anything, not even itself. WHERE phone IS NULL finds missing phone numbers. WHERE phone = NULL always returns zero rows with no error — a silent wrong answer.
Request: “How many orders came from each country, and what was the total revenue per country?”
Outcome: One row per country. Each row shows the country name, how many orders came from there, and the total revenue. The country with the most revenue appears first.
SQL — GROUP BY, HAVING, ORDER BY
SELECT
country,
COUNT(*) AS total_orders,
SUM(order_value) AS total_revenue
FROM orders
WHERE status = 'completed'-- WHERE runs first, before groupingGROUP BY country -- collapses all rows with the same country into one groupHAVINGCOUNT(*) > 50-- HAVING filters AFTER grouping (WHERE cannot do this)ORDER BY total_revenue DESC; -- DESC = largest first | ASC = smallest first (default)-- GROUP BY: every row with the same country becomes one group
-- COUNT(*) then counts how many rows were in each group
-- SUM(order_value) adds up all order_values within each group
-- HAVING vs WHERE:
-- WHERE filters individual rows BEFORE grouping (use for raw column values)
-- HAVING filters groups AFTER grouping (use for aggregate results like COUNT, SUM)
-- "Only show countries with more than 50 orders" -> HAVING COUNT(*) > 50
-- You cannot write WHERE COUNT(*) > 50 because COUNT does not exist until after GROUP BY
Request: “Show me the top 10 highest-value orders”
Outcome: Exactly 10 rows — the 10 orders with the largest order_value, with the largest at the top.
SQL — ORDER BY and LIMIT
SELECT order_id, customer_id, order_value
FROM orders
ORDER BY order_value DESC-- sort largest firstLIMIT10; -- stop after 10 rows-- Without LIMIT: every row comes back sorted by order_value
-- LIMIT 10: stop as soon as 10 rows have been returned
-- Always use LIMIT when exploring large tables -- it stops you
-- accidentally pulling millions of rows into your screen
-- In SQL Server / MS Access: use TOP 10 instead of LIMIT
-- SELECT TOP 10 order_id, order_value FROM orders ORDER BY order_value DESC
Group 2 — Aggregation Functions (turning many rows into one number)
The five aggregation functions every analyst uses daily
Aggregation functions collapse many rows into a single summary value. They are always used inside a SELECT statement, and when combined with GROUP BY, they calculate that summary separately for each group. The five you will use constantly: COUNT() — how many rows, SUM() — add them all up, AVG() — find the mean, MIN() — the smallest value, MAX() — the largest value.
DISTINCT is a modifier that removes duplicate values before counting or displaying. Without DISTINCT, COUNT() counts every row including duplicates. With DISTINCT, it only counts unique values — for example, counting distinct customer_ids tells you how many unique customers placed orders, not how many total orders there were.
Request: “Give me a complete summary: total orders, unique customers, total revenue, average order, smallest order, and largest order”
Outcome: A single summary row showing six different calculated values across the entire orders table.
SQL — All Five Aggregation Functions + DISTINCT
SELECTCOUNT(*) AS total_rows,
COUNT(DISTINCT customer_id) AS unique_customers,
SUM(order_value) AS total_revenue,
ROUND(AVG(order_value), 2) AS avg_order_value,
MIN(order_value) AS smallest_order,
MAX(order_value) AS largest_order
FROM orders
WHERE status = 'completed';
-- COUNT(*): counts ALL rows including duplicates -- the total row count
-- COUNT(customer_id): counts rows where customer_id is NOT NULL
-- COUNT(DISTINCT customer_id): counts only UNIQUE customer_ids
-- If customer 42 placed 5 orders: COUNT(*) adds 5, COUNT(DISTINCT ...) adds 1
-- SUM(order_value): adds up every order_value in the result
-- AVG(order_value): total divided by count -- ROUND(..., 2) limits to 2 decimal places
-- MIN(order_value): the single smallest value in the entire column
-- MAX(order_value): the single largest value in the entire column
-- Result: one row with all six summary numbers
Request: “Show the top 5 countries by number of unique customers who placed orders”
Outcome: Five rows, one per country, showing how many distinct customers ordered from each. The country with the most unique customers appears first.
SQL — DISTINCT inside GROUP BY
SELECT
country,
COUNT(DISTINCT customer_id) AS unique_customers,
COUNT(*) AS total_orders
FROM orders
GROUP BY country
ORDER BY unique_customers DESCLIMIT5;
-- Without DISTINCT: COUNT(customer_id) counts every order including repeat customers
-- A customer who placed 10 orders would count as 10
-- With DISTINCT: COUNT(DISTINCT customer_id) counts each customer only once
-- A customer who placed 10 orders counts as 1
-- Having both columns lets you see: are repeat orders common in this country?
-- UK: unique_customers=500, total_orders=2000 means 4 orders per customer on average
Tip: You can use DISTINCT without an aggregation function to show unique values from a column: SELECT DISTINCT country FROM customers; — this returns one row per unique country, like a list of all countries you have customers in. No duplicates.
The WHERE Toolkit — Every Operator Explained
You already know = and AND — but WHERE has a full toolkit
The basic WHERE clause uses = to match a value exactly. But real data questions need more precision: a range of dates, a list of acceptable values, partial text matches. Each operator below is a different tool for a different type of filter. Knowing which one to reach for — and exactly how it behaves — is what separates analysts who write clean SQL from those who write clunky workarounds.
The operators you will use most often: IN (match any value from a list), BETWEEN (match a range), LIKE (partial text match), NOT (reverse any condition), IS NULL / IS NOT NULL (check for missing values).
The IN Operator — matching a list of values
IN: the clean alternative to writing five OR conditions
IN checks whether a column's value appears anywhere in a list you provide. It is exactly equivalent to writing x = 'A' OR x = 'B' OR x = 'C' but far more readable and easier to extend. You can put as many values as you need inside the parentheses.
When to use IN: Any time you have a specific set of known values you want to match. Country names, status codes, product categories, customer IDs. If you find yourself writing three or more OR conditions in a row, IN is almost always the cleaner choice.
NOT IN: The reverse — returns rows where the value is NOT in your list. Useful for exclusions: 'show me all orders except cancelled and refunded ones'. Important caveat: if the column contains any NULL values, NOT IN behaves unexpectedly (see the warning below).
Request: “Show orders from the UK, Germany, or France — three countries that are part of a regional report”
Outcome: Only rows where country is one of those three values. All other countries are excluded. Much cleaner than three separate OR conditions.
SQL — IN
-- Without IN: three separate OR conditions -- gets messy fastSELECT * FROM orders
WHERE country = 'UK'OR country = 'Germany'OR country = 'France';
-- With IN: same result, much cleaner -- easy to add more valuesSELECT * FROM orders
WHERE country IN ('UK', 'Germany', 'France');
-- NOT IN: exclude specific values instead of including them
-- "Show all orders except cancelled and refunded ones"SELECT order_id, order_value, status
FROM orders
WHERE status NOT IN ('cancelled', 'refunded', 'pending');
-- IN with numbers: no quotes needed for numbers
-- "Show me orders for customer IDs 101, 204, and 387"SELECT * FROM orders
WHERE customer_id IN (101, 204, 387);
-- IN can also take a subquery (covered in Advanced SQL):
-- WHERE customer_id IN (SELECT id FROM vip_customers)
-- This checks: "is this customer_id in the result of that inner query?"
Watch out:NOT IN and NULL: If the column contains even a single NULL value, NOT IN returns zero rows — silently. This is SQL's most surprising gotcha. The reason: SQL cannot determine whether NULL is 'not in the list' because NULL equals nothing. The safe fix: add AND column IS NOT NULL to your WHERE clause, or use NOT EXISTS (covered in Advanced SQL) which handles NULLs correctly.
The BETWEEN Operator — filtering a range of values
BETWEEN: the readable way to filter a range
BETWEEN checks whether a value falls within a range, inclusive of both endpoints. It is exactly equivalent to writing x >= low_value AND x <= high_value but far more readable. It works on numbers, dates, and text.
Critical detail — BETWEEN is always inclusive of BOTH endpoints. BETWEEN 1 AND 10 includes both 1 and 10, not just the values in between. This catches beginners out when working with dates: BETWEEN '2024-01-01' AND '2024-12-31' includes both those exact dates. But if your dates include times (e.g. '2024-12-31 23:59:59'), BETWEEN '2024-01-01' AND '2024-12-31' would EXCLUDE the last day's evening orders. For date ranges with times, it is safer to write >= start AND < the_day_after_end.
NOT BETWEEN returns rows outside the range — values strictly below the low end or strictly above the high end.
Request: “Show orders where the order value is between £50 and £200, and placed during 2024”
Outcome: Only orders where order_value is 50, 200, or anything between (inclusive), AND the order_date falls within 2024.
SQL — BETWEEN
-- BETWEEN with numbers: inclusive of both endpointsSELECT order_id, order_value, order_date
FROM orders
WHERE order_value BETWEEN50AND200;
-- This is identical to: WHERE order_value >= 50 AND order_value <= 200
-- order_value = 50 -> INCLUDED (>= 50)
-- order_value = 200 -> INCLUDED (<= 200)
-- order_value = 49 -> EXCLUDED (below 50)
-- order_value = 201 -> EXCLUDED (above 200)
-- BETWEEN with dates: the safe way to filter a full yearSELECT order_id, order_date, order_value
FROM orders
WHERE order_date BETWEEN'2024-01-01'AND'2024-12-31';
-- Works for dates stored as DATE type
-- If dates include timestamps, use >= and < instead:
WHERE order_date >= '2024-01-01'AND order_date < '2025-01-01'
-- This safely captures ALL of 2024 including 2024-12-31 23:59:59
-- NOT BETWEEN: values outside the rangeSELECT order_id, order_value
FROM orders
WHERE order_value NOT BETWEEN50AND200;
-- Returns orders below £50 OR above £200
-- order_value = 30 -> INCLUDED (below the range)
-- order_value = 100 -> EXCLUDED (inside the range)
-- order_value = 500 -> INCLUDED (above the range)
The LIKE Operator — matching patterns in text
LIKE: searching for text patterns, not exact matches
LIKE checks whether a text column matches a pattern. Two wildcard characters do the heavy lifting: % (percent) matches zero or more of any characters — it is a stand-in for 'anything can go here, including nothing'. _ (underscore) matches exactly one character — useful when you need to match a specific number of characters.
LIKE is case-insensitive in most databases (MySQL, SQL Server, PostgreSQL with ILIKE). In some PostgreSQL configurations LIKE is case-sensitive — use ILIKE instead for case-insensitive matching. When in doubt, combine LIKE with LOWER() to be safe: LOWER(email) LIKE '%gmail%'.
When to use LIKE: Partial name searches, finding email domains, searching product descriptions for keywords, matching ID patterns (e.g. all IDs starting with 'ORD-2024'). For simple exact matching, use = instead — LIKE is slower because the database cannot use an index efficiently when the pattern starts with %.
Request: “Find customers whose email is from Gmail, whose name starts with 'J', and whose customer code follows the pattern 'UK-????-2024'”
Outcome: Three separate examples showing the % and _ wildcards in action.
SQL — LIKE with % and _ wildcards
-- % matches ZERO OR MORE of any character
-- Starts with: 'J%' matches any text that begins with JSELECT first_name FROM customers WHERE first_name LIKE'J%';
-- 'James' -> MATCH | 'jessica' -> MATCH (case-insensitive in most DBs)
-- 'Alice' -> NO MATCH (does not start with J)
-- Ends with: '%gmail.com' matches any text that ends with gmail.comSELECT email FROM customers WHERE email LIKE'%gmail.com';
-- 'alice@gmail.com' -> MATCH
-- 'bob@hotmail.com' -> NO MATCH
-- Contains: '%keyword%' matches text with the keyword ANYWHERE insideSELECT product_name FROM products WHERE product_name LIKE'%laptop%';
-- 'Gaming Laptop Pro' -> MATCH ('laptop' is in the middle)
-- 'laptop stand' -> MATCH ('laptop' is at the start)
-- 'desktop computer' -> NO MATCH ('laptop' not found)
-- _ matches EXACTLY ONE character -- useful for fixed-length patternsSELECT customer_code FROM customers
WHERE customer_code LIKE'UK-____-2024';
-- Each _ matches exactly ONE character (any character)
-- 'UK-ABCD-2024' -> MATCH (4 characters between the dashes)
-- 'UK-AB-2024' -> NO MATCH (only 2 characters, need exactly 4)
-- 'UK-ABCDE-2024'-> NO MATCH (5 characters, need exactly 4)
-- NOT LIKE: exclude pattern matchesSELECT email FROM customers WHERE email NOT LIKE'%test%';
-- Excludes all emails containing "test" -- removes test accounts
-- Case-safe pattern: use LOWER() to guarantee case-insensitive matchingSELECT email FROM customers
WHERELOWER(email) LIKE'%@gmail.com';
-- LOWER() converts the stored value to lowercase before comparing
-- Matches: alice@gmail.com, ALICE@GMAIL.COM, Alice@Gmail.Com -- all included
Tip: If LIKE starts with % (e.g. '%gmail.com'), the database cannot use an index and must scan every row — this is called a full table scan and is slow on large tables. If LIKE starts without % (e.g. 'gmail%'), the database can use an index and is fast. Whenever possible, anchor your pattern to the start of the string.
The NOT Operator — reversing any condition
NOT: flipping any condition to its opposite
NOT reverses the result of any condition. You have already seen NOT IN and NOT LIKE and NOT BETWEEN. You can also use NOT with IS NULL (IS NOT NULL) and with parenthesised groups of conditions. NOT comes before the condition it reverses.
When to use NOT: When it is easier to describe what you want to exclude than what you want to include. 'Show me everything except test data' is cleaner to write with NOT than with a long list of what IS acceptable.
SQL — NOT with various operators
-- IS NOT NULL: only rows where a value EXISTS (opposite of IS NULL)SELECT customer_id, phone
FROM customers
WHERE phone IS NOT NULL;
-- Returns only customers who HAVE a phone number
-- IS NOT NULL is the correct way -- != NULL would return zero rows
-- NOT with a grouped condition using parenthesesSELECT * FROM orders
WHERENOT (country = 'UK'AND order_value < 10);
-- Returns everything EXCEPT UK orders under £10
-- De Morgan's law: NOT(A AND B) = NOT A OR NOT B
-- Equivalent: WHERE country != 'UK' OR order_value >= 10
-- Combining NOT operatorsSELECT * FROM customers
WHERE email IS NOT NULLAND email NOT LIKE'%test%'AND country NOT IN ('Unknown', 'N/A');
-- All three conditions must be true:
-- 1. customer must have an email
-- 2. that email must not contain "test"
-- 3. their country must not be Unknown or N/A
-- Useful for cleaning test data out of a report
Comparison Operators — the complete reference
Every comparison operator and exactly what it does
These are the building blocks of every WHERE condition. Knowing all of them — including the ones beginners often miss — means you can express any filtering logic precisely.
SQL — All Comparison Operators
-- Equality and inequalityWHERE status = 'completed'-- exactly equalsWHERE status != 'cancelled'-- not equal to (also written as <>)WHERE status <> 'cancelled'-- same as != (traditional SQL syntax)-- Numeric comparisonsWHERE price > 100-- strictly greater than 100WHERE price >= 100-- 100 or more (includes 100)WHERE price < 50-- strictly less than 50WHERE price <= 50-- 50 or less (includes 50)-- Range operatorsWHERE price BETWEEN10AND50-- 10 to 50 inclusive (>= 10 AND <= 50)WHERE price NOT BETWEEN10AND50-- outside the range (< 10 OR > 50)-- List matchingWHERE country IN ('UK', 'DE') -- matches any value in the listWHERE country NOT IN ('UK') -- NOT in the list (careful with NULLs)-- Pattern matchingWHERE name LIKE'A%'-- starts with AWHERE name LIKE'%son'-- ends with "son"WHERE name LIKE'%an%'-- contains "an" anywhereWHERE code LIKE'__-___'-- exactly 2 chars, dash, exactly 3 charsWHERE name NOT LIKE'%test%'-- does not contain "test"-- NULL checks (always use IS -- never use = or != with NULL)WHERE phone IS NULL-- the phone column has no value at allWHERE phone IS NOT NULL-- a value exists in the phone column-- Combining conditionsWHERE a AND b -- BOTH must be trueWHERE a OR b -- EITHER must be trueWHERENOT a -- a must be FALSEWHERE (a OR b) AND c -- use parentheses to control evaluation order-- Without parentheses: AND is evaluated BEFORE OR
-- WHERE a OR b AND c is read as: WHERE a OR (b AND c)
-- Always use parentheses when mixing AND and OR to be explicit
End-of-Lesson Exercises
How to use these: Read the request and attempt it yourself first. Then expand the answer to compare. Read the explanation so you understand what every line does before moving on.
🖊 Exercise 1 of 5 - Show order_id, customer_id, order_value for all orders. Show the 5 most recent first.
Hint: Which table has orders? Which columns? How do you sort newest first? How do you limit to 5?
SQL
SELECT order_id, customer_id, order_value, order_date
FROM orders
ORDER BY order_date DESCLIMIT5;
-- ORDER BY order_date DESC: most recent date appears first
-- LIMIT 5: stop after 5 rows - gives us only the top 5
-- Result: the 5 most recent orders in the entire table
🧠 What is this doing? ORDER BY order_date DESC sorts with the largest (most recent) date at the top. DESC means descending - largest first. LIMIT 5 stops the query after returning 5 rows. Without LIMIT every order would come back sorted by date.
🖊 Exercise 2 of 5 - Show only customers who signed up in 2024 and are from Germany.
Hint: You need two conditions both true at the same time. How do you filter a date range?
SQL
SELECT customer_id, first_name, email, signup_date
FROM customers
WHERE country = 'Germany'AND signup_date >= '2024-01-01'AND signup_date < '2025-01-01';
-- WHERE country = 'Germany': only German customers pass
-- AND signup_date >= '2024-01-01': on or after 1 Jan 2024
-- AND signup_date < '2025-01-01': before 1 Jan 2025
-- The two date conditions together = the whole of year 2024
-- AND means ALL conditions must be true for a row to appear
🧠 What is this doing? WHERE filters rows one by one checking each condition. country = 'Germany' excludes all non-German rows. The two date conditions create a range covering exactly 2024. AND means every condition must be true - a row failing any one condition is excluded.
🖊 Exercise 3 of 5 - Total revenue and number of orders per country, for completed orders only.
Hint: Filter first with WHERE, then group with GROUP BY, then calculate totals per group.
SQL
SELECT
country,
COUNT(*) AS completed_orders,
SUM(order_value) AS total_revenue
FROM orders
WHERE status = 'completed'GROUP BY country
ORDER BY total_revenue DESC;
-- WHERE runs BEFORE GROUP BY - it removes non-completed orders first
-- GROUP BY country groups remaining rows into one group per country
-- COUNT(*) counts how many rows are in each country group
-- SUM(order_value) adds up all values within each country group
-- Result: one row per country showing completed order stats only
🧠 What is this doing? WHERE status = 'completed' removes all pending or cancelled rows before any grouping happens. GROUP BY country then collapses remaining rows into one row per country. COUNT(*) and SUM(order_value) each calculate separately per country group.
🖊 Exercise 4 of 5 - Find all customers whose email contains the word 'test' anywhere.
Hint: Partial text match. LIKE with % on both sides matches text appearing anywhere in the string.
SQL
SELECT customer_id, first_name, email
FROM customers
WHERE email LIKE'%test%';
-- LIKE '%test%' matches any email containing 'test' anywhere
-- First %: anything can appear before 'test'
-- 'test': this exact text must appear somewhere in the string
-- Second %: anything can appear after 'test'
-- 'test@example.com' -> INCLUDED
-- 'alice@testing.org' -> INCLUDED ('test' is inside 'testing')
-- 'bob@gmail.com' -> EXCLUDED (no 'test' anywhere)
🧠 What is this doing? LIKE '%test%' is a pattern match. The % wildcard matches zero or more of any character. Putting % on both sides means 'test' can appear at the start, middle, or end. test@example.com matches. alice@testing.org matches. bob@gmail.com does not match.
🖊 Exercise 5 of 5 - Show only countries where the average order value is above 75. Include the average and order count.
Hint: GROUP BY country first. Then filter on the RESULT of AVG() - which clause filters after grouping?
SQL
SELECT
country,
COUNT(*) AS order_count,
ROUND(AVG(order_value), 2) AS avg_order_value
FROM orders
WHERE status = 'completed'GROUP BY country
HAVINGAVG(order_value) > 75ORDER BY avg_order_value DESC;
-- GROUP BY creates one group per country
-- AVG(order_value) calculates the average for each group
-- HAVING AVG(order_value) > 75 removes groups whose average is 75 or below
-- HAVING must be used here - not WHERE - because the average only
-- exists AFTER grouping. WHERE runs before grouping so it cannot
-- see aggregated values. HAVING filters groups after they are created.
🧠 What is this doing? GROUP BY country creates one group per country. HAVING filters groups after aggregation - it is the only clause that can filter on calculated values like AVG(). WHERE cannot be used here because the average does not exist until after GROUP BY runs.
You can query one table. Intermediate SQL is about combining tables, stacking result sets, writing readable multi-step queries, doing calculations that keep every row, adding conditional logic, handling missing values, and cleaning data types. These topics are what employers test at interview for mid-level analyst roles.
Group 3 — Joining and Combining Tables
Why data lives in separate tables and how to bring it back together
In any real database, data is split across multiple tables deliberately — to avoid repetition and keep information consistent. Customer details in one table. Orders in another. Products in a third. Each table connects to others via a shared ID column. A JOIN uses that shared column to stitch tables back together when you need data from more than one.
INNER JOIN — returns only rows that have a match in BOTH tables. A customer with no orders will not appear. An order with no matching customer will not appear. LEFT JOIN — returns EVERY row from the left table whether or not a match exists on the right. Unmatched right-side columns come back as NULL. Use LEFT JOIN when you want to keep all your data and just enrich it with extra information where available.
Request: “Show every order with the customer's name and country alongside it”
Outcome: A combined table. Each row shows order details AND the matching customer's name and country. Only orders that have a matching customer record appear (INNER JOIN behaviour).
SQL — INNER JOIN
SELECT
o.order_id,
o.order_value,
c.first_name AS customer_name,
c.country
FROM orders o
INNER JOIN customers c ON o.customer_id = c.customer_id;
-- "orders o" gives the table the alias "o" -- a shorthand used as a prefix
-- "customers c" gives the table the alias "c"
-- ON o.customer_id = c.customer_id: the matching condition
-- SQL checks every order row and finds the customer row where the IDs match
-- You MUST prefix columns with their alias when both tables share a column name
-- Both tables have customer_id -- "o.customer_id" vs "c.customer_id" is unambiguous
-- An order with no matching customer: EXCLUDED from INNER JOIN results entirely
Request: “Show ALL customers and how many orders each placed — including customers who never ordered”
Outcome: Every customer appears. Customers with orders show their count. Customers who never ordered show 0 (thanks to COALESCE). With INNER JOIN those zero-order customers would silently vanish.
SQL — LEFT JOIN
SELECT
c.customer_id,
c.first_name,
COUNT(o.order_id) AS order_count,
COALESCE(SUM(o.order_value), 0) AS lifetime_spend
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.first_name
ORDER BY lifetime_spend DESC;
-- LEFT JOIN: every row from the LEFT table (customers) is kept
-- If a customer has no orders: order columns come back as NULL
-- COUNT(o.order_id): counts non-NULL order IDs -> gives 0 for no-order customers
-- COUNT(*) would give 1 even for unmatched customers -- use COUNT(o.order_id) here
-- COALESCE(SUM(o.order_value), 0): replaces NULL with 0
-- COALESCE(value, fallback): returns the first argument that is NOT NULL
-- Without it: customers with no orders show NULL instead of 0
UNION and UNION ALL — stacking result sets vertically
While JOINs combine tables horizontally (adding columns), UNION combines result sets vertically (stacking rows on top of each other). Both SELECT statements must return the same number of columns with compatible data types.
UNION removes duplicate rows from the combined result. UNION ALL keeps every row including duplicates — it is faster because it skips the deduplication step. Use UNION ALL unless you specifically need to remove duplicates.
Request: “Combine a list of active customers and a list of archived customers into one single list”
Outcome: All rows from both queries stacked into one result. UNION removes any rows that appear in both. UNION ALL keeps everything including duplicates.
SQL — UNION and UNION ALL
-- UNION: combines two SELECT results into one, removing duplicate rowsSELECT customer_id, first_name, 'active' AS source
FROM customers_active
UNIONSELECT customer_id, first_name, 'archived' AS source
FROM customers_archived;
-- Both SELECT statements must return the same number of columns
-- Column names in the result come from the FIRST SELECT
-- The literal 'active' and 'archived' add a label so you know which table each row came from
-- UNION ALL: same but keeps ALL rows including duplicates (faster -- use this by default)SELECT customer_id, email FROM customers_uk
UNION ALLSELECT customer_id, email FROM customers_eu;
-- When to use UNION vs UNION ALL:
-- UNION ALL (default choice): faster, keeps everything, no hidden deduplication
-- UNION: use only when you specifically need to remove duplicates between the two sets
Group 4 — Advanced Analytical Tools
CASE WHEN — adding if-then-else logic inside a query
CASE WHEN evaluates conditions in order and returns the first matching result. It works exactly like an if-then-else statement: if this condition is true, return this value, else check the next condition. This is how you create labels, tiers, flags, and categories from raw data values.
CASE WHEN can also be used inside aggregate functions — for example, SUM(CASE WHEN status='completed' THEN order_value ELSE 0 END) sums only the completed order values without needing a WHERE clause. This lets you calculate multiple conditional totals in one single query.
Request: “Label every customer as Gold, Silver, Bronze, or New based on their total spend”
Outcome: Every row from customer_totals with a new 'tier' column. Conditions are checked in order — the first true condition wins.
SQL — CASE WHEN
SELECT
customer_id,
total_spent,
CASEWHEN total_spent >= 1000THEN'Gold'WHEN total_spent >= 300THEN'Silver'WHEN total_spent >= 50THEN'Bronze'ELSE'New'ENDAS customer_tier
FROM customer_totals;
-- CASE starts the block. Each WHEN checks a condition in order.
-- First TRUE condition wins and SQL stops checking the rest
-- Customer with 1500: first WHEN (>=1000) is TRUE -> 'Gold', done
-- Customer with 400: first FALSE, second (>=300) TRUE -> 'Silver'
-- ELSE catches everything that did not match any WHEN
-- END closes the CASE block. AS customer_tier names the new column.
-- CASE WHEN inside SUM: count completed vs cancelled in one query
SELECTSUM(CASE WHEN status='completed'THEN order_value ELSE0END) AS completed_revenue,
SUM(CASE WHEN status='cancelled'THEN order_value ELSE0END) AS cancelled_revenue
FROM orders;
WITH (CTEs) — naming your query steps so humans can read them
A CTE (Common Table Expression) lets you give a sub-query a name and reference it later like a temporary table. Instead of one giant nested query nobody can read, you write named steps — each building on the last. This is the most important code quality habit in SQL.
CTEs do not store data permanently. They recalculate fresh every time the query runs. You can chain as many CTEs as you need — each one can reference all previous ones. You can also test each CTE independently by temporarily making it the final SELECT, which makes debugging enormously easier.
Request: “In readable steps: calculate each customer's total spend, then summarise by country”
Outcome: The same result as a complex nested query — but written in two clearly named steps that any analyst can follow, debug, and modify.
SQL — WITH (CTEs)
-- Step 1: name this result "customer_totals" and give it a meaningful definitionWITH customer_totals AS (
SELECT
c.customer_id,
c.country,
COALESCE(SUM(o.order_value), 0) AS total_spent
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.country
),
-- Step 2: now summarise customer_totals by country (reference Step 1 by name)
country_summary AS (
SELECT
country,
COUNT(*) AS total_customers,
ROUND(AVG(total_spent), 2) AS avg_spend_per_customer,
SUM(total_spent) AS country_revenue
FROM customer_totals
GROUP BY country
)
-- Final SELECT: output the result we actually wantSELECT * FROM country_summary
ORDER BY country_revenue DESC;
-- WITH starts the CTE section. Each named block is one step.
-- Separate multiple CTEs with a comma (only one WITH keyword at the top)
-- Later steps can reference earlier steps by their name
-- The final SELECT is the actual output of the whole query
-- CTEs do not store data -- they recalculate fresh each time the query runs
Window Functions — calculations that keep every row
GROUP BY collapses rows. If you GROUP BY country you get one row per country — the individual order detail is gone. Window functions perform calculations across rows WITHOUT collapsing them. Every original row stays in your output, and the calculated value appears as a new column alongside it.
The OVER() clause is what makes something a window function. PARTITION BY inside OVER() defines the group (like GROUP BY but without collapsing). ORDER BY inside OVER() defines the sequence within that group. The three most used window functions: ROW_NUMBER() — unique sequential rank, RANK() — rank with gaps for ties, LAG() — look at the previous row's value.
Request: “For each order: show the order value, a running total, yesterday's revenue, and the customer's rank within their country — without losing any individual order rows”
Outcome: Every order row preserved, with four new calculated columns alongside each one. No rows are collapsed. This is impossible with GROUP BY alone.
SQL — Window Functions: ROW_NUMBER, RANK, LAG, running total
-- Running total: SUM that grows as you move through datesSELECT
order_date,
order_value,
SUM(order_value) OVER(ORDER BY order_date)
AS running_total,
-- LAG: the value from the previous rowLAG(order_value, 1) OVER(ORDER BY order_date)
AS previous_day_value,
order_value - LAG(order_value, 1) OVER(ORDER BY order_date)
AS day_on_day_change
FROM daily_sales;
-- SUM() OVER(ORDER BY order_date): adds up as it moves through rows in date order
-- Row 1 (Jan 1): running_total = Jan 1 value
-- Row 2 (Jan 2): running_total = Jan 1 + Jan 2 values
-- Every individual row is preserved -- nothing is collapsed
-- LAG(order_value, 1): looks back 1 row and returns that row's order_value
-- First row has no previous row -> returns NULL
-- ROW_NUMBER vs RANK within a partitionSELECT
customer_id, country, total_spent,
-- ROW_NUMBER: unique number 1,2,3,4 -- no ties, always uniqueROW_NUMBER() OVER(PARTITION BY country ORDER BY total_spent DESC) AS row_num,
-- RANK: assigns same rank to ties, then skips numbers (1,1,3,4)RANK() OVER(PARTITION BY country ORDER BY total_spent DESC) AS rank_in_country
FROM customer_totals;
-- PARTITION BY country: rankings restart from 1 for each new country
-- UK customers: rank 1,2,3... independently from Germany customers: rank 1,2,3...
-- ROW_NUMBER: two customers with the same spend still get different numbers (1 and 2)
-- RANK: two customers with the same spend both get rank 1, next customer gets rank 3
-- DENSE_RANK (not shown): two ties both get 1, next gets 2 (no gaps)
Tip: A quick way to remember OVER(): think of it as defining the 'window' — the set of rows the function looks at to do its calculation. Empty OVER() = look at all rows. OVER(PARTITION BY x) = look at rows within the same x group. OVER(ORDER BY y) = look at rows in y order, building up cumulatively.
Group 5 — Data Cleaning Commands
COALESCE — replacing NULL with a useful default
NULL means 'no value exists'. It causes problems: NULL + 5 = NULL, COUNT(column) ignores NULLs, and showing NULL in a report looks unprofessional. COALESCE takes a list of values and returns the first one that is NOT NULL. This is the standard way to replace missing values with a sensible default.
You have already seen COALESCE with SUM(). It appears constantly in analytics: replacing NULL revenue with 0, NULL country with 'Unknown', NULL category with 'Uncategorised' — anywhere a missing value would break a calculation or look wrong.
Request: “Show each order with a clean country label — use 'Unknown' if country is NULL, and use the region column as a backup if country is missing”
Outcome: Every row shows a country value. NULL is gone. COALESCE checks each argument in order and returns the first non-NULL one.
SQL — COALESCE
SELECT
order_id,
order_value,
-- Try country first. If NULL, try region. If still NULL, use 'Unknown'COALESCE(country, region, 'Unknown') AS location
FROM orders;
-- COALESCE(a, b, c): returns the first argument that is NOT NULL
-- country = 'UK', region = 'Europe' -> returns 'UK' (first non-NULL)
-- country = NULL, region = 'Europe' -> returns 'Europe' (second argument)
-- country = NULL, region = NULL -> returns 'Unknown' (third argument)
-- Common patterns:
COALESCE(SUM(order_value), 0) -- replace NULL sum with 0COALESCE(discount_pct, 0) -- treat missing discount as 0%COALESCE(email, phone, 'No contact') -- use phone if no email
CAST and CONVERT — changing a column's data type
Databases store data in types: INTEGER, VARCHAR (text), DATE, DECIMAL, etc. Sometimes a column is stored as the wrong type — for example, a date stored as text, or a price stored as text because someone exported it with currency symbols. CAST changes the type of a value so you can do the right kind of calculation on it.
CAST is the SQL standard and works in all databases. CONVERT is the SQL Server / MySQL alternative with slightly different syntax — in SQL Server, CONVERT puts the data type first; in MySQL, it puts it second. When working with a new database, check which it supports.
Request: “Convert a text date column to a real date type, and convert a text price column to a decimal number”
Outcome: Columns that were previously stored as text now behave like their correct types — dates can be compared and sorted, numbers can be added and averaged.
SQL — CAST and CONVERT
-- CAST syntax: CAST(value AS new_type)SELECT
order_id,
CAST(order_date AS DATE) AS proper_date,
CAST(price_text AS DECIMAL(10,2)) AS proper_price,
CAST(customer_id AS VARCHAR(20)) AS id_as_text
FROM orders;
-- CAST(order_date AS DATE): converts a text string like '2024-03-15' to a real date
-- Real dates can be sorted correctly, compared, and subtracted from each other
-- Text dates sort alphabetically: '2024-02' sorts before '2024-10' which is correct
-- But '2024-9' sorts after '2024-10' alphabetically -- which is WRONG
-- CAST(price_text AS DECIMAL(10,2)): converts '42.5' text to the number 42.50
-- DECIMAL(10,2): up to 10 total digits with 2 after the decimal point
-- Without this: SUM('42.5') would fail or give wrong results
-- SQL Server alternative: CONVERT(new_type, value)CONVERT(DATE, order_date) -- SQL Server syntaxCONVERT(DECIMAL(10,2), price_text) -- SQL Server syntax
TRIM and REPLACE — cleaning messy text directly in SQL
Real data is full of text inconsistencies: extra spaces at the start or end of values, currency symbols mixed into number columns, inconsistent capitalisation. TRIM and REPLACE let you fix these directly in your SQL query without needing Python or Excel.
TRIM removes leading and trailing spaces — the most common text quality problem in exported data. REPLACE finds a specific piece of text and swaps it for something else — useful for stripping currency symbols (£, $), removing commas from numbers, or standardising abbreviations.
Request: “Clean the country column (remove spaces) and clean the price column (remove the £ symbol so it can be used as a number)”
Outcome: Rows with clean values. 'UK ' becomes 'UK'. '£42.50' becomes '42.50' which can be CASTed to a number.
SQL — TRIM and REPLACE
SELECT
order_id,
-- TRIM removes spaces from both the start and end of the textTRIM(country) AS country_clean,
-- REPLACE(column, find_this, replace_with_this)REPLACE(price_text, '£', '') AS price_no_symbol,
-- Chain REPLACE calls to remove multiple unwanted charactersREPLACE(REPLACE(price_text, '£', ''), ',', '') AS price_digits_only,
-- CAST the cleaned text to a proper number for calculationsCAST(REPLACE(price_text, '£', '') AS DECIMAL(10,2)) AS price_numeric
FROM orders;
-- TRIM(country): ' UK ' -> 'UK' | 'Germany ' -> 'Germany'
-- Removes spaces from BOTH ends. Does not remove spaces in the middle.
-- LTRIM removes only left spaces. RTRIM removes only right spaces.
-- REPLACE(price_text, '£', ''): replace '£' with nothing (empty string)
-- '£42.50' -> '42.50'
-- REPLACE(REPLACE(...)): chain two REPLACE calls to clean two characters
-- '£1,250.00' -> '£1250.00' (inner) -> '1250.00' (outer)
-- CAST(REPLACE(...) AS DECIMAL): the full pipeline: strip symbol, convert to number
CREATE TABLE and DROP TABLE — building and removing tables
While analysts primarily query existing data, you will sometimes need to create a temporary working table to store intermediate results — especially when working with large datasets where CTEs would be recalculated repeatedly. CREATE TABLE builds a new table with defined columns and types. DROP TABLE removes a table entirely.
A common pattern is to CREATE a temporary results table, load analysis into it, query from it multiple times, then DROP it when you are done. Most databases also support CREATE TEMPORARY TABLE (or CREATE TEMP TABLE) which automatically disappears when your session ends — safer than a permanent table.
Request: “Create a clean working table for customer spend analysis, populate it, query from it, then remove it”
Outcome: A new table called customer_spend_summary is created, filled with data, queried, then removed. The original source tables are never touched.
SQL — CREATE TABLE and DROP TABLE
-- Step 1: create a new empty table with defined columns and typesCREATE TABLE customer_spend_summary (
customer_id INTEGER,
country VARCHAR(100),
total_spent DECIMAL(12, 2),
order_count INTEGER,
customer_tier VARCHAR(20)
);
-- Step 2: fill it with data using INSERT INTO ... SELECTINSERT INTO customer_spend_summary
SELECT
customer_id,
country,
SUM(order_value),
COUNT(*),
CASE WHENSUM(order_value) >= 1000THEN'Gold'ELSE'Standard'ENDFROM orders GROUP BY customer_id, country;
-- Step 3: query from it like any other tableSELECT * FROM customer_spend_summary
WHERE customer_tier = 'Gold'ORDER BY total_spent DESC;
-- Step 4: remove the table when done (CANNOT be undone -- be careful)DROP TABLE customer_spend_summary;
-- CREATE TEMPORARY TABLE: safer alternative -- auto-deleted when session ends
CREATE TEMPORARY TABLE temp_results ( ... );
-- In most databases: CREATE TEMP TABLE also works
-- IF EXISTS: prevents an error if the table does not exist
DROP TABLE IF EXISTS customer_spend_summary; -- safer version
End-of-Lesson Exercises
How to use these: Read the request and attempt it yourself first. Then expand the answer to compare. Read the explanation so you understand what every line does before moving on.
🖊 Exercise 1 of 5 - Show every order with the customer's name and country. Only include orders with a matching customer record.
Hint: Which JOIN gives only rows that exist in both tables? Which column links orders to customers?
SQL
SELECT
o.order_id,
o.order_value,
o.order_date,
c.first_name AS customer_name,
c.country
FROM orders o
INNER JOIN customers c ON o.customer_id = c.customer_id
ORDER BY o.order_date DESC;
-- orders o: gives the orders table the alias "o"
-- customers c: gives the customers table the alias "c"
-- INNER JOIN: only rows with a match in BOTH tables appear
-- ON o.customer_id = c.customer_id: the column that links them
-- An order with no matching customer record is EXCLUDED entirely
-- o. and c. prefixes clarify which table each column comes from
🧠 What is this doing? INNER JOIN returns only rows where a matching customer_id exists in both tables. Orders with no matching customer are excluded. Table aliases (o, c) are required here because both tables have a customer_id column - without the prefix SQL does not know which one you mean.
🖊 Exercise 2 of 5 - Using a CTE: calculate each customer's total spend, then show only those who spent more than 500.
Hint: Write WITH customer_spend AS (...) first. Then SELECT from it with a WHERE filter on total_spent.
SQL
WITH customer_spend AS (
SELECT
customer_id,
SUM(order_value) AS total_spent
FROM orders
WHERE status = 'completed'GROUP BY customer_id
)
SELECT *
FROM customer_spend
WHERE total_spent > 500ORDER BY total_spent DESC;
-- WITH customer_spend AS (...) creates a named temporary result
-- The CTE calculates total completed spend per customer_id
-- The final SELECT treats customer_spend like a normal table
-- WHERE total_spent > 500 then filters that named result
-- Result: only customers who have spent more than 500 total
🧠 What is this doing? The CTE creates a named intermediate result: one row per customer with their total spend. The final SELECT then filters it with WHERE total_spent > 500. Without the CTE you would need to nest the GROUP BY as a subquery - much harder to read and debug.
🖊 Exercise 3 of 5 - For every order, show the order value AND how it compares to the overall average across all orders.
Hint: Use AVG() OVER() with an empty OVER clause - this averages across ALL rows without grouping.
SQL
SELECT
order_id,
order_value,
ROUND(AVG(order_value) OVER(), 2) AS overall_avg,
ROUND(order_value - AVG(order_value) OVER(), 2) AS vs_average
FROM orders
WHERE status = 'completed';
-- AVG(order_value) OVER() with empty OVER: average across ALL rows
-- Every row gets the SAME overall_avg value
-- vs_average = this row's value minus the overall average
-- positive = this order is above average
-- negative = this order is below average
-- All individual order rows are kept - no rows are collapsed
🧠 What is this doing? AVG(order_value) OVER() with an empty OVER() calculates the average across the entire result set. Every row gets the same overall average as a new column. Subtracting it from order_value tells you above or below average per row. The key difference from GROUP BY: all individual rows are preserved.
🖊 Exercise 4 of 5 - Label every customer as High Spender (over 500), Mid Spender (100-500), or Low Spender (under 100).
Hint: Use CASE WHEN with a CTE for totals. Check the highest threshold first.
SQL
WITH totals AS (
SELECT customer_id, SUM(order_value) AS total_spent
FROM orders WHERE status = 'completed'GROUP BY customer_id
)
SELECT
customer_id, total_spent,
CASEWHEN total_spent > 500THEN'High Spender'WHEN total_spent >= 100THEN'Mid Spender'ELSE'Low Spender'ENDAS spend_tier
FROM totals;
-- CASE checks conditions in order - first TRUE wins and stops
-- Customer with 600: first WHEN (>500) is TRUE -> 'High Spender', done
-- Customer with 200: first FALSE, second (>=100) TRUE -> 'Mid Spender'
-- Customer with 40: both FALSE -> ELSE -> 'Low Spender'
-- Checking >500 FIRST is critical: checking >=100 first would
-- label a 600-spend customer as 'Mid Spender' incorrectly
🧠 What is this doing? The CTE calculates each customer's total spend. CASE WHEN evaluates conditions top to bottom - the first true condition wins and the rest are skipped. Checking > 500 first is critical: if >= 100 were checked first, a customer who spent 600 would be labelled Mid Spender incorrectly.
🖊 Exercise 5 of 5 - Show all customers, total spend, and rank within their country. Include customers with no orders (show 0).
Hint: Need LEFT JOIN, COALESCE for null spend, ROW_NUMBER() OVER(PARTITION BY country ORDER BY ...) for ranking.
SQL
WITH customer_totals AS (
SELECT
c.customer_id, c.first_name, c.country,
COALESCE(SUM(o.order_value), 0) AS total_spent
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.first_name, c.country
)
SELECT
customer_id, first_name, country, total_spent,
ROW_NUMBER() OVER(
PARTITION BY country
ORDER BY total_spent DESC
) AS country_rank
FROM customer_totals;
-- LEFT JOIN keeps ALL customers even those with zero orders
-- COALESCE(SUM(...), 0): replaces NULL spend with 0
-- ROW_NUMBER() OVER(PARTITION BY country ...): ranking per country
-- PARTITION BY country: ranking restarts from 1 for each country
-- ORDER BY total_spent DESC: Rank 1 = highest spender in that country
-- UK and Germany each have their own independent Rank 1
🧠 What is this doing? LEFT JOIN keeps all customer rows even when no orders match - those customers get NULL in order columns. COALESCE converts that NULL to 0. ROW_NUMBER() OVER(PARTITION BY country ORDER BY total_spent DESC) assigns a rank starting from 1 within each country group independently.
Advanced SQL: Subqueries A subquery is a complete SQL query written inside another query. It lets you use the result of one query as input to another. Subqueries can appear in the WHERE clause, the FROM clause, the SELECT clause, and more. Start with simple scalar subqueries and work through to correlated subqueries — these are the patterns that appear in real analytics work and in senior analyst interviews.
1 — Scalar Subquery in WHERE: a query that returns one value
The simplest subquery: getting one number to use as a filter value
A scalar subquery returns exactly one value — one row, one column. You can use it anywhere you would normally type a number or text value. The outer query runs first conceptually, but the inner query (the subquery) executes first and its result is then used by the outer query.
When to use it: Any time your filter condition involves a calculated value you do not know in advance. 'Find all orders above the average order value' — you do not know what the average is until you calculate it, so you embed that calculation as a subquery.
Request: “Show all orders where the order value is above the average order value for all completed orders”
Outcome: Only orders that beat the average. The average is calculated automatically by the inner query — you do not need to know the number in advance or run two separate queries.
SQL — Scalar Subquery in WHERE
-- The inner query calculates one number: the average order valueSELECT order_id, customer_id, order_value
FROM orders
WHERE order_value > (
SELECTAVG(order_value)
FROM orders
WHERE status = 'completed'
)
AND status = 'completed'ORDER BY order_value DESC;
-- What happens step by step:
-- 1. The inner query runs: SELECT AVG(order_value) FROM orders WHERE status='completed'
-- Result: one number, e.g. 87.50
-- 2. The outer query then runs as if you had written: WHERE order_value > 87.50
-- Result: all completed orders above the average
-- You can use the subquery in the SELECT to show the average alongside each rowSELECT
order_id,
order_value,
ROUND((
SELECTAVG(order_value) FROM orders WHERE status = 'completed'
), 2) AS overall_average,
order_value - (
SELECTAVG(order_value) FROM orders WHERE status = 'completed'
) AS above_average_by
FROM orders
WHERE status = 'completed';
-- The same subquery appears twice here -- in practice use a CTE to avoid repetition
-- Covered below: WITH avg_val AS (...) then reference avg_val.avg_order in SELECT
2 — Subquery with IN: a query that returns a list
Using a subquery to build the list for an IN operator
Instead of typing a fixed list into IN ('A', 'B', 'C'), you can put a complete SELECT statement inside IN. The subquery returns a list of values, and the outer query checks whether each row's column value appears in that list. The subquery must return exactly one column.
When to use it: 'Find all orders placed by customers who are in the VIP programme' — you first need to find the list of VIP customer IDs (inner query), then find orders for those customers (outer query). This is a natural two-step thought process expressed as one query.
Request: “Find all orders placed by customers who live in countries where we have more than 1,000 customers”
Outcome: Only orders from 'large market' customers. The inner query calculates which countries qualify, and the outer query finds orders from customers in those countries.
SQL — Subquery with IN
-- Inner query: find which countries have more than 1,000 customersSELECT o.order_id, o.order_value, c.country
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE c.country IN (
SELECT country
FROM customers
GROUP BY country
HAVINGCOUNT(*) > 1000
)
ORDER BY o.order_value DESC;
-- Step by step:
-- 1. Inner query groups customers by country and returns countries with > 1000 customers
-- Result: a list like ('UK', 'Germany', 'France')
-- 2. Outer query uses that list in the IN operator
-- Same result as: WHERE c.country IN ('UK', 'Germany', 'France')
-- But this is dynamic -- if the data changes, the list updates automatically
-- NOT IN with a subquery: the opposite -- exclude those customersSELECT * FROM orders
WHERE customer_id NOT IN (
SELECT customer_id FROM blacklisted_customers
);
-- WARNING: if blacklisted_customers contains any NULL customer_id,
-- NOT IN returns ZERO rows -- use NOT EXISTS instead (shown below)
3 — Subquery in FROM: treating a query result as a table
The derived table: using a subquery as if it were a real table
You can put a subquery in the FROM clause and treat its result like a regular table. This is called a derived table or inline view. You must give it an alias (a name) so the outer query can reference it.
This pattern is useful when you need to filter or aggregate on the result of an aggregation — which you cannot do with a simple WHERE or HAVING. For example: 'find customers whose order count is in the top 10% of all customers' — you first need to calculate order counts per customer (inner query), then filter on those counts (outer query).
CTEs are usually cleaner than derived tables for the same task. Both do the same thing, but CTEs are named at the top and reusable. Use derived tables when you need a quick inline calculation in a one-off query.
Request: “Find the average revenue per customer tier — but the tier calculation itself is complex”
Outcome: Each tier shows its average revenue. The inner query calculates the tier for every customer. The outer query then averages revenue by tier.
SQL — Subquery in FROM (derived table)
-- Inner query: calculate each customer's total spend and assign their tierSELECT
tier,
COUNT(*) AS customers_in_tier,
ROUND(AVG(total_spent), 2) AS avg_spend_for_tier
FROM (
-- This inner query runs first and becomes a temporary tableSELECT
customer_id,
SUM(order_value) AS total_spent,
CASEWHENSUM(order_value) >= 1000THEN'Gold'WHENSUM(order_value) >= 300THEN'Silver'ELSE'Bronze'ENDAS tier
FROM orders
WHERE status = 'completed'GROUP BY customer_id
) AS customer_tiers -- the alias is required -- gives this derived table a nameGROUP BY tier
ORDER BY avg_spend_for_tier DESC;
-- Step by step:
-- 1. Inner query runs: calculates total_spent and tier for every customer
-- Result: a temporary table with columns (customer_id, total_spent, tier)
-- 2. Outer query uses that table as if it were real:
-- SELECT tier, COUNT(*), AVG(total_spent) FROM that_result GROUP BY tier
-- The alias AS customer_tiers is mandatory in most databases
Tip: Rewrite derived tables as CTEs for better readability: WITH customer_tiers AS (... inner query ...) SELECT ... FROM customer_tiers GROUP BY tier. CTEs and derived tables produce identical results — CTEs are just easier to read and debug.
4 — EXISTS and NOT EXISTS: checking whether rows exist
EXISTS: a TRUE/FALSE test for whether a subquery returns any rows
EXISTS checks whether the subquery returns at least one row. It does not care what columns the subquery returns — just whether any row exists. If the subquery returns any rows at all, EXISTS is TRUE. If the subquery returns zero rows, EXISTS is FALSE.
NOT EXISTS is the safest way to check for non-membership — it handles NULLs correctly where NOT IN fails. Use NOT EXISTS instead of NOT IN whenever the inner query might return NULLs. EXISTS is also often faster than IN because the database can stop as soon as it finds one matching row.
When to use EXISTS: 'Show customers who have placed at least one order' — EXISTS is perfect here. 'Show customers who have NEVER placed an order' — use NOT EXISTS.
Request: “Find customers who have placed at least one order in 2024 — and separately, customers who have never ordered at all”
Outcome: EXISTS returns the first group (have ordered). NOT EXISTS returns the second (have never ordered). Both handle NULLs correctly, unlike IN and NOT IN.
SQL — EXISTS and NOT EXISTS
-- EXISTS: customer has placed AT LEAST ONE order in 2024SELECT c.customer_id, c.first_name, c.email
FROM customers c
WHEREEXISTS (
SELECT1-- SELECT 1 is conventional -- we only care IF a row exists, not WHAT it containsFROM orders o
WHERE o.customer_id = c.customer_id -- this links inner to outer queryAND o.order_date >= '2024-01-01'AND o.order_date < '2025-01-01'
);
-- The database checks for each customer:
-- "Does any row exist in orders where customer_id matches AND date is in 2024?"
-- YES -> customer is included in results
-- NO -> customer is excluded
-- NOT EXISTS: customers who have NEVER placed any orderSELECT c.customer_id, c.first_name
FROM customers c
WHERENOT EXISTS (
SELECT1FROM orders o
WHERE o.customer_id = c.customer_id
);
-- This is the NULL-safe replacement for NOT IN:
-- NOT IN version (unsafe if orders.customer_id has any NULLs):
-- WHERE customer_id NOT IN (SELECT customer_id FROM orders)
-- NOT EXISTS version (always correct even with NULLs):
-- WHERE NOT EXISTS (SELECT 1 FROM orders WHERE customer_id = c.customer_id)
5 — Correlated Subquery: re-evaluating for every row
The most powerful — and most misunderstood — type of subquery
All the subqueries above run once and produce a result used by the outer query. A correlated subquery is different: it runs once for EVERY row in the outer query, and it references columns from the outer query. The result changes depending on which row the outer query is currently processing.
Think of it like a manager checking each employee's performance against their own team's average (not the company average) — for each employee, you need to calculate a different average based on that employee's team. The inner query has to run separately for each row.
Performance warning: Correlated subqueries can be slow on large tables because they re-run for every outer row. Window functions (covered in intermediate SQL) often achieve the same result much faster. Use correlated subqueries when the logic is clearer than the window function equivalent, and move to window functions when performance matters.
Request: “For each order, show whether it is above or below that specific customer's own average — not the overall average”
Outcome: Each row shows its order value compared to the average for just that customer. The inner query calculates a different average for each customer as the outer query moves through rows.
SQL — Correlated Subquery
-- The inner query references o.customer_id from the OUTER query
-- This means the inner query re-executes for every row in the outer querySELECT
o.order_id,
o.customer_id,
o.order_value,
ROUND((
SELECTAVG(order_value)
FROM orders inner_o
WHERE inner_o.customer_id = o.customer_id -- references outer query's customer_idAND inner_o.status = 'completed'
), 2) AS this_customers_avg,
CASEWHEN o.order_value > (
SELECTAVG(order_value)
FROM orders inner_o
WHERE inner_o.customer_id = o.customer_id
) THEN'Above their average'ELSE'Below their average'ENDAS vs_personal_avg
FROM orders o
WHERE o.status = 'completed';
-- What happens for Customer 42 who has 10 orders:
-- Outer query processes order row 1 (customer_id=42, order_value=150)
-- Inner query runs: SELECT AVG(order_value) FROM orders WHERE customer_id=42
-- Result: average for customer 42 = e.g. 120
-- Comparison: 150 > 120 -> "Above their average"
-- Outer query moves to order row 2 (customer_id=42, order_value=80)
-- Inner query runs AGAIN with customer_id=42: average is still 120
-- Comparison: 80 < 120 -> "Below their average"
-- WINDOW FUNCTION EQUIVALENT (much faster on large tables):
SELECT order_id, customer_id, order_value,
AVG(order_value) OVER(PARTITION BY customer_id) AS this_customers_avg
FROM orders WHERE status = 'completed';
You can nest subqueries as deeply as you need, though more than three levels usually signals the query should be rewritten as a CTE for readability. Multi-level subqueries let you answer questions that require several sequential calculations where each step depends on the result of the previous one.
Rule of thumb: If your subquery nesting exceeds two levels, convert it to a CTE. The query becomes dramatically easier to read, test, and debug when each level has a name.
Request: “Find customers whose lifetime spend puts them in the top 25% of all customers — three levels deep”
Outcome: A three-level chain: first calculate each customer's total spend, then find the 75th percentile threshold, then find customers whose spend exceeds that threshold.
SQL — Multi-Level Subquery (then rewritten as CTE)
-- THREE-LEVEL nested subquery -- works but hard to readSELECT customer_id, total_spent
FROM (
-- Level 2: calculate total spend per customerSELECT customer_id, SUM(order_value) AS total_spent
FROM orders WHERE status='completed'GROUP BY customer_id
) AS customer_totals
WHERE total_spent > (
-- Level 3: find the 75th percentile of those totalsSELECTPERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY total_spent)
FROM (
-- Level 1 (innermost): same customer totals calculationSELECTSUM(order_value) AS total_spent
FROM orders WHERE status='completed'GROUP BY customer_id
) AS all_totals
);
-- SAME RESULT as CTE -- much more readableWITH customer_totals AS (
SELECT customer_id, SUM(order_value) AS total_spent
FROM orders WHERE status='completed'GROUP BY customer_id
),
p75_threshold AS (
SELECTPERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY total_spent)
AS threshold
FROM customer_totals
)
SELECT ct.customer_id, ct.total_spent
FROM customer_totals ct
CROSS JOIN p75_threshold -- makes the threshold value available in every rowWHERE ct.total_spent > p75_threshold.threshold
ORDER BY ct.total_spent DESC;
-- CROSS JOIN here is intentional: the threshold table has 1 row,
-- so CROSS JOIN just adds that single threshold value to every customer row
7 — Common Subquery Patterns: the ones you will use again and again
Six patterns worth memorising — these cover most real analytics subquery needs
Once you recognise these patterns you can combine them to answer almost any complex business question.
SQL — Six Common Subquery Patterns
-- PATTERN 1: Find rows above/below a calculated thresholdWHERE value > (SELECTAVG(value) FROM table)
-- PATTERN 2: Filter by membership in a dynamic listWHERE id IN (SELECT id FROM other_table WHERE condition)
-- PATTERN 3: Safe anti-join (rows NOT in another table)WHERENOT EXISTS (SELECT1FROM other_table WHERE other.id = outer.id)
-- PATTERN 4: Aggregate the result of an aggregate (double aggregation)SELECTAVG(order_count) FROM
(SELECT customer_id, COUNT(*) AS order_count
FROM orders GROUP BY customer_id) AS counts
-- PATTERN 5: Running the same filter at two levels (outer AND inner)WITH filtered AS (SELECT * FROM orders WHERE status='completed')
SELECT * FROM filtered WHERE order_value > (SELECTAVG(order_value) FROM filtered)
-- PATTERN 6: Find the latest record per group (common in event logs)SELECT * FROM events e
WHERE event_date = (
SELECTMAX(event_date)
FROM events
WHERE customer_id = e.customer_id -- correlated: uses the outer row's customer_id
);
-- Returns only the most recent event for each customer
-- Window function equivalent: ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY event_date DESC)
End-of-Lesson Exercises
How to use these: Each exercise builds on the subquery patterns above. Attempt the query yourself first. The explanation after the answer tells you which pattern applies and why.
🖊 Exercise 1 of 5 — Find all products whose price is above the average price across all products.
Hint: Scalar subquery in WHERE. The inner query calculates one number (AVG). The outer query compares each row to it.
SQL
SELECT product_id, product_name, price
FROM products
WHERE price > (
SELECTAVG(price) FROM products
)
ORDER BY price DESC;
-- The inner query runs first and returns one number: the average price
-- The outer query then filters: keep only products where price > that average
-- Every row in the products table is compared to the same calculated average
🧠 What is this doing? Pattern 1 — scalar subquery in WHERE. The inner SELECT AVG(price) FROM products returns a single number. The outer query uses that number as the right side of the > comparison. You could also write WHERE price > 47.50 if you knew the average in advance — but the subquery calculates it dynamically so the query always reflects current data.
🖊 Exercise 2 of 5 — Find all orders placed by customers who are in the 'premium' segment table.
Hint: Subquery with IN. The inner query returns a list of customer IDs from the premium_customers table.
SQL
SELECT o.order_id, o.customer_id, o.order_value
FROM orders o
WHERE o.customer_id IN (
SELECT customer_id
FROM premium_customers
WHERE segment = 'premium'
)
ORDER BY o.order_value DESC;
-- Inner query returns a list of customer_ids who are premium
-- Outer query keeps only orders where the customer_id is in that list
-- Dynamic: if a customer is added to premium_customers, they appear automatically
🧠 What is this doing? Pattern 2 — subquery with IN. The inner query returns a list of customer_ids from premium_customers. The outer query checks whether each order's customer_id appears in that list. This is equivalent to joining orders to premium_customers, but the subquery approach is often more readable when you only need to filter, not display, the inner data.
🖊 Exercise 3 of 5 — Find customers who have NEVER placed any order. Use NOT EXISTS to handle NULLs safely.
Hint: NOT EXISTS with a correlated subquery. The inner query checks for the existence of any order per customer.
SQL
SELECT c.customer_id, c.first_name, c.email
FROM customers c
WHERENOT EXISTS (
SELECT1FROM orders o
WHERE o.customer_id = c.customer_id
);
-- For each customer row, the inner query asks:
-- "Does any row exist in orders with this customer_id?"
-- NOT EXISTS: keep the customer only if the answer is NO
-- SELECT 1 is conventional -- we only care whether a row exists, not what it contains
-- The link o.customer_id = c.customer_id makes this a correlated subquery:
-- it re-executes for each customer using THAT customer's ID
🧠 What is this doing? Pattern 3 — NOT EXISTS anti-join. NOT EXISTS is preferred over NOT IN here because if orders.customer_id ever contains a NULL, NOT IN would return zero rows silently. NOT EXISTS handles NULLs correctly: it simply checks whether any matching row exists and is immune to NULL values in the inner query.
🖊 Exercise 4 of 5 — Calculate the average number of orders per customer (average of a per-customer count).
Hint: Derived table in FROM. You cannot AVG(COUNT(*)) directly — you need to count first, then average the counts.
SQL
SELECTROUND(AVG(order_count), 2) AS avg_orders_per_customer,
MIN(order_count) AS min_orders_for_any_customer,
MAX(order_count) AS max_orders_for_any_customer
FROM (
SELECT customer_id, COUNT(*) AS order_count
FROM orders
WHERE status = 'completed'GROUP BY customer_id
) AS customer_order_counts;
-- You cannot write SELECT AVG(COUNT(*)) directly -- SQL does not allow this
-- Solution: calculate the COUNT per customer in an inner query (derived table)
-- Then in the outer query, AVG those counts
-- The alias AS customer_order_counts is required for derived tables
🧠 What is this doing? Pattern 4 — double aggregation using a derived table. SQL does not allow AVG(COUNT(*)) in one step because you cannot aggregate an aggregate in the same query level. The solution: first GROUP BY customer_id to get counts (inner), then AVG those counts in the outer query. This pattern appears constantly: 'average revenue per order per customer', 'average number of items per basket', etc.
🖊 Exercise 5 of 5 — For each customer, show their most recent order date and the value of that specific order — using a correlated subquery.
Hint: Correlated subquery. The inner query finds the MAX order_date for EACH specific customer.
SQL
SELECT
c.customer_id,
c.first_name,
o.order_date AS last_order_date,
o.order_value AS last_order_value
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_date = (
SELECTMAX(order_date)
FROM orders
WHERE customer_id = c.customer_id -- correlated: uses outer query's customer_id
)
ORDER BY o.order_date DESC;
-- For EACH customer row, the inner query finds the MAX order_date for that customer
-- The WHERE then keeps only the order whose date equals that maximum
-- Result: one row per customer showing only their most recent order
-- Window function equivalent (faster on large tables):WITH ranked AS (
SELECT *, ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY order_date DESC) AS rn
FROM orders
)
SELECT customer_id, order_date, order_value FROM ranked WHERE rn = 1;
🧠 What is this doing? Pattern 6 — correlated subquery to find the latest record per group. The inner query's WHERE customer_id = c.customer_id is the correlation: it references the outer query's current customer_id, so it re-executes for each customer. The window function version using ROW_NUMBER() PARTITION BY customer_id is faster on large datasets and should be preferred in production environments.
What is Python and why do analysts use it? Python is a programming language — a way of giving step-by-step instructions to a computer. SQL asks questions of a database. Excel works with spreadsheets. Python does everything else: cleaning messy files, combining datasets, building charts, automating repetitive tasks, and eventually building machine learning models. You write instructions in a plain text file, run it, and the computer follows every step exactly. You do not need to learn all of Python — just two libraries well: pandas for manipulating data, and matplotlib for charts.
1 — Storing values: variables and basic data types
What a variable is — and why you need one
When Python runs your code, it needs somewhere to hold information so it can use it again later. A variable is simply a name you give to a piece of data. Think of it as a labelled box: you put something in the box, you give the box a name, and later you can open that box by referring to its name. You can store text, numbers, true/false flags, lists of items, and more.
Unlike some languages, Python works out the type of data automatically — you do not have to say "this is a number" or "this is text". You just write the value and Python figures it out.
Why it matters: Variables are how you hold onto results from one step so you can use them in the next. If you calculate total revenue in line 5, you store it in a variable so line 20 can reference it. Without variables, every piece of data would vanish the moment it was calculated.
Request: "Store a company name, the current year, the total revenue, and a flag for whether the company is profitable"
Outcome: Four variables are created in memory. You can reference them by name at any point later in your code. Python automatically knows that year is a whole number, revenue is a decimal number, etc.
Python
# The # symbol starts a comment — Python ignores everything after it
# Comments are notes for humans reading the code, not instructions for Python
company_name = "DataPath Analytics"# str — text always goes in quotes
year = 2024# int — a whole number, no quotes
revenue = 1_250_000.50# float — a decimal number (underscores just make it readable)
is_profitable = True# bool — only two options: True or False# Print a variable's value to see what is stored inside itprint(company_name) # Output: DataPath Analyticsprint(year) # Output: 2024print(revenue) # Output: 1250000.5# f-strings let you embed variables directly inside a sentence
# Put an f before the opening quote, then wrap variable names in { }print(f"In {year}, {company_name} made £{revenue:,.0f} in revenue")
# Output: In 2024, DataPath Analytics made £1,250,001 in revenue
# The :,.0f part means: format as a number with commas, 0 decimal places
Request: "Store a list of countries and their revenues so I can work with them together"
Outcome: Two lists are created. A list holds multiple values in order, like a column in a spreadsheet. Each value has a position number starting from 0 (not 1). You can read any item by its position.
Python
# A list holds multiple values in square brackets, separated by commas
# Think of it like a single column of a spreadsheet
countries = ["UK", "Germany", "France", "Spain"]
revenues = [42000, 31000, 28000, 19000]
# Access a single item using its position number in square brackets
# Python counts from 0 — the first item is at position 0, not 1print(countries[0]) # Output: UK (position 0 = first item)print(countries[1]) # Output: Germany (position 1 = second item)print(countries[-1]) # Output: Spain (position -1 = last item, always)print(len(countries)) # Output: 4 (how many items are in the list)# A dictionary stores key-value pairs — like a row with named columns
# Keys are the column names, values are the data
customer = {
"id": 1042,
"name": "Alice Chen",
"country": "UK",
"orders": 7
}
# Access a value using its key name in square bracketsprint(customer["name"]) # Output: Alice Chenprint(customer["orders"]) # Output: 7# Print a sentence about this customer using f-stringprint(f"Customer {customer['name']} from {customer['country']} has made {customer['orders']} orders")
# Output: Customer Alice Chen from UK has made 7 orders
2 — Loading and inspecting data: pandas basics
What pandas is and what a DataFrame looks like
pandas is a library — a collection of tools someone else has written that you can use in your own code. You "import" it at the top of your script to make those tools available. pandas gives you the DataFrame — a 2-dimensional table with labelled rows and columns, just like a spreadsheet, but programmable.
When you load a CSV file with pandas, it becomes a DataFrame in memory. From that point, you can filter it, sort it, calculate statistics, add new columns, and visualise it — all without touching the original file.
The first five things to always do with any new dataset: check its shape (how many rows and columns), look at the first few rows, check the data types of each column, check for missing values, and look at basic statistics. Do these every single time, no exceptions. They will catch errors before they corrupt your analysis.
Request: "Load a CSV file and find out: how big is it, what does it look like, are there any missing values?"
Outcome: Five different pieces of information about the dataset are printed to the screen. You will know the exact size, the column names and sample values, each column's data type, how many values are missing per column, and the basic statistics for every numeric column.
Python
import pandas as pd # "as pd" means you can type pd instead of pandas each time# Load the CSV file — this creates a DataFrame called df
# df is the standard variable name analysts use for a DataFrame
df = pd.read_csv('sales_data.csv')
# ── Step 1: How big is the dataset? ──────────────────────────print(df.shape)
# Output: (150000, 12)
# This means 150,000 rows and 12 columns
# .shape is a property — no brackets needed — it just tells you the dimensions# ── Step 2: What does the data look like? ────────────────────print(df.head())
# Output: The first 5 rows with all column names and their values
# This shows you whether the data loaded correctly and what columns exist
# df.head(10) would show 10 rows instead of the default 5# ── Step 3: What data type is each column? ───────────────────print(df.dtypes)
# Output example:
# order_date object ← BAD: date stored as plain text — needs fixing
# order_value float64 ← GOOD: stored as a proper decimal number
# customer_id int64 ← GOOD: stored as a whole number
# country object ← OK: text columns always show as "object"
# If a number column shows "object", it means some rows have text in them# ── Step 4: Are there missing values? ────────────────────────print(df.isnull().sum())
# Output example:
# order_id 0 ← perfect: no missing values at all
# customer_id 0 ← perfect
# phone 8234 ← 8,234 rows have no phone number
# order_value 3 ← 3 rows have no value — investigate these!
# .isnull() creates a True/False for every cell
# .sum() counts how many True values (i.e. missing) in each column# ── Step 5: What are the basic statistics? ───────────────────print(df.describe())
# Output: count, mean, min, max, std, quartiles for every numeric column
# Scan this for anything suspicious:
# → A negative order_value would be visible here immediately
# → A max value of 999999 when everything else is under 500 signals a bad row
3 — Selecting and filtering: getting the data you need
Picking specific columns and specific rows
Once data is loaded, you will almost never want to work with the entire DataFrame at once. You will want specific columns (like SELECT in SQL) or specific rows that meet a condition (like WHERE in SQL). In pandas, you select columns using square brackets with the column name. You filter rows by putting a condition inside square brackets.
The condition works by checking every row and returning True or False. Only rows where the condition is True are kept. This is identical in logic to WHERE in SQL — it just looks different.
Request: "Show me only the order_id, country, and order_value columns — nothing else"
Outcome: The same DataFrame but with only 3 columns visible. All rows are still there. This is useful when a dataset has 50 columns but you only need 3 for your analysis.
Python
import pandas as pd
df = pd.read_csv('sales_data.csv')
# Single column — use one set of square brackets
# Returns a Series (a single column of data, like a list with an index)
country_col = df['country']
print(country_col.head()) # Shows the first 5 values in the country column# Multiple columns — use TWO sets of square brackets (outer = pandas filter, inner = list)
# Returns a DataFrame (a table) with only those columns
small_df = df[['order_id', 'country', 'order_value']]
print(small_df.head())
# Output: a table with exactly 3 columns and all rows
# The original df is unchanged — small_df is a new separate view
Request: "Show me only the rows where the country is UK and the order value is over £100"
Outcome: A filtered DataFrame containing only rows that meet BOTH conditions simultaneously. All other rows are excluded. The original df is not modified — the filtered result is stored in a new variable.
Python
# Filter rows: put the condition inside square brackets
# The condition df['country'] == 'UK' checks every row:
# → True if the country column equals 'UK'
# → False otherwise
# Only rows where the result is True are returned
uk_orders = df[df['country'] == 'UK']
print(f"UK orders: {len(uk_orders)}") # shows how many rows passed the filter# Two conditions together: use & for AND, | for OR
# Each condition must be wrapped in its own brackets ( )
# This is different from SQL — in Python the brackets are required
uk_high_value = df[(df['country'] == 'UK') & (df['order_value'] > 100)]
print(f"UK orders over £100: {len(uk_high_value)}")
# .isin() is the pandas equivalent of SQL's IN operator
# Much cleaner than writing three separate == conditions with |
europe = df[df['country'].isin(['UK', 'Germany', 'France'])]
print(f"European orders: {len(europe)}")
Request: "Add a new column that calculates revenue after 20% VAT, and another that flags orders as high value if over £200"
Outcome: The DataFrame gains two new columns. Every row now has a revenue_inc_vat value (order_value × 1.20) and a True/False is_high_value flag. This is how you create derived metrics from existing data.
Python
# Add a new column by assigning a calculation to a new column name
# The calculation runs on every row simultaneously — no loop needed
df['revenue_inc_vat'] = df['order_value'] * 1.20# Every row now has: revenue_inc_vat = its own order_value multiplied by 1.20# A boolean flag column — True if condition is met, False otherwise
# This is useful for grouping, filtering, or counting later
df['is_high_value'] = df['order_value'] > 200# order_value of 350 → is_high_value = True
# order_value of 80 → is_high_value = False# Count how many rows are True in a boolean column
# .sum() counts True values (True = 1, False = 0)
high_value_count = df['is_high_value'].sum()
print(f"High value orders: {high_value_count}")
print(f"Percentage: {high_value_count / len(df) * 100:.1f}%")
# Sort the DataFrame by order_value, largest first
df_sorted = df.sort_values('order_value', ascending=False)
print(df_sorted[['order_id', 'order_value', 'country']].head(5))
# Output: the 5 rows with the highest order values
4 — Summarising data: groupby
groupby: the pandas equivalent of GROUP BY in SQL
groupby collapses rows that share the same value in a column into a single group, then applies an aggregation function (sum, count, mean, etc.) to each group separately. The result is one row per unique value in the grouping column — just like GROUP BY in SQL.
The most powerful form is .agg() — which lets you apply multiple different aggregation functions to different columns at the same time and give each result a meaningful name. This is how you build the summary tables that end up in reports.
Request: "Calculate total revenue, number of orders, average order value, and largest single order — for each country"
Outcome: One row per country. Each row shows all four calculated metrics for that country. The country with the highest revenue appears first. This is ready to paste into a report or chart.
Python
import pandas as pd
df = pd.read_csv('sales_data.csv')
# .groupby('country') means: treat all rows with the same country as one group
# .agg() then applies different calculations to different columns
# Format inside agg: new_column_name = ('source_column', 'function')
summary = df.groupby('country').agg(
total_orders = ('order_id', 'count'), # count how many rows per group
total_revenue = ('order_value', 'sum'), # sum all order_values per group
avg_order_value = ('order_value', 'mean'), # mean average per group
largest_order = ('order_value', 'max'), # maximum value per group
unique_customers= ('customer_id', 'nunique'), # count distinct customer IDs per group
).round(2).reset_index()
# .round(2) rounds every number to 2 decimal places
# .reset_index() moves 'country' from the index back into a regular column
# (groupby makes the grouping column the index by default — this reverses that)
summary_sorted = summary.sort_values('total_revenue', ascending=False)
print(summary_sorted.to_string(index=False)) # print without the row numbers
5 — Your first charts
Turning numbers into visuals that anyone can understand instantly
A column of numbers communicates data. A chart communicates insight. Patterns that are invisible in a table — a December spike, one country growing while all others decline, an outlier order 50 times larger than the average — become instantly obvious the moment you plot the data visually.
matplotlib is Python's core charting library. The basic pattern is always the same: (1) create a figure with a size, (2) tell it what type of chart to draw and what data to use, (3) add a title and axis labels, (4) show or save it. Once you know this pattern, you can produce any chart by changing just a few parameters.
Request: "Show me a bar chart of revenue by country (top 10 only), and a line chart of how monthly revenue changes over time"
Outcome: Two separate chart windows open (or two chart images are saved). The bar chart shows which countries generate the most revenue at a glance. The line chart shows whether revenue is growing, declining, or seasonal over time. These are the two most common chart types in business analytics.
Python
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv('sales_data.csv')
# ── Chart 1: Horizontal bar chart — revenue by country ───────# Step 1: calculate what to plot (total revenue per country, top 10 only)
country_rev = (df
.groupby('country')['order_value']
.sum() # add up all order values per country
.sort_values() # sort ascending so the longest bar appears at the top
.tail(10) # keep only the top 10 (last 10 after ascending sort)
)
# Step 2: create the chart
plt.figure(figsize=(10, 6)) # width=10 inches, height=6 inches — controls chart size
country_rev.plot(kind='barh', color='#4db8ff')
# kind='barh' means horizontal bar chart (kind='bar' would be vertical)
# color sets the bar colour — any hex colour code works here
plt.title('Top 10 Countries by Revenue', fontsize=14) # chart title
plt.xlabel('Total Revenue (£)') # x-axis label
plt.tight_layout() # automatically adjusts spacing so labels are not cut off
plt.savefig('revenue_by_country.png', dpi=150) # save to file (150 dpi = high quality)
plt.show() # display the chart on screen# ── Chart 2: Line chart — monthly revenue trend ──────────────# Step 1: convert the order_date column to a proper datetime type
# Then extract just the month-year period (e.g. "2024-01") for grouping
df['order_date'] = pd.to_datetime(df['order_date'])
df['month'] = df['order_date'].dt.to_period('M') # 'M' = month level# Step 2: sum revenue by month
monthly_rev = df.groupby('month')['order_value'].sum()
# Step 3: draw the line chart
plt.figure(figsize=(12, 4))
monthly_rev.plot(color='#3dffa0', linewidth=2, marker='o')
# linewidth=2 makes the line thicker and easier to read
# marker='o' adds a dot at each data point so individual months are visible
plt.title('Monthly Revenue Trend')
plt.ylabel('Revenue (£)')
plt.grid(axis='y', alpha=0.3) # light horizontal grid lines — helps reading values
plt.tight_layout()
plt.show()
6 — Making decisions with if, elif, and else
What if/else is and why you need it
Sometimes your code needs to make a decision: do one thing in one situation, something different in another. That is exactly what if, elif, and else are for. Think of it like a flowchart — Python checks the condition, and depending on whether it is True or False, it follows a different path.
This comes up constantly in data work: 'if the value is above 100, label it High, otherwise label it Low'. 'If the country is UK, apply one tax rate, else apply another'. 'If the email is missing, skip this row, else process it'. Once you understand if/else, you can add logic to any calculation.
Request: “Check whether a customer's order value qualifies for free shipping (orders over £50 get free shipping)”
Outcome: The code evaluates the condition and prints a different message depending on the result. Python runs the code inside the matching block and skips all the others.
Python — if / elif / else
# A single value to test
order_value = 75# if: check the first conditionif order_value >= 50:
print("Free shipping! Order qualifies.")
else:
print("Standard shipping applies.")
# Output: Free shipping! Order qualifies.# Multiple conditions with elif (short for "else if")# Python checks each condition in order and runs the FIRST one that is True
order_value = 120if order_value >= 200:
tier = "Platinum"# only runs if order_value is 200 or aboveelif order_value >= 100:
tier = "Gold"# only runs if the first condition was False AND this is Trueelif order_value >= 50:
tier = "Silver"# only runs if both above were False AND this is Trueelse:
tier = "Standard"# runs if ALL conditions above were Falseprint(f"Order value {order_value} -> {tier} tier")
# Output: Order value 120 -> Gold tier# Important: the indentation (4 spaces) is not decoration -- Python uses it# to know which lines belong inside the if block vs outside it# Code at the same indentation level runs together as one block
Tip: In Python, indentation is the structure. Code inside an if block must be indented exactly 4 spaces. If the indentation is inconsistent, Python will raise an IndentationError. This is the most common error beginners hit — just make sure all lines in the same block line up perfectly.
7 — Repeating actions with for loops
What a loop is — and when you need one
A loop tells Python to repeat the same block of code for each item in a collection. Without loops, if you wanted to print 100 values you would have to write 100 print statements. With a loop, you write the print statement once and Python handles the repetition.
In data analytics, loops appear when you need to process multiple files, apply the same calculation to multiple columns, or print a formatted summary for each row. However — in pandas, loops are often unnecessary. pandas is designed to work on entire columns at once (called vectorised operations), which is much faster than looping row by row. You will learn both so you know when each is appropriate.
Request: “Print a formatted summary line for each country in a list”
Outcome: Python goes through the list one item at a time, runs the code block for each, and moves to the next. The variable country holds the current item each time.
Python — for loops
# A list of countries to process
countries = ["UK", "Germany", "France", "Spain"]
# for loop: run the indented block once for each item# "country" is just a variable name -- it holds the current item each passfor country in countries:
print(f"Processing data for: {country}")
# Output:# Processing data for: UK# Processing data for: Germany# Processing data for: France# Processing data for: Spain# range(): generate a sequence of numbersfor i inrange(5): # i goes: 0, 1, 2, 3, 4print(f"Row {i}")
# Loop through rows of a small DataFrame# NOTE: in pandas, avoid row-by-row loops on large data -- use vectorised ops insteadimport pandas as pd
df = pd.DataFrame({"country":["UK","DE"], "revenue":[42000,31000]})
for index, row in df.iterrows(): # iterrows() gives (index, row) pairsprint(f"{row['country']}: £{row['revenue']:,}")
# UK: £42,000# DE: £31,000# FASTER alternative without a loop -- apply to entire column at once:
df["revenue_formatted"] = df["revenue"].apply(lambda x: ff"£{x:,}")
# This does the same thing but processes the entire column in one go
8 — Writing your own functions
What a function is and why you should write them
A function is a named block of reusable code. You define it once and call it as many times as you need. If you find yourself writing the same 5 lines of code in three different places in your script, that code should be a function.
Functions make your code dramatically easier to read, test, and fix. If the logic needs to change, you change it in one place and every call to the function automatically gets the update. Without functions, you have to find and change every copy separately — and you will inevitably miss one.
Anatomy of a function:def starts the definition, then the function name, then parentheses containing any inputs (called parameters), then a colon. The indented block below is the function body. return sends a value back to whoever called the function.
Request: “Write a function that classifies a customer into a tier based on their spend — so you can reuse it anywhere in your code”
Outcome: Once the function is defined, you call it by name and pass in a value. The function does the calculation and gives you back the result. You can call it on a single value or apply it to an entire DataFrame column.
Python — def — defining and calling functions
# def keyword starts a function definition# "spend" is the parameter -- a placeholder for whatever value you pass indefclassify_customer(spend):
"""Classify a customer into a tier based on their total spend. The triple-quote string (docstring) explains what the function does. """if spend >= 1000:
return"Gold"# return sends this value back to the callerelif spend >= 300:
return"Silver"elif spend >= 50:
return"Bronze"else:
return"New"# Call the function with different valuesprint(classify_customer(1500)) # Goldprint(classify_customer(250)) # Bronzeprint(classify_customer(10)) # New# Apply the function to a whole DataFrame column in one lineimport pandas as pd
df = pd.read_csv("customers.csv")
df["tier"] = df["total_spent"].apply(classify_customer)
# .apply() runs classify_customer on every value in the total_spent column# This is much cleaner than writing if/elif 4 times in a loop# Functions with multiple parameters and a default valuedefcalculate_vat(price, vat_rate=0.20):
"""Return price with VAT added. Default rate is 20% if not specified."""return price * (1 + vat_rate)
print(calculate_vat(100)) # 120.0 -- uses default 20%print(calculate_vat(100, 0.05)) # 105.0 -- uses 5% rate
9 — Working with dates and times
Why dates are tricky in Python — and how to handle them properly
Dates look simple but they cause more bugs than almost anything else in data work. When you load a CSV, dates are stored as plain text like '2024-03-15'. Python cannot do date arithmetic on text — you cannot subtract '2024-01-01' from '2024-03-15' and get 74 days. You first have to convert the text to a real datetime object.
Once converted, datetime objects unlock a huge range of operations: calculate the number of days between two dates, extract the month or year for grouping, filter to a specific time window, identify which day of the week something happened. These operations are central to almost every time-based analysis.
Request: “Convert a date column from text to real dates, then extract month and year, calculate days since order, and filter to recent orders”
Outcome: After conversion, the date column behaves like a real date — you can do arithmetic, extract components, and filter by date range.
Python — Working with dates
import pandas as pd
from datetime import date
df = pd.read_csv("orders.csv")
# Step 1: convert text dates to real datetime objects# BEFORE: "2024-03-15" is just text -- you cannot do maths with it
df["order_date"] = pd.to_datetime(df["order_date"])
# AFTER: 2024-03-15 00:00:00 is a real datetime -- you can now do maths# Step 2: extract date components using .dt accessor
df["year"] = df["order_date"].dt.year# 2024
df["month"] = df["order_date"].dt.month# 3 (March)
df["month_name"] = df["order_date"].dt.strftime("%B") # "March"
df["day_of_week"] = df["order_date"].dt.day_name() # "Friday"
df["quarter"] = df["order_date"].dt.quarter# 1, 2, 3, or 4# Step 3: calculate days since each order was placed
today = pd.Timestamp("today") # todays date as a Timestamp
df["days_ago"] = (today - df["order_date"]).dt.days# subtracts dates, extracts days# "today - order_date" gives a timedelta (a duration)# .dt.days converts that duration into a plain integer number of days# Step 4: filter to only orders from 2024
orders_2024 = df[
(df["order_date"] >= "2024-01-01") &
(df["order_date"] < "2025-01-01")
]
# You can compare datetime columns directly to text date strings# pandas converts the string to a datetime automatically for the comparison# Group by month to see monthly revenue trend
df["month_period"] = df["order_date"].dt.to_period("M") # "2024-03"
monthly = df.groupby("month_period")["order_value"].sum()
print(monthly)
10 — Handling errors gracefully with try/except
What happens when your code hits a problem — and how to control it
When Python encounters something it cannot handle — dividing by zero, opening a file that does not exist, converting 'abc' to a number — it raises an error (also called an exception) and stops immediately. If you are processing 100,000 rows and one has bad data, your entire script crashes on that row.
The try/except block lets you catch errors and decide what to do instead. Put the code that might fail inside try. Put the fallback behaviour inside except. If no error occurs, the except block is skipped entirely. If an error occurs, Python jumps to the except block instead of crashing.
When to use it in data work: Converting messy text columns to numbers (some rows might have '£42.50' or 'N/A' instead of a clean number), reading files that might not exist, calling APIs that might time out.
Python — try / except
# Without try/except: one bad value crashes the whole script
values = ["42", "75", "N/A", "28"]
# This would crash on "N/A" -- ValueError: invalid literal for int()# for v in values:# print(int(v))# With try/except: handle the bad value, continue processing the restfor v in values:
try:
result = int(v) # try to convert to integerprint(f"Converted: {result}")
except ValueError: # if conversion fails (ValueError), do this insteadprint(f"Could not convert '{v}' -- skipping")
# Output:# Converted: 42# Converted: 75# Could not convert 'N/A' -- skipping# Converted: 28# Practical data use: safely convert a messy price column to floatdefsafe_to_float(value):
"""Try to convert value to float. Return None if it fails."""try:
returnfloat(str(value).replace("£", "").replace(",", ""))
except (ValueError, TypeError):
returnNone# return None for values we cannot convertimport pandas as pd
df = pd.read_csv("messy_prices.csv")
df["price_clean"] = df["price_raw"].apply(safe_to_float)
# "£42.50" -> 42.5# "1,250" -> 1250.0# "N/A" -> None (can then fill with df.fillna(0))
11 — Reading and writing different file types
CSV is not the only file format you will work with
In a real analytics job you will receive data in many formats: CSV, Excel (.xlsx), JSON, and sometimes plain text files. pandas handles all of these with different read functions, and can write to all of them too. Knowing which function to use for each format saves significant time.
You will also frequently need to combine multiple files into one DataFrame — for example, 12 monthly CSV files that each contain one month of orders. The pattern for this is simple: read each file into a list, then use pd.concat() to stack them all together.
Python — Reading and writing files
import pandas as pd
import glob # glob finds files matching a pattern# ── Reading different formats ──────────────────────────────# CSV (most common)
df = pd.read_csv("sales_data.csv")
df = pd.read_csv("data.csv", encoding="latin-1") # for files with special characters
df = pd.read_csv("data.csv", sep=";") # for semicolon-separated files# Excel (.xlsx or .xls)
df = pd.read_excel("report.xlsx") # reads the first sheet
df = pd.read_excel("report.xlsx", sheet_name="Sales") # specific sheet by name
df = pd.read_excel("report.xlsx", sheet_name=1) # second sheet by position# JSON (common from APIs and web data)
df = pd.read_json("data.json")
# ── Combining multiple CSV files into one DataFrame ─────────# Scenario: 12 monthly files named jan_2024.csv, feb_2024.csv, etc.
all_files = glob.glob("*.csv") # find all .csv files in the folderprint(f"Found {len(all_files)} files")
dataframes = [] # empty list to collect each file's DataFramefor file in all_files:
df_single = pd.read_csv(file)
df_single["source_file"] = file # add a column showing which file each row came from
dataframes.append(df_single)
combined = pd.concat(dataframes, ignore_index=True)
# ignore_index=True: reset row numbers from 0 instead of repeating 0,1,2 for each fileprint(f"Combined shape: {combined.shape}")
# ── Writing results back out ─────────────────────────────────
combined.to_csv("combined_output.csv", index=False) # save as CSV
combined.to_excel("combined_output.xlsx", index=False) # save as Excel# index=False: prevents pandas writing the row numbers as a column in the file
12 — Cleaning and working with text data
String operations you will use constantly on real messy data
Text data is almost always messy. Extra spaces, inconsistent capitalisation, punctuation mixed into numeric fields, email addresses in different formats. Python has a rich set of string methods for fixing these issues, and pandas exposes them all through the .str accessor so you can apply them to an entire column at once.
This section focuses on the operations that come up most in real data work, explained simply with before/after examples so you can see exactly what each one does.
Python — String cleaning operations
import pandas as pd
df = pd.read_csv("customers.csv")
# ── Basic text cleaning ────────────────────────────────────# Remove extra spaces from both ends: " UK " -> "UK"
df["country"] = df["country"].str.strip()
# Change case: "united kingdom" -> "United Kingdom"
df["country"] = df["country"].str.title() # Title Case
df["email"] = df["email"].str.lower() # all lowercase
df["code"] = df["code"].str.upper() # ALL UPPERCASE# Replace text: "£42.50" -> "42.50"
df["price"] = df["price"].str.replace("£", "", regex=False)
df["price"] = df["price"].str.replace(",", "", regex=False) # remove commas# Check if text contains a pattern
df["is_gmail"] = df["email"].str.contains("@gmail.com", na=False)
df["is_test"] = df["email"].str.contains("test", case=False, na=False)
# na=False: treat missing values as False (not True, not an error)# case=False: match regardless of capitalisation (TEST, test, Test all match)# Extract parts of text
df["email_domain"] = df["email"].str.split("@").str[1]
# split("@") splits "alice@gmail.com" into ["alice", "gmail.com"]# .str[1] takes the second item (index 1): "gmail.com"
df["first_name"] = df["full_name"].str.split(" ").str[0] # first word
df["last_name"] = df["full_name"].str.split(" ").str[-1] # last word# Check string length (useful for validation)
df["name_length"] = df["first_name"].str.len()
suspicious = df[df["first_name"].str.len() < 2] # single-letter names = data quality issue
13 — Creating pivot tables with pandas
Pivot tables in Python — just like Excel but programmable
A pivot table reshapes data: you take a table of individual rows and summarise it by two dimensions simultaneously — one along the rows, one along the columns. You have used this concept in Excel; pd.pivot_table() does the same thing in Python.
The advantage over Excel: your pivot table is reproducible. Any time the source data is updated, you re-run the script and get a fresh pivot instantly. No manual drag-and-drop required. No 'I forgot to refresh the pivot' errors in a report.
Python — pivot_table
import pandas as pd
df = pd.read_csv("orders.csv")
# Basic pivot table: revenue by region (rows) and product category (columns)
pivot = pd.pivot_table(
df,
values="order_value", # what number to summarise
index="region", # what goes in the rows
columns="category", # what goes in the columns
aggfunc="sum", # how to summarise: sum, mean, count, min, max
fill_value=0# replace NaN (no orders) with 0
)
print(pivot)
# Result looks like this:# category Electronics Clothing Homeware# region# North 45000 12000 8000# South 38000 15000 6000# East 52000 9000 7000# Multiple aggregations at once
pivot2 = pd.pivot_table(
df,
values="order_value",
index="region",
columns="category",
aggfunc=["sum", "count", "mean"], # list of functions
fill_value=0
)
# Add row and column totals (margins)
pivot3 = pd.pivot_table(
df,
values="order_value",
index="region",
columns="category",
aggfunc="sum",
margins=True, # adds a "All" row and column with totals
margins_name="Total"# label for the totals row/column
)
print(pivot3.round(0))
End-of-Lesson Exercises
How to use these: Read the request and attempt it yourself first. Then expand the answer to compare. Read the explanation so you understand what every line does before moving on.
🖊 Exercise 1 of 5 - Load sales_data.csv and print: how many rows and columns, the first 3 rows, and which columns have missing values.
Hint: Three pandas tools: .shape, .head(3), .isnull().sum(). Print each one.
Python
import pandas as pd
df = pd.read_csv('sales_data.csv')
print("Shape:", df.shape)
# Output: (150000, 12) -- 150,000 rows and 12 columnsprint("\nFirst 3 rows:")
print(df.head(3))
# Shows the first 3 rows with all column names and sample valuesprint("\nMissing values per column:")
print(df.isnull().sum())
# isnull() marks True where a value is missing
# .sum() then counts how many Trues in each column
# A column showing 0 has no missing values
# A column showing 8234 has 8,234 empty cells that need attention
🧠 What is this doing? df.shape returns (rows, columns) as a tuple - no brackets needed, it is a property. df.head(3) shows the first 3 rows. df.isnull().sum() chains two steps: isnull() marks each cell True/False, then sum() counts how many True values per column.
🖊 Exercise 2 of 5 - Show only completed orders from the UK worth more than 50. Print how many rows match and show the first 5.
Hint: Three conditions with &. Each must be in its own brackets (). len() counts, .head(5) previews.
Python
import pandas as pd
df = pd.read_csv('sales_data.csv')
# Each condition goes in its own () brackets -- required by Python
filtered = df[
(df['status'] == 'completed') &
(df['country'] == 'UK') &
(df['order_value'] > 50)
]
print(f"Rows matching all conditions: {len(filtered)}")
# len() counts the rows that passed all three conditionsprint(filtered.head(5))
# Shows the first 5 rows that passed
🧠 What is this doing? The & operator means AND - all conditions must be True. Each condition must be in () because Python evaluates & before == without them. The outer [] keeps only rows where the combined condition is True. The original df is not changed - filtered is a new separate DataFrame.
🖊 Exercise 3 of 5 - Add two columns: revenue_after_discount (10% off) and is_europe (True if UK, Germany, or France).
Hint: Multiply by 0.90 for the discount. Use .isin([...]) for the Europe check.
Python
import pandas as pd
df = pd.read_csv('sales_data.csv')
# 10% discount means keeping 90% -- multiply by 0.90
df['revenue_after_discount'] = df['order_value'] * 0.90# Runs on every row at once: row1: 100*0.9=90, row2: 200*0.9=180# .isin() checks each value against the list -- True if in list, False if not
df['is_europe'] = df['country'].isin(['UK', 'Germany', 'France'])
# UK -> True | Spain -> False | Germany -> True | US -> Falseprint(df[['order_id', 'order_value', 'revenue_after_discount', 'country', 'is_europe']].head())
🧠 What is this doing? Assigning a calculation to df['new_column_name'] creates a new column across all rows simultaneously. The multiplication runs on every row at once. .isin(['UK','Germany','France']) checks each country value against the list and returns True or False - equivalent to SQL's IN operator.
🖊 Exercise 4 of 5 - Group by country: total revenue, order count, average order value. Show top 5 countries by revenue.
Hint: Use .groupby('country').agg(...) with three calculations. Chain .sort_values() and .head(5).
Python
import pandas as pd
df = pd.read_csv('sales_data.csv')
summary = df.groupby('country').agg(
total_revenue = ('order_value', 'sum'), # sum all order_values per country
order_count = ('order_id', 'count'), # count rows per country
avg_order_value = ('order_value', 'mean'), # mean average per country
).round(2).reset_index()
# .round(2) rounds all numbers to 2 decimal places
# .reset_index() moves country from the index back to a regular column
top5 = summary.sort_values('total_revenue', ascending=False).head(5)
print(top5.to_string(index=False))
🧠 What is this doing? groupby('country') creates one group per unique country. agg() applies different functions to different columns at the same time. .round(2) tidies the output. .reset_index() moves country back to a normal column. sort_values descending puts the biggest revenue first. head(5) keeps only the top 5.
🖊 Exercise 5 of 5 - Create a horizontal bar chart of revenue by country for the top 8 countries. Add a title, axis label, and save as PNG.
Hint: Group by country first. sort_values().tail(8) gets the top 8. plt.figure(), .plot(kind='barh'), title, xlabel, savefig.
Python
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('sales_data.csv')
# Step 1: calculate what to plot
country_rev = (
df.groupby('country')['order_value']
.sum() # total revenue per country
.sort_values() # sort ascending so top country is last
.tail(8) # last 8 after ascending = top 8
)
# Step 2: draw the chart
plt.figure(figsize=(10, 5)) # canvas: 10 inches wide, 5 tall
country_rev.plot(kind='barh', color='#4db8ff') # barh = horizontal bars
plt.title('Top 8 Countries by Revenue', fontsize=14)
plt.xlabel('Total Revenue')
plt.tight_layout() # prevents labels from being cut off
plt.savefig('top_countries.png', dpi=150) # save as high-res image
plt.show()
🧠 What is this doing? sort_values() ascending then .tail(8) keeps the 8 largest values, with the largest last. Horizontal bars are then drawn bottom to top so the largest appears at the top. dpi=150 creates a high-resolution image. tight_layout() auto-adjusts spacing.
You can now load, filter, group and chart data. Intermediate Python is about handling the messy reality of real-world data — fixing types, filling gaps, removing duplicates — then combining data from multiple sources and building multi-panel dashboards. These skills close the gap between "I can run a notebook" and "I can deliver a complete analysis."
1 — Cleaning messy data
Why cleaning exists and what you are actually doing when you clean
Every dataset you receive in a real job will have problems. Dates stored as text that will not sort correctly. Numbers with currency symbols that Python thinks are text. Duplicate rows from a broken export. Country names written as "UK", "uk", "U.K." and "United Kingdom" all meaning the same place. The customer_id column that has some blank rows.
These are not rare edge cases — they are the default state of raw data. Cleaning is not a boring formality. It is the difference between an analysis that is right and one that is confidently wrong. A revenue sum that includes duplicate rows overstates revenue. An average calculated on a column that includes text values returns an error. Patterns discovered in data that includes test accounts are fiction.
Clean in a specific order: first understand what is missing, then remove obvious garbage (duplicates, test rows), then fix data types, then handle missing values, then standardise text formats. Doing it in the wrong order creates new problems — for example, trying to fill missing values before converting a text column to numeric type will fail silently.
Request: "Clean a messy sales dataset — remove duplicates, fix types, fill missing values, and standardise text"
Outcome: A clean DataFrame where every row is unique, every column has the right data type, missing values are handled, and text is consistent. The script also prints a summary at each step so you can see exactly what changed.
Python
import pandas as pd
import numpy as np # numpy provides mathematical tools and the NaN value
df = pd.read_csv('messy_data.csv')
print(f"Starting shape: {df.shape}") # record the starting size# ── Step 1: understand what is missing ───────────────────────# Calculate the % of each column that is empty or null
missing_pct = (df.isnull().sum() / len(df) * 100).sort_values(ascending=False)
print("\nMissing values (% of column):")
print(missing_pct[missing_pct > 0]) # only show columns that have some missing# Decision rule: if > 60% missing → drop the column entirely
# if < 10% missing → safe to fill with a default
# if in between → investigate why it is missing before deciding# ── Step 2: remove duplicate rows ────────────────────────────
n_before = len(df)
df = df.drop_duplicates(subset=['order_id'], keep='first')
# subset=['order_id'] means: two rows are duplicates if they share the same order_id
# keep='first' means: keep the first occurrence, drop all others
# If you omit subset, ALL columns must match for a row to be considered a duplicateprint(f"\nDuplicates removed: {n_before - len(df)}")
# ── Step 3: fix data types ───────────────────────────────────# Dates stored as text will not sort correctly or allow date arithmetic
df['order_date'] = pd.to_datetime(df['order_date'])
# "2024-01-15" as text becomes a real date object that you can add days to, group by month, etc.# Numbers stored as text (e.g. "£42.50" or "1,250") need two steps:
# first strip the non-numeric characters, then convert to float
df['order_value'] = pd.to_numeric(
df['order_value'].str.replace('£', '').str.replace(',', ''),
errors='coerce'# errors='coerce' turns any unparseable value into NaN instead of crashing
)
# ── Step 4: fill missing values with sensible defaults ───────# Drop rows where critical columns are missing — we cannot use a row with no order value
df = df.dropna(subset=['order_id', 'order_value'])
# For optional columns, fill with a default that makes sense
df['country'] = df['country'].fillna('Unknown') # 'Unknown' is better than a blank
df['phone'] = df['phone'].fillna('Not provided')
# ── Step 5: standardise text formatting ──────────────────────# .str.strip() removes spaces from both ends: " UK " → "UK"
# .str.upper() converts all letters to uppercase: "uk" → "UK"
# Together they ensure "UK", "uk", " Uk " all become "UK"
df['country'] = df['country'].str.strip().str.upper()
df['email'] = df['email'].str.strip().str.lower() # emails should always be lowercaseprint(f"\nClean dataset shape: {df.shape}") # compare to starting shape
2 — Combining DataFrames: merge
Joining tables in Python the same way SQL JOINs work
In pandas, .merge() is the equivalent of a SQL JOIN. You have two DataFrames with a shared column (the key), and you want to combine them so information from both appears in the same row. The how parameter controls which rows are kept — exactly like the JOIN type in SQL. After every merge, check that the number of rows is what you expected. More rows than the left DataFrame means duplicate keys exist — one row matched multiple rows in the other, creating extra copies.
Request: "Combine the orders and customers files so each order row shows the customer's name and country alongside it"
Outcome: A single DataFrame containing all columns from both files, joined on the customer_id column. Every order row now shows the matching customer's details. The script also checks that the merge did not accidentally multiply rows.
Python
import pandas as pd
orders = pd.read_csv('orders.csv')
customers = pd.read_csv('customers.csv')
# how='inner' → only rows where customer_id exists in BOTH DataFrames
# Orders with no matching customer record will be EXCLUDED
merged_inner = orders.merge(customers, on='customer_id', how='inner')
# how='left' → ALL rows from orders, customer data added where it exists
# Orders with no matching customer show NaN in customer columns
merged_left = orders.merge(customers, on='customer_id', how='left')
# Always check merge quality — did the row count change unexpectedly?print(f"Orders before merge: {len(orders)}")
print(f"Orders after merge: {len(merged_left)}")
# If merged has MORE rows than orders → there are duplicate customer_ids in customers
# If merged has FEWER rows (inner join) → some orders had no matching customer# Merge three tables by chaining two .merge() calls
products = pd.read_csv('products.csv')
full = (orders
.merge(customers, on='customer_id', how='left') # add customer info
.merge(products, on='product_id', how='left') # add product info
)
print(f"Columns after joining all three: {list(full.columns)}")
3 — Multi-panel analysis dashboard
Building the standard analyst deliverable: a four-panel chart
A four-panel chart layout is the most common format for delivering an exploratory data analysis. It answers four related questions at once and gives a stakeholder the full picture in a single image. The pattern is: create a 2×2 grid of subplots, fill each with a different type of chart, add titles to each, then save the whole thing as one high-resolution image file.
seaborn is a charting library built on top of matplotlib that produces better-looking statistical charts with less code. It is particularly good for heatmaps, box plots, and scatter plots — the types of charts used constantly in data analysis.
Request: "Build a four-panel dashboard showing: order value distribution, revenue by country, correlation heatmap, and order values by category"
Outcome: One image file containing four charts arranged in a 2×2 grid. A stakeholder looking at this image can immediately understand the distribution of order sizes, which countries generate the most revenue, which variables are related to each other, and how order values compare across product categories.
Python
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv('sales_data.csv')
# Create a 2-row, 2-column grid of charts
# fig = the overall canvas (the whole image)
# axes = a 2D array of the four individual chart areas
# axes[0,0] = top-left | axes[0,1] = top-right
# axes[1,0] = bottom-left | axes[1,1] = bottom-right
fig, axes = plt.subplots(2, 2, figsize=(14, 9))
# ── Top-left: histogram of order values ──────────────────────# Shows the SHAPE of the data: is it bell-shaped, right-skewed, bimodal?
# Bins=40 means split the range into 40 columns
axes[0, 0].hist(df['order_value'], bins=40, color='#4db8ff', edgecolor='none')
axes[0, 0].set_title('Order Value Distribution')
axes[0, 0].set_xlabel('Order Value (£)')
axes[0, 0].set_ylabel('Number of Orders')
# ── Top-right: horizontal bar chart — top 10 countries ────────# sort_values() ascending then .tail(10) gives the top 10 with largest at top
rev = df.groupby('country')['order_value'].sum().sort_values().tail(10)
rev.plot(kind='barh', ax=axes[0, 1], color='#3dffa0')
# ax=axes[0,1] tells matplotlib which subplot to draw this chart in
axes[0, 1].set_title('Top 10 Countries by Revenue')
axes[0, 1].set_xlabel('Total Revenue (£)')
# ── Bottom-left: correlation heatmap ─────────────────────────# Shows which numeric columns are related to each other
# +1 = perfect positive relationship | -1 = perfect inverse | 0 = no relationship
num_cols = df.select_dtypes(include='number') # only numeric columns
sns.heatmap(num_cols.corr(), annot=True, fmt='.2f',
cmap='RdYlGn', center=0, ax=axes[1, 0], linewidths=0.5)
# annot=True shows the correlation number inside each cell
# fmt='.2f' formats those numbers to 2 decimal places
# cmap='RdYlGn' = red (negative) to green (positive) colour scale
# center=0 ensures 0 correlation = yellow (neutral)
axes[1, 0].set_title('Correlation Matrix')
# ── Bottom-right: box plot — order values by product category ─# Shows the median, spread, and outliers for each category simultaneously
# Useful for spotting that one category has much larger or more variable orders
sns.boxplot(data=df, x='category', y='order_value',
ax=axes[1, 1], palette='Set2')
axes[1, 1].set_title('Order Value by Product Category')
axes[1, 1].tick_params(axis='x', rotation=30) # rotate labels so they do not overlap# Add an overall title to the whole figure (above all four charts)
plt.suptitle('Sales Analysis Dashboard', fontsize=16, y=1.01)
plt.tight_layout() # automatically adjusts spacing between the four subplots
plt.savefig('analysis_dashboard.png', dpi=150, bbox_inches='tight')
# dpi=150 = high resolution (suitable for presentations)
# bbox_inches='tight' ensures the overall title is not cut off
plt.show()
4 — Lambda functions: quick one-line functions
What a lambda is and when to use one
A lambda is a tiny anonymous function written in a single line. Instead of using def to define a named function, you write the logic inline wherever you need it. Lambda functions are used almost exclusively with .apply(), .map(), and sorted() — places where you need a simple transformation but it is not worth writing a full function for it.
When to use lambda vs def: If the function is one line and only used in one place, lambda is cleaner. If the function is more than one line, or used in multiple places, write a proper def function.
Python — lambda functions
import pandas as pd
# Regular function vs lambda -- both do the same thing# Regular function:defdouble(x):
return x * 2# Lambda: lambda [parameters]: [expression to return]
double_lambda = lambda x: x * 2print(double(5)) # 10print(double_lambda(5)) # 10# Most common use: inside .apply() to transform a column
df = pd.read_csv("orders.csv")
# Add VAT to every order value
df["price_with_vat"] = df["order_value"].apply(lambda x: x * 1.20)
# Classify orders into size buckets
df["size"] = df["order_value"].apply(
lambda x: "Large"if x > 200else ("Medium"if x > 50else"Small")
)
# The inline if/else (ternary expression) works inside a lambda:
# value_if_true if condition else value_if_false# Apply to multiple columns with .apply(axis=1)# axis=1 means: pass each ROW as a Series to the lambda
df["label"] = df.apply(
lambda row: f"{row['country']} - {row['category']}",
axis=1
)
# Creates: "UK - Electronics", "DE - Clothing", etc.
5 — Faster categorisation with np.where and pd.cut
Two tools that replace most if/elif chains applied to columns
np.where is the vectorised (fast) alternative to applying a simple if/else to every row. Instead of looping or using .apply(), it works on the entire column at once. Think of it as: 'for every row, if this condition is True use value A, else use value B'.
pd.cut automatically places numeric values into bins (ranges) you define. Instead of writing CASE WHEN age >= 18 AND age < 25 THEN '18-24' etc., you define the bin edges and labels once and pandas assigns every row to the right bin. Perfect for age groups, price brackets, score ranges — any time you need to convert a continuous number into a category.
Python — np.where and pd.cut
import pandas as pd
import numpy as np
df = pd.read_csv("orders.csv")
# np.where: fast binary condition (True = value A, False = value B)# np.where(condition, value_if_true, value_if_false)
df["is_high_value"] = np.where(df["order_value"] > 100, "High", "Standard")
# order_value = 150 -> "High"# order_value = 45 -> "Standard"# Nest np.where for multiple tiers (like CASE WHEN with 3 outcomes)
df["tier"] = np.where(
df["order_value"] >= 500, "Platinum", # if >= 500: Platinum
np.where(
df["order_value"] >= 100, "Gold", # elif >= 100: Gold"Standard"# else: Standard
)
)
# pd.cut: automatically bin a numeric column into labelled ranges# bins defines the EDGES of the ranges# labels defines the NAME for each range
df["order_band"] = pd.cut(
df["order_value"],
bins=[0, 25, 75, 150, float("inf")], # 4 edges = 3 bins
labels=["Small", "Medium", "Large", "XL"], # label for each bin
right=True# right=True means: 0 < x <= 25 is "Small"
)
# order_value = 10 -> "Small" (between 0 and 25)# order_value = 50 -> "Medium" (between 25 and 75)# order_value = 200 -> "XL" (above 150)# Check the distribution of the binsprint(df["order_band"].value_counts())
# pd.qcut: same idea but uses quantiles instead of fixed edges# pd.qcut(df["order_value"], q=4, labels=["Q1","Q2","Q3","Q4"])# Splits data into 4 equal-sized groups (quartiles) automatically
6 — Formatting numbers for professional output
Making your numbers readable for non-technical audiences
When you show analysis results to a manager, raw numbers like 1247853.6492 are hard to read. A properly formatted number — £1,247,854 — communicates instantly. Python's f-strings support formatting codes inside the curly braces that control how numbers are displayed: commas, decimal places, percentages, padding.
This is a small topic but it matters enormously for professional output. A report where revenue shows as 1247853.649132 looks like it came from a script. A report where it shows as £1,247,854 looks like it came from an analyst.
Python — Number formatting
# f-string format codes go after a colon inside the {}
revenue = 1247853.6492
rate = 0.1847
count = 42381# Comma separator for thousandsprint(ff"Revenue: {revenue:,.0f}") # 1,247,854 (0 decimal places)print(ff"Revenue: {revenue:,.2f}") # 1,247,853.65 (2 decimal places)print(ff"Revenue: £{revenue:,.0f}") # £1,247,854 (add currency symbol)# Percentage formattingprint(ff"Rate: {rate:.1%}") # 18.5% (multiplies by 100 and adds %)print(ff"Rate: {rate:.2%}") # 18.47%# Integer with commasprint(ff"Customers: {count:,}") # 42,381# Apply formatting to a whole DataFrame column for displayimport pandas as pd
df = pd.read_csv("summary.csv")
# Format revenue column for display (creates a new string column)
df["revenue_display"] = df["revenue"].apply(lambda x: ff"£{x:,.0f}")
df["rate_display"] = df["rate"].apply(lambda x: ff"{x:.1%}")
# Print a formatted summary tableprint("\n=== Monthly Summary ===")
for _, row in df.iterrows():
print(ff" {row['month']:<10} Revenue: {row['revenue']:>12,.0f} Growth: {row['growth']:>+.1%}")
# :<10 means: left-align in a field 10 characters wide# :>12 means: right-align in a field 12 characters wide# :>+.1% means: show + or - sign, 1 decimal place, percentage
End-of-Lesson Exercises
How to use these: Read the request and attempt it yourself first. Then expand the answer to compare. Read the explanation so you understand what every line does before moving on.
🖊 Exercise 1 of 5 - Load messy_data.csv. Remove duplicates by order_id, drop rows with missing order_value, convert order_date to datetime. Print shape before and after.
Hint: drop_duplicates(subset=['order_id']), dropna(subset=['order_value']), pd.to_datetime(). Print df.shape at start and end.
Python
import pandas as pd
df = pd.read_csv('messy_data.csv')
print(f"Before: {df.shape}")
df = df.drop_duplicates(subset=['order_id'], keep='first')
# Two rows with the same order_id are duplicates
# keep='first' keeps the first occurrence, drops all others
df = df.dropna(subset=['order_value'])
# Drops only rows where order_value is NaN
# Rows with NaN in other columns are kept
df['order_date'] = pd.to_datetime(df['order_date'])
# Converts text dates to proper datetime type
# Enables date sorting, arithmetic, and month/year extractionprint(f"After: {df.shape}")
print(f"order_date type: {df['order_date'].dtype}")
🧠 What is this doing? drop_duplicates(subset=['order_id']) checks only the order_id column - two rows sharing the same order_id are duplicates. dropna(subset=['order_value']) drops rows where order_value is NaN only. pd.to_datetime() converts text dates to proper datetime objects enabling date arithmetic and grouping by month.
🖊 Exercise 2 of 5 - Merge orders.csv with customers.csv (left join on customer_id). Print row count before and after to check no rows were lost or multiplied.
Hint: orders.merge(customers[['customer_id','country']], on='customer_id', how='left'). Check len() before and after.
Python
import pandas as pd
orders = pd.read_csv('orders.csv')
customers = pd.read_csv('customers.csv')
merged = orders.merge(
customers[['customer_id', 'country']], # only bring the columns we need
on='customer_id',
how='left'
)
# Quality check -- these should be equal for a left joinprint(f"Orders before: {len(orders)}")
print(f"Orders after: {len(merged)}")
# If after > before: duplicate customer_ids in the customers file
# If equal: merge worked correctlyprint(f"Missing country: {merged['country'].isnull().sum()}")
🧠 What is this doing? customers[['customer_id','country']] selects only two columns before merging to avoid adding unwanted columns. how='left' keeps all orders even if no matching customer exists. The row count check is critical: a left join should produce the same number of rows as the left DataFrame. More rows means duplicate customer_ids in the customers file.
🖊 Exercise 3 of 5 - Standardise the country column: strip spaces and convert to uppercase. Show top 5 countries before and after to confirm the effect.
Hint: Use .str.strip().str.upper(). Compare .value_counts().head(5) before and after.
Python
import pandas as pd
df = pd.read_csv('sales_data.csv')
print("BEFORE:")
print(df['country'].value_counts().head(5))
# Might show: "UK": 12000, " UK": 342, "uk": 89 as separate entries
df['country'] = df['country'].str.strip().str.upper()
# .str.strip() removes spaces from both ends: " UK " -> "UK"
# .str.upper() converts all letters to uppercase: "uk" -> "UK"
# These chain: strip runs first, then upper runs on the resultprint("\nAFTER:")
print(df['country'].value_counts().head(5))
# Now "UK", " UK", "uk" all become "UK" -- their counts combine
🧠 What is this doing? .str methods work on every row simultaneously. .str.strip() removes leading and trailing whitespace. .str.upper() converts all characters to uppercase. Comparing value_counts before and after shows 'UK', ' UK', and 'uk' merging into one consistent entry.
🖊 Exercise 4 of 5 - Build a 2-panel chart: left = histogram of order_value, right = bar chart of top 8 countries by revenue.
Hint: plt.subplots(1, 2) for 1 row 2 columns. axes[0]=left, axes[1]=right. Use ax= parameter in .plot().
Python
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('sales_data.csv')
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# 1 row, 2 columns. axes[0]=left panel, axes[1]=right panel# Left: histogram of order values
axes[0].hist(df['order_value'], bins=30, color='#4db8ff')
axes[0].set_title('Order Value Distribution')
axes[0].set_xlabel('Order Value')
# Right: top 8 countries by revenue
rev = df.groupby('country')['order_value'].sum().sort_values().tail(8)
rev.plot(kind='barh', ax=axes[1], color='#3dffa0')
# ax=axes[1] tells pandas which subplot to draw this chart into
axes[1].set_title('Top 8 Countries by Revenue')
plt.tight_layout()
plt.savefig('two_panel.png', dpi=150)
plt.show()
🧠 What is this doing? plt.subplots(1, 2) creates 1 row and 2 columns of subplots. For a 1D layout axes[0] is left and axes[1] is right. The ax= parameter in .plot() tells pandas which subplot area to draw into. tight_layout() prevents titles and labels from overlapping.
🖊 Exercise 5 of 5 - Full pipeline: load, clean country column, merge with customers for email, group by country for revenue, save to CSV.
🧠 What is this doing? This exercise chains every intermediate skill in one pipeline. Clean first so dirty data does not corrupt the analysis. Merge with only the needed columns to keep the result tidy. to_csv(index=False) prevents the row numbers from becoming an extra column in the saved file.
Why Excel still matters even if you know SQL and Python. Excel is the universal language of business. Your manager lives in it. Clients send you files in it. Finance, operations, and executives all work primarily in it. You will open Excel files every single week regardless of your other tools. The good news: 20 formulas and 3 features cover 80% of everything analysts actually do in Excel. This section teaches exactly those.
1 — Cell references: the most important thing to understand first
Why cell references behave differently when you copy a formula
Every formula in Excel points at cells. When you copy a formula to a different row, Excel has to decide: should the cell reference move with the formula, or should it stay fixed? The answer depends on what type of reference you used — and getting this wrong is the source of a large proportion of Excel errors in real workplaces.
A relative reference (like A2) moves when you copy. If you write =A2*B2 in row 2 and copy it to row 3, Excel automatically updates it to =A3*B3. This is what you want for per-row calculations where each row should calculate its own value.
An absolute reference (like $B$1) never moves. The dollar signs lock it in place. This is what you use when one cell holds a value — a tax rate, an exchange rate, a target — that every formula in the column should always point to. When that value changes, you update one cell and every formula recalculates. Without absolute references, you would have to update every formula manually.
Keyboard shortcut: Press F4 immediately after clicking a cell reference to cycle through: A2 → $A$2 → A$2 → $A2 → back to A2. Use this constantly — it is faster than typing dollar signs by hand.
Request: "Calculate revenue including 20% VAT for each order, where the VAT rate is stored in cell B1 so it can be changed in one place"
Outcome: Each row in the revenue column shows that row's order value multiplied by 1.20. If the VAT rate in B1 changes from 20% to 23%, all formulas update instantly. No formula needs to be edited individually.
Excel
-- Cell B1 contains: 0.20 (the VAT rate as a decimal)-- Column A contains: order values (e.g. A2=85, A3=120, A4=200)-- Formula in C2 (then copy down to C3, C4, C5...):
=A2*(1+$B$1)
-- A2 is RELATIVE — it adjusts as you copy down each row
-- In C3 it becomes A3*(1+$B$1)
-- In C4 it becomes A4*(1+$B$1)
-- $B$1 is ABSOLUTE — the $ signs lock both row and column
-- It stays $B$1 in every row no matter where you copy
-- Result: each row calculates its own order value × (1 + VAT rate)
-- Without the $ signs, copying down would give:
-- C2: =A2*(1+B1) → correct
-- C3: =A3*(1+B2) → WRONG — B2 is empty, result is just A3
-- C4: =A4*(1+B3) → WRONG — B3 is probably another order value
2 — Counting and summing with conditions: COUNTIF, SUMIF, SUMIFS
The most useful Excel formulas for day-to-day data analysis
SUM and COUNT work on entire ranges without conditions. But most real business questions have conditions attached: "how many orders from the UK?", "what was total revenue from completed orders?", "how many transactions were above £100 and from Germany?" The conditional variants — COUNTIF, SUMIF, and SUMIFS — answer these questions directly without needing a pivot table or helper columns.
Think of SUMIFS as the Excel equivalent of SQL's SELECT SUM(value) FROM table WHERE condition1 AND condition2. It is the formula you will reach for most often when someone asks you to slice a number by one or more criteria.
The difference between SUMIF (one condition) and SUMIFS (multiple conditions) is just the S at the end. Always use SUMIFS even for one condition — it is more flexible and the syntax is easier to extend later.
Request: "How many orders came from the UK?"
Outcome: A single number — the count of rows where the country column equals "UK". COUNTIF scans the entire country column and counts every cell that matches.
Excel — COUNTIF
-- COUNTIF syntax: =COUNTIF(range_to_check, what_to_look_for)
=COUNTIF(B:B, "UK")
-- B:B means the entire column B (all rows)
-- "UK" is the condition — only count rows where column B equals UK
-- Result: a single number like 12,450
-- The quote marks are required for text — =COUNTIF(B:B, UK) would fail
-- To count using a cell reference instead of typing the value:
=COUNTIF(B:B, E2)
-- This counts rows where column B equals whatever is in cell E2
-- Useful when you want to count for multiple criteria without editing the formula
Request: "What was the total revenue from UK orders?"
Outcome: A single number — the sum of all values in column C where the corresponding row in column B equals "UK". SUMIF adds up the values in one column based on a condition in a different column.
Excel — SUMIF
-- SUMIF syntax: =SUMIF(condition_range, condition, sum_range)
-- Read as: "look at condition_range, where condition is met, sum the values in sum_range"
=SUMIF(B:B, "UK", C:C)
-- B:B = the column to check (country column)
-- "UK" = the condition (only rows where country = UK)
-- C:C = the column to add up (revenue column)
-- Result: total of all values in column C where column B equals "UK"
Request: "Total revenue from UK orders that are also marked as 'completed' — two conditions at once"
Outcome: A single number — the sum of column C only for rows where column B = "UK" AND column D = "completed". Both conditions must be true for a row to be included in the sum.
Excel — SUMIFS (multiple conditions)
-- SUMIFS syntax: =SUMIFS(sum_range, condition1_range, condition1, condition2_range, condition2, ...)
-- Note: SUMIFS puts the sum_range FIRST (opposite order to SUMIF)
=SUMIFS(C:C, B:B, "UK", D:D, "completed")
-- C:C = column to add up (revenue)
-- B:B = first condition column (country)
-- "UK" = first condition value
-- D:D = second condition column (status)
-- "completed" = second condition value
-- A row is included only if BOTH B=UK AND D=completed
-- You can keep adding pairs: ..., E:E, "Gold", F:F, ">100" for more conditions
-- SUMIFS with a numeric comparison (orders over £100 from UK):
=SUMIFS(C:C, B:B, "UK", C:C, ">"&100)
-- The ">"&100 joins the > operator with the number 100
3 — IF logic: making decisions inside formulas
Adding conditional logic to create labels, flags, and categories
The IF function evaluates a condition — is it true or false? — and returns one value if true and a different value if false. This is how you create new categorisations from existing data: customer tiers from spend amounts, pass/fail flags from test scores, region labels from country names.
Nested IFs allow multiple tiers — if A then X, else if B then Y, else Z. The IFS function (available in Excel 2019 and later) does the same thing more cleanly without deeply nested brackets. IFERROR handles formula errors gracefully — essential for any dashboard where some cells might be empty or have unexpected values.
Request: "Label each customer as Gold, Silver, Bronze, or New based on their total spend in column C"
Outcome: A new column where each row contains a tier label. Customers who spent ≥£1000 get "Gold", ≥£300 get "Silver", ≥£50 get "Bronze", and everyone else gets "New". Conditions are evaluated in order — first match wins.
Excel — IF and IFS
-- IFS function (Excel 2019+): cleaner than nested IF
-- =IFS(condition1, result1, condition2, result2, ..., TRUE, fallback)
=IFS(C2>=1000,"Gold", C2>=300,"Silver", C2>=50,"Bronze", TRUE,"New")
-- Conditions are checked in order — the first TRUE condition wins
-- C2=1500: first condition (>=1000) is TRUE → result is "Gold", stops checking
-- C2=400: first condition FALSE, second (>=300) is TRUE → "Silver"
-- C2=60: first two FALSE, third (>=50) TRUE → "Bronze"
-- C2=10: all three FALSE, TRUE is always TRUE → "New" (the catch-all)
-- Always check the HIGHEST value first — if you put >=50 first,
-- a customer with £1500 would be labelled "Bronze" incorrectly
-- Older nested IF version (same result, harder to read):
=IF(C2>=1000,"Gold",IF(C2>=300,"Silver",IF(C2>=50,"Bronze","New")))
Request: "Calculate the profit margin percentage, but show 0 instead of an error if revenue is zero"
Outcome: Each row shows the profit margin as a percentage. Rows where revenue is zero — which would normally cause a divide-by-zero error (#DIV/0!) — show 0 instead of a red error message. This is what makes dashboards look professional.
Excel — IFERROR
-- Without IFERROR: shows #DIV/0! error when revenue (column C) is zero
=(B2-C2)/C2
-- With IFERROR: shows 0 instead of an error
-- Syntax: =IFERROR(formula_that_might_fail, what_to_show_if_it_fails)
=IFERROR((B2-C2)/C2, 0)
-- If (B2-C2)/C2 works normally → shows the result (the margin %)
-- If it produces any error (#DIV/0!, #N/A, #VALUE!, etc.) → shows 0 instead
-- Use IFERROR on any formula that might fail due to missing or bad data
-- Common use: XLOOKUP with a friendly fallback message
=IFERROR(XLOOKUP(A2, ProductTable[SKU], ProductTable[Price]), "Not found")
-- If the SKU exists → shows its price
-- If the SKU is missing from the table → shows "Not found" instead of #N/A
4 — Pivot Tables: summarising thousands of rows instantly
The single most powerful Excel feature for data analysts — and it requires zero formulas
A Pivot Table takes a table of raw data — which might be 50,000 rows — and summarises it into a compact interactive table. You drag column names into four areas and Excel does all the aggregation automatically. No formulas. No helper columns. The whole operation takes under a minute, and when your source data is updated you just right-click and Refresh.
Understanding what each area does: Rows — what you want to group by (e.g. Country, Month, Product Category). Values — what number you want to calculate (e.g. Sum of Revenue, Count of Orders, Average of Order Value). Columns — an optional second grouping that creates a cross-tabulation (e.g. Year, Quarter). Filters — a dropdown that filters the entire table (e.g. show only one region at a time).
To create: click anywhere in your data → Insert → PivotTable → New Worksheet → OK. Drag fields from the field list on the right into the four areas.
Key Pivot Table operations every analyst uses
Change the calculation type: Right-click any value in the Values area → Summarise Values By → choose Sum, Count, Average, Max, or Min
Show as percentage: Right-click any value → Show Values As → % of Grand Total (or % of Row Total, % of Column Total)
Sort largest to smallest: Click any value in the data area → right-click → Sort → Largest to Smallest
Filter to top 10: Click the dropdown arrow on a Row label → Value Filters → Top 10
Refresh when data changes: Right-click anywhere in the Pivot Table → Refresh
Group dates by month or quarter: Right-click a date field in the Rows area → Group → select Months or Quarters
🧠 Quick Check
You want total revenue (column C) only for rows where column B = "UK" AND column D = "completed". Which formula is correct?
End-of-Lesson Exercises
How to use these: Read the request and attempt it yourself first. Then expand the answer to compare. Read the explanation so you understand what every line does before moving on.
🖊 Exercise 1 of 5 - Count how many rows in column D have the status 'completed'.
Hint: COUNTIF takes two arguments: the range to scan, and the value to look for.
Excel
=COUNTIF(D:D, "completed")
🧠 What is this doing? COUNTIF(D:D, "completed") scans every cell in column D and counts how many equal 'completed'. D:D means the entire column. Text conditions must be in quotes. The result is a single number representing the count.
🖊 Exercise 2 of 5 - Calculate total revenue from column C only for rows where column B is 'Germany'.
Hint: SUMIF takes three arguments: column to check, what to look for, column to add up.
Excel
=SUMIF(B:B, "Germany", C:C)
🧠 What is this doing? SUMIF(condition_range, condition, sum_range). Column B is checked, 'Germany' is the condition, column C is what gets added up. Every row where B equals 'Germany' contributes its C value to the total. Note: SUMIF puts the sum_range last, opposite to SUMIFS.
🖊 Exercise 3 of 5 - Label each row as 'High' if column C is above 200, 'Medium' if above 50, otherwise 'Low'.
Hint: Use IFS. Check the highest threshold first. The last condition TRUE is the catch-all.
Excel
=IFS(C2>200,"High", C2>50,"Medium", TRUE,"Low")
🧠 What is this doing? IFS evaluates conditions in order and returns the result for the first true condition. C2>200 is checked first - if true, 'High' is returned and remaining conditions are ignored. TRUE at the end is always true, acting as the catch-all. Checking >200 first is critical - checking >50 first would mislabel a 300-value as 'Medium'.
🖊 Exercise 4 of 5 - Calculate profit margin as (revenue - cost) / revenue. Show 0 instead of an error if revenue is zero.
Hint: Write the division formula first, then wrap the whole thing in IFERROR(..., 0).
Excel
=IFERROR((B2-C2)/B2, 0)
🧠 What is this doing? (B2-C2)/B2 calculates the margin. When B2 is zero this produces a #DIV/0! error. IFERROR catches any error the inner formula produces and returns 0 instead. IFERROR catches all error types so it works as a general safety net around any formula that might fail.
🖊 Exercise 5 of 5 - Total revenue (column C) where column B = 'France' AND column D = 'completed' AND column C > 100.
Hint: SUMIFS starts with sum_range. Then pairs of (condition_range, condition). For numbers: ">"&100.
🧠 What is this doing? SUMIFS always starts with the sum_range, then condition pairs. Three pairs here: country=France, status=completed, value>100. For numeric comparisons the operator is text in quotes joined to the number with &. A row is included only if all three conditions are simultaneously true.
You can now reference cells, count, sum, and apply IF logic. Intermediate Excel is where you go from using spreadsheets to building tools others rely on. XLOOKUP replaces VLOOKUP permanently. Power Query automates cleaning tasks that used to take hours. These skills are what make colleagues ask "how did you build this so quickly?"
1 — XLOOKUP: the modern replacement for VLOOKUP
Why VLOOKUP breaks and what to use instead
VLOOKUP is one of the most widely used Excel formulas in history — and one of the most fragile. Its core design flaw: it references the return column by its position number (the 3rd column, the 4th column) rather than by name. If someone inserts a new column into your lookup table, all position numbers shift and your formula silently returns the wrong data. No error appears. You just get wrong answers.
XLOOKUP (available in Excel 2019 and Microsoft 365) fixes this entirely. You reference the return column directly by name, so inserting columns never breaks anything. It also works in any direction (left, right, up, down), returns a useful message when a value is not found instead of #N/A, and can return multiple columns at once.
Rule: If you are on Excel 2019 or Microsoft 365, never use VLOOKUP again. Use XLOOKUP for everything.
Request: "For each order in column A, look up the customer's country from the customer reference table and add it to column B"
Outcome: Column B fills with the country for each customer ID. If a customer ID does not exist in the reference table, "Not found" appears instead of an error. The formula works even if columns are inserted into the reference table later.
Excel — XLOOKUP
-- XLOOKUP syntax:
-- =XLOOKUP(what_to_find, where_to_look, what_to_return, if_not_found)
=XLOOKUP(A2, CustomerTable[CustomerID], CustomerTable[Country], "Not found")
-- A2 = the customer ID you are looking up (changes each row because it is relative)
-- CustomerTable[CustomerID] = the column to search through
-- CustomerTable[Country] = the column to return the value from
-- "Not found" = what appears if A2 is not in CustomerTable[CustomerID]
-- Why this beats VLOOKUP:
-- VLOOKUP: =VLOOKUP(A2, $B$2:$E$1000, 3, FALSE)
-- The "3" means "return the 3rd column" — if someone inserts a column
-- before the return column, "3" now points to the wrong place
-- No error appears — just wrong data
-- XLOOKUP uses the column NAME not position — inserting columns cannot break it
-- Returning multiple columns at once (result spills automatically):
=XLOOKUP(A2, CustomerTable[CustomerID], CustomerTable[[Name]:[Country]:[Email]])
-- Returns Name, Country and Email all at once from one formula
2 — Text and date functions for cleaning data directly in Excel
Fixing messy data without leaving Excel
When data is exported from different systems or entered manually, it is almost always inconsistent. Dates stored as text. Names with extra spaces. Phone numbers in different formats. Country names as "UK", " uk", "United Kingdom". Text functions let you standardise this directly in Excel, row by row, using formulas — without needing Python or any other tool.
Date functions are critical for analysts because so much of what we measure is time-based. YEAR(), MONTH(), and TEXT() extract parts of a date for grouping. DATEDIF() calculates how long between two dates — essential for customer tenure, days since last order, or age calculations.
Excel — Text and Date Functions
-- ── TEXT CLEANING FUNCTIONS ──────────────────────────────────
=TRIM(A2)
-- Removes all extra spaces from a cell: " United Kingdom " → "United Kingdom"
-- The most common data cleaning formula — use it on any text column from an export
=LOWER(A2) -- "Alice CHEN" → "alice chen" (all lowercase)
=UPPER(A2) -- "alice chen" → "ALICE CHEN" (all uppercase)
=PROPER(A2) -- "alice chen" → "Alice Chen" (first letter of each word capitalised)
=SUBSTITUTE(A2, "Ltd", "Limited")
-- Finds "Ltd" everywhere in the cell and replaces it with "Limited"
-- Case-sensitive: "ltd" would NOT be replaced, only "Ltd"
=LEFT(A2, 3) -- Returns the first 3 characters: "United Kingdom" → "Uni"
=RIGHT(A2, 2) -- Returns the last 2 characters: "United Kingdom" → "om"
=LEN(A2) -- Counts characters: "United Kingdom" → 14
=A2&" "&B2
-- Joins two cells with a space between them: "Alice" and "Chen" → "Alice Chen"
-- The & operator joins (concatenates) text — always wrap literal text in quotes-- ── DATE FUNCTIONS ──────────────────────────────────────────
=YEAR(A2) -- Extracts the year: "2024-03-15" → 2024
=MONTH(A2) -- Extracts the month: "2024-03-15" → 3
=DAY(A2) -- Extracts the day: "2024-03-15" → 15
=TEXT(A2,"mmm-yyyy")
-- Formats a date as text: 2024-03-15 → "Mar-2024"
-- mmm = 3-letter month | yyyy = 4-digit year
-- Use TEXT() when you want to GROUP dates by month in a chart or pivot
-- Note: TEXT() returns text — you cannot do date arithmetic on the result
=DATEDIF(A2, B2, "D")
-- Calculates the number of days between two dates
-- "D" = days | "M" = complete months | "Y" = complete years
-- Example: A2=order date, B2=delivery date → days to deliver
-- DATEDIF is hidden (not in autocomplete) but still works perfectly
=TODAY()
-- Returns today's date — updates automatically every day
-- Useful for calculating "days since last order": =TODAY()-A2
3 — Power Query: automating data cleaning with one-click refresh
The Excel feature that saves analysts hours every month
Power Query (found under Data → Get Data) lets you connect to a file or database, apply a series of cleaning and transformation steps, and load the result into Excel — and then refresh the entire process with one click whenever the source file is updated. Every step you apply is recorded and replayed automatically.
Without Power Query: you spend 2 hours every month manually cleaning the same monthly report. With Power Query: you set it up once in 30 minutes, and every subsequent month takes 5 seconds to refresh. This is one of the highest-return features in Excel for any analyst who works with regularly updated data.
When to use Power Query vs formulas: If you will receive an updated version of this data again next month, use Power Query. If it is a one-time calculation, formulas are faster. Power Query shines when the same cleaning steps need to run on new data regularly.
Key Power Query operations — what each one does and when to use it
Remove Duplicates (Home → Remove Rows → Remove Duplicates): removes rows where all selected columns have identical values. Use when data has been double-exported or a system retried a transaction. Equivalent to SQL's DISTINCT.
Filter Rows (click the dropdown arrow on any column header): keeps only rows matching your criteria. The filter is applied to every refresh automatically. Equivalent to SQL's WHERE.
Change Type (right-click column header → Change Type): converts the column to the correct data type — Text, Whole Number, Decimal, Date, etc. Always do this after loading so downstream calculations work correctly.
Replace Values (right-click column → Replace Values): finds a text value and replaces it with something else across the entire column. Use to standardise "UK", "U.K.", "United Kingdom" all to "UK".
Unpivot Columns (select columns → Transform → Unpivot Columns): converts wide-format data (months as columns) into long-format data (one row per month). Required before building most Pivot Tables and charts on data exported from financial systems.
Merge Queries (Home → Merge Queries): joins two queries by a shared column — the Power Query equivalent of a SQL JOIN. Use to combine data from two separate files.
Refresh All (Data → Refresh All): re-runs every transformation step on the latest version of the source file. This is the payoff — the entire pipeline runs in seconds.
🧠 Quick Check
VLOOKUP breaks when someone inserts a new column because it uses column position numbers. Which formula references columns by name instead, making it immune to this?
End-of-Lesson Exercises
How to use these: Read the request and attempt it yourself first. Then expand the answer to compare. Read the explanation so you understand what every line does before moving on.
🖊 Exercise 1 of 5 - Use XLOOKUP to find each product's price from PriceTable[ProductCode] and PriceTable[Price]. Show 'Unknown' if not found.
Hint: XLOOKUP(what to find, where to look, what to return, if not found). Four arguments.
🧠 What is this doing? A2 is the value to look up. PriceTable[ProductCode] is the column to search. PriceTable[Price] is the column to return. 'Unknown' appears if A2 is not found. Because the return column is referenced by name not position, inserting a column into PriceTable will never change what this formula returns - unlike VLOOKUP.
🖊 Exercise 2 of 5 - Names in column A are messy: ' alice chen', 'BOB SMITH '. Write one formula that removes spaces and shows Title Case.
Hint: Two functions: one removes spaces, one applies Title Case. Which runs first (inner)?
Excel
=PROPER(TRIM(A2))
🧠 What is this doing? Functions evaluate inside out - TRIM runs first because it is the inner function. TRIM(A2) removes all extra spaces: ' alice chen' becomes 'alice chen'. PROPER() then applies Title Case: 'alice chen' becomes 'Alice Chen'. Order matters: if PROPER ran first on ' alice chen' it would produce ' Alice Chen' with the leading space still there.
🖊 Exercise 3 of 5 - Column A has order dates. Write three formulas: extract the year, format as 'Mar-2024', and calculate days ago.
Hint: Three formulas: YEAR(), TEXT() with format code, and TODAY() minus the date.
Excel
=YEAR(A2)
=TEXT(A2,"mmm-yyyy")
=TODAY()-A2
🧠 What is this doing? YEAR(A2) extracts the numeric year - useful for pivot table grouping. TEXT(A2,"mmm-yyyy") formats the date as readable text like 'Jan-2024'. TODAY()-A2 subtracts the date from today giving the number of days between them. Format the result cell as Number not Date or Excel will display it as a date.
🖊 Exercise 4 of 5 - Column D has 'completed', 'pending', 'cancelled'. Convert to 'Done', 'In Progress', 'Void'. Show 'Unknown' for anything else.
Hint: IFS with one equality check per status. What is the catch-all last condition?
🧠 What is this doing? IFS checks D2='completed' first - if true returns 'Done' and stops. Then checks 'pending' for 'In Progress', then 'cancelled' for 'Void'. TRUE is the catch-all - since TRUE is always true, any value not matching the first three gets 'Unknown'. This is the standard pattern for converting coded values to readable labels.
🖊 Exercise 5 of 5 - Calculate SUMIFS for Germany completed orders over 50. Wrap it in IFERROR to show 0 if anything fails.
Hint: Write the SUMIFS in full first. Then wrap the whole formula in IFERROR(..., 0).
🧠 What is this doing? SUMIFS calculates total revenue where country=Germany AND status=completed AND value>50. IFERROR wraps the entire SUMIFS - if columns are missing, renamed, or empty, it returns 0 instead of an error. This is best practice for any formula in a dashboard that will be shared with others.
Choosing the right chart is half the analysis. The wrong chart type makes correct data misleading. A bar chart comparing 50 categories is unreadable. A pie chart with 8 slices communicates nothing. This section tells you exactly which chart to reach for based on the question you are answering — and which ones to avoid and why.
1 — Comparison Charts: showing how values differ across categories
📊 Bar Chart (Clustered and Stacked)
What it does: Compares values across distinct categories using rectangular bars. The length of each bar is proportional to the value it represents. Horizontal bar charts (where bars run left to right) are better when category names are long — they read naturally left to right rather than forcing the reader to tilt their head.
Use it when: You are comparing a metric across categories (countries, products, departments, time periods). "Which region had the highest sales?" "Which product category generates the most returns?" "How does each team's headcount compare?"
Clustered bar: Two or more bars side by side for each category — good for comparing two metrics at once (e.g. revenue vs target per region).
Stacked bar: Bars divided into segments showing sub-categories — good for showing composition alongside total (e.g. revenue broken down by product type per region). Use sparingly: more than 4 segments becomes hard to read.
Avoid it when: You have more than 15–20 categories (the bars become tiny and unreadable) or when you want to show change over continuous time (use a line chart instead).
Power BI tip: Use the Bar Chart visual for horizontal bars and Column Chart for vertical bars. Sort your bars by value (descending) by default — a sorted bar chart communicates the ranking at a glance, an unsorted one forces the reader to search.
Horizontal bar chart — sorted by value descending so ranking is immediately visible
📐 Column Chart
What it does: Identical to a bar chart but oriented vertically. Bars run bottom to top.
Use it when: You have short category names that fit comfortably below each column, or when you are showing data over time periods (months, quarters, years) where left-to-right naturally implies chronological order.
Avoid it when: Category names are long (they overlap diagonally and become unreadable — switch to a horizontal bar chart instead).
Column chart — vertical bars work well for monthly time-period comparisons
⏩ Bullet Chart
What it does: A compact bar chart that shows an actual value against a target, with a background showing performance zones (e.g. poor/acceptable/good). Designed specifically for KPI tracking.
Use it when: You need to show "are we hitting our target?" for multiple metrics in a small space. Far more information-dense than a bar chart. A sales dashboard showing 12 regional targets and actuals in one glance.
Power BI tip: Available as a custom visual. Use it on executive dashboards where space is limited and every metric needs a target context.
2 — Trend Charts: showing how values change over time
📈 Line Chart
What it does: Connects data points with a line to show change over a continuous time axis. The slope of the line makes trends immediately visible — rising, falling, flat, seasonal patterns.
Use it when: Your x-axis is time and you want to show how a metric evolves. "How has monthly revenue changed over the past 2 years?" "Is customer churn improving or worsening quarter by quarter?" "What does the weekly order volume trend look like?"
Multiple lines: Up to 4–5 lines on one chart is readable for comparing trends across categories (e.g. revenue by region over time). Beyond 5 lines, the chart becomes a "spaghetti chart" — use small multiples instead.
Avoid it when: Your x-axis is not time or an ordered sequence. A line connecting revenue from UK to Germany to France implies a meaningful progression between them — which does not exist. Use a bar chart for unordered categories.
Power BI tip: Add a constant line for targets, averages, or thresholds using Analytics pane → Constant Line. This immediately shows whether each time period is above or below benchmark.
Line chart — the slope makes the growth trend visible at a glance
📈 Area Chart
What it does: A line chart with the area below the line filled in. The filled area emphasises the total volume, not just the trend direction.
Use it when: You want to show both the trend AND the magnitude of a metric over time. Revenue over time where you want the reader to appreciate the total volume, not just whether it went up or down. Good for showing cumulative metrics (running total of signups, cumulative revenue).
Stacked area chart: Multiple series stacked on top of each other to show both total and composition over time. Useful but can be misleading — the middle series appear distorted because their baseline is not zero.
Avoid it when: You have multiple overlapping series — overlapping filled areas obscure each other. Use a line chart instead.
Area chart — filled region emphasises total volume alongside the trend direction
📈 Combo Chart (Line + Column)
What it does: Combines a bar/column chart and a line chart on the same visual, using two y-axes. One metric is shown as bars, another as a line.
Use it when: You want to show two related but differently-scaled metrics together. Revenue (bars) vs profit margin % (line). Order volume (bars) vs average order value (line). The dual axis lets both metrics use their natural scale.
Avoid it when: The two metrics are not meaningfully related. Putting unrelated metrics on a dual-axis chart implies a relationship that may not exist — a common way to create misleading visuals accidentally.
Dual-axis caution: Dual y-axis charts can be manipulated to make any correlation look strong or weak by changing the axis scales. Always check that both axes start at zero (or clearly label that they do not) and that the scale relationship is honest.
Combo chart — bars show volume (left axis), line shows margin % (right axis)
3 — Part-to-Whole Charts: showing composition and proportions
🆗 Pie Chart and Donut Chart
What it does: Shows each category's share of a total as a slice of a circle. The donut chart is the same but with a hole in the middle (which you can use to display the total or a key metric).
Use it when: You have 2–4 categories and want to show their proportional share of a whole. "Our revenue is 60% from direct sales, 30% from partnerships, 10% from other channels." Simple composition stories with few categories.
Avoid it when: You have more than 4–5 categories (slices become too thin to distinguish), when the values are similar in size (humans are bad at comparing angles — use a bar chart instead), or when exact values matter more than proportions.
The most overused chart in business: Pie charts are almost always better replaced by a simple sorted bar chart. Humans compare lengths far more accurately than angles. If your manager insists on a pie chart, use a donut with the total in the centre — at least that adds one useful number.
Donut chart — works for 2–4 segments; the centre can show the total value
🧴 Treemap
What it does: Displays hierarchical data as nested rectangles. Each rectangle's area is proportional to its value. Larger rectangles represent larger values. Colour can encode a second dimension (e.g. growth rate).
Use it when: You have hierarchical data with many categories and want to show relative size at a glance. "Which product subcategories contribute most to total revenue?" "How is headcount distributed across departments and teams?" Good for showing 10–50 categories where a bar chart would be too long.
Avoid it when: Values are similar in size (the rectangles look almost identical and communicate nothing) or when you need to compare precise values (area is harder to compare than length).
Treemap — rectangle area is proportional to value; good for 10–50 categories
📈 Waterfall Chart
What it does: Shows how a starting value changes through a series of positive and negative increments to reach a final value. Increases are shown in one colour (typically green), decreases in another (typically red), with a running total visible throughout.
Use it when: Explaining the components of a change. "We started the year at £2M revenue, added £500K from new customers, lost £200K from churn, added £300K from upsells — ending at £2.6M." Perfect for financial variance analysis, P&L walkthroughs, and explaining why a metric moved.
Power BI tip: Available as a native visual. Set the first and last bars as "total" bars to show the start and end values as full columns rather than floating increments.
When your manager asks "why did the number change?" — a waterfall chart is almost always the right answer. It is the single best chart for communicating causes of change.
Waterfall chart — green bars add to the total, red bars subtract; best for explaining variance
4 — Distribution Charts: showing how data is spread
📈 Histogram
What it does: Groups continuous data into ranges (bins) and shows how many values fall into each range. Unlike a bar chart, the bars touch each other — because the data is continuous, not categorical.
Use it when: You want to understand the shape of your data. "Are most orders small with a few very large ones (right-skewed)?" "Is customer age normally distributed?" "Are transaction amounts clustered around certain price points?" Histograms reveal patterns invisible in summary statistics like mean and median.
Avoid it when: Your data is categorical (use a bar chart). The choice of bin width significantly affects what the histogram shows — always try a few different bin sizes.
Histogram — bars touch because data is continuous; reveals the shape of your distribution
🌟 Scatter Plot
What it does: Plots individual data points along two numeric axes (x and y) to show the relationship between two variables. Each dot is one record. Clusters of dots reveal correlations, outliers appear as isolated points far from the main group.
Use it when: "Is there a relationship between customer age and purchase value?" "Do stores with higher marketing spend generate higher revenue?" "Which products have high volume but low margin (the danger zone)?" Scatter plots reveal correlations, clusters, and outliers that aggregate charts hide entirely.
Avoid it when: You have very few data points (fewer than 20 dots is just noise) or when your x-axis is categorical rather than numeric.
Bubble chart: A scatter plot where each dot's size represents a third variable. "Plot revenue vs margin, with bubble size = order volume." Adds one dimension of information but becomes cluttered with many data points.
Power BI tip: Add a trend line using the Analytics pane → Trend Line. This makes the correlation direction immediately visible without the viewer having to infer it from the dot pattern.
Scatter plot — each dot is one record; the dashed trend line reveals the correlation direction
🆕 Box Plot (Box and Whisker)
What it does: Shows the distribution of values through five summary statistics: minimum, 25th percentile (Q1), median, 75th percentile (Q3), and maximum. The box shows the middle 50% of data; the whiskers extend to the min and max; dots outside the whiskers are outliers.
Use it when: Comparing distributions across categories. "How does order value distribution compare across regions?" "Are delivery times more variable for one carrier than another?" Box plots show spread, skew, and outliers simultaneously in a compact space.
Power BI tip: Available as a custom visual. Particularly valuable for operations and quality analysis where variance matters as much as the average.
Box plot — box = middle 50% of data; line = median; whiskers = min/max; dots = outliers
5 — Relationship and Flow Charts
🌟 Scatter Plot Matrix
What it does: A grid of scatter plots showing every pairwise relationship between a set of numeric variables simultaneously. With 4 variables you get a 4×4 grid of 16 scatter plots.
Use it when: Early in an analysis when you want to quickly identify which variables are correlated before diving deeper. Good for exploratory data analysis presented to a technical audience.
🧭 Funnel Chart
What it does: Shows stages of a process where volume decreases at each step. Each stage is a bar that gets narrower, making drop-off rates visually obvious.
Use it when: Visualising conversion rates through a sequential process. Marketing funnel (impressions → clicks → leads → opportunities → won deals). E-commerce funnel (visits → add to cart → checkout → purchase). Support ticket funnel (opened → assigned → in progress → resolved).
Avoid it when: Steps do not have a clear sequential relationship or when the order of steps can vary per customer.
Power BI tip: Add data labels showing both the absolute count and the % drop from the previous stage. The % drop is the actionable insight — where is the biggest leakage?
Funnel chart — narrowing width shows drop-off at each stage of the process
🌍 Map Visuals (Choropleth and Bubble Map)
What it does: Plots data geographically. Choropleth maps shade regions (countries, states, postcodes) by intensity of a metric — darker = higher. Bubble maps place circles at locations where size represents a metric.
Use it when: Your data has a meaningful geographic dimension and location patterns matter. "Where are our customers concentrated?" "Which postcodes have the highest return rates?" "How does average order value vary by country?"
Avoid it when: Geography is not actually relevant to the insight — putting data on a map just because you have a country column adds complexity without clarity. Also avoid choropleth maps when large geographic areas with small populations dominate visually (e.g. Australia appears important simply because it is large).
Power BI tip: Power BI's built-in map visual uses Bing Maps. For postcode/region level analysis, use the Filled Map visual. Ensure your location column is properly categorised as a geographic data type in the data model.
What it does: Displays a single large number. Nothing else. The simplest and often the most impactful visual on a dashboard.
Use it when: You have a headline metric the audience needs to see immediately. Total revenue. Active customers. OTIF rate. Conversion rate. NPS score. Every executive dashboard should open with 3–5 card visuals showing the most important numbers before any charts appear.
Power BI tip: Use the KPI visual instead of a plain card when you want to show the current value AND whether it is above or below a target. The KPI visual adds a trend indicator (up/down arrow) and a comparison to goal automatically.
Card visuals — large numbers with RAG indicators; always the first thing on any dashboard
🔋 Gauge Chart
What it does: Shows a single value on a dial between a minimum and maximum, like a speedometer. The needle (or filled arc) shows where the current value sits within the range.
Use it when: Showing progress toward a single target where the min and max of the acceptable range are known. Capacity utilisation (0–100%). SLA performance (0–100%). Budget consumption (0% to over-budget threshold).
Avoid it when: You have multiple metrics to show — gauges take a lot of space for one number. A KPI card or bullet chart is almost always more space-efficient.
Gauge chart — needle shows current value within the acceptable range; green/amber/red zones
7 — Tables and Matrices
📄 Table and Matrix Visual
What it does: Displays raw or aggregated data in rows and columns. A table shows rows with multiple column values. A matrix is a cross-tabulation (pivot table) — rows on one axis, columns on another, values in the intersections.
Use it when: Your audience needs to look up specific values rather than understand patterns. "What were the exact sales figures for each product in each region?" "What is the conversion rate for each campaign in each country?" Analysts and finance teams often need tables; executives need charts.
Matrix with conditional formatting: Adding colour scales or data bars to a matrix turns a dense table into a visual heatmap — the most information-dense visual available. "Revenue by product × region" as a colour-scaled matrix immediately shows the hot and cold spots.
Power BI tip: Apply conditional formatting to the Values in a matrix (Format → Conditional Formatting → Background colour → Colour scale). This transforms a wall of numbers into a heatmap that communicates in seconds what a table takes minutes to read.
Matrix with conditional formatting — colour scale turns a table of numbers into an instant heatmap
A dashboard is not a collection of charts — it is a story. The best dashboards answer one clear question and guide the viewer from the headline metric to the explanation to the next action. This section covers the four main dashboard types analysts build, what each one is for, and the design principles that make the difference between a dashboard that gets used and one that gets ignored.
Dashboard Type 1 — Executive Summary Dashboard
Purpose: answer "how are we doing overall?" in under 30 seconds
Who uses it: C-suite, directors, senior managers. People who need the big picture fast and will drill into details only if something is wrong.
What it contains:
Row 1 — Headline KPIs: 3–5 large Card or KPI visuals showing the most critical metrics with vs-target indicators. Revenue, margin %, active customers, NPS, OTIF. These are the first thing the eye sees.
Row 2 — Trend context: 2–3 line charts showing those same KPIs over the past 12–24 months. "Is the number improving or declining?" is the question these answer.
Row 3 — Breakdown: 1–2 bar charts showing which dimension (region, product, segment) is driving the result. "Where is it coming from?"
Design rules for executive dashboards:
Maximum 7–9 visuals total. If you need more, you need two dashboards.
Every visual has a title that states the conclusion, not just the metric: "Revenue 12% above target" not "Revenue by month".
RAG (Red/Amber/Green) status on every KPI — the executive should not have to calculate whether something is good or bad.
No more than 2 click interactions required to reach any insight.
Data refreshes automatically — a stale executive dashboard destroys trust immediately.
Charts to use: Cards, KPI visuals, Line charts, Bar/Column charts, Gauge charts for capacity or quota metrics.
Charts to avoid: Pie charts with more than 3 slices, treemaps, scatter plots, anything that requires explanation before reading.
Dashboard Type 2 — Operational Monitoring Dashboard
Purpose: monitor daily/weekly operations and flag issues that need immediate action
Who uses it: Operations managers, team leads, anyone responsible for day-to-day performance. Typically viewed daily or multiple times per day.
What it contains:
Current status: Cards showing today's or this week's numbers against targets. OTIF rate, orders processed, tickets resolved, fill rate.
Alerts and exceptions: Tables or matrices highlighting specific rows that need action — orders overdue, SKUs below safety stock, accounts with open complaints. The visual should make the problem row stand out immediately (red conditional formatting, icons).
Trend vs recent average: Sparklines or small line charts showing whether today's performance is normal or anomalous.
Drill-through: Every metric should link to a detail view showing the individual records behind it so the operations team can take action.
Design rules:
The most important alerts go in the top-left — people scan dashboards like they scan a page: top-left first.
Use conditional formatting aggressively: red = needs action now, amber = monitor, green = on track.
Include a "data as of" timestamp — operational dashboards are useless if the user does not know how current the data is.
Prioritise speed of load — operational dashboards are checked frequently; a slow dashboard gets abandoned.
Charts to use: Cards, Tables with conditional formatting, Matrix with colour scales, Line charts (short rolling window), Funnel charts for process stages.
Dashboard Type 3 — Analytical Deep-Dive Dashboard
Purpose: explore data to answer a specific business question
Who uses it: Analysts, data scientists, product managers, and anyone trying to understand the "why" behind a metric. Used for investigation rather than monitoring.
What it contains:
Filters and slicers: Date range pickers, category selectors, segment filters — the user controls what they see. This dashboard is interactive by design.
Distribution visuals: Histograms and box plots showing the spread of key metrics, not just averages.
Correlation and relationship charts: Scatter plots showing relationships between variables. "Does marketing spend correlate with conversion rate by region?"
Decomposition: Waterfall charts and treemaps showing what drives a metric. "What explains the revenue variance vs last year?"
Drill-through to individual records: The ability to go from a summary metric down to the specific transactions behind it.
Design rules:
Analytical dashboards can have more visuals than executive ones — 12–15 is acceptable if they are logically organised.
Use page tabs in Power BI to organise related visuals: one tab per analytical question.
Add explanatory text boxes: "This scatter plot shows the relationship between X and Y. Points in the top-right quadrant are high performers on both dimensions."
Include a tooltip page — when users hover over a data point they see a detailed breakdown in a mini pop-up.
Charts to use: Scatter plots, histograms, box plots, waterfall charts, treemaps, decomposition tree (Power BI native), matrix with conditional formatting.
Dashboard Type 4 — Self-Service Reporting Dashboard
Purpose: let non-technical users answer their own questions without asking the analytics team
Who uses it: Sales teams, marketing managers, regional managers — people who need data regularly but should not need to request a new report every time their question changes slightly.
What it contains:
Prominent filter panel: Date range, region, product category, customer segment — the most common dimensions users want to slice by. Filters should be at the top or left side and visually obvious.
Consistent metric definitions: A glossary or tooltip explaining exactly how each metric is calculated. "Revenue = invoiced amount excluding VAT and returns." Self-service dashboards fail when users do not trust or understand the numbers.
Export capability: Users need to be able to export the filtered data to Excel. Build this in deliberately rather than having them screenshot the visual.
Parameterised views: Let users switch between views — "show me top 10" vs "show me bottom 10", "show by volume" vs "show by value".
Design rules:
Test with actual end users before publishing. What seems obvious to an analyst is often confusing to a sales manager.
Label every filter with what it controls: "Region filter (affects all charts)" not just a dropdown with no label.
Use bookmarks in Power BI to create pre-set views: "Weekly Summary View", "Monthly Board View", "Deep Dive View" — one click takes the user to a pre-configured state.
Publish to a clearly named workspace and set up automatic refresh. A self-service dashboard with stale data will be abandoned within a week.
Charts to use: All standard chart types with emphasis on simplicity. Bar charts, line charts, cards, simple tables. Avoid exotic chart types that require explanation.
When to Use Which Chart — Quick Reference
Chart selection guide — match your question to the right visual
Comparing values across categoriesBar chart (horizontal) or Column chart (vertical)
Showing change over timeLine chart — always use line for time series
Showing composition of a totalStacked bar (multiple categories) or Donut (2–4 categories only)
Explaining why a number changedWaterfall chart — shows positive and negative contributions
Showing a KPI vs targetKPI card visual or Bullet chart
Showing many categories by sizeTreemap — areas proportional to value
Showing conversion through stagesFunnel chart
Showing geographic patternsFilled map (regions) or Bubble map (point locations)
Showing relationship between two numbersScatter plot
Showing spread and outliersBox plot or Histogram
Showing one important numberCard visual — large, prominent, no decoration
Cross-tabulation with colour codingMatrix with conditional formatting (heatmap style)
Two metrics on different scales togetherCombo chart (bars + line, dual axis)
Showing 3 metrics simultaneouslyBubble chart (x, y position + bubble size)
The Five Most Common Dashboard Mistakes
1. Too many charts on one page
Every chart added to a dashboard reduces the impact of every other chart. A dashboard with 20 visuals communicates nothing because the viewer's attention is fragmented. Ask yourself: "If I had to remove half these charts, which half would I keep?" Those are the only ones that belong on the dashboard. The rest belong on a separate analytical page.
2. Chart titles that describe the data instead of the insight
"Revenue by Region" tells the reader what the chart contains. "North Region Revenue 23% Below Target" tells them what it means. Insight-driven titles are the single highest-return improvement most analysts can make to their dashboards. Spend 30 seconds writing a title that states the conclusion — your audience will thank you.
3. Using colour randomly instead of intentionally
In a well-designed dashboard, colour carries meaning: red = problem, green = on track, amber = monitor. When you use 8 different colours just to differentiate bar chart categories, you waste the colour channel on decoration. Use consistent colour conventions throughout all your dashboards so viewers learn the language once and read faster forever after.
4. Showing averages without showing distribution
"Average delivery time is 3.2 days" sounds fine. But if 80% of orders arrive in 1 day and 20% take 10+ days, the average is actively misleading. Always consider whether a distribution chart (histogram, box plot) tells a more honest story than a single average. The average can hide the problem your audience needs to see.
5. Building for the builder, not the reader
The most common dashboard mistake is building something technically impressive that the intended audience cannot use. Before publishing any dashboard, sit with one person from your target audience and watch them try to answer three specific questions using it. The places where they hesitate or get confused are the places that need redesigning. No amount of internal review catches what a real user test finds in the first five minutes.