Data Quality → Missing Data → Clean Data 🧹🗑️🔍

Last updated: 2026-01-02

Applies to version: 1.0+

Reading time: ~10 min

Prerequisite: Don't be a data hoarder

Quick Summary / TL;DR

Your data is trash. But it's not you bruh...it's hers too...— EVERYONE'S data is trash when they first get it. This tutorial is your garbage truck. We're gonna take out the trash so you can actually do statistics instead of just staring at a spreadsheet full of lies, missing values, and people who claim to be 250 years old.

The workflow is simple but morons like you keep messing it up. So listen up:

1. Data Quality Checks

Find ALL the trash — outliers, wrong values, duplicates, formats, AND missing values (initial spot)

→

2. Missing Data Analysis

Focus JUST on the missing stuff — patterns, reasons, what to do about it

→

3. Clean Data

Actually fix everything — now your data isn't trash anymore (congrats!)

⚠️ THE CONNECTION: Data Quality Checks SPOT missing values → Missing Data Analysis DIVES DEEP into them

Why This Order? (Because Morons Mess It Up)
Data Quality Checks: The Full Trash Inspection
Missing Data Analysis: The Deep Dive
Clean Data: Taking Out the Trash
Real Example: Gym Bro Data Cleanup
Summary: Your Data Cleaning Checklist

1. Why This Order? (Because Morons Mess It Up) 🤦‍♂️

Let me explain this with a gym analogy since your pea brain understands gains:

🛠️ The Gym Equipment Check Analogy

Data Quality Check = Full Gym Walkthrough

You walk in and scan everything:

Broken treadmill? (outliers)
Missing weights? (missing values - you spot them!)
Dumbbell labeled "50kg" but feels like 5kg? (wrong values)
Two identical squat racks? (duplicates)

You're noting ALL problems at once.

Missing Data Analysis = Focus JUST on Missing Weights

Now you zoom in on ONE problem:

Which weight plates are missing? (count per type)
Did someone steal them? (pattern analysis)
Can we borrow from another gym? (imputation strategy)

Specialized focus after the general inspection.

Clean Data = Actually Fixing Everything

You take action:

Replace missing 45lb plates
Fix the treadmill belt
Re-label the dumbbells
Now the gym is READY for gains

⚠️ MORON ALERT: If you do Missing Data Analysis ONLY, you might miss that half your data has negative ages or duplicate entries. You'd be MISSING THE POINT! Don't be that guy.

This is your comprehensive garbage scan. You're looking for EVERYTHING wrong. Here's your checklist, bruh:

📍 The Obvious Stuff (Even You Should Spot These)

Missing Values: Blank cells, "N/A", "NULL", "?"
Outliers: Dude who benches 500kg or 5kg
Wrong Data Types: Age stored as text "25 years old"

🤥 The Sneaky Lies (These Fool Everyone)

Impossible Values: Age = 250, BMI = -5, Negative height
Temporal Nonsense: Died before born, discharged before admitted
3-year-old with a PhD: Education date vs birth date mismatch

🔁 Duplicate Disasters

Exact Duplicates: Same person entered twice
Fuzzy Duplicates: "John Smith" vs "Jon Smyth", same address
Same person, multiple IDs: Gym bro using different member numbers

📝 Format Fuckups

Phone numbers: (123)456-7890 vs 1234567890 vs 123-456-7890
Emails without @: brogmail.com (missing @ symbol)
Dates as text: "January 32, 2023" (that day doesn't exist, dumbass)

🤯 Logical Insanities

Pregnant males: Gender = "Male", Pregnancy = "Yes"
Sum ≠ Total: Item1 + Item2 = 100, Total column says 150
"Number of children" = 0 but "Child names" = ["Timmy", "Sally"]

📊 Distribution Disasters

All values identical: Every gym bro benches exactly 100kg (sure...)
99% in one category: 99% "Male", 1% "Female" in a general population study
Sudden value shift: All weights in kg until row 500, then switch to lbs

🎯 Gym Data Example - Before Quality Check:

MemberID	Name	Age	Max Bench (kg)	Join Date	Gender
101	John Bro	25	120	2023-01-15	Male
102	[EMPTY]	-5	500	2023-02-30	Male
103	Jane Smith	twenty five	60	2023-03-10	Female
101	John Bro	25	120	2023-01-15	Male
104	Mike	30	NULL	2024-13-01	Male

See the trash? Missing name, negative age, impossible bench, text age, duplicate, NULL, invalid date... This data is HOT GARBAGE. 🔥🗑️

OK, you found missing values in your quality check. Now let's FOCUS on them. This is surgical precision, not a general inspection.

Step 1: Count the Missing (How Bad Is It?)

Missing per variable:

Name: 1 missing (20%)

Age: 0 missing (0%)

Max Bench: 1 missing (20%)

Join Date: 0 missing (0%)

Gender: 0 missing (0%)

Step 2: Pattern Analysis (Why Is It Missing?)

This is where you become a data detective. Three types of missingness:

MCAR 🎲

Missing Completely At Random

No pattern. Random like a dice roll.

Example: Random survey pages lost in mail

Fix: Usually safe to delete

MAR 📊

Missing At Random

Missingness relates to OTHER variables you CAN see

Example: Women less likely to report weight (you know gender)

Fix: Can model and impute

MNAR 🕵️

Missing Not At Random

Missingness relates to THE VARIABLE ITSELF

Example: Rich people hide income (you DON'T know their true income)

Fix: Tricky. May need specialized methods

✅ Why It Matters

If you ignore why data is missing and just delete or impute blindly, your analysis can be biased, misleading, or wrong.

MCAR 🎲 → Safe to delete

(if small % missing)

Since it's completely random, deleting won't bias your results (as long as you don't lose too much data).

MAR 📊 → Can model/impute

using other variables

Since missingness relates to variables you can see, you can use regression, K-NN, or multiple imputation.

MNAR 🕵️ → Tricky!

requires specialized methods

Need sensitivity analysis, selection models, or expert knowledge. Simple imputation will bias results.

⚠️ NOTES FOR MORONS:🤦‍♂️ Knowing WHY data is missing is more important than knowing HOW to impute it. A fancy imputation method on MNAR data will give you fancy wrong results.

Step 3: Decision Time (What to Do?)

Option A: Deletion 🗑️

Listwise: Delete entire row if ANY missing
Pairwise: Use available data for each calculation
When: MCAR & small % missing (<5%)

Option B: Imputation 🔧

Mean/Median/Mode: Simple but dumb
Regression: Predict missing from other variables
K-NN: Use similar rows' values
Multiple Imputation: Fancy stats bro method

Option C: Flagging 🚩

Add "was_missing" column
Keep original missing, but mark it
Good for MNAR situations

🧪 Interactive: Is This MCAR, MAR, or MNAR?

Scenario: Gym survey data. "Income" column has missing values.

Now you actually FIX everything you found. This is the payoff, bruh.

🔄 Fix Format Issues

Before: "twenty five", "(123)456-7890", "2024-13-01"

After: 25, 1234567890, 2024-01-13

🎯 Handle Outliers

Before: Bench: [60, 80, 120, 500, 90]

After (cap at 95th %ile): [60, 80, 120, 180, 90]

500kg was bullshit. Capped at reasonable max.

🔍 Impute Missing Values

Before: Bench: [120, NULL, 100, 110, NULL]

After (mean imputation): [120, 110, 100, 110, 110]

Mean = 110. Filled NULLs with mean.

🚫 Remove Duplicates

Before: John Bro appears twice (identical rows)

After: John Bro appears once

Kept first occurrence, removed exact duplicate.

✅ Validate Logic

Before: Gender = "Male", Pregnancy = "Yes"

After: Pregnancy = "No" (fixed impossible combo)

📝 Document Everything

Before: No record of changes

After: "2026-01-02: Fixed negative age (-5 → 25 using median), capped outlier bench (500 → 180), imputed 2 missing benches with mean"

5. Real Example: Gym Bro Data Cleanup 🏋️‍♂️✨

Let's walk through the ENTIRE workflow with our trash gym data:

🎯 Original Bad Gym Data

MemberID	Name	Age	Max Bench (kg)	Join Date	Gender
101	John Bro	25	120	2023-01-15	Male
102	[EMPTY]	-5	500	2023-02-30	Male
103	Jane Smith	twenty five	60	2023-03-10	Female
101	John Bro	25	120	2023-01-15	Male
104	Mike	30	NULL	2024-13-01	Male

See the trash? Missing name, negative age, impossible bench, text age, duplicate, NULL, invalid date... This data is HOT GARBAGE. 🔥🗑️

🔍 Data Quality Check Findings

❌ Missing name (row 2)
❌ Negative age (-5)
❌ Impossible bench (500kg)
❌ Age as text ("twenty five")
❌ Duplicate row (John Bro appears twice)
❌ Missing bench (NULL)
❌ Invalid dates (Feb 30, month 13)

🧊 Missing Data Analysis (Just the Missing Stuff)

In reality, for most cases, unless you have a way to get the missing data, like some member's name, you usually end up just deleting the entire thing.

Think about it with your amoeba brain 🧬🧠: why would you keep records that are incomplete? You can't work with the data.

But still, just showing you that these methods do exist and this is how people deal with them as its more like a case by cases basis not a blank rule for all.

Name missing: 1 (20%) - Probably MAR (new member forgot to write)
Bench missing: 1 (20%) - MCAR (random skip)
Decision: Impute name with "Unknown", impute bench with median

✨ Clean Data (After Fixing Everything)

MemberID	Name	Age	Max Bench	Join Date	Gender
101	John Bro	25	120	2023-01-15	Male
102	Unknown	27	180	2023-02-28	Male
103	Jane Smith	25	60	2023-03-10	Female
104	Mike	30	90	2024-01-13	Male

🔥 TRANSFORMATION COMPLETE: Garbage → Usable data. Ready for analysis!

🎯 Summary: Your Data Cleaning Checklist

Ready for More? 🚀

Now that your data isn't trash, you can actually do statistics! Next up: Frequency Tables and Summary Statistics (they work MUCH better with clean data).

💡 Remember: Data quality isn't a one-time thing. Every time you get new data, run through this checklist. Your future self (and anyone who has to use your analysis) will thank you.

Now go clean some data, you beautiful data janitor. 🧹✨

Statistics Tutorial