Data Quality → Missing Data → Clean Data 🧹🗑️🔍
Last updated: 2026-01-02
Applies to version: 1.0+
Reading time: ~10 min
Prerequisite: Don't be a data hoarder
Quick Summary / TL;DR
Your data is trash. But it's not you bruh...it's hers too...— EVERYONE'S data is trash when they first get it. This tutorial is your garbage truck. We're gonna take out the trash so you can actually do statistics instead of just staring at a spreadsheet full of lies, missing values, and people who claim to be 250 years old.
The workflow is simple but morons like you keep messing it up. So listen up:
1. Data Quality Checks
Find ALL the trash — outliers, wrong values, duplicates, formats, AND missing values (initial spot)
2. Missing Data Analysis
Focus JUST on the missing stuff — patterns, reasons, what to do about it
3. Clean Data
Actually fix everything — now your data isn't trash anymore (congrats!)
⚠️ THE CONNECTION: Data Quality Checks SPOT missing values → Missing Data Analysis DIVES DEEP into them
Table of Contents
1. Why This Order? (Because Morons Mess It Up) 🤦♂️
Let me explain this with a gym analogy since your pea brain understands gains:
🛠️ The Gym Equipment Check Analogy
Data Quality Check = Full Gym Walkthrough
You walk in and scan everything:
- Broken treadmill? (outliers)
- Missing weights? (missing values - you spot them!)
- Dumbbell labeled "50kg" but feels like 5kg? (wrong values)
- Two identical squat racks? (duplicates)
You're noting ALL problems at once.
Missing Data Analysis = Focus JUST on Missing Weights
Now you zoom in on ONE problem:
- Which weight plates are missing? (count per type)
- Did someone steal them? (pattern analysis)
- Can we borrow from another gym? (imputation strategy)
Specialized focus after the general inspection.
Clean Data = Actually Fixing Everything
You take action:
- Replace missing 45lb plates
- Fix the treadmill belt
- Re-label the dumbbells
- Now the gym is READY for gains
⚠️ MORON ALERT: If you do Missing Data Analysis ONLY, you might miss that half your data has negative ages or duplicate entries. You'd be MISSING THE POINT! Don't be that guy.
2. Data Quality Checks: The Full Trash Inspection 🗑️🔍
This is your comprehensive garbage scan. You're looking for EVERYTHING wrong. Here's your checklist, bruh:
📍 The Obvious Stuff (Even You Should Spot These)
- Missing Values: Blank cells, "N/A", "NULL", "?"
- Outliers: Dude who benches 500kg or 5kg
- Wrong Data Types: Age stored as text "25 years old"
🤥 The Sneaky Lies (These Fool Everyone)
- Impossible Values: Age = 250, BMI = -5, Negative height
- Temporal Nonsense: Died before born, discharged before admitted
- 3-year-old with a PhD: Education date vs birth date mismatch
🔁 Duplicate Disasters
- Exact Duplicates: Same person entered twice
- Fuzzy Duplicates: "John Smith" vs "Jon Smyth", same address
- Same person, multiple IDs: Gym bro using different member numbers
📝 Format Fuckups
- Phone numbers: (123)456-7890 vs 1234567890 vs 123-456-7890
- Emails without @: brogmail.com (missing @ symbol)
- Dates as text: "January 32, 2023" (that day doesn't exist, dumbass)
🤯 Logical Insanities
- Pregnant males: Gender = "Male", Pregnancy = "Yes"
- Sum ≠ Total: Item1 + Item2 = 100, Total column says 150
- "Number of children" = 0 but "Child names" = ["Timmy", "Sally"]
📊 Distribution Disasters
- All values identical: Every gym bro benches exactly 100kg (sure...)
- 99% in one category: 99% "Male", 1% "Female" in a general population study
- Sudden value shift: All weights in kg until row 500, then switch to lbs
🎯 Gym Data Example - Before Quality Check:
| MemberID | Name | Age | Max Bench (kg) | Join Date | Gender |
|---|---|---|---|---|---|
| 101 | John Bro | 25 | 120 | 2023-01-15 | Male |
| 102 | [EMPTY] | -5 | 500 | 2023-02-30 | Male |
| 103 | Jane Smith | twenty five | 60 | 2023-03-10 | Female |
| 101 | John Bro | 25 | 120 | 2023-01-15 | Male |
| 104 | Mike | 30 | NULL | 2024-13-01 | Male |
See the trash? Missing name, negative age, impossible bench, text age, duplicate, NULL, invalid date... This data is HOT GARBAGE. 🔥🗑️
3. Missing Data Analysis: The Deep Dive 🧊🔬
OK, you found missing values in your quality check. Now let's FOCUS on them. This is surgical precision, not a general inspection.
Step 1: Count the Missing (How Bad Is It?)
Missing per variable:
Name: 1 missing (20%)
Age: 0 missing (0%)
Max Bench: 1 missing (20%)
Join Date: 0 missing (0%)
Gender: 0 missing (0%)
Step 2: Pattern Analysis (Why Is It Missing?)
This is where you become a data detective. Three types of missingness:
MCAR 🎲
Missing Completely At Random
No pattern. Random like a dice roll.
Example: Random survey pages lost in mail
Fix: Usually safe to delete
MAR 📊
Missing At Random
Missingness relates to OTHER variables you CAN see
Example: Women less likely to report weight (you know gender)
Fix: Can model and impute
MNAR 🕵️
Missing Not At Random
Missingness relates to THE VARIABLE ITSELF
Example: Rich people hide income (you DON'T know their true income)
Fix: Tricky. May need specialized methods
✅ Why It Matters
If you ignore why data is missing and just delete or impute blindly, your analysis can be biased, misleading, or wrong.
MCAR 🎲 → Safe to delete
(if small % missing)
Since it's completely random, deleting won't bias your results (as long as you don't lose too much data).
MAR 📊 → Can model/impute
using other variables
Since missingness relates to variables you can see, you can use regression, K-NN, or multiple imputation.
MNAR 🕵️ → Tricky!
requires specialized methods
Need sensitivity analysis, selection models, or expert knowledge. Simple imputation will bias results.
⚠️ NOTES FOR MORONS:🤦♂️ Knowing WHY data is missing is more important than knowing HOW to impute it. A fancy imputation method on MNAR data will give you fancy wrong results.
Step 3: Decision Time (What to Do?)
🧪 Interactive: Is This MCAR, MAR, or MNAR?
Scenario: Gym survey data. "Income" column has missing values.
4. Clean Data: Taking Out the Trash 🧹✨
Now you actually FIX everything you found. This is the payoff, bruh.
🔄 Fix Format Issues
Before: "twenty five", "(123)456-7890", "2024-13-01"
After: 25, 1234567890, 2024-01-13
🎯 Handle Outliers
Before: Bench: [60, 80, 120, 500, 90]
After (cap at 95th %ile): [60, 80, 120, 180, 90]
500kg was bullshit. Capped at reasonable max.
🔍 Impute Missing Values
Before: Bench: [120, NULL, 100, 110, NULL]
After (mean imputation): [120, 110, 100, 110, 110]
Mean = 110. Filled NULLs with mean.
🚫 Remove Duplicates
Before: John Bro appears twice (identical rows)
After: John Bro appears once
Kept first occurrence, removed exact duplicate.
✅ Validate Logic
Before: Gender = "Male", Pregnancy = "Yes"
After: Pregnancy = "No" (fixed impossible combo)
📝 Document Everything
Before: No record of changes
After: "2026-01-02: Fixed negative age (-5 → 25 using median), capped outlier bench (500 → 180), imputed 2 missing benches with mean"
5. Real Example: Gym Bro Data Cleanup 🏋️♂️✨
Let's walk through the ENTIRE workflow with our trash gym data:
🎯 Original Bad Gym Data
| MemberID | Name | Age | Max Bench (kg) | Join Date | Gender |
|---|---|---|---|---|---|
| 101 | John Bro | 25 | 120 | 2023-01-15 | Male |
| 102 | [EMPTY] | -5 | 500 | 2023-02-30 | Male |
| 103 | Jane Smith | twenty five | 60 | 2023-03-10 | Female |
| 101 | John Bro | 25 | 120 | 2023-01-15 | Male |
| 104 | Mike | 30 | NULL | 2024-13-01 | Male |
See the trash? Missing name, negative age, impossible bench, text age, duplicate, NULL, invalid date... This data is HOT GARBAGE. 🔥🗑️
🔍 Data Quality Check Findings
- ❌ Missing name (row 2)
- ❌ Negative age (-5)
- ❌ Impossible bench (500kg)
- ❌ Age as text ("twenty five")
- ❌ Duplicate row (John Bro appears twice)
- ❌ Missing bench (NULL)
- ❌ Invalid dates (Feb 30, month 13)
🧊 Missing Data Analysis (Just the Missing Stuff)
In reality, for most cases, unless you have a way to get the missing data, like some member's name, you usually end up just deleting the entire thing.
Think about it with your amoeba brain 🧬🧠: why would you keep records that are incomplete? You can't work with the data.
But still, just showing you that these methods do exist and this is how people deal with them as its more like a case by cases basis not a blank rule for all.- Name missing: 1 (20%) - Probably MAR (new member forgot to write)
- Bench missing: 1 (20%) - MCAR (random skip)
- Decision: Impute name with "Unknown", impute bench with median
✨ Clean Data (After Fixing Everything)
| MemberID | Name | Age | Max Bench | Join Date | Gender |
|---|---|---|---|---|---|
| 101 | John Bro | 25 | 120 | 2023-01-15 | Male |
| 102 | Unknown | 27 | 180 | 2023-02-28 | Male |
| 103 | Jane Smith | 25 | 60 | 2023-03-10 | Female |
| 104 | Mike | 30 | 90 | 2024-01-13 | Male |
🔥 TRANSFORMATION COMPLETE: Garbage → Usable data. Ready for analysis!
🎯 Summary: Your Data Cleaning Checklist
✅ Data Quality Checks (Do FIRST)
✅ Missing Data Analysis (Do SECOND)
✅ Clean Data (Do THIRD)
💡 Remember: Data quality isn't a one-time thing. Every time you get new data, run through this checklist. Your future self (and anyone who has to use your analysis) will thank you.
Now go clean some data, you beautiful data janitor. 🧹✨