Statistics Tutorial

Data Quality → Missing Data → Clean Data 🧹🗑️🔍

Last updated: 2026-01-02

Applies to version: 1.0+

Reading time: ~10 min

Prerequisite: Don't be a data hoarder

Quick Summary / TL;DR

Your data is trash. But it's not you bruh...it's hers too...— EVERYONE'S data is trash when they first get it. This tutorial is your garbage truck. We're gonna take out the trash so you can actually do statistics instead of just staring at a spreadsheet full of lies, missing values, and people who claim to be 250 years old.

The workflow is simple but morons like you keep messing it up. So listen up:

1. Data Quality Checks

Find ALL the trash — outliers, wrong values, duplicates, formats, AND missing values (initial spot)

2. Missing Data Analysis

Focus JUST on the missing stuff — patterns, reasons, what to do about it

3. Clean Data

Actually fix everything — now your data isn't trash anymore (congrats!)

⚠️ THE CONNECTION: Data Quality Checks SPOT missing values → Missing Data Analysis DIVES DEEP into them

Table of Contents

  1. Why This Order? (Because Morons Mess It Up)
  2. Data Quality Checks: The Full Trash Inspection
  3. Missing Data Analysis: The Deep Dive
  4. Clean Data: Taking Out the Trash
  5. Real Example: Gym Bro Data Cleanup
  6. Summary: Your Data Cleaning Checklist

1. Why This Order? (Because Morons Mess It Up) 🤦‍♂️

Let me explain this with a gym analogy since your pea brain understands gains:

🛠️ The Gym Equipment Check Analogy

Data Quality Check = Full Gym Walkthrough

You walk in and scan everything:

  • Broken treadmill? (outliers)
  • Missing weights? (missing values - you spot them!)
  • Dumbbell labeled "50kg" but feels like 5kg? (wrong values)
  • Two identical squat racks? (duplicates)

You're noting ALL problems at once.

Missing Data Analysis = Focus JUST on Missing Weights

Now you zoom in on ONE problem:

  • Which weight plates are missing? (count per type)
  • Did someone steal them? (pattern analysis)
  • Can we borrow from another gym? (imputation strategy)

Specialized focus after the general inspection.

Clean Data = Actually Fixing Everything

You take action:

  • Replace missing 45lb plates
  • Fix the treadmill belt
  • Re-label the dumbbells
  • Now the gym is READY for gains

⚠️ MORON ALERT: If you do Missing Data Analysis ONLY, you might miss that half your data has negative ages or duplicate entries. You'd be MISSING THE POINT! Don't be that guy.


2. Data Quality Checks: The Full Trash Inspection 🗑️🔍

This is your comprehensive garbage scan. You're looking for EVERYTHING wrong. Here's your checklist, bruh:

📍 The Obvious Stuff (Even You Should Spot These)

  • Missing Values: Blank cells, "N/A", "NULL", "?"
  • Outliers: Dude who benches 500kg or 5kg
  • Wrong Data Types: Age stored as text "25 years old"

🤥 The Sneaky Lies (These Fool Everyone)

  • Impossible Values: Age = 250, BMI = -5, Negative height
  • Temporal Nonsense: Died before born, discharged before admitted
  • 3-year-old with a PhD: Education date vs birth date mismatch

🔁 Duplicate Disasters

  • Exact Duplicates: Same person entered twice
  • Fuzzy Duplicates: "John Smith" vs "Jon Smyth", same address
  • Same person, multiple IDs: Gym bro using different member numbers

📝 Format Fuckups

  • Phone numbers: (123)456-7890 vs 1234567890 vs 123-456-7890
  • Emails without @: brogmail.com (missing @ symbol)
  • Dates as text: "January 32, 2023" (that day doesn't exist, dumbass)

🤯 Logical Insanities

  • Pregnant males: Gender = "Male", Pregnancy = "Yes"
  • Sum ≠ Total: Item1 + Item2 = 100, Total column says 150
  • "Number of children" = 0 but "Child names" = ["Timmy", "Sally"]

📊 Distribution Disasters

  • All values identical: Every gym bro benches exactly 100kg (sure...)
  • 99% in one category: 99% "Male", 1% "Female" in a general population study
  • Sudden value shift: All weights in kg until row 500, then switch to lbs

🎯 Gym Data Example - Before Quality Check:

MemberID Name Age Max Bench (kg) Join Date Gender
101 John Bro 25 120 2023-01-15 Male
102 [EMPTY] -5 500 2023-02-30 Male
103 Jane Smith twenty five 60 2023-03-10 Female
101 John Bro 25 120 2023-01-15 Male
104 Mike 30 NULL 2024-13-01 Male

See the trash? Missing name, negative age, impossible bench, text age, duplicate, NULL, invalid date... This data is HOT GARBAGE. 🔥🗑️


3. Missing Data Analysis: The Deep Dive 🧊🔬

OK, you found missing values in your quality check. Now let's FOCUS on them. This is surgical precision, not a general inspection.

Step 1: Count the Missing (How Bad Is It?)

Missing per variable:

Name: 1 missing (20%)

Age: 0 missing (0%)

Max Bench: 1 missing (20%)

Join Date: 0 missing (0%)

Gender: 0 missing (0%)

Step 2: Pattern Analysis (Why Is It Missing?)

This is where you become a data detective. Three types of missingness:

MCAR 🎲

Missing Completely At Random

No pattern. Random like a dice roll.

Example: Random survey pages lost in mail

Fix: Usually safe to delete

MAR 📊

Missing At Random

Missingness relates to OTHER variables you CAN see

Example: Women less likely to report weight (you know gender)

Fix: Can model and impute

MNAR 🕵️

Missing Not At Random

Missingness relates to THE VARIABLE ITSELF

Example: Rich people hide income (you DON'T know their true income)

Fix: Tricky. May need specialized methods

✅ Why It Matters

If you ignore why data is missing and just delete or impute blindly, your analysis can be biased, misleading, or wrong.

MCAR 🎲 → Safe to delete

(if small % missing)

Since it's completely random, deleting won't bias your results (as long as you don't lose too much data).

MAR 📊 → Can model/impute

using other variables

Since missingness relates to variables you can see, you can use regression, K-NN, or multiple imputation.

MNAR 🕵️ → Tricky!

requires specialized methods

Need sensitivity analysis, selection models, or expert knowledge. Simple imputation will bias results.

⚠️ NOTES FOR MORONS:🤦‍♂️ Knowing WHY data is missing is more important than knowing HOW to impute it. A fancy imputation method on MNAR data will give you fancy wrong results.

Step 3: Decision Time (What to Do?)

Option A: Deletion 🗑️

  • Listwise: Delete entire row if ANY missing
  • Pairwise: Use available data for each calculation
  • When: MCAR & small % missing (<5%)

Option B: Imputation 🔧

  • Mean/Median/Mode: Simple but dumb
  • Regression: Predict missing from other variables
  • K-NN: Use similar rows' values
  • Multiple Imputation: Fancy stats bro method

Option C: Flagging 🚩

  • Add "was_missing" column
  • Keep original missing, but mark it
  • Good for MNAR situations

🧪 Interactive: Is This MCAR, MAR, or MNAR?

Scenario: Gym survey data. "Income" column has missing values.


4. Clean Data: Taking Out the Trash 🧹✨

Now you actually FIX everything you found. This is the payoff, bruh.

🔄 Fix Format Issues

Before: "twenty five", "(123)456-7890", "2024-13-01"

After: 25, 1234567890, 2024-01-13

🎯 Handle Outliers

Before: Bench: [60, 80, 120, 500, 90]

After (cap at 95th %ile): [60, 80, 120, 180, 90]

500kg was bullshit. Capped at reasonable max.

🔍 Impute Missing Values

Before: Bench: [120, NULL, 100, 110, NULL]

After (mean imputation): [120, 110, 100, 110, 110]

Mean = 110. Filled NULLs with mean.

🚫 Remove Duplicates

Before: John Bro appears twice (identical rows)

After: John Bro appears once

Kept first occurrence, removed exact duplicate.

✅ Validate Logic

Before: Gender = "Male", Pregnancy = "Yes"

After: Pregnancy = "No" (fixed impossible combo)

📝 Document Everything

Before: No record of changes

After: "2026-01-02: Fixed negative age (-5 → 25 using median), capped outlier bench (500 → 180), imputed 2 missing benches with mean"


5. Real Example: Gym Bro Data Cleanup 🏋️‍♂️✨

Let's walk through the ENTIRE workflow with our trash gym data:

🎯 Original Bad Gym Data

MemberID Name Age Max Bench (kg) Join Date Gender
101 John Bro 25 120 2023-01-15 Male
102 [EMPTY] -5 500 2023-02-30 Male
103 Jane Smith twenty five 60 2023-03-10 Female
101 John Bro 25 120 2023-01-15 Male
104 Mike 30 NULL 2024-13-01 Male

See the trash? Missing name, negative age, impossible bench, text age, duplicate, NULL, invalid date... This data is HOT GARBAGE. 🔥🗑️

🔍 Data Quality Check Findings

  • ❌ Missing name (row 2)
  • ❌ Negative age (-5)
  • ❌ Impossible bench (500kg)
  • ❌ Age as text ("twenty five")
  • ❌ Duplicate row (John Bro appears twice)
  • ❌ Missing bench (NULL)
  • ❌ Invalid dates (Feb 30, month 13)

🧊 Missing Data Analysis (Just the Missing Stuff)

In reality, for most cases, unless you have a way to get the missing data, like some member's name, you usually end up just deleting the entire thing.

Think about it with your amoeba brain 🧬🧠: why would you keep records that are incomplete? You can't work with the data.

But still, just showing you that these methods do exist and this is how people deal with them as its more like a case by cases basis not a blank rule for all.

  • Name missing: 1 (20%) - Probably MAR (new member forgot to write)
  • Bench missing: 1 (20%) - MCAR (random skip)
  • Decision: Impute name with "Unknown", impute bench with median

✨ Clean Data (After Fixing Everything)

MemberIDNameAgeMax BenchJoin DateGender
101John Bro251202023-01-15Male
102Unknown271802023-02-28Male
103Jane Smith25602023-03-10Female
104Mike30902024-01-13Male

🔥 TRANSFORMATION COMPLETE: Garbage → Usable data. Ready for analysis!


🎯 Summary: Your Data Cleaning Checklist

✅ Data Quality Checks (Do FIRST)

✅ Missing Data Analysis (Do SECOND)

✅ Clean Data (Do THIRD)

Ready for More? 🚀

Now that your data isn't trash, you can actually do statistics! Next up: Frequency Tables and Summary Statistics (they work MUCH better with clean data).

💡 Remember: Data quality isn't a one-time thing. Every time you get new data, run through this checklist. Your future self (and anyone who has to use your analysis) will thank you.

Now go clean some data, you beautiful data janitor. 🧹✨