Statistics Tutorial

Data Tables – Aggregated vs Disaggregated 📊🧮

Last updated: 2026-01-02

Applies to version: 1.0+

Reading time: ~12 min

Quick Summary / TL;DR

Tables are where your data goes to get organized or get lost forever. 😾

You have two choices: show every single datapoint (disaggregated) or group them into intervals (aggregated) like a lazy ass.

Disaggregated tables are raw, honest, and painful to look at. Aggregated tables are tidy, clean, and hide the ugly truth.

Choose wisely, 'cause if you fuck this up, your graphs will lie and your professor will fail you. 🚫📉

Let's dive in, dumbass.

Table of Contents

  1. Trick Question – Discrete or Continuous?
  2. The Chess Dataset – Wins, Draws, and .5 Bullshit
  3. Table Types: Disaggregated vs Aggregated
  4. Disaggregated Table – Full Transparency
  5. Aggregated Table – Binned & Hidden
  6. Binning Methods – How Many Classes?
  7. Formulas Change – Know Your Math
  8. When to Use Which (Don’t Be Stupid)
  9. Cheat Sheet & Next Steps

1. Trick Question – Discrete or Continuous? 🤔

Alright, check this out:

Dataset values: 2, 2.5, 3, 3.5, 4, 4.5, 5, 5.5...

Trick question: Are these discrete or continuous?

If you said "continuous", go back to tutorial #1, bruh. You're falling behind. 😤

If you said "can be either", okay – you're thinking, but still wrong.

The Truth:

This is chess data. A win = 1 point, a draw = 0.5 points.

So 2.5 means: 2 wins + 1 draw. You're counting games, not measuring height or temperature.

Counting = Discrete. Even with decimal points, you're still counting outcomes, not measuring a continuum.

Remember: if you're not measuring (heights, temps, weights), you're counting. Discrete ≠ only integers. Discrete = distinct, separate outcomes.

💡 Memory tip: Decimals don't automatically mean continuous. Think: "Can I have 2.3 of this thing in real life?" For chess games? No. For temperature? Yes.

2. The Chess Dataset – Wins, Draws, and .5 Bullshit ♟️📈

Here's the raw dataset I gathered from analyzing chess player streaks:

Xi Fi
272
2.59
341
3.520
428
4.512
529
5.514
623
6.511
723
7.58
88

What this means:

  • Xi = The streak value (e.g., 2 = two consecutive wins)
  • Fi = Frequency (e.g., 72 players had a streak of 2 wins)
  • So the dataset is: [2, 2, 2, ... (72 times), 2.5, 2.5, ... (9 times), 3, 3, ... (41 times), ...]

Note: X (without subscript) = entire dataset. Xi (with subscript) = individual value.

And as you know, you always order data from min to max before working with it. This shit's already ordered, so we're good.

3. Table Types: Disaggregated vs Aggregated 🧾↔️📦

Tables come in two flavors. Choose your poison:

1. Disaggregated Data / Raw Data Table

Every single datapoint is shown with its frequency.

Transparent, honest, and painful to look at if you have lots of data.

Used when you have distinct values (like chess streaks) or small datasets.

2. Aggregated Data Table

Data grouped into intervals (classes). You don't see individual values.

Tidy, clean, and hides the ugly details.

Two sub-flavors:

  • Same amplitude intervals: All classes have equal width
  • Different amplitude intervals: Classes have different widths

Now, why would you use aggregated? Because sometimes you're given data that way, or you have too much data to list individually. Or you're lazy. Probably lazy. 😴

4. Disaggregated Table – Full Transparency 🔍📋

Let's build the full table for our chess data. This is where we calculate everything.

Table Headers & Meanings:

  • Xi: Datapoint
  • Fi: Absolute Frequency
  • CumFi: Cumulative Absolute Frequency
  • fi: Relative Frequency (Fi ÷ total)
  • cumfi: Cumulative Relative Frequency
  • FiXi: For calculating mean
  • |Xi - μ|: Absolute deviation from mean
  • Fi|Xi - μ|: For Mean Absolute Deviation
  • (Xi - μ)²: Squared deviation
  • Fi(Xi - μ)²: For variance
  • Xi²: Squared value
  • FiXi²: Alternative variance calculation

Most of these are intermediate calculations you'll never look at again, but you need 'em to get the real stats.

The Full Disaggregated Table:

Xi Fi CumFi fi cumfi FiXi |Xi - μ| Fi|Xi - μ| (Xi - μ)² Fi(Xi - μ)² Xi² FiXi²
27272.00000.16860.1686144.00007.9801574.566763.68194585.09644.0000288.0000
2.5981.00000.02110.189722.50007.480167.320855.9518503.56626.250056.2500
341122.00000.09600.2857123.00006.9801286.183848.72171997.59009.0000369.0000
3.520142.00000.04680.332670.00006.4801129.601941.9916839.832312.2500245.0000
428170.00000.06560.3981112.00005.9801167.442635.76151001.322616.0000448.0000
4.512182.00000.02810.426254.00005.480165.761130.0314360.377120.2500243.0000
529211.00000.06790.4941145.00004.9801144.422724.8013719.238725.0000725.0000
5.514225.00000.03280.526977.00004.480162.721320.0712280.997430.2500423.5000
623248.00000.05390.5808138.00003.980191.542215.8411364.346436.0000828.0000
6.511259.00000.02580.606671.50003.480138.281012.1111133.221642.2500464.7500
723282.00000.05390.6604161.00002.980168.54228.8810204.262049.00001127.0000
7.58290.00000.01870.679260.00002.480119.84076.150949.206956.2500450.0000
88298.00000.01870.697964.00001.980115.84073.920831.366264.0000512.0000

Key Observations:

  • Relative frequency (fi) tells you what percentage of the total each value represents. 2 appears 16.86% of the time.
  • Cumulative frequency shows running totals. By streak 5.5, you've seen 52.69% of all data.
  • This table lets you calculate everything: mean, variance, skewness, kurtosis, percentiles...

Descriptive Stats from This Table:

StatValueStatValue
n427.0000μ (mean)9.9801
Mo (mode)2.0000Me (median)5.5000
σ² (variance)242.2051σ (std dev)15.5629
DAM (MAD)8.2875Q13.0000
Q39.0000IQR6.0000
CV1.5594Skewness (G)0.8636
Kurtosis (K)0.2000Range111.5000

Interpretation: Positive skew (0.8636), leptokurtic (K < 0.263), mean > median > mode. Typical right-skewed distribution with a few long winning streaks pulling the mean up.

5. Aggregated Table – Binned & Hidden 📦🔢

Now for the lazy (or practical) approach: binning data into intervals.

The Problem:

You're given this table already aggregated. You don't know the individual values anymore.

Example: Interval [2-10] has 35 data points. Which 35 values? Fuck if I know. Could be 2, 3, 5.5, 9... anything in that range.

This is common in published data, surveys, or when someone already processed the data for you (probably poorly).

Example Aggregated Table:

ClassIntervalLinfLsupaiFi CumFificumfiCiFiCi fiCifiCi²fi|Ci - μ|hi = fi/aiHi = Fi/ai
1[2-10]2.000010.00008.000035.000035.00000.35000.35006.0000210.00002.100012.60002.91380.04384.3750
2[10-15]10.000015.00005.000025.000060.00000.25000.600012.5000312.50003.125039.06250.45630.05005.0000
3[15-20]15.000020.00005.000010.000070.00000.10000.700017.5000175.00001.750030.62500.31750.02002.0000
4[20-25]20.000025.00005.000020.000090.00000.20000.900022.5000450.00004.5000101.25001.63500.04004.0000
5[25-32]25.000032.00007.000010.0000100.00000.10001.000028.5000285.00002.850081.22501.41750.01431.4286

Column Definitions:

  • Linf, Lsup: Lower/upper class limits
  • ai: Class width = Lsup - Linf
  • Ci: Class midpoint = (Linf + Lsup) ÷ 2
  • hi: Frequency density = fi ÷ ai (for histograms)
  • Hi: Cumulative frequency density

Note: This is purely an academic exercise. In real life, why would you start with data already in a table? Because your professor is a sadist, that's why.

6. Binning Methods – How Many Classes? 📏🔢

If you're aggregating raw data yourself, how many classes (intervals) should you create?

There are formulas. Take each with a grain of salt – they're mild suggestions, not divine truth.

Example Data:

  • N = 100 (sample size)
  • Range = 8
  • Q1 = 1, Q3 = 4, IQR = 3
  • Standard deviation = 3

Different Methods Give Different Answers:

Method Formula Number of Classes
Square Root Rule √N √100 = 10
Sturges' Rule 1 + 3.322·log₁₀(N) 1 + 3.322·log₁₀(100) = 7.644
Scott's Rule Range ÷ (3.5·σ·N⁻¹/³) 8 ÷ (3.5·3·100⁻¹/³) ≈ 3.536
Freedman-Diaconis Range ÷ (2·IQR·N⁻¹/³) 8 ÷ (2·3·100⁻¹/³) ≈ 6.189

Which One to Use?

  • Square Root: Simple, okay for small datasets
  • Sturges: Classic, biased toward normal distributions
  • Scott: Good for normal data, uses standard deviation
  • Freedman-Diaconis: Robust, uses IQR (good for skewed data)

My advice: Try a few, see which gives meaningful intervals. Don't blindly follow formulas.

7. Formulas Change – Know Your Math 🧮⚠️

PAY ATTENTION: The math changes when working with aggregated data.

You're not starting from raw data anymore – you're interpolating from binned data. Formulas are different.

Key Differences:

  • Mean: μ = Σ(fi·Ci) where Ci = class midpoint
  • Variance: σ² = Σ[fi·(Ci - μ)²] (using midpoints)
  • Percentiles/Median: Use linear interpolation within classes
  • Mode: Use formula based on adjacent class frequencies

Stats from Aggregated Table:

StatValueInterpretation
N100.0000Total observations
μ14.3250Mean
Mo10.8621Mode (using interpolation formula)
Me13.0000Median
σ²59.5569Variance
σ7.7173Standard deviation
Q17.7143First quartile
Q321.2500Third quartile
Skewness (G1)0.4487Positive skew
Kurtosis (K)0.3267Platykurtic (K > 0.263)

Percentile Calculation Example (Median):

P50 = 0.5 (50th percentile)

cum f(Me-1) = 0.3500 (cumulative frequency before median class)

f(Me) = 0.2500 (frequency of median class)

li(Me) = 10.0000 (lower limit of median class)

a(Me) = 5.0000 (width of median class)

Formula: Me = li(Me) + [(0.5 - cum f(Me-1)) ÷ f(Me)] × a(Me)

Me = 10 + [(0.5 - 0.35) ÷ 0.25] × 5 = 13.0000

Bottom line: If you use raw data formulas on aggregated data, you'll be wrong. Know which table type you have.

8. When to Use Which (Don't Be Stupid) 🤔✅

Use Disaggregated Tables When:

  • You have few distinct values (like chess streaks)
  • Data is already categorical or discrete
  • You need exact calculations (no approximation)
  • Sample size is small (< 100 observations)
  • You're presenting data transparently

Use Aggregated Tables When:

  • You have continuous data with many unique values
  • Sample size is large (> 100 observations)
  • Data is already given to you in intervals
  • You need to simplify for presentation
  • Creating histograms or density plots
  • You're lazy (most common reason)

Graph Choice Depends on Table Type:

  • Disaggregated discrete data: Bar charts, dot plots, stem-and-leaf
  • Aggregated continuous data: Histograms, frequency polygons, ogives
  • Mixed situations: Box plots work for both (using raw or binned data)

For our chess example (discrete): bar chart or stem-and-leaf. For the aggregated example: histogram or cumulative frequency graph.


🎯 Cheat Sheet & Next Steps

Key Concepts

  • Disaggregated Table: Shows every value with frequency
  • Aggregated Table: Groups data into intervals (classes)
  • Discrete ≠ Integers: Chess scores with .5 are still discrete
  • Binning Methods: √N, Sturges, Scott, Freedman-Diaconis
  • Formulas Change: Aggregated data uses different formulas

Table Headers

  • Xi: Datapoint
  • Fi: Absolute frequency
  • fi: Relative frequency
  • Ci: Class midpoint (aggregated)
  • ai: Class width

When to Use

  • Disaggregated: Small datasets, exact calculations, transparency
  • Aggregated: Large datasets, continuous data, simplification
  • Graphs: Match table type (bar vs histogram)

Common Mistakes

  • Using raw formulas on aggregated data
  • Blindly following binning rules
  • Confusing discrete decimal data with continuous
  • Forgetting to check cumulative columns

Next Steps

  • Learn: Histograms & frequency polygons
  • Practice: Convert raw data to aggregated and compare stats
  • Read: Exploratory Data Analysis (Tukey)

Ready for More? 🚀

Tables are just the beginning. Next, we'll visualize this shit with graphs that don't lie (unless you fuck up the table first).

Next tutorial: Histograms & Frequency Polygons – making your data look pretty or exposing its ugly truth.

Ready for More? 🚀

Keep learning, and remember: Statistics is a superpower! 💪