Data Tables – Aggregated vs Disaggregated 📊🧮

Last updated: 2026-01-02

Applies to version: 1.0+

Reading time: ~12 min

Quick Summary / TL;DR

Tables are where your data goes to get organized or get lost forever. 😾

You have two choices: show every single datapoint (disaggregated) or group them into intervals (aggregated) like a lazy ass.

Disaggregated tables are raw, honest, and painful to look at. Aggregated tables are tidy, clean, and hide the ugly truth.

Choose wisely, 'cause if you fuck this up, your graphs will lie and your professor will fail you. 🚫📉

Let's dive in, dumbass.

Trick Question – Discrete or Continuous?
The Chess Dataset – Wins, Draws, and .5 Bullshit
Table Types: Disaggregated vs Aggregated
Disaggregated Table – Full Transparency
Aggregated Table – Binned & Hidden
Binning Methods – How Many Classes?
Formulas Change – Know Your Math
When to Use Which (Don’t Be Stupid)
Cheat Sheet & Next Steps

1. Trick Question – Discrete or Continuous? 🤔

Alright, check this out:

Dataset values: 2, 2.5, 3, 3.5, 4, 4.5, 5, 5.5...

Trick question: Are these discrete or continuous?

If you said "continuous", go back to tutorial #1, bruh. You're falling behind. 😤

If you said "can be either", okay – you're thinking, but still wrong.

The Truth:

This is chess data. A win = 1 point, a draw = 0.5 points.

So 2.5 means: 2 wins + 1 draw. You're counting games, not measuring height or temperature.

Counting = Discrete. Even with decimal points, you're still counting outcomes, not measuring a continuum.

Remember: if you're not measuring (heights, temps, weights), you're counting. Discrete ≠ only integers. Discrete = distinct, separate outcomes.

💡 Memory tip: Decimals don't automatically mean continuous. Think: "Can I have 2.3 of this thing in real life?" For chess games? No. For temperature? Yes.

2. The Chess Dataset – Wins, Draws, and .5 Bullshit ♟️📈

Here's the raw dataset I gathered from analyzing chess player streaks:

X_i	F_i
2	72
2.5	9
3	41
3.5	20
4	28
4.5	12
5	29
5.5	14
6	23
6.5	11
7	23
7.5	8
8	8

What this means:

X_i = The streak value (e.g., 2 = two consecutive wins)
F_i = Frequency (e.g., 72 players had a streak of 2 wins)
So the dataset is: [2, 2, 2, ... (72 times), 2.5, 2.5, ... (9 times), 3, 3, ... (41 times), ...]

Note: X (without subscript) = entire dataset. X_i (with subscript) = individual value.

And as you know, you always order data from min to max before working with it. This shit's already ordered, so we're good.

3. Table Types: Disaggregated vs Aggregated 🧾↔️📦

Tables come in two flavors. Choose your poison:

1. Disaggregated Data / Raw Data Table

Every single datapoint is shown with its frequency.

Transparent, honest, and painful to look at if you have lots of data.

Used when you have distinct values (like chess streaks) or small datasets.

2. Aggregated Data Table

Data grouped into intervals (classes). You don't see individual values.

Tidy, clean, and hides the ugly details.

Two sub-flavors:

Same amplitude intervals: All classes have equal width
Different amplitude intervals: Classes have different widths

Now, why would you use aggregated? Because sometimes you're given data that way, or you have too much data to list individually. Or you're lazy. Probably lazy. 😴

4. Disaggregated Table – Full Transparency 🔍📋

Let's build the full table for our chess data. This is where we calculate everything.

Table Headers & Meanings:

X_i: Datapoint
F_i: Absolute Frequency
CumF_i: Cumulative Absolute Frequency
f_i: Relative Frequency (F_i ÷ total)
cumf_i: Cumulative Relative Frequency
F_iX_i: For calculating mean
|X_i - μ|: Absolute deviation from mean
F_i|X_i - μ|: For Mean Absolute Deviation
(X_i - μ)²: Squared deviation
F_i(X_i - μ)²: For variance
X_i²: Squared value
F_iX_i²: Alternative variance calculation

Most of these are intermediate calculations you'll never look at again, but you need 'em to get the real stats.

The Full Disaggregated Table:

X_i	F_i	CumF_i	f_i	cumf_i	F_iX_i	\|X_i - μ\|	F_i\|X_i - μ\|	(X_i - μ)²	F_i(X_i - μ)²	X_i²	F_iX_i²
2	72	72.0000	0.1686	0.1686	144.0000	7.9801	574.5667	63.6819	4585.0964	4.0000	288.0000
2.5	9	81.0000	0.0211	0.1897	22.5000	7.4801	67.3208	55.9518	503.5662	6.2500	56.2500
3	41	122.0000	0.0960	0.2857	123.0000	6.9801	286.1838	48.7217	1997.5900	9.0000	369.0000
3.5	20	142.0000	0.0468	0.3326	70.0000	6.4801	129.6019	41.9916	839.8323	12.2500	245.0000
4	28	170.0000	0.0656	0.3981	112.0000	5.9801	167.4426	35.7615	1001.3226	16.0000	448.0000
4.5	12	182.0000	0.0281	0.4262	54.0000	5.4801	65.7611	30.0314	360.3771	20.2500	243.0000
5	29	211.0000	0.0679	0.4941	145.0000	4.9801	144.4227	24.8013	719.2387	25.0000	725.0000
5.5	14	225.0000	0.0328	0.5269	77.0000	4.4801	62.7213	20.0712	280.9974	30.2500	423.5000
6	23	248.0000	0.0539	0.5808	138.0000	3.9801	91.5422	15.8411	364.3464	36.0000	828.0000
6.5	11	259.0000	0.0258	0.6066	71.5000	3.4801	38.2810	12.1111	133.2216	42.2500	464.7500
7	23	282.0000	0.0539	0.6604	161.0000	2.9801	68.5422	8.8810	204.2620	49.0000	1127.0000
7.5	8	290.0000	0.0187	0.6792	60.0000	2.4801	19.8407	6.1509	49.2069	56.2500	450.0000
8	8	298.0000	0.0187	0.6979	64.0000	1.9801	15.8407	3.9208	31.3662	64.0000	512.0000

Key Observations:

Relative frequency (f_i) tells you what percentage of the total each value represents. 2 appears 16.86% of the time.
Cumulative frequency shows running totals. By streak 5.5, you've seen 52.69% of all data.
This table lets you calculate everything: mean, variance, skewness, kurtosis, percentiles...

Descriptive Stats from This Table:

Stat	Value	Stat	Value
n	427.0000	μ (mean)	9.9801
Mo (mode)	2.0000	Me (median)	5.5000
σ² (variance)	242.2051	σ (std dev)	15.5629
DAM (MAD)	8.2875	Q1	3.0000
Q3	9.0000	IQR	6.0000
CV	1.5594	Skewness (G)	0.8636
Kurtosis (K)	0.2000	Range	111.5000

Interpretation: Positive skew (0.8636), leptokurtic (K < 0.263), mean > median > mode. Typical right-skewed distribution with a few long winning streaks pulling the mean up.

5. Aggregated Table – Binned & Hidden 📦🔢

Now for the lazy (or practical) approach: binning data into intervals.

The Problem:

You're given this table already aggregated. You don't know the individual values anymore.

Example: Interval [2-10] has 35 data points. Which 35 values? Fuck if I know. Could be 2, 3, 5.5, 9... anything in that range.

This is common in published data, surveys, or when someone already processed the data for you (probably poorly).

Example Aggregated Table:

Class	Interval	L_inf	L_sup	a_i	F_i	CumF_i	f_i	cumf_i	C_i	F_iC_i	f_iC_i	f_iC_i²	f_i\|C_i - μ\|	h_i = f_i/a_i	H_i = F_i/a_i
1	[2-10]	2.0000	10.0000	8.0000	35.0000	35.0000	0.3500	0.3500	6.0000	210.0000	2.1000	12.6000	2.9138	0.0438	4.3750
2	[10-15]	10.0000	15.0000	5.0000	25.0000	60.0000	0.2500	0.6000	12.5000	312.5000	3.1250	39.0625	0.4563	0.0500	5.0000
3	[15-20]	15.0000	20.0000	5.0000	10.0000	70.0000	0.1000	0.7000	17.5000	175.0000	1.7500	30.6250	0.3175	0.0200	2.0000
4	[20-25]	20.0000	25.0000	5.0000	20.0000	90.0000	0.2000	0.9000	22.5000	450.0000	4.5000	101.2500	1.6350	0.0400	4.0000
5	[25-32]	25.0000	32.0000	7.0000	10.0000	100.0000	0.1000	1.0000	28.5000	285.0000	2.8500	81.2250	1.4175	0.0143	1.4286

Column Definitions:

L_inf, L_sup: Lower/upper class limits
a_i: Class width = L_sup - L_inf
C_i: Class midpoint = (L_inf + L_sup) ÷ 2
h_i: Frequency density = f_i ÷ a_i (for histograms)
H_i: Cumulative frequency density

Note: This is purely an academic exercise. In real life, why would you start with data already in a table? Because your professor is a sadist, that's why.

6. Binning Methods – How Many Classes? 📏🔢

If you're aggregating raw data yourself, how many classes (intervals) should you create?

There are formulas. Take each with a grain of salt – they're mild suggestions, not divine truth.

Example Data:

N = 100 (sample size)
Range = 8
Q1 = 1, Q3 = 4, IQR = 3
Standard deviation = 3

Different Methods Give Different Answers:

Method	Formula	Number of Classes
Square Root Rule	√N	√100 = 10
Sturges' Rule	1 + 3.322·log₁₀(N)	1 + 3.322·log₁₀(100) = 7.644
Scott's Rule	Range ÷ (3.5·σ·N⁻¹/³)	8 ÷ (3.5·3·100⁻¹/³) ≈ 3.536
Freedman-Diaconis	Range ÷ (2·IQR·N⁻¹/³)	8 ÷ (2·3·100⁻¹/³) ≈ 6.189

Which One to Use?

Square Root: Simple, okay for small datasets
Sturges: Classic, biased toward normal distributions
Scott: Good for normal data, uses standard deviation
Freedman-Diaconis: Robust, uses IQR (good for skewed data)

My advice: Try a few, see which gives meaningful intervals. Don't blindly follow formulas.

7. Formulas Change – Know Your Math 🧮⚠️

PAY ATTENTION: The math changes when working with aggregated data.

You're not starting from raw data anymore – you're interpolating from binned data. Formulas are different.

Key Differences:

Mean: μ = Σ(f_i·C_i) where C_i = class midpoint
Variance: σ² = Σ[f_i·(C_i - μ)²] (using midpoints)
Percentiles/Median: Use linear interpolation within classes
Mode: Use formula based on adjacent class frequencies

Stats from Aggregated Table:

Stat	Value	Interpretation
N	100.0000	Total observations
μ	14.3250	Mean
Mo	10.8621	Mode (using interpolation formula)
Me	13.0000	Median
σ²	59.5569	Variance
σ	7.7173	Standard deviation
Q1	7.7143	First quartile
Q3	21.2500	Third quartile
Skewness (G₁)	0.4487	Positive skew
Kurtosis (K)	0.3267	Platykurtic (K > 0.263)

Percentile Calculation Example (Median):

P₅₀ = 0.5 (50th percentile)

cum f(Me-1) = 0.3500 (cumulative frequency before median class)

f(Me) = 0.2500 (frequency of median class)

li(Me) = 10.0000 (lower limit of median class)

a(Me) = 5.0000 (width of median class)

Formula: Me = li(Me) + [(0.5 - cum f(Me-1)) ÷ f(Me)] × a(Me)

Me = 10 + [(0.5 - 0.35) ÷ 0.25] × 5 = 13.0000

Bottom line: If you use raw data formulas on aggregated data, you'll be wrong. Know which table type you have.

8. When to Use Which (Don't Be Stupid) 🤔✅

Use Disaggregated Tables When:

You have few distinct values (like chess streaks)
Data is already categorical or discrete
You need exact calculations (no approximation)
Sample size is small (< 100 observations)
You're presenting data transparently

Use Aggregated Tables When:

You have continuous data with many unique values
Sample size is large (> 100 observations)
Data is already given to you in intervals
You need to simplify for presentation
Creating histograms or density plots
You're lazy (most common reason)

Graph Choice Depends on Table Type:

Disaggregated discrete data: Bar charts, dot plots, stem-and-leaf
Aggregated continuous data: Histograms, frequency polygons, ogives
Mixed situations: Box plots work for both (using raw or binned data)

For our chess example (discrete): bar chart or stem-and-leaf. For the aggregated example: histogram or cumulative frequency graph.

🎯 Cheat Sheet & Next Steps

Key Concepts

Disaggregated Table: Shows every value with frequency
Aggregated Table: Groups data into intervals (classes)
Discrete ≠ Integers: Chess scores with .5 are still discrete
Binning Methods: √N, Sturges, Scott, Freedman-Diaconis
Formulas Change: Aggregated data uses different formulas

Table Headers

X_i: Datapoint
F_i: Absolute frequency
f_i: Relative frequency
C_i: Class midpoint (aggregated)
a_i: Class width

When to Use

Disaggregated: Small datasets, exact calculations, transparency
Aggregated: Large datasets, continuous data, simplification
Graphs: Match table type (bar vs histogram)

Common Mistakes

Using raw formulas on aggregated data
Blindly following binning rules
Confusing discrete decimal data with continuous
Forgetting to check cumulative columns

Next Steps

Learn: Histograms & frequency polygons
Practice: Convert raw data to aggregated and compare stats
Read: Exploratory Data Analysis (Tukey)

Ready for More? 🚀

Tables are just the beginning. Next, we'll visualize this shit with graphs that don't lie (unless you fuck up the table first).

Next tutorial: Histograms & Frequency Polygons – making your data look pretty or exposing its ugly truth.

Ready for More? 🚀

Keep learning, and remember: Statistics is a superpower! 💪

Statistics Tutorial

Data Tables – Aggregated vs Disaggregated 📊🧮

Quick Summary / TL;DR

Table of Contents

1. Trick Question – Discrete or Continuous? 🤔

The Truth:

2. The Chess Dataset – Wins, Draws, and .5 Bullshit ♟️📈

What this means:

3. Table Types: Disaggregated vs Aggregated 🧾↔️📦

1. Disaggregated Data / Raw Data Table

2. Aggregated Data Table

4. Disaggregated Table – Full Transparency 🔍📋

Table Headers & Meanings:

The Full Disaggregated Table:

Key Observations:

Descriptive Stats from This Table:

5. Aggregated Table – Binned & Hidden 📦🔢

The Problem:

Example Aggregated Table:

Column Definitions:

6. Binning Methods – How Many Classes? 📏🔢

Example Data:

Different Methods Give Different Answers:

Which One to Use?

7. Formulas Change – Know Your Math 🧮⚠️

Key Differences:

Stats from Aggregated Table:

Percentile Calculation Example (Median):

8. When to Use Which (Don't Be Stupid) 🤔✅

Use Disaggregated Tables When:

Use Aggregated Tables When:

Graph Choice Depends on Table Type:

🎯 Cheat Sheet & Next Steps

Key Concepts

Table Headers

When to Use

Common Mistakes

Next Steps

Ready for More? 🚀

Ready for More? 🚀