Sughosh Dixit
Sughosh P Dixit
2025-11-1516 min read

Day 15 — Percentiles as Thresholds: Drawing Lines in the Sand

Day 15 of 30
Continue from your last DS day in Learning Path
Article Header Image

TL;DR

Quick summary

Percentiles provide powerful, interpretable thresholds for decision-making without distributional assumptions. Learn how to use percentiles as cutoffs for loan approvals, performance rankings, and anomaly detection—turning any feature into a ranked decision rule.

Key takeaways
  • Day 15 — Percentiles as Thresholds: Drawing Lines in the Sand
Preview

Day 15 — Percentiles as Thresholds: Drawing Lines in the Sand

Percentiles provide powerful, interpretable thresholds for decision-making without distributional assumptions. Learn how to use percentiles as cutoffs for loan approvals, performance rankings, and anomaly detection—turning any feature into a ranked decision rule.

Day 15 — Percentiles as Thresholds: Drawing Lines in the Sand

Turn any feature into an interpretable, ranked decision threshold without complex modeling.

Percentiles provide powerful, interpretable thresholds for decision-making without distributional assumptions, perfect for quota-based decisions and relative rankings.

Note: This article uses technical terms and abbreviations. For definitions, check out the Key Terms & Glossary page.


The Decision-Making Problem

Imagine you're a bank loan officer looking at 10,000 credit applications:

The data:

Credit Scores: 450, 520, 580, 620, 650, 680, 720, 750, 780, 850...

Your job: Draw a line somewhere and decide:

  • Above the line → APPROVE

  • Below the line → REJECT

Questions:

  1. Where should that line be?

  2. How do we justify the cutoff?

  3. What if we want to approve the "top 20%"?

Enter: Percentiles as thresholds!


What Are Percentiles?

Percentile (also called quantile when expressed as proportion):

Definition: The p-th percentile is the value below which p% of the data falls.

Visual Intuition

Imagine 100 people sorted by height:

Show code (10 lines)
Shortest → Tallest

Person 1, 2, 3, ..., 50, ..., 75, ..., 90, ..., 100

10th percentile: Person #10's height (10% are shorter)

50th percentile: Person #50's height (50% are shorter) ← MEDIAN

90th percentile: Person #90's height (90% are shorter)

Key insight: Percentiles divide sorted data into "below" and "above" groups!


Common Percentiles You Know

The Classic Names

Show code (10 lines)
0th percentile   = Minimum

25th percentile  = Q₁ (First Quartile)

50th percentile  = Median (Q₂)

75th percentile  = Q₃ (Third Quartile)

100th percentile = Maximum

Business Examples

Income distribution:

10th percentile: $25,000 (10% earn less)

50th percentile: $55,000 (median income)

90th percentile: $150,000 (top 10% starts here)

99th percentile: $500,000 (the 1%!)

Test scores:

10th percentile: 45 (struggling students)

50th percentile: 72 (average)

90th percentile: 92 (top performers)

Website load times:

50th percentile: 1.2 seconds (typical user)

95th percentile: 3.5 seconds (slow experience)

99th percentile: 8.0 seconds (really bad!)

The Math: Quantile Function

Percentile Definition (Precise)

For data x₁, x₂, ..., xₙ (sorted in ascending order):

The p-th percentile (where p is between 0 and 100):

Show code (10 lines)
Position = (p/100) × (n + 1)

If position is an integer k:

Percentile = xₖ

If position = k + fraction:

Percentile = xₖ + fraction × (xₖ₊₁ - xₖ)  [Linear interpolation]

Example Calculation

Data: [10, 20, 30, 40, 50, 60, 70, 80, 90, 100] (n = 10)

Find 25th percentile (Q₁):

Show code (16 lines)
Position = (25/100) × (10 + 1) = 0.25 × 11 = 2.75

This is between position 2 and 3:

x₂ = 20

x₃ = 30

Q₁ = 20 + 0.75 × (30 - 20)

= 20 + 0.75 × 10

= 20 + 7.5

= 27.5

Find 50th percentile (Median):

Show code (14 lines)
Position = (50/100) × 11 = 5.5

Between position 5 and 6:

x₅ = 50

x₆ = 60

Median = 50 + 0.5 × (60 - 50)

= 50 + 5

= 55

Find 75th percentile (Q₃):

Show code (14 lines)
Position = (75/100) × 11 = 8.25

Between position 8 and 9:

x₈ = 80

x₉ = 90

Q₃ = 80 + 0.25 × (90 - 80)

= 80 + 2.5

= 82.5

Quantile Function Notation

Q(p) = quantile function at proportion p (where p ∈ [0, 1])

Q(0.25) = 25th percentile = Q₁

Q(0.50) = 50th percentile = Median

Q(0.75) = 75th percentile = Q₃

Key property: Q(p) is monotone increasing

If p₁ < p₂, then Q(p₁) ≤ Q(p₂)

Translation: Higher percentiles always give higher (or equal) values! ↗

Using Percentiles as Thresholds

The Decision Framework

Goal: Create decision rules based on data distribution

Method: Pick a percentile → That's your cutoff!

Example: Loan Approval

Data: 10,000 credit scores ranging from 300 to 850

Strategy 1: Approve top 20%

Show code (10 lines)
Threshold = 80th percentile

Calculate Q(0.80) = 720

Rule: If credit score ≥ 720 → APPROVE

If credit score < 720 → REJECT

Result: Exactly 20% of applicants approved

Strategy 2: Conservative approach (top 10%)

Show code (10 lines)
Threshold = 90th percentile

Calculate Q(0.90) = 780

Rule: If credit score ≥ 780 → APPROVE

If credit score < 780 → REJECT

Result: Only 10% approved (more selective!)

Strategy 3: Aggressive approach (top 50%)

Show code (10 lines)
Threshold = 50th percentile (median)

Calculate Q(0.50) = 650

Rule: If credit score ≥ 650 → APPROVE

If credit score < 650 → REJECT

Result: Half of applicants approved

Threshold Comparison
Threshold Comparison

Different threshold strategies yield different approval rates—choose based on your risk tolerance and business goals.

Try It: Interactive Threshold Tuner

Percentile Threshold Tuner

Pick a percentile to set a decision threshold. See how the acceptance rate changes if you approve scores above the threshold.

Provide any numeric scores: risk, anomaly, quality, etc.
0%90%100%
Decision:
Samples
20
Threshold @ 90%
88.20
Acceptance rate
10.0%
Distribution preview
56.065.074.083.092.0Values
Percentiles are monotone: increasing p raises the threshold for ascending data and typically reduces acceptance (when accepting ≥ threshold).

Visual: Percentile Ladder

Show code (22 lines)
Credit Score Distribution (10,000 applicants)

850  99th percentile (top 1%)

780  90th percentile (top 10%)

720  80th percentile (top 20%)

680  75th percentile (Q₃)

650  50th percentile (Median)

620  25th percentile (Q₁)

580  20th percentile

520  10th percentile

450  1st percentile (bottom 1%)

Pick your rung on the ladder = Pick your threshold!

Percentile Ladder
Percentile Ladder

The percentile ladder shows how different thresholds correspond to different approval rates—pick your rung!


Why Percentiles? The Benefits

1. Distribution-Free (Nonparametric)

No assumptions needed!

Show code (10 lines)
Don't need: Normality, specific distribution shape

Works for: Any distribution (skewed, multimodal, weird!)

Example: Income data (highly skewed)

→ Mean = $120K (pulled up by billionaires)

→ 80th percentile = $95K (robust, interpretable)

2. Interpretable

Everyone understands rankings!

"We approve the top 20%"

vs

"We approve scores ≥ 1.5 standard deviations above mean"

Which is clearer? The first!

3. Robust to Outliers

Extreme values don't affect percentiles much:

Show code (12 lines)
Data: [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]

75th percentile = 82.5

Change 100 → 10,000 (outlier!):

Data: [10, 20, 30, 40, 50, 60, 70, 80, 90, 10000]

75th percentile = 82.5 (unchanged!)

The median and nearby percentiles are stable!

4. Direct Control Over Approval Rate

Want to approve 15% of applicants?

Simply set threshold = 85th percentile

Guaranteed to approve exactly 15%!

With parametric methods (z-scores):

z > 1.96 → What % does this approve?

Depends on the distribution!

Need calculations, assumptions, uncertainty...

5. Easy to Adjust

Market changes, need to approve more loans?

Before: 80th percentile (top 20%)

After: 70th percentile (top 30%)

Just slide the threshold down! Simple.

The Monotonicity Property

Theorem: For any dataset, Q(p) is a monotone increasing function.

Formal statement:

If p₁ < p₂, then Q(p₁) ≤ Q(p₂)

In words: Higher percentile → Higher (or equal) threshold value

Why This Matters

Consequence 1: Predictable behavior

If you increase the percentile, the cutoff can ONLY go up (or stay same)

Never goes down!

Consequence 2: Nested approval sets

Show code (10 lines)
People approved at 90th percentile

⊆ People approved at 80th percentile

⊆ People approved at 70th percentile

Top 10% ⊆ Top 20% ⊆ Top 30%

More selective thresholds are subsets!

Consequence 3: Safe threshold adjustment

Increasing percentile → Fewer approvals → Lower risk

Decreasing percentile → More approvals → Higher risk

Direction is guaranteed!

Exercise: Prove Monotonicity

Claim: For sorted data x₁ ≤ x₂ ≤ ... ≤ xₙ, if p₁ < p₂, then Q(p₁) ≤ Q(p₂)

Proof:

Case 1: Integer Positions

Setup:

Show code (10 lines)
Position for p₁: k₁ = ⌊(p₁/100) × (n+1)⌋

Position for p₂: k₂ = ⌊(p₂/100) × (n+1)⌋

Since p₁ < p₂:

(p₁/100) × (n+1) < (p₂/100) × (n+1)

Therefore: k₁ ≤ k₂

Since data is sorted (x₁ ≤ x₂ ≤ ... ≤ xₙ):

Show code (10 lines)
Q(p₁) = xₖ₁

Q(p₂) = xₖ₂

Since k₁ ≤ k₂ and data is sorted:

xₖ₁ ≤ xₖ₂

Therefore: Q(p₁) ≤ Q(p₂)

Case 2: Non-Integer Positions (Interpolation)

Setup:

Position for p₁: pos₁ = (p₁/100) × (n+1) = k₁ + f₁

Position for p₂: pos₂ = (p₂/100) × (n+1) = k₂ + f₂

Where k₁, k₂ are integer parts, f₁, f₂ are fractional parts

Interpolation formula:

Q(p₁) = xₖ₁ + f₁ × (xₖ₁₊₁ - xₖ₁)

Q(p₂) = xₖ₂ + f₂ × (xₖ₂₊₁ - xₖ₂)

Sub-case 2a: k₁ = k₂ (same interval)

Show code (14 lines)
If k₁ = k₂ and p₁ < p₂:

Then pos₁ < pos₂, so f₁ < f₂

Q(p₁) = xₖ + f₁ × (xₖ₊₁ - xₖ)

Q(p₂) = xₖ + f₂ × (xₖ₊₁ - xₖ)

Since f₁ < f₂ and (xₖ₊₁ - xₖ) ≥ 0 (sorted):

f₁ × (xₖ₊₁ - xₖ) ≤ f₂ × (xₖ₊₁ - xₖ)

Therefore: Q(p₁) ≤ Q(p₂)

Sub-case 2b: k₁ < k₂ (different intervals)

Show code (10 lines)
Since k₁ < k₂ and data sorted:

xₖ₁ ≤ xₖ₂

Also, since 0 ≤ f₁, f₂ < 1:

xₖ₁ ≤ xₖ₁ + f₁ × (xₖ₁₊₁ - xₖ₁) ≤ xₖ₁₊₁ ≤ ... ≤ xₖ₂

Therefore: Q(p₁) ≤ Q(p₂)

Conclusion: In all cases, Q(p₁) ≤ Q(p₂) when p₁ < p₂!

Practical Demonstration

Data: [10, 15, 20, 25, 30, 35, 40, 45, 50, 55] (n=10)

Test increasing percentiles:

Show code (9 lines)
import numpy as np

data = [10, 15, 20, 25, 30, 35, 40, 45, 50, 55]
percentiles = [10, 20, 30, 40, 50, 60, 70, 80, 90]

for p in percentiles:
threshold = np.percentile(data, p)
print(f"{p}th percentile: {threshold}")

Output:

Show code (20 lines)
10th percentile: 10.5

20th percentile: 13.0    (≥ 10.5)

30th percentile: 17.0    (≥ 13.0)

40th percentile: 23.0    (≥ 17.0)

50th percentile: 30.0    (≥ 23.0)

60th percentile: 37.0    (≥ 30.0)

70th percentile: 42.0    (≥ 37.0)

80th percentile: 47.0    (≥ 42.0)

90th percentile: 52.0    (≥ 47.0)

Every step up in percentile → threshold increases (or stays same)!

Monotonicity Demonstration
Monotonicity Demonstration

The monotonicity property ensures that higher percentiles always yield higher (or equal) threshold values—guaranteed!


Implementation: compute_percentile_thresholds

Show code (53 lines)
import numpy as np
import pandas as pd

def compute_percentile_thresholds(data, percentiles=[25, 50, 75, 90, 95, 99]):
"""
Compute threshold values at specified percentiles

Parameters:
- data: array-like, feature values
- percentiles: list of percentiles to compute

Returns:
- DataFrame with percentiles and corresponding thresholds
"""

# Sort data for visualization
sorted_data = np.sort(data)
n = len(data)

results = []

for p in percentiles:
# Compute threshold
threshold = np.percentile(data, p)

# Count values above and below
n_below = np.sum(data < threshold)
n_above = np.sum(data >= threshold)

# Percentage above (approval rate if used as cutoff)
pct_above = (n_above / n) * 100

results.append({
'percentile': p,
'threshold': threshold,
'n_below': n_below,
'n_above': n_above,
'approval_rate_%': pct_above
})

return pd.DataFrame(results)

# Example usage
np.random.seed(42)
credit_scores = np.random.normal(650, 100, 10000).clip(300, 850)

thresholds_df = compute_percentile_thresholds(
credit_scores,
percentiles=[50, 75, 80, 85, 90, 95, 99]
)

print(thresholds_df)

Output:

Show code (16 lines)
percentile  threshold  n_below  n_above  approval_rate_%

0          50     649.73     5000     5000            50.00

1          75     717.04     7500     2500            25.00

2          80     733.56     8000     2000            20.00

3          85     751.39     8500     1500            15.00

4          90     777.82     9000     1000            10.00

5          95     814.56     9500      500             5.00

6          99     882.19     9900      100             1.00

Interpretation:

  • Want to approve 20%? Set threshold at 733.56 (80th percentile)

  • Want top 5%? Set threshold at 814.56 (95th percentile)


Visualizing Percentile Thresholds

Percentile Ladder on Histogram

Show code (40 lines)
import matplotlib.pyplot as plt

def visualize_percentile_thresholds(data, percentiles=[50, 75, 90, 95]):
"""
Visualize data distribution with percentile thresholds marked
"""
fig, ax = plt.subplots(figsize=(12, 6))

# Histogram
ax.hist(data, bins=50, alpha=0.6, color='skyblue', edgecolor='black')

# Add percentile lines
colors = ['green', 'orange', 'red', 'darkred']

for p, color in zip(percentiles, colors):
threshold = np.percentile(data, p)
ax.axvline(threshold, color=color, linestyle='--', linewidth=2,
label=f'{p}th percentile: {threshold:.1f}')

# Add approval rate annotation
approval_rate = 100 - p
ax.text(threshold, ax.get_ylim()[1]*0.9,
f'Top {approval_rate:.0f}%',
rotation=90, verticalalignment='bottom',
color=color, fontweight='bold')

ax.set_xlabel('Credit Score', fontsize=12)
ax.set_ylabel('Frequency', fontsize=12)
ax.set_title('Credit Score Distribution with Percentile Thresholds',
fontsize=14, fontweight='bold')
ax.legend(loc='upper left')
ax.grid(alpha=0.3)

plt.tight_layout()
return fig

# Generate and plot
fig = visualize_percentile_thresholds(credit_scores, [50, 75, 85, 90, 95])
plt.show()

Sorted Values with Percentile Markers

Show code (38 lines)
def plot_sorted_with_percentiles(data, percentiles=[25, 50, 75, 90, 95]):
"""
Plot sorted data values with percentile positions marked
"""
sorted_data = np.sort(data)
n = len(data)

fig, ax = plt.subplots(figsize=(12, 6))

# Plot sorted values
ax.plot(range(n), sorted_data, color='blue', alpha=0.5, linewidth=1)

# Mark percentiles
colors = ['green', 'yellow', 'orange', 'red', 'darkred']

for p, color in zip(percentiles, colors):
threshold = np.percentile(data, p)
position = int((p/100) * n)

ax.scatter(position, threshold, color=color, s=200,
zorder=5, edgecolor='black', linewidth=2)
ax.axhline(threshold, color=color, linestyle=':', alpha=0.5)
ax.text(n*0.02, threshold, f'{p}th: {threshold:.1f}',
fontsize=10, color=color, fontweight='bold',
bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))

ax.set_xlabel('Rank (sorted position)', fontsize=12)
ax.set_ylabel('Credit Score', fontsize=12)
ax.set_title('Sorted Credit Scores with Percentile Markers',
fontsize=14, fontweight='bold')
ax.grid(alpha=0.3)

plt.tight_layout()
return fig

fig = plot_sorted_with_percentiles(credit_scores)
plt.show()

Visual shows:

Show code (22 lines)
Credit Score

850                                           • 99th

800                                     • 95th

750                              • 90th

700                       • 75th

650              • 50th

600        • 25th

•

300 → Rank

0         2500      5000      7500    10000

The curve shows monotonicity visually!

Advanced: Percentile-at-Value

Sometimes you want the reverse: "What percentile is this score?"

Show code (26 lines)
def compute_percentile_at_value(data, value):
"""
Compute what percentile a given value corresponds to

Parameters:
- data: array-like
- value: scalar value to find percentile for

Returns:
- percentile: where this value falls (0-100)
"""
n = len(data)
n_below = np.sum(data < value)
n_equal = np.sum(data == value)

# Midpoint method: count half of ties
percentile = ((n_below + n_equal/2) / n) * 100

return percentile

# Example
score = 720
percentile = compute_percentile_at_value(credit_scores, score)
print(f"A score of {score} is at the {percentile:.1f}th percentile")
print(f"This is better than {percentile:.1f}% of applicants")

Output:

A score of 720 is at the 75.3th percentile

This is better than 75.3% of applicants

Use cases:

  • "You scored 720, which beats 75% of applicants!"

  • "This transaction is at the 99.5th percentile for amount"

  • "Your response time is at the 5th percentile (very fast!)"


When to Use Percentile Thresholds

Perfect For:

Quota-based decisions

"Approve top 1,000 applicants"

→ Use 90th percentile (10% of 10,000)

Relative performance

"Bonus for top 20% performers"

→ Use 80th percentile of sales

Anomaly detection

"Flag transactions above 99th percentile"

→ Catches unusual amounts

Resource allocation

"Prioritize cases in top 30% of urgency"

→ Use 70th percentile of urgency score

SLA definitions

"95% of requests must complete under X seconds"

→ Use 95th percentile as threshold

Don't Use When:

Absolute standards matter

"Approve everyone with score > 700"

Not: "Approve top 20%"

If the standard is fixed, percentiles aren't appropriate!

Small sample sizes

n = 10: Percentiles unstable

Better: Use all data or simple cutoffs

Need theoretical justification

"Why 80th percentile?"

"Because... we decided?"

Percentiles are data-driven but somewhat arbitrary

Distribution is important

If you specifically need Normal(μ, σ) behavior

Percentiles don't preserve distributional properties

Combining Percentiles with Business Logic

Example: Risk-Adjusted Approval

Show code (31 lines)
def tiered_approval_thresholds(credit_score, income,
credit_percentiles, income_percentiles):
"""
Use percentiles from both features for tiered decisions

Tiers:
- Tier 1 (Auto-approve): Both top 25%
- Tier 2 (Manual review): One top 25%
- Tier 3 (Auto-reject): Neither top 25%
"""

credit_pct = compute_percentile_at_value(credit_scores_data, credit_score)
income_pct = compute_percentile_at_value(income_data, income)

if credit_pct >= 75 and income_pct >= 75:
return "APPROVED", "Tier 1: Excellent on both metrics"
elif credit_pct >= 75 or income_pct >= 75:
return "REVIEW", "Tier 2: Strong on one metric"
else:
return "REJECTED", "Tier 3: Below threshold on both"

# Example
decision, reason = tiered_approval_thresholds(
credit_score=750,  # 85th percentile
income=80000,      # 60th percentile
credit_percentiles=credit_scores,
income_percentiles=incomes
)
print(f"Decision: {decision}")
print(f"Reason: {reason}")

Output:

Decision: REVIEW

Reason: Tier 2: Strong on one metric

Common Pitfalls

1. Ties at Percentile

Show code (12 lines)
Data: [10, 20, 20, 20, 30, 40, 50, 60, 70, 80]

↑_____↑

Three 20's

50th percentile falls in the middle of ties!

Different software may handle differently.

Solution: Be consistent, document method

2. Small Sample Instability

Show code (14 lines)
n = 10:

99th percentile = ???

Position = 0.99 × 11 = 10.89

Interpolation between 10th and... 11th value?

There is no 11th value!

Solution: Use percentiles < 100 × (n-1)/n

For n=10: Max reliable percentile ≈ 90%

3. Assuming Equal Intervals

Show code (20 lines)
Wrong thinking:

"50th and 75th are 25 points apart

So 50th to 25th are also 25 points apart"

Reality:

Percentiles mark equal COUNTS, not equal VALUES

Data: [1, 2, 3, 100]

25th percentile: 1.75

50th percentile: 2.50

75th percentile: 51.25

Look at those gaps! Not equal!

4. Confusing Percentile and Percentage

"Score at 80th percentile = 80% correct"

"Score at 80th percentile = Better than 80% of people"

Very different!

5. Using Percentiles on Small Subgroups

Show code (16 lines)
Total data: n = 10,000

Subgroup: n = 50

Computing 99th percentile on subgroup:

Position = 0.99 × 51 = 50.49

You're asking for the 50th value in a group of 50!

Basically the maximum. Unreliable!

Solution: Use percentiles from full data,

apply to subgroups

Summary

Percentiles are powerful, interpretable threshold proposals that require no distributional assumptions!

Key Concepts:

Percentile definition: Value below which p% of data falls

p-th percentile = Q(p/100)

Quantile function Q(p): Inverse of CDF, maps [0,1] → data values

Monotonicity property:

If p₁ < p₂, then Q(p₁) ≤ Q(p₂)

Higher percentile → Higher (or equal) threshold ↗

As decision threshold:

Want top 20%? Use 80th percentile

Guaranteed to select exactly 20%!

Robust & nonparametric:

  • No normal assumption needed

  • Stable to outliers

  • Works with any distribution

The Monotonicity Proof:

For sorted data x₁ ≤ x₂ ≤ ... ≤ xₙ:

Show code (10 lines)
p₁ < p₂

→ Position(p₁) < Position(p₂)

→ xₖ₁ ≤ xₖ₂  (because sorted)

→ Q(p₁) ≤ Q(p₂)

Increasing percentile ALWAYS increases threshold!

When to Use:

Quota-based decisions ("top 10%")

Relative rankings ("better than 75%")

Anomaly detection ("above 99th percentile")

SLA definitions ("95% under threshold")

Data-driven threshold proposals

Fixed absolute standards ("must score > 700")

Small samples (n < 50)

Theoretical distribution requirements

Implementation:

Show code (9 lines)
# Compute thresholds
thresholds = compute_percentile_thresholds(data, [75, 90, 95])

# Find percentile of value
pct = compute_percentile_at_value(data, 720)

# Use as decision rule
approved = data >= np.percentile(data, 80)  # Top 20%

The power: Turn any feature into an interpretable, ranked decision threshold without complex modeling!


Note: This article uses technical terms like percentiles, quantiles, thresholds, monotonicity, and nonparametric. For definitions, check out the Key Terms & Glossary page.

Sughosh P Dixit
Sughosh P Dixit
Data Scientist & Tech Writer
Next
16 min read

Discussion