Why p95 and p99 aren’t used in the 2026 Benchmarks

Why p95 and p99 aren’t used in the 2026 Benchmarks

TL;DR: Our 2026 load tests are orchestrated by Orderly Ape, which outputs metrics aggregated into 10-second windows. Computing the 95th or 99th percentile over those aggregated windows gives you the 95th percentile of bucket-averages — not the 95th percentile of individual request response times. The two numbers can differ by orders of magnitude on a stressed server. So for 2026 we’ve dropped p95/p99 from the charts and from any ranking decisions. The averages (mean / cumulative average) remain accurate and are what the charts now use.

What “p95” is supposed to mean

When a load test fires a few hundred thousand HTTP requests at a server, each request has a response time. The p95 is the response time at the 95th percentile of that distribution — i.e., “95% of requests were faster than this”. It’s the standard way to talk about tail latency, because averages can hide spikes: a server that returns 200 ms most of the time but takes 30 seconds for the slowest 1% of requests has an OK average and a terrible p99.

This is why p95/p99 are usually the most important numbers in a load test report — they tell you about the bad days, not the typical day.

What we actually got from Orderly Ape this year

Orderly Ape collects per-request metrics from each k6 runner and then rolls them up into time-windowed aggregates — by default in 10-second buckets. The exported metric stream looks something like:

window_start       requests   avg_response   p95_in_window
00:00:00–00:00:10  4,812      181 ms         245 ms
00:00:10–00:00:20  4,901      183 ms         252 ms
00:00:20–00:00:30  4,778      189 ms         258 ms
...

That p95_in_window is the 95th percentile of the requests that happened inside that 10-second window. Useful for spotting one bad 10-second slice in a 60-minute test, but it’s a per-bucket value, not a global one.

When the downstream pipeline goes to compute a single “p95 for the whole test”, it does the only thing it can with what’s there: takes a percentile across the 360 bucket values (60 minutes × 6 buckets/min) instead of across the ~1.5 million individual request response times. That number has no useful mathematical relationship to the latency tail.

A concrete example of why this matters

Imagine 60 minutes of testing, 1.5M requests total, distributed roughly evenly across 360 ten-second windows.

True per-request p95 across all 1.5M requests: somewhere around 280–320 ms (the slow windows pull the global tail noticeably).

p95 of the 360 bucket values: around 251 ms — barely moved from the 358 “normal” buckets, because the 2 bad ones got smoothed inside each window’s own averaging and don’t represent enough of the bucket count to reach the 95th percentile of buckets.

A server with a real, painful tail latency problem looks indistinguishable from a server with none. That’s why we can’t use these numbers to rank or award.

What we use instead for 2026

For benchmark posts, awards, and tier rankings, those are the metrics we evaluate. p95 and p99 fields are still imported and stored on each result post (for future re-analysis if a fix lands), and are still displayed in the detailed results tables with an asterisk + footnote linking back here. They just aren’t charted, and they aren’t being used to award Top Tier or Honorable Mention status.

Comparing across years

If you’re looking at p95 values on a 2023 result and a 2026 result side-by-side, don’t. The 2023 numbers came from a different pipeline that computed p95 per-request, the 2026 numbers came from windowed aggregates. They aren’t the same quantity. We’ve kept the column visible across years rather than retiring it because the 2020–2023 numbers remain valid for historical comparison among themselves.


About the Author

Kevin Ohashi

Kevin Ohashi is the geek-in-charge at Review Signal. He is passionate about making data meaningful for consumers. Kevin is based in Washington, DC.

Recommended Articles

Want updates sent to your email?

Subscribe to our Newsletter