The Code Gorilla

Friday, 15 January 2021

Approx Median for big data and distributed systems

 Its been so long since I posted to this blog, I almost forgot I had it.

Anyway - a lot of things have changed in all that time, I'm still a big fan of true BDD - something I successfully championed at the place I work so that its now considered a standard practice. I have used the word true in front of BDD due to the number of times I have seen BDD miss represented, it stands for Behavioural Driven Design, not a Test methodology, not a Specification Documentation, not a Unit Test Framework, not a ... and so on - Its a Design activity.

I'm still a fan of Scenario/Feature files, although I do occasionally write the scenarios in "code" frameworks rather than cucumber feature files with steps.

I mainly develop in Scala and Python, developing and maintaining big data (like) systems, using Apache Spark, HDFS, Kafka, Akka. Product work is mainly in Scala (with a slice of Java) and Python is used for automation work and deployment frameworks (ansible etc).

I hope to start updating this blog again and I have a few ideas I'd like to share with the community, the first of these is:

A Distributed (approx) Median Calculation - median calculations do not scale well and do not offer reaggregation optimizations (where you calculate the median for 1 hour, store results and use that to calculate the median for 1 day, a median of a median is not a median of the original data).

Using Bloom filters or HyperLogLog are techniques to allow you scale simple set calculations on a large scale (with some margin of error) such InSet (Bloom Filter) and CountDistinct (HyperLogLog) with reaggregation available via the merging of the storage or sketches that these techniques use to store the information about the data it has seen.

Sum, Counts and Means are easily implemented in reaggregate-able terms so the long term data/metrics can continue to be compacted down.

What I'm proposing here is for a Median that can scale and be reaggregated but does come with some limitations and prerequisites.

  1. The range of all possible values in the source material - this is not a solution for all values. But if the range is small enough then this can be an effective means. E.g. Measuring throughput on a network device, it has a finite boundary - lets say 0 to 10 GB.
  2. You have to define an acceptable bin range, this defined the size of the intermediate storage (and that needed to be stored for reaggregation). E.g. 1 MB  bins, on our range of 0 to 10 GB we have 10,000 bins. This will dictate the accuracy, the more bins the greater the accuracy but with a larger intermediate storage.
Each bin is a counter and it counts the number of input values that have been seen in that range,

Once all the data has been seen, this histogram can be used to calculate the approximate median value.

Some code and comparisons to true median will come in the next part.

A link to the code: Approx Median on github