The Code Gorilla

Friday, 15 January 2021

Approx Median for big data and distributed systems

 Its been so long since I posted to this blog, I almost forgot I had it.

Anyway - a lot of things have changed in all that time, I'm still a big fan of true BDD - something I successfully championed at the place I work so that its now considered a standard practice. I have used the word true in front of BDD due to the number of times I have seen BDD miss represented, it stands for Behavioural Driven Design, not a Test methodology, not a Specification Documentation, not a Unit Test Framework, not a ... and so on - Its a Design activity.

I'm still a fan of Scenario/Feature files, although I do occasionally write the scenarios in "code" frameworks rather than cucumber feature files with steps.

I mainly develop in Scala and Python, developing and maintaining big data (like) systems, using Apache Spark, HDFS, Kafka, Akka. Product work is mainly in Scala (with a slice of Java) and Python is used for automation work and deployment frameworks (ansible etc).

I hope to start updating this blog again and I have a few ideas I'd like to share with the community, the first of these is:

A Distributed (approx) Median Calculation - median calculations do not scale well and do not offer reaggregation optimizations (where you calculate the median for 1 hour, store results and use that to calculate the median for 1 day, a median of a median is not a median of the original data).

Using Bloom filters or HyperLogLog are techniques to allow you scale simple set calculations on a large scale (with some margin of error) such InSet (Bloom Filter) and CountDistinct (HyperLogLog) with reaggregation available via the merging of the storage or sketches that these techniques use to store the information about the data it has seen.

Sum, Counts and Means are easily implemented in reaggregate-able terms so the long term data/metrics can continue to be compacted down.

What I'm proposing here is for a Median that can scale and be reaggregated but does come with some limitations and prerequisites.

  1. The range of all possible values in the source material - this is not a solution for all values. But if the range is small enough then this can be an effective means. E.g. Measuring throughput on a network device, it has a finite boundary - lets say 0 to 10 GB.
  2. You have to define an acceptable bin range, this defined the size of the intermediate storage (and that needed to be stored for reaggregation). E.g. 1 MB  bins, on our range of 0 to 10 GB we have 10,000 bins. This will dictate the accuracy, the more bins the greater the accuracy but with a larger intermediate storage.
Each bin is a counter and it counts the number of input values that have been seen in that range,

Once all the data has been seen, this histogram can be used to calculate the approximate median value.

Some code and comparisons to true median will come in the next part.

A link to the code: Approx Median on github


Friday, 22 March 2013

Getting started with SpecFlow (Part 2)

This is part 2, if you've not read part 1 please do. Download source code

Quick recap


We have our solution, Dentist with one project Dentist.Specs with a single feature file called BookAnAppointment.feature.
Its now time to generate some code, if you have previously generated the steps then select the step file you created and delete it.

Part 2 - Time for code


Wednesday, 20 March 2013

Getting started with SpecFlow (Part 1)

This blog post will, hopefully, provide a good starting point for those interested in using SpecFlow and BDD. This example is developed using SpecFlow 1.9, MsTest and Visual Studio 2012.

I won't go into what BDD is, or try to explain the Gherkin language as used in Specflow. This will be a basic project run through to get you started with your first Specflow & BDD based projects.

This project will be based on developing a simple system for the purpose of booking appointments at a dentist, there a multiple layers to this n-tier system.

Data Layer (DataModel)
 |
Domain Layer (DomainModel)
|
Presentation Layer (UI)

Part 1 - Defining the Scenarios

In this example, we shall be using BDD to explain the behaviour of the business domain layer and not worry about the presentation layer for now (I wish to keep the example simple).

Monday, 18 March 2013

Entity framework, why the virtual on references?

I've been perplexed for a little while about the entity framework when using code first.

Consider the following:

A multiple choice database, one question with multiple answers:

public class Answer
{
    public int AnswerID { get; set; }
    public int QuestionID { get; set; }

    public string Text { get; set; }
    public bool IsCorrect { get; set; }

    public virtual Question Question { get; set; }
}

public class Question
{
    public int QuestionID {get;set;}
    public string Text { get; set; }

    public virtual ICollection<Answer> Answers { get; set; }
}

public class MultiChoiceContext : DbContext, IDataModel
{
    public DbSet<Question> Questions { get; set; }
}

The question that's always been in my mind is why the virtual on the external reference on Question (line 9) and ICollection (line 17), I've seen lots of examples on the internet that do and do not have the virtual keyword and until recently I've never understood why. So I'm here to enlighten you.

Sunday, 10 March 2013

Dependency Injection (DI)

I was late to the part on Dependency Injection, or so I thought. I'd been reading and trying to get my head around IoC (Inversion of Control) for some time when I decided to delve deeper and started reading the book Dependency Injection in .Net. After reading about 50 pages it dawned on me, I'd been designing my software with DI in mind for years.


My background in strongly typed modern (at the time) C++ had always enforced reference (pointers) access via an interface (or pure virtual class) in conjunction with the pattern identified by Kevlin Henney called PFA (Parameterize From Above), complemented by my own personal hatred of the global singleton "don't know how to solve a problem so I'll just cheat" anti-pattern had made me subconsciously an advocate of DI. Like many of the GoF (Gang of Four) patterns named in the book Design Patterns : Elements of Reusable Object-Orientated Software I'd been using them for years, I just didn't know they had a name (remember all this book really did/does is allow us to talk about a pattern using the same language, the patterns are just a result of good design and are pretty obvious). Now I understood, now I could talk to my other developers using the same language.

Saturday, 9 March 2013

Unit testing

If you're here, then your thinking about testing. Please visit my articles on SpecFlow for BDD to bring something new to testing.

There is lot to be said about unit testing (or testing via code), and a lot of opinions on the right and wrong way to do this. The first thing I say about code testing is that it doesn't matter how you do it or how much or how effective it is, we all need to start somewhere - just as long as you are doing some form of code testing it means your can build on it and address those areas where improvements can be made.

I follow three levels of code testing:
  1. Unit testing (strict AAA)
  2. Interaction testing (or integration testing)
  3. System testing (as much as can be done)

Friday, 8 March 2013