Friday, May 26, 2017

Introducing Kotlin Statistics

NOTE: Since the 1.0 release of Kotlin-Statistics, a large refactoring took place which deprecated some examples in this blog post. Please view the README of Kotlin-Statistics to learn the new API. 

Last fall I wrote an article Kotlin for Data Science where I proposed Kotlin as a general programming language for data science and analytics. To my surprise, I later found out a lot of folks read the article and quietly started to investigate the idea. In Spring 2017 the conversation began to grow in the Kotlin community, but nothing set fire to the prospect quite like when Google made its announcement to support Kotlin. With that announcement, as well as the fact Kotlin compiles to JavaScript and soon LLVM, it is clear that Kotlin is poised to gain adoption on multiple domains.

Since my previous article, the topic of integrating data science workflows with software engineering has gained traction across data science communities. O'Reilly posted an interesting article about using Go for data science, and a quick Google search for "DevOps for Data Science" reveals the same theme: the gap between data science and software engineering is a logical next step in progressing data science as a discipline.

I believe that Python and R will continue to be tools used for analytics, but they are not sustainable for workflows that require continuous delivery into production. Of course, libraries like Apache Spark and others are striving to support multiple language API's, but Kotlin has a lot of potential to usher in a bigger picture. That is what I hope to show with the Kotlin Statistics library I released this week. It is not a silver bullet to all problems in data science, nor does it have advanced features like ML at the moment. Rather, I want Kotlin Statistics to show how Kotlin's inferred static typing and abstraction can make data science code simpler and more tactical, but also resilient and refactorable. Not to mention, the tooling for Kotlin is fantastic with Intellij IDEA.

A Quick Tour

I released the Kotlin Statistics library this week. It is not yet at a 1.0 version, but it should give you a good set of tools to start doing fundamental statistical analysis.

Take for example this Kotlin code below where I declare a Patient type, and I include the first name, last name, birthday, and white blood cell count. I also have an enum called Gender reflecting a MALE/FEMALE category. Of course, I could import this data from a text file, a database, or another source, but for now I am going to declare them in literal Kotlin code:

data class Patient(val firstName: String,
                   val lastName: String,
                   val gender: Gender,
                   val birthday: LocalDate,
                   val whiteBloodCellCount: Int)

val patients = listOf(
        Patient("John", "Simone", Gender.MALE, LocalDate.of(1989, 1, 7), 4500),
        Patient("Sarah", "Marley", Gender.FEMALE, LocalDate.of(1970, 2, 5), 6700),
        Patient("Jessica", "Arnold", Gender.FEMALE, LocalDate.of(1980, 3, 9), 3400),
        Patient("Sam", "Beasley", Gender.MALE, LocalDate.of(1981, 4, 17), 8800),
        Patient("Dan", "Forney", Gender.MALE, LocalDate.of(1985, 9, 13), 5400),
        Patient("Lauren", "Michaels", Gender.FEMALE, LocalDate.of(1975, 8, 21), 5000),
        Patient("Michael", "Erlich", Gender.MALE, LocalDate.of(1985, 12, 17), 4100),
        Patient("Jason", "Miles", Gender.MALE, LocalDate.of(1991, 11, 1), 3900),
        Patient("Rebekah", "Earley", Gender.FEMALE, LocalDate.of(1985, 2, 18), 4600),
        Patient("James", "Larson", Gender.MALE, LocalDate.of(1974, 4, 10), 5100),
        Patient("Dan", "Ulrech", Gender.MALE, LocalDate.of(1991, 7, 11), 6000),
        Patient("Heather", "Eisner", Gender.FEMALE, LocalDate.of(1994, 3, 6), 6000),
        Patient("Jasper", "Martin", Gender.MALE, LocalDate.of(1971, 7, 1), 6000)

enum class Gender {
If you find the LocalDate.of() or other parts of the declaration to be redundant and wordy, you can easily create functions or type aliases to make things more concise, but I am not going to digress into that right now.
Let's start with some basic analysis: what is the average and standard deviation of whiteBloodCellCount across all the patients? We can leverage some extension functions in Kotlin Statistics to find this quickly:

fun main(args: Array<String>) {

    val averageWbcc =
   { it.whiteBloodCellCount }.average()

    val standardDevWbcc =
   { it.whiteBloodCellCount }.standardDeviation()

    println("Average WBCC: $averageWbcc, Std Dev WBCC: $standardDevWbcc")

We should get this output:
Average WBCC: 5346.153846153846, Std Dev WBCC: 1412.2177503341948
However, we sometimes need to slice our data not only for more detailed insight but also to judge our sample. For example, did we get a representative sample with our patients for both male and female? We can use the countBy() operator in Kotlin Statistics to count a Collection or Sequence of items by a keySelector as shown here:

fun main(args: Array<String>) {

    val genderCounts = patients.countBy(
            keySelector = { it.gender }


This returns a Map<Gender,Int>, reflecting the patient count by gender. Here is what it looks like in the output from our code above:
Okay, so our sample is a bit MALE-heavy, but let's move on. We can also find the average white blood cell count by gender using averageBy(). This accepts not only a keySelector lambda but also an intMapper to select an integer off each Patient (we could also use doubleMapper, bigDecimalMapper, etc). In this case, we are selecting the whiteBloodCellCount off each Patient and averaging it by Gender, as shown next:

fun main(args: Array<String>) {

    val averageWbccByGender = patients.averageBy(
            keySelector = { it.gender },
            intMapper = { it.whiteBloodCellCount }


{MALE=5475.0, FEMALE=5140.0}

So the average WBCC for MALE is 5475, and FEMALE is 5140.

What about age? Did we get a good sampling of younger and older patients? If you look at our Patient class, we only have a birthday to work with which is a Java 8 LocalDate. But using Java 8's date and time utilities, we can derive the age in years in the keySelector like this:

fun main(args: Array<String>) {

    val patientCountByAge = patients.countBy(
            keySelector = { ChronoUnit.YEARS.between(it.birthday, }

And here is the output:

{28=1, 47=1, 37=1, 36=1, 31=2, 41=1, 25=2, 32=1, 43=1, 23=1, 45=1}

If you look at our output for the code, it is not very meaningful to get a count by age. It would be better if we could count by age ranges, like 20-29, 30-39, and 40-49. We can do this using the binByXXX() operators. If we want to bin by an Int value such as age, we can define a BinModel that starts at 20, and increments each binSize by 10. We also provide the value we are binning using binMapper, which is the patient's age as shown below:

fun main(args: Array<String>) {

    val binnedPatients = patients.binByInt(
            binMapper = { ChronoUnit.YEARS.between(it.birthday, },
            binSize = 10,
            rangeStart = 20

    binnedPatients.forEach {

And here is the output showning all our Patient items binned up in a BinModel, by these age ranges:

Bin(range=20..29, value=[Patient(firstName=John, lastName=Simone, gender=MALE, birthday=1989-01-07, whiteBloodCellCount=4500), Patient(firstName=Jason, lastName=Miles, gender=MALE, birthday=1991-11-01, whiteBloodCellCount=3900), Patient(firstName=Dan, lastName=Ulrech, gender=MALE, birthday=1991-07-11, whiteBloodCellCount=6000), Patient(firstName=Heather, lastName=Eisner, gender=FEMALE, birthday=1994-03-06, whiteBloodCellCount=6000)])
Bin(range=30..39, value=[Patient(firstName=Jessica, lastName=Arnold, gender=FEMALE, birthday=1980-03-09, whiteBloodCellCount=3400), Patient(firstName=Sam, lastName=Beasley, gender=MALE, birthday=1981-04-17, whiteBloodCellCount=8800), Patient(firstName=Dan, lastName=Forney, gender=MALE, birthday=1985-09-13, whiteBloodCellCount=5400), Patient(firstName=Michael, lastName=Erlich, gender=MALE, birthday=1985-12-17, whiteBloodCellCount=4100), Patient(firstName=Rebekah, lastName=Earley, gender=FEMALE, birthday=1985-02-18, whiteBloodCellCount=4600)])
Bin(range=40..49, value=[Patient(firstName=Sarah, lastName=Marley, gender=FEMALE, birthday=1970-02-05, whiteBloodCellCount=6700), Patient(firstName=Lauren, lastName=Michaels, gender=FEMALE, birthday=1975-08-21, whiteBloodCellCount=5000), Patient(firstName=James, lastName=Larson, gender=MALE, birthday=1974-04-10, whiteBloodCellCount=5100), Patient(firstName=Jasper, lastName=Martin, gender=MALE, birthday=1971-07-01, whiteBloodCellCount=6000)])

We can look up the bin for a given age using an accessor syntax. For example, we can retrieve the Bin for the age 25 like this, and it will return the 20-29 bin:

fun main(args: Array<String>) {

    val binnedPatients = patients.binByInt(
            binMapper = { ChronoUnit.YEARS.between(it.birthday, },
            binSize = 10,
            rangeStart = 20


If we wanted to not collect the items into bins but rather perform an aggregation on each one, we can do that by also providing a groupOp argument. This allows you to use a lambda specifying how to reduce each List<Patient> for each Bin. Below is the average white blood cell count by age range:

fun main(args: Array<String>) {

    val avgWbccByAgeRange = patients.binByInt(
            binMapper = { ChronoUnit.YEARS.between(it.birthday, },
            binSize = 10,
            rangeStart = 20,
            groupOp = { { it.whiteBloodCellCount }.average() }


Here is the output, showing that the average white blood cell count for each age range is within the 5000's:

BinModel(bins=[Bin(range=20..29, value=5100.0), Bin(range=30..39, value=5260.0), Bin(range=40..49, value=5700.0)])

Using let() for Multiple Calculations

There may be times you want to perform multiple aggregations to create reports of various metrics. This is usually achievable using Kotlin's let() operator. Say you wanted to find the 1st, 25th, 50th, 75th, and 100th percentiles by gender. We can tactically use a Kotlin extension function called wbccPercentileByGender() which will take a set of patients and separate a percentile calculation by gender. Then we can invoke it for the five desired percentiles and package them in a Map<Double,Map<Gender,Double>>, as shown below:

fun main(args: Array<String>) {

    fun Collection<Patient>.wbccPercentileByGender(percentile: Double) =
                    percentile = percentile,
                    keySelector = { it.gender },
                    doubleMapper = { it.whiteBloodCellCount.toDouble() }

    val percentileQuadrantsByGender = patients.let {
        mapOf(1.0 to it.wbccPercentileByGender(1.0),
                25.0 to it.wbccPercentileByGender(25.0),
                50.0 to it.wbccPercentileByGender(50.0),
                75.0 to it.wbccPercentileByGender(75.0),
                100.0 to it.wbccPercentileByGender(100.0)



1.0={MALE=3900.0, FEMALE=3400.0}
25.0={MALE=4200.0, FEMALE=4000.0}
50.0={MALE=5250.0, FEMALE=5000.0}
75.0={MALE=6000.0, FEMALE=6350.0}
100.0={MALE=8800.0, FEMALE=6700.0}


This was a somewhat simple introduction to Kotlin Statistics and the functionality I have built so far. Be sure to read the project's README to see a more comprehensive set of operators available in the library. Over time, I plan on improving with linear regression, charting, and other features. I am also thinking of putting in Bayesian model support after I finish scoping it out.

But more importantly, I hope this demonstrates Kotlin's efficacy in being tactical but robust. Kotlin is capable of rapid turnaround for quick ad hoc analysis, but you can take that statically-typed code and put it in production if you need to. While I am seeking to add more functionality to this, it would be awesome to see others contribute to the idea of using Kotlin for these kinds of purposes.


  1. MS Office setup is very easy to install, download and redeem. Use of MS Office is also simple and the user can learn the use of it easily. Online help option is also available in all application of the MS Office which provides an instant guideline. setup
    www office com setup

  2. McAfee provides security for all sorts of users. They supply services and products for home and office at home, enterprise businesses with over 250 workers, and small organizations with under 250 employees, and also venture opportunities. activate
    mcafee com activate
    mcafee activate

  3. We are providing help and support for Microsoft office Setup and activation. Call us or email us the error or problem, our one of the expert contact you with the suitable perfect solution. Get the MS Office application suite and as per your need and see how it is easy to work with Microsoft Office.

    www office com setup | Install Office | setup setup
    www office com setup
    Install Office


    Before you plan to install the Office 2016 or Office 365 on your device be it a Computer, Laptop, Mobile Phone or a Tablet, you are required to take few important steps on of them is to remove any existing Office installations from your PC. Just like the previous Office products, Office 2016 & 365 will conflict with the previously installed versions. So, it becomes necessary to remove the previous office files properly.


    To Setup retail card please visit official website Www.Office.Com/Setup. Office Retail Cards allow you to download your security product from the internet instead of installing from a CD, ensuring recent versions.
    Microsoft Office product