Banner

Friday, May 26, 2017

Introducing Kotlin Statistics




Last fall I wrote an article Kotlin for Data Science where I proposed Kotlin as a general programming language for data science and analytics. To my surprise, I later found out a lot of folks read the article and quietly started to investigate the idea. In Spring 2017 the conversation began to grow in the Kotlin community, but nothing set fire to the prospect quite like when Google made its announcement to support Kotlin. With that announcement, as well as the fact Kotlin compiles to JavaScript and soon LLVM, it is clear that Kotlin is poised to gain adoption on multiple domains.

Since my previous article, the topic of integrating data science workflows with software engineering has gained traction across data science communities. O'Reilly posted an interesting article about using Go for data science, and a quick Google search for "DevOps for Data Science" reveals the same theme: the gap between data science and software engineering is a logical next step in progressing data science as a discipline.

I believe that Python and R will continue to be tools used for analytics, but they are not sustainable for workflows that require continuous delivery into production. Of course, libraries like Apache Spark and others are striving to support multiple language API's, but Kotlin has a lot of potential to usher in a bigger picture. That is what I hope to show with the Kotlin Statistics library I released this week. It is not a silver bullet to all problems in data science, nor does it have advanced features like ML at the moment. Rather, I want Kotlin Statistics to show how Kotlin's inferred static typing and abstraction can make data science code simpler and more tactical, but also resilient and refactorable. Not to mention, the tooling for Kotlin is fantastic with Intellij IDEA.


A Quick Tour


I released the Kotlin Statistics library this week. It is not yet at a 1.0 version, but it should give you a good set of tools to start doing fundamental statistical analysis.

Take for example this Kotlin code below where I declare a Patient type, and I include the first name, last name, birthday, and white blood cell count. I also have an enum called Gender reflecting a MALE/FEMALE category. Of course, I could import this data from a text file, a database, or another source, but for now I am going to declare them in literal Kotlin code:

data class Patient(val firstName: String,
                   val lastName: String,
                   val gender: Gender,
                   val birthday: LocalDate,
                   val whiteBloodCellCount: Int)


val patients = listOf(
        Patient("John", "Simone", Gender.MALE, LocalDate.of(1989, 1, 7), 4500),
        Patient("Sarah", "Marley", Gender.FEMALE, LocalDate.of(1970, 2, 5), 6700),
        Patient("Jessica", "Arnold", Gender.FEMALE, LocalDate.of(1980, 3, 9), 3400),
        Patient("Sam", "Beasley", Gender.MALE, LocalDate.of(1981, 4, 17), 8800),
        Patient("Dan", "Forney", Gender.MALE, LocalDate.of(1985, 9, 13), 5400),
        Patient("Lauren", "Michaels", Gender.FEMALE, LocalDate.of(1975, 8, 21), 5000),
        Patient("Michael", "Erlich", Gender.MALE, LocalDate.of(1985, 12, 17), 4100),
        Patient("Jason", "Miles", Gender.MALE, LocalDate.of(1991, 11, 1), 3900),
        Patient("Rebekah", "Earley", Gender.FEMALE, LocalDate.of(1985, 2, 18), 4600),
        Patient("James", "Larson", Gender.MALE, LocalDate.of(1974, 4, 10), 5100),
        Patient("Dan", "Ulrech", Gender.MALE, LocalDate.of(1991, 7, 11), 6000),
        Patient("Heather", "Eisner", Gender.FEMALE, LocalDate.of(1994, 3, 6), 6000),
        Patient("Jasper", "Martin", Gender.MALE, LocalDate.of(1971, 7, 1), 6000)
)

enum class Gender {
    MALE,
    FEMALE
}
If you find the LocalDate.of() or other parts of the declaration to be redundant and wordy, you can easily create functions or type aliases to make things more concise, but I am not going to digress into that right now.
Let's start with some basic analysis: what is the average and standard deviation of whiteBloodCellCount across all the patients? We can leverage some extension functions in Kotlin Statistics to find this quickly:

fun main(args: Array<String>) {

    val averageWbcc =
            patients.map { it.whiteBloodCellCount }.average()

    val standardDevWbcc =
            patients.map { it.whiteBloodCellCount }.standardDeviation()

    println("Average WBCC: $averageWbcc, Std Dev WBCC: $standardDevWbcc")

}
We should get this output:
Average WBCC: 5346.153846153846, Std Dev WBCC: 1412.2177503341948
However, we sometimes need to slice our data not only for more detailed insight but also to judge our sample. For example, did we get a representative sample with our patients for both male and female? We can use the countBy() operator in Kotlin Statistics to count a Collection or Sequence of items by a keySelector as shown here:

fun main(args: Array<String>) {

    val genderCounts = patients.countBy(
            keySelector = { it.gender }
    )

    println(genderCounts)
}

This returns a Map<Gender,Int>, reflecting the patient count by gender. Here is what it looks like in the output from our code above:
{MALE=8, FEMALE=5}
Okay, so our sample is a bit MALE-heavy, but let's move on. We can also find the average white blood cell count by gender using averageBy(). This accepts not only a keySelector lambda but also an intMapper to select an integer off each Patient (we could also use doubleMapper, bigDecimalMapper, etc). In this case, we are selecting the whiteBloodCellCount off each Patient and averaging it by Gender, as shown next:

fun main(args: Array<String>) {

    val averageWbccByGender = patients.averageBy(
            keySelector = { it.gender },
            intMapper = { it.whiteBloodCellCount }
    )

    println(averageWbccByGender)
}

{MALE=5475.0, FEMALE=5140.0}

So the average WBCC for MALE is 5475, and FEMALE is 5140.

What about age? Did we get a good sampling of younger and older patients? If you look at our Patient class, we only have a birthday to work with which is a Java 8 LocalDate. But using Java 8's date and time utilities, we can derive the age in years in the keySelector like this:

fun main(args: Array<String>) {

    val patientCountByAge = patients.countBy(
            keySelector = { ChronoUnit.YEARS.between(it.birthday, LocalDate.now()) }
    )

    println(patientCountByAge)
}
And here is the output:

{28=1, 47=1, 37=1, 36=1, 31=2, 41=1, 25=2, 32=1, 43=1, 23=1, 45=1}

If you look at our output for the code, it is not very meaningful to get a count by age. It would be better if we could count by age ranges, like 20-29, 30-39, and 40-49. We can do this using the binByXXX() operators. If we want to bin by an Int value such as age, we can define a BinModel that starts at 20, and increments each binSize by 10. We also provide the value we are binning using binMapper, which is the patient's age as shown below:

fun main(args: Array<String>) {

    val binnedPatients = patients.binByInt(
            binMapper = { ChronoUnit.YEARS.between(it.birthday, LocalDate.now()).toInt() },
            binSize = 10,
            rangeStart = 20
    )

    binnedPatients.forEach {
        println(it)
    }
}

And here is the output showning all our Patient items binned up in a BinModel, by these age ranges:

Bin(range=20..29, value=[Patient(firstName=John, lastName=Simone, gender=MALE, birthday=1989-01-07, whiteBloodCellCount=4500), Patient(firstName=Jason, lastName=Miles, gender=MALE, birthday=1991-11-01, whiteBloodCellCount=3900), Patient(firstName=Dan, lastName=Ulrech, gender=MALE, birthday=1991-07-11, whiteBloodCellCount=6000), Patient(firstName=Heather, lastName=Eisner, gender=FEMALE, birthday=1994-03-06, whiteBloodCellCount=6000)])
Bin(range=30..39, value=[Patient(firstName=Jessica, lastName=Arnold, gender=FEMALE, birthday=1980-03-09, whiteBloodCellCount=3400), Patient(firstName=Sam, lastName=Beasley, gender=MALE, birthday=1981-04-17, whiteBloodCellCount=8800), Patient(firstName=Dan, lastName=Forney, gender=MALE, birthday=1985-09-13, whiteBloodCellCount=5400), Patient(firstName=Michael, lastName=Erlich, gender=MALE, birthday=1985-12-17, whiteBloodCellCount=4100), Patient(firstName=Rebekah, lastName=Earley, gender=FEMALE, birthday=1985-02-18, whiteBloodCellCount=4600)])
Bin(range=40..49, value=[Patient(firstName=Sarah, lastName=Marley, gender=FEMALE, birthday=1970-02-05, whiteBloodCellCount=6700), Patient(firstName=Lauren, lastName=Michaels, gender=FEMALE, birthday=1975-08-21, whiteBloodCellCount=5000), Patient(firstName=James, lastName=Larson, gender=MALE, birthday=1974-04-10, whiteBloodCellCount=5100), Patient(firstName=Jasper, lastName=Martin, gender=MALE, birthday=1971-07-01, whiteBloodCellCount=6000)])

We can look up the bin for a given age using an accessor syntax. For example, we can retrieve the Bin for the age 25 like this, and it will return the 20-29 bin:

fun main(args: Array<String>) {

    val binnedPatients = patients.binByInt(
            binMapper = { ChronoUnit.YEARS.between(it.birthday, LocalDate.now()).toInt() },
            binSize = 10,
            rangeStart = 20
    )

    println(binnedPatients[25])
}

If we wanted to not collect the items into bins but rather perform an aggregation on each one, we can do that by also providing a groupOp argument. This allows you to use a lambda specifying how to reduce each List<Patient> for each Bin. Below is the average white blood cell count by age range:

fun main(args: Array<String>) {

    val avgWbccByAgeRange = patients.binByInt(
            binMapper = { ChronoUnit.YEARS.between(it.birthday, LocalDate.now()).toInt() },
            binSize = 10,
            rangeStart = 20,
            groupOp = { it.map { it.whiteBloodCellCount }.average() }
    )

    println(avgWbccByAgeRange)
}

Here is the output, showing that the average white blood cell count for each age range is within the 5000's:

BinModel(bins=[Bin(range=20..29, value=5100.0), Bin(range=30..39, value=5260.0), Bin(range=40..49, value=5700.0)])

Using let() for Multiple Calculations


There may be times you want to perform multiple aggregations to create reports of various metrics. This is usually achievable using Kotlin's let() operator. Say you wanted to find the 1st, 25th, 50th, 75th, and 100th percentiles by gender. We can tactically use a Kotlin extension function called wbccPercentileByGender() which will take a set of patients and separate a percentile calculation by gender. Then we can invoke it for the five desired percentiles and package them in a Map<Double,Map<Gender,Double>>, as shown below:


fun main(args: Array<String>) {

    fun Collection<Patient>.wbccPercentileByGender(percentile: Double) =
            percentileBy(
                    percentile = percentile,
                    keySelector = { it.gender },
                    doubleMapper = { it.whiteBloodCellCount.toDouble() }
            )

    val percentileQuadrantsByGender = patients.let {
        mapOf(1.0 to it.wbccPercentileByGender(1.0),
                25.0 to it.wbccPercentileByGender(25.0),
                50.0 to it.wbccPercentileByGender(50.0),
                75.0 to it.wbccPercentileByGender(75.0),
                100.0 to it.wbccPercentileByGender(100.0)
        )
    }

    percentileQuadrantsByGender.forEach(::println)
}

OUTPUT:

1.0={MALE=3900.0, FEMALE=3400.0}
25.0={MALE=4200.0, FEMALE=4000.0}
50.0={MALE=5250.0, FEMALE=5000.0}
75.0={MALE=6000.0, FEMALE=6350.0}
100.0={MALE=8800.0, FEMALE=6700.0}

Summary

This was a somewhat simple introduction to Kotlin Statistics and the functionality I have built so far. Be sure to read the project's README to see a more comprehensive set of operators available in the library. Over time, I plan on improving with linear regression, charting, and other features. I am also thinking of putting in Bayesian model support after I finish scoping it out.

But more importantly, I hope this demonstrates Kotlin's efficacy in being tactical but robust. Kotlin is capable of rapid turnaround for quick ad hoc analysis, but you can take that statically-typed code and put it in production if you need to. While I am seeking to add more functionality to this, it would be awesome to see others contribute to the idea of using Kotlin for these kinds of purposes.

Saturday, May 20, 2017

System76 Galago Pro Review

I have owned a System76 Kudu since Fall of 2016 (which I reviewed here) and it has helped me be much more productive. However, I bought that as a desktop replacement laptop which it excels at, and I needed to buy something more mobile. Thankfully, I pre-ordered the System76 Galago Pro not long after it was announced, and I finally received it a week ago.

Although I have been ridiculously busy wrapping up my second book Learning RxJava, some folks asked if I could write a review. So here it goes:

The Galago Pro is thin, light, sturdy, and beautiful.


Design
  


The System76 Galago Pro has a discrete and ergonomic profile, and is only .56" in height. It is light and thin, and perfect for mobile use even if you have to walk and type with one hand. The aluminum casing is a nice touch and helps it feel sturdy.

Aluminum casing helps this laptop feels sturdy, and it looks cool

The keyboard has great response. The resistance on the keys feels just right. The placement and spacing between them does not feel cramped and it feels even better than my 17" System76 Kudu, so typing is pretty fluid. The trackpad is smooth and recognizes gestures without issue as well.

The keyboard is not cramped and its design feels optimized.


Hardware

I did not upgrade a lot of the hardware when I bought my Galago Pro. I kept it pretty modest as shown below. As a developer working on open-source projects and writing books, this configuration is plenty.

Ubuntu 17.04 (64-bit)    
1× 13.3" Anti-glare 3K HiDPI Display    
Intel® HD Graphics 620    
3.1 GHz i5-7200U (2.5 up to 3.1 GHz – 3MB Cache – 2 Cores – 4 Threads)    
4 GB DDR4 at 2133MHz (1× 4 GB)    
250 GB M.2 SSD     $59.00
No 2nd Drive    
United States Keyboard    
WiFi up to 433 Mbps + Bluetooth    

Although it does not affect me, it is too bad international layouts are not available for the keyboard. I know a few folks in Europe who would like to order a System76 but are not satisfied using stickers on their keyboard. I understand System76 is working on this though.

Finishing the final chapter of Learning RxJava on my Galago Pro.

I upgraded to a 250 GB hard drive, but went with the default M.2 SSD instead of the much more expensive PCIe. For my purposes, I found this to boot quickly and perform fast enough. I do not do a lot of video or picture editing where an ultra-fast hard drive can make a difference.

I wish the battery was more ambitious than 4-5 hours. You could probably squeeze more out of it by using airplane mode and lowering screen brightness. But I've found doing word processing with Internet gives me about 4-5 hours. If I'm using an intensive IDE like Intellij IDEA and writing Kotlin code (with Internet), it gravitates towards 3-4. I understand this is about the same performance as the current MacBook Pro, so this is not bad. But it would be awesome to see the boundaries of battery life pushed farther with an ambitious machine like this.

Of course, the big selling point with the Galago Pro is the ports. It has plenty of them!

Lock, Ethernet, SD/MMC, HDMI, Mini DisplayPort, USB, USB-C (w/ Thunderbolt 3)


Power, SIM, USB, microphone, and headphone

Unlike the recent MacBook, you will likely not need any dongles here. It is impressive how many ports have been packed into such a thin device. What I found most intriguing is how System76 fit the Ethernet jack, which has a door that flips down to hold the Ethernet cable as shown below:


The Ethernet port has a clever collapsing door

I am glad System76 was not quick to slash the Ethernet port but rather found an innovative way to include it into the design of the laptop. While I would not deliberately test this, the door feels pretty sturdy against my everyday abuse of pulling a cable in-and-out. It is also level with the table top when a cable is inserted.

Having an Ethernet port is especially life-saving when you encounter WiFi driver issues, and you need to connect to the Internet to get them.

Setup

System76 ships its computers with Ubuntu, but I prefer to use the Linux Mint distro. While Linux Mint is based on Ubuntu, I find Linux Mint to provide a much more fluid experience and "just works" when it comes to usability (although it is promising what System76 is doing with the GTK "Pop" theme). Normally, putting your own Linux distro on a System76 machine is a problem-free experience. Just make sure to install the System76 drivers.




The Galago Pro worked smoothly with the default Ubuntu installation. However, I ran into a driver problem when I installed Linux Mint 18.1 (the latest version at the time of writing). It has an older Linux Kernel version that does not include drivers for the Galago Pro's new hardware, including the Intel wireless chip. This meant I had no wireless Internet to solve the problem, and thankfully the Ethernet port came in to save the day. I updated the Linux Kernel and then everything worked.
 
System76's customer service is always stellar, and unlike many companies are helpful towards tinkerers and hackers. I did send a message to them and suggested their System76 driver should check the Linux kernel version, and they were immediately responsive and forwarded that to their engineering team. They apologized that I had any difficulties in the first place, as they strive to have everything work even if you use a different Linux distro.

Summary

The System76 Galago Pro is a beautiful machine that feels highly productive for a 13" ultrabook. The keyboard and trackpad feel phenomenal, and the HDPI screen is beautiful. But what really stands out are the many physical ports to get plugged in, including a clever Ethernet port for those of us that like to be wired.

The only place I wish the Galago Pro pushed the boundaries a bit more is battery life. I get about 4-5 hours with moderate screen brightness and doing everyday work. However, this sounds to be on par with the current Macbook Pro, so it is unfair to cite this as a downside. But in a perfect world, 8 hours would be nice.

If you are looking for a high-quality, mobile alternative to Macbook, Surface, or other mobile productivity devices, the Galago Pro is great. It truly excels at the intersect between mobility and not cutting corners, and it just looks and feels cool.