Adventures in the TCP stack: Uncovering performance regressions in the TCP SACKs vulnerability fixes

Last month, we announced that the Databricks platform was experiencing network performance regressions due to Linux patches for the TCP SACKs vulnerabilities. The regressions were observed in less than 0.2% of cases when running the Databricks Runtime (DBR) on the Amazon Web Services (AWS) platform. In this post, we will dive deeper into our analysis that determined the TCP stack was the source of the degradation. We will discuss the symptoms we were seeing, walk through how we debugged the TCP connections, and explain the root cause in the Linux source.

As a quick note before we jump in, Canonical is working on an Ubuntu 16.04 image that resolves these performance regressions. We plan to update the Databricks platform once that image is available and has passed our regression tests.

A failing benchmark

We were first alerted when one of our benchmarks became 6x slower. The regression appeared after upgrading the Amazon Machine Image (AMI) we use to incorporate Ubuntu’s fixes for the TCP SACKs vulnerabilities.

The Databricks platform’s ami-2.102 image (red) includes the TCP SACKs performance regressions. It is 6x lower with runs averaging ~16 minutes. The above chart shows the cumulative time over 20 runs.

Sign up