cft

How to ensure a data scientist is never productive

We need to start placing a higher value on data scientists’ time than we do on machine time


user

Devin Petersohn

3 years ago | 4 min read

While data science tools are being optimized to perform well on microbenchmarks, they are becoming more and more difficult to use. Is the benchmark performance worth the human time cost it takes to get there? (Spoiler: it would take up to 200 years to recoup the upfront cost to learning a new tool, even if the new tool performs 10x faster)

Time to recoup the cost of learning a new tool, see below for detailed calculation
Time to recoup the cost of learning a new tool, see below for detailed calculation

Modin (https://github.com/modin-project/modin) is designed and optimized for Data Scientist time, enabling performance without code changes.

Pushing complexity onto the data scientist

Let’s design a system. If we want to ensure data scientists are not productive, the first thing we probably want to do is force them to learn a lot of new and unnecessary concepts for tuning performance, like partitioning and resource management.

To further reduce data scientist productivity, let’s also introduce a completely new API. This has the nice side-effect of system lock-in, making it harder to leave once adopted. In any case, trading human time for machine time is the most effective way to ensure that data scientists are not productive.

I want to do a thought experiment to see exactly how much the overheads of learning an entirely new ecosystem and new distributed computing expertise actually cost. Then we can model how much computation a new system would need to save to begin to make returns on the time cost. This way we can see how much productivity we actually cost the user.

Modeling the cost of learning a new tool (that does the same thing)

To model the user, we will first simulate “proficiency” with a linear relationship to time. To make things simple, let’s say it takes an average of 2 years to be as proficient in a new tool as they are with an existing tool.

This 2 years includes gaining an understanding of the new requirements of the system, like distributed computing, partitioning, etc. Let’s also say that proficiency and productivity are 1:1 correlated, so proficiency is a proxy for productivity.

Because of the linear relationship we are using the total productivity loss is 1/2 of the 2 years it took to become proficient. According to Glassdoor, the average yearly salary of a data scientist in the United States is $113,000 USD as of writing this.

So by our back of the envelope calculation, we have an estimated total productivity cost of $113,000 per data scientist. The productivity loss for a team of 5 exceeds $500,000 USD.

How long will it take to recoup the $113,000 investment on compute?

For simplicity let’s use the per-hour cost of the AWS m4.4xlarge, as of writing it is $0.80 per hour. m4.4xlarge has 16 CPU cores and 64GB RAM.

To recoup the $113,000 productivity cost of the one year lost, you would need to save, in aggregate, over 16 years worth of compute time on this instance. To get the number of CPU years per core, we just multiply 16 years x 16 CPU cores = 256 CPU years.

How many compute years does the average data scientist use in a given day? If we assume a single CPU is running 50% of work hours (which it isn’t), we get 4 hours/day, or 12.5% of the day.

Extrapolating to the entire year, 12.5% of the year is spent running compute with these numbers, so it takes 8 real years to accumulate one CPU year in productive compute. Remember this number, it will be important shortly.

If we need to save 256 CPU years and the new system is 10x faster or with 10x more data, it will take about 25 CPU years in the new tool to make up for that time compared to the old tool. But wait, it takes 8 real years to accumulate one CPU year. At a 10x improvement, it would take 200 years to recoup the upfront cost of losing 1 year of productivity!

This simple calculation cannot possibly reflect all of the details of every data scientist’s reality, but the goal is not to perfectly model reality. Instead, its purpose is to demonstrate that the human time cost to come up to speed on a new ecosystem is so much higher than any compute cost saved that talking purely about benchmarking performance pales in comparison.

Improved performance does not equate 1:1 to improved productivity. The benchmarks presented in blogs and conferences always hide upfront costs.

  • Do you have to learn a new API to do something you can already do?
  • Do you have to change file formats to get performance?
  • Do you have to tune performance to avoid being punished by a new tool?
  • Do you need to provision resources or request workers for the new tool?
  • How much human time does all of this cost?

Modin: Putting the focus back on the Data Scientist

Modin (https://github.com/modin-project/modin) is a data science platform designed around empowering data scientists without adding complexity and new requirements. It exposes the pandas API, with many other APIs and modes of interaction in the pipeline.

# import pandas as pd
import modin.pandas as pd # a drop-in replacement!

Suddenly, our typical data science setup goes from this:


To a workflow without costly conversion between ecosystems:

Modin is disrupting the data science tooling space by prioritizing the data scientists time over hardware time. To this end, Modin has:

  1. No upfront cost to learning a new API
  2. Integration with the Python ecosystem
  3. Integration with Ray/Dask clusters (Run on/with what you have!)
  4. Scalability and performance with no changes to existing pandas code
Modin performance scales as the number of nodes increases (with no changes to existing pandas code). Maximum time to startup the cluster was 3 minutes in each case, data from NYC Taxi. No performance tuning was performed. Baseline of pandas was not possible at this data scale.
Modin performance scales as the number of nodes increases (with no changes to existing pandas code). Maximum time to startup the cluster was 3 minutes in each case, data from NYC Taxi. No performance tuning was performed. Baseline of pandas was not possible at this data scale.

Remember, the goal of data scientists is not to execute individual queries as fast as possible, it is to extract as much value as possible from their data. Tools should work for the data scientist, data scientists shouldn’t have to work for their tools.

Upvote


user
Created by

Devin Petersohn


people
Post

Upvote

Downvote

Comment

Bookmark

Share


Related Articles