Data Scientists Are Responsible To Set Reasonable Expectations Of Machine Learning And AI Projects

Since 1956, when John McCarthy convinced attendees of the Dartmouth Conference to adopt the term “Artificial Intelligence”, research funding and enthusiasm over the subject has had peaks and troughs.

In particular, in the late 70s and 80s, there were two AI winters, with significantly reduced funding for research into AI(see this great article for an illustrated timeline of AI).

We now see AI in every part of our lives, from movie recommendations on our favourite streaming service to facial recognition at airport passport control. However, despite this, research by PwC has shown that only 4% of executives plan to deploy AI in 2020, compared to 20% in 2019.

And the reason for this is not surprising. Seven out of ten companies report “minimal or no impact from AI so far”, according to a survey of 2,500 executives, conducted as part of a report by BCG and MIT.

One of the reasons for this failure of AI to have an impact may be misaligned expectations between product owners and data scientists.

Instead of promising the world, it is the responsibility of data scientists to manage expectations, ensuring product owners understand the realistic potential of AI initiatives to have an impact and the requirements to reach productionisation.

In this article, I will highlight four crucial discussions data scientists should have with product owners of potential AI(or machine learning, I’ll use the terms interchangeably in this article) initiatives before beginning any development.

At the end of each section, I’ll include a short example of what the output of each conversation might look like for the very simple problem of solving a Rubik’s Cube.

Create a clear problem statement

Many data scientists (myself included) love their job. Maybe not all of it, but the thought of putting your fingers on your keyboard and beginning to develop a new machine learning model using state-of-the-art techniques is what really keeps us going.

However, this often means that data scientists ask themselves “how can we change current processes to include machine learning?”. But we all know the age-old saying “if it ain’t broke, don’t fix it”.

Instead, in the initial discussion with product owners, try to understand their processes and where there are “pain points”, points in their processes where there are problems. Then, write down each of these problems as a problem statement.

A problem statement has two parts, a concise definition of an issue that could be improved upon, and a goal for improving it.

With a problem statement in place, data scientists can now look at where machine learning might be able to help mitigate the issue and meet the goal. AI has now moved from being “an end” to “a means to an end”.

Example: Currently, I spend too much time solving my Rubik’s Cube, so I would like a way to solve it more quickly.

Discuss the required outputs

It may seem strange to talk about the outputs before the inputs (spoiler alert: we’ll talk about those next), but a discussion about the outputs of a proposed solution essentially elaborates the second half of the problem statement, the goal.

One of the best ways to do this is to work backwards. Ask the product owner what they will do with the outputs from the model first. Then you can gauge exactly what the output needs to look like in order to fulfil this.

And take it one step further…Write up a schema indicating exactly what the output will look like. It may seem strange to do this so early on, but if a clear problem was identified in the problem statement stage, the solution should just be fitting into an existing process, which already has clearly defined outputs.

Example: The owner will use the instructions to solve the Rubik’s Cube, so will need to be provided with a sequential set of steps listing the face to twist, the direction and the angle. For example, “twist front face anti-clockwise 180 degrees”.

Request the required inputs

Question for the data scientists…How often have you begun a project only to find the data you need doesn’t exist?

Once a problem statement is defined, the data scientist can go away and brainstorm potential solutions. As well as this they should try to understand the problem better, including what data is available and what extra data they will need to implement their proposed solution.

This needs to be communicated with the owners of the data as well as the product owner. Data scientists cannot expect all the data they need to be sitting neatly in a database waiting for them if they never told anyone they needed it.

This is not to say that if we ask for it we will definitely get it, but if data requirements are not laid out early on it is very easy to get very far into development before realising that the proposed solution cannot be implemented as the data cannot be obtained or simply does not exist.

Example: To solve the Rubik’s Cube, we need the colour of each square at the start. for example “top face, right middle square is red”.

Define success

Problem statements are a great way to identify where AI initiatives can be implemented, but they shouldn’t contain any detail. Although they describe a goal, they may not provide a quantitative metric to measure when the goal has been reached.

If a measurable goal is not defined, then it is always possible to call the initiative a failure, as it becomes the target of subjective judgement by the product owner.

Therefore, once a problem statement is selected, the next step is to sit down with the product owner and flesh out the goal. Ask the following questions:

What metric can we use to measure performance?
What is the value of said metric with the current solution?
How much does this metric need to improve for the initiative to be considered a success?

With these three questions, data scientists can get a true understanding of what their solution needs to do to be a success and ensures that everyone has measurable expectations of the outcome.

Example: Currently, the owner takes 181 moves on average to solve a shuffled Rubik’s Cube, based on his 30 attempts. Success will be decreasing the average to 150 over 30 attempts with the proposed solution.

The above list is by no means all-encompassing. However, these conversations are ones that often occur to late, or not at all, leading to projects failing to meet expectations or failing altogether.

It is also worth noting that some projects will still fail, as is the experimental nature of data science. However, if this is the case, it will be purely down to limitations of the current state-of-the-art, not bad communication or misaligned expectations.

This article was originally published in The Startup here. Other articles from Jonathan Davis you may enjoy include: