Photo by Jackson Simmer on Unsplash

To err is human and mechanical.

It’s a cold winter day in Detroit, but the sun is shining bright. Robert Williams decided to spend some quality time rolling on his house’s front loan with his two daughters.

His wife and mother-in-law lingered in a corner, sipping a warm drink while chatting away. Suddenly, police officers appeared from nowhere and brought to an abrupt halt a perfect family day.

Robert was ripped from the arms of his crying daughters without an explanation, and cold handcuffs now gripped his hands. The police took him away in no time! His family were left shaken in disbelief at the scene which had unfolded in front of their eyes.

What followed for Robert were 30 long hours in police custody. He was then arraigned on a theft charge with a $1,000 bond.

But no one bothered to ask Robert whether he had an alibi for the time of the alleged theft! Luckily for him, he had just published a video on Instagram from another location at the time of the robbery, thus giving him solid digital proof.

But what went wrong? The Michigan State Police collected all the footage related to the theft and used facial-recognition software to identify potential suspects from a database of 49 million images.

Robert Williams was the unlucky person identified by the algorithm. This collapse of justice highlights two crucial issues.

First of all, facial-recognition is not perfect; in fact, it is only 95% accurate. Of course, people probably consider this percentage as being pretty high.

But what is the impact of that 5% error? That tiny error had a tragic effect on Robert Williams and his family because they are now scarred for life.

Furthermore, if Robert did not post a photo on Instagram during the robbery, he might still be in jail today. Some people might argue that this is a small price to pay for the added protection these systems offer. After all, it is just one tiny error, or is it?

In 2009, China began deploying a nationwide social credit system. It set up a network of more than 200 million public cameras and an army of sensors.

This data, combined with all the information harvested from mobile devices, gave Chinese authorities precise information about their citizens. The idea behind the system is to promote “trustworthy” behaviour.

If a person contravenes a Chinese law (such as not paying taxes, driving through a red light or smoking on a train), that person’s social credit score is reduced.

If he acts according to law, the social credit score will go up. In fact, just in 2018, 24 million Chinese people were banned from travelling because their social credit score was too low.

The problem is compounded further when one considers China’s entire population, which is close to 1.4 billion people. An error rate of 5% in their camera system could misclassify and negatively affect 70 million people.

So really and truly, even though AI systems performed incredible feats in the past decades (in some cases even surpassing human levels), we are not there yet, and their impact can be disastrous if not managed well.

Bias is in the electronic eye of the beholder.

Errors are not the only things we should be concerned about. Bias is another problematic issue when we are creating AI algorithms.

An analysis of the state-of-the-art self-driving car system conducted by the Georgia Institute of Technology found that the sensors and cameras used in these cars are more likely to detect a person with lighter skin tones.

This automatically means that a self-driving vehicle is less likely to stop before crashing into a coloured person. This held true for all the systems tested, and the accuracy level decreased by 5% for people with darker skin.

Some people might argue that a machine cannot be racist because it does not understand the notion of race. That is true, and in fact, the problem doesn’t lie with our algorithms’ inherent nature.

The issues which emerge are a reflection of the underlying data which we use to train our algorithms. This was proven a few months ago when a new tool called Face Depixelizer was released.

The idea behind the program is to take a pixellated photograph of a person as input. The system will then process it using AI and return a sharp, accurate image of the face.

When the system was given a pixellated image of former US President Barak Obama as input, the algorithm turned him into a white man even though the pixellated picture was very recognisable with the naked eye.

The same algorithm was tested with the pixellated image of Alexandria-Ocasio Cortez, the US congresswoman with Latino roots, and Lucy Liu, the famous Asian-American actress. In both cases, the algorithm reconstructed their faces to look white.

These algorithms are susceptible to the data used for training. As George Fuechsel, an early IBM programmer, once said: “garbage in, garbage out”. These algorithms are a blank slate; if they are trained on unbalanced datasets, they will automatically produce wrong classifications.

Because of this, data scientists have to go to great lengths to ensure that their dataset is representative of the domain being modelled. This is not easy when one considers that the world we live in is already biased.

A quick search on Google for nurses finds 800k references for female nurses and 600k for male nurses. Another search for doctors finds 2 million links to male doctors against 1.5 million for female doctors.

The numbers represent the typical gender stereotypes whereby nurses are typically portrayed as female and doctors as male.

If these are fed directly to an algorithm without first balancing the datasets, the algorithm will learn the biases and apply them in its judgement (thus skewing the results’ validity).

In 2015, Amazon fell victim to this issue. Since the company recruits thousands of people worldwide, their engineers decided to create an AI hiring too powered with AI.

After using it a few times, it became evident that the tool was discriminating against women. The reason was the following. When the engineers designed the AI, they needed to train the algorithm on some data.

It dawned on them that the current employees of Amazon fitted perfectly with the company culture of the organisation, they had the proper requisites, and the company was happy with their performance.

The engineers reasoned that if the AI learns to recognise people similar to those employed already, it will successfully identify promising candidates.

Thus, they used the curriculum vitae (cv) of their current employees to train the algorithm. However, they did not realise that there is a 75% male-dominated workforce in the tech industry.

So the bias was inevitable! Amazon’s system taught itself that male candidates were preferable and penalised resumes that included the word “women”.

It also preferred candidates who littered their resumes with certain verbs which have a male connotation.

This was not an isolated case. Microsoft too worked on an AI, which developed many negative traits in the end. It all began as an experiment in conversational understanding.

Microsoft created a Twitter chatbot called Tay, whose task was to engage with people through casual conversation. The conversation’s playfulness didn’t last long, and Tay started hurling misogynous and racist comments to people sprinkled with swear words.

The chatbot began as an almost blank slate, and in less than 15 hours, it managed to capture the negative sentiment of the people that interacted with it. It’s as if you raise a child in a jungle and expect him to emerge literate. It just doesn’t work like that.

Furthermore, the online world is populated with diverse opinions, and AI systems can quickly go astray if they are not guided.

In fact, Tay managed to process thousands of messages during its short lifetime and churn out almost 2 messages per second.

One can only imagine the extent of the damage which such a system can inflict on people. This event should also remind us that even large corporations like Microsoft sometimes forget to take preventive measures to safeguard us against rough AI.

Time to explain yourself

A new subfield of AI emerged in recent years called Explainable AI (XAI) to address these issues. The purpose is to create algorithms that do not conceal their underlying workings but rather provide a full explanation transparently.

We’ve already seen various examples where the results produced by AI algorithms proved to be problematic. But there were other cases where the problems were not so obvious.

Computer Scientists created systems that seemed to work well, and people blindly trusted the results, but they were not always correct.

The amazing snow classifier

Classifying between a Husky and a Wolf is not a straight forward task, yet algorithms manage to obtain 90% accuracy. Or do they?

A recent experiment at the University of Washington sought to create a classifier capable of distinguishing between Wolves’ images from Huskies.

The system was trained on several photos and tested on an independent set of pictures, which is usually the case in machine learning. Surprisingly, even though Huskies are very similar to Wolves, the system managed to obtain around 90% accuracy.

The researchers were ecstatic with such a result. However, on running an explainer function capable of justifying why the algorithm managed to obtain such good results, it became evident that the model was basing its decisions primarily on the background.

Wolf images usually depicted a snowy environment, while pictures of huskies rarely did. So rather than creating a Wolves Vs Huskies classifier, the researchers had unwittingly created a great snow detector.

This error was not evident just by looking at traditional performance measures like accuracy! Of course, such a toy experiment was lab-based, but what would happen if a patient’s life is dependent on that AI?

Missing link

Classifying the patient’s risk level is no simple feat, especially when considering the responsibilities, which are carried by that decision.

In the 1990s, the University of Pittsburgh conducted a study to predict the risk of complications in pneumonia patients. The goal was to figure out which of the pneumonia patients are low or high-risk.

Low-risk patients were sent home and prescribed a cocktail of antibiotics, and chicken soup whilst the rest were admitted to the hospital.

The system, designed around an Artificial Neural Network (ANN) architecture (a brain-inspired model), analysed no less than 750,0000 patients in 78 hospitals across 23 states with a death rate of 11%. Surprisingly, it managed to obtain high precision of around 86%.

When the system was tested with actual patients, the doctors noticed a severe fault. Patients with pneumonia who also suffered from asthma were being classified as low-risk.

The doctors immediately raised their eyebrows because the misclassification was severe.

They flagged the issue with the programmers, and the system was sent back to the drawing board. The Computer Scientists analysed it thoroughly, yet they could not find any faults either in the data or in the existing system.

However, when they tried to delve further into how the system reached such a conclusion, they immediately faced a wall. The ANN at the heart of the system is inspired by the inner workings of the brain. However, it is usually considered a black box.

This means that we give it an input, we get an output, but we cannot clearly see the algorithm’s internal workings. This makes the task of finding an explanation extraordinarily complicated and, in some cases, impossible to achieve for a human.

To overcome this hurdle, they built a rule-based system on top of the ANN architecture. In so doing, they were capable of reading and understanding the rules which were being generated by the system.

Surprisingly, the rule-based system too reached the same conclusions as those of the ANN. However, the researchers further discovered that according to the data, patients who suffered from pneumonia and were asthmatic had a higher recovery rate than the others.

What the algorithm missed was the reason why they were getting better. It was definitely not because they were asthmatic! The reason was that in the past, asthmatic patients were automatically flagged as high-risk by the doctors.

Because of this, they were immediately admitted to intensive care, which eventually resulted in a more effective recovery than regular patients.

This proves two things; first of all, human intuition is essential since the doctors immediately flagged this issue when confronted with the results of the automated system.

Second, it should remind us that correlation does not imply causation.

Too many choices lead to bad decisions.

The decisions we have seen so far are relatively straight forward, but what if things get infinitely more complex? Autonomous vehicles will have to take hard decisions, and we need to be able to understand the rationale behind those decisions.

Luckily for the patients in the pneumonia case, a human being was in control of that system, but what if we have autonomous systems like self-driving cars? Think about the famous trolley problem whereby a self-driving car has no way of avoiding an accident. And as a result, someone will die for sure! What should the AI choose?

Prioritise humans over animals?
Passengers over pedestrians?
Save more lives over fewer?
Prefer women over men?
Prioritise young over old?
Choose those with a higher social status over the others?
Reward law-abiders over criminals?
Or maybe stay on course and don’t intervene?

In most of these cases, there is no right or wrong. Someone will die as a consequence. In reality, we do not have an answer to these queries. But let’s place the argument aside because this is not the end of the issue.

After someone dies, a pertinent question to ask is, who will take responsibility for the accident? The programmer, the corporation who produced the car or the owner of the vehicle?

The car owner has no direct say in the driving of the vehicle. He can decide the destination, but that’s pretty much it. So it is unlikely that the car owner is at fault.

The corporation that produced the vehicle used the program developed by a software developer. It had no say on the code per se, and before installing it in their cars, it did pass rigorous testing imposed on the car companies by national authorities.

The programmer probably used the latest methodologies to develop correct code. Once the vehicle became operational, it started acquiring information based upon its own experiences on the road. In some cases, these cars even learn from the experiences of other vehicles.

Of course, the programmer has little control over what the machine actually knows once it is on the road. So we came around in a full circle, no one seems to be responsible, and we are none the wiser.

Conclusion

The challenge we are facing is to figure out the best way to push forward AI whilst creating solutions that we can trust and understand. Because of this, the move towards XAI is only natural.

Current AI models are opaque, non-intuitive, and complex for people to understand, but if we want people to trust AI systems’ autonomous decisions, users must understand, have confidence, and effectively manage it.

Apart from the issues mentioned earlier, there are also other considerations such as:

Algorithms and automation are becoming more pervasive. Many business challenges are being addressed with AI-driven analytics, and we’re seeing inroads in domains where the human expert was traditionally uncontested (E.g. medical diagnosis). AI can process much more data than human operators; they are faster and don’t get tired. On the other hand, we have to be careful since biases and erroneous decisions can spread more quickly.
AI models are becoming huge with millions of parameters, making them much more difficult to understand by human users. Some systems are being trained on human operators, but this is not sufficient to gain people’s trust. Furthermore, some models offer no control over the logic behind the algorithm, and there is no simple way of correcting any of its errors.
With the introduction of the European General Data Protection Regulation (GDPR) on May 25 2018, black-box approaches have become much more challenging to use in business since they cannot provide an explanation of their internal workings. GDPR also introduced the right of erasure apart from the right to be forgotten. So anyone can make a request to any organisation and demand the deletion of their personal data. However, this is much more complex than it sounds. It is not just a matter of removing ones’ personal data from search results. It has to be deleted from the knowledge-base storing that data, from crash recovery backups, from data mirrored in other datacenter and physically deleted from the hard-disk (rather than just removing the link). All of this renders the deletion task too complicated.

The challenge we are facing is to figure out the best way to push forward these innovations whilst creating solutions that we can trust and understand. We are not there yet, and AI should not be considered this magical deity we follow blindly.

On the other hand, we cannot ignore it and the advantages it can offer us. If we can manage to counter the challenges mentioned earlier, we can genuinely improve many people’s lives.