Operators or mechanics generally when they see a problem in a machine (large truck, plane or ship) describe it and report it in historical fault books or systems that store information of this type (MROs).

In aviation, each failure must be reported (by the pilot or a mechanic). Usually, these failures are written manually to be stored in a computer system. But as humans, errors or ways of describing a failure do not always coincide or may have a typing error, making it difficult to relate or create a chronic history of failure.

Before leaving, we will define what a failure history is:

Imagine a timeline (Figure 1) in which each down arrow is a failure or problem on a machine, an up arrow is a solution to the failure, the color represents the same failure, but lexically they are different.

As there are many failures and solutions over time, there must be some relationship or pattern between the systems, which we probably cannot see with the usual forms of classification (faults are generally grouped by machine system but are not always classified correctly ). In this case for the red fault, the solution X is more effective than the Y.

Figura 1: Failure history for Machine A

So we have situations that we can improve:

To be able to group faults in the same failure mode.
To be able to determine the best possible solution to this failure mode, which guarantees better reliability over time.

To start, we could define a problem in which we have 3 machines, of which we will focus only on 3 systems, each system has independent failures (MF), with different corrective actions. We can see it in figure 2:

Figure 2: Definite problem

A failure mode (MF) is the exact failure of a component. We will not go into the classification of the MF but it can be assembled with an expert engineer or with component manuals. This is an important input that we must have to focus the investigation. Anyway, if we don’t have it, we could build it by seeing word repeatability between the systems, and then supported by an expert engineer, define the failure modes.

Noise elimination

Let’s continue with the problem posed in Figure 2: We have a database that contains MF information for the 3 machines in Figure 2 in their 3 systems:

We will look for words that are not related to any of our failures and that the mechanics have written, for example, “Today I found the problem X”, in this text, only “X” is important, everything else we must eliminate, in the literature this Kinds of words are known as “stop words”. For this, we will use the Python NLTK library.
After cleaning all our text, it would be good to do a second review of words that we have classified as a fault, but this second step is good to rely on an expert engineer, we will use a technique called tokenization, for word processing.

The process can best be seen in Figure 3:

Figure 3: Text data processing

Therefore the python code for tokenization and word processing would be as follows:

Figure 4: Actual Situation

import pandas as pd
import nltk
from nltk.corpus import stopwords

df = pd.read_csv('/Users/jorgepontigo/text_mining_reliability/raw data/faults_systems.csv', header=0, sep=',',parse_dates = ['failure_date'], usecols=[0,1,2,3], dayfirst = True)

text = df["fault_name"]

cross = ""

for i in range(len(text)):
    cross = cross + (text[i])

tokenization = [t for t in cross.split()]

clean_tokenization = tokenization[:]

english_sw_up = [element.upper() for element in stopwords.words('english')]
spanish_sw_up = [element.upper() for element in stopwords.words('spanish')]
english_sw_low = [element.lower() for element in stopwords.words('english')]
spanish_sw_low = [element.lower() for element in stopwords.words('spanish')]

for token in tokenization:

    if token in english_sw_up:
        clean_tokenization.remove(token)

    elif token in spanish_sw_up:
        clean_tokenization.remove(token)

    elif token in english_sw_low:
        clean_tokenization.remove(token)

    elif token in spanish_sw_low:
        clean_tokenization.remove(token)

frequency = nltk.FreqDist(clean_tokenization)

frequency.plot(20,cumulative=False)

In the previous code, we grouped everything written by the mechanics and used NLTK to eliminate the words stop, even so in the graph in Figure 5 we see that we can continue removing words that are not related to an MF, for example, found, during, form, etc.

Figure 5: Word frequency after of Python

We will continue cleaning our code, we will eliminate all the words that we know are not failures (we can consult an expert), we will also eliminate all the numbers.

Let’s start development!

Now, looking at the graphic above we could use pandas to program our real code (the previous one was a previous one to explain tokenization). We will use the graph in Figure 5, and we will clean our code, we will remove stopwords, then we will use the fuzzywuzzy match library and then we will comment on the results.

The fuzzywuzzy match was everyone with everyone, it would be interesting to do it by system name and everyone with everyone, after this, leave the best result (we can leave this as next steps).

import pandas as pd
import nltk
from nltk.corpus import stopwords
import re
from fuzzywuzzy import fuzz

#DATA LOAD

df = pd.read_csv('/Users/jorgepontigo/text_mining_reliability/raw data/faults_systems.csv', header=0, sep=',',parse_dates = ['failure_date'], usecols=[0,1,2,3,4], dayfirst = True)

df2 = pd.read_csv('/Users/jorgepontigo/text_mining_reliability/raw data/mode_faults.csv', header=0, sep=',', usecols=[0,1], dayfirst = True)

text = df["fault_name"]

cross = ""

for i in range(len(text)):
    cross = cross + (text[i])

tokenization = [t for t in cross.split()]

clean_tokenization = tokenization[:]

english_sw_up = [element.upper() for element in stopwords.words('english')]
spanish_sw_up = [element.upper() for element in stopwords.words('spanish')]
english_sw_low = [element.lower() for element in stopwords.words('english')]
spanish_sw_low = [element.lower() for element in stopwords.words('spanish')]

#DELETED STOPWORD

df["fault_name"]= df['fault_name'].apply( lambda x: ' '.join([word for word in x.split() if word not in english_sw_up]))

df["fault_name"]= df['fault_name'].apply( lambda x: ' '.join([word for word in x.split() if word not in spanish_sw_up]))

df["fault_name"]= df['fault_name'].apply( lambda x: ' '.join([word for word in x.split() if word not in english_sw_low]))

df["fault_name"]= df['fault_name'].apply( lambda x: ' '.join([word for word in x.split() if word not in spanish_sw_low]))

df['fault_name'] = df['fault_name'].str.upper()

#DELETED STOPWORD FROM CHART Figure 5

df["fault_name"] = df.fault_name.str.replace(r"\bFOUND\b", ' ', regex=True)\
                    .str.replace(r"\bFAULT\b", "")\
                    .str.replace(r"\bPROBLEM\b", "")\
                    .str.replace(r"\bproblem\b", "")\
                    .str.replace(r"\bDefect\b", "")\
                    .str.replace(r"\bPERFORM\b", "")\
                    .str.replace(r"\bMAINTENANCE\b", "")\
                    .str.replace(r"\bDISPLAYED\b", "")\
                    .str.replace(r"\bDAILY\b", "")\
                    .str.replace(r"\bFAULT\b", "")\
                    .str.replace(r"\bCHECK\b", "")\
                    .str.replace(r"\bPROBLEM\b", "")\
                    .str.replace(r"\bPOSITION\b", "")\
                    .str.replace(r"\bREPORTA\b", "")\
                    .str.replace(r"\bMSG\b", "")\
                    .str.replace(r"\bDEFECT\b", "")\
                    .str.replace(r"\bTEMPERATURA\b", "")\
                    .str.replace(r"\bINSPECTION\b", "")\
                    .str.replace(r"\bDUE\b", "")

df = df.reset_index(drop=True)
df2 = df2.reset_index(drop=True)
df3 = df.merge(df2, on=['system'], how='left', indicator=False)
df3['match'] = 0
df4 = df3

#Fuzzy wuzzy match NLP

for i in range(len(df3)):
    df4['match'][i] = fuzz.ratio(df3['fault_name'][i], df3['mode_fault'][i])

df4['RN'] = df4.sort_values(['match'], ascending=[False]) \
             .groupby(['id_failure', 'fault_name']) \
             .cumcount() + 1

df5 = df4[(df4.RN == 1)]
df5.to_csv('export.csv', encoding='utf-8')

df5 = df5.drop(['RN', 'id_failure', 'failure_date', 'system', 'Machine'], axis=1)
print(df5)

Figure 6: Results

In Figure 6 we have the results of our problem.
The ‘fault_name_clean’ column are the faults with all the noise we remove.

The ‘fault_name_original’ column is the column as it was written by the mechanic.
‘mode_fault’ is the best match failure mode of the MF glossaries per system.
Finally, the column ‘match’ the match percentage of ‘fault_name_clean’ and ‘mode_fault’.

In green we can see the coincidences that we have, for example, we managed to classify with 62% a ‘problem vbreak’ (‘id_failure’ = 11), which is very good! On the other hand, in red we show that there is still a lot of noise that we can remove

Next step!

Systems can be very diverse on a large machine, and MFs even more, so it would be useful to do staggered matches, tests between systems or for the same system and finally leave the best of all matches (since a system has its MF itself, but sometimes the choice of system may be wrong by a mechanic).

It would also be nice to learn more about NLTK and see synonym or antonym options. On the other hand, it is good to sit down with an expert to eliminate all possible noise, which always exists.

We still have to analyze the corrective actions, to choose the best solution, with this we could have a MF ranking and its best corrective action. This could be determined by seeing which of the actions takes the longest without re-failing for the same fault and machine.

We already have the first step that is to create the failure stories, now the next step to have a complete tool would be missing.

This article was originally published by Jorge Pontigo Burgos on medium