# Classification Based On Data Analysis

# Confirmatory Data Analysis

RiteshPratap A. Singh

Studies can also be grouped based on the kind of data analysis that is performed after the collection of data.

Data analysis can be defined as (by John Tukey, 1961) “Procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data.”

In a sense, this classification is based on the purpose of collecting the data and conducting the study; is it to confirm an a priori hypothesis? Or is it to know and observe general trends or patterns?

**Confirmatory Data Analysis**

Most of the experiments in science are conducted with a prior hypothesis in mind; the purpose of such studies is to confirm or disconfirm a priori hypotheses or models.

For example, “light intensity is linearly correlated with algal growth” could be a prior hypothesis and the investigator can test whether the empirical relationship between these two variables (light intensity and algal growth) is indeed linear or not using a model-fitting method such as regression analysis.

In confirmatory data analysis, there is a substantial scope for various scientific misconducts. For example, an investigator might want to confirm his prior hypothesis that a certain plant extract is anti-cancerous. In this undertaking to prove the hypothesis, a number of investigators resort to scientific misconducts.

For example, investigator deliberately picks certain values from a large set of data that are in agreement with his prior hypothesis, while suppresses other data that are in disagreement with the hypothesis (this tactic is known as ‘cherry picking’).

The researcher could also resort to the malpractice of ‘data massaging’ in which he deliberately smooth and manipulate the data, alters the values and removes the extreme values (outliers) manually in order to bolster the support for his argument.

**Exploratory Data Analysis**

In contrast, in exploratory data analysis, there is no prior hypothesis to test. Data is collected with no a priori arguments or hypothesis. Mostly, observational studies follow this approach.

The collected data is subjected to Exploratory Data Analysis to see general trends and patterns. For example, an investigator might simply collect flora from Antarctica out of sheer curiosity without any prior hypothesis.

Most of the big data analyses are exploratory in nature; analysis of a whole genome to find interesting genes, analysing a customer’s purchase history to infer his or her consumption habits and lifestyles and so on.

Exploratory data analysis typically starts with summarizing the data through tables, charts (data visualization), and descriptive statistics. Inferential statistics can also be used to reveal general trends and patterns.

For example, to test whether two variables are correlated, a correlation analysis can be performed and to test the association between exposure and outcome, an Odds Ratio can be used. Exploratory data analysis involving correlation and association, the problem of statistical confounding is rampant.

**Statistical confounding**

Confounding is an extraneous variable in a statistical model that correlates (directly or inversely) with both the dependent variable and the independent variable. For example, a study found a strong prevalence of breast cancer in developed countries.

What could have caused such a high prevalence of breast cancer in rich countries despite their quality of life? Such a result would naturally trigger a media debate and a number of possible causative factors (or hypothesis/opinions/models) would be revealed, including sedentary lifestyle of economically privileged people, or their fast-food habits.

However, the real reason for this disparity was the fact that routine screening of the population through mammography was a standard practice in developed countries and this diagnostic screening would naturally reveal many more cases compared with no screening, as in the case of developing countries.

Consider yet another example; “a 1998 study found a strong association between ice cream sales and higher incidences of drowning.” In statistics, a correlation between two variables does not mean that one variable caused the other to happen or vice versa.

Concluding that ice cream sales caused drowning from the above study is outrageous; the extraneous factor that influences both of the studied variables could have been simply the season; in summer months, sales of ice creams is high, so as the prevalence of watersports activities like swimming.

Increased watersports activities might have lead to more drowning incidents. Yet another statement: “There are four times more fatal accidents by motorcycles in India comparing with cars, and therefore motorcycles are not safe”.

While inferring this statement it is necessary to be informed that the number of motorcycles plying on the streets of India is almost ten times that of the number of cars.

Therefore, one would expect motorcycle accidents to be tenfold to that of car accidents. The fact that motorcycle accidents are only fourfold would, in fact, suggest that motorcycles are safer than cars.

**Meta-analysis**

Meta-analysis is the analysis of analyses; it compares data obtained from various independent investigations.

For example, a meta-analysis involved combined analysis of hundreds of investigations on whether butter is good for human health or not, and concluded that butter is moderately bad for humans. In meta-analysis the data is collected from published literature; there is no fresh data collection involved.

As the meta-analysis simply combine results from other studies, there are several biases involved. Investigators can choose which papers to include in the analysis with an ulterior motive to confirm one’s a priori opinion.

As all publications suffer from publication bias and positivity bias (it is very difficult for the negative results to get published in standard journals), meta-analyses magnify this problem many folds.

Systematic reviews are similar, but involves no analysis of scientific data. Systematic reviews comprehensively review various primary research articles (especially Randomized Control Trials) and synthesize an overall picture by recommending efficacy of a new intervention (drug/procedure etc.).

**Predictive analytics**

As the name suggests, predictive analytics makes predictions about future or unknown events based on prior information. A simple example is generation of the calibration curve (standard curve) in analytical chemistry; spectroscopic absorbance is measured for solutions containing various solute concentrations and values are plotted to generate a curve.

With this curve in our hands, we can determine (with a margin of error) solute concentration in a test sample only from absorbance value. Predictive analytics using complicated (commonly regression) models have immense utility in personalized medicine.

Data from various dimensions (genotype, blood results, BMI, age, various risk factors and so on) can be subjected to predictive analytics to make inferences about the risk of developing certain diseases in future, or expected efficiencies of various drugs.

**Summary**

- Data can be classified based on the purpose of collecting it, which in turn depends upon the kind of analysis that we would like to perform with it
- Two major types of data analysis are confirmatory which is used to confirm or disconfirm our a priori hypothesis, and exploratory which is used to infer general trends or patterns of our data with no prior hypotheses.
- Confirmatory analyses are prone to a number of biases; for example, cherry picking and confirmation bias. Exploratory analyses are prone to the problem of statistical confounding.
- Meta-analysis is an analysis of analyses; it systematically analyses results from a number of independent studies to arrive at a conclusion. Systematic reviews are similar, but instead of data analysis, reviews merely synthesize information from the facts.
- The predictive analysis used to make future predictions from available data. Statistical regression models are used to achieve much of the predictions in biology.

Upvote

RiteshPratap A. Singh

| Data scientist - R&D | AI Researcher| Bioinformatician | Geneticist | Engineer | Yoga practitioner | Writer-Editor | Mathematics and Psychology apprentice | On a mission to prevent Crime, Disease, and Disaster.

Related Articles