Observational Studies In Statistics
Observational or Exploratory studies
RiteshPratap A. Singh
There are broadly two types of studies in the statistics, classified according to how data is collected. Observational studies and Experimental studies. Conversely, there are two types of studies according to how data is analysed; Exploratory Data Analysis and Confirmatory Data Analysis.
Observational or Exploratory studies
An observational study merely ‘observes’ and collect data from an existing situation without any interventions or manipulations. Most of the curiosity-driven basic scientific research (also called ‘blue-skies research’) involves this kind of studies.
For example, a taxonomist exploring Antarctic vista to survey and collect ice and snow algal samples, or an astronomer observing the night sky to study various astronomical bodies. The data collection does not intentionally interfere with the running of the system.
In observational studies usually, there is no prior hypothesis. For example, when Charles Darwin embarked HMS Beagle in 1851, he had no specific hypothesis to test; instead, he keenly observed the natural phenomenon and developed a theory to explain the observed phenomenon. Usually, observational studies rely on lucky finds.
For example, consider a bioprospecting pharmaceutical firm screening plant extracts for anticancer drugs. The firm needs to screen a very large number of extracts, on average 250,000 extracts for developing one drug candidate which might turn into a blockbuster drug, a lucky find indeed. Instead of systematic scientific experimentation and hypothesis testing, observational studies depend on trial and error strategy which is rather too slow and expensive.
Observational studies are also open to misinterpretation due to a lack of knowledge in a given field. For example, a natural historian might overlook the froth that he observed in the Antarctic lake as a trifle, although it might have caused by a truly unusual phenomenon.
An intertidal biologist surveying a coastal location might intuitively think soft and fluffy benthic organism as a soft coral or a sea anemone and miss out an opportunity to collect a rare specimen of red seaweed Galaxaura. Even a qualified entomologist sometime miss out interesting insect specimen from tropical rainforests that show subtle camouflage and other adaptive traits.
Data collection from an existing situation needs to be carefully planned though. For example, for any type of bioprospecting works involving indigenous knowledge, investigators need to obtain free prior informed consent (FPIC) from the sample providers; or else their data collection would be considered a form of biopiracy. In epidemiological studies, investigators should obtain FPIC from the patients before they collect the sample or data.
By the mere fact that the subjects are aware of their data being collected, they might modify some aspects of behaviour subconsciously, the Hawthorne effect as explained in the previous module. Data collection involving questionnaire surveys suffer a number of cognitive biases explained in the previous module.
For biodiversity and taxonomic surveys, sampling needs to be planned systematically in consultation with species distribution maps and various sample survey strategies such as quadrat method.
Cross-sectional studies
The cross-sectional study is a type of observational study where data is collected from study units at a fixed time. For example, a natural historian or a geologist exploring Antarctic landscapes collect the samples at a fixed time that the investigator brings back to the lab for further detailed observations.
Contrast this with an ecophysiologist who studies a particular seaweed community at a coastal location over time; the investigator follows the development of seaweeds over a period of time-an example of a longitudinal study. Most of the demographic surveys fall into cross-sectional studies; for example a national census.
The census captures information of the population at any given point in time. However, the data obtained from several cross-sectional studies can be used at a later stage to make inferences about longitudinal trends; for example, a statistician can analysis last 20 censuses in India to infer about trends in literacy rates over time.
Most of the questionnaire surveys too fall in cross-sectional studies and suffer from various cognitive biases; social desirability bias and recall bias for instance.
Cohort studies, longitudinal studies
A cohort, a term used in epidemiology, is a group of people whose membership in a set is clearly defined. For example, “all smokers in India” is a set, however, the set membership is not clearly defined.
Does smoker mean someone who has smoked a cigarette in his lifetime? Or someone who smoked a cigarette in last one year? A clear definition of the set is an essential prerequisite to avoid ambiguities during data analysis.
In longitudinal studies, cohorts are followed temporally; often towards the future till the occurrence of an endpoint. An endpoint is clearly defined the outcome of a study or observation. For example, algal eutrophication of a specified intensity, species extinction, presence of a disease, 5 year survival period after radical mastectomy, death and so on.
An example of longitudinal study is the identification of certain cohorts (smokers) and follow them towards next 10 years for studying, for example, the development of lung cancer.
Many scientific insights often overlooked today as common everyday knowledge, like “smoking causes cancer” or “trans-fats and lack of exercise causes coronary artery diseases” are inferences of decades-long meticulous longitudinal epidemiological studies backed with rigorous statistics; for instance the famous “Framingham Heart Study”, which began in 1948 and is still ongoing.
Longitudinal studies are not merely restricted to epidemiology. An astronomer might study characteristics of supernovae over the time. As explained previously, ecophysiologists often follow an organism or a community towards the future. Investigators could follow the cohorts or other subjects longitudinally towards the past as well, the so-called ‘retrospective study’ as explained later.
Crossover studies
In crossover studies, the same experimental unit receives more than one treatment or is investigated under more than one condition (or exposure) of the study. Different treatments are given during non-overlapping time periods.
For example, sequential treatment of laboratory animals with various drugs, and blood metabolites are measured for each drug. Apart from observational studies, crossover studies are also oftentimes employed for the execution of various clinical experiments.
Prospective cohort studies
A prospective cohort study is a type of longitudinal study that follows the cohorts towards the future. For example, a group of HIV+ persons are identified and they are followed in time for the next 10 years to study, for example, development of AIDS. By definition, prospective studies are always directed towards the future.
A biomedical geneticist might refer twin registry to identify twins to follow in time towards the future. An ecologist might study community structure in a quadrat placed at a coastal benthic area for the next 5 years.
Retrospective studies
A retrospective study (a type of longitudinal study) follows the subjects back in time. For example, an epidemiologist can consult death registry to identify people who have died of lung cancer at a particular locality and she can study their past to infer about plausible causative agents (for example, did the subjects smoke cigarettes?).
Investigators can consult cancer registry (or any other kinds of epidemiological registries) to study occupational associations; for example, does occupations involving exposure to suspected carcinogens are associated with higher cancer prevalence? All of the investigations in forensics fall into retrospective studies.
Case-control studies
Case-Control studies are a special type of retrospective study in which two groups are clearly identified and selected for further studies. One group called cases usually include patients that meet certain criteria (for example, persons identified with any form of cancer in last one year). Another group called control (often, negative control) are often healthy persons that serve as a comparison for the cases.
These cases and controls are often followed temporally back in time to identify causative factors that might have lead to the development of a disease. One example of such a case-control study is Genome-Wide Association Studies (GWAS) where whole genome sequences of cases and controls are thoroughly compared to spot candidate mutations in cases (but absent in controls) that might have caused cancer. Such studies have identified a number of genes involved in the development of cancer; genes for p53 and BRCA1 for instance.
Note that none of the types of studies explained above involves experimental manipulation of the subjects; all are, therefore, observational studies.
People selected by the investigators merely live their lives as usual; for example, they follow their usual exercise routine, diet and merely consume drugs that their physicians prescribed. Plants or animals that the ecologists choose to follow in time are not under any experimental manipulations (controlled nutrients, light exposure and so on).
Summary
- There are broadly two types of studies in the statistics, classified according to how data is collected; observational study and experimental study.
- An observational study merely ‘observes’ and collect data from an existing situation without any interventions or manipulations.
- The cross-sectional study is a type of observational study where data is collected from many study units at a fixed time.
- A cohort, a term used in epidemiology, is the group of people whose membership in a set is clearly defined. An endpoint is clearly defined outcome of a study or observation.
- In longitudinal studies, cohorts are followed temporally; often towards the future (towards the past too at times) till the occurrence of an endpoint.
- In cross-over studies, the same experimental unit receives more than one treatment or is investigated under more than one condition (or exposure) of the study.
- A prospective cohort study is a type of longitudinal study that follows the cohorts towards the future.
- A retrospective study (a type of longitudinal study) follows the subjects back in time.
- Case-Control studies are a special type of retrospective study in which two groups are clearly identified and selected for further studies, cases and controls. Genome-Wide Association Study is a type of case-control study.
Upvote
RiteshPratap A. Singh
| Data scientist - R&D | AI Researcher| Bioinformatician | Geneticist | Engineer | Yoga practitioner | Writer-Editor | Mathematics and Psychology apprentice | On a mission to prevent Crime, Disease, and Disaster.

Related Articles