Introduction to Data Summarization
RiteshPratap A. Singh
The term Data Summarization refers to presenting the summary of generated data in an easily comprehensible and informative manner.
Presenting the raw data (the data that was generated which is essentially the entire repertoire of datasets- individual measurements) is not practical in many cases.
For example, an epidemiological study that involved blood glucose measurements from lakh samples, or human genome (the entire human genome if printed, would occupy 130 volumes and take 95 years to read). Presenting such complex data would need several printed pages, and convey no easily comprehensible information.
For example, what are the general trends? Is the distribution roughly normal? Are the data unimodal or multimodal? Are there any obvious clusters? Is there any anomaly?
However, note that the raw data should not be dismissed as useless; the data should be carefully archived either in physical form or in online archival tools that generate a Digital Object Identifier (DOI) such as LabArchives (https://www.labarchives.com/) and should be available for inspection at any time.
A carefully chosen summary of raw data would convey many trends and patterns of the data in an easily accessible manner. The term ‘data mining’ refers exactly this; extracting meaningful information from the raw data.
For example, what are the genes in the human genome? The way data is presented is very important, although often overlooked aspect in statistics. Data summarization comes much before any statistical tests; indeed choosing appropriate statistical test depends on the general trends of the data revealed in the summarization step.
Tabular Vs Graphical
In general, data can be summarized numerically as a table (tabular summarization), or visually as a graph (data visualization).
A raw dataset can be summarized as a table by grouping the individual measurements (elements of data set) into various, appropriately labelled bins.
Such a table could instantaneously convey the patterns of the data in an easily accessible manner.
Examples of such tabular summarization include Empirical Frequency Distribution, Cumulative Frequency Distribution, Relative Frequency Distribution, contingency table, and a table containing values of various descriptive statistical assessments as explained in a later module.
On the other hand, these tables can also be presented, perhaps in a more accessible manner, visually through graphs.
As the apophthegm goes, “a picture is worth a thousand words”, a carefully chosen graph style could summarize the whole data effectively such a manner that an investigator spends as little time as possible to see the general trends.
Examples include stem-and-leaf diagrams, histograms, Ogives, time-series line graphs, column/bar graphs, box-and-whisker plots, pie charts, heatmaps, tree-representations, network graphs, bubble graphs, contour plots, area charts, scatterplots (also called dot plots) and so on.
Many of these chart types have 3-dimensional variants as well which are used if values of three variables are simultaneously plotted.
Choice of table or graph for data summarization is largely a personal option. However, the choice of appropriate tables or graphs depends upon the level of measurements mentioned in the previous module. The below table may be referred to as a rule of thumb to decide the appropriate summarization style:
It has to be noted that the same information conveyed through a table should not be presented again as a graph or vice versa, as it would lead to the redundancy.
- Data summarization is the first step in statistics, it is aimed at extracting useful information and general trends from the raw data.
- Two methods for data summarization are through tables and graphs.
- Although the selection of tables or graph is mostly a matter of personal choice, selection of an appropriate table or graph depends on the level of measurement. For example, Gregor Mendel’s data of dihybrid cross falls under nominal scale and to summarize it, a contingency table or bar graph can be used, but not a scatterplot or a histogram.
Originally published on medium
RiteshPratap A. Singh
| Data scientist - R&D | AI Researcher| Bioinformatician | Geneticist | Engineer | Yoga practitioner | Writer-Editor | Mathematics and Psychology apprentice | On a mission to prevent Crime, Disease, and Disaster.