McMaster LibGuides: Data Visualization Guide: Introduction

What is Data Visualization?

Data Visualization is a broad term for things that uses graphical or pictorial representations of data for exploration, and communication . It is the graphic representation of information and data, by using charts, graphs, maps etc. There are two general umbrellas for types of visualizations: exploration and communication.

Exploration: Analysis and Modelling

These types of visualizations are intended to allow the user to develop hypotheses about the patterns in the data. The user can visually explore and identify trends in the data, notice relationships between different data variables, the overall structures, anomalies and outliers and look at the distribution of the data values overall to enable further data analysis and gain insights about the data.

Communication: Description and Presentation

These types of visualizations present already known data in a visual form allows patterns of large quantities of information to be seen and understood. This allows for relevant findings or key insights of the data analysis already done in a way that is easily understood by the audience.

The benefits of both types of visualizations is most evident when it is used at a large scale as these forms allow for patterns of large quantities of information to be seen and understood.

These definitions are adapted from "Visualization and Interpretation" by Johanna Drucker.

Why Use Data Visualizations?

We have a huge capacity to take things in visually and slot them into our brains, so it’s a way to get a lot of data and understand it quickly. Furthermore, it’s a great tool for spotting trends, and outliers and it's a great tool for exploring relationships and prompting further questions.

Anscombe's Quartet

Anscombe’s Quartet can be defined as a group of four data sets which are nearly identical in simple descriptive statistics, but there are some peculiarities in the dataset that fools the regression model if built. They have very different distributions and appear differently when plotted on scatter plots.

It was constructed in 1973 by statistician Francis Anscombe to illustrate the importance of plotting the graphs before analyzing and model building, and the effect of other observations on statistical properties.There are these four data set plots which have nearly same statistical observations, which provides same statistical information that involves variance, and mean of all x,y points in all four datasets.

This tells us about the importance of visualising the data before applying various algorithms to build models out of them. These graphs suggest that the data features must be plotted in order to see the distribution of the samples that can help you identify the various anomalies present in the data like outliers, diversity of the data, linear separability of the data, etc. In the visualizations you can see how the linear regression is only useful for data with linear relationships and is inaccurate when working with other kinds of datasets.