We tend to make inferences about relationships between the objects that we see in ways that bear on our interpretation of graphical data, for example. Arrangements of points and lines on a page can encourage us—sometimes quite unconsciously—to make inferences about similarities, clustering, distinctions, and causal relationships that might or might not be there in the numbers. Sometimes these perceptual tendencies can be honestly harnessed to make our graphics more effective. At other times, they will tend to lead us astray, and must take care not to lean on them too much.
library(tidyverse)library(glue)library(gt)# reshape anscombe data in tidy formatans <- anscombe |>as_tibble() |>pivot_longer(x1:y4) |>mutate(dataset =case_when(str_detect(name, "1") ~"dataset1",str_detect(name, "2") ~"dataset2",str_detect(name, "3") ~"dataset3",str_detect(name, "4") ~"dataset4")) |>mutate(xory =ifelse(str_detect(name, "x"), "x", "y")) |>select(-name) |>pivot_wider(names_from = xory, values_from = value, values_fn = list) |>unnest(cols =c("x","y")) # Define a function to make labels for plotslabler <-function(rsq,pval) glue("R = {rsq}, p = {pval}")# generate linear models, extract p-vals and rsq; generate labelsans_sum <- ans |>group_by(dataset) |>nest() |>mutate(linmod =map(data, \(data)lm(y~x, data = data)),linmod_s =map(linmod, summary),rsq =map_dbl(linmod_s, \(sum) round(sqrt(sum$r.squared),3)),pval =map_chr(linmod_s, \(sum) scales::pvalue(sum$coefficients[2,4])),label =map2_chr(rsq, pval, labler))ans_plot <- ans |>ggplot(aes(x = x, y = y)) +geom_point() +facet_wrap(.~dataset, nrow =2) +theme_bw()ans_plot
Anscombe’s quartet
Code
ans_plot +geom_smooth(method ="lm", se = F)
Anscombe’s quartet
Code
ans_plot +geom_smooth(method ="lm", se = F) +geom_text(inherit.aes = F, data = ans_sum,aes(x =Inf,y =-Inf, label = label),hjust =1.1, vjust =-1)
Illustrations like this demonstrate why it is worth looking at data. But that does not mean that looking at data is all one needs to do. Real datasets are messy, and while displaying them graphically is very useful, doing so presents problems of its own. As we will see below, there is considerable debate about what sort of visual work is most effective, when it can be superfluous, and how it can at times be misleading to researchers and audiences alike.
Mapping features of the data onto aesthetics of a graphic
Code
```{mermaid}%%| collapse: true%%| code-fold: trueflowchart LR A(Quantitative variable) --> B(Position on figure) --> F(Position on X-axis, Y-axis, ...) A --> C(Size of visual element) A --> D(Color of visual element)```
flowchart LR
A(Quantitative variable) --> B(Position on figure) --> F(Position on X-axis, Y-axis, ...)
A --> C(Size of visual element)
A --> D(Color of visual element)
Code
```{mermaid}%%| collapse: true%%| code-fold: trueflowchart LR A(Categorical variable) --> B(Shape of visual element) --> C(Shape of point, stroke of line) A --> D(Colour of visual element)```
flowchart LR
A(Categorical variable) --> B(Shape of visual element) --> C(Shape of point, stroke of line)
A --> D(Colour of visual element)
Categorical variables can be ordered or unordered. Examples?
Mapping features of the data onto aesthetics of a graphic
How to choose the right map?
The map should be capable of encoding the data
e.g. Shapes are a poor choice for encoding quantitative variables
The map should be effective at encoding the data
The subject of research on how brains work
Identifying the geometry for the plot
The geometry determines how the data are shown
In ggplot, you can choose from a large number of geom_* functions, depending on whether you wish to show points, lines, boxplots, histograms, etc.
Setting the scale for the aesthetics
Code
```{mermaid}%%| collapse: true%%| code-fold: trueflowchart LR A(Quantitative variable) --> B(Position on figure) A --> C(Size of visual element) A --> D(Color of visual element) --> G(Species 1 should be blue, Species 2 should be green)```
flowchart LR
A(Quantitative variable) --> B(Position on figure)
A --> C(Size of visual element)
A --> D(Color of visual element) --> G(Species 1 should be blue, Species 2 should be green)
How to choose the right scale?
The scale should be capable of encoding the data you are presenting
e.g. If you have data on a dozen species, color might be a poor choice because human brains are not good at distinguishing between a dozen colors.
The scale should be efficient at encoding the data you are presenting
E.g. Diverging data (e.g. “least likely to most likely”) best represented by a divergent palette
The subject of research on how brains work
More research on how humans digest visual information
Work of William Cleveland (Elements of graphic data, 1985; Visualization data, 1993)