Kamil Slowikowski
About Archive

Compare distributions with box plots, not bar plots

In many scientific journals, authors use bar plots to compare two or more distributions. Often, the error bar is only present for the upper limit and not for the lower limit. Sometimes, the bar for the control group has no error bars due to data normalization. Here, I simulate a small experiment to illustrate why this normalization is problematic and why box plots are better than bar plots for comparing two distributions.


Let’s simulate a simple experiment where the control samples have a value centered around 1 and the experimental samples have a value centered around 2. We might imagine that these values represent gene expression or some other measure of interest.

# Set the random seed to reproduce the random numbers.

# Some technical variation is shared by control and experimental groups.
n <- 5
shared_variation <- rnorm(n, mean = 1, sd = 0.5)
dat <- melt(data.frame(
  # The value for controls is centered around 1.
  Control = rnorm(n, mean = 1, sd = 0.25) * shared_variation,
  # The value for experimental samples is centered around 2.
  Experiment = rnorm(n, mean = 2, sd = 0.25) * shared_variation,
  ID = 1:n
), id.vars = 'ID')
ID variable value
1 Control 1.2392120
2 Control 0.4608037
3 Control 1.6991868
4 Control 1.8845041
5 Control 0.7468247
1 Experiment 2.7597986
2 Experiment 1.1681772
3 Experiment 3.1028883
4 Experiment 3.5554806
5 Experiment 1.8724863


With normal distributions, we can use the t-test to compare the distribution of values from the control group and the experimental group. This helps us to determine if the difference between the two distributions is statistically significant.

# Test if the distribution of control values differs from experimental values.
t1 <- t.test(
  x = dat$value[dat$variable == "Control"],
  y = dat$value[dat$variable == "Experiment"]
estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high method alternative
-1.28566 1.206106 2.491766 -2.525504 0.0407433 6.739684 -2.498911 -0.0724094 Welch Two Sample t-test two.sided

After normalizing the data so that the controls are to equal 1.0, we no longer test for a difference between control and experimental distributions. Instead, we now test if the experimental distribution is different from 1.0:

# Normalize Controls to 1.0.
dat2 <- unsplit(lapply(split.data.frame(dat, dat$ID), function(x) {
  x$value <- x$value / x$value[1]
}), dat$ID)

# Test if the experimental distribution differs from 1.0.
t2 <- t.test(
  x = dat2$value[dat2$variable == "Control"],
  y = dat2$value[dat2$variable == "Experiment"]
estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high method alternative
-1.196441 1 2.196441 -8.018042 0.0013126 4 -1.610738 -0.7821436 Welch Two Sample t-test two.sided

In this second test, the mean value for the control samples is 1 instead of 1.2. We can see that the t-statistic -8 is inflated relative to the correct value of -2.5 and the p-value is lower than it should be.

Bar Plots

Journals often publish bar plot figures that do not clearly communicate the results to the reader:

  • The data is displayed as a bar plot of mean or median values.
  • The lower bounds of error bars are omitted.
  • Legends do not explain the meaning of the bars or error bars.

Below, I show the data represented as bars of mean values with error bars of standard deviations.

plot_bars <- function(dat, ttest) {
  # mean by group
  x <- by(dat$value, dat$variable, mean)
  x <- melt(as.data.frame(as.list(x)), id.vars = NULL)
  # sd by group
  x$sd <- as.vector(by(dat$value, dat$variable, sd))
  ggplot(x, aes(x = variable, y = value, group = variable)) +
    geom_errorbar(aes(ymax = value + sd, ymin = value - sd), width = .1) +
    geom_bar(aes(variable, value),
             stat = "identity", fill = "white", color = "black") +
    theme_bw(base_size = 18) +
    labs(x = "", y = "", title = paste("P = ", format.pval(ttest$p.value, 2)))

grid.arrange(plot_bars(dat, t1), plot_bars(dat2, t2), nrow = 1)

plot of chunk two-barplots

Notice that the figure on the right might appear to show that there is a very significant difference between the experimental and control groups.

Normalizing the data so that the controls are equal to 1.0 causes two effects:

  • The technical variation between experiments is hidden.
  • The difference between the control and the experimental groups is exaggerated. This is because we’re comparing the experimental distribution to 1.0 instead comparing it to the control distribution.

Box Plots

A box plot is a good way to compare two or more distributions. Here’s the anatomy of a boxplot created with ggplot2:

plot of chunk ggplot2-boxplot-anatomy

I would like to see data presented with box plots as shown below instead of bar plots. I think the scientific question, “Are the distributions different from each other?”, comes naturally from this kind of presentation.

# Show a box plot of the data.
p1 <- ggplot(dat) +
  geom_boxplot(aes(x = variable, y = value)) +
  theme_bw(base_size = 18) +
  labs(x = "", y = "", title = paste("P = ", format.pval(t1$p.value, 2)))

p2 <- ggplot(dat2) +
  geom_boxplot(aes(x = variable, y = value)) +
  theme_bw(base_size = 18) +
  labs(x = "", y = "", title = paste("P = ", format.pval(t2$p.value, 2)))

grid.arrange(p1, p2, nrow = 1)

plot of chunk two-boxplots

On the left, notice that the range of values in the control group overlaps with the range of values in the experimental group. The variation in the control group tells us something about the amount of technical variation between repeated experiments. If the technical variation is too high, we might have some reason to be skeptical about the reproducibility of the assay.

On the right, we hide the variation between control samples, so we can make no assessment of the technical variation in the experiment. This point is more obvious for the box plot than it is for the bar plot above.