Processing math: 100%
  • Objectives
  • Set-up
  • Center
    • 1. Mean
    • 2. Median
    • 3. Mode
  • Spread
    • 1. Range
    • 2. Variance and standard deviation
    • 3. Interquartile range
    • 4. Coefficient of variation
  • Visual comparison of center and spread

Objectives

It is always important to visually inspect your data with graphs, which we’ll learn more about in the coming weeks, but it’s equally important to describe your data numerically prior to any analysis. In this tutorial, we’ll focus on some commonly used statistics of center and spread. Statistics of center (e.g. mean, median, mode) describe the average or typical value of an individual or observation. Statistics of spread (e.g. variance, standard deviation, IQR) describe the variability of measurements from individual to individual or from observation to observation. This tutorial will teach you how to do the following:

  1. Describe the center and spread of data numerically.
  2. Visually compare the center and spread across groups.

Set-up

You’ll need to install (install.packages()) and load (library()) the following:

library(tidyverse)

Center

1. Mean

The sample mean (ˉY) is the average of the measurements in the sample (Yi) divided by the total number of observations, n. This is old news, you know this:

ˉY=ni=1Yin

This is very easy to do in R:

x <- c(1, 4, 3, 7, 1, 2, 3)
mean(x)
## [1] 3

Try running the following:

x <- c(1, 4, 3, 7, NA, 2, 3)
mean(x)
## [1] NA

Remember that if you have any missing data in your vector, the mean() function will return NA. This is true of any of the base R functions. While this may be frustrating at times, it serves as an important reminder that you have missing data, and that you should consider this when computing any statistics or interpreting any analyses. The work around, should you still want to compute the mean even with missing data, is pretty easy:

mean(x, na.rm=T)
## [1] 3.333333

2. Median

Also old news. The median is the middle observation in a set of data if you ordered them from smallest to largest. To calculate this, you need to sort the date and find the middle value. If there are two middle values because the total number of observations is even, then compute the average of the middle pair. You can compute the median in R as follows:

median(x, na.rm=T)
## [1] 3

3. Mode

The mode is the value that appears most often in a data. Note that unlike mean and median, mode can actually be applied to both numeric and character data.

mode(x)
## [1] "numeric"

Wait, why does the mode() function return "numeric"? Type ?mode in the console. The mode() function tells us the storage mode of an object (i.e. "interger", "double", "character"). Let’s make our own function to compute the mode of a vector{WHAT!? WE CAN BUILD FUNCTIONS!? Yes. We. Can. And this revolutionizes your programming experience. Read more here.]:

# here's one way to build a DIY function to compute the mode
getmode <- function(v) {  # v can be any vector
   uniqv <- unique(v) # compute unique values in that vector
   uniqv[which.max(tabulate(match(v, uniqv)))] 
   # use tabulate to count the number of each value in vector...
   # then find the max number, e.g. the value that appears the most in the vector
}

getmode(x)
## [1] 3

There is no base R mode function, so you have to build your own.^{WHAT!? WE CAN BUILD FUNCTIONS!? Yes. We. Can. And this revolutionizes your programming experience. Read more here.]

Note that our getmode() function will only return NA when NA is the most frequently occurring value in our variable:

x_MEGA_NA <- c(1, 2, 2, 3, 4, NA, NA, NA, 6)
getmode(x_MEGA_NA)
## [1] NA

Spread

Once we know the central tendency of each variable, it’s good to explore how far the other values of the variable are from the central value. This is known as the spread of the data.

1. Range

The simplest indicator of the spread of a variable is the range, or the difference between the maximum and minimum values of the variable. In the example above with our vector x, the range would be the highest value, 7 minus the lowest value, 1, or 6.

range(x, na.rm=T) # don't forget na.rm=T or your range will be NA to NA!
## [1] 1 7

2. Variance and standard deviation

Standard deviation (s) basically captures how far from the mean the observations are. A big standard deviation means that many observations are far from the mean; conversely, a small standard deviation means observations are closer to the mean. The standard deviation is the square root of the variance (s2). Both are easy to compute. Let’s start with the variance. The formula for the variance of a sample of data is:

s2=ni=1(YiˉY)2n1

So the variance formula is, for each observation, computing the deviation, or the difference between the observation and the mean (YiˉY). We square the difference so that the deviations above and below the mean (e.g positive and negative deviations) contribute equally to the variance. Together, the numerator of this formula is know as the sum of squares of Y (we’ll come back to this as we talk about correlation and regression). To compute the average deviation across observations, we divide by n1.1 To compute the standard deviation, which is more intuitive because it has the same units as the variable itself, we just take the square root:

s=ni=1(YiˉY)2n1 So say we have data describing hours spent studying per week by students in this class:

hours <- c(1, 4, 3, 0, 8, 3.5, 3, 4, 1, 2, 5, 7, 3, 2.5, 4, 3, 5)

So if we wanted to compute the variance manually, we could compute the mean of our observations (ybar), the deviations of each observation from this mean (deviations), and the squared deviations (sq_deviations). The variance would then be the sum of the squared deviations divided by the total number of observations (17) minus 1. Let’s try this.

ybar <- mean(hours)
deviations <- hours - ybar
sq_deviations <- deviations^2

variance <- sum(sq_deviations)/(length(hours) - 1)
print(variance)
## [1] 4.170956

You’ll be glad to hear that you don’t need to go through this each time you want to compute the variance. Check this out:

var(hours)
## [1] 4.170956

Thanks R! If you want to determine the standard deviation, you’d take the square root of the variance:

sqrt(variance)
## [1] 2.042292

Or use the base R shortcut function:

sd(hours)
## [1] 2.042292

NICE! Ok, but does this actually mean? The standard deviation tells us how far the measurements typically are from the mean, (3.4705882).2 Standard deviation also tells us something about the distribution of the data. If the distribution of our data is generally bell-shaped, then about two-third of the data (~67%) falls between ˉYs and ˉY+s. About 95% of the data will fall between ˉY2s and ˉY+2s. To visualize this, I’ve plotted a histogram which counts the number of observations across bins of the same size. This lets us quickly visualize the distribution of the variable of interest:

# We'll spend more time on ggplot() soon, but I bet you can figure out what's going on here!
ggplot() +
  geom_histogram(aes(hours)) +
  geom_vline(xintercept = mean(hours), color = "black", size = 1.5) +
  geom_vline(xintercept = (mean(hours) - sd(hours)), color = "blue") +
  geom_vline(xintercept = (mean(hours) + sd(hours)), color = "blue") +
  geom_vline(xintercept = (mean(hours) - 2*sd(hours)), color = "red", linetype = "dotted") +
  geom_vline(xintercept = (mean(hours) + 2*sd(hours)), color = "red", linetype = "dotted") +
  theme_bw() +
  labs(x = "Hours spent studying", 
       y= "Count", 
       title = "Your devotion to this class :)")

So in this case, the data suggests that 95% of you are studying between -0.6139954 and 7.5551719. One of you is particularly awesome and devoting 8 hours to the noble cause of this course.

3. Interquartile range

Quartiles are values that chunk the data (arranged from smallest to largest) into quarters. The first quartile is middle value of the measurements below the median. The third quartile is the middle value of the measurements above the median. The interquartile range is the range from the first to third quartile… this represents the span of the middle 50% of the data.

4. Coefficient of variation

We often want to compare the relative variation among individuals. Imagine that you wanted to compare the variability of hours studied across two different classes? Say, for example you wanted to compare EASY 101 with HARD 499? In general, students study much more for HARD 499 than for EASY 101, so studying an additional hour for HARD 499 doesn’t nearly as much as it does in EASY 101. The coefficient of variation calculates the standard deviation as a percentage of the mean:

CV=sˉY100

CV = (sd(hours)/mean(hours)) * 100
print(CV)
## [1] 58.8457

A higher coefficient of variation means more variability, so if the coefficient of variation for HARD 499 was 13.45%, this means that there is less variability in hours studied in this course.

We can also use the coefficient of variation to compare the variability of variables that don’t have the same units, for example the variability of height and weight in a particular group.


Visual comparison of center and spread

Let’s turn to some real data to make this more interesting. This week we will work with the 500 Cities data. To learn more about the data, check out the 500 Cities Metadata document on Canvas. For now, here’s a brief introduction to the data:

“The 500 Cities project is a collaboration between CDC, the Robert Wood Johnson Foundation, and the CDC Foundation. The purpose of the 500 Cities Project is to provide city- and census tract-level small area estimates for chronic disease risk factors, health outcomes, and clinical preventive service use for the largest 500 cities in the United States. These small area estimates will allow cities and local health departments to better understand the burden and geographic distribution of health-related variables in their jurisdictions, and assist them in planning public health interventions.”

The project website has fantastic mapping and visualization tools, so go check it out. You can find specific information about each of the health variables in the data here. I’ve cleaned and subset the original large data to only include states in the southeastern US, and this clean version is what we will work with in this tutorial.

First, let’s load the data. I saved my tidy version of the data as a .RDS file using saveRDS(). RDS files are an R-specific file structure that stores objects just as they were in R - i.e., if the variable you write out is a data.frame, then the variable you read back in will also be a data.frame. We can load the .RDS file using readRDS() and use the glimpse() function from the tidyverse set of packages to inspect the data:

health <- readRDS("./data/health_southeast.RDS") 
# the data lives in my data subfolder, so it's easy to load
# refer back to the first week for more on what the ./data thing is about
glimpse(health)
## Rows: 83
## Columns: 23
## $ CityName              <chr> "Albany", "Asheville", "Athens", "Atlanta", "...
## $ StateAbbr             <chr> "GA", "NC", "GA", "GA", "GA", "LA", "AL", "FL...
## $ PopulationCount       <dbl> 77434, 83393, 115452, 420003, 195844, 229493,...
## $ Arthritis             <dbl> 25.8, 22.6, 24.6, 20.2, 23.9, 25.9, 29.8, 20....
## $ BingeDrinking         <dbl> 11.5, 18.7, 13.6, 16.8, 13.7, 18.2, 11.5, 17....
## $ Cancer                <dbl> 5.9, 6.4, 6.2, 5.9, 6.0, 6.0, 5.8, 6.4, 6.2, ...
## $ CholesterolScreening  <dbl> 80.7, 79.0, 80.0, 82.7, 80.7, 79.2, 80.9, 82....
## $ CKD                   <dbl> 4.2, 2.8, 3.5, 3.5, 3.7, 3.6, 4.0, 2.4, 3.0, ...
## $ COPD                  <dbl> 9.0, 6.3, 7.6, 6.4, 8.2, 7.8, 9.2, 5.9, 8.2, ...
## $ HeartDisease          <dbl> 7.3, 5.3, 6.6, 6.0, 6.8, 6.7, 7.3, 5.2, 6.4, ...
## $ Asthma                <dbl> 11.1, 9.3, 9.5, 9.3, 10.2, 9.6, 12.1, 7.7, 9....
## $ NoHealthInsurance     <dbl> 26.5, 14.6, 21.8, 19.5, 23.9, 14.9, 20.6, 16....
## $ Smoking               <dbl> 24.3, 17.2, 20.0, 16.8, 22.4, 22.4, 22.2, 15....
## $ Diabetes              <dbl> 15.5, 8.7, 12.1, 12.2, 13.7, 13.4, 16.2, 7.3,...
## $ HighBloodPressure     <dbl> 42.2, 30.5, 34.9, 34.1, 38.0, 42.4, 45.2, 25....
## $ HighCholesterol       <dbl> 30.6, 28.8, 29.8, 28.3, 30.1, 34.1, 31.9, 28....
## $ MentalHealth          <dbl> 16.2, 12.8, 14.3, 12.0, 14.9, 16.6, 16.8, 11....
## $ PhysicalActivity      <dbl> 40.5, 22.9, 34.3, 29.5, 36.4, 35.3, 38.1, 23....
## $ Obesity               <dbl> 41.9, 28.1, 33.5, 31.1, 38.5, 37.7, 42.0, 21....
## $ PoorHealth            <dbl> 15.5, 11.8, 13.2, 11.3, 13.9, 15.7, 17.1, 10....
## $ Stroke                <dbl> 5.1, 2.9, 3.9, 4.1, 4.4, 4.2, 5.1, 2.5, 3.5, ...
## $ BloodPressureMedicine <dbl> 68.3, 60.1, 64.3, 65.7, 66.3, 65.4, 67.6, 52....
## $ DoctorVisits          <dbl> 78.0, 73.3, 74.8, 73.9, 78.4, 74.9, 74.9, 71....

So this data contains city-level observations (grouped by state) describing:

  • PopulationCount: pretty self-explanatory
  • BingeDrinking: percentage of adults over 18 who report having more than five or more drinks (men) or four or more drinks (women) on an occasion in the past 30 days (more info here).
  • Smoking: percentage of adults over 18 who report having smoked more than 100 cigarettes in their lifetime and currently smoke every day or some days (more info here)
  • MentalHealth: percent of adults reporting 14 or more days during the last 30 during which their mental health was not good (more info here)
  • Obesity: percentage of adults over 18 who have a BMI over 30 (more info here)
  • PoorHealth: percentage of adults over 18 who report 14 or more days of the last 40 during which their physical health was not good (more info here)
  • LowSleep: percentage of adults over 18 who report usually getting insufficient sleep (more info here)

Ok, now imagine that we have money to promote health in the southeastern US, but don’t have enough money for all states in the region. How would you begin to determine which state should receive support? Well, we could start by looking at measures of center for each of the variables by state, for example, by comparing the average value or PoorHealth across all monitored cities in each state. One way to do this is to use the dplyr::group_by() and dplyr::summarize() functions to group our data by state and summarize the average value of PoorHealth for each group:

aph <- health %>%
  group_by(StateAbbr) %>%
  summarize(MEAN = mean(PoorHealth))# added this to help with the visualization below
aph
## # A tibble: 8 x 2
##   StateAbbr  MEAN
##   <chr>     <dbl>
## 1 AL         14.9
## 2 FL         13.7
## 3 GA         12.2
## 4 LA         15.5
## 5 MS         15.9
## 6 NC         12.9
## 7 SC         12.2
## 8 TN         14.2

We’ve now created a data.frame called aph that stores the name of each state and the average value of PoorHealth for each state. Let’s use a bar chart to visualize these differences.3

ggplot(data = aph) +
  geom_col(aes(x = StateAbbr, y = MEAN)) +
  labs(x = "US State",
       y = "% adults reporting poor health (>14 days)",
       title = "Poor health in the Southeastern US") +
  theme_bw()

If you’re Type A about your visualizations and want to arrange the bar plot by value of PoorHealth, here’s one way to do that (plus add some color, cuz, why not4):

ggplot(data = aph) +
  geom_col(aes(x = reorder(StateAbbr, desc(MEAN)), y = MEAN, fill = StateAbbr)) +
  labs(x = "US State",
       y = "% adults reporting poor health (>14 days)",
       title = "Poor health in the Southeastern US") +
  theme_bw() +
  theme(legend.position = "none") + # remove the legend
  scale_fill_brewer(palette = "Pastel1") 

Ok,easy. Let’s give money to Mississippi, Louisiana, and Alabama. They seem to be doing much worse than anyone else (Bourbon Street and ’Bama football).

NOT SO FAST. The visualization above tells us nothing about (1) the number of observations (cities) in each state used to compute the mean or (2) the variability among cities in each state. It also tells us nothing about (3) all of the assumptions that went into creating this data (the way the sample was generated, how the factors were operationalized, who was likely excluded and the bias that entails, the population the sample is attempting to capture, etc..). If you were really going to make a big decision using this data, you’d need to explore all of these factors.

In this tutorial, we’ll focus on the first two points. Let’s start by looking at the number of observations (cities) in each state. One way we can do this is by adding a new variable that counts the number of cities in each state to our group_by() and summarize() pipe we wrote above:

aph <- health %>%
  group_by(StateAbbr) %>%
  summarize(MEAN = mean(PoorHealth),
            NCITIES = n()) # this n() function counts the number of observations in each group
aph
## # A tibble: 8 x 3
##   StateAbbr  MEAN NCITIES
##   <chr>     <dbl>   <int>
## 1 AL         14.9       6
## 2 FL         13.7      33
## 3 GA         12.2      11
## 4 LA         15.5       6
## 5 MS         15.9       2
## 6 NC         12.9      14
## 7 SC         12.2       5
## 8 TN         14.2       6

Let’s compute standard deviation to get a sense of how variable PoorHealth is across cities in each state:

aph <- health %>%
  group_by(StateAbbr) %>%
  summarize(APH = mean(PoorHealth),
            NCITIES = n(), 
            SD = sd(PoorHealth)) 
aph
## # A tibble: 8 x 4
##   StateAbbr   APH NCITIES    SD
##   <chr>     <dbl>   <int> <dbl>
## 1 AL         14.9       6 2.45 
## 2 FL         13.7      33 1.50 
## 3 GA         12.2      11 2.93 
## 4 LA         15.5       6 0.898
## 5 MS         15.9       2 0.141
## 6 NC         12.9      14 1.61 
## 7 SC         12.2       5 2.63 
## 8 TN         14.2       6 1.32

This lets us quickly see that the spread (here, measured as standard deviation) in Georgia is much higher than in other states. Tables like this are helpful, but staring at a big matrix of numbers is not the most effective way to quickly visualize these differences in center and spread across groups.

A very quick way to visualize the shape of a variable is to plot a histogram of that variable. You can do this using either the base R hist() function or using ggplot().

hist(health$PoorHealth, breaks = 100, xlab = "Poor health rates", ylab = "Frequency", main = "Histogram of poor health rates")

Changing the breaks() argument changes the number of bins into which the variable values are placed. Try it!

You can also build a nicer histogram using ggplot() as follows:

ggplot(data = health) +
  geom_histogram(aes(x = PoorHealth), bins = 50) +
  labs(x = "Rate of respondents with poor health", 
       y = "Frequency",
       title = "500 Cities Health Outcomes")  +
  theme_classic()

To compare distributions of a variable across groups, e.g. distributions of poor health observations across states, enter an extremely powerful data visualization tool: the box plot. These plots visualize the median, the first and third quartiles (the 25th and 75th percentiles), and any outliers that fall beyond the 95th percentile of the data in one elegant plot. They also automatically group data and compute summary statistics, so you can skip that step in your code! ANYTIME you want to quickly visualize difference across groups, the box and whisker should be your go-to visualization tool:

ggplot(data = health) +
  geom_boxplot(aes(x = StateAbbr, y = PoorHealth)) + # to add color, play with fill=StateAbbr and color=StateAbbr
  labs(x = "US State",
       y = "% adults reporting poor health (>14 days)",
       title = "Poor health in the Southeastern US") +
  theme_bw() 

What’s going on here? Think of the box plot (also referred to as a “box and whisker plot”) above as variable distributions (histograms) turned on their sides. The median of the distribution is shown by the horizontal bar dividing each box. The lower and upper edges of the box are the first and third quartiles. The “whiskers”, or the thin vertical lines, represent the area describing 95% of the distribution (technically 1.5 * IQR). The dots are outliers, or values that fall outside of where the bulk of the data lives. In our example, these points represent cities with much higher or lower rates of reported PoorHealth than the norm. Think back to our previous reflection on why outliers, particularly in this case, can be very important to understand.

This approach to plotting difference across groups is much more effective than a simple bar plot (so never ever use one of those again unless you’re comparing count data across groups!) - it communicates center and spread in one visualization. So now that we understand what this plot helps us visualize, let’s interpret what it means. First, let me reorder the plot and pretty-it-up a bit:

ggplot(data = health) +
  geom_boxplot(aes(x = reorder(StateAbbr, desc(PoorHealth)), y = PoorHealth, fill = StateAbbr)) + # to add color, play with fill=StateAbbr and color=StateAbbr
  labs(x = "",
       y = "% adults reporting poor health (>14 days)",
       title = "Poor health in the Southeastern US") +
  theme_bw() +
  theme(legend.position = "none") +
  scale_fill_brewer(palette = "Dark2") +
  coord_flip()

Boxes that are relatively tall indicate a state with cities that have many different rates of reported PoorHealth (hey Georgia!). Notice that Alabama, Mississippi, and Louisiana have much higher median PoorHealth values than other states. While this does suggest that these states have worse health than other states, it’s important to note that the range of reported values overlap with many other states (the boxes and/or whiskers overlap). While this makes it hard for us to say that there any compelling statistical difference in health across these states, we can say with some confidence that it is likely that reported PoorHealth is, for example, higher in urban areas in Alabama than in North Carolina. Note that another cool tool to use when comparing across groups is the violin plot.

ggplot(data = health) +
  geom_violin(aes(x = reorder(StateAbbr, desc(PoorHealth)), y = PoorHealth, fill = StateAbbr)) + # to add color, play with fill=StateAbbr and color=StateAbbr
  labs(x = "US State",
       y = "% adults reporting poor health (>14 days)",
       title = "Poor health in the Southeastern US") +
  theme_bw() +
  theme(legend.position = "none") +
  scale_fill_brewer(palette = "Dark2") +
  coord_flip()

This slices data into quartiles like a box and whisker plot, but also shows us the shape of the distribution! Cool! To make this plot, the only bit of code I changed was geom_boxplot() to geom_histogram().

Now, let’s zoom in on my favorite part of box plots… the outliers. In the case of our data, these are the cities with abnormally high or low rates of reported poor health. You’ll encounter published research where outliers are “clipped” (dropped) from analysis… but man, what a bummer… because in many ways, outliers present fantastic opportunities to learn. We’ll come back to this a bit later in the class.

We can use our dplyr skills to query our data and find our more about these outliers. For example, say we want to locate the cities in Florida with particularly high or low reported PoorHealth… here’s one way to do that:

health %>%
  filter(StateAbbr == "FL",
         PoorHealth > 16) # derived from visually inspecting the data, but you could also compute mean() +/- 1 SD
##     CityName StateAbbr PopulationCount Arthritis BingeDrinking Cancer
## 1    Hialeah        FL          224669      18.9          14.5    4.8
## 2 Lauderhill        FL           66887      24.1          13.5    5.6
##   CholesterolScreening CKD COPD HeartDisease Asthma NoHealthInsurance Smoking
## 1                 78.1 4.0  6.9          7.4    7.2              43.4    18.9
## 2                 80.8 3.9  9.6          7.2   10.9              28.2    22.8
##   Diabetes HighBloodPressure HighCholesterol MentalHealth PhysicalActivity
## 1     15.3              30.6            31.3         15.6             40.2
## 2     13.7              41.0            29.9         16.8             37.4
##   Obesity PoorHealth Stroke BloodPressureMedicine DoctorVisits
## 1    34.2       16.8    3.6                  56.3         71.0
## 2    34.4       16.6    5.1                  63.3         77.3

These cities are the two points above the Florida box plot. We could also use dplyr to find the cities in each state with the maximum (or minimum) value of reported PoorHealth:

health %>%
  group_by(StateAbbr) %>%
  filter(PoorHealth == max(PoorHealth))
## # A tibble: 8 x 23
## # Groups:   StateAbbr [8]
##   CityName StateAbbr PopulationCount Arthritis BingeDrinking Cancer
##   <chr>    <chr>               <dbl>     <dbl>         <dbl>  <dbl>
## 1 Birming~ AL                 212237      29.8          11.5    5.8
## 2 Gastonia NC                  71741      24.4          14.9    6.3
## 3 Hialeah  FL                 224669      18.9          14.5    4.8
## 4 Jackson  MS                 173514      25.2          11.9    5.9
## 5 Macon    GA                  91351      27.6          11.6    5.9
## 6 Memphis  TN                 646889      26.3          11.7    6  
## 7 North C~ SC                  97471      25.3          17.3    6.2
## 8 Shrevep~ LA                 199311      25.4          16.7    6  
## # ... with 17 more variables: CholesterolScreening <dbl>, CKD <dbl>,
## #   COPD <dbl>, HeartDisease <dbl>, Asthma <dbl>, NoHealthInsurance <dbl>,
## #   Smoking <dbl>, Diabetes <dbl>, HighBloodPressure <dbl>,
## #   HighCholesterol <dbl>, MentalHealth <dbl>, PhysicalActivity <dbl>,
## #   Obesity <dbl>, PoorHealth <dbl>, Stroke <dbl>, BloodPressureMedicine <dbl>,
## #   DoctorVisits <dbl>

Based on what you’ve seen so far, which state should receive the money? How might you look at other variables in the data to support your decision?


  1. "The reason is that the sample mean is itself calculated using each data point. Therefore, the measurements in the sample are slightly closer on average to the sample mean than they are to the true population mean. This causes a bias that is corrected by dividing by n1 instead of n. ABD, p. 69

  2. RMarkdown tip: You can write inline code using back ticks and the letter r, more here. In this case, I’m not actually typing out the numbers representing ybar, but computing it inline. If I ever decide to change the values of hours, then the actual numbers in the text will change in response to this change. Nice.

  3. Never ever do this again! As far as I’m concerned, bar charts are only to be used when comparing count data across groups, never to compare something like a mean… box and whiskers are the way to go here when comparing variable statistics across groups of data. Keep reading for more!

  4. More on the palettes available through RColorBrewer here