The Carnegie Library of Pittsburgh has released a variety of data related to self-reported summer reading behavior. In the following exploratory analysis I will investigate a subset of this data.

This subset describes the number of times a given book has been read by logged time and location. For instance, a child reading a book 3 times before bed will show up as a single data point of 3 reads with a time corresponding to when they (or a parent) entered this data into the system.

For the sake of preserving patron privacy CLPGH decided not to include the actual book titles with this particular subset of data, as it could be used to deanonymize users.

My exploration of the data resulted in discovering the following topics:

Analysis Format

This analysis has been conducted using the R statistical programming language along with a variety of other tools. In order to create a reproducable analysis, the code that produced the particular output (such as a graph or statistic) is included before that output. If you are only interested in conclusions drawn from this data and not in the source code, please scroll to the top, click Code in the upper right corner, and then select Hide All Code.

ts <- read.csv("log-timestamps.csv")
ts <- ts[-c(28994),]  # remove malformed row at the end
# ts <- as.data.table(ts)
names(ts)[names(ts) == "log_value"] <- "times_read"
# Add a column with the time rounded to the nearest hour and no date
times_only <- as.POSIXct(ts$log_timestamp, format="%m/%d/%Y %H:%M")
times_only <- round(times_only, units=c("hours"))
ts <- mutate(ts, hour_rounded = format(times_only, "%H:%M"))
# Convert column to Date objects
ts$date_read <- as.Date(strptime(ts$date_read, "%m/%d/%Y"))

Introducing the Data

Let’s begin by examining what data we have. Here are the first few entries:

head(ts)

Most Common Reading Days

When looking at the data it’s useful to think about columns that might be related to one another. We have a date_read column along with times_read indicating the number of times the book was read per log. Using R, we can convert date_read from dates into a new column of weekdays. From there we can see which days people tend to read most often.

ts$weekday <- weekdays(ts$date_read)
# Convert to factor in order to impart a manual ordering
ts$weekday <- factor(ts$weekday, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))
weekday_summary <- summarise(group_by(ts, weekday),
                   reads = sum(times_read))
ggplot(na.omit(weekday_summary), aes(weekday, reads)) +
  geom_point() +
  geom_line(color="blue", alpha=0.2, aes(group=1)) +
  ggtitle("Reading by Weekday")

Surprisingly, it’s more common to read on Tuesday through Thursday than on the weekend. Note that the light blue line is intended only to guide your eye to the changes in each black data point; we don’t have values for the time between each weekday.

Number of Times Read

Let’s get an idea of the typical amount of times books are read per log.

ggplot(ts, aes(times_read)) +
  geom_histogram(binwidth = 1) +
  xlim(0, 20) +
  ggtitle("Count of Times Read")

I’ve limited the range of times_read and lost a few outliers, but we can clearly see that most books are only read once per logged event.

However, we can get a better idea of the range of values per library using a boxplot. A boxplot gives us an easy way to see the median and range of values for a category.

sample_data <- data.frame(value=rnorm(100, mean=10, sd=3), type=c("sample1", "sample2"))
ggplot(sample_data, aes(type, value)) +
  geom_boxplot() +
  ggtitle("Sample Boxplots")

In the above sample boxplots, the bold line indicates the median of the data. There are as many values that fall between the bottom horizontal line and the median as there are that fall between the median and the top horizontal line.

The vertical line indicates the total range of values, with any unusual outliers marked by solid dots (which you’ll encounter below).

Let’s take a look at our actual data.

# Omit data that has no library specified.
with_library <- ts[ts$library != "",]
boxplot_read_library <- ggplot(with_library, aes(library, times_read)) +
  geom_boxplot() +
  scale_x_discrete(label=function(x) abbreviate(x, minlength=14)) +
  theme(axis.text.x = element_text(angle=45, hjust=1))
boxplot_read_library + ggtitle("Boxplot of Times Read by Library")

Wow, this looks strange! That’s because some libraries have some very high outliers marked by dots. Let’s zoom in a bit.

boxplot_read_library +
  coord_cartesian(ylim = c(0,15)) +
  ggtitle("Boxplot of Times Read by Library (max 15)")

This is a bit better. We can see that for almost every library we don’t even have a box to plot, since all non-outlier values are simply 1. This means that these libraries almost always have patrons who only read a book once per logged event; anything else is an unusual event marked by a dot.

However, there are three libraries that have a larger range of values where we can actually see a box. For the sake of brevity, I am only including an analysis of the Woods Run location.

Woods Run Times Read

count_times_read <- function(l) {
  p <- ggplot(filter(ts, library == l), aes(times_read)) +
    geom_histogram(binwidth = 1) +
    ggtitle(sprintf("Times Read, %s", l))
  return(p)
}
count_times_read("Woods Run") +
  coord_cartesian(xlim=c(0,30), ylim=c(0,75))

For this library I’ve already zoomed in since the data point for single readings is so large (about 850-900). Despite that, Woods Run has some range of values for times_read. Let’s see how the total number of reads per day changes over time.

Woods Run Reads Over Time and Reading Consistency

wr <- summarise(group_by(filter(ts, library == "Woods Run"), date_read),
          reads = sum(times_read)
          )
ggplot(wr, aes(date_read, reads)) +
  geom_step() +
  scale_x_date(date_minor_breaks = "1 day") +
  ggtitle("Daily Reads over Time, Woods Run")

Two trends can be seen. First, there’s not much data prior to June 6th. Second, once we start to have more data the number of reads per day appears to vary significantly. Let’s get some statistics for Woods Run’s data starting at June 6th.

wr_june6 = filter(wr, date_read >= "2016-06-06")
summary(wr_june6$reads)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   5.00   36.75   80.50   94.33  121.20  282.00 

Woods Run has a mean reads/day of 94.33, but the standard deviation is 76.55. If you’re not familiar with standard deviation, this is a measurement that represents how wide of a range the data vary across.

To illustrate this, look at the following. Here I’ve plotted two sets of data that have the same average value of 0; however, the red line has a standard deviation of 1 while the blue line has a standard deviation of 10.

sample_data2 <- data.frame(o=seq(1, 100),
                           a=rnorm(100, mean=0, sd=1),
                           b=rnorm(100, mean=0, sd=10))
ggplot(sample_data2) +
  geom_line(aes(o, a, color="blue")) +
  geom_line(aes(o, b, color="red")) +
  scale_color_discrete(name="SD",
                       breaks=c("blue", "red"),
                       labels=c("1", "10")) +
  theme(axis.title.x = element_blank(), axis.title.y = element_blank()) +
  ggtitle("Mean of 0, Different Standard Deviations")

From this we can see that while the mean tells us where the line is generally centered upon, the standard deviation captures the spread of values.

For normally distributed data (think of a bell curve), a rule of thumb is that 68% of data falls within one standard deviation of the mean, 95% fall within two standard deviations, and 99% fall within three standard deviations.

Now back to our Woods Run data. Since the standard deviation of 76.55 is relatively large compared to the mean of 94.33, we do indeed have quite wide variability – and thus inconsistency – in reads/day for Woods Run.

In addition, we can look at the quartiles for the data. We can visually examine them using the familiar boxplot.

ggplot(wr_june6, aes(c("Woods Run"), reads)) +
  theme(axis.title.x = element_blank()) +
  coord_fixed(ratio = 0.01) +
  geom_boxplot()

There are as many values contained between the first quartile (36.75) and the median (94.33) as there are between the median and the third quartile (121.20). Visually this may be difficult to see, so this is a case when looking at the actual numeric ranges may be more helpful: the range between the first quartile and median is greater than the median and third quartile.

One way to interpret this is that on days with less than the median number of reads we’re more likely to see a wider variety of read amounts, but on days with above the median number of reads, we are more likely to see a closer range of read amounts. However, we can’t establish any causal connections with this limited data – this is just observing a trend.

Typical Logging Times

The dataset also includes the time when each book was logged. Note that this is different from when the book was actually read. Let’s group these by hour and see what hours are most popular overall – earlier on I amended the data with a new column of values containing the time rounded to the nearest hour.

ggplot(ts, aes(x=hour_rounded)) +
  geom_bar() +
  theme(axis.text.x = element_text(angle=45, hjust=1)) +
  ggtitle("Logging by Hour")

It looks like most logging occurs between 3PM-9PM, but it’s interesting that there are more log events around midnight than around noon when people might be on a lunch break.

Let’s compare this data across the most popular locations:

hours_by_location <- function(location) {
  ggplot(filter(ts, library == location), aes(x=hour_rounded)) +
    geom_bar() +
    theme(axis.text.x = element_text(angle=45, hjust=1)) +
    ggtitle(sprintf("Logging Hours: %s", location))
}
for (l in top_3$library) {
  print(hours_by_location(l))
}

Although the count scale varies across these three libraries, the general shape is similar.

We can derive some practical insights from this data. Times with high logging counts would be good times to ensure that the libraries’ technical support departments are available for computer help. Also, if CLPGH’s IT department needs to perform maintenance on this system, they might not want to do it at a late hour – in fact, performing it at the start of the workday or even at noon may actually impact a lesser amount of patrons.

Conclusion

This was an exploration of only a small amount of the released summer reading data. However, we’ve been able to come up with some interesting avenues for further investigation:

That being said, there is one main challenge with this data:

It would be interesting to try alternative methods of recording reading activity that don’t involve the user having access to a computer, particularly in low income areas. In addition, it’d be nice if users could enter a time range in which they read the book separate from when they log their reading (although users can optionally enter a separate date without a time).

---
title: "Carnegie Library Reading Data"
output: html_notebook
---

The Carnegie Library of Pittsburgh has [released](https://github.com/carnegielibrary/summer-reading-data) a variety of data related to self-reported summer reading behavior. In the following exploratory analysis I will investigate a subset of this data.

This subset describes the number of times a given book has been read by logged time and location. For instance, a child reading a book 3 times before bed will show up as a single data point of 3 reads with a time corresponding to when they (or a parent) entered this data into the system.

For the sake of preserving patron privacy CLPGH decided not to include the actual book titles with this particular subset of data, as it could be used to deanonymize users.

My exploration of the data resulted in discovering the following topics:

- Which days people read most
- The number of times books are read
- Reading trends per library over time
- Consistency in reading trends
- Most popular library locations
- Typical logging hours

# Analysis Format

This analysis has been conducted using the R statistical programming language along with a variety of other tools. In order to create a reproducable analysis, the code that produced the particular output (such as a graph or statistic) is included before that output. If you are only interested in conclusions drawn from this data and not in the source code, please scroll to the top, click `Code` in the upper right corner, and then select `Hide All Code`.

```{r, include=FALSE}
require(dplyr)
require(ggplot2)
```

```{r}
ts <- read.csv("log-timestamps.csv")
ts <- ts[-c(28994),]  # remove malformed row at the end
# ts <- as.data.table(ts)
names(ts)[names(ts) == "log_value"] <- "times_read"
```

```{r}
# Add a column with the time rounded to the nearest hour and no date
times_only <- as.POSIXct(ts$log_timestamp, format="%m/%d/%Y %H:%M")
times_only <- round(times_only, units=c("hours"))
ts <- mutate(ts, hour_rounded = format(times_only, "%H:%M"))
```

```{r}
# Convert column to Date objects
ts$date_read <- as.Date(strptime(ts$date_read, "%m/%d/%Y"))
```

# Introducing the Data

Let's begin by examining what data we have. Here are the first few entries:

```{r}
head(ts)
```

- `log_timestamp`: When the patron submitted their reading information. Note that this is not the time when they read the book.
- `date_read`: Patrons can optionally provide a different date that they read the book from when they submitted their information. However, it does not contain a time.
- `times_read`: The number of times the book was read. If a book was read for a group, each member of the group counts as one read.
- `library`: The patron's home library, even if the book was obtained through another branch.
- `hour_rounded`: A column that I created which has the contents of `log_timestap`, but without a date and rounded to the nearest hour.

# Most Common Reading Days

When looking at the data it's useful to think about columns that might be related to one another. We have a `date_read` column along with `times_read` indicating the number of times the book was read per log. Using R, we can convert `date_read` from dates into a new column of weekdays. From there we can see which days people tend to read most often.

```{r}
ts$weekday <- weekdays(ts$date_read)
# Convert to factor in order to impart a manual ordering
ts$weekday <- factor(ts$weekday, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))
```


```{r}
weekday_summary <- summarise(group_by(ts, weekday),
                   reads = sum(times_read))

ggplot(na.omit(weekday_summary), aes(weekday, reads)) +
  geom_point() +
  geom_line(color="blue", alpha=0.2, aes(group=1)) +
  ggtitle("Reading by Weekday")
```

Surprisingly, it's more common to read on Tuesday through Thursday than on the weekend. Note that the light blue line is intended only to guide your eye to the changes in each black data point; we don't have values for the time between each weekday.

# Number of Times Read

Let's get an idea of the typical amount of times books are read per log.

```{r, warning=FALSE}
ggplot(ts, aes(times_read)) +
  geom_histogram(binwidth = 1) +
  xlim(0, 20) +
  ggtitle("Count of Times Read")
```

I've limited the range of times_read and lost a few outliers, but we can clearly see that most books are only read once per logged event.

However, we can get a better idea of the range of values per library using a [boxplot](https://en.wikipedia.org/wiki/Box_plot). A boxplot gives us an easy way to see the median and range of values for a category.

```{r}
sample_data <- data.frame(value=rnorm(100, mean=10, sd=3), type=c("sample1", "sample2"))

ggplot(sample_data, aes(type, value)) +
  geom_boxplot() +
  ggtitle("Sample Boxplots")
```

In the above sample boxplots, the bold line indicates the median of the data. There are as many values that fall between the bottom horizontal line and the median as there are that fall between the median and the top horizontal line.

The vertical line indicates the total range of values, with any unusual outliers marked by solid dots (which you'll encounter below).

Let's take a look at our actual data.

```{r}
# Omit data that has no library specified.
with_library <- ts[ts$library != "",]

boxplot_read_library <- ggplot(with_library, aes(library, times_read)) +
  geom_boxplot() +
  scale_x_discrete(label=function(x) abbreviate(x, minlength=14)) +
  theme(axis.text.x = element_text(angle=45, hjust=1))

boxplot_read_library + ggtitle("Boxplot of Times Read by Library")
```

Wow, this looks strange! That's because some libraries have some very high outliers marked by dots. Let's zoom in a bit.

```{r}
boxplot_read_library +
  coord_cartesian(ylim = c(0,15)) +
  ggtitle("Boxplot of Times Read by Library (max 15)")
```

This is a bit better. We can see that for almost every library we don't even have a box to plot, since all non-outlier values are simply 1. This means that these libraries almost always have patrons who only read a book once per logged event; anything else is an unusual event marked by a dot.

However, there are three libraries that have a larger range of values where we can actually see a box. For the sake of brevity, I am only including an analysis of the Woods Run location.

## Woods Run Times Read

```{r}
count_times_read <- function(l) {
  p <- ggplot(filter(ts, library == l), aes(times_read)) +
    geom_histogram(binwidth = 1) +
    ggtitle(sprintf("Times Read, %s", l))
  return(p)
}
```


```{r}
count_times_read("Woods Run") +
  coord_cartesian(xlim=c(0,30), ylim=c(0,75))
```

For this library I've already zoomed in since the data point for single readings is so large (about 850-900). Despite that, Woods Run has some range of values for `times_read`. Let's see how the total number of reads per day changes over time.

## Woods Run Reads Over Time and Reading Consistency

```{r}
wr <- summarise(group_by(filter(ts, library == "Woods Run"), date_read),
          reads = sum(times_read)
          )
```


```{r}
ggplot(wr, aes(date_read, reads)) +
  geom_step() +
  scale_x_date(date_minor_breaks = "1 day") +
  ggtitle("Daily Reads over Time, Woods Run")
```

Two trends can be seen. First, there's not much data prior to June 6th. Second, once we start to have more data the number of reads per day appears to vary significantly. Let's get some statistics for Woods Run's data starting at June 6th.

```{r}
wr_june6 = filter(wr, date_read >= "2016-06-06")
summary(wr_june6$reads)
```

Woods Run has a mean reads/day of 94.33, but the standard deviation is 76.55. If you're not familiar with standard deviation, this is a measurement that represents how wide of a range the data vary across.

To illustrate this, look at the following. Here I've plotted two sets of data that have the same average value of 0; however, the red line has a standard deviation of 1 while the blue line has a standard deviation of 10.

```{r}
sample_data2 <- data.frame(o=seq(1, 100),
                           a=rnorm(100, mean=0, sd=1),
                           b=rnorm(100, mean=0, sd=10))

ggplot(sample_data2) +
  geom_line(aes(o, a, color="blue")) +
  geom_line(aes(o, b, color="red")) +
  scale_color_discrete(name="SD",
                       breaks=c("blue", "red"),
                       labels=c("1", "10")) +
  theme(axis.title.x = element_blank(), axis.title.y = element_blank()) +
  ggtitle("Mean of 0, Different Standard Deviations")
```

From this we can see that while the mean tells us where the line is generally centered upon, the standard deviation captures the spread of values.

A rule of thumb is that 68% of data falls within one standard deviation of the mean, 95% fall within two standard deviations, and 99% fall within three standard deviations.

Now back to our Woods Run data. Since the standard deviation of 76.55 is relatively large compared to the mean of 94.33, we do indeed have quite wide variability -- and thus inconsistency -- in reads/day for Woods Run.

In addition, we can look at the quartiles for the data. We can visually examine them using the familiar boxplot.

```{r}
ggplot(wr_june6, aes(c("Woods Run"), reads)) +
  theme(axis.title.x = element_blank()) +
  coord_fixed(ratio = 0.01) +
  geom_boxplot()
```

There are as many values contained between the first quartile (36.75) and the median (94.33) as there are between the median and the third quartile (121.20). Visually this may be difficult to see, so this is a case when looking at the actual numeric ranges may be more helpful: the range between the first quartile and median is greater than the median and third quartile.

One way to interpret this is that on days with less than the median number of reads we're more likely to see a wider variety of read amounts, but on days with above the median number of reads, we are more likely to see a closer range of read amounts. However, we can't establish any causal connections with this limited data -- this is just observing a trend.

# Most Popular Locations

So what are the most popular locations? A naive approach is to simply see which locations had the highest total number of reads.

```{r}
reads_by_library <- summarise(group_by(ts[ts$library != "",], library),
  reads = sum(times_read)
)

popular <- arrange(reads_by_library, desc(reads))
top_3 <- popular[1:3,]

ggplot(popular, aes(library, reads)) +
  geom_bar(stat="identity") +
  scale_x_discrete(label=function(x) abbreviate(x, minlength=14)) +
  theme(axis.text.x = element_text(angle=45, hjust=1)) +
  ggtitle("Total Summer Reads per Library")
```

```{r}
top_3
```


It's clear that Squirrel Hill is the most popular library followed by Oakland. Third place is difficult to identify visually, but looking at the numbers we see that it is Woods Run.

## Reads/Day Across the Most Popular Locations

Being that these three are the most popular libraries, they provide us with a lot of data points to analyze. Let's look at how total reads per day has varied over time across these libraries.

```{r}
popular_data <- group_by(ts[ts$library %in% top_3$library,],
                         library,
                         date_read) %>%
                summarise(reads = sum(times_read))

popular_trends <- ggplot(popular_data[popular_data$date_read >= "2016-06-06",],
         aes(date_read, reads, color=library)) +
    geom_point() +
    geom_line() +
    scale_x_date(date_minor_breaks = "1 day")

popular_trends + ggtitle("Reading Trends Across Most Popular Libraries")
```

We can see that Squirrel Hill has a huge spike on July 6th; for further analysis we may want to look into whether any special events occurred on that day. However, this data point makes the rest of the chart difficult to compare. As such, let's zoom in.

```{r}
popular_trends +
  coord_cartesian(ylim = c(0,750)) +
  ggtitle("Reading Trends Across Most Popular Libraries (max 750)")
```

From here we can see that even in the most popular libraries, the total number of reads per day over time is generally erratic.

Further analysis might want to look at measurements such as libraries with the most consistent number of reads (by using standard deviation). From there an experiment can be conducted to try out programs and see if they increase consistency and/or the total number of reads.

# Typical Logging Times

The dataset also includes the time when each book was logged. Note that this is different from when the book was actually read. Let's group these by hour and see what hours are most popular overall -- earlier on I amended the data with a new column of values containing the time rounded to the nearest hour.

```{r}
ggplot(ts, aes(x=hour_rounded)) +
  geom_bar() +
  theme(axis.text.x = element_text(angle=45, hjust=1)) +
  ggtitle("Logging by Hour")
```

It looks like most logging occurs between 3PM-9PM, but it's interesting that there are more log events around midnight than around noon when people might be on a lunch break.

Let's compare this data across the most popular locations:

```{r}
hours_by_location <- function(location) {
  ggplot(filter(ts, library == location), aes(x=hour_rounded)) +
    geom_bar() +
    theme(axis.text.x = element_text(angle=45, hjust=1)) +
    ggtitle(sprintf("Logging Hours: %s", location))
}
```

```{r}
for (l in top_3$library) {
  print(hours_by_location(l))
}
```

Although the count scale varies across these three libraries, the general shape is similar.

We can derive some practical insights from this data. Times with high logging counts would be good times to ensure that the libraries' technical support departments are available for computer help. Also, if CLPGH's IT department needs to perform maintenance on this system, they might not want to do it at a late hour -- in fact, performing it at the start of the workday or even at noon may actually impact a lesser amount of patrons.

# Conclusion

This was an exploration of only a small amount of the released summer reading data. However, we've been able to come up with some interesting avenues for further investigation:

- Which days people read the most
- Library popularity
- Reading behavior over time
- Consistency in reading behavior
- How often books are read multiple times
- Times for IT staffing and maintenance
- A variety of bases for measuring the effectiveness of future programs

That being said, there is one main challenge with this data:

- The data are self-reported and thus biased towards those who are inclined to put forth the effort to record their activity.

It would be interesting to try alternative methods of recording reading activity that don't involve the user having access to a computer, particularly in low income areas. In addition, it'd be nice if users could enter a time range in which they read the book separate from when they log their reading (although users can  optionally enter a separate date _without_ a time).
