Skip to content

Commit bcc659a

Browse files
committed
Analyze usage metrics from Nov 2019 to Oct 2020.
1 parent 8aa3435 commit bcc659a

File tree

2 files changed

+182
-0
lines changed

2 files changed

+182
-0
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,3 @@
11
.Rproj.user
22
.Rhistory
3+
*.pdf

scripts/report-2020.Rmd

Lines changed: 181 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,181 @@
1+
---
2+
title: "Workflowr usage report 2020"
3+
subtitle: "November 2019 to October 2020"
4+
author: "John Blischak"
5+
date: "`r Sys.Date()`"
6+
output:
7+
pdf_document:
8+
toc: true
9+
toc_depth: 3
10+
urlcolor: blue
11+
---
12+
13+
```{r setup, include=FALSE}
14+
knitr::opts_knit$set(eval.after = "fig.cap", root.dir = "..")
15+
knitr::opts_chunk$set(echo = FALSE, fig.width = 4, fig.height = 4,
16+
fig.pos = "h", message = FALSE)
17+
```
18+
19+
```{r packages}
20+
library("dplyr")
21+
library("ggplot2")
22+
theme_set(theme_classic(base_size = 12))
23+
library("reshape2")
24+
```
25+
26+
```{r functions}
27+
loadData <- function(filename, dateFilter = "2019-11-01", cumulate = TRUE) {
28+
stopifnot(file.exists(filename))
29+
x <- read.delim(filename, stringsAsFactors = FALSE)
30+
if (cumulate) {
31+
if (!"count" %in% colnames(x)) x$count <- 1
32+
x$cumulative <- cumsum(x$count)
33+
}
34+
x$date <- as.Date(x$date, format = "%Y-%m-%d", tz = "America/Chicago")
35+
x <- subset(x, date >= as.Date(dateFilter))
36+
x
37+
}
38+
39+
plotData <- function(x, y = "count", title = "", smooth = FALSE) {
40+
ggplot(x, aes(x = date, y = .data[[y]])) +
41+
geom_point() +
42+
labs(x = "Date", y = "Count", title = title) +
43+
if (smooth) geom_smooth()
44+
}
45+
```
46+
47+
```{r data}
48+
projects <- loadData("data/github-projects.txt")
49+
stars <- loadData("data/github-stars.txt")
50+
activity <- loadData("data/activity.txt", cumulate = FALSE)
51+
downloads <- loadData("data/cranlogs.txt")
52+
clones <- loadData("data/github-clones.txt")
53+
views <- loadData("data/github-views.txt")
54+
```
55+
56+
```{r prep}
57+
colnames(activity) <- c("Date", "Weekly Projects", "Weekly Users",
58+
"Monthly Projects", "Monthly Users")
59+
activityLong <- melt(activity, id.vars = "Date", variable.name = "Metric",
60+
value.name = "Count")
61+
```
62+
63+
## Summary
64+
65+
Last year I started systematically measuring workflowr usage metrics. I setup a
66+
script to run every Monday morning. This regular measurement is critical because
67+
GitHub only keeps some metrics for a 2 week period. Here's what I found:
68+
69+
### The Good
70+
71+
Both the number of workflowr projects on GitHub and the number of stars of the
72+
workflowr repository continue to steadily rise. This indicates that people are
73+
interested in workflowr and actively trying it out.
74+
75+
### The Bad
76+
77+
The number of active projects and users hasn't grown over the last year. This is
78+
corroborated by the package downloads from CRAN and the views and clones of the
79+
workflowr GitHub repository. This indicates that we are not maintaining users
80+
over the long-term.
81+
82+
### The Ugly
83+
84+
I can no longer continue using this strategy to measure active projects and
85+
users because the [GitHub Search API][gh-search-api] has a limit of 1,000 search
86+
results.
87+
88+
While it is exciting that there are now over 1,000 workflowr
89+
projects on GitHub, the complication is that this breaks my query to the GitHub
90+
API. They do not want the Search API to be used to catalog all the results, but
91+
instead to find a few hits among the top results. From their
92+
documentation:
93+
94+
[gh-search-api]: https://docs.github.com/en/free-pro-team@latest/rest/reference/search#about-the-search-api
95+
96+
> The Search API helps you search for the specific item you want to find. For
97+
example, you can find a user or a specific file in a repository. Think of it the
98+
way you think of performing a search on Google. It's designed to help you find
99+
the one result you're looking for (or maybe the few results you're looking for).
100+
Just like searching on Google, you sometimes want to see a few pages of search
101+
results so that you can find the item that best meets your needs. To satisfy
102+
that need, the GitHub Search API provides up to 1,000 results for each search.
103+
104+
The recommended workaround to this problem is to apply additional search
105+
filters, e.g. perform a separate query for each year. Frustratingly, while this
106+
works for many of the search features, it doesn't work when searching for the
107+
existence of a specific file in a repository. I identify workflowr projects by
108+
searching for the file `_workflowr.yml`.
109+
110+
Thus to continue obtaining these useful metrics, I'll need to switch to a
111+
different source. I think I should be able to use the [Google BigQuery dataset
112+
of public GitHub repositories][bigquery], but I have to figure out how to use it
113+
first.
114+
115+
[bigquery]: https://console.cloud.google.com/bigquery?project=swift-fabric-269218&p=bigquery-public-data&d=github_repos&page=dataset
116+
117+
## Tables
118+
119+
```{r activity-table}
120+
activitySum <- activityLong %>%
121+
group_by(Metric) %>%
122+
summarize(`November 2019` = Count[Date == "2019-11-11"],
123+
Median = median(Count),
124+
`June 2020` = Count[Date == "2020-06-29"],
125+
`October 2020` = Count[Date == "2020-10-05"])
126+
knitr::kable(activitySum, caption = "Active projects and users on GitHub.")
127+
```
128+
129+
```{r workshops}
130+
workshopDates <- c("2020-07-20", "2020-07-27", "2020-08-03", #useR
131+
"2020-08-10", "2020-08-17", "2020-08-31", #PSU
132+
"2020-09-14", "2020-09-21", "2020-09-28" #QBIO
133+
)
134+
workshopDates <- as.Date(workshopDates, format = "%Y-%m-%d", tz = "America/Chicago")
135+
workshops <- activity %>%
136+
filter(Date %in% workshopDates)
137+
rownames(workshops) <- c("Before useR", "useR", "After useR",
138+
"Before PSU", "PSU", "After PSU",
139+
"Before QBIO", "QBIO", "After QBIO")
140+
knitr::kable(workshops,
141+
caption = "The recent increase in monthy projects and users is driven by the workshops we taught in July (useR!), August (Penn State), and September (QBIO).")
142+
```
143+
144+
\newpage
145+
## Figures
146+
147+
```{r projects, fig.cap=caption}
148+
caption <- sprintf("The number of public workflowr projects on GitHub increased from %d to %d.",
149+
projects$cumulative[1], projects$cumulative[nrow(projects)])
150+
plotData(projects, y = "cumulative", title = "Cumulative GitHub projects")
151+
```
152+
153+
```{r stars, fig.cap=caption}
154+
caption <- sprintf("The number of stars of the workflowr GitHub repository increased from %d to %d.",
155+
stars$cumulative[1], stars$cumulative[nrow(stars)])
156+
plotData(stars, y = "cumulative", title = "Cumulative GitHub stars")
157+
```
158+
159+
```{r activity, fig.cap=caption, fig.width=8, fig.height=8}
160+
caption <- "Avtive projects and users on GitHub. An active project is any repository with at least one new commit in the previous time period (week or month). An active user is the owner of at least one of the repositories with a new commit in the previous time period (week or month). In other words, it is a count of the unique users."
161+
ggplot(activityLong, aes(x = Date, y = Count)) +
162+
geom_point() +
163+
facet_wrap(~Metric) +
164+
labs(title = "Activity") +
165+
geom_smooth()
166+
```
167+
168+
```{r downloads, fig.cap=caption}
169+
caption <- "The daily downloads of the workflowr package from CRAN. The absolute number is not that informative since many downloads are from automated systems. The relative change over time (or lack of change) is more interpretable."
170+
plotData(downloads, title = "Package downloads from CRAN", smooth = TRUE)
171+
```
172+
173+
```{r views, fig.cap=caption}
174+
caption <- "The daily views of the workflowr repository on GitHub."
175+
plotData(views, title = "Views of GitHub repository", smooth = TRUE)
176+
```
177+
178+
```{r clones, fig.cap=caption}
179+
caption <- "The daily clones of the workflowr repository from GitHub."
180+
plotData(clones, title = "Clones of GitHub repository", smooth = TRUE)
181+
```

0 commit comments

Comments
 (0)