maxjmartin.com

PredictionBook size predictions

To look at the likely growth of predictionbook.com, I scrape each page of the public predictions (50 per page) to retrieve the ID and creation date. This gives us an estimate of the number of public predictions created per day based on the page number, and the number of total predictions based on the prediction ID.

The predictions:

  • First prediction created June 1st 2017 UTC has an ID (see URL) >= 182180 ( 55% confidence )
  • 1st June 2017: >= 249 pages of predictions on the “View Predictions” page (12450ish public predictions, currently 238 pages) ( 60% confidence )
  • 1st June 2017: >= 246 pages of predictions on the “View Predictions” page (12300ish public predictions, currently 238 pages) ( 75% confidence )
  • 1st Jan 2018: >= 270 pages of predictions on the “View Predictions” page (13500ish public predictions, currently 238 pages) ( 55% confidence )
  • 1st Jan 2018: >= 267 pages of predictions on the “View Predictions” page (13350ish public predictions, currently 238 pages) ( 70% confidence )

The confidence in each prediction accounts for the likely possibility that the model will not hold in future, due to the small number of users on the site.

Public predictions

Public predictions graph

There looks to be some seasonal variance, and it would be interesting to verify if there are more predictions around January. A linear model with a coefficient of 5/day fits well.

Total Predictions

Total predictions graph

The increase was much faster in 2015, due to large number of automated private predictions maybe? (~300/day), it then leveled off in 2016+ (~10/day)

At first glance it looks like a Gompertz function, which would suggest a leveling off after an initial period of growth (would make sense for a website such as a wiki where there is a roughly finite number of articles to be written) but I think it is more likely a change in behaviour of a few prolific users (and scripts?). It is hard to know for sure without more detailed information on private predictions.

R code

library(rvest)
library(magrittr)

# https://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html
# https://blog.rstudio.org/2014/11/24/rvest-easy-web-scraping-with-r/

lpg <- 239 # last page at time of crawl
#df <- data.frame()

#for(pg in c(1:lpg)) {
#	pb <- read_html(paste("http://predictionbook.com/predictions/page/",pg, sep=""))

#	p <- (1+lpg-pg)*50; # 50 per page, so first prediction on last page is roughly the 50th
#	dt <- pb %>%
#		html_node(".description .date.created_at") %>%
#		html_attr("title") %>%
#		as.Date()
#
#	df <- rbind(df, data.frame(x = dt, y = p))
#}

##### public predictions

df <- read.table("C:\\sync\\documents\\r\\pbook.csv", sep=",", header=TRUE)

# ignore pre-2013 data
df <- df[as.Date(df$x) > as.Date("2013-01-01"),]
model <- lm(formula = y ~ as.Date(x), data = df)

future <- as.Date(c("2017-06-01", "2018-01-01", "2018-06-01", "2019-01-01"))
new.df <- data.frame(x=future)

# level=0.95, 95% of pop falls within upr/lwr
# assumes error is independant of x, normal distribution
fitted <- predict(model, new.df, interval="predict")

png("C:\\sync\\documents\\r\\pbook_1.png", 800, 500, type="cairo")
plot(as.Date(df$x), df$y, type='l', 
	xlim=c(as.Date("2014-01-01"),as.Date("2018-06-01")), 
	ylim=c(7000,15000), 
	lwd=2, col="gray", xlab="date", ylab="public predictions")
abline(model, col="red")
lines(future,fitted[, "lwr"],lty="dotted", col="blue")
lines(future,fitted[, "upr"],lty="dotted", col="blue")
dev.off()

for(i in c(1:3)) {
	print(paste("Pages on PB on", future[i], "95% between", max(lpg,round(fitted[i,"lwr"]/50)), "and", round(fitted[i,"upr"]/50)))
}

##### total predictions

png("C:\\sync\\documents\\r\\pbook_2.png", 800, 500, type="cairo")
df <- read.table("C:\\sync\\documents\\r\\pbook_total.csv", sep=",", header=TRUE)
plot(as.Date(df$x), df$y, type='l', 
	lwd=2, col="gray", xlab="date", ylab="total predictions")
dev.off()

##### output

# "Pages on PB on 2017-06-01 95% between 241 and 260"
# "Pages on PB on 2018-01-01 95% between 262 and 281"
# "Pages on PB on 2018-06-01 95% between 277 and 296"