PredictionBook size predictions

modified 2018/11/02

To look at the likely growth of, I scrape each page of the public predictions (50 per page) to retrieve the ID and creation date. This gives us an estimate of the number of public predictions created per day based on the page number, and the number of total predictions based on the prediction ID.

The predictions:

The confidence in each prediction accounts for the likely possibility that the model will not hold in future, due to the small number of users on the site.

Public predictions

Public predictions graph

There looks to be some seasonal variance, and it would be interesting to verify if there are more predictions around January. A linear model with a coefficient of 5/day fits well.

Total Predictions

Total predictions graph

The increase was much faster in 2015, due to large number of automated private predictions maybe? (~300/day), it then leveled off in 2016+ (~10/day)

At first glance it looks like a Gompertz function, which would suggest a leveling off after an initial period of growth (would make sense for a website such as a wiki where there is a roughly finite number of articles to be written) but I think it is more likely a change in behaviour of a few prolific users (and scripts?). It is hard to know for sure without more detailed information on private predictions.

R code



lpg <- 239 # last page at time of crawl
#df <- data.frame()

#for(pg in c(1:lpg)) {
#	pb <- read_html(paste("",pg, sep=""))

#	p <- (1+lpg-pg)*50; # 50 per page, so first prediction on last page is roughly the 50th
#	dt <- pb %>%
#		html_node(".description .date.created_at") %>%
#		html_attr("title") %>%
#		as.Date()
#	df <- rbind(df, data.frame(x = dt, y = p))

##### public predictions

df <- read.table("C:\\sync\\documents\\r\\pbook.csv", sep=",", header=TRUE)

# ignore pre-2013 data
df <- df[as.Date(df$x) > as.Date("2013-01-01"),]
model <- lm(formula = y ~ as.Date(x), data = df)

future <- as.Date(c("2017-06-01", "2018-01-01", "2018-06-01", "2019-01-01"))
new.df <- data.frame(x=future)

# level=0.95, 95% of pop falls within upr/lwr
# assumes error is independant of x, normal distribution
fitted <- predict(model, new.df, interval="predict")

png("C:\\sync\\documents\\r\\pbook_1.png", 800, 500, type="cairo")
plot(as.Date(df$x), df$y, type='l', 
	lwd=2, col="gray", xlab="date", ylab="public predictions")
abline(model, col="red")
lines(future,fitted[, "lwr"],lty="dotted", col="blue")
lines(future,fitted[, "upr"],lty="dotted", col="blue")

for(i in c(1:3)) {
	print(paste("Pages on PB on", future[i], "95% between", max(lpg,round(fitted[i,"lwr"]/50)), "and", round(fitted[i,"upr"]/50)))

##### total predictions

png("C:\\sync\\documents\\r\\pbook_2.png", 800, 500, type="cairo")
df <- read.table("C:\\sync\\documents\\r\\pbook_total.csv", sep=",", header=TRUE)
plot(as.Date(df$x), df$y, type='l', 
	lwd=2, col="gray", xlab="date", ylab="total predictions")

##### output

# "Pages on PB on 2017-06-01 95% between 241 and 260"
# "Pages on PB on 2018-01-01 95% between 262 and 281"
# "Pages on PB on 2018-06-01 95% between 277 and 296"