Predicting when a SE user will reach a given rep
For what follows, rep
contains the reputation observations over the given sample (1.460 days).
Fitting a linear trend on the data eg lmf=LiearModelFit[rep,t,t]
produces the following fitted function (evaluating lmf["Function"][t]
produces):
250144. + 258.184 t
t
here are 'days' but that is not made explicit in the estimation; plugging in t=1460
returns 627092.
; that is the estimated reputation for the last point in the sample (the actual value is 622032
).
Also, according to this linear model, the reputation target trep=654321
will be obtained on day 1566
or 106
days after the end of our sample. We obtain the point estimate of a reputation value that is equal to or greater to the target figure using the follow snippet of code:
Reap[
Scan[
If[lmf[#] >= trep, Sow[#]; Return[]] &, 1460 + Range[365]]]
According to this, the target figure will be obtained on
DatePlus[{2018, 1, 5}, {106, "Day"}]
{2018, 4, 21}
or as of late April, 2018.
Now, all the above are great and all, with a minor glitch; take a look at the residuals from the estimated model:
This is what a major autocorrelation infestation looks like. The presence of autocorrelation in the residuals can be interpreted in different ways; for the present case we'll go with model mis-specification ie there is a systematic part in the data that (systematically?) eludes our (simple, linear trend) model.
The linear model has persistent positive residual autocorrelation.
In order to find a better specified model, we'll turn to Mathematica's TimeSeriesModelFit
. For the rest of this segment, tsrep
stands for
d1 = {2018, 1, 5};
delta = 1460;
d0 = DatePlus[d1, {-delta + 1, "Day"}]
tsrep = TimeSeries[rep, {d0, Automatic, "Day"},
MetaInformation -> {"User" -> "TeX.SE@egreg", "Units" -> "Reputation"}]
the TimeSeries
representation of rep
.
This is what the reputation data look like
xlbl = Row[{delta, " ", "days of data"}];
tspec = {"Year", "/", "Month", "/","Day"};
from = DateString[tsrep["FirstTime"], tspec];
to = DateString[tsrep["LastTime"], tspec];
txt = "data provided by user Mma.SE@anderstood";
DateListPlot[tsrep ,
Filling -> Axis,
FrameLabel -> {
{tsrep["Units"], Automatic},
{xlbl, Row[{from, "-", to}]}
},
PlotLegends -> tsrep["User"],
Epilog -> {Text[Style[txt, Darker@Red, Bold, Italic], Scaled[{0.5, 0.1}]]}]
The data display a definite trend-which we'll assume is deterministic; this means that the reputation series as it stands is a non-stationary series. This observation is confirmed by the outcome of performing appropriate unit-root tests:
headers = Style[#, Bold] & /@ {"Lags", "Test", "Test statistic",
"P-value", "Test conclusion"};
Prepend[Prepend[
UnitRootTest[rep,
{"Drift", #},
{"AutomaticTest", "TestStatistic", "PValue", "ShortTestConclusion"}
], #] & /@ {1, 2, 7, 15, 30}, headers] // Grid
The null of the presence of a unit root cannot be rejected for various specifications of the test.
We are going to split our data into two parts: the first one we'll use for estimation purposes and the later part will be set aside for testing purposes. We choose to split the data at {2017, 10, 8}
(last 90 days of data).
delta = 90;
d2 = DatePlus[d1, {-delta + 1, "Day"}]
excludeLast90dDays = TimeSeriesWindow[tsrep, {Automatic, d2}];
last90Days = TimeSeriesWindow[tsrep, {d2, Automatic}];
We let Mathematica select an appropriate model for the first part of our data
mx90 = TimeSeriesModelFit[excludeLast90dDays];
The selected model is an ARIMAProcess[-0.034365, {-0.0386807, -0.0263645,
0.0195861}, 2, {-0.964332}, 1528.1]
. The residuals obtained from this model as well as their scatter plots at various lags are reported below.
The ACF's, PACF's and Ljung-Box don't seem terribly bad also.
This time, the model predicts that trep
will be obtained
Reap[Scan[
If[mx90[#] >= trep, Sow[#]; Return[]] &,
DateRange[d2, DatePlus[d1, {365, "Day"}], "Day"]]
]
on
{2018, 5, 25}
.
The predicted date along with appropriate 95% bounds according to the ARIMA model is presented below
code
(* used to obtain lags of residuals *)
pair[list_, lag_] := {Drop[list, -lag], Drop[list, lag]}
Clear[u, t]
(* used to scatter-plot residuals *)
plot[list_, lag_] := ListPlot[
pair[list, lag] // Transpose,
Frame -> True,
FrameLabel -> {
{Subscript[u, t], Automatic},
{Subscript[u, t - lag],Row[{"lag = ", lag, " ", "day(s)"}]}
},
Axes -> False,
ImageSize -> Small,
RotateLabel -> False]
Data in the format {date, reputation}...:
timeddata = Transpose[{Range[Length[data]], data}];
A quadratic fit:
mymodel = Fit[timeddata, {1, x, x^2}, x]
(* 245614. + 276.773 x - 0.0127237 x^2 *)
Solve for criterion:
Solve[245614.0277534589` + 276.7728073237194` x -
0.012723685322656041` x^2 == 654321, x]
(* {{x -> 1593.41}, {x -> 20159.2}} *)
The first solution is the correct one, of course: 1593 days from the beginning of the data collection.
A graph:
Show[
ListPlot[hhh, PlotStyle -> Red,
PlotRange -> {{0, 2000}, {0, 700000}}],
Plot[{654321,
245614.0277534589` + 276.7728073237194` x -
0.012723685322656041` x^2}, {x, 1, 2000}],
Epilog -> {Dashed, Line[{{1593, 0}, {1593, 654321}}]}]
You can add cubic or quartic terms, but it doesn't change the predicted date enough to matter--the cubit term in the fit is less than $10^{-5}$.