Predicting when a SE user will reach a given rep

For what follows, rep contains the reputation observations over the given sample (1.460 days).

Fitting a linear trend on the data eg lmf=LiearModelFit[rep,t,t] produces the following fitted function (evaluating lmf["Function"][t] produces):

250144. + 258.184 t

t here are 'days' but that is not made explicit in the estimation; plugging in t=1460 returns 627092.; that is the estimated reputation for the last point in the sample (the actual value is 622032).

Also, according to this linear model, the reputation target trep=654321 will be obtained on day 1566 or 106 days after the end of our sample. We obtain the point estimate of a reputation value that is equal to or greater to the target figure using the follow snippet of code:

Reap[
  Scan[
    If[lmf[#] >= trep, Sow[#]; Return[]] &, 1460 + Range[365]]]

According to this, the target figure will be obtained on

DatePlus[{2018, 1, 5}, {106, "Day"}]

{2018, 4, 21}

or as of late April, 2018.

Now, all the above are great and all, with a minor glitch; take a look at the residuals from the estimated model:

enter image description here

This is what a major autocorrelation infestation looks like. The presence of autocorrelation in the residuals can be interpreted in different ways; for the present case we'll go with model mis-specification ie there is a systematic part in the data that (systematically?) eludes our (simple, linear trend) model.

The linear model has persistent positive residual autocorrelation.

enter image description here

In order to find a better specified model, we'll turn to Mathematica's TimeSeriesModelFit. For the rest of this segment, tsrep stands for

d1 = {2018, 1, 5};
delta = 1460;
d0 = DatePlus[d1, {-delta + 1, "Day"}]

tsrep = TimeSeries[rep, {d0, Automatic, "Day"}, 
  MetaInformation -> {"User" -> "TeX.SE@egreg", "Units" -> "Reputation"}]

the TimeSeries representation of rep.

This is what the reputation data look like

xlbl = Row[{delta, " ", "days of data"}];
tspec = {"Year", "/", "Month", "/","Day"};
from = DateString[tsrep["FirstTime"], tspec];
to = DateString[tsrep["LastTime"], tspec];
txt = "data provided by user Mma.SE@anderstood";

DateListPlot[tsrep , 
  Filling -> Axis, 
  FrameLabel -> {
    {tsrep["Units"], Automatic}, 
    {xlbl, Row[{from, "-", to}]}
   }, 
  PlotLegends -> tsrep["User"], 
  Epilog -> {Text[Style[txt, Darker@Red, Bold, Italic], Scaled[{0.5, 0.1}]]}]

enter image description here

The data display a definite trend-which we'll assume is deterministic; this means that the reputation series as it stands is a non-stationary series. This observation is confirmed by the outcome of performing appropriate unit-root tests:

headers = Style[#, Bold] & /@ {"Lags", "Test", "Test statistic", 
  "P-value", "Test conclusion"};
Prepend[Prepend[
  UnitRootTest[rep, 
    {"Drift", #}, 
    {"AutomaticTest", "TestStatistic", "PValue", "ShortTestConclusion"}
   ], #] & /@ {1, 2, 7, 15, 30}, headers] // Grid

enter image description here

The null of the presence of a unit root cannot be rejected for various specifications of the test.

We are going to split our data into two parts: the first one we'll use for estimation purposes and the later part will be set aside for testing purposes. We choose to split the data at {2017, 10, 8} (last 90 days of data).

delta = 90;
d2 = DatePlus[d1, {-delta + 1, "Day"}]
excludeLast90dDays = TimeSeriesWindow[tsrep, {Automatic, d2}];
last90Days = TimeSeriesWindow[tsrep, {d2, Automatic}];

We let Mathematica select an appropriate model for the first part of our data

mx90 = TimeSeriesModelFit[excludeLast90dDays];

The selected model is an ARIMAProcess[-0.034365, {-0.0386807, -0.0263645, 0.0195861}, 2, {-0.964332}, 1528.1]. The residuals obtained from this model as well as their scatter plots at various lags are reported below.

enter image description here

The ACF's, PACF's and Ljung-Box don't seem terribly bad also.

enter image description here

This time, the model predicts that trep will be obtained

Reap[Scan[
 If[mx90[#] >= trep, Sow[#]; Return[]] &, 
  DateRange[d2, DatePlus[d1, {365, "Day"}], "Day"]]
]

{2018, 5, 25}.

The predicted date along with appropriate 95% bounds according to the ARIMA model is presented below

enter image description here

code

(* used to obtain lags of residuals *)
pair[list_, lag_] := {Drop[list, -lag], Drop[list, lag]}

Clear[u, t]
(* used to scatter-plot residuals *)
plot[list_, lag_] := ListPlot[
  pair[list, lag] // Transpose, 
  Frame -> True, 
  FrameLabel -> {
    {Subscript[u, t], Automatic}, 
    {Subscript[u, t - lag],Row[{"lag = ", lag, " ", "day(s)"}]}
   }, 
  Axes -> False, 
  ImageSize -> Small, 
  RotateLabel -> False]

Data in the format {date, reputation}...:

timeddata = Transpose[{Range[Length[data]], data}];

A quadratic fit:

mymodel = Fit[timeddata, {1, x, x^2}, x]

(* 245614. + 276.773 x - 0.0127237 x^2 *)

Solve for criterion:

Solve[245614.0277534589` + 276.7728073237194` x - 
   0.012723685322656041` x^2 == 654321, x]

(* {{x -> 1593.41}, {x -> 20159.2}} *)

The first solution is the correct one, of course: 1593 days from the beginning of the data collection.

A graph:

Show[
 ListPlot[hhh, PlotStyle -> Red, 
  PlotRange -> {{0, 2000}, {0, 700000}}],
 Plot[{654321, 
   245614.0277534589` + 276.7728073237194` x - 
    0.012723685322656041` x^2}, {x, 1, 2000}],
 Epilog -> {Dashed, Line[{{1593, 0}, {1593, 654321}}]}]

enter image description here

You can add cubic or quartic terms, but it doesn't change the predicted date enough to matter--the cubit term in the fit is less than $10^{-5}$.

Predicting when a SE user will reach a given rep

Tags:

Probability Or Statistics

Fitting

Related

Recent Posts