# Motivation for this post

In time series modeling you typically run into issues concerning complexity versus utility. What I mean by that is that there may be questions you need the answer to but are afraid of the model complexity that comes along with it. This fear of complexity is something that relates to identifiability and the curse of dimensionality. Fortunately for us probabilistic programming can handle these things neatly. In this post we’re going to look at a problem where we have a choice between a granular model and an aggregated one. We need to use a proper probabilistic model that we will sample in order to get the posterior information we are looking for.

# The generating model

In order to do this exercise we need to know what we’re doing and as such we will generate the data we need by simulating a stochastic process. I’m not a big fan of this since simulated data will always be, well simulated, and as such not very realistic. Data in our real world is not random people. This is worth remembering, but as the clients I work with on a daily basis are not inclined to share their precious data, and academic data sets are pointless since they are almost exclusively too nice to represent any real challenge I resort to simulated data. It’s enough to make my point. So without further ado I give you the generating model.

\[ \begin{align} y_t &\sim N(\mu_t, 7)\\ \mu_t &= x_t + 7 z_t\\ x_t &\sim N(3, 1)\\ z_t &\sim N(1, 1) \end{align} \]

which is basically a gaussian mixture model. So that represents the ground truth. The time series generated looks like this

where time is on the x axis and the response variable on the y axis. The first few lines of the generated data are presented below.

t | y | x | z |
---|---|---|---|

0 | 15.0108339 | 2.318083 | 1.7065568 |

1 | 10.9367111 | 2.114033 | 0.5486860 |

2 | 34.8632957 | 2.539029 | 2.8635449 |

3 | 0.8558588 | 3.162928 | 0.4805269 |

4 | 0.0040576 | 2.135794 | -0.5653087 |

5 | 35.0282266 | 3.019179 | 1.7807739 |

So it’s apparent that we have three variables in this data set; the response variable \(y\), and the covariates \(x\) and \(z\) (\(t\) is just an indicator of a fake time). So the real model is just a linear model of the two variables. Now say that instead we want to go about solving this problem and we have two individuals arguing about the best solution. Let’s call them Mr. Granularity and Mr. Aggregation. Now Mr. Granularity is a fickle bastard as he always wants to split things into more fine grained buckets. Mr. Aggregation on the other hand is more kissable by nature. By that I’m refering to the Occam’s razor version of kissable, meaning “Keep It Simple Sir” (KISS).

This means that Mr. Granularity wants to estimate a parameter for each of the two variables while Mr. Aggregation wants to estimate one parameter for the sum of \(x\) and \(z\).

# Mr. Granularity’s solution

So let’s start out with the more complex solution. Mathematically Mr. Granularity defines the probabilistic model like this

\[ \begin{align} y_t &\sim N(\mu_t, \sigma)\\ \mu_t &=\beta_x x_t + \beta_z z_t + \beta_0\\ \beta_x &\sim N(0, 5)\\ \beta_z &\sim N(0, 5)\\ \beta_0 &\sim N(0, 5)\\ \sigma &\sim U(0.01, \inf) \end{align} \]

which is implemented in Stan code below. There’s nothing funky or noteworthy going on here. Just a simple linear model.

```
data {
int N;
real x[N];
real z[N];
real y[N];
}
parameters {
real b0;
real bx;
real bz;
real<lower=0> sigma;
}
model {
b0 ~ normal(0, 5);
bx ~ normal(0, 5);
bz ~ normal(0, 5);
for(n in 1:N)
y[n] ~ normal(bx*x[n]+bz*z[n]+b0, sigma);
}
generated quantities {
real y_pred[N];
for (n in 1:N)
y_pred[n] = x[n]*bx+z[n]*bz+b0;
}
```

# Mr. Aggregation’s solution

So remember that Mr. Aggregation was concerned about over-fitting and didn’t want to split things up into the most granular pieces. As such, in his solution, we will add the two variables \(x\) and \(z\) together and quantify them as if they were one. The resulting model is given below followed by the implementation in Stan.

\[ \begin{align} y_t &\sim N(\mu_t, \sigma)\\ \mu_t &=\beta_r (x_t + z_t) + \beta_0\\ \beta_r &\sim N(0, 5)\\ \beta_0 &\sim N(0, 5)\\ \sigma &\sim U(0.01, \inf) \end{align} \]

```
data {
int N;
real x[N];
real z[N];
real y[N];
}
parameters {
real b0;
real br;
real<lower=0> sigma;
}
model {
b0 ~ normal(0, 5);
br ~ normal(0, 5);
for(n in 1:N)
y[n] ~ normal(br*(x[n]+z[n])+b0, sigma);
}
generated quantities {
real y_pred[N];
for (n in 1:N)
y_pred[n] = (x[n]+z[n])*br+b0;
}
```

# Analysis

Now let’s have a look at the different solutions and what we end up with. This problem was intentionally noise to confuse even the granular approach as much as possible. We’ll start by inspecting the posteriors for the parameters of interest. They’re shown below in these caterpillar plots where the parameters are on the y-axis and the posterior density is given on the x-axis.

It is clear that the only direct comparison we can make is the intercept \(b_0\) from both models. Now if you remember, the generating function doesn’t contain an intercept. It’s \(0\). Visually inspecting the graphs above will show you that something bad is happening to both models. Let’s put some numbers on this shall we. The Tables below will illuminate the situation.

### Parameter distributions - Granular model

mean | 2.5% | 25% | 50% | 75% | 97.5% | |
---|---|---|---|---|---|---|

b0 | 2.82 | -2.15 | 1.13 | 2.70 | 4.52 | 8.05 |

bx | 0.39 | -1.25 | -0.14 | 0.39 | 0.95 | 2.01 |

bz | 5.68 | 3.93 | 5.11 | 5.70 | 6.27 | 7.36 |

Mr. Granularity have indeed identified a *possible* intercept with the current model. The mean value of the posterior is 2.82 and as you can see there is 87% probability mass larger than \(0\) indicating the models confidence that there is an intercept. The model expresses the same certainty about the fact that \(\beta_x\) and \(\beta_z\) are real given that 69% and 100% of their masses respectively are above \(0\). The absolute errors for the models estimate are -0.61 and -1.32 for \(\beta_x\) and \(\beta_z\) respectively.

### Parameter distributions - Aggregated model

mean | 2.5% | 25% | 50% | 75% | 97.5% | |
---|---|---|---|---|---|---|

b0 | -1.35 | -6.15 | -2.93 | -1.38 | 0.23 | 3.42 |

br | 2.80 | 1.62 | 2.40 | 2.79 | 3.20 | 3.97 |

Mr. Aggregation have also identified a *possible* intercept with the current model. The mean value of the posterior is -1.35 and as you can see there is 29% probability mass larger than \(0\) indicating the models confidence that there is an intercept. The model expresses the same certainty about the fact that \(\beta_r\) is real given that 100% of it’s mass is above \(0\). The absolute errors for the models estimate are 1.8 and -4.2 if you consider the distance from the true \(\beta_x\) and \(\beta_z\) respectively.

## Comparing the solutions

The table below quantifies the differences between the estimated parameters and the parameters of the generating function. The top row are the true parameter values from the generating function and the row names are the different estimated parameters in Mr. A’s and Mr. G’s model respectively.

b0 | bx | bz | |
---|---|---|---|

Mr. A b0 | 1.35 | ||

Mr. A br | 1.8 | 4.2 | |

Mr. G b0 | 2.82 | ||

Mr. G bx | 0.61 | ||

Mr. G bz | 1.32 |

As is apparent from the table you can see that Mr. Aggregation’s model is 180% off with respect to the true \(\beta_x\) coefficient, and 60% off with respect to the true \(\beta_z\) coefficient. That’s not very impressive and actually leads to the wrong conclusions when trying to discern the dynamics of \(x\) and \(z\) on \(y\).

The corresponding analysis for the granular model gives us better results. Mr. Granularity’s model is 61% off with respect to the true \(\beta_x\) coefficient, and 19% off with respect to the true \(\beta_z\) coefficient. This seems a lot better. But still, if we have a granular model, why are we so off on the intercept? Well if you remember the generating function from before it looked like this

\[ \begin{align} y_t &\sim N(\mu_t, 7)\\ \mu_t &= x_t + 7 z_t\\ x_t &\sim N(3, 1)\\ z_t &\sim N(1, 1) \end{align} \]

which is statistically equivalent with the follwoing formulation

\[ \begin{align} y_t &\sim N(\mu_t, 7)\\ \mu_t &= x_t + 7 z_t + 3\\ x_t &\sim N(0, 1)\\ z_t &\sim N(1, 1)\\ \end{align} \]

which in turn would make the \(x_t\) variable nothing but noise. This can indeed be confirmed if you simulate many times. This is one of the core problems behind some models; identifiability. It’s a tough thing and the very reason why maximum likelihood can not be used in general. You need to sample!

# Conclusion

I’ve shown you today the dangers of aggregating information into a single unit and what those dangers are. There is a version of the strategy shown here which brings the best of both worlds; Hierarchical pooling. This methodology pulls data with low information content towards the mean of the other more highly informative ones. The degree of pooling can be readily expressed as a prior belief on how much the different subparts should be connected. As such; don’t throw information away. If you believe they belong together, express that belief as a prior. Don’t restrict your model to the same biases as you have! In summary:

- Always add all the granularity you need to solve the problem
- Don’t be afraid of complexity; it’s part of life
- Always sample the posteriors when you have complex models
- Embrace the uncertainty that your model shows
- Be aware that the uncertainty quantified is the
**model’s**uncertainty

Happy Inferencing!