Fitting issues for model with measurement error

Hi all,

After working with stan through brms I am exploring some more complex models, and this lead to me doing some simulation and testing for models with measurement error, using the example in the Stan user guide as a guideline.


Simulating some data:

N <- 1000
a <- 2
b <- 7
sigma <- 1
mu_x <- 0
sigma_x <- 5
x <- rnorm(N,mu_x,sigma_x)
y <- rnorm(N,a + b*x, sigma)
tau <- 1
x_meas <- rnorm(N,x,tau)
data_meas <- list(
  x_meas=x_meas,
  y=y,
  N=N
)

In other words, a regression model of the form y \sim Normal(ax + b, \sigma)
but with x as a latent variable being measured with error, x_{meas} \sim Normal(x, \tau)

I tested two models, both a centered parameterization:

data {
  int<lower=1> N;
  vector[N] x_meas;
  vector[N] y;
}
parameters {
  vector[N] x;
  real mu_x;
  real<lower=0> sigma_x;
  real<lower=0> tau;
  real a;
  real b;
  real<lower=0> sigma;
}
model {
  // Priors
  tau ~ exponential(1);
  sigma ~ exponential(1);
  sigma_x ~ cauchy(0,5);
  mu_x ~ normal(0, 1);
  a ~ normal(0, 5);
  b ~ normal(0, 5);

  // Latent variable model (centered)
  x ~ normal(mu_x, sigma_x);

  // Likelihoods
  x_meas ~ normal(x, tau);
  y ~ normal(a + b * x, sigma);
}

and a non-centered one:

parameters {
  vector[N] x_raw;
  // ...
}
transformed parameters {
  vector[N] x = mu_x + x_raw * sigma_x;
}

model {
  // latent variable model (non-centered)
  x_raw ~ normal(0, 1);     // Implies x ~ normal(mu_x, sigma_x)
  // ...
}

Using cmdstanr to test these models on my simulated data, I’m having a variety of issues depending on my choice of priors and/or simulating procedure - sometimes the centered version samples better, sometimes the non-centered version samples better, but in both cases, consistently, I get the E-BFMI less than 0.3 warning, nasty traceplots, and very low effective sample sizes/high rhats for at least some of the parameters, with an ess_bulk in the order of approx. 10 samples (although the sample means do usually recover the simulation parameters).

I’m wondering if there is something obvious I’m doing wrong here. Maybe the model is inherently not identifiable? Maybe my priors are wildly off? Perhaps my parameterization in both cases is wrong? Any advice would be appreciated.

Using a wide, heavy-tailed, half cauchy as the standard deviation on a normal nested inside of another normal may make it very hard for the sampler to converge. Have you tried using something tighter with natural [0,inf) bounds like lognormal?

I’ve tried several flavors of exponential, with scale parameters ranging from 0.5 to 1.5, and while some do better than others, they all give me a lot of warnings. Testing with a lognormal(0,1), I don’t see much of a difference either.

My guess is that there’s too much noise on the measurement given the coefficient a = 2. The prior with sigma_x = 5 isn’t going to help much with identifying the x. I would also guess the fit and sampling will get better with more data.

You can also be in the dead space where neither the centered or non-centered parameterizations work well. If that’s the case, then you’ll find fitting easier with an order of magnitude more data using centered and an order of magnitude less data with non-centered.

Tighter priors should help here. There are two major differences between a half Cauchy and a lognormal.

The lognormal, like some parameterizations of the inverse gamma, is zero avoiding in the sense that the log density goes to -infinity as the variate goes to zero. The exponential and half Cauchy on the other hand assign no-zero probability to zero. If the data is consistent with a zero value of the parameter, these can be problematic as they drive the unconstrained parameter to -infinity.

The lognormal and exponential also have much narrower tails than the Cauchy. This makes the prior want to push the values out into the tails and if you don’t have enough data to counteract that, it can be a problem.

P.S. An easier way to non-center a parameterization is to declare

vector<offset=mu_x, multiplier=sigma_x> x;
...
x~ normal(mu_x, sigma_x);

I think I solved it:

So annoyingly, while tighter priors like lognormal(0,1) do a bit better, I still got low E-BFMI warnings and very low ESS for my variance parameters sigma and tau - so the problem really seemed to be something inherent to the model. Simpler models such as the one I copied from here worked fine, until x_obs and y_obs get separate variances. I realized that with the data given, the two variance components of the model remain unidentifiable unless I give them very different scales and very different priors to go with those scales (see here).

I validated this by changing to a repeated measures design, which instantly fixes it as tau can now be estimated from the data:

N <- 100
reps <- 3

sigma <- 0.5
tau <- 0.3
alpha <- 1.3
beta <- -0.4

# true covariate values
x <- runif(n, -3, 3)

#repeatedly sampling true x
idxs <- rep(1:n,each=reps)

x_obs <- rnorm(idxs, x[idxs], tau)
y <- rnorm(idxs, alpha + beta*x[idxs], sigma)

and the model:

data {
  int<lower=0> N;
  int<lower=0> reps;
  vector[N*reps] x_obs;
  vector[N*reps] y;
  array[N*reps] int<lower=0,upper=N> idxs;
}

parameters {
  vector[N] x;
  real alpha;
  real beta;
  real<lower=0> sigma;
  real<lower=0> tau;
}

model {
  //priors
  alpha ~ normal(0, 5);
  beta ~ normal(0, 5);
  sigma ~ lognormal(0,1);
  tau ~ lognormal(0,1);
  x ~ normal(0,5);
  
  //likelihood
  for(i in 1:N*reps){
  x_obs[i] ~ normal(x[idxs[i]],tau);
  y[i] ~ normal(alpha + beta * x[idxs[i]], sigma);

}

}

In hindsight, I missed that tau is data, not a parameter, in the example here.

Solved thanks to the forum history!

1 Like