S2.Rmd

---
title: "S2 Appendix"
output: 
  pdf_document:
    number_sections: true
bibliography: bibliography.bib
---

This appendix aims to provide a comparison between Random-Walk Metropolis (RMW)
and Hamiltonian Monte Carlo (HMC), two Markov chain Monte Carlo (MCMC)
algorithms. First, we illustrate how the algorithms explore the
parameter space by means of a toy example. Then, we compare the performance
of each algorithm in fitting an SEIR model to the Pandemic Flu data
(described in the main text).

\tableofcontents 

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, message = FALSE, warning = FALSE)

library(bayesplot)
library(cmdstanr)
library(dplyr)
library(ggplot2)
library(ggpubr)
library(lubridate)
library(MCMCpack)
library(parallel)
library(patchwork)
library(posterior)
library(purrr)
library(stringr)
library(readsdr)
library(readr)
library(tictoc)
library(tidybayes)
library(tidyr)

source("./R/plots.R")
source("./R/helpers.R") 
```

\newpage

# Toy example

```{r, fig.height = 2.5, fig.width = 2.5, fig.cap = "Toy example data", fig.align='center'}
# test data
set.seed(8)
y <- rnorm(50)
x <- rnorm(50)

x <- as.numeric(scale(x))
y <- as.numeric(scale(y))

data_df <- data.frame(x = x, y = y)

ggplot(data_df, aes(x = x, y = y)) +
  geom_point(colour = "steelblue") +
  theme_pubr() 
```

With the purpose of presenting an intuitive comparison between the Metropolis 
algorithm (RWM) and Hamiltonian Monte Carlo (HMC), we follow a simple example
from @mcelreath_20. We assume that we have received bidimensional data 
(see Figure 1). To this data, we propose a model (see equations 1-4) in which
we assume that the measurements in each dimension are independent. For each
dimension, we hypothesise that the measurements follow a normal distribution 
from a common mean and one standard deviation. Therefore, our goal is to 
estimate the posterior distribution of parameters $\mu_x$ and $\mu_y$. Our prior
knowledge about these parameters is represented by a normal distribution
(equations 3 and 4). The estimation is achieved via RWM and HMC.

\begin{equation}
x_i \sim Normal(\mu_x, 1)
\end{equation}

\begin{equation}
y_i \sim Normal(\mu_y, 1)
\end{equation}

\begin{equation}
\mu_x \sim Normal(0, 0.5)
\end{equation}

\begin{equation}
\mu_y \sim Normal(0, 0.5)
\end{equation}


## Parameter space exploration

In Figure 2, we depict the process of finding and exploring the posterior 
distribution (the dashed line represents the target distribution's 99% bounds).
In the left-hand side panels, we present the first ten samples generated by the
MCMC algorithms from the starting point (red dot). Here, we can notice 
Metropolis' random-walk behaviour. In spite of the samples get closer 
progressively to the target distribution, they follow a haphazard trajectory. In
addition to this, five proposals were rejected by Metropolis' acceptance 
criterion so that various samples overlap, and we can only distinguish half of 
them. In stark contrast, HMC samples move directly to the target 
distribution. Once the chain finds the target, it uses information from
parameter space's *landscape* to construct trajectories. Intuitively, we can
conceive the target distribution as a bowl whose bottom represents areas of high
plausibility. Here, sample generation corresponds to the process of throwing a 
marble inside the bowl with some momentum and let that marble explore the 
bowl's _curvature_ to record the position where the marble loses speed.
Recall that the marble loses speed when it reaches the bowl's bottom.
Hamiltonian mechanics play its role by describing, in terms of kinetic and 
potential energies, the trajectories that the frictionless particle is 
following. We refer the reader to @mcelreath_20 to complement this intuition. 

Further, in the right-hand side panels, we present the first 500 draws from each
method. Interestingly, they both find the target, but HMC draws spreads more
evenly across the posterior distribution. In other words, HMC samples provide
a more accurate description of the explored space than RWM.

```{r}
# neg-log-probability
U <- function(q, a = 0, b = 0.5, k = 0, d = 0.5) {
  muy <- q[1]
  mux <- q[2]
  U   <- sum( dnorm(y, muy, 1, log = TRUE) ) +
    sum( dnorm(x, mux, 1, log = TRUE) ) +
    dnorm(muy, a, b, log = TRUE) + 
    dnorm(mux, k, d, log = TRUE)
  return( -U )
}
```


```{r}
q_init <- runif(2, -2, 2)

metropolis_sampling <- function(q_init, n_samples) {
  samples_output      <- data.frame(x = rep(NA, n_samples),
                                    y = rep(NA, n_samples))
  samples_output[1, ] <- q_init
  
  q_current    <- q_init
  
  for(i in seq_len(n_samples - 1)) {
    prev_prob  <- - U(q_current )
    new_x      <- rnorm(1, q_current[[1]], 0.15)
    new_y      <- rnorm(1, q_current[[2]], 0.15)
    
    q_proposal <- c(new_x, new_y)
    new_prob   <- - U(q_proposal) # U returns the neg log prob
    
    acceptance_log_prob <- new_prob - prev_prob
    
    log_runif <- log(runif(1))

    if(log_runif < acceptance_log_prob ) {
      q_current  <- q_proposal
    }
    
    samples_output[i + 1, ] <- q_current 
  }
  
  samples_output
}

set.seed(123)
MCMC_samples <- metropolis_sampling(q_init, 10000)
```

```{r}
demo_n <- 10
path_df <- MCMC_samples[1:demo_n,] %>% 
  mutate(is_first = c(T, rep(F, demo_n - 1)))

a <- plot_MCMC_path(path_df, MCMC_samples)

demo_n <- 500
path_df <- MCMC_samples[1:demo_n,] %>% 
  mutate(is_first = c(T, rep(F, demo_n - 1)))

b <- plot_MCMC_path(path_df, MCMC_samples)
```


```{r}
# Code borrowed from McElreath(2020)
# gradient function
# need vector of partial derivatives of U with respect to vector q
U_gradient <- function( q , a = 0 , b = 0.5 , k=0 , d = 0.5) {
  muy <- q[1]
  mux <- q[2]
  G1 <- sum( y - muy ) + (a - muy)/b^2 #dU/dmuy
  G2 <- sum( x - mux ) + (k - mux)/d^2 #dU/dmux
  return( c( -G1 , -G2 ) ) # negative bc energy is neg-log-prob
}
```

```{r}
HMC2 <- function (U, grad_U, epsilon, L, current_q) {
  q <- current_q
  p <- rnorm(length(q), 0, 1) # random flick - p is momentum.
  current_p <- p
  # Make a half step for momentum at the beginning
  p <- p - epsilon * grad_U(q) / 2
  # initialize bookkeeping - saves trajectory
  qtraj <- matrix(NA,nrow=L+1,ncol=length(q))
  ptraj <- qtraj
  qtraj[1,] <- current_q
  ptraj[1,] <- p
  
  # Alternate full steps for position and momentum
  for ( i in 1:L ) {
    q <- q + epsilon * p # Full step for the position
    # Make a full step for the momentum, except at end of trajectory
    if ( i!= L ) {
      p           <- p - epsilon * grad_U(q)
      ptraj[i+1,] <- p
    }
    qtraj[i+1,] <- q
  }
  
  # Make a half step for momentum at the end
  p           <- p - epsilon * grad_U(q) / 2
  ptraj[L+1,] <- p
  # Negate momentum at end of trajectory to make the proposal symmetric
  p <- -p
  # Evaluate potential and kinetic energies at start and end of trajectory
  current_U  <- U(current_q)
  current_K  <- sum(current_p^2) / 2
  proposed_U <- U(q)
  proposed_K <- sum(p^2) / 2
  # Accept or reject the state at end of trajectory, returning either
  # the position at the end of the trajectory or the initial position
  accept <- 0
  
  if (runif(1) < exp(current_U-proposed_U+current_K-proposed_K)) {
    new_q <- q  # accept
    accept <- 1
  } else new_q <- current_q  # reject
  return(list( q=new_q, traj=qtraj, ptraj=ptraj, accept=accept ))
}
```

```{r}
source("./R/HMC_utils.R")

Q <- list()
Q$q <- q_init

step      <- 0.02
L         <- 11
n_samples <- 2000


HMC_results <- vector(mode = "list", length = n_samples)

set.seed(10)

for(i in seq_len(n_samples)) {
  Q                <- HMC2( U , U_gradient , step , L , Q$q)
  HMC_results[[i]] <- Q
}

map_df(HMC_results, function(sample_obj) {
  data.frame(x = sample_obj$q[[1]], y = sample_obj$q[[2]])
}) -> posterior_df

traj_df <- construct_traj_df(HMC_results, 10, q_init)

c <- plot_HMC_path(traj_df, posterior_df, q_init, "First 10 iterations")

traj_df <- construct_traj_df(HMC_results, 500, q_init)

d <- plot_HMC_path(traj_df, posterior_df, q_init, "First 500 iterations") 
```

```{r, fig.cap = "Parameter space exploration"}
(a + b) / (c +d)
```


\newpage

# Pandemic flu example

The purpose of this example is to provide a performance comparison between
HMC and RWM.

## Data

The graph below presents the daily number of influenza cases detected by the
United States Public Health Service in Cumberland (Maryland) during the 1918
influenza pandemic, from 22 September 1918 to 30 November 1918.

```{r, fig.height = 3.5, fig.width = 5, fig.align = 'center', fig.cap = "Cumberland's incidence data"}
source("./R/plots.R")

flu_data <- read_csv("./data/Cumberland_data_1918.csv") %>% 
  rename(time = Time, y = Cases) %>% 
  mutate(Date = dmy(Date),
         Week = epiweek(Date))


plot_daily_incidence(flu_data)
```


## Model

The equations below describe the calibrated model. We adopt the approach 
described in the main article. Namely, we assume three unknowns:
$\beta$, $\rho$, and $I(0)$. Likewise, we assign to these parameters the priors
explained in the _Prior information_ section in the main article.

\begin{equation}
 \dot S = \frac{- \beta S(t)I(t)}{N}
\end{equation}

\begin{equation}
 \dot E = \frac{- \beta S(t)I(t)}{N} - \sigma E(t)
\end{equation}

\begin{equation}
 \dot I = \sigma E(t) - \gamma I(t)
\end{equation}

\begin{equation}
 \dot R = \gamma I(t)
\end{equation}

\begin{equation}
 \dot C = \sigma E(t)
\end{equation}

\begin{equation}
  x(\tau+1) = \rho(C(\tau+1) - C(\tau))
\end{equation}

\begin{equation}
  y \sim Pois(x)
\end{equation}

# Performance comparison

Undoubtedly, for an SD practitioner, the transition to a new tool involves an
investment in time and resources. In this section, we present a set of metrics
that supports the adoption of HMC. Although this assessment is far from being comprehensive (in terms of metrics and models evaluated), it indicates the
performance gap between HMC and RWM.

In this experiment, we calibrate the SEIR model (Section 2.2) using the two
MCMC methods. To obtain samples via RWM, we follow the approach adopted by
@Osgood_2015. That is, we use the R package *MCMCpack*. On the other hand,
we use *Stan* to draw samples through HMC. We perform the calibration under
six scenarios, which differ in the number of iterations (100, 200, 500, 1000,
1500 and, 2000). For instance, the 100-iterations scenario indicates that
we allocate 100 iterations to the *burn-in/warm-up* phase and 100 iterations to
the *sampling* phase. It should be noted that we only use the draws from
the *sampling phase* to estimate diagnostic quantities. Conversely,
we take into account the *burn-in/warm-up* phase to measure execution times.


```{r}
source("./R/posterior_components.R")

pop_size           <- 5234
const_list         <- list(N = pop_size, sigma = 0.5, par_gamma = 0.5)
stock_list         <- list(R = 1570)
filepath           <- "./models/SEIR.stmx"
mdl                <- read_xmile(filepath, 
                                 const_list = const_list,
                                 stock_list = stock_list)

deSolve_components <- mdl$deSolve_components
loglik             <- generate_loglik_fun(mdl$deSolve_components, flu_data$y)

posterior_fun <- function(pars) loglik(pars) + logprior(pars) 

set.seed(3001141)
inits_I    <- runif(4, -2, 2)
inits_beta <- runif(4, -2, 2)
inits_rho  <- runif(4, -2, 2)
```

```{r, message = TRUE}
n_iters <- c(1e2, 2e2, 5e2, 1e3, 1.5e3, 2e3)
seeds   <- c(409198311,87064581, 696684459, 817963518, 411110113, 897071894)

fldr <- "./backup_objs/Comparison/MCMC_RWM"
dir.create(fldr, showWarnings = FALSE, recursive = TRUE)

map2(n_iters, seeds, function(n_iter, seed) {
  
  fl <- str_glue("{n_iter}.rds")
  fn <- file.path(fldr, fl)
  
  if(!file.exists(fn)) {
    
    message(str_glue("Starting process for {n_iter} samples"))
    
    
    tic.clearlog()
    tic()
  
    mclapply(1:4, function(i) {
      
      inits     <- c(inits_I[[i]], inits_beta[[i]], inits_rho[[i]])
      
      MCMCmetrop1R(fun = posterior_fun,
                   theta.init = c(inits),
                   mcmc       = n_iter, 
                   burnin     = n_iter,
                   seed       = seed)-> mcmc_output
  
      posterior_df <- as.data.frame(mcmc_output) %>%
        set_names(c("I0", "beta", "rho")) %>% 
        mutate(I0   = exp(I0),
               beta  = exp(beta),
               rho   = expit(rho),
               Chain = i)
    }, mc.cores = 4) -> post_obj
  
    toc(quiet = FALSE, log = TRUE)
    
    log.lst <- tic.log(format = FALSE)
      
    result_obj <- list(post_obj     = post_obj,
                       time         = log.lst)
  
    saveRDS(result_obj, fn)
  } else {
    result_obj <- readRDS(fn)
  }
  
  result_obj
}) -> full_results
```

```{r}
unk_pars <- c("I0", "beta", "rho") # Unknown parameters
posterior_list <- map(full_results, "post_obj") %>% 
  map(function(pos_obj) {
    do.call("bind_rows", pos_obj)
  })

names(posterior_list) <- n_iters

map(posterior_list, function(posterior_df) {
  mcmc_trace(posterior_df, pars = unk_pars,
             facet_args = list(labeller = label_parsed)) +
    labs(title = "RWM")
}) -> rwm_traces
```

```{r}
stan_filepath <- "./Stan_files/example/flu_poisson.stan"

n_iters <- c(1e2, 2e2, 5e2, 1e3, 1.5e3, 2e3)
seeds   <- c(409198311,87064581, 696684459, 817963518, 411110113, 897071894)

fldr    <- "./backup_objs/Comparison/MCMC_HMC"
dir.create(fldr, showWarnings = FALSE, recursive = TRUE)

map2(n_iters, seeds, function(n_iter, seed) {
  file_name  <- str_glue("{n_iter}.rds")
  file_path  <- file.path(fldr, file_name)
  

  if(!file.exists(file_path)) {
    
    tic.clearlog()
    tic()
    
    stan_d <- list(n_obs    = nrow(flu_data),
                   y        = flu_data$y,
                   n_params = 2,
                   n_difeq  = 5,
                   t0       = 0,
                   ts       = 1:length(flu_data$y))

    mod <- cmdstan_model(stan_filepath)

    fit <- mod$sample(data            = stan_d,
                      seed            = seed,
                      chains          = 4,
                      parallel_chains = 4,
                      iter_warmup     = n_iter,
                      iter_sampling   = n_iter,
                      refresh         = 5,
                      save_warmup     = FALSE,
                      output_dir      = fldr)

    toc(quiet = FALSE, log = TRUE)
    
    log.lst <- tic.log(format = FALSE)
    
    result_obj <- list(fit  = fit,
                       time = log.lst)
    
    saveRDS(result_obj, file_path)
  } else {
    result_obj <- readRDS(file_path)
  }
  
  result_obj
}) -> HMC_sens
```

```{r}
fit_objs        <- map(HMC_sens, "fit")
names(fit_objs) <- n_iters

```

```{r}
map(fit_objs , function(fit_obj) {
  mcmc_trace(fit_obj$draws(), pars = unk_pars,
             facet_args = list(labeller = label_parsed)) +
    labs(title = "HMC")
}) -> HMC_traces
```


## Computational time

Unsurprisingly, the first metric that comes to mind for measuring performance is
computational time, the time the practitioner waits for the results. In this 
evaluation, we ignore whether the Markov chains converge to the posterior 
distribution. The results (Figure 4) indicate that, in all scenarios, HMC takes
less time to produce an equal amount of samples than RWM does. Nevertheless,
this metric is not free of confounders. It is not known whether these 
differences are due to the methods per se or if, on the contrary, the observed
gap stems from performance discrepancies in the technological implementations.

```{r}
source("./R/helpers.R")

comp_times <- map_dbl(full_results, 
                      function(pos_obj) calculate_time(pos_obj$time))

cp_RWM <- data.frame(n_iter = n_iters, time = comp_times, method = "RWM")

comp_times <- map_dbl(HMC_sens, function(pos_obj) calculate_time(pos_obj$time))

cp_HMC     <- data.frame(n_iter = n_iters, time = comp_times, method = "HMC")

cp_df <- bind_rows(cp_RWM, cp_HMC)
```


```{r, fig.cap = "Computational time by MCMC method", fig.height = 3}
ggplot(cp_df, aes(x = n_iter, y = time)) +
  geom_point(aes(shape = method, colour = method), size = 4) +
  scale_colour_manual(values = c("#077893", "#FF9465")) +
  theme_pubclean() +
  labs(y = "Elapsed time [Minutes]", x = "# iters")
```

## Trace plots

Before estimating diagnostic quantities, it is always recommended to check
trace plots so as to detect convergence issues. The trace plots
in this section show the draws obtained in the *sampling phase*. As mentioned
above, we allocate an equal number of iterations for the *burn-in/warm-up* and
*sampling* phases. In the first scenario, MCMC samplers return 400 samples 
(4 chains * 100 samples from the sampling phase) that describe the target
(posterior) distribution. In these graphs, we observe that even from 100 warm-up
samples, HMC apparently reaches convergence. Conversely, RWM seems to require at
least 1500 burn-in draws for the chains to converge.

### 100-iterations scenario

```{r, fig.height = 5, fig.cap= "Trace plots of the 100-iterations scenario by method"}
rwm_traces[[1]] / HMC_traces[[1]]
```

\newpage

### 200-iterations scenario

```{r, fig.height = 3, fig.cap= "Trace plots of the 200-iterations scenario by method"}
rwm_traces[[2]] / HMC_traces[[2]]
```

### 500-iterations scenario

```{r, fig.height = 3, fig.cap= "Trace plots of the 500-iterations scenario by method"}
rwm_traces[[3]] / HMC_traces[[3]]
```

\newpage

### 1000-iterations scenario

```{r, fig.height = 3, fig.cap= "Trace plots of the 1000-iterations scenario by method"}
rwm_traces[[4]] / HMC_traces[[4]]
```

### 1500-iterations scenario

```{r, fig.height = 3, fig.cap= "Trace plots of the 1500-iterations scenario by method"}
rwm_traces[[5]] / HMC_traces[[5]]
```

\newpage

### 2000-iterations scenario

```{r, fig.height = 3, fig.cap= "Trace plots of the 1500-iterations scenario by method"}
rwm_traces[[6]] / HMC_traces[[6]]
```

## Potential scale reduction factor

$\widehat{R}$ is a measure of convergence, and unlike computational time, it is
technologically independent. Based on the results presented in Figure 11, we 
confirm that, for this example, RWM requires at least 2000 burn-in samples so 
that all parameters reach convergence ($\widehat{R} < 1.01$), a value
significantly higher than the equivalent number of samples (500) observed in HMC.

```{r rhat_RWM}
imap_dfr(posterior_list, function(df, n_iter) {
  
  map_df(unk_pars, function(par) {
    
    dplyr::select(df, par, Chain) %>% 
      mutate(id = rep(1:(n_iter), 4)) %>% 
      pivot_wider(names_from = "Chain", values_from = !!ensym(par)) %>% 
      dplyr::select(-id) %>% 
      as.matrix() -> chain_matrix
    
      data.frame(par        = par, 
                   rhat     = rhat(chain_matrix),
                   ess_bulk = ess_bulk(chain_matrix),
                   ess_tail = ess_tail(chain_matrix),
                   n_iter   = n_iter)
  })
}) %>% mutate(method = "RWM") -> diag_RWM
```

```{r rhat_HMC}
imap_dfr(fit_objs, function(fit_object, n_iter) {
  
    map_dfr(unk_pars, function(par) {
      
      chain_matrix <- extract_variable_matrix(fit_object$draws(), par)
      
      data.frame(par      = par, 
                   rhat     = rhat(chain_matrix),
                   ess_bulk = ess_bulk(chain_matrix),
                   ess_tail = ess_tail(chain_matrix),
                   n_iter   = n_iter)
    })
}) %>% mutate(method = "HMC") -> diag_HMC
```


```{r, fig.height = 4, fig.cap = "R-hat comparison"}
diag_df <- bind_rows(diag_RWM, diag_HMC) %>% 
  mutate(par = ifelse(par == "I0", "I(0)", par))

ggplot(diag_df, aes(x = as.numeric(n_iter), y = rhat)) +
  geom_point(aes(colour = method), size = 0.5) +
  geom_line(aes(group = method, colour = method), alpha = 0.5) +
  scale_colour_manual(values = c("#077893", "#FF9465")) +
  facet_grid(method~par, labeller = label_parsed, scales = "free") +
  geom_hline(yintercept = 1.01, colour = "red", linetype = "dashed") +
  labs(x = "# iters", y = expression(widehat(R)),
       subtitle = "A)") + 
  theme_tidybayes() +
  theme(legend.position = "none",
        axis.text = element_text(size = 6, colour = "grey50")) -> a

print(a)
```

## Effective Sample Size

If $\widehat{R}$ is a measure of convergence, the effective sample size (ESS) is
a measure of efficiency. This metric helps us answer: *Are 2,000 samples from
RWM equivalent to 2,000 samples from HMC?* The reader should recall that the ESS 
approximates the number of independent samples from the MCMC draws, which are 
correlated by definition. We present two types of ESS: bulk & tail. ESS-bulk is
defined in OS1. From the Stan manual, ESS-tail is defined as the minimum of 
effective sample sizes for 5% and 95% quantiles. ESS-Tail is a useful measure
for sampling efficiency in the tails of the distribution (related e.g., to the
efficiency of variance and tail quantile estimates). Both metrics should be at
least 400. In Figure 12, we can observe that HMC produces a higher number of
ESS than RWM in all scenarios. Furthermore, HMC satisfies the 400-ESS threshold
from 500 iterations per chain, a third of the iterations required by RWM.

```{r, fig.height = 3.5, fig.cap = "ESS comparison"}
diag_df %>% dplyr::select(-rhat) %>% 
  pivot_longer(c(-par, -n_iter, -method)) -> ess_df

ggplot(ess_df, aes(x = as.numeric(n_iter), y = value)) +
  geom_point(aes(shape = method, colour = method), alpha = 0.75, size = 1.5) +
  scale_y_log10() +
  scale_colour_manual(values = c("#077893", "#FF9465")) +
  facet_grid(name ~ par, labeller = label_parsed) +
  geom_hline(yintercept = 400, colour = "red", linetype = "dashed") +
  labs(x = "# iters", y = "ESS (log scale)",
       subtitle = "B)") +
  guides(colour = guide_legend(title="Method"),
         shape  = guide_legend(title="Method")) +
  theme_test() +
  theme(axis.text = element_text(size = 6, colour = "grey50"))-> b

print(b)
```

```{r}
ggsave("./plots/Fig10_performance_comparison.pdf", plot = a / b)
ggsave("./plots/tiff_files/Fig10_performance_comparison.tiff", plot = a / b, dpi = 600)
```

# Original Computing Environment

```{r}
sessionInfo()
```

\newpage

# References