Surrogate Outcome Regression Analysis

Introduction

The SurrogateRegression package fits regression models when the target outcome is partially missing but a correlated surrogate outcome is available. Rather than treating the surrogate as a simple proxy, the package jointly models the target and surrogate in a bivariate normal regression framework. Missing values are handled directly in the likelihood: when only the target is missing, an accelerated least squares procedure is used; when both outcomes can be missing or design matrices differ, estimation uses an ECME algorithm. No assumptions are required about the relationship between target and surrogate beyond the bivariate model.

This vignette walks through the main functions: simulating data (rBNR), fitting the model (FitBNR), inspecting results (e.g. coef, vcov, residuals), and testing hypotheses on the target coefficients (TestBNR).

Data generation: rBNR

rBNR() simulates from a bivariate normal regression model with outcomes missing completely at random. You provide the target and surrogate design matrices (X, Z), regression coefficient vectors (b, a), and optional target/surrogate missingness fractions and covariance matrix.

set.seed(100)
n <- 800L
X <- cbind(1, rnorm(n))   # target design (intercept + 1 covariate)
Z <- cbind(1, rnorm(n))   # surrogate design
Y <- rBNR(
  X = X,
  Z = Z,
  b = c(1, 1),            # target coefficients
  a = c(-1, -1),          # surrogate coefficients
  t_miss = 0.2,           # 20% target missing
  s_miss = 0.1            # 10% surrogate missing
)
head(Y, 10)
#>        Target   Surrogate
#> 1   0.8552978 -0.14018771
#> 2   1.0001255 -5.49865133
#> 3   0.7597892 -0.80051545
#> 4          NA -0.68154468
#> 5   0.3207791 -2.28004537
#> 6   0.3851064  0.10403556
#> 7  -0.3908293  1.08703599
#> 8   3.2726895          NA
#> 9   1.9078554  0.56861011
#> 10         NA -0.03831721

The result is an n x 2 matrix with columns "Target" and "Surrogate"; NA indicates missing. By construction, at least one outcome is observed for each subject (t_miss + s_miss must be ≤ 1).

Fitting the model: FitBNR

FitBNR() takes the target and surrogate outcome vectors and their design matrices. It chooses the fitting method automatically: accelerated least squares when the surrogate is complete and the design is the same for both outcomes, and ECME when there is surrogate missingness or different designs.

fit <- FitBNR(
  t = Y[, "Target"],
  s = Y[, "Surrogate"],
  X = X,
  Z = Z
)
#> Objective increment:  0.352 
#> Objective increment:  0.0141 
#> Objective increment:  0.000563 
#> Objective increment:  2.25e-05 
#> Objective increment:  9.01e-07 
#> 4 update(s) performed before tolerance limit.

The returned object is of class bnr. Use show() (or print the object) to see the regression and covariance tables:

show(fit)
#>     Outcome Coefficient  Point     SE      L      U         p
#> 1    Target          x1  1.010 0.0408  0.931  1.090 2.43e-135
#> 2    Target          x2  0.986 0.0407  0.906  1.070 2.13e-129
#> 3 Surrogate          z1 -1.020 0.0371 -1.090 -0.945 1.23e-165
#> 4 Surrogate          z2 -0.959 0.0372 -1.030 -0.886 1.15e-146
#> 
#>         Covariance  Point     SE        L      U
#> 1           Target 1.0700 0.0597  0.95700 1.1900
#> 2 Target-Surrogate 0.0347 0.0435 -0.00881 0.0781
#> 3        Surrogate 0.9910 0.0522  0.89400 1.1000

Extracting results

Coefficients — Extract the regression coefficient table, optionally by outcome:

coef(fit)
#>     Outcome Coefficient      Point         SE          L          U
#> 1    Target          x1  1.0112884 0.04084382  0.9312360  1.0913408
#> 2    Target          x2  0.9858227 0.04073352  0.9059865  1.0656590
#> 3 Surrogate          z1 -1.0174233 0.03709281 -1.0901238 -0.9447227
#> 4 Surrogate          z2 -0.9586721 0.03717235 -1.0315285 -0.8858156
#>               p
#> 1 2.426259e-135
#> 2 2.131980e-129
#> 3 1.232975e-165
#> 4 1.150384e-146
coef(fit, type = "Target")
#>   Outcome Coefficient     Point         SE         L        U             p
#> 1  Target          x1 1.0112884 0.04084382 0.9312360 1.091341 2.426259e-135
#> 2  Target          x2 0.9858227 0.04073352 0.9059865 1.065659 2.131980e-129

Residuals — Fitted residuals for the target, surrogate, or both:

head(residuals(fit), 5)
#>       Target  Surrogate
#> 1  0.3390820  0.5747572
#> 2 -0.1408293 -2.6878226
#> 3 -0.1737009  0.4822566
#> 4         NA -0.2702931
#> 5 -0.8058222 -0.3821896
head(residuals(fit, type = "Target"), 5)
#>          1          2          3          4          5 
#>  0.3390820 -0.1408293 -0.1737009         NA -0.8058222

Variance-covariancevcov() can return the estimated residual covariance of the outcomes, the information matrix for regression coefficients, or for covariance parameters:

# Residual covariance of (target, surrogate)
vcov(fit, type = "Outcome")
#>               Target  Surrogate
#> Target    1.06786111 0.03465803
#> Surrogate 0.03465803 0.99068273
# Information matrix for regression coefficients (inverse = asymptotic cov of coefs)
info_reg <- vcov(fit, type = "Regression")
dim(info_reg)
#> [1] 4 4

Hypothesis testing: TestBNR

TestBNR() tests the null that a subset of the target regression coefficients is zero. You specify which coefficients are zero under the null with the logical vector is_zero (one entry per column of X), and choose test = "Wald" or test = "Score". When fitting uses least squares, only the Wald test is available.

# Test that the first target coefficient (intercept) is zero
test_intercept <- TestBNR(
  t = Y[, "Target"],
  s = Y[, "Surrogate"],
  X = X,
  Z = Z,
  is_zero = c(TRUE, FALSE),
  test = "Wald"
)
test_intercept
#>          Wald            df             p 
#>  6.130610e+02  1.000000e+00 2.415345e-135

# Test that the second target coefficient is zero
test_slope <- TestBNR(
  t = Y[, "Target"],
  s = Y[, "Surrogate"],
  X = X,
  Z = Z,
  is_zero = c(FALSE, TRUE),
  test = "Wald"
)
test_slope
#>          Wald            df             p 
#>  5.857336e+02  1.000000e+00 2.122841e-129

The function returns the test statistic, degrees of freedom, and p-value.

Partitioning by missingness: PartitionData

For advanced use, PartitionData() splits the data by missingness pattern (complete, target missing, surrogate missing) and precomputes inner products used in estimation. This is used internally by the package but can be useful for custom analyses.

part <- PartitionData(
  t = Y[, "Target"],
  s = Y[, "Surrogate"],
  X = X,
  Z = Z
)
names(part)
#> [1] "Orig"     "Dims"     "Inds"     "Complete" "TMiss"    "SMiss"    "IPs"
part$Dims$n0   # complete cases
#> [1] 560
part$Dims$n1   # target missing, surrogate observed
#> [1] 160
part$Dims$n2   # surrogate missing, target observed
#> [1] 80