The SurrogateRegression package fits regression models when the target outcome is partially missing but a correlated surrogate outcome is available. Rather than treating the surrogate as a simple proxy, the package jointly models the target and surrogate in a bivariate normal regression framework. Missing values are handled directly in the likelihood: when only the target is missing, an accelerated least squares procedure is used; when both outcomes can be missing or design matrices differ, estimation uses an ECME algorithm. No assumptions are required about the relationship between target and surrogate beyond the bivariate model.
This vignette walks through the main functions: simulating data
(rBNR), fitting the model (FitBNR), inspecting
results (e.g. coef, vcov,
residuals), and testing hypotheses on the target
coefficients (TestBNR).
rBNRrBNR() simulates from a bivariate normal regression
model with outcomes missing completely at random. You provide the target
and surrogate design matrices (X, Z),
regression coefficient vectors (b, a), and
optional target/surrogate missingness fractions and covariance
matrix.
set.seed(100)
n <- 800L
X <- cbind(1, rnorm(n)) # target design (intercept + 1 covariate)
Z <- cbind(1, rnorm(n)) # surrogate design
Y <- rBNR(
X = X,
Z = Z,
b = c(1, 1), # target coefficients
a = c(-1, -1), # surrogate coefficients
t_miss = 0.2, # 20% target missing
s_miss = 0.1 # 10% surrogate missing
)
head(Y, 10)
#> Target Surrogate
#> 1 0.8552978 -0.14018771
#> 2 1.0001255 -5.49865133
#> 3 0.7597892 -0.80051545
#> 4 NA -0.68154468
#> 5 0.3207791 -2.28004537
#> 6 0.3851064 0.10403556
#> 7 -0.3908293 1.08703599
#> 8 3.2726895 NA
#> 9 1.9078554 0.56861011
#> 10 NA -0.03831721The result is an n x 2 matrix with columns
"Target" and "Surrogate"; NA
indicates missing. By construction, at least one outcome is observed for
each subject (t_miss + s_miss must be ≤ 1).
FitBNRFitBNR() takes the target and surrogate outcome vectors
and their design matrices. It chooses the fitting method automatically:
accelerated least squares when the surrogate is complete and the design
is the same for both outcomes, and ECME when there is surrogate
missingness or different designs.
fit <- FitBNR(
t = Y[, "Target"],
s = Y[, "Surrogate"],
X = X,
Z = Z
)
#> Objective increment: 0.352
#> Objective increment: 0.0141
#> Objective increment: 0.000563
#> Objective increment: 2.25e-05
#> Objective increment: 9.01e-07
#> 4 update(s) performed before tolerance limit.The returned object is of class bnr. Use
show() (or print the object) to see the regression and
covariance tables:
show(fit)
#> Outcome Coefficient Point SE L U p
#> 1 Target x1 1.010 0.0408 0.931 1.090 2.43e-135
#> 2 Target x2 0.986 0.0407 0.906 1.070 2.13e-129
#> 3 Surrogate z1 -1.020 0.0371 -1.090 -0.945 1.23e-165
#> 4 Surrogate z2 -0.959 0.0372 -1.030 -0.886 1.15e-146
#>
#> Covariance Point SE L U
#> 1 Target 1.0700 0.0597 0.95700 1.1900
#> 2 Target-Surrogate 0.0347 0.0435 -0.00881 0.0781
#> 3 Surrogate 0.9910 0.0522 0.89400 1.1000Coefficients — Extract the regression coefficient table, optionally by outcome:
coef(fit)
#> Outcome Coefficient Point SE L U
#> 1 Target x1 1.0112884 0.04084382 0.9312360 1.0913408
#> 2 Target x2 0.9858227 0.04073352 0.9059865 1.0656590
#> 3 Surrogate z1 -1.0174233 0.03709281 -1.0901238 -0.9447227
#> 4 Surrogate z2 -0.9586721 0.03717235 -1.0315285 -0.8858156
#> p
#> 1 2.426259e-135
#> 2 2.131980e-129
#> 3 1.232975e-165
#> 4 1.150384e-146
coef(fit, type = "Target")
#> Outcome Coefficient Point SE L U p
#> 1 Target x1 1.0112884 0.04084382 0.9312360 1.091341 2.426259e-135
#> 2 Target x2 0.9858227 0.04073352 0.9059865 1.065659 2.131980e-129Residuals — Fitted residuals for the target, surrogate, or both:
head(residuals(fit), 5)
#> Target Surrogate
#> 1 0.3390820 0.5747572
#> 2 -0.1408293 -2.6878226
#> 3 -0.1737009 0.4822566
#> 4 NA -0.2702931
#> 5 -0.8058222 -0.3821896
head(residuals(fit, type = "Target"), 5)
#> 1 2 3 4 5
#> 0.3390820 -0.1408293 -0.1737009 NA -0.8058222Variance-covariance — vcov() can return
the estimated residual covariance of the outcomes, the information
matrix for regression coefficients, or for covariance parameters:
# Residual covariance of (target, surrogate)
vcov(fit, type = "Outcome")
#> Target Surrogate
#> Target 1.06786111 0.03465803
#> Surrogate 0.03465803 0.99068273
# Information matrix for regression coefficients (inverse = asymptotic cov of coefs)
info_reg <- vcov(fit, type = "Regression")
dim(info_reg)
#> [1] 4 4TestBNRTestBNR() tests the null that a subset of the
target regression coefficients is zero. You specify
which coefficients are zero under the null with the logical vector
is_zero (one entry per column of X), and
choose test = "Wald" or test = "Score". When
fitting uses least squares, only the Wald test is available.
# Test that the first target coefficient (intercept) is zero
test_intercept <- TestBNR(
t = Y[, "Target"],
s = Y[, "Surrogate"],
X = X,
Z = Z,
is_zero = c(TRUE, FALSE),
test = "Wald"
)
test_intercept
#> Wald df p
#> 6.130610e+02 1.000000e+00 2.415345e-135
# Test that the second target coefficient is zero
test_slope <- TestBNR(
t = Y[, "Target"],
s = Y[, "Surrogate"],
X = X,
Z = Z,
is_zero = c(FALSE, TRUE),
test = "Wald"
)
test_slope
#> Wald df p
#> 5.857336e+02 1.000000e+00 2.122841e-129The function returns the test statistic, degrees of freedom, and p-value.
PartitionDataFor advanced use, PartitionData() splits the data by
missingness pattern (complete, target missing, surrogate missing) and
precomputes inner products used in estimation. This is used internally
by the package but can be useful for custom analyses.
part <- PartitionData(
t = Y[, "Target"],
s = Y[, "Surrogate"],
X = X,
Z = Z
)
names(part)
#> [1] "Orig" "Dims" "Inds" "Complete" "TMiss" "SMiss" "IPs"
part$Dims$n0 # complete cases
#> [1] 560
part$Dims$n1 # target missing, surrogate observed
#> [1] 160
part$Dims$n2 # surrogate missing, target observed
#> [1] 80