Title: | Missingness Aware Gaussian Mixture Models |
---|---|
Description: | Parameter estimation and classification for Gaussian Mixture Models (GMMs) in the presence of missing data. This package complements existing implementations by allowing for both missing elements in the input vectors and full (as opposed to strictly diagonal) covariance matrices. Estimation is performed using an expectation conditional maximization algorithm that accounts for missingness of both the cluster assignments and the vector components. The output includes the marginal cluster membership probabilities; the mean and covariance of each cluster; the posterior probabilities of cluster membership; and a completed version of the input data, with missing values imputed to their posterior expectations. For additional details, please see McCaw ZR, Julienne H, Aschard H. "Fitting Gaussian mixture models on incomplete data." <doi:10.1186/s12859-022-04740-9>. |
Authors: | Zachary McCaw [aut, cre] |
Maintainer: | Zachary McCaw <[email protected]> |
License: | GPL-3 |
Version: | 1.0.1.2 |
Built: | 2024-10-31 03:50:14 UTC |
Source: | https://github.com/zrmacc/mgmm |
Calculates the Calinski-Harabaz index.
CalHar(data, assign, means)
CalHar(data, assign, means)
data |
Observations. |
assign |
Assignments. |
means |
List of cluster means. |
Scalar metric.
Function to choose the number of clusters k. Examines cluster numbers between
k0 and k1. For each cluster number, generates boot
bootstrap data
sets, fits the Gaussian Mixture Model (FitGMM
), and calculates
quality metrics (ClustQual
). For each metric, determines the
optimal cluster number k_opt
, and the k_1SE
, the smallest
cluster number whose quality is within 1 SE of the optimum.
ChooseK( data, k0 = 2, k1 = NULL, boot = 100, init_means = NULL, fix_means = FALSE, init_covs = NULL, lambda = 0, init_props = NULL, maxit = 10, eps = 1e-04, report = TRUE )
ChooseK( data, k0 = 2, k1 = NULL, boot = 100, init_means = NULL, fix_means = FALSE, init_covs = NULL, lambda = 0, init_props = NULL, maxit = 10, eps = 1e-04, report = TRUE )
data |
Numeric data matrix. |
k0 |
Minimum number of clusters. |
k1 |
Maximum number of clusters. |
boot |
Bootstrap replicates. |
init_means |
Optional list of initial mean vectors. |
fix_means |
Fix the means to their starting value? Must provide initial values. |
init_covs |
Optional list of initial covariance matrices. |
lambda |
Optional ridge term added to covariance matrix to ensure positive definiteness. |
init_props |
Optional vector of initial cluster proportions. |
maxit |
Maximum number of EM iterations. |
eps |
Minimum acceptable increment in the EM objective. |
report |
Report bootstrap progress? |
List containing Choices
, the recommended number of clusters
according to each quality metric, and Results
, the mean and standard
error of the quality metrics at each cluster number evaluated.
See ClustQual
for evaluating cluster quality, and FitGMM
for estimating the GMM with a specified cluster number.
set.seed(100) mean_list <- list(c(2, 2), c(2, -2), c(-2, 2), c(-2, -2)) data <- rGMM(n = 500, d = 2, k = 4, means = mean_list) choose_k <- ChooseK(data, k0 = 2, k1 = 6, boot = 10) choose_k$Choices
set.seed(100) mean_list <- list(c(2, 2), c(2, -2), c(-2, 2), c(-2, -2)) data <- rGMM(n = 500, d = 2, k = 4, means = mean_list) choose_k <- ChooseK(data, k0 = 2, k1 = 6, boot = 10) choose_k$Choices
Evaluates cluster quality. Returns the following metrics:
BIC: Bayesian Information Criterion, lower value indicates better clustering quality.
CHI: Calinski-Harabaz Index, higher value indicates better clustering quality.
DBI: Davies-Bouldin, lower value indicates better clustering quality.
SIL: Silhouette Width, higher value indicates better clustering quality.
ClustQual(fit)
ClustQual(fit)
fit |
Object of class mix. |
List containing the cluster quality metrics.
See ChooseK
for using quality metrics to choose the cluster number.
set.seed(100) # Data generation mean_list = list( c(2, 2, 2), c(-2, 2, 2), c(2, -2, 2), c(2, 2, -2) ) data <- rGMM(n = 500, d = 3, k = 4, means = mean_list) fit <- FitGMM(data, k = 4) # Clustering quality cluster_qual <- ClustQual(fit)
set.seed(100) # Data generation mean_list = list( c(2, 2, 2), c(-2, 2, 2), c(2, -2, 2), c(2, 2, -2) ) data <- rGMM(n = 500, d = 3, k = 4, means = mean_list) fit <- FitGMM(data, k = 4) # Clustering quality cluster_qual <- ClustQual(fit)
Combines point estimates and standard errors across multiple imputations.
CombineMIs(points, covs)
CombineMIs(points, covs)
points |
List of point estimates, potentially vector valued. |
covs |
List of sampling covariances, potentially matrix valued. |
List containing the final point estimate ('point') and sampling covariance ('cov').
set.seed(100) # Generate data and introduce missingness. data <- rGMM(n = 25, d = 2, k = 1) data[1, 1] <- NA data[2, 2] <- NA data[3, ] <- NA # Fit GMM. fit <- FitGMM(data) # Lists to store summary statistics. points <- list() covs <- list() # Perform 50 multiple imputations. # For each, calculate the marginal mean and its sampling variance. for (i in seq_len(50)) { imputed <- GenImputation(fit) points[[i]] <- apply(imputed, 2, mean) covs[[i]] <- cov(imputed) / nrow(imputed) } # Combine summary statistics across imputations. results <- CombineMIs(points, covs)
set.seed(100) # Generate data and introduce missingness. data <- rGMM(n = 25, d = 2, k = 1) data[1, 1] <- NA data[2, 2] <- NA data[3, ] <- NA # Fit GMM. fit <- FitGMM(data) # Lists to store summary statistics. points <- list() covs <- list() # Perform 50 multiple imputations. # For each, calculate the marginal mean and its sampling variance. for (i in seq_len(50)) { imputed <- GenImputation(fit) points[[i]] <- apply(imputed, 2, mean) covs[[i]] <- cov(imputed) / nrow(imputed) } # Combine summary statistics across imputations. results <- CombineMIs(points, covs)
Calculates the Davies-Bouldin index.
DavBou(data, assign, means)
DavBou(data, assign, means)
data |
Observations |
assign |
Assignments |
means |
List of cluster means |
Scalar index.
Given an matrix of random vectors, estimates the parameters
of a Gaussian Mixture Model (GMM). Accommodates arbitrary patterns of missingness
at random (MAR) in the input vectors.
FitGMM( data, k = 1, init_means = NULL, fix_means = FALSE, init_covs = NULL, lambda = 0, init_props = NULL, maxit = 100, eps = 1e-06, report = TRUE )
FitGMM( data, k = 1, init_means = NULL, fix_means = FALSE, init_covs = NULL, lambda = 0, init_props = NULL, maxit = 100, eps = 1e-06, report = TRUE )
data |
Numeric data matrix. |
k |
Number of mixture components. Defaults to 1. |
init_means |
Optional list of initial mean vectors. |
fix_means |
Fix the means to their starting value? Must provide initial values. |
init_covs |
Optional list of initial covariance matrices. |
lambda |
Optional ridge term added to covariance matrix to ensure positive definiteness. |
init_props |
Optional vector of initial cluster proportions. |
maxit |
Maximum number of EM iterations. |
eps |
Minimum acceptable increment in the EM objective. |
report |
Report fitting progress? |
Initial values for the cluster means, covariances, and proportions are
specified using M0
, S0
, and pi0
, respectively. If the
data contains complete observations, i.e. observations with no missing
elements, then fit.GMM
will attempt to initialize these parameters
internally using K-means. If the data contains no complete observations, then
initial values are required for M0
, S0
, and pi0
.
For a single component, an object of class mvn
, containing
the estimated mean and covariance, the final objective function, and the
imputed data.
For a multicomponent model , an object of class
mix
,
containing the estimated means, covariances, cluster proportions, cluster
responsibilities, and observation assignments.
See rGMM
for data generation, and ChooseK
for selecting
the number of clusters.
# Single component without missingness # Bivariate normal observations sigma <- matrix(c(1, 0.5, 0.5, 1), nrow = 2) data <- rGMM(n = 1e3, d = 2, k = 1, means = c(2, 2), covs = sigma) fit <- FitGMM(data, k = 1) # Single component with missingness # Trivariate normal observations mean_list <- list(c(-2, -2, -2), c(2, 2, 2)) sigma <- matrix(c(1, 0.5, 0.5, 0.5, 1, 0.5, 0.5, 0.5, 1), nrow = 3) data <- rGMM(n = 1e3, d = 3, k = 2, means = mean_list, covs = sigma) fit <- FitGMM(data, k = 2) # Two components without missingness # Trivariate normal observations mean_list <- list(c(-2, -2, -2), c(2, 2, 2)) sigma <- matrix(c(1, 0.5, 0.5, 0.5, 1, 0.5, 0.5, 0.5, 1), nrow = 3) data <- rGMM(n = 1e3, d = 3, k = 2, means = mean_list, covs = sigma) fit <- FitGMM(data, k = 2) # Four components with missingness # Bivariate normal observations # Note: Fitting is slow. mean_list <- list(c(2, 2), c(2, -2), c(-2, 2), c(-2, -2)) sigma <- 0.5 * diag(2) data <- rGMM( n = 1000, d = 2, k = 4, pi = c(0.35, 0.15, 0.15, 0.35), m = 0.1, means = mean_list, covs = sigma) fit <- FitGMM(data, k = 4)
# Single component without missingness # Bivariate normal observations sigma <- matrix(c(1, 0.5, 0.5, 1), nrow = 2) data <- rGMM(n = 1e3, d = 2, k = 1, means = c(2, 2), covs = sigma) fit <- FitGMM(data, k = 1) # Single component with missingness # Trivariate normal observations mean_list <- list(c(-2, -2, -2), c(2, 2, 2)) sigma <- matrix(c(1, 0.5, 0.5, 0.5, 1, 0.5, 0.5, 0.5, 1), nrow = 3) data <- rGMM(n = 1e3, d = 3, k = 2, means = mean_list, covs = sigma) fit <- FitGMM(data, k = 2) # Two components without missingness # Trivariate normal observations mean_list <- list(c(-2, -2, -2), c(2, 2, 2)) sigma <- matrix(c(1, 0.5, 0.5, 0.5, 1, 0.5, 0.5, 0.5, 1), nrow = 3) data <- rGMM(n = 1e3, d = 3, k = 2, means = mean_list, covs = sigma) fit <- FitGMM(data, k = 2) # Four components with missingness # Bivariate normal observations # Note: Fitting is slow. mean_list <- list(c(2, 2), c(2, -2), c(-2, 2), c(-2, -2)) sigma <- 0.5 * diag(2) data <- rGMM( n = 1000, d = 2, k = 4, pi = c(0.35, 0.15, 0.15, 0.35), m = 0.1, means = mean_list, covs = sigma) fit <- FitGMM(data, k = 4)
Given a matrix of random vectors, estimates the parameters for a mixture of multivariate normal distributions. Accommodates arbitrary patterns of missingness, provided the elements are missing at random (MAR).
FitMix( data, k = 2, init_means = NULL, fix_means = FALSE, init_covs = NULL, lambda = 0, init_props = NULL, maxit = 100, eps = 1e-06, report = FALSE )
FitMix( data, k = 2, init_means = NULL, fix_means = FALSE, init_covs = NULL, lambda = 0, init_props = NULL, maxit = 100, eps = 1e-06, report = FALSE )
data |
Numeric data matrix. |
k |
Number of mixture components. Defaults to 2. |
init_means |
Optional list of initial mean vectors. |
fix_means |
Fix means to their starting values? Must initialize. |
init_covs |
Optional list of initial covariance matrices. |
lambda |
Optional ridge term added to covariance matrix to ensure positive definiteness. |
init_props |
Optional vector of initial cluster proportions. |
maxit |
Maximum number of EM iterations. |
eps |
Minimum acceptable increment in the EM objective. |
report |
Report fitting progress? |
Object of class mix
.
Given a matrix of n x d-dimensional random vectors, possibly containing missing elements, estimates the mean and covariance of the best fitting multivariate normal distribution.
FitMVN( data, init_mean = NULL, fix_mean = FALSE, init_cov = NULL, lambda = 0, maxit = 100, eps = 1e-06, report = TRUE )
FitMVN( data, init_mean = NULL, fix_mean = FALSE, init_cov = NULL, lambda = 0, maxit = 100, eps = 1e-06, report = TRUE )
data |
Numeric data matrix. |
init_mean |
Optional initial mean vector. |
fix_mean |
Fix the mean to its starting value? Must initialize. |
init_cov |
Optional initial covariance matrix. |
lambda |
Optional ridge term added to covariance matrix to ensure positive definiteness. |
maxit |
Maximum number of EM iterations. |
eps |
Minimum acceptable increment in the EM objective. |
report |
Report fitting progress? |
An object of class mvn
.
Generates a stochastic imputation of a data set from a fitted data set.
GenImputation(fit)
GenImputation(fit)
fit |
Fitted model. |
Numeric matrix with missing values imputed.
set.seed(100) # Generate data and introduce missingness. data <- rGMM(n = 25, d = 2, k = 1) data[1, 1] <- NA data[2, 2] <- NA data[3, ] <- NA # Fit GMM. fit <- FitGMM(data) # Generate imputation. imputed <- GenImputation(fit)
set.seed(100) # Generate data and introduce missingness. data <- rGMM(n = 25, d = 2, k = 1) data[1, 1] <- NA data[2, 2] <- NA data[3, ] <- NA # Fit GMM. fit <- FitGMM(data) # Generate imputation. imputed <- GenImputation(fit)
Log likelihood for Fitted GMM
## S3 method for class 'mix' logLik(object, ...)
## S3 method for class 'mix' logLik(object, ...)
object |
A |
... |
Unused. |
Log likelihood for Fitted MVN Model
## S3 method for class 'mvn' logLik(object, ...)
## S3 method for class 'mvn' logLik(object, ...)
object |
A |
... |
Unused. |
Mean for Fitted GMM
## S3 method for class 'mix' mean(x, ...)
## S3 method for class 'mix' mean(x, ...)
x |
A |
... |
Unused. |
Mean for Fitted MVN Model
## S3 method for class 'mvn' mean(x, ...)
## S3 method for class 'mvn' mean(x, ...)
x |
A |
... |
Unused. |
Defines a class to hold Gaussian Mixture Models.
Assignments
Maximum a posteriori assignments.
Completed
Completed data, with missing values imputed to their posterior expectations.
Components
Number of components.
Covariances
List of fitted cluster covariance matrices.
Data
Original data, with missing values present.
Density
Density of each component at each example.
Means
List of fitted cluster means.
Objective
Final value of the EM objective.
Proportions
Fitted cluster proportions.
Responsibilities
Posterior membership probabilities for each example.
Mean Update for Mixture of MVNs with Missingness.
MixUpdateMeans(split_data, means, covs, gamma)
MixUpdateMeans(split_data, means, covs, gamma)
split_data |
Data partitioned by missingness. |
means |
List of component means. |
covs |
List of component covariances. |
gamma |
List of component responsibilities. |
List containing the updated component means.
Defines a class to hold multivariate normal models.
Completed
Completed data, with missing values imputed to their posterior expectations.
Covariance
Fitted covariance matrix.
Data
Original data, with missing values present.
Mean
Fitted mean vector.
Objective
Final value of the EM objective.
Returns a list with the input data split in separate matrices for complete cases, incomplete cases, and empty cases.
PartitionData(data)
PartitionData(data)
data |
Data.frame. |
List containing:
The original row and column names: 'orig_row_names', 'orig_col_names'.
The original row and column numbers: 'n_row' and 'n_col'.
The complete cases 'data_comp'.
The incomplete cases 'data_incomp'.
The empty cases 'data_empty'.
Counts of complete 'n0', incomplete 'n1', and empty 'n2' cases.
Initial order of the observations 'init_order'.
Print method for objects of class mix
.
## S3 method for class 'mix' print(x, ...)
## S3 method for class 'mix' print(x, ...)
x |
A |
... |
Unused. |
Print for Fitted MVN Model
## S3 method for class 'mvn' print(x, ...)
## S3 method for class 'mvn' print(x, ...)
x |
A |
... |
Unused. |
Reassembles a data matrix split by missingness pattern.
ReconstituteData(split_data)
ReconstituteData(split_data)
split_data |
Split data are returned by |
Numeric matrix.
Generates an matrix of multivariate normal random vectors
with observations (examples) as rows. If
, all observations belong to the same
cluster. If
the observations are generated via two-step procedure.
First, the cluster membership is drawn from a multinomial distribution, with
mixture proportions specified by
pi
. Conditional on cluster
membership, the observation is drawn from a multivariate normal distribution,
with cluster-specific mean and covariance. The cluster means are provided
using means
, and the cluster covariance matrices are provided using
covs
. If , missingness is introduced, completely at random, by
setting that proportion of elements in the data matrix to
NA
.
rGMM(n, d = 2, k = 1, pi = NULL, miss = 0, means = NULL, covs = NULL)
rGMM(n, d = 2, k = 1, pi = NULL, miss = 0, means = NULL, covs = NULL)
n |
Observations (rows). |
d |
Observation dimension (columns). |
k |
Number of mixture components. Defaults to 1. |
pi |
Mixture proportions. If omitted, components are assumed equiprobable. |
miss |
Proportion of elements missing, |
means |
Either a prototype mean vector, or a list of mean vectors. Defaults to the zero vector. |
covs |
Either a prototype covariance matrix, or a list of covariance matrices. Defaults to the identity matrix. |
Numeric matrix with observations as rows. Row numbers specify the true cluster assignments.
For estimation, see FitGMM
.
set.seed(100) # Single component without missingness. # Bivariate normal observations. cov <- matrix(c(1, 0.5, 0.5, 1), nrow = 2) data <- rGMM(n = 1e3, d = 2, k = 1, means = c(2, 2), covs = cov) # Single component with missingness. # Trivariate normal observations. mean_list <- list(c(-2, -2, -2), c(2, 2, 2)) cov <- matrix(c(1, 0.5, 0.5, 0.5, 1, 0.5, 0.5, 0.5, 1), nrow = 3) data <- rGMM(n = 1e3, d = 3, k = 2, means = mean_list, covs = cov) # Two components without missingness. # Trivariate normal observations. mean_list <- list(c(-2, -2, -2), c(2, 2, 2)) cov <- matrix(c(1, 0.5, 0.5, 0.5, 1, 0.5, 0.5, 0.5, 1), nrow = 3) data <- rGMM(n = 1e3, d = 3, k = 2, means = mean_list, covs = cov) # Four components with missingness. # Bivariate normal observations. mean_list <- list(c(2, 2), c(2, -2), c(-2, 2), c(-2, -2)) cov <- 0.5 * diag(2) data <- rGMM( n = 1000, d = 2, k = 4, pi = c(0.35, 0.15, 0.15, 0.35), miss = 0.1, means = mean_list, covs = cov)
set.seed(100) # Single component without missingness. # Bivariate normal observations. cov <- matrix(c(1, 0.5, 0.5, 1), nrow = 2) data <- rGMM(n = 1e3, d = 2, k = 1, means = c(2, 2), covs = cov) # Single component with missingness. # Trivariate normal observations. mean_list <- list(c(-2, -2, -2), c(2, 2, 2)) cov <- matrix(c(1, 0.5, 0.5, 0.5, 1, 0.5, 0.5, 0.5, 1), nrow = 3) data <- rGMM(n = 1e3, d = 3, k = 2, means = mean_list, covs = cov) # Two components without missingness. # Trivariate normal observations. mean_list <- list(c(-2, -2, -2), c(2, 2, 2)) cov <- matrix(c(1, 0.5, 0.5, 0.5, 1, 0.5, 0.5, 0.5, 1), nrow = 3) data <- rGMM(n = 1e3, d = 3, k = 2, means = mean_list, covs = cov) # Four components with missingness. # Bivariate normal observations. mean_list <- list(c(2, 2), c(2, -2), c(-2, 2), c(-2, -2)) cov <- 0.5 * diag(2) data <- rGMM( n = 1000, d = 2, k = 4, pi = c(0.35, 0.15, 0.15, 0.35), miss = 0.1, means = mean_list, covs = cov)
Show for Fitted Mixture Models
## S4 method for signature 'mix' show(object)
## S4 method for signature 'mix' show(object)
object |
A |
Show for Multivariate Normal Models
## S4 method for signature 'mvn' show(object)
## S4 method for signature 'mvn' show(object)
object |
A |
Covariance for Fitted GMM
## S3 method for class 'mix' vcov(object, ...)
## S3 method for class 'mix' vcov(object, ...)
object |
A |
... |
Unused. |
Covariance for Fitted MVN Model
## S3 method for class 'mvn' vcov(object, ...)
## S3 method for class 'mvn' vcov(object, ...)
object |
A |
... |
Unused. |