Class 4: Missing Data: Basic techniques#
Evaluation of missing data at training#
mulitple imputation
ML based was better than imputation which is better than dropping samples
example datasets: 45% of patients have at least 1 missing value
Imputation#
Mean imputation:
insert the mean based onthe other values
Hot deck
mean-like with similarity
Multiple inputation
3 diff ways
Imputation ML#
MLP
fully connected
Self organization
competitive learning
NN on modle of nodes in 2d grid,
KNN
select closest complete case to impute values from
expensive for large datasets due to need to search everywhere for each missing value
Testing#
Train NN based on data imputed with each technie
Conclusions:
in general, any imputation was better than deletion
ML based performed better
Discussion & Questions#
interesting that even simple methods provide improvement
SOM is sort of unclear how does that work?
Review of MLP and sigmoid
Handling missing values At application time#
reduced models vs imputation.
broad approach
15 common datasets
Techniques:#
Discard
Acquire missing values
Imputation
predictive value imputation
distribution based
unique values
Reduced Feature Models
retrain for different feature models
Feature imputability impacts the distribution or predictive type of imputation
More complex model#
decision tree with bagging
again, reduced model is the best strategy
Hybrid Models for efficient prediction#
reduced models
a hybrid is a complete model with stored subset for most common missing features
Reduced feature enseble
N models for N features
each one is missing one feature
average these together for final prediction
substantial reduction in when there is a single feature is missing
combine with imputation for multiple features
relative accuracy is better than imputation
General takeaways#
reduced models vs imputatation is a large improvement
this is sort of an imputation
Weaknesses#
Didn’t check unique value imputation
MCAR
focused on
Overall Discussion#
How might the two problems interact?
if missing data at both train and prediction…
train using missing data without imputation for training the separate models
Questions on these ideas
What additional things might you need to consider when choosing one?
feature imputability at training
what to do with time series data
How to check if missing CAR?
look at collection technique
what do to with varying data per person
LSTM for time series data
hierarchichal modeling other wise