Class 4: Missing Data: Basic techniques

Contents

Class 4: Missing Data: Basic techniques#

Evaluation of missing data at training#

mulitple imputation
ML based was better than imputation which is better than dropping samples
example datasets: 45% of patients have at least 1 missing value

Imputation#

Mean imputation:
- insert the mean based onthe other values
Hot deck
- mean-like with similarity
Multiple inputation
- 3 diff ways

Imputation ML#

MLP
- fully connected
Self organization
- competitive learning
- NN on modle of nodes in 2d grid,
KNN
- select closest complete case to impute values from
- expensive for large datasets due to need to search everywhere for each missing value

Testing#

Train NN based on data imputed with each technie

Conclusions:

in general, any imputation was better than deletion
ML based performed better

Discussion & Questions#

interesting that even simple methods provide improvement
SOM is sort of unclear how does that work?
Review of MLP and sigmoid

Handling missing values At application time#

reduced models vs imputation.
broad approach
15 common datasets

Techniques:#

Discard
Acquire missing values
Imputation
- predictive value imputation
- distribution based
- unique values
Reduced Feature Models
- retrain for different feature models

Feature imputability impacts the distribution or predictive type of imputation

More complex model#

decision tree with bagging
again, reduced model is the best strategy

Hybrid Models for efficient prediction#

reduced models
a hybrid is a complete model with stored subset for most common missing features
Reduced feature enseble
- N models for N features
- each one is missing one feature
- average these together for final prediction
- substantial reduction in when there is a single feature is missing
- combine with imputation for multiple features
- relative accuracy is better than imputation

General takeaways#

reduced models vs imputatation is a large improvement
this is sort of an imputation

Weaknesses#

Didn’t check unique value imputation
MCAR
focused on

Overall Discussion#

How might the two problems interact?
- if missing data at both train and prediction…
- train using missing data without imputation for training the separate models
Questions on these ideas
What additional things might you need to consider when choosing one?
- feature imputability at training
what to do with time series data
How to check if missing CAR?
- look at collection technique
what do to with varying data per person
- LSTM for time series data
- hierarchichal modeling other wise
- example of hierarchical with time series also

For Wednesday#