Class 4: Missing Data: Basic techniques

Evaluation of missing data at training

  • mulitple imputation

  • ML based was better than imputation which is better than dropping samples

  • example datasets: 45% of patients have at least 1 missing value


  • Mean imputation:

    • insert the mean based onthe other values

  • Hot deck

    • mean-like with similarity

  • Multiple inputation

    • 3 diff ways

Imputation ML

  • MLP

    • fully connected

  • Self organization

    • competitive learning

    • NN on modle of nodes in 2d grid,

  • KNN

    • select closest complete case to impute values from

    • expensive for large datasets due to need to search everywhere for each missing value


  • Train NN based on data imputed with each technie


  • in general, any imputation was better than deletion

  • ML based performed better

Discussion & Questions

  • interesting that even simple methods provide improvement

  • SOM is sort of unclear how does that work?

  • Review of MLP and sigmoid

Handling missing values At application time

  • reduced models vs imputation.

  • broad approach

  • 15 common datasets


  • Discard

  • Acquire missing values

  • Imputation

    • predictive value imputation

    • distribution based

    • unique values

  • Reduced Feature Models

    • retrain for different feature models

Feature imputability impacts the distribution or predictive type of imputation

More complex model

  • decision tree with bagging

  • again, reduced model is the best strategy

Hybrid Models for efficient prediction

  • reduced models

  • a hybrid is a complete model with stored subset for most common missing features

  • Reduced feature enseble

    • N models for N features

    • each one is missing one feature

    • average these together for final prediction

    • substantial reduction in when there is a single feature is missing

    • combine with imputation for multiple features

    • relative accuracy is better than imputation

General takeaways

  • reduced models vs imputatation is a large improvement

  • this is sort of an imputation


  • Didn’t check unique value imputation

  • MCAR

  • focused on

Overall Discussion

  • How might the two problems interact?

    • if missing data at both train and prediction…

    • train using missing data without imputation for training the separate models

  • Questions on these ideas

  • What additional things might you need to consider when choosing one?

    • feature imputability at training

  • what to do with time series data

  • How to check if missing CAR?

    • look at collection technique

  • what do to with varying data per person