Skip to content

Latest commit

 

History

History
86 lines (79 loc) · 8.68 KB

README.md

File metadata and controls

86 lines (79 loc) · 8.68 KB

Missing Data, Data Imputation

In statistics, missing data, or missing values, occur when no data value is stored for the variable in an observation (Wiki)

Types of missing data (Wiki)

  • Missed completely at random
  • Missed at random
  • Missed data that depends on unobserved variables
  • Missed data that depends on the missing value itself

Discarding data

  • Listwise deletion Complete-case analysis
    • Samples (rows) are removed from a dataset if they have missing values. Probably the most simple and popular approach. Often done automatically by many ML packages
    • When dealing with big number of variables that have missing values, the number of samples after deletion can be too small
    • May lead to biased estimates. Also smaller sample size increases standard errors
  • Available-case analysis Complete-variables analysis
    • Excluding variables from data if their missing-values rate is lower than some threshold

Imputation (Wiki)

Whenever a single imputation strategy is used, the standard errors of estimates tend to be too low. The intuition here is that we have substantial uncertainty about the missing values, but by choosing a single imputation we inessence pretend that we know the true value with certainty (Data Analysis Using Regression and Multilevel/Hierarchical Models)

Timeseries imputation

Other methods, packages

  • MIDAS Multiple Imputation with Denoising Autoencoders (Code, Paper)
  • Impute.jl (Code)