Regression with missing data, a comparison study of techniques based on random forests

May 2021

Abstract

In this paper, we present the practical benefits of a new random forest algorithm to deal with missing values in the sample. The purpose of this work is to compare the different solutions to deal with missing values using random forests and describe the new algorithm performance as well as its algorithmic complexity. A variety of data-missing mechanisms (MCAR, MAR, MNAR) are considered and simulated. We study the quadratic errors and the bias of our algorithm and compare them to the most popular missing values random forests algorithms in the literature. In particular, we compare those techniques for both regression and prediction purposes. This work follows the paper of Gómez-Méndez and Joly [On the consistency of a random forest algorithm in the presence of missing entries. 2020. Available from: arXiv:201105433] on the consistency of this new algorithm.

This is an original manuscript of an article published by Taylor & Francis in Journal of Statistical Computation and Simulation on 09 January 2023, available at: http://www.tandfonline.com/doi/full/10.1080/00949655.2022.2163646.

To cite this article:
Irving Gómez-Méndez & Emilien Joly (2023). Regression with missing data, a comparison study of techniques based on random forests, Journal of Statistical Computation and Simulation.

To get the .bib format for the citation clic on the Cite button above.

Type

Journal article

Publication

Regression with missing data, a comparison study of techniques based on random forests

Irving Gómez-Méndez

Assistant Professor of Artificial Intelligence

I work at the intersection of statistics, machine learning, causal reasoning, and interactive quantitative software.