Flag missing information for multiple variables

 

When the data set contains a large number of variables with missing values, the flagging method of single variable will greatly increase the complexity of the model. In this case, the missing values of each sample on all variables can be identified by establishing only one new variable. Although this method cannot reflect the influence of specific variables, it has little effect on the complexity of the model.

In SPL, A.mvp(T) or P.mvp(cns, T) can integrate multiple MI variables, generate the mvp variable to represent the missing information of the multivariate, and furthermore and automatic preprocessing operations are performed on the mvp variable.

For example, there are 81 variables in the house price data, and the mvp() function is used to mark the missing information for multiple variables


A

1

=file("D://house_prices_train.csv").import@qtc()

2

=A1.fname()

3

=A2.(A1.mi(~))

4

=A3.group(!~)

5

=to(A4(1).len()).("A4(1)("/~/")(#).field(1):MI_"/~).concat@c()

6

=A1.derive(${A5})

7

=to(A4(1).len()).("\""/"MI_"/~/"\"").concat@c()

8

=A6.mvp([${A7}],A1.(SalePrice))

9

=A1.derive(A8(1)(#).field(1):mvp)

A2 Get the filed names

A3 Mark the missing value for each variable. Variables with missing values return MI indicators, and variables without missing values return null.

..

A4 Divided into two groups based on whether the MI indicator is null

..

A5-A6 Adds the MI indicators to table A1

..

A7 Extract all MI indicator field names as input parameters to A8

..

A8 An mvp variable is generated to represent the missing information of multiple variables, and automatic preprocessing operations are performed on the mvp variable. For example, Pow2 represents power transformation as in the figure.

..

A9 Add the mvp variable to the modeling data A1

..