Identification of pseudo-independent variables

 

In modeling data, such a variable is often encountered, which is itself affected by the dependent variable. If this variable is added to the model as an independent variable, other independent variables will not enter the model. At the same time, because these variables are dependent on the dependent variables, the data of these variables will not be obtained in the actual prediction application, so that the model cannot be really used. For such variables we call them "pseudo-independent variables".

For example, in the variables of loan default prediction, there are two variables: whether the user defaults or not and the number of overdue days. Among them, whether the user defaults is the forecast target, that is, the dependent variable, while the number of overdue days is affected by the dependent variable. Only the overdue days of defaulting customers will be greater than 0, and the overdue days of non-defaulting customers will be 0. Such variables are pseudo independent variables and should be removed from the modeling data.

For example, in the credit card data, construct a pseudo-independent variable fraud_days, and then model with the external library Ymodel to identify the pseudo-independent variable by looking at the model indicator AUC value and the importance of each variable.


A

1

=file("D://creditcard_b.csv").import@tc()

2

=A1.derive(if(Class==0,0,rand(100)):fraud_days)

3

=ym_env()

4

=ym_model(A3,A2)

5

=ym_target(A4,"Class")

6

=ym_build_model(A4)

7

=ym_performance(A6)

8

=ym_importance(A6).sort@z(Importance)

A1 Import the data

A2 Construct a pseudo-independent variable named fraud_days. When the target variable Class is 0, fraud_days=0. When Class is 1, fraud_days>0

A3 Initialize the environment

A4 Load the modeling data

A5 Set the target variable to Class

A6 Perform automatic modeling

A7 Get the model performance and observe AUC value

..

A8 Gets the importance of the variables and arranges them in descending order

..

As you can see, the AUC is 1, achieving "perfect prediction", but this "perfect" is often not a good sign. It means that there are independent variables in the model that can "perfectly explain" the dependent variables. Looking at the importance of each variable further, we find that the importance of the constructed variable fraud_days is 1, and the other variables are 0 or almost 0. This means that fraud_days is in effect in the created model. The other variables are excluded from the model. Such independent variables are pseudo independent variables and should be eliminated during modeling.