Identification of pseudo-independent variables
In modeling data, such a variable is often encountered, which is itself affected by the dependent variable. If this variable is added to the model as an independent variable, other independent variables will not enter the model. At the same time, because these variables are dependent on the dependent variables, the data of these variables will not be obtained in the actual prediction application, so that the model cannot be really used. For such variables we call them "pseudo-independent variables".
For example, in the variables of loan default prediction, there are two variables: whether the user defaults or not and the number of overdue days. Among them, whether the user defaults is the forecast target, that is, the dependent variable, while the number of overdue days is affected by the dependent variable. Only the overdue days of defaulting customers will be greater than 0, and the overdue days of non-defaulting customers will be 0. Such variables are pseudo independent variables and should be removed from the modeling data.
For example, in the credit card data, construct a pseudo-independent variable fraud_days, and then model with the external library Ymodel to identify the pseudo-independent variable by looking at the model indicator AUC value and the importance of each variable.
A |
|
1 |
=file("D://creditcard_b.csv").import@tc() |
2 |
=A1.derive(if(Class==0,0,rand(100)):fraud_days) |
3 |
=ym_env() |
4 |
=ym_model(A3,A2) |
5 |
=ym_target(A4,"Class") |
6 |
=ym_build_model(A4) |
7 |
=ym_performance(A6) |
8 |
=ym_importance(A6).sort@z(Importance) |
A1 Import the data
A2 Construct a pseudo-independent variable named fraud_days. When the target variable Class is 0, fraud_days=0. When Class is 1, fraud_days>0
A3 Initialize the environment
A4 Load the modeling data
A5 Set the target variable to Class
A6 Perform automatic modeling
A7 Get the model performance and observe AUC value
A8 Gets the importance of the variables and arranges them in descending order
As you can see, the AUC is 1, achieving "perfect prediction", but this "perfect" is often not a good sign. It means that there are independent variables in the model that can "perfectly explain" the dependent variables. Looking at the importance of each variable further, we find that the importance of the constructed variable fraud_days is 1, and the other variables are 0 or almost 0. This means that fraud_days is in effect in the created model. The other variables are excluded from the model. Such independent variables are pseudo independent variables and should be eliminated during modeling.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL