Use Cases of Modeling Tests – Comparison between YModel and Manual Model Building

 

Case 1

Task: According to the data of defaulted installment loans for a bank, predict the probability of default (PD) among its personal users.

Data set description: 2.9 million rows, 37 columns, a size of 477MB.

Target variable: IsDefaulted.

Test content:

1.     Model performance indexes over the test data set: AUC, lift in top 10%, and model attenuation level.

2.     Model building duration.

3.     Skill requirements.

 

Test result:

1.     Model performance

Modeling   type

AUC   for training data set

AUC   for test data set

Model   attenuation

Lift   in top 10% for test data set

Manual   model Building

Model   1

1

0.973

0.027

9.22

Model   2

0.999

0.971

0.028

9.18

Model   3

0.999

0.968

0.031

9.09

Model   4

0.998

0.922

0.076

7.9

Model   5

0.996

0.965

0.031

8.63

Model   6

0.995

0.959

0.036

8.77

Model   7

0.993

0.927

0.066

7.99

Model   8

0.988

0.956

0.032

8.63

Model   9

0.982

0.928

0.054

7.99

Model   10

0.976

0.914

0.062

7.76

Model   11

0.969

0.919

0.05

7.85

Model   12

0.961

0.924

0.037

7.95

YModel

0.918

0.911

0.007

8.0

Note: Manual model building produces a series of intermediate models (Models 1-12) as a result of model tuning while YModel generates the desired final model directly.

Result explanation:

1)     The first several manually-made models have high AUC on training data set. It’s apparently they are overfitting. A more suitable model (model 12) is created after multiple tunings.

2)     Compared with YModel, Model 12 has higher AUC on test data set but much higher model attenuation level. So it is overfitting too. YModel has very small model attenuation level and thus will perform better on scoring unknown data.

3)     YModel is slightly higher than Model 12 in lift in top 10% on test data set.

Summary: This is a close contest in terms of the above indexes, but YModel has better generalization ability.

2.     Model building duration

Manual model building: About three weeks for manual preprocessing and model tuning.

YModel: 13 minutes for automatic preprocessing and model building.

3.     Skill requirements

Manual model building: Professional statistical knowledge.

YModel: General knowledge.

 

Case 2

Task: According to the data of defaulted corporate loans for a bank, predict the probability of default (PD) among micro and small corporate users.

Data set description: 36000 rows, 5500 columns, a size of 453MB; high dimensional and sparse.

Target variable: IsDefaulted.

Test content:

1.     Model performance indexes over the test data set: AUC, lift in top 10%, and model attenuation level.

2.     Model building duration.

 

Test result:


YModel

Manual model building

Model building   duration

17   minutes (data preprocessing & model building)

2   weeks

Model   number

1

1

AUC   for training data set

0.996

0.998

AUC   for test data set

0.987

0.972

Lift   in top 10% for test data set

9.8

9.6

1)     YModel has higher AUC and lift and lower attenuation level on test data set.

2)     YModel is fast and efficient, even in handling high dimensional data; manual model building is slow and inefficient, particularly complicated in dealing with high dimensional data.

 

Case 3

Task: Predict claim settlement risk for the insurance company.

Data set description: 1.38 million rows, dozens of columns, a size of 4G; high proportion of missing data and high-cardinality categorical variables.

Target variable: ClaimOccured

Test content:

1.     Gini index on test data set.

2.     Model building duration.

 

Test result:


YModel

Manual model building

Model building duration

60 minutes (data preprocessing   & mode building)

1 month

Model   performance (Gini)

0.683

0.608

Key   derived variables

3

-

1)     YModel has higher Gini index on test data set.

2)     YModel can automatically handle missing data and high-cardinality categorical variables and auto-generate derived variables. It is much faster and more efficient.