Under sampling
Under sampling is to achieve sample balance by reducing the number of samples of most classes. The simple and direct method is to remove some data randomly to reduce the size of most class samples.
For example, the target Survived in the Titanic data is a binary variable with a value of 0,1, and a 1:1 sample balance is achieved by under sampling
A |
|
1 |
=file("D://titanic.csv").import@qtc() |
2 |
1 |
3 |
=A1.group@p(Survived) |
4 |
=A3.sort(~.len()) |
5 |
=ceil(min(A4(2).len(),A4(1).len()*A2)) |
6 |
=to(A4(2).len()).sort(rand()) |
7 |
=A6(to(A5)).sort() |
8 |
=(A4(2)(A7)|A4(1)).sort() |
9 |
=A1(A8) |
A2 Set the sampling balance ratio, majority sample/minority sample
A3 Group according to the target variable, and get the member locations of each group
A4 Sorting by the number of samples, the first group is a minority sample, and the second group is a majority sample
A5 Calculate the quantity to be sampled according to the sampling proportion
A6 Majority sample groups of classes are randomly ordered
A7 Take the first A5 samples from A6 and sort them to realize random sampling
A8 Combine the position sequence of the majority and minority samples that need to be sampling
A9 Take the sample of the corresponding position and complete the sampling
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL