Outlier processing
Handling method of outliers:
Delete records with outliers:directly delete records with outliers;
Treat as missing value:treat outlier as missing value, and use the missing value processing method to process;
Correction of outliers:the outliers can be corrected by the endpoint value or the average of the two observed values;
Labeling outliers: By creating new variables, outliers are labeled for further analysis or processing
No processing:data mining directly on datasets with outliers;
Correction outliers
In SPL, A.sert()and P.sert(cn) can automatically correct outliers. For example, an outlier correction was made to the variable "Fare" in the Titanic data.
A |
|
1 |
=file("D://titanic.csv").import@qtc() |
2 |
=A1.sert@c("Fare") |
A2 Corrects the outlier in the Fare variable, returns the correction result and the correction record Rec, @c indicates that the original data is modified.
Labeling outliers
For example, the "Fare" variable in titanic.csv is labeled with outliers as 3 standard deviations (z=3) and 5 standard deviations (z=5), respectively.
A |
|
1 |
=file("D://titanic.csv").import@qtc() |
2 |
=A1.avg(Fare) |
3 |
=sqrt(var@s(A1.(Fare))) |
4 |
=A1.derive((Fare-A2)/A3:Fare_z,if(Fare_z>3,1,if(Fare_z<-3,-1,0)):Fare_z3,if(Fare_z>5,1,if(Fare_z<-5,-1,0)):Fare_z5) |
A2 Calculate the mean of Fare
A3 Calculate the standard deviation of Fare
A4 Calculate the z-score of Fare, denoted Fare_z, and label outliers according to the z-value. Z-values greater than 3 are marked as 1, z-values less than -3 are marked as -1, others are marked as 0, and the variable is marked as Fare_z3; Z-values greater than 5 are marked as 1, z-values less than -5 are marked as -1, others are marked as 0, and the variable is marked as Fare_z5
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL