"When the number of categories of categorical variable is large, there may be noise, such as cate .."

Lina RaqForum 43 No.
261 View • 1 Years ago

Low frequency categorical data processing

When the number of categories of categorical variable is large, there may be noise, such as category with very few sample, abnormal category, suspected error category, etc. in this case, the number of category can be reduced by combining low frequency variables.

The "Name" in Titanic.csv is a categorical variable. Each passenger's name contains such titles as"Mr"and"Mrs", which can extracted to generate a new variable"Title", and then combine the low-frequency classification in"Title".

SPL code：

	A
1	=file("D://titanic.csv").import@qtc()
2	=A1.derive(Name.split@b(",")(2).split(".")(1):Title)
3	=A2.groups(Title;count(~):count)
4	=A2.group(Title)
5	=A4.align@a([true,false],~.len()<10)
6	=A5(1).(~.run(Title="others"))
7	=A2
8	=A7.groups(Title;count(~):count)