A typical comparison between esProc and python
Pandas is a class library for Python to process structured data, which is convenient for data analysis and processing. Therefore, we mainly use pandas to compare with esProc.
1. Search and statistics
- Task
Find out the files containing the specified word in all the texts in the directory, and list the contents of the row and row number.
- Python
1 |
import glob |
2 |
list1=[] |
3 |
for filename in glob.glob('D:/*.txt'): |
4 |
file = open(filename,'r') |
5 |
for num,value in enumerate(file): |
6 |
if value.find('XXX')>=0: |
7 |
list1.append(filename+' '+num+' '+value) |
8 |
file.close() |
It seems that it can also be done in the form of apply (lambda:), but it is not as clear and understandable as for loop.
- esProc
A |
|
1 |
=directory@p(“*.txt”) |
2 |
=A1.conj(file(~).read@n().(if(pos(~,"xxx"),[A1.~,#,~].string())).select(~)) |
esProc provides the traversal function of the file system. Combined with the text computing ability, it can be done in two lines of code.
2. WordCount
- Task
Count the number of words in a file.
- Python
1 |
data=pd.read_csv(file,sep="\n",header=None) |
2 |
tmp=[] |
3 |
data=data.iloc[:,0].apply(lambda x:tmp.extend(x.split())) |
4 |
pd.Series(1,tmp).groupby(level=0).count() |
It's just to split the contents by empty characters. It will be very troublesome to remove other punctuations that do not have statistical meanings.
- esProc
A |
|
1 |
=lower(file(“T.txt”).read()).words().groups(~:word;count(1):count) |
esProc provides the word() function to split the string into words, and a one-liner can complete the task.
3. General operation
- Task
Find out the males over 25 and females over 23 from the text: 1)list by order of name; 2) Group by gender and calculate the average age; 3)list all the surnames that have appeared (do not consider multiple surnames)
- Python
1 |
import pandas as pd |
2 |
file = "D.csv" |
3 |
data=pd.read_csv(file) |
4 |
data_select = data[((data['sex']=='male')&(data['age']>=25))|((data['sex']=='female')&(data['age']>=23))] |
5 |
data_sort = data.sort_values('name') |
6 |
data_group = data.groupby('sex')['age'].mean() |
7 |
data['sur'] = data['name'].apply(lambda x:x[0]) |
8 |
data_distinct = data.drop_duplicates(['sur']) |
Pandas needs to rebuild a column to duplicate.
- esProc
A |
||
1 |
=file(“D.csv”).import@t(name,sex,age;”,”) |
|
2 |
=A1.select(sex==”Male”&&age>=25||sex==”Female”&&age>=23) |
//Filter |
3 |
=A2.sort(name) |
//Sort |
4 |
=A2.groups(sex;avg(age):age) |
//Group and aggregation |
5 |
=A2.id(left(name,1)) |
//Unique value |
esProc provides rich structured computing functions. To some extent, it can operate on text as a database table, and obtain similar computing power as SQL without database.
When there is a large amount of data, pandas can only extract the data in sections, and then calculate and merge, and the amount of code and calculation will increase sharply.
When the data volume is large, these calculations can be done based on cursors:
A |
||
1 |
=file(“D.csv”).cursor@tm(name,sex,age;”,”) |
|
2 |
=A1.select(sex==”Male”&&age>=25||sex==”Female”&&age>=23) |
//Filter |
3 |
=A2.sortx(name) |
//Sort |
4 |
=A2.groups(sex;avg(age):age) |
//Group and aggregation |
5 |
=A2.groupx(left(name,1);) |
//Unique value |
6 |
=A3.fetch(…) |
//Fetch result |
Different from the in-memory calculation, the cursor can only be traversed once, and only one of the above sorting and grouping operations can be executed. The cursor needs to be rebuilt when another is executed.
4. Divide columns
- Task
List in column the information of employees whose salaries are more than 8000 in each department (ranking from high to low).
- Python
1 |
import pandas as pd |
2 |
file = "E:/txt/EMPLOYEE.txt" |
3 |
data=pd.read_csv(file,sep='\t') |
4 |
sales_df = data[data['DEPT']=='Sales'].sort_values('SALARY',ascending=False).reset_index(drop=True) |
5 |
sales_df['FULLNAME'] = sales_df['NAME']+' '+sales_df['SURNAME'] |
6 |
sales_df = sales_df[['FULLNAME','SALARY']] |
7 |
hr_df = data[data['DEPT']=='HR'].sort_values('SALARY',ascending=False).reset_index(drop=True) |
8 |
hr_df['FULLNAME'] = hr_df['NAME']+' '+hr_df['SURNAME'] |
9 |
hr_df = hr_df[['FULLNAME','SALARY']] |
10 |
data = pd.concat([sales_df,hr_df],axis=1) |
11 |
print(data) |
When dividing column in pandas, the index should be reset, otherwise empty data will appear.
- esProc
A |
||
1 |
=connect("mysql") |
//Connect to database |
2 |
=A1.query("select * from EMPLOYEE order by SALARY") |
//Sort by salary |
3 |
=A2.select(DEPT=="Sales") |
//Sales department data |
4 |
=A2.select(DEPT=="HR") |
//HR department data |
5 |
=create(Sales,SALARY,HR,SALARY) |
//Create new table |
6 |
=A5.paste(A3.(NAME+" "+SURNAME),A3.(SALARY),A4.(NAME+" "+SURNAME),A4.(SALARY)) |
//Fill in data |
esProc can easily create new table, fill in data and divide columns.
5. Filter according to cumulative value
- Task
Find the top n customers whose sales accounts for half of the total sales, and rank them by sales in descending order.
- Python
1 |
import pandas as pd |
2 |
sale_file = "E:\\txt\\sales_volume.csv" |
3 |
sale_info = pd.read_csv(sale_file) |
4 |
sale_info.sort_values(by='Amount',inplace=True,ascending=False) |
5 |
half_amount = sale_info['Amount'].sum()/2 |
6 |
vip_list = [] |
7 |
amount = 0 |
8 |
for client_info in sale_info.itertuples(): |
9 |
amount += getattr(client_info, 'Amount') |
10 |
if amount < half_amount: |
11 |
vip_list.append(getattr(client_info, 'Client')) |
12 |
else: |
13 |
vip_list.append(getattr(client_info, 'Client')) |
14 |
break |
15 |
print(vip_list) |
Pandas does not have an existing loop function to calculate the position that meets the conditions, so it can only use for loop to complete the calculation.
- esProc
A |
||
1 |
E:\\txt\\sales_volume.csv |
//Data storage path |
2 |
=file(A1).import@tc().sort@z(Amount) |
//Import data and sort by amount in descending order |
3 |
=A2.sum(Amount)/2 |
//Calculate half of total sales |
4 |
=A2.pselect(cum(Amount)>=A3) |
//Find the position where the cumulative sales is more than half of total sales |
5 |
=A2.(Client).m(:A4) |
//Take the client names before the position |
It is recommended to use loop function to calculate in esProc, and filter and record the position information according to the conditions.
6. Calculate ratio compared with the previous period for the selected records
- Task
Find the trading information and increase rate of a stock in the three trading days with the highest share price.
- Python
1 |
import pandas as pd |
2 |
stock1001_file = "E:\\txt\\stock1001_price.txt" |
3 |
stock1001 = pd.read_csv(stock1001_file,sep = '\t') |
4 |
CL = stock1001['CL'] |
5 |
CL_psort = CL.argsort()[::-1].iloc[:3].values |
6 |
CL_psort_shift1 = CL_psort-1 |
7 |
CL_rise = CL[CL_psort].values/CL[CL_psort_shift1].values-1 |
8 |
max_3 = stock1001.loc[CL_psort].reset_index(drop = True) |
9 |
max_3['RISE'] = CL_rise |
10 |
print(max_3) |
Python can't use position to directly calculate ratio compared with the previous period. It needs to construct a position sequence of the previous period and then calculate.
- esProc
A |
||
1 |
=Stocks.sort(TradingDate) |
|
2 |
=A1.psort(ClosePrice:-1)([1,2,3]) |
//The position of three days with highest share price |
3 |
=A1.calc(A2, ClosePrice/ClosePrice[-1]-1) |
//Calculate the increase rate for the three days |
4 |
=A1(A2).new(TradingDate,ClosePrice,A3(#):IncreaseRate) |
esProc is very good at these calculations related to order and position, and the code is simple and in line with natural thinking.
7. Foreign key join 1
- Task
Find out employees with couple salary more than 10000.
- Python
1 |
import pandas as pd |
2 |
emp_file = "E:\\txt\\Employees.txt" |
3 |
rel_file = "E:\\txt\\EmpRel.txt" |
4 |
emp_info = pd.read_csv(emp_file,sep='\t') |
5 |
rel_info = pd.read_csv(rel_file,sep='\t') |
6 |
rel_info = rel_info[rel_info['Relationship']=="Spouse"] |
7 |
emp1_join = pd.merge(rel_info,emp_info,left_on='Emp1',right_on='ID') |
8 |
emp2_join = pd.merge(emp1_join,emp_info,left_on='Emp2',right_on='ID') |
9 |
res = emp2_join[emp2_join['BasePay_x']+emp2_join['BasePay_y']>=10000] |
10 |
r = res[['Name_x','Name_x']] |
11 |
print(r) |
Pandas can only join twice to get two columns of information, and then filter.
- esProc
A |
|
1 |
=file("E:\\txt\\Employees.txt").import@t().keys(ID) |
2 |
=file("E:\\txt\\EmpRel.txt").import@t() |
3 |
=A2.select(Relationship=="Spouse") |
4 |
>A3.switch(Emp1,A1;Emp2,A1) |
5 |
=A3.select(Emp1.BasePay+Emp2.BasePay>=10000) |
6 |
>A5.run(Emp1=Emp1.Name,Emp2=Emp2.Name) |
Through foreign key objectification, esProc converts the foreign key field into the corresponding reference in the foreign key table, so that it can be treated as a single table.
8. Foreign key join 2
- Task
Find out the department with the youngest department manager.
- Python
1 |
import pandas as pd |
2 |
import datetime |
3 |
emp_file = "E:/txt/EMPLOYEE.txt" |
4 |
manager_file = "E:/txt/DEPARTMENT.txt" |
5 |
emp_info = pd.read_csv(emp_file,sep='\t') |
6 |
manager_info = pd.read_csv(manager_file,sep='\t') |
7 |
manager_join = pd.merge(manager_info,emp_info,left_on='MANAGER',right_on='EID') |
8 |
today = datetime.datetime.today().year |
9 |
arr = pd.to_datetime(manager_join['BIRTHDAY']) |
10 |
manager_join['AGE'] = today-arr.dt.year |
11 |
min_age = manager_join['AGE'].min() |
12 |
young = manager_join[manager_join['AGE']==min_age]['DEPT'] |
13 |
print(young) |
- esProc
A |
|
1 |
=file("E:/txt/EMPLOYEE.txt").import@t() |
2 |
=file("E:/txt/DEPARTMENT.txt").import@t() |
3 |
=A2.join(MANAGER,A1:EID,~:manager) |
4 |
=A3.minp(manager.(age(BIRTHDAY))).manager.DEPT |
It is also the application of foreign key objectification.
9. Sorting and filtering of grouped subsets
- Task
Find out the sales clerks whose sales are within top 8 for every moth in 1995.
- Python
1 |
import pandas as pd |
2 |
sale_file = 'E:\\txt\\SalesRecord.txt' |
3 |
sale_info = pd.read_csv(sale_file,sep = '\t') |
4 |
sale_info['month']=pd.to_datetime(sale_info['sale_date']).dt.month |
5 |
sale_group = sale_info.groupby(by=['clerk_name','month'],as_index=False).sum() |
6 |
sale_group_month = sale_group.groupby(by='month') |
7 |
set_name = set(sale_info['clerk_name']) |
8 |
for index,sale_g_m in sale_group_month: |
9 |
sale_g_m = sale_g_m.sort_values(by='sale_amt',ascending = False) |
10 |
sale_g_max_8 = sale_g_m.iloc[:8] |
11 |
sale_g_max_8_name = sale_g_max_8['clerk_name'] |
12 |
set_name = set_name.intersection(set(sale_g_max_8_name)) |
13 |
print(set_name) |
- esProc
A |
|
1 |
E:\\txt\\SalesRecord.txt |
2 |
=file(A1).import@t() |
3 |
=A2.groups(clerk_name:name,month(sale_date):month;sum(sale_amt):amount) |
4 |
=A3.group(month) |
5 |
=A4.(~.sort(-amount).to(8)) |
6 |
=A5.isect(~.(name)) |
esProc retains grouped subsets, and it is very simple and convenient to sort and filter the subsets in the loop function.
10. Grouped subset cross row calculation and filter grouped results
- Task
Find out stocks that have had three consecutive trading day limits (up 10%).
- Python
1 |
import pandas as pd |
2 |
def con_rise(stock:pd.DataFrame): |
3 |
rise_day_list = [] |
4 |
rise_num = 0 |
5 |
shift_1 = stock['CL']/stock['CL'].shift(1)-1>0.06 |
6 |
for bl in shift_1: |
7 |
if bl == False: |
8 |
rise_num = 0 |
9 |
else: |
10 |
rise_num+=1 |
11 |
rise_day_list.append(rise_num) |
12 |
return max(rise_day_list) |
13 |
stock_file = 'E:\\txt\\StockRecords.txt' |
14 |
stock = pd.read_csv(stock_file,sep='\t') |
15 |
stock_g = stock.groupby(by = ['CODE']) |
16 |
good_code = [] |
17 |
for index,group in stock_g: |
18 |
group = group.sort_values(by='DT') |
19 |
group = group.reset_index(drop = True) |
20 |
max_rise = con_rise(group) |
21 |
if max_rise>=5: |
22 |
good_code.append(index) |
23 |
print(good_code) |
- esProc
A |
||
1 |
E:\\txt\\StockRecords.txt |
|
2 |
=file(A1).import@t() |
|
3 |
=A2.group(CODE).(~.sort(DT)) |
|
4 |
=A3.select(func(A5,~)>=5).(~.CODE) |
|
5 |
func |
|
6 |
=(rise=0,A5.(rise=if(CL/CL[-1]-1>=0.06,rise=if(!CL[-1],0,rise+1),0))) |
|
7 |
=max(B6) |
The step-by-step calculation of esProc is very clear: 1. Grouping and sorting; 2. Calculating the maximum days with an increase of more than 0.06; 3. Filtering. The custom function can also be put into the loop function for loop calculation.
11. Calculate ratio compared with the previous period for continuous intervals of grouped subsets
- Task
Calculate the longest consecutive rising days of each stock.
- Python
1 |
import pandas as pd |
2 |
def con_rise(stock:pd.DataFrame): |
3 |
rise_day_list = [] |
4 |
rise_num = 0 |
5 |
shift_1 = stock['CL']>stock['CL'].shift(1) |
6 |
for bl in shift_1: |
7 |
if bl == False: |
8 |
rise_num = 0 |
9 |
else: |
10 |
rise_num+=1 |
11 |
rise_day_list.append(rise_num) |
12 |
return max(rise_day_list) |
13 |
stock_file = "E:\\txt\\StockRecords.txt" |
14 |
stock_info = pd.read_csv(stock_file,sep="\t") |
15 |
stock_info.sort_values(by='DT',inplace=True) |
16 |
stock_group = stock_info.groupby(by='CODE') |
17 |
max_rise_list = [] |
18 |
for index,stock_g in stock_group: |
19 |
code = stock_g.iloc[0]['CODE'] |
20 |
max_rise_list.append([code,con_rise(stock_g)]) |
21 |
max_rise_df = pd.DataFrame(max_rise_list,columns=['CODE','con_rise']) |
22 |
print(max_rise_df) |
- esProc
A |
B |
|
1 |
E:\\txt\\StockRecords.txt |
|
2 |
=file(A1).import@t() |
|
3 |
=A2.sort(DT) |
|
4 |
=A3.group(CODE) |
|
5 |
=A4.new(CODE,func(A6,~):con_rise) |
|
6 |
func |
|
7 |
=(num=0,A6.max(num=if(CL>CL[-1],if(#==1,0,num+1),0))) |
The idea of comparing with last period for continuous periods of a single stock: if it is higher than the stock price of the previous day, add 1; if it is not greater than, set 0; finally, check the maximum value in the sequence. The calculation method of a single stock is written as a function, and the table of each stock is passed in as a parameter. esProc can easily call the function in the loop function to get the result. With the same idea, pandas code looks much more complex.
12. Grouped subsets iteration loop
- Task
Calculate how many months it takes for each sales representative to reach the sales amount of 50k.
- Python
1 |
import pandas as pd |
2 |
sale_file = "E:/txt/orders_i.csv" |
3 |
sale_data = pd.read_csv(sale_file,sep='\t') |
4 |
sale_g = sale_data.groupby('sellerid') |
5 |
breach50_list = [] |
6 |
for index,group in sale_g: |
7 |
amount=0 |
8 |
group = group.sort_values('month') |
9 |
for row in group.itertuples(): |
10 |
amount+=getattr(row, 'amount') |
11 |
if amount>=500000: |
12 |
breach50_list.append([index,getattr(row, 'month'),]) |
13 |
break |
14 |
breach50_df = pd.DataFrame(breach50_list,columns=['sellerid','month']) |
15 |
print(breach50_df) |
- esProc
A |
||
1 |
E:/txt/orders_i.csv |
|
2 |
=file(A1).import@t() |
|
3 |
=A2.group(sellerid;(~.iterate((x=month,~~+amount),0,~~>500000),x):breach50) |
esProc retains grouped subsets and uses the iterative function to realize the iteration.
13. Grouped subsets operations
- Task
Calculate the daily inventory status of various goods in the specified time period.
- Python
1 |
import pandas as pd |
2 |
import numpy as np |
3 |
starttime = '2015-01-01' |
4 |
endtime = '2015-12-31' |
5 |
stock_data = pd.read_csv('E:\\txt\\stocklog.csv',sep='\t') |
6 |
stock_data['DATE']=pd.to_datetime(stock_data['DATE']) |
7 |
stock_data = stock_data[(stock_data['DATE']>=starttime)&(stock_data['DATE']<=endtime)] |
8 |
stock_data['ENTER']=stock_data['QUANTITY'][stock_data['INDICATOR']!='ISSUE'] |
9 |
stock_data['ISSUE']=stock_data['QUANTITY'][stock_data['INDICATOR']=='ISSUE'] |
10 |
stock_g = stock_data[['STOCKID','DATE','ENTER','ISSUE']].groupby(by=['STOCKID','DATE'],as_index=False).sum() |
11 |
stock_gr = stock_g.groupby(by='STOCKID',as_index = False) |
12 |
date_df = pd.DataFrame(pd.date_range(starttime,endtime),columns=['DATE']) |
13 |
stock_status_list = [] |
14 |
for index,group in stock_gr: |
16 |
date_df['STOCKID']=group['STOCKID'].values[0] |
17 |
stock_status = pd.merge(date_df,group,on=['STOCKID','DATE'],how='left') |
18 |
stock_status = stock_status.sort_values(['STOCKID','DATE']) |
19 |
stock_status['OPEN']=0 |
20 |
stock_status['CLOSE']=0 |
21 |
stock_status['TOTAL']=0 |
22 |
stock_status = stock_status.fillna(0) |
23 |
stock_value = stock_status[['STOCKID','DATE','OPEN','ENTER','TOTAL','ISSUE','CLOSE']].values |
24 |
open = 0 |
25 |
for value in stock_value: |
26 |
value[2] = open |
27 |
value[4] = value[2] + value[3] |
28 |
value[6] = value[4] - value[5] |
29 |
open = value[6] |
30 |
stock = pd.DataFrame(stock_value,columns = ['STOCKID','DATE','OPEN','ENTER','TOTAL','ISSUE','CLOSE']) |
31 |
stock_status_list.append(stock) |
32 |
stock_status = pd.concat(stock_status_list,ignore_index=True) |
print(stock_status) |
- esProc
A |
B |
|
1 |
=file("E:\\txt\\stocklog.csv").import@t() |
|
2 |
=A1.select(DATE>=date("2015-01-01") && DATE<=date("2015-12-31")) |
|
3 |
=A2.groups(STOCKID,DATE; sum(if(INDICATOR=="ISSUE",QUANTITY,0)):ISSUE, sum(if(INDICATOR!="ISSUE",QUANTITY,0)):ENTER) |
|
4 |
=periods(start,end) |
|
5 |
for A3.group(STOCKID) |
=A5.align(A4,DATE) |
6 |
>b=c=0 |
|
7 |
=B5.new(A5.STOCKID:STOCKID,A4(#):DATE,c:OPEN,ENTER, (b=c+ENTER):TOTAL,ISSUE, (c=b-ISSUE):CLOSE) |
|
8 |
>B7.run(ENTER=ifn(ENTER,0),ISSUE=ifn(ISSUE,0)) |
|
9 |
=@|B7 |
esProc 9 lines of code complete the task of Python 32 lines of code, and when you read the code later on, esProc is easier to understand.
14. Sort grouped results
- Task
Find out the two departments with the largest number and the smallest number of employees.
- Python
1 |
import pandas as pd |
2 |
emp_file = 'E:\\txt\\employee.txt' |
3 |
emp_info = pd.read_csv(emp_file,sep='\t') |
4 |
emp_g = emp_info.groupby(by='DEPT') |
5 |
size = emp_g.size().sort_values() |
6 |
sorted_dept = size.index.values |
7 |
print(sorted_dept[[0,-1]]) |
- esProc
A |
|
1 |
E:\\txt\\employee.txt |
2 |
=file(A1).import@t() |
3 |
=A2.group(DEPT).sort(~.len()).m([1,-1]).(~.DEPT) |
With esProc, the grouping, sorting and filtering are completed by one line.
15. Adjacent records grouping with the original order
- Task
List the team information with the most consecutive NBA titles.
- Python
1 |
import pandas as pd |
2 |
import numpy as np |
3 |
pd.set_option('display.max_columns', None) |
4 |
nba_file = 'E:\\txt\\nba.txt' |
5 |
nba_champion = pd.read_csv(nba_file,sep='\t') |
6 |
nba_champion = nba_champion.sort_values(by = 'Year') |
7 |
arr = np.zeros(len(nba_champion)) |
8 |
arr[nba_champion['Champion']!=nba_champion['Champion'].shift(1)]=1 |
9 |
arr = np.cumsum(arr) |
10 |
nba_champion['flag']=arr |
11 |
nba_champion_g = nba_champion.groupby(by='flag') |
12 |
max_num = nba_champion_g.size().idxmax() |
13 |
max_champion = nba_champion_g.get_group(max_num) |
14 |
print(max_champion) |
Python doesn't have the ability to group by adjacent conditions. You need to create a list of grouping flags.
- esProc
A |
||
1 |
=connect("mysql") |
//Connect to database |
2 |
=A1.query("select * from nba order by year") |
//Sort by year |
3 |
=A2.group@o(Champion) |
//When adjacency is different, start a new group |
4 |
=A3.maxp(~.len()) |
//List the team information with the most consecutive NBA titles. |
The set of esProc is ordered, and it is very convenient to merge the adjacent same records into one group and start a new group when the adjacent record is different.
16. Group by specified order
- Task
List in order the number and average age of female employees of technology, production, sales and HR department.
- Python
1 |
import pandas as pd |
2 |
import datetime |
3 |
emp_file = 'E:\\txt\\employee.txt' |
4 |
dept_seq = ['Technology','Production','Sales','HR'] |
5 |
emp_info = pd.read_csv(emp_file,sep='\t') |
6 |
emp_g = emp_info.groupby(by='DEPT') |
7 |
emp_g_index = [index for index,group in emp_g] |
8 |
today = datetime.datetime.today().year |
9 |
dept_femal_list = [] |
10 |
for dept in dept_seq: |
11 |
if dept not in emp_g_index: |
12 |
dept_femal_num = 0 |
13 |
age_a = None |
14 |
else: |
15 |
dept_emp = emp_g.get_group(dept) |
16 |
dept_emp_femal = dept_emp[dept_emp['GENDER']=='F'] |
17 |
dept_femal_num = len(dept_emp_femal) |
18 |
arr = pd.to_datetime(dept_emp_femal['BIRTHDAY']) |
19 |
age_a = (today-arr.dt.year).mean() |
20 |
dept_femal_list.append([dept,dept_femal_num,age_a]) |
21 |
dept_femal_info = pd.DataFrame(dept_femal_list,columns=['DEPT','femal_num','femal_age']) |
22 |
print(dept_femal_info) |
- esProc
A |
|
1 |
E:\\txt\\employee.txt |
2 |
[Technology,Production,Sales,HR] |
3 |
=file(A1).import@t() |
4 |
=A3.align@a(A2,DEPT) |
5 |
=A4.new(A2(#):DEPT,(F_EMP=~.select(GENDER:"F"),F_EMP.count()):Femal_num,F_EMP.avg(age@y(BIRTHDAY)):Femal_age) |
It is much more convenient to deal with this kind of calculation by esProc based on the ordered set and the special alignment operation.
17. Repeated conditional grouping
- Task
Divide employees into groups by segmentation according to the length of service in the company and count the number of male and female employees in each group.
- Python
1 |
import pandas as pd |
2 |
import datetime |
3 |
def eval_g(dd:dict,ss:str): |
4 |
return eval(ss,dd) |
5 |
emp_file = 'E:\\txt\\employee.txt' |
6 |
emp_info = pd.read_csv(emp_file,sep='\t') |
7 |
employed_list = ['Within five years','Five to ten years','More than ten years','Over fifteen years'] |
8 |
employed_str_list = ["(s<5)","(s>=5) & (s<10)","(s>=10)","(s>=15)"] |
9 |
today = datetime.datetime.today().year |
10 |
arr = pd.to_datetime(emp_info['HIREDATE']) |
11 |
employed = today-arr.dt.year |
12 |
emp_info['EMPLOYED']=employed |
13 |
dd = {'s':emp_info['EMPLOYED']} |
14 |
group_cond = [] |
15 |
for n in range(len(employed_str_list)): |
16 |
emp_g = emp_info.groupby(eval_g(dd,employed_str_list[n])) |
17 |
emp_g_index = [index for index in emp_g.size().index] |
18 |
if True not in emp_g_index: |
19 |
female_emp=0 |
20 |
male_emp=0 |
21 |
else: |
22 |
group = emp_g.get_group(True) |
23 |
sum_emp = len(group) |
24 |
female_emp = len(group[group['GENDER']=='F']) |
25 |
male_emp = sum_emp-female_emp |
26 |
group_cond.append([employed_list[n],male_emp,female_emp]) |
27 |
group_df = pd.DataFrame(group_cond,columns=['EMPLOYED','MALE','FEMALE']) |
28 |
print(group_df) |
Pandas does not have the function of repeated conditional grouping, so it can only regroup according to the conditions and get the groups that meet the conditions.
- esProc
A |
B |
|
1 |
?<5 |
//Within five years |
2 |
?>=5 && ?<10 |
//Five to ten years |
3 |
?>=10 |
//More than ten years |
4 |
?>=15 |
//Over fifteen years |
5 |
E:\\txt\\employee.txt |
|
6 |
=[A1:A4] |
=A6.concat@c() |
7 |
=file(A5).import@t() |
=A7.derive(age@y(HIREDATE):EMPLOYED) |
8 |
=B7.enum@r(A6,EMPLOYED) |
=[B1:B4] |
9 |
=A8.new(B8(#):EMPLOYED,~.count(GENDER=="M"):MALE,~.count(GENDER=="F"):FEMAL) |
esProc has powerful enumeration grouping function, which can easily realize repeated conditional grouping.
18. Log processing 1
- Task
Every three rows of records are a piece of log. Organize the log into a structured file.
- Python
1 |
import pandas as pd |
2 |
import numpy as np |
3 |
log_file = 'E://txt//access_log.txt' |
4 |
log_info = pd.read_csv(log_file,header=None) |
5 |
log_g=log_info.groupby(log_info.index//3) |
6 |
rec_list = [] |
7 |
for i,g in log_g: |
8 |
rec = g.values.reshape(1*3) |
9 |
rec[1] = rec[1].split(":")[-1].replace("#","") |
10 |
rec="\t".join(rec) |
11 |
rec = np.array(rec.split("\t")) |
12 |
rec = rec[[6,7,0,1,3,4,8,5]] |
13 |
rec_list.append(rec) |
14 |
rec_df = pd.DataFrame(rec_list,columns=["USERID","UNAME","IP","TIME","URL","BROWER","LOCATION","module"]) |
15 |
print(rec_df) |
- esProc
A |
|
1 |
E://txt//access_log.txt |
2 |
=file(A1).import@s() |
3 |
=A2.group((#-1)\3) |
4 |
=A3.(~.(_1).concat("\t").array("\t")) |
5 |
=A4.new(~(7):USERID,~(8):UNAME,~(1):IP,~(2):TIME,~(4):URL,~(5):BROWER,~(9):LOCATION,left(~(6).array("\:")(2),-1):module) |
With the mechanism of grouping by row number, you can process one group of data every time in loop, simplifying the difficulty.
19. Log processing 2
- Task
Each piece of log has indefinite rows, and each row with the same mark indicates that it is a piece of record.
- Python
1 |
import pandas as pd |
2 |
log_file = 'E://txt//Indefinite _info.txt' |
3 |
log_info = pd.read_csv(log_file,header=None) |
4 |
log_g = log_info.groupby(log_info[0].apply(lambda x:x.split("\t")[0]),sort=False) |
5 |
columns = ["userid","gender","age","salary","province","musicid","watch_time","time"] |
6 |
df_dic = {} |
7 |
for c in columns: |
8 |
df_dic[c]=[] |
9 |
for index,group in log_g: |
10 |
rec_dic = {} |
11 |
rec = group.values.flatten() |
12 |
rec = '\t'.join(rec).split("\t") |
13 |
for r in rec: |
14 |
v = r.split(":") |
15 |
rec_dic[v[0]]=v[1] |
16 |
for col in columns: |
17 |
if col not in rec_dic.keys(): |
18 |
df_dic[col].append(None) |
19 |
else: |
20 |
df_dic[col].append(rec_dic[col]) |
21 |
df = pd.DataFrame(df_dic) |
22 |
print(df) |
- esProc
A |
|
1 |
E://txt//Indefinite _info.txt |
2 |
=file(A1).import@s() |
3 |
[userid,gender,age,salary,province,musicid,watch_time,time] |
4 |
=A2.group@o(_1.array("\t")(1)) |
5 |
=A4.(~.(_1.array("\t")).conj().id().align(A3,~.array("\:")(1)).(~.array("\:")(2))).conj() |
6 |
=create(${A3.concat@c()}).record(A5) |
The merge grouping of esProc and the special alignment operation make the log processing very easy.
20. Log processing 3
- Task
Each piece of log has indefinite rows, and a fixed mark is at the beginning of each record.
- Python
1 |
import pandas as pd |
2 |
log_file = 'E://txt//Indefinite_info2.txt' |
3 |
log_info = pd.read_csv(log_file,header=None) |
4 |
group_cond = log_info[0].apply(lambda x:1 if x.split("\t")[0].split(":")[0]=="userid" else 0).cumsum() |
5 |
log_g = log_info.groupby(group_cond,sort=False) |
6 |
columns = ["userid","gender","age","salary","province","musicid","watch_time","time"] |
7 |
df_dic = {} |
8 |
for c in columns: |
9 |
df_dic[c]=[] |
10 |
for index,group in log_g: |
11 |
rec_dic = {} |
12 |
rec = group.values.flatten() |
13 |
rec = '\t'.join(rec).split("\t") |
14 |
for r in rec: |
15 |
v = r.split(":") |
16 |
rec_dic[v[0]]=v[1] |
17 |
for col in columns: |
18 |
if col not in rec_dic.keys(): |
19 |
df_dic[col].append(None) |
20 |
else: |
21 |
df_dic[col].append(rec_dic[col]) |
22 |
df = pd.DataFrame(df_dic) |
23 |
print(df) |
Pandas does not have the function of grouping by conditions. It needs to construct an array of grouping by conditions.
- esProc
A |
|
1 |
E://txt//Indefinite_info2.txt |
2 |
[userid,gender,age,salary,province,musicid,watch_time,time] |
3 |
=file(A1).import@s() |
4 |
=A3.group@i(_1.array("\t")(1).array("\:")(1)=="userid") |
5 |
=A4.(~.(_1.array("\t")).conj().align(A2,~.array("\:")(1)).(~.array("\:")(2))).conj() |
6 |
=create(${A2.concat@c()}).record(A5) |
esProc has powerful grouping function and loop computing ability, and its code is simple and clear.
21. Inverse grouping
- Task
List the details of installment loan: current payment amount, current interest, current principal and principal balance.
- Python
1 |
import numpy as np |
2 |
import pandas as pd |
3 |
loan_data = pd.read_csv('E:\\txt\\loan.csv',sep='\t') |
4 |
loan_data['mrate'] = loan_data['Rate']/(100*12) |
5 |
loan_data['mpayment'] = loan_data['LoanAmt']*loan_data['mrate']*np.power(1+loan_data['mrate'],loan_data['Term']) \ |
6 |
/(np.power(1+loan_data['mrate'],loan_data['Term'])-1) |
7 |
loan_term_list = [] |
8 |
for i in range(len(loan_data)): |
9 |
tm = loan_data.loc[i]['Term'] |
10 |
loanid = np.tile(loan_data.loc[i]['LoanID'],tm) |
11 |
loanamt = np.tile(loan_data.loc[i]['LoanAmt'],tm) |
12 |
term = np.arange(1,tm+1) |
13 |
rate = np.tile(loan_data.loc[i]['mrate'],tm) |
14 |
payment = np.tile(np.array(loan_data.loc[i]['mpayment']),loan_data.loc[i]['Term']) |
15 |
interest = np.zeros(len(loanamt)) |
16 |
principal = np.zeros(len(loanamt)) |
17 |
principalbalance = np.zeros(len(loanamt)) |
18 |
loan_amt = loanamt[0] |
19 |
for j in range(len(loanamt)): |
20 |
interest[j] = loan_amt*loan_data.loc[i]['mrate'] |
21 |
principal[j] = payment[j] - interest[j] |
22 |
principalbalance[j] = loan_amt - principal[j] |
23 |
loan_amt = principalbalance[j] |
24 |
loan_data_df = pd.DataFrame(np.transpose(np.array([loanid,loanamt,term,rate,payment,interest,principal,principalbalance])), columns = ['loanid','loanamt','term','mRate','payment','interest','principal','principalbalance']) |
25 |
|
loan_term_list.append(loan_data_df) |
|
26 |
loan_term_pay = pd.concat(loan_term_list,ignore_index=True) |
27 |
print(loan_term_pay) |
It's very troublesome for padas to deal with such inverse grouping.
- esProc
A |
||
1 |
E:\\txt\\loan.csv |
|
2 |
=file(A1).import@t() |
|
3 |
=A2.derive(Rate/100/12:mRate,LoanAmt*mRate*power((1+mRate),Term)/(power((1+mRate),Term)-1):mPayment) |
|
4 |
=A3.news((t=LoanAmt,Term);LoanID, LoanAmt, mPayment:payment, to(Term)(#):Term, mRate, t* mRate:interest, payment-interest:principal, t=t-principal:principlebalance) |
Using the function of inverse grouping, it is easy to solve the problem of inverse grouping.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/2bkGwqTj
Youtube 👉 https://www.youtube.com/@esProc_SPL