A typical comparison between esProc and python

 

  Pandas is a class library for Python to process structured data, which is convenient for data analysis and processing. Therefore, we mainly use pandas to compare with esProc.

1. Search and statistics

  • Task

  Find out the files containing the specified word in all the texts in the directory, and list the contents of the row and row number.

  • Python

1

import glob

2

list1=[]

3

for filename in glob.glob('D:/*.txt'):

4

    file   = open(filename,'r')

5

    for   num,value in enumerate(file):

6

          if value.find('XXX')>=0:

7

              list1.append(filename+' '+num+' '+value)

8

      file.close()

  It seems that it can also be done in the form of apply (lambda:), but it is not as clear and understandable as for loop.

  • esProc

A

1

=directory@p(“*.txt”)

2

=A1.conj(file(~).read@n().(if(pos(~,"xxx"),[A1.~,#,~].string())).select(~))

  esProc provides the traversal function of the file system. Combined with the text computing ability, it can be done in two lines of code.

2. WordCount

  • Task

  Count the number of words in a file.

  • Python

1

data=pd.read_csv(file,sep="\n",header=None)

2

tmp=[]

3

data=data.iloc[:,0].apply(lambda   x:tmp.extend(x.split()))

4

pd.Series(1,tmp).groupby(level=0).count()

  It's just to split the contents by empty characters. It will be very troublesome to remove other punctuations that do not have statistical meanings.

  • esProc

A

1

=lower(file(“T.txt”).read()).words().groups(~:word;count(1):count)

  esProc provides the word() function to split the string into words, and a one-liner can complete the task.

3. General operation

  • Task

  Find out the males over 25 and females over 23 from the text: 1)list by order of name; 2) Group by gender and calculate the average age; 3)list all the surnames that have appeared (do not consider multiple surnames)

  • Python

1

import pandas as pd

2

file = "D.csv"

3

data=pd.read_csv(file)

4

data_select = data[((data['sex']=='male')&(data['age']>=25))|((data['sex']=='female')&(data['age']>=23))]

5

data_sort = data.sort_values('name')

6

data_group = data.groupby('sex')['age'].mean()

7

data['sur'] = data['name'].apply(lambda   x:x[0])

8

data_distinct = data.drop_duplicates(['sur'])

  Pandas needs to rebuild a column to duplicate.

  • esProc

A


1

=file(“D.csv”).import@t(name,sex,age;”,”)


2

=A1.select(sex==”Male”&&age>=25||sex==”Female”&&age>=23)

//Filter

3

=A2.sort(name)

//Sort

4

=A2.groups(sex;avg(age):age)

//Group and   aggregation

5

=A2.id(left(name,1))

//Unique value

  esProc provides rich structured computing functions. To some extent, it can operate on text as a database table, and obtain similar computing power as SQL without database.

  When there is a large amount of data, pandas can only extract the data in sections, and then calculate and merge, and the amount of code and calculation will increase sharply.

  When the data volume is large, these calculations can be done based on cursors:


A


1

=file(“D.csv”).cursor@tm(name,sex,age;”,”)


2

=A1.select(sex==”Male”&&age>=25||sex==”Female”&&age>=23)

//Filter

3

=A2.sortx(name)

//Sort

4

=A2.groups(sex;avg(age):age)

//Group and   aggregation

5

=A2.groupx(left(name,1);)

//Unique value

6

=A3.fetch(…)

//Fetch result

  Different from the in-memory calculation, the cursor can only be traversed once, and only one of the above sorting and grouping operations can be executed. The cursor needs to be rebuilt when another is executed.

4. Divide columns

  • Task

  List in column the information of employees whose salaries are more than 8000 in each department (ranking from high to low).

  • Python

1

import pandas as pd

2

file = "E:/txt/EMPLOYEE.txt"

3

data=pd.read_csv(file,sep='\t')

4

sales_df = data[data['DEPT']=='Sales'].sort_values('SALARY',ascending=False).reset_index(drop=True)

5

sales_df['FULLNAME'] = sales_df['NAME']+' '+sales_df['SURNAME']

6

sales_df = sales_df[['FULLNAME','SALARY']]

7

hr_df = data[data['DEPT']=='HR'].sort_values('SALARY',ascending=False).reset_index(drop=True)

8

hr_df['FULLNAME'] = hr_df['NAME']+' '+hr_df['SURNAME']

9

hr_df = hr_df[['FULLNAME','SALARY']]

10

data = pd.concat([sales_df,hr_df],axis=1)

11

print(data)

  When dividing column in pandas, the index should be reset, otherwise empty data will appear.

  • esProc

A


1

=connect("mysql")

//Connect to   database

2

=A1.query("select   * from EMPLOYEE order by SALARY")

//Sort by salary

3

=A2.select(DEPT=="Sales")

//Sales department   data

4

=A2.select(DEPT=="HR")

//HR department   data

5

=create(Sales,SALARY,HR,SALARY)

//Create new table

6

=A5.paste(A3.(NAME+" "+SURNAME),A3.(SALARY),A4.(NAME+" "+SURNAME),A4.(SALARY))

//Fill in data

  esProc can easily create new table, fill in data and divide columns.

5. Filter according to cumulative value

  • Task

  Find the top n customers whose sales accounts for half of the total sales, and rank them by sales in descending order.

  • Python

1

import pandas as pd

2

sale_file = "E:\\txt\\sales_volume.csv"

3

sale_info = pd.read_csv(sale_file)

4

sale_info.sort_values(by='Amount',inplace=True,ascending=False)

5

half_amount = sale_info['Amount'].sum()/2

6

vip_list = []

7

amount = 0

8

for client_info in sale_info.itertuples():

9

      amount += getattr(client_info, 'Amount')

10

    if   amount < half_amount:

11

          vip_list.append(getattr(client_info, 'Client'))

12

      else:

13

          vip_list.append(getattr(client_info, 'Client'))

14

          break

15

print(vip_list)

  Pandas does not have an existing loop function to calculate the position that meets the conditions, so it can only use for loop to complete the calculation.

  • esProc

A


1

E:\\txt\\sales_volume.csv

//Data storage path

2

=file(A1).import@tc().sort@z(Amount)

//Import data and   sort by amount in descending order

3

=A2.sum(Amount)/2

//Calculate half of   total sales

4

=A2.pselect(cum(Amount)>=A3)

//Find the position   where the cumulative sales is more than half of total sales

5

=A2.(Client).m(:A4)

//Take the client   names before the position

  It is recommended to use loop function to calculate in esProc, and filter and record the position information according to the conditions.

6. Calculate ratio compared with the previous period for the selected records

  • Task

  Find the trading information and increase rate of a stock in the three trading days with the highest share price.

  • Python

1

import pandas as pd

2

stock1001_file = "E:\\txt\\stock1001_price.txt"

3

stock1001 = pd.read_csv(stock1001_file,sep = '\t')

4

CL = stock1001['CL']

5

CL_psort = CL.argsort()[::-1].iloc[:3].values

6

CL_psort_shift1 = CL_psort-1

7

CL_rise =   CL[CL_psort].values/CL[CL_psort_shift1].values-1

8

max_3 =   stock1001.loc[CL_psort].reset_index(drop = True)

9

max_3['RISE'] = CL_rise

10

print(max_3)

  Python can't use position to directly calculate ratio compared with the previous period. It needs to construct a position sequence of the previous period and then calculate.

  • esProc

A


1

=Stocks.sort(TradingDate)


2

=A1.psort(ClosePrice:-1)([1,2,3])

//The position of   three days with highest share price

3

=A1.calc(A2,   ClosePrice/ClosePrice[-1]-1)

//Calculate the   increase rate for the three days

4

=A1(A2).new(TradingDate,ClosePrice,A3(#):IncreaseRate)


  esProc is very good at these calculations related to order and position, and the code is simple and in line with natural thinking.

7. Foreign key join 1

  • Task

  Find out employees with couple salary more than 10000.

  • Python

1

import pandas as pd

2

emp_file = "E:\\txt\\Employees.txt"

3

rel_file = "E:\\txt\\EmpRel.txt"

4

emp_info = pd.read_csv(emp_file,sep='\t')

5

rel_info = pd.read_csv(rel_file,sep='\t')

6

rel_info = rel_info[rel_info['Relationship']=="Spouse"]

7

emp1_join =   pd.merge(rel_info,emp_info,left_on='Emp1',right_on='ID')

8

emp2_join =   pd.merge(emp1_join,emp_info,left_on='Emp2',right_on='ID')

9

res = emp2_join[emp2_join['BasePay_x']+emp2_join['BasePay_y']>=10000]

10

r = res[['Name_x','Name_x']]

11

print(r)

  Pandas can only join twice to get two columns of information, and then filter.

  • esProc

A

1

=file("E:\\txt\\Employees.txt").import@t().keys(ID)

2

=file("E:\\txt\\EmpRel.txt").import@t()

3

=A2.select(Relationship=="Spouse")

4

>A3.switch(Emp1,A1;Emp2,A1)

5

=A3.select(Emp1.BasePay+Emp2.BasePay>=10000)

6

>A5.run(Emp1=Emp1.Name,Emp2=Emp2.Name)

  Through foreign key objectification, esProc converts the foreign key field into the corresponding reference in the foreign key table, so that it can be treated as a single table.

8. Foreign key join 2

  • Task

  Find out the department with the youngest department manager.

  • Python

1

import pandas as pd

2

import datetime

3

emp_file = "E:/txt/EMPLOYEE.txt"

4

manager_file = "E:/txt/DEPARTMENT.txt"

5

emp_info = pd.read_csv(emp_file,sep='\t')

6

manager_info = pd.read_csv(manager_file,sep='\t')

7

manager_join =   pd.merge(manager_info,emp_info,left_on='MANAGER',right_on='EID')

8

today = datetime.datetime.today().year

9

arr = pd.to_datetime(manager_join['BIRTHDAY'])

10

manager_join['AGE'] = today-arr.dt.year

11

min_age = manager_join['AGE'].min()

12

young = manager_join[manager_join['AGE']==min_age]['DEPT']

13

print(young)

  • esProc

A

1

=file("E:/txt/EMPLOYEE.txt").import@t()

2

=file("E:/txt/DEPARTMENT.txt").import@t()

3

=A2.join(MANAGER,A1:EID,~:manager)

4

=A3.minp(manager.(age(BIRTHDAY))).manager.DEPT

  It is also the application of foreign key objectification.

9. Sorting and filtering of grouped subsets

  • Task

  Find out the sales clerks whose sales are within top 8 for every moth in 1995.

  • Python

1

import   pandas as pd

2

sale_file   = 'E:\\txt\\SalesRecord.txt'

3

sale_info   = pd.read_csv(sale_file,sep = '\t')

4

sale_info['month']=pd.to_datetime(sale_info['sale_date']).dt.month

5

sale_group   = sale_info.groupby(by=['clerk_name','month'],as_index=False).sum()

6

sale_group_month   = sale_group.groupby(by='month')

7

set_name   = set(sale_info['clerk_name'])

8

for   index,sale_g_m in sale_group_month:

9

    sale_g_m = sale_g_m.sort_values(by='sale_amt',ascending   = False)

10

    sale_g_max_8 = sale_g_m.iloc[:8]

11

    sale_g_max_8_name = sale_g_max_8['clerk_name']

12

    set_name = set_name.intersection(set(sale_g_max_8_name))

13

print(set_name)

  • esProc

A

1

E:\\txt\\SalesRecord.txt

2

=file(A1).import@t()

3

=A2.groups(clerk_name:name,month(sale_date):month;sum(sale_amt):amount)

4

=A3.group(month)

5

=A4.(~.sort(-amount).to(8))

6

=A5.isect(~.(name))

  esProc retains grouped subsets, and it is very simple and convenient to sort and filter the subsets in the loop function.

10. Grouped subset cross row calculation and filter grouped results

  • Task

  Find out stocks that have had three consecutive trading day limits (up 10%).

  • Python

1

import   pandas as pd

2

def con_rise(stock:pd.DataFrame):

3

    rise_day_list = []

4

    rise_num = 0

5

    shift_1 = stock['CL']/stock['CL'].shift(1)-1>0.06

6

    for bl in shift_1:

7

        if bl == False:

8

            rise_num = 0

9

        else:

10

            rise_num+=1

11

        rise_day_list.append(rise_num)

12

    return max(rise_day_list)

13

stock_file   = 'E:\\txt\\StockRecords.txt'

14

stock   = pd.read_csv(stock_file,sep='\t')

15

stock_g   = stock.groupby(by = ['CODE'])

16

good_code = []

17

for   index,group in stock_g:

18

    group = group.sort_values(by='DT')

19

      group = group.reset_index(drop = True)

20

    max_rise = con_rise(group)

21

    if max_rise>=5:

22

        good_code.append(index)

23

print(good_code)

  • esProc

A


1

E:\\txt\\StockRecords.txt


2

=file(A1).import@t()


3

=A2.group(CODE).(~.sort(DT))


4

=A3.select(func(A5,~)>=5).(~.CODE)


5

func


6


=(rise=0,A5.(rise=if(CL/CL[-1]-1>=0.06,rise=if(!CL[-1],0,rise+1),0)))

7


=max(B6)

  The step-by-step calculation of esProc is very clear: 1. Grouping and sorting; 2. Calculating the maximum days with an increase of more than 0.06; 3. Filtering. The custom function can also be put into the loop function for loop calculation.

11. Calculate ratio compared with the previous period for continuous intervals of grouped subsets

  • Task

  Calculate the longest consecutive rising days of each stock.

  • Python

1

import pandas as pd

2

def con_rise(stock:pd.DataFrame):

3

      rise_day_list = []

4

      rise_num = 0

5

    shift_1 = stock['CL']>stock['CL'].shift(1)

6

    for   bl in shift_1:

7

          if bl == False:

8

              rise_num = 0

9

          else:

10

              rise_num+=1

11

          rise_day_list.append(rise_num)

12

      return max(rise_day_list)

13

stock_file = "E:\\txt\\StockRecords.txt"

14

stock_info = pd.read_csv(stock_file,sep="\t")

15

stock_info.sort_values(by='DT',inplace=True)

16

stock_group = stock_info.groupby(by='CODE')

17

max_rise_list = []

18

for index,stock_g in stock_group:

19

    code   = stock_g.iloc[0]['CODE']

20

      max_rise_list.append([code,con_rise(stock_g)])

21

max_rise_df =   pd.DataFrame(max_rise_list,columns=['CODE','con_rise'])   

22

print(max_rise_df)

  • esProc

A

B

1

E:\\txt\\StockRecords.txt


2

=file(A1).import@t()


3

=A2.sort(DT)


4

=A3.group(CODE)


5

=A4.new(CODE,func(A6,~):con_rise)


6

func


7


=(num=0,A6.max(num=if(CL>CL[-1],if(#==1,0,num+1),0)))

  The idea of comparing with last period for continuous periods of a single stock: if it is higher than the stock price of the previous day, add 1; if it is not greater than, set 0; finally, check the maximum value in the sequence. The calculation method of a single stock is written as a function, and the table of each stock is passed in as a parameter. esProc can easily call the function in the loop function to get the result. With the same idea, pandas code looks much more complex.

12. Grouped subsets iteration loop

  • Task

  Calculate how many months it takes for each sales representative to reach the sales amount of 50k.

  • Python

1

import pandas as pd

2

sale_file = "E:/txt/orders_i.csv"

3

sale_data = pd.read_csv(sale_file,sep='\t')

4

sale_g = sale_data.groupby('sellerid')

5

breach50_list = []

6

for index,group in sale_g:

7

      amount=0

8

      group = group.sort_values('month')

9

    for   row in group.itertuples():

10

          amount+=getattr(row, 'amount')

11

          if amount>=500000:

12

              breach50_list.append([index,getattr(row, 'month'),])

13

              break

14

breach50_df = pd.DataFrame(breach50_list,columns=['sellerid','month'])

15

print(breach50_df)

  • esProc

A


1

E:/txt/orders_i.csv


2

=file(A1).import@t()


3

=A2.group(sellerid;(~.iterate((x=month,~~+amount),0,~~>500000),x):breach50)


  esProc retains grouped subsets and uses the iterative function to realize the iteration.

13. Grouped subsets operations

  • Task

  Calculate the daily inventory status of various goods in the specified time period.

  • Python

1

import pandas as pd

2

import numpy as np

3

starttime = '2015-01-01'

4

endtime = '2015-12-31'

5

stock_data = pd.read_csv('E:\\txt\\stocklog.csv',sep='\t')

6

stock_data['DATE']=pd.to_datetime(stock_data['DATE'])

7

stock_data = stock_data[(stock_data['DATE']>=starttime)&(stock_data['DATE']<=endtime)]

8

stock_data['ENTER']=stock_data['QUANTITY'][stock_data['INDICATOR']!='ISSUE']

9

stock_data['ISSUE']=stock_data['QUANTITY'][stock_data['INDICATOR']=='ISSUE']

10

stock_g = stock_data[['STOCKID','DATE','ENTER','ISSUE']].groupby(by=['STOCKID','DATE'],as_index=False).sum()

11

stock_gr = stock_g.groupby(by='STOCKID',as_index   = False)

12

date_df =   pd.DataFrame(pd.date_range(starttime,endtime),columns=['DATE'])

13

stock_status_list = []

14
15

for index,group in stock_gr:

16

      date_df['STOCKID']=group['STOCKID'].values[0]

17

      stock_status = pd.merge(date_df,group,on=['STOCKID','DATE'],how='left')

18

      stock_status = stock_status.sort_values(['STOCKID','DATE'])

19

      stock_status['OPEN']=0

20

      stock_status['CLOSE']=0

21

      stock_status['TOTAL']=0

22

      stock_status = stock_status.fillna(0)

23

      stock_value = stock_status[['STOCKID','DATE','OPEN','ENTER','TOTAL','ISSUE','CLOSE']].values

24

    open   = 0

25

    for   value in stock_value:

26

          value[2] = open

27

          value[4] = value[2] + value[3]

28

          value[6] = value[4] - value[5]

29

        open   = value[6]

30

      stock = pd.DataFrame(stock_value,columns = ['STOCKID','DATE','OPEN','ENTER','TOTAL','ISSUE','CLOSE'])

31

      stock_status_list.append(stock)

32

stock_status = pd.concat(stock_status_list,ignore_index=True)


print(stock_status)

  • esProc

A

B

1

=file("E:\\txt\\stocklog.csv").import@t()


2

=A1.select(DATE>=date("2015-01-01")   && DATE<=date("2015-12-31"))


3

=A2.groups(STOCKID,DATE;

sum(if(INDICATOR=="ISSUE",QUANTITY,0)):ISSUE,

sum(if(INDICATOR!="ISSUE",QUANTITY,0)):ENTER)


4

=periods(start,end)


5

for A3.group(STOCKID)

=A5.align(A4,DATE)

6


>b=c=0

7


=B5.new(A5.STOCKID:STOCKID,A4(#):DATE,c:OPEN,ENTER,

(b=c+ENTER):TOTAL,ISSUE,

(c=b-ISSUE):CLOSE)

8


>B7.run(ENTER=ifn(ENTER,0),ISSUE=ifn(ISSUE,0))

9


=@|B7

  esProc 9 lines of code complete the task of Python 32 lines of code, and when you read the code later on, esProc is easier to understand.

14. Sort grouped results

  • Task

  Find out the two departments with the largest number and the smallest number of employees.

  • Python

1

import   pandas as pd

2

emp_file   = 'E:\\txt\\employee.txt'

3

emp_info   = pd.read_csv(emp_file,sep='\t')

4

emp_g   = emp_info.groupby(by='DEPT')

5

size   = emp_g.size().sort_values()

6

sorted_dept   = size.index.values

7

print(sorted_dept[[0,-1]])

  • esProc

A

1

E:\\txt\\employee.txt

2

=file(A1).import@t()

3

=A2.group(DEPT).sort(~.len()).m([1,-1]).(~.DEPT)

  With esProc, the grouping, sorting and filtering are completed by one line.

15. Adjacent records grouping with the original order

  • Task

  List the team information with the most consecutive NBA titles.

  • Python

1

import   pandas as pd

2

import   numpy as np

3

pd.set_option('display.max_columns',   None)

4

nba_file   = 'E:\\txt\\nba.txt'

5

nba_champion   = pd.read_csv(nba_file,sep='\t')

6

nba_champion   = nba_champion.sort_values(by = 'Year')

7

arr   = np.zeros(len(nba_champion))

8

arr[nba_champion['Champion']!=nba_champion['Champion'].shift(1)]=1

9

arr   = np.cumsum(arr)

10

nba_champion['flag']=arr

11

nba_champion_g   = nba_champion.groupby(by='flag')

12

max_num   = nba_champion_g.size().idxmax()

13

max_champion   = nba_champion_g.get_group(max_num)

14

print(max_champion)

  Python doesn't have the ability to group by adjacent conditions. You need to create a list of grouping flags.

  • esProc

A


1

=connect("mysql")

//Connect to   database

2

=A1.query("select   * from nba order by year")

//Sort by year

3

=A2.group@o(Champion)

//When adjacency is   different, start a new group

4

=A3.maxp(~.len())

//List the team   information with the most consecutive NBA titles.

  The set of esProc is ordered, and it is very convenient to merge the adjacent same records into one group and start a new group when the adjacent record is different.

16. Group by specified order

  • Task

  List in order the number and average age of female employees of technology, production, sales and HR department.

  • Python

1

import   pandas as pd

2

import   datetime

3

emp_file   = 'E:\\txt\\employee.txt'

4

dept_seq   = ['Technology','Production','Sales','HR']

5

emp_info   = pd.read_csv(emp_file,sep='\t')

6

emp_g   = emp_info.groupby(by='DEPT')

7

emp_g_index   = [index for index,group in emp_g]

8

today   = datetime.datetime.today().year

9

dept_femal_list   = []

10

for   dept in dept_seq:

11

    if dept not in emp_g_index:

12

        dept_femal_num = 0

13

          age_a = None

14

    else:

15

        dept_emp = emp_g.get_group(dept)

16

        dept_emp_femal = dept_emp[dept_emp['GENDER']=='F']

17

        dept_femal_num = len(dept_emp_femal)

18

        arr = pd.to_datetime(dept_emp_femal['BIRTHDAY'])

19

        age_a = (today-arr.dt.year).mean()

20

      dept_femal_list.append([dept,dept_femal_num,age_a])

21

dept_femal_info   = pd.DataFrame(dept_femal_list,columns=['DEPT','femal_num','femal_age'])

22

print(dept_femal_info)

  • esProc

A

1

E:\\txt\\employee.txt

2

[Technology,Production,Sales,HR]

3

=file(A1).import@t()

4

=A3.align@a(A2,DEPT)

5

=A4.new(A2(#):DEPT,(F_EMP=~.select(GENDER:"F"),F_EMP.count()):Femal_num,F_EMP.avg(age@y(BIRTHDAY)):Femal_age)

  It is much more convenient to deal with this kind of calculation by esProc based on the ordered set and the special alignment operation.

17. Repeated conditional grouping

  • Task

  Divide employees into groups by segmentation according to the length of service in the company and count the number of male and female employees in each group.

  • Python

1

import pandas as pd

2

import datetime

3

def eval_g(dd:dict,ss:str):

4

      return eval(ss,dd)   

5

emp_file = 'E:\\txt\\employee.txt'

6

emp_info = pd.read_csv(emp_file,sep='\t')

7

employed_list = ['Within five years','Five   to ten years','More than ten years','Over fifteen years']

8

employed_str_list = ["(s<5)","(s>=5)   & (s<10)","(s>=10)","(s>=15)"]

9

today = datetime.datetime.today().year

10

arr = pd.to_datetime(emp_info['HIREDATE'])

11

employed = today-arr.dt.year

12

emp_info['EMPLOYED']=employed

13

dd = {'s':emp_info['EMPLOYED']}

14

group_cond = []

15

for n in range(len(employed_str_list)):

16

      emp_g = emp_info.groupby(eval_g(dd,employed_str_list[n]))

17

      emp_g_index = [index for index in emp_g.size().index]

18

    if   True not in emp_g_index:

19

          female_emp=0

20

          male_emp=0

21

      else:

22

          group = emp_g.get_group(True)

23

          sum_emp = len(group)

24

          female_emp = len(group[group['GENDER']=='F'])

25

          male_emp = sum_emp-female_emp

26

      group_cond.append([employed_list[n],male_emp,female_emp])

27

group_df = pd.DataFrame(group_cond,columns=['EMPLOYED','MALE','FEMALE'])

28

print(group_df)

  Pandas does not have the function of repeated conditional grouping, so it can only regroup according to the conditions and get the groups that meet the conditions.

  • esProc

A

B

1

?<5

//Within five years

2

?>=5 && ?<10

//Five to ten years

3

?>=10

//More than ten years

4

?>=15

//Over fifteen years

5

E:\\txt\\employee.txt


6

=[A1:A4]

=A6.concat@c()

7

=file(A5).import@t()

=A7.derive(age@y(HIREDATE):EMPLOYED)

8

=B7.enum@r(A6,EMPLOYED)

=[B1:B4]

9

=A8.new(B8(#):EMPLOYED,~.count(GENDER=="M"):MALE,~.count(GENDER=="F"):FEMAL)


  esProc has powerful enumeration grouping function, which can easily realize repeated conditional grouping.

18. Log processing 1

  • Task

  Every three rows of records are a piece of log. Organize the log into a structured file.

  • Python

1

import pandas as pd

2

import numpy as np

3

log_file = 'E://txt//access_log.txt'

4

log_info = pd.read_csv(log_file,header=None)

5

log_g=log_info.groupby(log_info.index//3)

6

rec_list = []

7

for i,g in log_g:

8

    rec   = g.values.reshape(1*3)

9

      rec[1] = rec[1].split(":")[-1].replace("#","")

10

    rec="\t".join(rec)

11

    rec   = np.array(rec.split("\t"))

12

    rec   = rec[[6,7,0,1,3,4,8,5]]

13

      rec_list.append(rec)

14

rec_df = pd.DataFrame(rec_list,columns=["USERID","UNAME","IP","TIME","URL","BROWER","LOCATION","module"])

15

print(rec_df)

  • esProc

A

1

E://txt//access_log.txt

2

=file(A1).import@s()

3

=A2.group((#-1)\3)

4

=A3.(~.(_1).concat("\t").array("\t"))

5

=A4.new(~(7):USERID,~(8):UNAME,~(1):IP,~(2):TIME,~(4):URL,~(5):BROWER,~(9):LOCATION,left(~(6).array("\:")(2),-1):module)

  With the mechanism of grouping by row number, you can process one group of data every time in loop, simplifying the difficulty.

19. Log processing 2

  • Task

  Each piece of log has indefinite rows, and each row with the same mark indicates that it is a piece of record.

  • Python

1

import pandas as pd

2

log_file = 'E://txt//Indefinite   _info.txt'

3

log_info = pd.read_csv(log_file,header=None)

4

log_g =   log_info.groupby(log_info[0].apply(lambda x:x.split("\t")[0]),sort=False)

5

columns = ["userid","gender","age","salary","province","musicid","watch_time","time"]

6

df_dic = {}

7

for c in columns:

8

      df_dic[c]=[]

9

for index,group in log_g:

10

      rec_dic = {}

11

    rec   = group.values.flatten()

12

    rec   = '\t'.join(rec).split("\t")

13

    for   r in rec:

14

        v = r.split(":")

15

          rec_dic[v[0]]=v[1]

16

    for col   in columns:

17

          if col not in rec_dic.keys():

18

              df_dic[col].append(None)

19

          else:

20

              df_dic[col].append(rec_dic[col])

21

df = pd.DataFrame(df_dic)

22

print(df)

  • esProc

A

1

E://txt//Indefinite   _info.txt

2

=file(A1).import@s()

3

[userid,gender,age,salary,province,musicid,watch_time,time]

4

=A2.group@o(_1.array("\t")(1))

5

=A4.(~.(_1.array("\t")).conj().id().align(A3,~.array("\:")(1)).(~.array("\:")(2))).conj()

6

=create(${A3.concat@c()}).record(A5)

  The merge grouping of esProc and the special alignment operation make the log processing very easy.

20. Log processing 3

  • Task

  Each piece of log has indefinite rows, and a fixed mark is at the beginning of each record.

  • Python

1

import pandas as pd

2

log_file = 'E://txt//Indefinite_info2.txt'

3

log_info = pd.read_csv(log_file,header=None)

4

group_cond = log_info[0].apply(lambda x:1 if   x.split("\t")[0].split(":")[0]=="userid"   else 0).cumsum()

5

log_g =   log_info.groupby(group_cond,sort=False)

6

columns = ["userid","gender","age","salary","province","musicid","watch_time","time"]

7

df_dic = {}

8

for c in columns:

9

      df_dic[c]=[]

10

for index,group in log_g:

11

      rec_dic = {}

12

    rec   = group.values.flatten()

13

    rec   = '\t'.join(rec).split("\t")

14

    for   r in rec:

15

          v = r.split(":")

16

          rec_dic[v[0]]=v[1]

17

    for   col in columns:

18

          if col not in rec_dic.keys():

19

              df_dic[col].append(None)

20

          else:

21

              df_dic[col].append(rec_dic[col])

22

df = pd.DataFrame(df_dic)

23

print(df)

  Pandas does not have the function of grouping by conditions. It needs to construct an array of grouping by conditions.

  • esProc

A

1

E://txt//Indefinite_info2.txt

2

[userid,gender,age,salary,province,musicid,watch_time,time]

3

=file(A1).import@s()

4

=A3.group@i(_1.array("\t")(1).array("\:")(1)=="userid")

5

=A4.(~.(_1.array("\t")).conj().align(A2,~.array("\:")(1)).(~.array("\:")(2))).conj()

6

=create(${A2.concat@c()}).record(A5)

  esProc has powerful grouping function and loop computing ability, and its code is simple and clear.

21. Inverse grouping

  • Task

  List the details of installment loan: current payment amount, current interest, current principal and principal balance.

  • Python

1

import numpy as np

2

import pandas as pd

3

loan_data = pd.read_csv('E:\\txt\\loan.csv',sep='\t')

4

loan_data['mrate'] = loan_data['Rate']/(100*12)

5

loan_data['mpayment'] =   loan_data['LoanAmt']*loan_data['mrate']*np.power(1+loan_data['mrate'],loan_data['Term']) \

6

/(np.power(1+loan_data['mrate'],loan_data['Term'])-1)

7

loan_term_list = []

8

for i in range(len(loan_data)):

9

    tm =   loan_data.loc[i]['Term']

10

      loanid = np.tile(loan_data.loc[i]['LoanID'],tm)

11

      loanamt = np.tile(loan_data.loc[i]['LoanAmt'],tm)

12

    term   = np.arange(1,tm+1)

13

    rate   = np.tile(loan_data.loc[i]['mrate'],tm)

14

      payment = np.tile(np.array(loan_data.loc[i]['mpayment']),loan_data.loc[i]['Term'])

15

      interest = np.zeros(len(loanamt))

16

    principal = np.zeros(len(loanamt))

17

      principalbalance  =   np.zeros(len(loanamt))

18

    loan_amt   = loanamt[0]

19

    for   j in range(len(loanamt)):

20

          interest[j] = loan_amt*loan_data.loc[i]['mrate']

21

          principal[j] = payment[j] - interest[j]

22

          principalbalance[j] = loan_amt - principal[j]

23

          loan_amt = principalbalance[j]

24

      loan_data_df =   pd.DataFrame(np.transpose(np.array([loanid,loanamt,term,rate,payment,interest,principal,principalbalance])),

    columns = ['loanid','loanamt','term','mRate','payment','interest','principal','principalbalance'])

25


      loan_term_list.append(loan_data_df)

26

loan_term_pay =   pd.concat(loan_term_list,ignore_index=True)

27

print(loan_term_pay)

  It's very troublesome for padas to deal with such inverse grouping.

  • esProc

A


1

E:\\txt\\loan.csv


2

=file(A1).import@t()


3

=A2.derive(Rate/100/12:mRate,LoanAmt*mRate*power((1+mRate),Term)/(power((1+mRate),Term)-1):mPayment)


4

=A3.news((t=LoanAmt,Term);LoanID,   LoanAmt, mPayment:payment, to(Term)(#):Term, mRate, t* mRate:interest,   payment-interest:principal, t=t-principal:principlebalance)


  Using the function of inverse grouping, it is easy to solve the problem of inverse grouping.