An Easy Way to Merge and Summarize Excel Files

 

1. Background

  Do you often need to merge Excel worksheets with same headers for further analysis? If the answer is yes, then I figured that you find it an annoyance to do the combination, particularly when there are many worksheets or files. Even Excel VBA often fails if there is a huge amount of data. In this case, a professional stand-alone data computing tool will be helpful. esProc is one of its kind, an outstanding one. There will be no need to write complex and inefficient VBA code, but a few lines of SPL (abbreviation of Structured Process Language with innovative syntax that esProc uses) code will suffice instead. With esProc at hand, merging Excel worksheets will no longer be a nightmare.

2. Basic merges

A.Merge multiple sheets of one Excel file

  Below is an Excel file containing sales data. There are 3 worksheets of same structure in the file.

january_2013:

Customer ID Customer Name Invoice Number Sale Amount Purchase Date
1234 John Smith 100-0002 $1,200.00 2013/1/1
2345 Mary Harrison 100-0003 $1,425.00 2013/1/6
3456 Lucy Gomez 100-0004 $1,390.00 2013/1/11
4567 Rupert Jones 100-0005 $1,257.00 2013/1/18
5678 Jenny Walters 100-0006 $1,725.00 2013/1/24
6789 Samantha Donaldson 100-0007 $1,995.00 2013/1/31

february_2013:

Customer ID Customer Name Invoice Number Sale Amount Purchase Date
9876 Daniel Farber 100-0008 $1,115.00 2013/2/2
8765 Laney Stone 100-0009 $1,367.00 2013/2/8
7654 Roger Lipney 100-0010 $2,135.00 2013/2/15
6543 Thomas Haines 100-0011 $1,346.00 2013/2/17
5432 Anushka Vaz 100-0012 $1,560.00 2013/2/21
4321 Harriet Cooper 100-0013 $1,852.00 2013/2/25

march_2013:

Customer ID Customer Name Invoice Number Sale Amount Purchase Date
6634 Poop Smith 100-0014 $1,350.00 2013/3/4
8765 Tony Song 100-0015 $1,167.00 2013/3/8
2345 Mary Harrison 100-0016 $1,789.00 2013/3/17
6543 Rachel Paz 100-0017 $2,042.00 2013/3/22
3456 Lucy Gomez 100-0018 $1,511.00 2013/3/28
4321 Susan Wallace 100-0019 $2,280.00 2013/3/30


  The merging requires getting two fields, Customer Name & Sales Amount, from each worksheet. Below is the final merging result:

Customer Name Sale Amount
John Smith 1200
Mary Harrison 1425
Lucy Gomez 1390
Rupert Jones 1257
Jenny Walters 1725
....... ........
Susan Wallace 2280


esProc SPL script:

A
1 =file("D:/sales_2013.xlsx").xlsopen()
2 =A1.conj(A1.xlsimport@t('Customer Name','Sale Amount';~.stname))
3 >file("D:/result_2013.xlsx"). xlsexport@t(A2;"merge_sheets")

Explanation:
  A1: Open the Excel file and generate a sequence of 3 worksheets.
  A2: conj() function traverses the member worksheets in A1’s sequence to import the specified fields ‘Customer Name’ and ‘Sale Amount’ and merges them. xlsimport() function imports the fields, where a semicolon follows the last field and parameter ~.stname represents the current worksheet. @t option specifies that the first row in each worksheet will be the field names. The function is nested within conj() function’s loop traversal to import data from every worksheet.
  A3: Save A2’s table sequence as a new worksheet in the original Excel file with the name “merge_sheets”. @t option is used to make the first row the column headers.

  Only three lines of code, the script is short, concise yet clear and easy to understand.

B.Merge multiple sheets from different Excel files

  Here are several Excel files in which every worksheet containing a year’s sales data has the same structure as the worksheets in the previous instance:
001png

esProc SPL script:

A B
1 for directory@p("d:/excel/*.xlsx") =file(A1).xlsopen()
2 =B1.conj(B1.xlsimport@t('Customer Name','Sale Amount','Purchase Date';~.stname))
3 =@|B2
4 > file("d:/result.xlsx"). xlsexport@t(B3;"merge_data")

The desired merging effect:

Customer Name Sale Amount Purchase Date
John Smith 1200 2013-01-01
Mary Harrison 1425 2013-01-06
Lucy Gomez 1390 2013-01-11
Rupert Jones 1257 2013-01-18
...... ...... ......
Thomas Haines 1346 2013-02-17


Explanation:
  A1: The for loop traverses every Excel file under the given directory to perform a series of operations over every worksheet in B1~B3.
  B1: Open an Excel file and generate a sequence.
  B2: Import the specified fields, ‘Customer Name’, ‘Sale Amount’ and ‘Purchase Date’, from every worksheet of the current file and merge them. This is similar to the operations in A2 in the previous instance.
  B3: Combine B2’s table sequence with the current value of B3.
  A4: Save B3’s table sequence as a worksheet, named merge_data, in file result.xlsx.

  The program achieves multiple Excel file merging with only two rounds loop. The outer loop traverses each of the Excel files under the given directory while the inner loop B1.conj() function merges data from every worksheet in an Excel file.

C.File merging with big result set

  Both A2 in the first instance and B3 in the second instance receive all merged data in the memory and then write it out at once. But a big merged result set can use up too much memory space or the available memory space won’t be enough to hold the data. In that case, stream-style appending can be used to generate a big file.

esProc SPL script:

A B
1 =file("D:/out.xlsx")
2 for directory@p("d:/excel/*.xlsx") =file(A2).xlsopen()
3 =if(A1.exists(),B2.xlsimport@t(),B2.xlsimport())
4 >A1.xlsexport@s(B3;"merger")

The desired merging effect:

Customer ID Customer Name Invoice Number Sale Amount Purchase Date
1234 John Smith 100-0002 1200 2013-01-01
2345 Mary Harrison 100-0003 1425 2013-01-06
3456 Lucy Gomez 100-0004 1390 2013-01-11
4567 Rupert Jones 100-0005 1257 2013-01-18
...... ...... ...... ...... ......
6789 Thomas Haines 100-0002 1346 2013-02-17


Explanation:
  A1: Open a specified file into which data will be output.
  A2: Traverse all to-be-merged Excel files under the given directory.
  B2: Open a to-be-merged Excel file.
  B3: If the file into which data will be output exists, import all rows, including the column headers, of a worksheet; if the file exists, use @t option to make the first row the column headers and import data from the second row.
  B4: Import data in a stream style in B3 and append it to the worksheet named merger in A1’s specified file.

  The stream-style importing and appending is suited to combine many small excel files into a big one.

3. Grouping & aggregation

  We continue to use the previous Excel file containing the sales data maker further analysis.

A. Group data by one or more fields

  Here’s an eProc SPL script for grouping data by one or multiple fields:

A
1 =file("D:/sales_2013.xlsx").xlsopen()
2 =A1.conj(A1.xlsimport@t(;~.stname))
3 =A2.groups('Customer ID';sum('Sale Amount'):Total,avg('Sale Amount'):Average)
4 =A2.groups('Customer ID','Purchase Date';sum('Sale Amount'):Total)


A3’s grouping result:

Customer ID Total Average
1234 2550 1275.0
2345 3214 1607.0
3456 2901 1450.5
4321 4132 2066.0
…… …… ……
4567 1257 1257.0


A4’s grouping result:

Customer ID Purchase Date Total
1234 2013-01-01 1200
1234 2013-03-04 1350
2345 2013-01-11 1425
2345 2013-03-17 1789
…… …… ……
9876 2013-02-02 1115


Explanation:
  A1: Open the specified Excel file.
  A2: Import all data from the worksheets in the file and perform merging.
  A3: Group the merged data by ‘Customer ID’ and calculate total and average for each group.
  A4: Group the merged data by ‘Customer ID’ and ‘Purchase Date’, and calculate total for each group.

B. Group data in a certain order

  esProc supports comparing neighboring records during grouping and aggregation. If the data is already ordered, then a re-sorting will be unnecessary. Now suppose the original data is ordered by date, but we need to summarize data by month.

esProc SPL script:

A B
1 for directory@p("d:/excel/*.xlsx") =file(A1).xlsopen()
2 =B1.conj(B1.xlsimport@t(;~.stname))
3 =@|B2
4 =B3.derive(year('Purchase Date'):Year,month('Purchase Date'):Month)
5 =A4.groups (month('Purchase Date'):Month;sum('Sale Amount'):Total,avg('Sale Amount'):Average)
6 =A4.groups@o (month('Purchase Date'):Month;sum('Sale Amount'):Total,avg('Sale Amount'):Average)


A5’s grouping result:

Month Total Average
1 272414 15134.111111111111
2 168038 9335.444444444445
3 357693 19871.833333333332


A6’s grouping result:

Month Total Average
1 8992 1498.6666666666667
2 9375 1562.5
3 10139 1689.8333333333333
1 260221 43370.166666666664
2 103656 17276.0
3 101509 16918.166666666668
1 3201 533.5
2 55007 9167.833333333334
3 246045 41007.5


Explanation:
  A1~B3: Merge all worksheets in the Excel files of same structure under the specified directory.
  A4: Generate a new table sequence base on B3’s table sequence by splitting the Purchase Date field into a new Year field and a new Month field.
  A5: groups() function groups data by month and calculates total and average for each group.
  A6: groups@o groups data in each year by month, calculates total and average for each group, and, with @o present, merges the summarized results.

  A4 contains detailed sales records; A5 summarizes data only by month, ignoring the year; A6 summarizes data by both year and month. The three cells shows summarized data of different levels.

C. Group data by segment

  This type of grouping divides data into multiple segments according to specified conditions and summarizes each group of data.

esProc SPL script:

A B
1 for directory@p("d:/excel/*.xlsx") =file(A1).xlsopen()
2 =B1.conj(B1.xlsimport@t(;~.stname))
3 =@|B2
4 =B3.groups(if ('Sale Amount'<1000,"1::<1000",
if ('Sale Amount'<1500,"2::1000~~1500",
if ('Sale Amount'<2000,"3::1500~~2000",
if ('Sale Amount'<2500,"4::2000~~2500",
"5::>=2500")))):Segment; count(1):Number,sum('Sale Amount'):Total)

Grouping result:

Segment Number Total
1::<1000 22 8280
2::1000~~1500 9 11617
3::1500~~2000 6 10432
4::2000~~2500 4 8810
5::>=2500 13 759006


Explanation:
  About the code block A1~B3, refer to the explanations of previous instances.
  A4: It defines 5 intervals for the value of ‘Sale Amount’ field and calculates number and total for within each range.

  But the above script is still not convenient-to-use enough. When we want to change the way of dividing the data, we need to modify the parameter expressions in groups() function. This is a little complicated. Another function, pseg(), helps to achieve an easier modification. Here’s the SPL code:

A B
1 [0,1000,1500,2000,2500]
2 for directory@p("d:/excel/*.xlsx") =file(A1).xlsopen()
3 =B1.conj(B1.xlsimport@t(;~.stname))
4 =@|B2
5 =B4.groups(A1.pseg(~.'Sale Amount'):Segment; count(1):Number,sum('Sale Amount'):Total)

  We can also group data by multiple fields according to a set of conditions and then summarize each group. For example, we divide the scores of subjects into 5 grades – Excellent, Good, Satisfactory, Poor, Passing, and summarize each group. For more uses of groups() function, refer to esProc Function Reference.

D. Group big data

  In the above instances, the Excel file(s) is/are fairly small to be imported into the memory at a time. If the total data amount is huge, trying to retrieve all data into the memory in either esProc or VBA could fail. An alternative is the stream retrieval that reads a part of data into the memory and summarizes it at one time.

esProc SPL script:

A B
1 =file("d:/tdata.xlsx").xlsopen@r()
2 for A1.count() =A1.xlsimport@ct(;A1(A2). stname)
3 =@|B2
4 = B3.conjx() =A4.groups('Customer ID';sum('Sale Amount'):SaleTotal)
5 >file(“d:/out.xlsx”).exportxls@bt(B4;"Customer&Sales")


The result of filtering and then grouping data:

Customer ID SaleTotal
1234 107721792
2345 139041639
3456 137985543
4321 96170742
... ...
9876 37590417


Explanation:
  A1: Use @r option to open an Excel file in a stream style.
  A2: Traverse every worksheet in the Excel file.
  B2: Use @c option to import data with cursor.
  B3: Union B2’s cursor into B3.
  A4: Union members of B3’s cursor sequence into a new cursor.
  B4: Group A4’s sequence by Customer ID and calculate Sale Amount for each group.
  A5: Export and save B4’s result set into an Excel file.

  Each time, the stream retrieval reads a part of data with cursor and groups and summarizes it.

4.Deduplication

  Duplicate data affects the analysis results. As duplication is common in real-world cases, let’s look at how to remove duplicates in esProc SPL.

A. Via primary key

  Set ‘Invoice Number’ as the primary key of sales_2013 to remove duplicates according to it.

A
1 =file("d:/sales_2013.xlsx").xlsopen()
2 =A1.conj(A1.xlsimport@t('Customer Name', 'Invoice Number', 'Sale Amount';~. stname))
3 =A2.group@1('Invoice Number')
4 >file(“d:/out.xlsx”). xlsexport@t(A3;"result")


Concatenated deduplicated data:

Customer Name Invoice Number Sale Amount
John Smith 100-0002 1200
Mary Harrison 100-0003 1425
Lucy Gomez 100-0004 1390
Rupert Jones 100-0005 1257
Jenny Walters 100-0006 1725
…… …… ……
Susan Wallace 100-0019 2280


Explanation:
  A1: Open a certain Excel file.
  A2: Import specified columns from every worksheet.
  A3: Group A2’s table sequence by primary key ‘Invoice Number’ and deduplicate each group.
  A4: Export and save A3’s result set into an Excel file.

  Data is distinct in each worksheet, but probably not once the worksheets are concatenated. Primary key is one of the ways of deduplicating data.

B. Via a certain field

  Deduplicate data in sales_2013 according to a certain field to look through employee records of different names.

A
1 =file("d:/sales_2013.xlsx").xlsopen()
2 =A1.conj(A1.xlsimport@t('Customer ID', 'Customer Name';~. stname))
3 =A2.id('Customer Name')
4 =A2.group@1('Customer Name')
5 >file("d:/out.xlsx"). xlsexport@t(A4;"result")

Explanation:
  A1: Open a certain Excel file.
  A2: Import specified columns from every worksheet.
  A3: Get a sequence of distinct names from A2’s table sequence.
  A4: Get a table sequence of distinct names from A2’s table sequence.
  A5: Export and save data of A4’s table sequence into an Excel file, with the first row as the column headers.

A3’s deduplication result:

Member
Anushka Vaz
Daniel Farber
Harriet Cooper
……
Tony Song


A4’s deduplication result:

Customer ID Customer Name
5432 Anushka Vaz
9876 Daniel Farber
4321 Harriet Cooper
…… ……
8765 Tony Song

C. Via multiple fields

  Sometimes we need several other fields, instead of the primary key, to determine whether records are duplicates or not.

A
1 =file("d:/sales_2013.xlsx").importxls@t()
2 =file("d:/sales_2014.xlsx").importxls@t()
3 =[A1,A2].merge('Customer ID', 'Purchase Date')
4 =A3.group@1('Customer ID', 'Purchase Date')
5 >file("d:/out.xlsx"). xlsexport@t(A4;"result")


Explanation:
  A1: Import a specified Excel file.
  A2: Import another specified Excel file.
  A3: Merge A1’s table sequence and A2’s table sequence by fields ‘Customer ID’ and ‘Purchase Date’ and return a new table sequence.
  A4: Group and deduplicate A3’s table sequence by ‘Customer ID’ and ‘Purchase Date’.
  A5: Export and save 43’s result set into an Excel file.
  We can perform the merge and grouping and deduplication by more fields as needed.

D. Deduplicate records

  Records in each file are distinct, but the concatenated file could contain duplicate records.

A B
1 =file("d:/sales_2013.xlsx").importxls@t() =A1.group@1('Invoice Number')
2 =file("d:/sales_2014.xlsx").importxls@t() =A2.group@1('Invoice Number')
3 =[B1,B2].merge@u() =A3.count()


Explanation:
  A1: Import a specified Excel file.
  B1: Deduplicate A1’s table sequence by ‘Invoice Number’.
  A2&B2: Same as A1&B1.
  A3: Merge B1’s table sequence and B2’s table sequence, deduplicate data, and return a table sequence. @u option is used to merge multiple table sequences in order, remove duplicates, and generate a new table sequence.
  B3: Count the records in the merged table sequence. merge@u() function merges multiple ordered, deduplicated table sequences.

  In the above, we explain how to handle merging, grouping and summarization, and deduplication over same-structure Excel files in esProc SPL. For Excel files of different structures, we just need to read to-be-merged fields as a sequence object and then process it in same ways. It is similar to handle other types of text files.

5. Appendix

slaes.rar