"1. Background Do you often need to merge Excel worksheets with same headers for further analys .."

pjf RaqForum 22 No.
1 Reply • 967 View • 5 Years ago

An Easy Way to Merge and Summarize Excel Files

1. Background

Do you often need to merge Excel worksheets with same headers for further analysis? If the answer is yes, then I figured that you find it an annoyance to do the combination, particularly when there are many worksheets or files. Even Excel VBA often fails if there is a huge amount of data. In this case, a professional stand-alone data computing tool will be helpful. esProc is one of its kind, an outstanding one. There will be no need to write complex and inefficient VBA code, but a few lines of SPL (abbreviation of Structured Process Language with innovative syntax that esProc uses) code will suffice instead. With esProc at hand, merging Excel worksheets will no longer be a nightmare.

2. Basic merges

A．Merge multiple sheets of one Excel file

Below is an Excel file containing sales data. There are 3 worksheets of same structure in the file.

january_2013:

Customer ID	Customer Name	Invoice Number	Sale Amount	Purchase Date
1234	John Smith	100-0002	$1,200.00	2013/1/1
2345	Mary Harrison	100-0003	$1,425.00	2013/1/6
3456	Lucy Gomez	100-0004	$1,390.00	2013/1/11
4567	Rupert Jones	100-0005	$1,257.00	2013/1/18
5678	Jenny Walters	100-0006	$1,725.00	2013/1/24
6789	Samantha Donaldson	100-0007	$1,995.00	2013/1/31

february_2013:

Customer ID	Customer Name	Invoice Number	Sale Amount	Purchase Date
9876	Daniel Farber	100-0008	$1,115.00	2013/2/2
8765	Laney Stone	100-0009	$1,367.00	2013/2/8
7654	Roger Lipney	100-0010	$2,135.00	2013/2/15
6543	Thomas Haines	100-0011	$1,346.00	2013/2/17
5432	Anushka Vaz	100-0012	$1,560.00	2013/2/21
4321	Harriet Cooper	100-0013	$1,852.00	2013/2/25

march_2013:

Customer ID	Customer Name	Invoice Number	Sale Amount	Purchase Date
6634	Poop Smith	100-0014	$1,350.00	2013/3/4
8765	Tony Song	100-0015	$1,167.00	2013/3/8
2345	Mary Harrison	100-0016	$1,789.00	2013/3/17
6543	Rachel Paz	100-0017	$2,042.00	2013/3/22
3456	Lucy Gomez	100-0018	$1,511.00	2013/3/28
4321	Susan Wallace	100-0019	$2,280.00	2013/3/30

The merging requires getting two fields, Customer Name & Sales Amount, from each worksheet. Below is the final merging result:

Customer Name	Sale Amount
John Smith	1200
Mary Harrison	1425
Lucy Gomez	1390
Rupert Jones	1257
Jenny Walters	1725
.......	........
Susan Wallace	2280

esProc SPL script:

	A
1	=file("D:/sales_2013.xlsx").xlsopen()
2	=A1.conj(A1.xlsimport@t('Customer Name','Sale Amount';~.stname))
3	>file("D:/result_2013.xlsx"). xlsexport@t(A2;"merge_sheets")

Explanation:
A1: Open the Excel file and generate a sequence of 3 worksheets.
A2: conj() function traverses the member worksheets in A1’s sequence to import the specified fields ‘Customer Name’ and ‘Sale Amount’ and merges them. xlsimport() function imports the fields, where a semicolon follows the last field and parameter ~.stname represents the current worksheet. @t option specifies that the first row in each worksheet will be the field names. The function is nested within conj() function’s loop traversal to import data from every worksheet.
A3: Save A2’s table sequence as a new worksheet in the original Excel file with the name “merge_sheets”. @t option is used to make the first row the column headers.

Only three lines of code, the script is short, concise yet clear and easy to understand.

B．Merge multiple sheets from different Excel files

Here are several Excel files in which every worksheet containing a year’s sales data has the same structure as the worksheets in the previous instance:
001png

esProc SPL script:

	A	B
1	for directory@p("d:/excel/*.xlsx")	=file(A1).xlsopen()
2		=B1.conj(B1.xlsimport@t('Customer Name','Sale Amount','Purchase Date';~.stname))
3		=@\|B2
4	> file("d:/result.xlsx"). xlsexport@t(B3;"merge_data")

The desired merging effect:

Customer Name	Sale Amount	Purchase Date
John Smith	1200	2013-01-01
Mary Harrison	1425	2013-01-06
Lucy Gomez	1390	2013-01-11
Rupert Jones	1257	2013-01-18
......	......	......
Thomas Haines	1346	2013-02-17

Explanation:
A1: The for loop traverses every Excel file under the given directory to perform a series of operations over every worksheet in B1~B3.
B1: Open an Excel file and generate a sequence.
B2: Import the specified fields, ‘Customer Name’, ‘Sale Amount’ and ‘Purchase Date’, from every worksheet of the current file and merge them. This is similar to the operations in A2 in the previous instance.
B3: Combine B2’s table sequence with the current value of B3.
A4: Save B3’s table sequence as a worksheet, named merge_data, in file result.xlsx.

The program achieves multiple Excel file merging with only two rounds loop. The outer loop traverses each of the Excel files under the given directory while the inner loop B1.conj() function merges data from every worksheet in an Excel file.

C．File merging with big result set

Both A2 in the first instance and B3 in the second instance receive all merged data in the memory and then write it out at once. But a big merged result set can use up too much memory space or the available memory space won’t be enough to hold the data. In that case, stream-style appending can be used to generate a big file.

esProc SPL script:

	A	B
1	=file("D:/out.xlsx")
2	for directory@p("d:/excel/*.xlsx")	=file(A2).xlsopen()
3		=if(A1.exists(),B2.xlsimport@t(),B2.xlsimport())
4		>A1.xlsexport@s(B3;"merger")

The desired merging effect:

Customer ID	Customer Name	Invoice Number	Sale Amount	Purchase Date
1234	John Smith	100-0002	1200	2013-01-01
2345	Mary Harrison	100-0003	1425	2013-01-06
3456	Lucy Gomez	100-0004	1390	2013-01-11
4567	Rupert Jones	100-0005	1257	2013-01-18
......	......	......	......	......
6789	Thomas Haines	100-0002	1346	2013-02-17

Explanation:
A1: Open a specified file into which data will be output.
A2: Traverse all to-be-merged Excel files under the given directory.
B2: Open a to-be-merged Excel file.
B3: If the file into which data will be output exists, import all rows, including the column headers, of a worksheet; if the file exists, use @t option to make the first row the column headers and import data from the second row.
B4: Import data in a stream style in B3 and append it to the worksheet named merger in A1’s specified file.

The stream-style importing and appending is suited to combine many small excel files into a big one.

3. Grouping & aggregation

We continue to use the previous Excel file containing the sales data maker further analysis.

A. Group data by one or more fields

Here’s an eProc SPL script for grouping data by one or multiple fields:

	A
1	=file("D:/sales_2013.xlsx").xlsopen()
2	=A1.conj(A1.xlsimport@t(;~.stname))
3	=A2.groups('Customer ID';sum('Sale Amount'):Total,avg('Sale Amount'):Average)
4	=A2.groups('Customer ID','Purchase Date';sum('Sale Amount'):Total)

A3’s grouping result:

Customer ID	Total	Average
1234	2550	1275.0
2345	3214	1607.0
3456	2901	1450.5
4321	4132	2066.0
……	……	……
4567	1257	1257.0

A4’s grouping result:

Customer ID	Purchase Date	Total
1234	2013-01-01	1200
1234	2013-03-04	1350
2345	2013-01-11	1425
2345	2013-03-17	1789
……	……	……
9876	2013-02-02	1115

Explanation:
A1: Open the specified Excel file.
A2: Import all data from the worksheets in the file and perform merging.
A3: Group the merged data by ‘Customer ID’ and calculate total and average for each group.
A4: Group the merged data by ‘Customer ID’ and ‘Purchase Date’, and calculate total for each group.

B. Group data in a certain order

esProc supports comparing neighboring records during grouping and aggregation. If the data is already ordered, then a re-sorting will be unnecessary. Now suppose the original data is ordered by date, but we need to summarize data by month.

esProc SPL script:

	A	B
1	for directory@p("d:/excel/*.xlsx")	=file(A1).xlsopen()
2		=B1.conj(B1.xlsimport@t(;~.stname))
3		=@\|B2
4	=B3.derive(year('Purchase Date'):Year,month('Purchase Date'):Month)
5	=A4.groups (month('Purchase Date'):Month;sum('Sale Amount'):Total,avg('Sale Amount'):Average)
6	=A4.groups@o (month('Purchase Date'):Month;sum('Sale Amount'):Total,avg('Sale Amount'):Average)

A5’s grouping result:

Month	Total	Average
1	272414	15134.111111111111
2	168038	9335.444444444445
3	357693	19871.833333333332

A6’s grouping result:

Month	Total	Average
1	8992	1498.6666666666667
2	9375	1562.5
3	10139	1689.8333333333333
1	260221	43370.166666666664
2	103656	17276.0
3	101509	16918.166666666668
1	3201	533.5
2	55007	9167.833333333334
3	246045	41007.5

Explanation:
A1~B3: Merge all worksheets in the Excel files of same structure under the specified directory.
A4: Generate a new table sequence base on B3’s table sequence by splitting the Purchase Date field into a new Year field and a new Month field.
A5: groups() function groups data by month and calculates total and average for each group.
A6: groups@o groups data in each year by month, calculates total and average for each group, and, with @o present, merges the summarized results.

A4 contains detailed sales records; A5 summarizes data only by month, ignoring the year; A6 summarizes data by both year and month. The three cells shows summarized data of different levels.

C. Group data by segment

This type of grouping divides data into multiple segments according to specified conditions and summarizes each group of data.

esProc SPL script:

	A	B
1	for directory@p("d:/excel/*.xlsx")	=file(A1).xlsopen()
2		=B1.conj(B1.xlsimport@t(;~.stname))
3		=@\|B2
4	=B3.groups(if ('Sale Amount'<1000,"1::<1000", if ('Sale Amount'<1500,"2::1000~~1500", if ('Sale Amount'<2000,"3::1500~~2000", if ('Sale Amount'<2500,"4::2000~~2500", "5::>=2500")))):Segment; count(1):Number,sum('Sale Amount'):Total)

Grouping result:

Segment	Number	Total
1::<1000	22	8280
2::1000~~1500	9	11617
3::1500~~2000	6	10432
4::2000~~2500	4	8810
5::>=2500	13	759006

Explanation:
About the code block A1~B3, refer to the explanations of previous instances.
A4: It defines 5 intervals for the value of ‘Sale Amount’ field and calculates number and total for within each range.

But the above script is still not convenient-to-use enough. When we want to change the way of dividing the data, we need to modify the parameter expressions in groups() function. This is a little complicated. Another function, pseg(), helps to achieve an easier modification. Here’s the SPL code:

	A	B
1	[0,1000,1500,2000,2500]
2	for directory@p("d:/excel/*.xlsx")	=file(A1).xlsopen()
3		=B1.conj(B1.xlsimport@t(;~.stname))
4		=@\|B2
5	=B4.groups(A1.pseg(~.'Sale Amount'):Segment; count(1):Number,sum('Sale Amount'):Total)

We can also group data by multiple fields according to a set of conditions and then summarize each group. For example, we divide the scores of subjects into 5 grades – Excellent, Good, Satisfactory, Poor, Passing, and summarize each group. For more uses of groups() function, refer to esProc Function Reference.

D. Group big data

In the above instances, the Excel file(s) is/are fairly small to be imported into the memory at a time. If the total data amount is huge, trying to retrieve all data into the memory in either esProc or VBA could fail. An alternative is the stream retrieval that reads a part of data into the memory and summarizes it at one time.

esProc SPL script:

	A	B
1	=file("d:/tdata.xlsx").xlsopen@r()
2	for A1.count()	=A1.xlsimport@ct(;A1(A2). stname)
3		=@\|B2
4	= B3.conjx()	=A4.groups('Customer ID';sum('Sale Amount'):SaleTotal)
5	>file(“d:/out.xlsx”).exportxls@bt(B4;"Customer&Sales")

The result of filtering and then grouping data:

Customer ID	SaleTotal
1234	107721792
2345	139041639
3456	137985543
4321	96170742
...	...
9876	37590417

Explanation:
A1: Use @r option to open an Excel file in a stream style.
A2: Traverse every worksheet in the Excel file.
B2: Use @c option to import data with cursor.
B3: Union B2’s cursor into B3.
A4: Union members of B3’s cursor sequence into a new cursor.
B4: Group A4’s sequence by Customer ID and calculate Sale Amount for each group.
A5: Export and save B4’s result set into an Excel file.

Each time, the stream retrieval reads a part of data with cursor and groups and summarizes it.

4.Deduplication

Duplicate data affects the analysis results. As duplication is common in real-world cases, let’s look at how to remove duplicates in esProc SPL.

A. Via primary key

Set ‘Invoice Number’ as the primary key of sales_2013 to remove duplicates according to it.

	A
1	=file("d:/sales_2013.xlsx").xlsopen()
2	=A1.conj(A1.xlsimport@t('Customer Name', 'Invoice Number', 'Sale Amount';~. stname))
3	=A2.group@1('Invoice Number')
4	>file(“d:/out.xlsx”). xlsexport@t(A3;"result")

Concatenated deduplicated data:

Customer Name	Invoice Number	Sale Amount
John Smith	100-0002	1200
Mary Harrison	100-0003	1425
Lucy Gomez	100-0004	1390
Rupert Jones	100-0005	1257
Jenny Walters	100-0006	1725
……	……	……
Susan Wallace	100-0019	2280

Explanation:
A1: Open a certain Excel file.
A2: Import specified columns from every worksheet.
A3: Group A2’s table sequence by primary key ‘Invoice Number’ and deduplicate each group.
A4: Export and save A3’s result set into an Excel file.

Data is distinct in each worksheet, but probably not once the worksheets are concatenated. Primary key is one of the ways of deduplicating data.

B. Via a certain field

Deduplicate data in sales_2013 according to a certain field to look through employee records of different names.

	A
1	=file("d:/sales_2013.xlsx").xlsopen()
2	=A1.conj(A1.xlsimport@t('Customer ID', 'Customer Name';~. stname))
3	=A2.id('Customer Name')
4	=A2.group@1('Customer Name')
5	>file("d:/out.xlsx"). xlsexport@t(A4;"result")

Explanation:
A1: Open a certain Excel file.
A2: Import specified columns from every worksheet.
A3: Get a sequence of distinct names from A2’s table sequence.
A4: Get a table sequence of distinct names from A2’s table sequence.
A5: Export and save data of A4’s table sequence into an Excel file, with the first row as the column headers.

A3’s deduplication result:

Member
Anushka Vaz
Daniel Farber
Harriet Cooper
……
Tony Song

A4’s deduplication result:

Customer ID	Customer Name
5432	Anushka Vaz
9876	Daniel Farber
4321	Harriet Cooper
……	……
8765	Tony Song

C. Via multiple fields

Sometimes we need several other fields, instead of the primary key, to determine whether records are duplicates or not.

	A
1	=file("d:/sales_2013.xlsx").importxls@t()
2	=file("d:/sales_2014.xlsx").importxls@t()
3	=[A1,A2].merge('Customer ID', 'Purchase Date')
4	=A3.group@1('Customer ID', 'Purchase Date')
5	>file("d:/out.xlsx"). xlsexport@t(A4;"result")

Explanation:
A1: Import a specified Excel file.
A2: Import another specified Excel file.
A3: Merge A1’s table sequence and A2’s table sequence by fields ‘Customer ID’ and ‘Purchase Date’ and return a new table sequence.
A4: Group and deduplicate A3’s table sequence by ‘Customer ID’ and ‘Purchase Date’.
A5: Export and save 43’s result set into an Excel file.
We can perform the merge and grouping and deduplication by more fields as needed.

D. Deduplicate records

Records in each file are distinct, but the concatenated file could contain duplicate records.

	A	B
1	=file("d:/sales_2013.xlsx").importxls@t()	=A1.group@1('Invoice Number')
2	=file("d:/sales_2014.xlsx").importxls@t()	=A2.group@1('Invoice Number')
3	=[B1,B2].merge@u()	=A3.count()

Explanation:
A1: Import a specified Excel file.
B1: Deduplicate A1’s table sequence by ‘Invoice Number’.
A2&B2: Same as A1&B1.
A3: Merge B1’s table sequence and B2’s table sequence, deduplicate data, and return a table sequence. @u option is used to merge multiple table sequences in order, remove duplicates, and generate a new table sequence.
B3: Count the records in the merged table sequence. merge@u() function merges multiple ordered, deduplicated table sequences.

In the above, we explain how to handle merging, grouping and summarization, and deduplication over same-structure Excel files in esProc SPL. For Excel files of different structures, we just need to read to-be-merged fields as a sequence object and then process it in same ways. It is similar to handle other types of text files.

5. Appendix

slaes.rar

SPL Official Website 👉 http://www.scudata.com

SPL Feedback and Help 👉 https://www.reddit.com/r/esProc

SPL Learning Material 👉 http://c.scudata.com

SPL Source Code and Package 👉 https://github.com/SPLWare/esProc

Discord 👉 https://discord.gg/ydhVnFH9

Youtube 👉 https://www.youtube.com/@esProc_SPL

esProc

pjf • 967 View • 5 Years ago

An Easy Way to Merge and Summarize Excel Files

1. Background

2. Basic merges

A．Merge multiple sheets of one Excel file

B．Merge multiple sheets from different Excel files

C．File merging with big result set

3. Grouping & aggregation

A. Group data by one or more fields

B. Group data in a certain order

C. Group data by segment

D. Group big data

4.Deduplication

A. Via primary key

B. Via a certain field

C. Via multiple fields

D. Deduplicate records

5. Appendix

ToC