How to merge Excel Files easily in Java

  Most Java programmers are accustomed to using POI or HSSFWorkbook to auto-merge Excel files. That requires a lot of code and effort as well as strict Excel format, yet with low scalability. Imagine that there is an independent data computing tool. It enables you to write a SQL-style script simply and call it directly from a Java application and return the result set. That’s not imagination, however, because Raqsoft’s esProc has made it achievable. esProc provides JDBC interface to let a third application call an SPL (Structured Process Language) script conveniently. Here let’s look at how this works.

Basic merges

A.     Merge multiple sheets in one Excel

  The following Excel file stores sales data. It includes worksheets of three months:

  January_2013:

  undefined

 

  February_2013:

  undefined

 

  March_2013:

  undefined

  We perform a filter merging by selecting Customer Name field and Sale Amount field from every sheet. Here’s the result:

  undefined

  SPL script:


A

1

=file(”D:/sales_2013.xlsx”).xlsopen()

2

=A1.conj(A1.xlsimport@t('Customer Name','Sale Amount';~.stname))

3

>file(“D:/result_2013.xlsx”).xlsexport@t(A2;"merge_sheets")

  Save the script as oneExcel.dfx to be embedded in the Java application.

  Script explanation:

  A1: Open the Excel file to generate a sequence of worksheets according to it.

  A2: conj() function traverses A1’s members to import 'Customer Name' and 'Sale Amount' from each worksheet iteratively and concatenate them together. xlsimport() function imports specified columns where the last column is followed by semicolon; @t option tells the function to use the first row as field names. Parameter ~.stname points to the current worksheet.

  A3: Save A2’s table sequence in a worksheet in the original Excel file and name it "merge_sheets" while using @t option to write the first record as column headers.

  The three-liner is short, concise, clear and easy to understand.

 

B.      Merge multiple Excels

  Now we need to merge multiple Excels. Each Excel file includes multiple worksheets (their structures are the same as those in the previous example) where annual data is stored:

  SPL script:


A

B

1

for  directory@p(”d:/excel/*.xlsx“)

=file(A1).xlsopen()

2


=B1.conj(B1.xlsimport@t('Customer   Name','SaleAmount','Purchase Date';~.stname))

3


=@|B2

4

>file(“d:/result.xlsx”).xlsexport@t(B3;"merge_data")


  Merging result:

  undefined

  Save the script as MergeExcels.dfx to be embedded in the Java application.

  Script explanation:

  A1: for statement traverses Excel files in the specified directory to process data in the inner loop in B1-B3.

  B1: Open an Excel file to generate a sequence of worksheets.

  B2: Import 'Customer Name', 'Sale Amount' and 'Purchase Date' columns from each worksheet of the current file and concatenate them together.

  B3: Concatenate B2’s table sequence with the current cell value, which is represented by @.

  A4: Save B3’s table sequence in a worksheet in result.xlsx and name it "merge_data".

  The program uses a two-level loop to merge multiple Excel files. The outer loop traverses all Excel files under the specified directory and in the inner loop B1.conj merges data from multiple worksheets in each file.

 

C.      Large merging result

  In above examples, a merging result is loaded into the memory and then output at a time. If the result is too large or it exceeds the memory capacity, we can process and output data in a stream style.

  SPL script:


A

B

1

=file("D:/out.xlsx")


2

for directory@p(”d:/excel/*.xlsx“)

=file(A2).xlsopen()

3


=if(A1.exists(),B2.xlsimport@t(),B2.xlsimport())

4


>A1.xlsexport@s(B3;"merger")

  Merging result:

  undefined

  Save the script as BigExcel.dfx to be embedded in the Java application.

  Script explanation:

  A1: Open a specified output file.

  A2: Traverse all to-be-merged Excel files in the directory.

  B2: Open one of A2’s files.

  B3: If the output file doesn’t exist, read in all rows of a worksheet in B2’s Excel file, including the column headers; if the output file exists, use @t option to read row 1 as the column headers and read detailed data from row 2.

  B4: Append B3’s result to a worksheet named “merger” in A1’s output file in a stream style.

  In this example, we traverse all Excel files to read in data from one and append it to the specified output file each time. The stream style output applies to merging a lot of small Excel files into a big one.

 

JAVA invocation

  An SPL script is integration-friendly. The Java application calls an SPL script in same manner as calling a stored procedure via esProc JDBC. We’ll take OneExcel.dfx, the result of merging multiple worksheets in an Excel file, to illustrate the invocation process.

    Connection con = null;
    Class.forName("com.esproc.jdbc.InternalDriver");
    con= DriverManager.getConnection("jdbc:esproc:local://");// Call stored procedure, where OneExcel is the dfx file name
    st =(com. esproc.jdbc.InternalCStatement)con.prepareCall("call OneExcel()");//Execute stored procedure
    st.execute();//Get result set
    ResultSet rs = st.getResultSet();
    ...

  Just change the file name to call another dfx file. See How to Call an SPL Script in Java to learn details. esProc also supports the ODBC style integration through same process.

 

Advanced merges

  At times we need to group a merging result before outputting it.

Group & summarize a merging result

A.     Group by field(s)

  To group a merging result by one or more fields and perform aggregation, you can use the SPL script below:


A

1

=file(”D:/sales_2013.xlsx”).xlsopen()

2

=A1.conj(A1.xlsimport@t(;~.stname))

3

=A2.groups('Customer ID';sum('Sale Amount'):Total,avg('Sale Amount'):Average)

4

=A2.groups('Customer ID','Purchase Date';sum('Sale Amount'):Total)

  A3’s result:

  undefined

  A4’s result:

  undefined

  Script explanation:

  A1: Open the specified Excel file.

  A2: Read in data from every worksheet and concatenate it.

  A3: Group merging result by Customer ID and calculate total sales and average for each group.

  A4: Group merging result by Customer ID & and Purchase Date and calculate total sales for each group.

 

B.      Order-based grouping

  esProc supports merge grouping, which groups records by comparing one with its next one. This increases efficiency and maintains the original order when records are already ordered. Suppose records are ordered by dates, we can use the following SPL script to perform grouping and aggregation, for example, by month:

SPL script:


A

B

1

for   directory@p(”d:/excel/*.xlsx“)

=file(A1).xlsopen()

2


=B1.conj(B1.xlsimport@t(;~.stname))

3


=@|B2

4

=B3.derive(year('Purchase Date'):Year,month('Purchase Date'):Month)


5

=A4.groups (month('Purchase Date'):Month;sum('Sale Amount'):Total,avg('Sale Amount'):Average)


6

=A4.groups@o(month('Purchase Date'):Month;sum('Sale Amount'):Total,avg('Sale Amount'):Average)


  A5’s result:

  undefined

  A6’s result:

  undefined

  Script explanation:

  A1~B3: As explained previously, the statements merge all worksheets of the same-structured Excel file under the specified directory.

  A4: Create a table sequence based on B3’s table sequence and split the Purchase Date values to add Year field and Month field.

  A5: groups() function groups merging result by Month and calculate total sales and average for each group.

  A6: groups@o() function groups merging result by Year and Month and calculate total sales and average for each group. @o option enables a merge grouping by comparing a row with its next one.

 

C.     Group by segment

  To divide records into multiple segments according to specified intervals and perform aggregation:

  SPL script:


A

B

1

for directory@p(”d:/excel/*.xlsx“)

=file(A1).xlsopen()

2


=B1.conj(B1.xlsimport@t(;~.stname))

3


=@|B2

4

=B3.groups(if  ('Sale Amount'<1000,"1::<1000",if  ('Sale Amount'<1500,"2::1000~~1500",if  ('Sale Amount'<2000,"3::1500~~2000",if ('Sale Amount'<2500,"4::2000~~2500","5::>=2500")))):Segment;count(1):Number,sum('Sale Amount'):Total)


  Grouping result:

  undefined

  Script explanation:

  A4: Divide records into 5 segments by intervals of sales amount and calculate record count and total sales for each group.

  To change the grouping conditions we need to modify parameters in groups() function. And the parameter expression is too complicated. To make the script more concise, we rewrite it as follows:


A

B

1

[0,1000,1500,2000,2500]


2

for directory@p(”d:/excel/*.xlsx“)

=file(A1).xlsopen()

3


=B1.conj(B1.xlsimport@t(;~.stname))

4


=@|B2

5

=B4.groups(A1.pseg(~.'Sale Amount'):Segment;count(1):Number,sum('Sale Amount'):Total)


  We can also perform grouping and aggregation by multiple fields or multiple conditions. For example, sum students’ scores according to ranges - Excellent, Very good, Good, Fair, Pass.

 

D.     Large Excel merging

  If the size of to-be-merged Excel files exceeds the physical memory capacity (reading them in with VBA almost can’t solve the problem), we can merge them at retrieval time in a stream style, i.e. retrieving a segment of data to process each time.

  SPL script:


A

B

1

=file(“d:/tdata.xlsx”).xlsopen@r()


2

for A1.count()

=A1.xlsimport@ct(;A1(A2).stname)

3


=@|B2

4

= B3.conjx()

=A4.groups('Customer ID';sum('Sale Amount'):SaleTotal)

5

>file(“d:/out.xlsx”).exportxls@bt(B4;"Customer&Sales")


  Result of filter grouping & aggregation:

  undefined

  Script explanation:

  A1: @r opens the Excel in a stream style.

  A2: Traverse worksheets in the Excel.

  B2: @c option loads data with cursor.

  B3: Concatenate cursor B2 with B3’s sequence of cursors.

  A4: Concatenate records in B3’s cursors into a new cursor.

  B4: Group records in A4’s cursor by Customer ID and sum Sale Amount for each group.

  A5: Export result to a target Excel file.

 

Remove duplicates

  In real-world scenarios, we often need to remove duplicate data to get correct analytic result.

A.     By primary key

  Remove duplicate data from sales_2013 by primary key Invoice Number field.


A

1

=file(“d:/sales_2013.xlsx”).xlsopen()

2

=A1.conj(A1.xlsimport@t('Customer Name', 'Invoice Number', 'Sale Amount';~. stname))

3

=A2.group@1('Invoice Number')

4

>file(“d:/out.xlsx”).xlsexport@t(A3;"result")

  Result of merging Excel without duplicates:

  undefined

  Script explanation:

  A1: Open the specified Excel.

  A2: Import specified columns from each worksheet to concatenate them.

  A3: Group A2’s table sequence by primary key Invoice Number and remove duplicates from each group. Parameter @1 (number 1 instead of letter 1) gets the first record from each group and concatenates them into a record sequence to return.

  A4: Export result to a target Excel file.

  Though data in every worksheet is unique, data in merging result may not be unique. That is why we remove duplicates by the primary key.

 

B.      By a specified field

  Remove duplicates from sales_2013 by a specified field to get records of different customers.


A

1

=file(“d:/sales_2013.xlsx”).xlsopen()

2

=A1.conj(A1.xlsimport@t('Customer ID', 'Customer Name';~. stname))

3

=A2.id('Customer  Name')

4

=A2.group@1('Customer Name')

5

>file(“d:/out.xlsx”).xlsexport@t(A4;"result")

  Script explanation:

  A1: Open the specified Excel.

  A2: Import specified columns from each worksheet to concatenate them.

  A3: Get different customer names from A2’s table sequence.

  A4: Group A2’s table sequence by Customer Name and get the first record from each group to return.

  A5: Export A4’s table sequence to a target Excel while writing row 1 as he column headers.

  A3’s distinct result:

  undefined

  A4’s distinct result:

  undefined

 

C.     By multiple fields

  Sometimes we need more than the primary key field to judge whether a record is a duplicate.


A

1

=file(“d:/sales_2013.xlsx”).importxls@t()

2

=file(“d:/sales_2014.xlsx”).importxls@t()

3

=[A1,A2].merge('Customer ID', 'Purchase Date')

4

=A3.group@1('Customer ID','Purchase Date')

5

>file(“d:/out.xlsx”).xlsexport@t(A4;"result")

  Script explanation:

  A1: Import data from the specified Excel.

  A2: Import data from the specified Excel

  A3: Merge A1’s table sequence and A2’s table sequence by Customer ID and Purchase Date and return a new table sequence.

  A: Group A3 by Customer ID and Purchase Date and get the first record, i.e. the unique one, from each group.

  A5: Export result to a target Excel.

 

D.     Remove duplicate records from result

  In each worksheet the records are unique but there may be duplicate records in the merging result.


A

B

1

=file(“d:/sales_2013.xlsx”).importxls@t()

=A1.group@1('Invoice  Number')

2

=file(“d:/sales_2014.xlsx”).importxls@t()

=A2.group@1('Invoice Number')

3

=[B1,B2].merge@u()

=A3.count()

  Script explanation:

  A1: Import data from the specified Excel.

  B1: Remove duplicate records from A1’s table sequence by Invoice Number.

  A2,B2: Same as A1,B1.

  A3: Merge B1’s table sequence and B2’s table sequence while removing duplicate records and return a new table sequence. @u option combines records of two table sequences in order and remove duplicates.

  B3: Count records in A3’s table sequence.

  merge@u() function merges ordered table sequences containing unique records.

 

Summary

  To merge, group & aggregate and perform distinct over Excel files of same structure or different structure in SPL, esProc just reads each target field as set objects for further processing. It’s the same way of merging text files with esProc, the professional data computing tool.

 

SPL strengths

  • A universal tool

  Data computing in Java involves a mountain of work that generates unreusable, lengthy code. And most of the time the source data even doesn’t come from a database. SPL equalizes all types of source to process data in a uniform and convenient way.

  • Free & open-source

  The free esProc Community Edition has been released. To use it permanently, users can deploy it on the server and integrate it with Java.

  • Solid technical support

  You can find solutions to general problems in our Documentation. For users with esProc Basic Edition, you can ask any questions in Raqsoft Community, where we offer free technical supports.