How to Handle Huge XLS Files

Excel displays poor performance in handling huge xls files. Usually we load an Excel file to the database and handle it using the database’s computing ability. Sometimes, however, the data can’t be wholly loaded to the database due to specific reasons. It would be great there was an application that can directly deal with massive xls files.

 

Take the employee information file emp.xls (Below is part of the source data):

EID

NAME

SURNAME

GENDER

STATE

BIRTHDAY

HIREDATE

DEPT

SALARY

1

Rebecca

Moore

F

California

1974-11-20

2005-03-11

R&D

7000

2

Ashley

Wilson

F

New York

1980-07-19

2008-03-16

Finance

11000

3

Rachel

Johnson

F

New Mexico

1970-12-17

2010-12-01

Sales

9000

4

Emily

Smith

F

Texas

1985-03-07

2006-08-15

HR

7000

And the state information file states.xls as an example (Below is part of the data):

STATEID

NAME

POPULATION

ABBR

AREA

CAPITAL

REGIONID

1

Alabama

4779736

AL

52419

Montgomery

6

2

Alaska

710231

AK

663267

Juneau

9

3

Arizona

6392017

AZ

113998

Phoenix

8

4

Arkansas

2915918

AR

52897

Little Rock

7

Task: Join the two tables through emp’s STATE column and states’ NAME column and get records where SALARY is above 5000 and POPULATION is below 5 million.

 

It’s easy to do this with esProc.
You can download esProc installation package and free DSK edition license HERE.

1. Get records of states where POPULATION is below one million:

Write script wherexls.dfx in esProc:

A

1

$select * from states.xls   where POPULATION<1000000

A1 gets records of states where POPULATION is below one million using simple SQL.

Below is A1’s result after execution:

undefined

 

2. Group emp records by genders and count employees in each group:

Write script groupxls.dfx in esProc:

A

1

$select GENDER,count(*) as   count from emp.xls group by GENDER

A1 groups emp by gender and count employees in each group using simple SQL.

Below is A1’s result after execution:

undefined

 

3. Join emp table and states table through emp.STATE and states.NAME and select records where state population is below 5 million and employee salary is above 5000:

Write script joinxls.dfx in esProc:

A

1

$select * from emp.xls b   join states.xls a on a.NAME=b.STATE where a.POPULATION<5000000 and   b.SALARY >5000

A1 performs join filtering over two tables using simple SQL.

Below is A1’s result after execution:

undefined

 

We can convert an xls file into a bin file. The bin file is esProc’s built-in binary file format. The format uses simple compression mechanism to store same size of data in smaller space and thus enables less time in reading it.

Take orders file orders.xls as an example. To convert it to a bin file, we use the following script:

A

1

=file("orders.xls").xlsimport@t()

2

=file("orders.btx").export@b(A1)

Below is the disk space used by xls file and bin file:

undefined

Like an Excel file, a bin file can be directly handled with esProc. For example:

Group orders file orders.btx by year and list years when the total number of orders is less than 10,000:

Write script groupbtx.dfx in esProc:

A

1

$select year(O_ORDERDATE) as   year,count(*) as count from orders.btx group by year(O_ORDERDATE) having   count(*)<10000

A1 groups records by year and count orders for each year using simple SQL.

Below is A1’s result after execution:

undefined