esProc One-liners: Split a Field Value into N Records

Key words: field value splitting  split one row into N rows

To split a field value into N strings by a separator and unfold the row into N rows in SQL is very inconvenient.

That’s because SQL doesn’t support the set objects. So it can’t expand one row into multiple rows according to a set of strings split from a value. The SQL way of doing this is like this. Get the maximum number of strings after value splitting (M), create a one-column, M-row temporary table (T2) with column named lv, in which values are 1,2,……,M, and perform a cross-join between T2 and the original table over the (T2.lv)th string after value splitting. This SQL query contains multiple complicated nested subqueries. Moreover, databases have their own functions for string splitting, so they write different SQL queries for doing this.

Here’s one example. Based on elective course table COURSES below, we need to find how many courses each student chooses.

COURSE

STUDENTS

Chinese

Tom,Kate,John,Jimmy

Russia

Tom,Cart,Jimmy

Spanish

Kate,Joan,Cart

Portuguese

John,Tom

History

Tom,Cart,Kate

Music

Kate,Joan,Tom

The expected result:

STUDENT

NUM

Tom

5

Kate

4

Cart

3

Jimmy

2

Joan

2

John

2

 

Oracle SQL does it in this way:

SELECT STUDENT, COUNT(*) NUM FROM

    (SELECT T1.COURSE, REGEXP_SUBSTR(T1.STUDENTS, '[^,]+', 1, T2.LV ) STUDENT

    FROM COURSES T1,

        ( SELECT LEVEL LV

        FROM (SELECT MAX(REGEXP_COUNT(A.STUDENTS, '[^,]+', 1)) R_COUNT

            FROM COURSES A

        ) B

        CONNECT BY LEVEL <= B.R_COUNT) T2

    WHERE REGEXP_SUBSTR(T1.STUDENTS, '[^,]+', 1, T2.LV) IS NOT NULL

    ) C

GROUP BY STUDENT

ORDER BY NUM DESC;

C is the temporary table. So the SQL query is multilevel and difficult to read.

 

The program would be concise and simple if we could use esProc SPL to handle it. It is a one-liner:

connect("mydb").query("SELECT * FROM COURSES").news(STUDENTS.split@c();~:STUDENT ).groups(STUDENT;count(1):NUM).sort(-NUM)

The Structured Process Language (SPL) supports set objects and provides the function for expanding a row to multiple rows according to a set. The code is brief, clear, simple and easy to understand. Most importantly, it enables universal syntax for producing consistent queries over various data sources.

 

The SPL sets function library includes intersection, difference, union & concatenation, aggregates and loops. Refer to SPL Set to learn more.

SPL is integration-friendly with a Java program. Read How to Call an SPL Script in Java to learn details.

About how to work with esProc, read Getting Started with esProc.