Spark SQL Find the Number of Extensions for a Record

Question

Source: https://stackoverflow.com/questions/70470272/spark-sql-find-the-number-of-extensions-for-a-record

I have a dataset as follows:

col1

extension_col1

2345

2246

2246

2134

2134

2091

2091

Null

1234

1111

1111

Null

I need to find the number of extensions available for each record in col1whereby records are sorted already and contiguously in terms of sets which are terminated by a null.

The final result will be like this:

col1

extension_col1

No_Of_Extensions

2345

2246

3

2246

2134

2

2134

2091

1

2091

Null

0

1234

1111

1

1111

Null

0

value 2345 extends as2345>2246>2134>2091>nulland hence it has 3 extension relations excluding null.

How to get the 3rd column(No_Of_Extensions) using spark sql/scala?

Answer

For this scenario where the data table’s first column is already ordered, create a new group each time when value of the second column in the previous record is null, and in each group, add a number column according to the specific requirements. It is a great hassle to try to achieve the process in SQL. You need to first create row numbers and the marker column as needed, and perform the grouping according to the marker column and row numbers. A common alternative is to fetch the original data out of the database and process it in Python or SPL. SPL, the open-source Java package, is easier to be integrated into a Java program and generate much simpler code. It expresses the algorithm with only two lines of code:

A

1

=MYSQL.query("select * from t4")

2

=A1.group@i(#2[-1]==null).run(len=~.len(),~=~.derive(len-#:No_Of_Extensions)).conj()

 

View SPL source code.