SPL Programming - 11.2 [Big data] Fuctions on cursor
To only count the number of records, SPL also has a skip() function.
skip() is intended to skip several records without reading, and return the number of skipped records. When there is no parameter, it will go to the end, and the number of skipped records returned is the total number of records.
It seems that it will be easier to directly use the functions on cursor. Are there corresponding cursor functions for operations like sum? After all, it’s still troublesome to write a loop by yourself. For example, you need to set an initial value for sum and max.
SPL does not have a sum function for cursor, but has groups(). The average order amount in the previous section can be written as follows:
We learned that when the first parameter of the groups()function is filled with blank, it means that the whole data table is divided into one group, which is equivalent to summarizing the whole table, so the order amount and quantity of all the records can be recorded. The result set of A2 is a data table with only one row of record, and the average value is obtained by the calculation of A3 (the field of the first member will be selected by default when getting field value from a record sequence).
Why not provide a sum()and count() function for the cursor like for a sequence?
Because the cursor traversal is very slow, we should calculate as many results as possible during the traversal. If we provide functions such as sum()and count(), we need to traverse the cursor once for sum()and once again for count(), the performance will be very poor. It is also troublesome to create a new cursor in the code, so SPL does not provide it at all. Directly calculate multiple results in one traversal in groups().
However, if we always use groups() to summarize the whole table, it will always return a table sequence with only one row of record. Although there is no problem in use, it still does not feel good. And we also need to get new field names, it is a little troublesome. Therefore, SPL provides a total function, and it can be slightly simplified:
The total() function also requires one-time traversal to calculate multiple values, but it returns as a sequence without using the field names.
Since it is the groups() function, it can also be used for group aggregation?
Yes, for example, we can calculate the total order amount and order quantity of each area:
Note that the area field should also be fetched this time.
It seems that there is no difference in code between doing group aggregation for cursor and for table sequence or record sequence before.
It’s true, but the cursor is still special. For example, we want to do another grouping by month. If it’s a record sequence, just write another groups(). But if it’s a cursor, it can’t be used again. It has finished traversing and can’t get records. At this time, we need to recreate a new cursor and then do groups().
But in this way, the cursor data is traversed twice, and the speed is relatively slow.
There is indeed this problem. SPL also provides a method to calculate multiple sets of aggregate values in the process of one cursor traversal, which is called traversal reuse. However, these contents belong to the category of performance optimization, which can only be explained clearly by adding new concepts. They are beyond the design outline of this book, so they are not discussed here. You can refer to other documents of SPL if interested.
If you use SQL, you can only traverse multiple times. It does not support traversal reuse.
There is a groups()function for the cursor, but there is no corresponding group() function (in fact, SPL also has this function, but the use conditions are limited, and it does not correspond to groups(). We’ll mention it in the next section). In other words, the groups() function of the cursor does not use the group() function to divide into subsets first, and then summarize them. Why?
Because it is not necessary, actually the groups() function for the table sequence and record sequence also does not divide into grouping subsets first and then perform summary calculation. There are many summary calculations, such as sum, count, max, etc., which can be implemented by accessing the set members in turn in a loop. Each member can be used only once, and there is no need for the random access we mentioned earlier, so the operation efficiency is higher. If you carefully read the iterative function in Chapter 5, you will understand this principle more thoroughly.
Then why doesn’t the cursor have a corresponding group() function?
Cursor is introduced to handle large data that cannot be loaded into memory. If the original data table cannot be loaded into memory, it is obviously impossible to load all grouping subsets. If the group() function is provided, it is necessary to keep these grouping subsets in external storage, which has very poor performance and is very troublesome. For the operation that must be realized by grouping subsets, we will think of other better ways to do it, so we do not provide this operation.
Now that we understand the aggregation and grouping operations on a cursor, let’s move on to other operations.
What will the select() function look like on a cursor? For example, we want to find orders with an order amount exceeding 5000.
At this time, it is very different from the operation of sequence and record sequence. For the selection function of the record sequence, the result record sequence is calculated directly, and we can view the members. However, it cannot be done for the cursor. Because the cursor corresponds to big data, the selected result may still be big data. We can’t calculate it into a table sequence in memory. It has also to be a cursor. Therefore, the select() function on the cursor still returns a cursor.
Then, how do we view the selected result?
We use fetch(). Since the selected cursor is still a cursor, we can also use the fetch() function to fetch data. However, because it is a cursor, we can’t fetch all the data, and we can only get part of it at a time.
A2 conditionally filters the cursor and still returns a cursor. Then A3 can get the table sequence composed of 100 records to examine.
When you execute this code, you can look at the return value of A2 and find that it is a strange thing. You can’t see the data in the cursor when it is not fetched. It can’t be displayed anyway, so SPL just displays it as the name of the internal object.
However, what exactly does it mean to execute the select() function against the cursor? Will it traverse the data in the cursor once?
It should not. Otherwise, where will the qualified records be placed in the traversal process? As we said, it can’t be loaded in memory, shall it be put into the hard disk?
SPL does not do this and there is no need to do so. The select()function of the cursor does not actually do any substantive selection action. It just remembers that there is a selection action to be done on the cursor. Only when fetching data will SPL check that there is a select() action on the cursor, and only then will it really perform condition judgment to select records. After selecting the records required by fetch()(for example, 100 records here), it stops calculating and waits for the next fetch() to calculate again.
The cursor returned by the select() function is called a deferred cursor. It will perform substantive calculation only when the data is really fetched. This is different from the groups()just mentioned. groups() will immediately trigger the cursor traversal action, and the returned is no longer a cursor.
When learning a cursor function, you should know its return value and whether it is a deferred cursor.
Deferred cursor technology allows cursor related code to be written much like calculations of a record sequence. For example, we want to group and summarize orders with an amount of more than 5000 by area. If it is a record sequence, we know that we just need to select()and then groups(); The same is true for cursor:
It is exactly the same as the operation of a record sequence, and is very simple.
Therefore, we can usually not pay much attention to whether the cursors returned by these functions are deferred. It is simply understood that these functions are performing corresponding calculations, and there is generally no problem. However, we still need to know this mechanism, which is necessary when carefully analyzing the running results of some code.
However, cursor still has a key difference from record sequence.
After we execute select()to a certain record sequence, we will get a new record sequence, but the original record sequence will not change, and we can still continue to do various other calculations. The new cursor obtained after executing select() to a cursor is related to the original cursor. In the process of traversing the new cursor, the original cursor will also be traversed at the same time. When the new cursor is traversed and can not be reused, the original cursor will also be traversed and can not be reused. Conversely, traversing the original cursor will also affect the new cursor. After a new deferred cursor is obtained by executing a function on a cursor, in principle, the original cursor should not be used again, otherwise there will be very chaotic consequences.
Cursor and record sequence are essentially different. After all, there are great differences between memory and external storage. SPL just try to make them similar, so it will be more convenient to learn and write code.
Because the cursor cannot be accessed randomly, positioning is meaningless, so there are no functions like pselect()on the cursor. In addition, align() and enum() functions are often only related to small data, so there is no corresponding cursor version.
But the functions new(), run() and derive() are very meaningful. These are also deferred cursors.
For example, we add an average unit price field to orders with an order amount of more than 5000:
We can create another deferred cursor based on the deferred cursor. SPL will not calculate immediately, but marks down that there is a select()action and a derive() action on the cursor, which will be executed one by one when fetching.
new()and run() are similar. In fact, news() for expansion can also be executed on cursor. There are no more examples here.
Foreign key related switch()and join() are meaningful to cursor.
Assuming that each region will have its own tax rate, now we want to calculate the total number and amount of orders with tax exceeding 100 by month.
switch()also returns a deferred cursor. In the process of groups() calculation, only when the data is read in will the switch() be actually processed, that is, converting the foreign key field area into the record of the dimension table.
Similarly, we want to add a tax field to the orders with an amount of more than 5000 and write it to another file.
The export()function can write the cursor directly to the file, @c means to write it as a comma separated csv file, @t means to write the field names as the first line as the title. In A6, we first use the join() function to add the tax rate to the cursor data, and then use run to change the tax rate into tax. The functions of deferring cursor can be used at multiple steps, and the writing method is exactly the same as the record sequence in memory.