SPL Programming - 10.1 [Association] Primary key
When we talked about the concept of structured data, we said that the fields of the data table have names, but the records have no names. So how do we identify the difference between one record and another? We know that each record corresponds to the information of one thing. Which is it? How to determine that the current referenced or operated record is the one corresponding to the expected thing?
It is often unreliable to use the order of a record in the table sequence to identify, because the sequence number will change with the insertion and deletion action.
In the examples of the previous chapter, most of the table sequences we created and referenced have a field with values that are unique in the whole table sequence, such as the id field in the personnel table or the dt field in the stock table. Structured data uses this unique field value to identify a record.
In a data table, if there are one or more fields and the values of any two records in this field (the sequence of values if multiple fields) are different, we can set this field (fields) as the primary key of the data table.
Note that the primary key is set manually and not found automatically. There may be multiple fields (or sets of fields) in a data table that meet the condition of primary key (uniqueness), but we will only select one (set) as the primary key. A primary key must be unique, but a unique field may not be a primary key.
The field value of the primary key of a record, also known as the primary key value of the record, is also referred to as the primary key when the context is unambiguous.
Because the primary key is unique, that is, the primary key value of a record will not be repeated in the whole data table, the primary key can be used to uniquely identify a record. We say that the current referenced or operated record is a record with a primary key of xxx, so there will only be one such record, and there will be no mistake.
For a data table with a primary key already set, in principle, it is not allowed to add new records with the same primary key as existing records. When inserting and deleting records, the primary key will not change. Using the primary key to identify a record is more stable than using the sequence number.
However, primary key is not required (none of the previous examples has one). A data table without a primary key usually only appends records instead of modifying, deleting existing records or inserting new records in front, because these actions will change the sequence number.
In fact, we are not unfamiliar with the concept of primary key in our daily life. All students in the school will have a unique student number. The attribute of student number itself has no effective information, and its function is only to uniquely identify a student. It will not work to directly use students’ names. People may often have duplicate names, so it is difficult to accurately identify a student. For things like orders, they don’t have a name. They need a (unique) number to identify them. Otherwise, we don’t know how to determine which order we are talking about. Other similar ones include flight number, bank account, telephone number, etc., which all play the role of primary key.
The “primary key” in daily life is almost everywhere. The primary key of structured data is only an embodiment of people’s daily experience in the programming language. The primary key is the name of the record. Setting the primary key is to give the record a (unique) name.
SPL provides some functions for primary key:
A2 uses the keys()function to set the primary key of the table sequence as the id field, and A3 uses key() to calculate the primary key of a record.
For the table sequence with primary key set, a key icon will be drawn on the primary key field when the values are displayed.
However, SPL does not check the uniqueness of the primary key when it is set. Whether the primary key is unique is controlled by the programmers themselves.
Using the grouping operation we have learned, it is easy to judge whether a field is unique in the data table:
Just see if the sequence after DISTINCT is as long as the original table sequence.
When inserting records, SPL does not check the uniqueness of the primary key. The following code will not report an error.
SPL only checks the uniqueness of the primary key when it is required. This has the advantage of better computing performance, and each check is a time-consuming action, but the disadvantage is that there may be potential errors. The principle of SPL fully believes in programmers and leaves freedom to programmers.
The main usage of the primary key is to identify records, that is, to find a record with the primary key value.
SPL uses the find() function to return the record whose primary key is its parameter. If it cannot be found, it will return null.
In the table sequence we are using now as an example, the primary key of records is arranged from small to large just like the sequence number. This is to facilitate the generation of a unique field (if they are generated randomly, it is necessary to compare whether they are repeated). However, SPL does not have this requirement for the primary key. It can be out of order:
A4 can also normally find the record with the parameter as primary key.
Since SPL does not check the uniqueness of the primary key, if there are records with the same primary key in the table sequence, the find()function will not report an error. It will find the first record and return, which is a bit like select@1().
By the way, find()is similar to pos(). It is not a loop function. Its parameter will be calculated first when being called, and there will be no ~, # and so on.
It is a common action to use the primary key to find a record. If the table sequence is a relatively large, the search speed will be slow, because this search usually needs to compare one by one.
SPL provides a mechanism to index the primary key, and the search speed will be much faster.
The index()function will create an index for the table sequence with a primary key, and the index will be automatically used by find() after the index is created.
When you execute this code and compare the running time before and after indexing, you will find that the performance gap is very large (but it also needs to be executed 10000 times and use interval@ms() to capture). After the index is provided, the method of comparison one by one is no longer used. The specific method is far beyond the content of this book, which will not be explained here.
SPL requires the uniqueness of the primary key when creating an index. When index() is executed, SPL checks whether the primary key is unique. If it is found that the primary key is not unique at this time, an error message of duplicate primary keys will be reported.
SPL seems a little lazy and won’t take the initiative to do things until necessary.
Unlike the operations mentioned earlier, primary key and index can only be created for a table sequence, not for a record sequence. They are part of the data structure.
However, find() can be used in a record sequence, only that it can’t use the index. It can only compare one by one.
As an aside, if the primary keys are in order, find()also supports a fast search technology called binary search, which is much faster than comparison one by one (close to the index, but still can’t catch up). It can be used for a record sequence without index or a table sequence that has not created an index. The select()function and pos() function also support this binary search. Interested readers can refer to the relevant documents of SPL, because high-performance computing is not the focus of this book (we didn’t talk about the principle of indexing just now), we won’t elaborate here.
Let’s review the grouping operation mentioned in the previous chapter. Observing the table sequence returned by groups(), it is obvious that the field corresponding to the grouping key value is unique, that is, this field can naturally constitute the primary key of the table sequence returned by groups(). Therefore, the field value used for grouping is called key. Careful readers may have found that when viewing the returned result of groups(), the small key icon has been drawn on the grouping key.