"It is the very inconspicuous account de-duplication count, written in SQL as COUNT (DISTINCT…). .."

mars RaqForum 25 No.
1 Reply • 207 View • 5 Months ago

A major culprit in the slow running and collapse of a database

It is the very inconspicuous account de-duplication count, written in SQL as COUNT (DISTINCT…).

Account de-duplication count is common and has important business significance in business analysis. The account here may be a user ID, bank account, phone number, license plate number…. The calculation logic is basically the same, which is to calculate how many accounts meet a certain condition from the historical data of a certain period of time.

For example, how many cars have been to NewYork last month? How many phones had calls between 2:00 am and 4:00 am last week? How many bank accounts have received overseas remittances this year? …

Historical data usually only records events that have occurred, such as a car appearing in a certain place at a certain time, a phone talking to someone at a certain time, a bank account that has been transferred in or out on a certain date, etc…. Directly filtering historical records using specified criteria will result in many records with the same account (a certain car may have been to NewYork multiple times,…), but it can only be counted once, so de-duplication needs to be done.

The filtering criteria in COUNT (DISTINCT…) are often not as simple as in the examples above. For example, how many credit card accounts have had weekly total consumption exceeding 1000 dollars this year? It is necessary to first calculate the weekly consumption amount of the account before filtering; How many accounts have consumed over 100 dollars for three consecutive days in the past month? More complex existence judgments are required for account transactions. However, regardless of the condition, it ultimately boils down to calculating COUNT (DISTINCT).

The famous e-commerce funnel statistics are such complex COUNT (DISTINCT), but it will be more complex. Funnel statistics involve multiple sequential events, with each step corresponding to a COUNT (DISTINCT), in order to calculate the customer churn rate for this step together with the previous COUNT (DISTINCT); The next step of COUNT (DISTINCT) should be filtered based on the previous step. And consider the order in which events occur. The entire process is quite complex, but essentially it is still doing COUNT (DISTINCT).

COUNT (DISTINCT) has always been a difficult problem in database computing, usually very slow. If the data volume is large (with a large number of accounts, which is also the norm), it may also lead to database crashes.

Why is this?

Because COUNT (DISINCT) requires a large amount of computation. COUNT (DISTINCT ID) needs to save the traversed different IDs in a list, and the next ID needs to be compared with this list to determine if it is new, in order to determine whether to add a count value and add it to the list. The ordinary COUNT (ID) does not need to save the traversed ID, and of course, it does not need to be compared, which is obviously much simpler. The position of COUNT (DISTINCT) in SQL may seem similar to that of COUNT and SUM, but its computing speed is much slower.

Moreover, many databases store the above ID list in memory when calculating COUNT (DISTINCT ID), which allows for high-speed access and comparison. However, if the number of accounts is large, memory may not be able to hold it, leading to inevitable crashes. If you cache this ID list to external storage, although it can avoid crashing, it is not convenient to access and compare, and performance will further decrease sharply.

Fortunately, the ID in COUNT (DISTINCT ID) calculation is usually just one column of data, and at the limit of 100GB of memory, it may hold billions of IDs, exceeding the number of accounts in most applications. Therefore, although the conventional COUNT (DISTINCT) calculation is slower, it is not likely to crash frequently.

But if the situation is more complex, it may not be certain. For example, funnel statistics will have multiple mixed COUNTs (DISTINCT), and the SQL will have nested JOINs. In this case, if you want to run faster, it will occupy much more memory (JOINs can also cause a serious contradiction between memory and performance), and the probability of crash will increase sharply.

The following is a three-step funnel analysis of a practical scenario implemented in SQL:

WITH e1 AS (
	SELECT userid, visittime AS step1_time, MIN(sessionid) AS sessionid, 1 AS step1
	FROM defined_events e1 JOIN eventgroup ON eventgroup.id = e1.eventgroup
	WHERE visittime >= DATE_ADD(arg_date,INTERVAL -14 day) AND visittime < arg_date AND eventgroup.name='SiteVisit'
	GROUP BY userid,visittime
), e2 AS (
	SELECT e2.userid, MIN(e2.sessionid) AS sessionid, 1 AS step2, MIN(visittime) AS step2_time, MIN(e1.step1_time) AS step1_time
	FROM defined_events e2 JOIN e1 ON e1.sessionid = e2.sessionid AND visittime > step1_time JOIN eventgroup ON eventgroup.id = e2.eventgroup
	WHERE visittime < DATE_ADD(step1_time ,INTERVAL +1 day) AND eventgroup.name = 'ProductDetailPage'
	GROUP BY e2.userid
), e3 AS (
	SELECT e3.userid, MIN(e3.sessionid) AS sessionid, 1 AS step3, MIN(visittime) AS step3_time, MIN(e2.step1_time) AS step1_time
	FROM defined_events e3 JOIN e2 ON e2.sessionid = e3.sessionid AND visittime > step2_time
	JOIN eventgroup ON eventgroup.id = e3.eventgroup
	WHERE visittime < DATE_ADD(step1_time ,INTERVAL +1 day) AND (eventgroup.name = 'OrderConfirmationType1')
	GROUP BY  e3.userid
)
SELECT s.devicetype AS devicetype,
	COUNT(DISTINCT CASE WHEN funnel_conversions.step1 IS NOT NULL THEN funnel_conversions.step1_userid  ELSE NULL END) AS step1_count,
	COUNT(DISTINCT CASE WHEN funnel_conversions.step2 IS NOT NULL THEN funnel_conversions.step2_userid  ELSE NULL END) AS step2_count,
	COUNT(DISTINCT CASE WHEN funnel_conversions.step3 IS NOT NULL THEN funnel_conversions.step3_userid  ELSE NULL END) AS step3_count,
	COUNT(DISTINCT CASE WHEN funnel_conversions.step3 IS NOT NULL THEN funnel_conversions.step3_userid  ELSE NULL END) 
		/ COUNT(DISTINCT CASE WHEN funnel_conversions.step1 IS NOT NULL THEN funnel_conversions.step1_userid  ELSE NULL END) AS step3_rate
FROM (
	SELECT e1.step1_time AS step1_time, e1.userid AS userid, e1.userid AS step1_userid, e2.userid AS step2_userid,e3.userid AS step3_userid,
		e1.sessionid AS step1_sessionid, step1, step2, step3
	FROM e1	LEFT JOIN e2 ON e1.userid=e2.userid LEFT JOIN e3 ON e2.userid=e3.userid) funnel_conversions
LEFT JOIN sessions s ON funnel_conversions.step1_sessionid = s.id 
GROUP BY s.devicetype

It can be seen that there are not only multiple COUNTs (DISTINCT) here, but also multiple self-join subqueries to achieve complex funnel step judgments. This SQL statement did not have a result after running for three minutes in the medium standard cluster (4-node) of Snowflake.

Then, how can we solve this annoying COUNT (DISTINCT)?

Actually, it’s not difficult, as long as the data is sorted by account, COUNT (DISINCT) is easy to calculate.

When the data is ordered by ID, then calculate COUNT (DISTINCT ID), simply save the value of the previous ID (one value). If the next ID is equal to the previous ID, the count of the current ID will be increased. If it is different, the saved ID will be replaced and clear the count. There is no need to search and compare in a large list, and the memory occupied when saving only one ID can be almost negligible. The calculation is fast and it is impossible to collapse.

For complex tasks such as funnel analysis, there is also no problem. When the data is ordered by ID, one ID of data can be read into memory each time, and complex calculations can be conveniently carried out without involving data from other IDs. In principle, It’s OK as long as the memory can hold data from one ID, and there is no problem saving multiple sets of COUNT (DISTINCT) counts.

Unfortunately, relational databases and SQL cannot do this.

As the theoretical foundation of relational databases, relational algebra is based on unordered sets. In SQL, set members (records of tables) have no order, and databases do not theoretically support ordering when storing data. The above optimization logic cannot be implemented in SQL.

esProc SPL can!

Strictly speaking, esProc SPL is not a database, but a professional computing engine. It no longer adopts relational algebra, but has created its own discrete dataset theory based on ordered sets and invented a new programming language SPL. esProc deliberately supports ordering when storing data, and SPL also provides rich methods for ordered calculations, which enables high-performance and low resource implementation of COUNT (DISTINCT) calculation.

The icount function of SPL is used for COUNT (DISTINCT) calculation, and by default, it will be implemented using the aforementioned method, which maintains a list of traversed different IDs, and the next ID will be compared to the list. If the data is ordered by ID, then use icount@o, in this case, SPL will adopt the ordered COUNT (DISTINCT) algorithm, which only maintains the previous ID value for comparison.

In particular, SPL also supports ordered cursor. When data is ordered by ID, a batch of data with the same ID can be fetched each time, and complex calculations can be performed to determine whether the current ID meets the filtering criteria. The subsequent count can be implemented directly using COUNT, without the need to maintain the previous ID and the action of comparison (which has already been compared during cursor retrieval).

Funnel analysis can be implemented using this mechanism:

	A
1	=now()
2	=eventgroup=file("eventgroup.btx").import@b()
3	=devicetype=file("devicetype.btx").import@b()
4	=long(elapse(arg_date,-14))
5	=long(arg_date)
6	=long(arg_date+1)
7	=A2.(case(NAME,"SiteVisit":1,"ProductDetailPage":2,"OrderConfirmationType1":3;null))
8	=file("defined_events.ctx").open()
9	=A8.cursor@m(USERID,SESSIONID,VISITTIME,EVENTGROUPNO;VISITTIME>=A4 && VISITTIME
10	=sessions=file("sessions.ctx").open().cursor@m(USERID,ID,DEVICETYPENO;;A9)
11	=A9.joinx@m(USERID:SESSIONID,A10:USERID:ID,DEVICETYPENO)
12	=A11.group(USERID)
13	=A12.new(~.align@a(3,EVENTGROUPNO):e,e(1).select(VISITTIME
14	=A13.run(e=join@m(e1:e1,SESSIONID;e2:e2,SESSIONID).select( e2=e2.select(VISITTIME>e1.VISITTIME && VISITTIME
15	=A14.run(e0=e1.id(DEVICETYPENO),e1=e.min(e1.VISITTIME),e2=e.min(e2),e=e.min(e1.SESSIONID),e3=e3.select(SESSIONID==e && VISITTIME>e2 && VISITTIME
16	=A15.news(e;~:DEVICETYPE,e2,e3)
17	=A16.groups(DEVICETYPE;count(1):STEP1_COUNT,count(e2):STEP2_COUNT,count(e3):STEP3_COUNT,null:STEP3_RATE)
18	=A17.run(DEVICETYPE=devicetype.m(DEVICETYPE).DEVICETYPE,STEP3_RATE=STEP3_COUNT/STEP1_COUNT)
19	=interval@s(A1,now())

(The SPL code is written in a grid, which is very different from ordinary programming languages. Please refer to here: A programming language coding in a grid ）

A12 reads the data of one USERID each time for subsequent judgment, and when calculating in A17, the count function is directly used without the need for an icount. This code is not only more concise and versatile (to do more funnel steps, just change A7, while SQL code needs to add many sub queries), but also runs faster. Using an EC2 with the same specification of Snowflake, it can be completed in just 10 seconds on a single machine.

In fact, COUNT (DISTINCT) is just a manifestation, and the essence of this problem is to group data by an account basis and then process it. Grouping and aggregation by account is a similar operation, and some aggregations are not simple SUM/MAX, and sometimes require a complex process to calculate. For example, calculating the number of times each phone has a call duration less than 3 seconds; Calculating the newly added points for last month’s credit card account, and the rule is that if you spend more than 100 dollars for three consecutive days, the points for these days will be doubled; …

Almost all event data in information systems is hung under a certain account, so this type of operation is very common and can be encountered in various queries and batch jobs. It can be said to be one of the most common business logic models. With esProc SPL’s ordered calculations based on ordered storage, this large class of problems can be implemented concisely and with high performance, while they are very difficult for relational databases in SQL system.

Here is a variant example, a spatiotemporal collision problem, to identify the top 20 mobile phones that have appeared the most frequently in the same time period and location as a specified mobile phone, with a data scale of approximately 25 billion rows. The SQL is written as follows:

WITH DT AS ( SELECT DISTINCT id, ROUND(tm/900)+1 as tn, loc FROM T WHERE tm<3*86400)
SELECT * FROM (
    SELECT B.id id, COUNT( DISINCT B.tn ) cnt
    FROM DT AS A JOIN DT AS B ON A.loc=B.loc AND A.tn=B.tn
    WHERE A.id=a AND B.id<>a
    GROUP BY id )
ORDER BY cnt DESC
LIMIT 20

There are nested DISTINCT operations and self-JOINs here, and a single node ClickHouse crashes directly and a 5-node cluster takes more than 30 minutes to get the result.

The SPL code utilizes ordered storage and the previously mentioned ordered cursor to effectively avoid these difficulties, getting the result in less than 6 minutes using just one node.

	A
1	=now()
2	>NL=100000,NT=3*96
3	=file("T.ctx").open()
4	=A3.cursor(tm,loc;id==a).fetch().align(NLNT,(loc-1)NT+tm\900+1)
5	=A3.cursor@mv(;id!=a && A4((loc-1)*NT+tm\900+1))
6	=A5.group@s(id;icount@o(tm\900):cnt).total(top(-20;cnt))
7	=interval@ms(A1,now())

Careful readers may find that the effectiveness of the esProc SPL algorithm depends on the order of data by ID, and the order of data generation is usually not the ID, but the time. Then, can this algorithm only be applied to previously sorted historical data, and become invalid for new data that cannot be sorted together in time?

esProc has taken this into account, and SPL’s multi-zone composite table can achieve incremental sorting when data enters, ensuring that the data is sorted by ID in real-time when read, allowing this ordered calculation scheme to be applied to the latest data. Moreover, such operations usually involve time intervals, and SPL’s pseudo table supports a two-dimensional ordering mechanism, which can quickly filter out data outside the time interval and further improve computational performance.

esProc is a pure Java software that can perform operations in any JVM environment and can be seamlessly embedded into Java programs, giving the computing power of a data warehouse to applications in various scenarios in a very lightweight manner.

esProc provides a visual development environment that supports single step execution, breakpoint setting, and WYSIWYG result preview. Developing and debugging is much more convenient than SQL and stored procedures.

SPL also has comprehensive process control statements, such as for loops and if branches, and supports subroutine calls. It has the procedural ability only available in stored procedures, and can comprehensively replace SQL and stored procedures.

…

Finally, esProc SPL is open source and free. It is here https://github.com/SPLWare/esProc.

SPL Official Website 👉 http://www.scudata.com

SPL Feedback and Help 👉 https://www.reddit.com/r/esProc

SPL Learning Material 👉 http://c.scudata.com

SPL Source Code and Package 👉 https://github.com/SPLWare/esProc

Discord 👉 https://discord.gg/ydhVnFH9

Youtube 👉 https://www.youtube.com/@esProc_SPL

TalkingData

mars • 207 View • 5 Months ago