Find Difference between Every Two Subsets

Question

Suppose I have a MySQL database in 1NF with the following tables:

sandwich\_id | sandwich\_name | sandwich_price

 

\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_

 

0| BLT | 5.5

 

1| Reuben | 7.0

 

3| Grilled Cheese | 3.75

 

...

 

And a separate table that stores all the ingredient values:

 

sandwich_id | ingredient

 

\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_

 

0| bacon

 

0| lettuce

 

0| tomato

 

1| corned beef

 

1| swiss cheese

 

...

 

How can I compare all the sandwiches by their ingredients to determine which are the most similar?

(Also, is there a technical term for that second table that I’m missing? I want to call it a map table, but I know that’s not quite right, since a map table stores foreign keys for two tables and this one’s more of an offshoot of the first…)

 

Answer

For your question, we just need to find the intersection between every two sets of ingredients for different sandwiches, count members of every intersection, and sort the counts. Since SQL lacks explicit set data type, the code is difficult to understand. It’s convenient to handle this in SPL (Structured Process Language):

A

1

$select i.sandwich_id   sandwich_id, i.ingredient ingredient, s.sandwich_name sandwich_namefrom   ingredientTable i,sandwichTable s where i.sandwich_id=s.sandwich_id

2

=A1.group(sandwich_id;sandwich_name,~.(ingredient):collection)

3

=xjoin(A2;A2).select(_1.sandwich_id<_2.sandwich_id)

4

=A3.new((_1.collection ^   _2.collection).len():sameCount,_1.sandwich_name,_2.sandwich_name)

5

=A4.sort(-sameCount)

 

A2: Group A1’s table sequence by sandwich_id. collection contains the set of ingredients for each sandwich.

A3: Group sandwiches every two of them. For example, 0,1,3 can be made 3 groups: [0,1],[0,3],[1,3].

A4: Find the number of members in each intersection result. The sign “^” means getting intersection.

A5: Sort A4’s result in descending order.

See The Set-oriented SPL to learn more about explicit sets.