Clojure set with custom equality semantics

Sets are collections of unique values.

To be able to eliminate the multiplicates, the set needs to know when two values are equal.

Sets use the standard Clojure equality semantics, meaning that if two values are =, then it removes the multiplicates.

Example:

(= :foo :foo)
;; #=> true

#{:foo :bar :foo}
;; #=> #{:foo :bar}
;; One of the two :foo values is removed because they are equal (`=`)

But what if you want to have a set with custom equality semantics?

Say you have a list of person entities, where each person has a :person/id attribute that identifies the person. When you put the person entities in a set, you want entities with unique :person/ids, no matter what other data the entity contains.

Example:

#{{:person/id 1
   :person/firstName "Mikko"}
  {:person/id 2
   :person/firstName "John"
   :person/lastName "Doe"}
  {:person/id 1
   :person/lastName "Koski"}} 
;; #=> #{#:person{:id 1, :firstName "Mikko"}
;;       #:person{:id 1, :lastName "Koski"}
;;       #:person{:id 2, :firstName "John", :lastName "Doe"}}

;; Note that there are two entities with `:person/id` 1, which is not what we want.

How could we create a set that uses the :person/id attribute to decide if two values are equal?

Custom equality with sorted-set-by

Sets are unordered data structures, but you can create ordered sets with sorted-set. Using sorted-set-by, you can have a sorted set with your own comparator for sorting.

The sorted-set has another interesting property: In addition to the sort order, the comparator also defines the equality semantics, meaning that if the comparator thinks that two values are equal (i.e., the comparator returns 0), then the set threats them as equal values, that is, it removes multiplicates.

So for example, we can make a comparator that compares only the :person/id attribute and return 0 for two maps with the same :person/id, regarless of the other keys and values in the map:

(defn compare-id [x y]
  (compare (:person/id x) (:person/id y)))
  
(compare-id
 {:person/id 1
  :person/firstName "Mikko"}
 {:person/id 1
  :person/lastName "Koski"})
;; #=> 0
;; Zero means the values are equal from comparator point-of-view

And if we use compare-id as the comparator with sorted-set-by, we get the result we wanted:

(def s
  (sorted-set-by
   compare-id
   {:person/id 1
    :person/firstName "Mikko"}
   {:person/id 2
    :person/firstName "John"}
   {:person/id 1
    :person/lastName "Koski"}))

s
;; #=> #{#:person{:id 1, :firstName "Mikko"} 
         #:person{:id 2, :firstName "John"}}

Do sets with custom equality work with disj, contains? and get?

Yes, they do.

These functions are documented as follows in Clojure reference:

Sets support 'removal' with disj, as well as contains? and get, the latter returning the object that is held in the set which compares equal to the key, if found:

-- Clojure - Data Structures

Here's an example that demonstrates that everything works as documented:

(disj s {:person/id 1 :person/lastName "Koski"})
;; #=> #{#:person{:id 2, :firstName "John"}}

(contains? s {:person/id 1 :person/lastName "Koski"})
;; #=> true

(get s {:person/id 1 :person/lastName "Koski"})
;; #=> #:person{:id 1, :firstName "Mikko"}

What about clojure.set operators?

So now that we've set up our set of person entities with custom equality by :person/id and tested that the essential functions disj, contains? and get work as expected, the next question is, do the set operations provided by the clojure.set namespace work with sets with custom equality?

The answer is yes, but with one caveat:

Do not mix different set types.

If we look at the source code of clojure.set, we can see that the functions don't create new sets for the return values; instead, they reuse the ones given as parameters. This is good news because we don't want to lose our custom equality semantics after using the clojure.set operations.

However, the functions in clojure.set use count of the given sets to determine which one will be used as a base for the return value. You shouldn't rely on this logic, so it's safest to use two sets of the same type.

Let's take an example.

First, let's call clojure.set/intersection with a sorted set of two values and a normal unsorted set of one value:

(def result-when-sorted-set-has-more-values
  (clojure.set/intersection
   (sorted-set-by
    compare-id
    {:person/id 1
     :person/firstName "Mikko"}
    {:person/id 2
     :person/firstName "John"})
   #{{:person/id 2
      :person/lastName "Doe"}}))

result-when-sorted-set-has-more-values
;; #=> #{#:person{:id 2, :lastName "Doe"}}

(sorted? result-when-sorted-set-has-more-values)
;; #=> false

(contains? result-when-sorted-set-has-more-values {:person/id 2})
;; #=> false

In the example above, the returned set has the value we expected but is not sorted, nor does it use the equality by :person/id.

Next, let's have a sorted set of one value and an unsorted set of two values:

(def result-when-unsorted-set-has-more-values
  (clojure.set/intersection
   (sorted-set-by
    compare-id
    {:person/id 2
     :person/firstName "John"})
   #{{:person/id 1
      :person/firstName "Mikko"}
     {:person/id 2
      :person/lastName "Doe"}}))

result-when-unsorted-set-has-more-values
;; #=> #{}

(sorted? result-when-unsorted-set-has-more-values)
;; #=> true

(contains? result-when-unsorted-set-has-more-values {:person/id 2})
;; #=> false

Now we did got back a sorted set, but the set didn't have any values, which is not what we wanted.

To get the correct result and correct set type, we have to use the same set type for both arguments:

(def result-when-both-sets-are-sorted
  (clojure.set/intersection
   (sorted-set-by
    compare-id
    {:person/id 2
     :person/firstName "John"})
   (sorted-set-by
    compare-id
    {:person/id 1
     :person/firstName "Mikko"}
    {:person/id 2
     :person/lastName "Doe"})))

result-when-both-sets-are-sorted
;; #=> #{#:person{:id 2, :firstName "John"}}

(sorted? result-when-both-sets-are-sorted)
;; #=> true

(contains? result-when-both-sets-are-sorted {:person/id 2})
;; #=> true

Conclusion

Using the sorted-set-by with custom equality semantics was a fun thing to learn, but I'm still unsure if there are any real world use-cases out there where this would be actually useful.

In the examples above, I had two sets of person entities with slightly different attributes, and the aim was to eliminate the duplicates. In reality, I'd probably do that with something like distinct-by (if it existed), or I'd use group-by and then map the vals with merge to get distinct entities with the attributes merged, which is probably what I'd want anyway:

(->> 
 [{:person/id 1
  :person/firstName "Mikko"}
  {:person/id 2
   :person/firstName "John"}
  {:person/id 1
   :person/lastName "Koski"}]
 (group-by :person/id)
 vals
 (map #(apply merge %)))
;; #=> (#:person{:id 1, :firstName "Mikko", :lastName "Koski"}
;;      #:person{:id 2, :firstName "John"})

But maybe if I'd need to do actual set operations (difference, intersection etc.) with the two sets of entities, then I'd maybe use the sorted-set-by with custom equality. Maybe.

What do you think? Is there any real-world use for this?

Please let me know and comment on Mastodon!