======================================================================
Title:       Clojure performance considerations
Date:        2025-12-20
Tags:        box3, clojure
Link:        https://spool-five.com/box3/20251220t152030--clojure-performance-considerations__box3_clojure/
Word Count:  2422
======================================================================


One of the nice things about the clojure programming language[1] is how
flexible built-in functions are. Like python, clojure is not typed, so
you can spend less time worrying about how your functions/methods handle
types, and more thinking about the actual data that is being worked on.
In this way, it is more of a 'data-first' kind of language.
=>[1] https://clojure.org/

Clojure also has great built-in abstractions, meaning that core operations
like map/filter/remove work on lots of data types.

However, the downside of this kind of flexibility is that you can't
necessarily rely on the compiler/type system, etc. to make sure you are
doing things in the most performant way. A lot of the optimizations in
the core library are hidden from the user. Again, this is good in that
you can think about problems at a higher-level, and get to solutions
very quickly. But, with programming, you always come up against cases
where you do need to optimise and understand performance better.

This note is exploring those kind of considerations.


-A sample problem-

This is not a **real** problem, so the usual caveats apply. For this
exercise, though, I'm going to imagine a task where you have some kind
of list of words, and you have to:

1.  Count the frequencies of letters that appear across the words
2.  Count the frequences after some kind of filter has been applied

In the first case, we are trying to find the best way to compute this,
and in the second, we also want a solution that is somewhat 'composable'
and reusable.

We will start with some sample input.


-Dependencies & Helpers-

(require '[clojure.string :as str])

(defn time-taken [label expr]
  (let [start# (. System (nanoTime))
        _result (eval expr)]
    (with-out-str
      (prn
       (str "=> " label ": "
            (/ (double (- (. System (nanoTime)) start#)) 1000000.0)
            " msecs")))))


-Input Generator-


(def base-sentence
  "Pellentesque dapibus suscipit ligula Donec posuere augue in quam Etiam vel tortor sodales tellus ultricies commodo Suspendisse potenti  Aenean in sem ac leo mollis blandit  Donec neque quam dignissim in mollis nec sagittis eu wisi  Phasellus lacus  Etiam laoreet quam sed arcu  Phasellus at dui in ligula mollis ultricies  Integer placerat tristique nisl  Praesent augue  Fusce commodo  Vestibulum convallis lorem a tempus semper dui dui euismod elit vitae placerat urna tortor vitae lacus  Nullam libero mauris consequat quis varius et dictum id arcu  Mauris mollis tincidunt felis  Aliquam feugiat tellus ut neque  Nulla facilisis risus a rhoncus fermentum tellus tellus lacinia purus et dictum nunc justo sit amet elit")


(defn generate-random-words [size]
  (let [words (->> (-> base-sentence (clojure.string/split #" "))
                   (remove #(= % "")))]
    (take size (repeatedly #(rand-nth words)))))

(def small-word-set (generate-random-words 1000))
(def large-word-set (generate-random-words 1000000))

;; realise both these here, so they don't impact the timings later
small-word-set
large-word-set

nil


-Method 1 - The Naive Approach-

This problem is very easy to solve, especially given that there is a
built-in function in clojure for it: `frequencies`. But let's explore
what kind of performance costs different approaches have.

First, let's start with a pretty naive approach. This function just
takes the words, converts them to lower case, joins them together and
then calls the `frequencies` function.

In order to 'join' the list of words, we could use either:

-   `concat`: which would produce a sequence of characters
-   `str`: which would produce a new long string

<!--listend-->

(reduce concat ["word1" "word2"])

(w o r d 1 w o r d 2)

(reduce str ["word1" "word2"])

word1word2

Because clojure is so flexible in terms of the types of sequences
functions can accept, either a sequence of character or a string can be
given to frequencies:

(frequencies (reduce concat ["word1" "word2"]))

{w 2, o 2, r 2, d 2, 1 1, 2 1}

(frequencies (reduce str ["word1" "word2"]))

{w 2, o 2, r 2, d 2, 1 1, 2 1}

However, it is this very flexibility that I am trying to investigate
here, because although you can use many types of approaches, some may
be
better/more performant than others. Because clojure is quite a
high-level/abstract language, it is not always clear what the best
approach
to choose is.

Let's test out both on a longer sequence.

(defn count-letter-frequencies-v1 [join-fn input-words]
  (->> input-words
       (map str/lower-case)
       (reduce join-fn)
       (frequencies)))

(time-taken "Method 1 (Join with `str`)"
            '(count-letter-frequencies-v1 str small-word-set))

"=> Method 1 (Join with `str`): 12.1635 msecs"

(time-taken "Method 1 (Join with `concat`)"
            '(count-letter-frequencies-v1 concat small-word-set))

"=> Method 1 (Join with `concat`): 218.816541 msecs"

A huge difference! This is a first glimpse into why I am writing this
reference note. There is nothing particularly obvious about why using
`concat` would be so much less performant than using `str`, but it is.
I myself can't explain the difference fully, but when you look into the
implementations for both, `concat` utilizes lazy sequences, while `str`
utilizes java's string-builder method.

Another very simple mistake I have made above is in terms of the order
of the processing. In plain terms, the function:

-   Applies the `lower-case` function to each item in the list
-   Joins all those items together
-   Gets the frequencies of this single list of characters

There is an intermediate 'mapping' step that is included here, and which
is not needed, the `(map str/lower-case)` step. Instead, why not just
convert the final joined string into lower-case letters.

(defn count-letter-frequencies-v1_1 [input-words]
  (->> input-words
       (reduce str)
       str/lower-case
       frequencies))

(time-taken "Method 1.1 (Reorder steps)"
            '(count-letter-frequencies-v1_1 small-word-set))

"=> Method 1.1 (Reorder steps): 2.069709 msecs"

This, again, is a faster method. A good heuristic to have when dealing
with clojure is to try to avoid 'intermediate' collections as much as
possible. Clojure uses persistent data structures, so when we map over
the list of words with `str/lower-case`, we are creating a new, intermediate
collection that takes up memory.

For reference, here is what the actual output looks like:

(count-letter-frequencies-v1_1 small-word-set)

{a 446, b 40, c 248, d 171, e 601, f 47, g 82, h 23, i 590, j 5, l 467, m 272, n 258, o 290, p 119, q 90, r 236, s 494, t 414, u 528, v 57, w 17}

Unfortunately, neither of these methods can actually handle the longer
list of words in a reasonable amount of time. So, we will try move beyond
these more 'naive' implementations to thinking more about alternative
approaches in clojure.


-Method 2 - Sequences of Sequences-

To frame the problem in a different way, we could say that at the abstract
level what we are dealing with is a "sequence of sequences", i.e., a
**recursive** problem. We have a list of words, but each word itself is
its own sequence.

Since we already saw that the built-in `frequencies` function is doing
a lot of the heavy-lifting, we can also try to use that on the 'inner'
sequences also. In other words, we will try make a frequencies-map for
each word, then join sub-maps those together.

Clojure has a function for this kind of joining operation, `merge-with`,
which takes a sequence of maps and an instruction for how to merge
elements if there are overlapping keys. In this case, were are just
adding up all the counts so we can use the function `+`.

(defn count-letter-frequencies-v2 [input-words]
  (->> input-words
       (mapv (comp frequencies str/lower-case)) ;; convert each word to lower-case and get the letter frequencies
       (apply (partial merge-with +))))

(time-taken "Method 2 (Small Word List)"
            '(count-letter-frequencies-v2 small-word-set))

"=> Method 2 (Small Word List): 1.968792 msecs"

Not necessarily much faster than the solution above at this small scale,
but we now have a function that is capable of handling the larger word
list.

(time-taken "Method 2 (Large Word List)"
            '(count-letter-frequencies-v2 large-word-set))

"=> Method 2 (Large Word List): 2824.748125 msecs"

It's still pretty slow however, so let's see if there are any other
methods we could try.


-Method 3 - Transducers-

When you are first learning clojure, it takes a little while to get used
to using the function `reduce`. However, once it clicks, you start using
it everywhere. Clojure's `transduce` is a little similar, however, while
it might have taken me a week to get used to `reduce`, I'm still not
quite used to `transduce` after many years!

The best explanation for it is from a Rich Hickey talk, where he describes
some kind of pie-making recipe, where you have to do something like take
an apple from a batch, check it for imperfections (filter), chop it up,
and then add it to the basket of good, chopped apple pieces. In real
life, you would do all of these operations at once (pick up, check,
chop, place in the container). In other words, you wouldn't (a) check
all the apples (creating a new pile of 'good' apples), (b) slice up that
pile (creating a new pile of sliced apples), then (c) place the slices
in the final container. In this, latter case, you are creating a lot of
intermediate, unnecessary piles of apples. Transducers are related to
this intuition, they are about processing a collection in a comprehensive
way, rather than one sequence at a time.

As I mentioned above, I can't say I have a very deep understanding of
transducers, but here is my attempt to translate the above process into
a transducer.


(def recipe (comp (map str/lower-case) (map frequencies)))
;; Our 'recipe' for processing the words. Each 'sequence' function is a step
;; that will be composed together by the transducer.
;; Worth noting that this could also be written as:
;; (map (comp frequencies str/lower-case))

(def end-form (partial merge-with +))
;; This is the thing we are 'collecting' the process into.

(defn count-letter-frequencies-v3 [input-words]
  (transduce recipe end-form input-words))

(time-taken "Method 3 (Transduce)"
            '(count-letter-frequencies-v3 large-word-set))

"=> Method 3 (Transduce): 1085.506542 msecs"

It is has similar performance to the above method, but has become more
composable. For example, say we want to now only collect the frequencies
of letters in words of n length, we could do something like:

(defn letter-frequencies-for-words-length-n [n]
  (comp
   (filter #(= (count %) n))
   (map str/lower-case)
   (map frequencies)))

(transduce (letter-frequencies-for-words-length-n 5) end-form large-word-set)

{a 82537, c 45819, d 18342, e 138203, f 18509, g 18402, i 55131, j 9131, l 55157, m 27535, n 46120, o 36610, p 9214, q 18586, r 27507, s 73468, t 45772, u 128774, v 18243}


-Method 4 - Transient Data Structures-

What about if we just want to abandon all of this 'persistent' data
structure stuff all together? Clojure has a way to do this. In fact,
the function that we have been calling a lot, `frequencies` does this
under the hood.

We will try to do something similar as our last attempt.

This time, we are going to very literally write the whole process as
one function (no more of this chaining things together/composing things).
This will make our function less reusable, but hopefully will speed up
the process.

The thinking here is that we are going to take a word, add it to the
'final' stage immediately, and then move on to the next word. The
important point will be that this 'final' container will be a `transient`
map the whole way through. This means that we don't care about what is
happening 'inside' the map while we are running the function (it can be
stateful), and then we call `persistent!` on it at the end to get the
final result.

(defn count-letter-frequencies-v4 [input-words]
  (->> input-words
       (reduce
        (fn [result word]
          (reduce
           (fn [result letter]
             (assoc! result letter (inc (get result letter 0))))
             ;; note that we aren't 'updating' the map here. `update` is
             ;; a pure function, which has guarentees about the value
             ;; inside the map. This doesn't work here with the transient
             ;; map, so we have to create a new entry each time
           result (str/lower-case word)))
        (transient {})) ;; create a transient map
       (persistent!))) ;; 'realise' the map

(time-taken "Method 4 (Transients)"
            '(count-letter-frequencies-v4 large-word-set))

"=> Method 4 (Transients): 200.558 msecs"


-Side-By-Side Comparisons-

Just to be sure that these are all working as intended, let's check that
the outputs are equal.


(= (count-letter-frequencies-v1 str small-word-set)
   (count-letter-frequencies-v1_1 small-word-set)
   (count-letter-frequencies-v2 small-word-set)
   (count-letter-frequencies-v3 small-word-set)
   (count-letter-frequencies-v4 small-word-set))


true

Now, let's look at the times again, and try format as an org-table.


(defn mean [values] (/ (apply + values) (count values)))

(defn time-taken-value [expr]
  (let [start# (. System (nanoTime))
        _result (eval expr)]
    (/ (double (- (. System (nanoTime)) start#)) 1000000.0)))

(defn time-taken-n-times [n-times expr]
  (take n-times (repeatedly #(time-taken-value expr))))

(def medium-word-set (generate-random-words 50000))
medium-word-set

(defn compare-results [n-iterations]
  (let [tests [["Method 1 (Basic)"  count-letter-frequencies-v1_1]
               ["Method 2 (Merge)"  count-letter-frequencies-v2]
               ["Method 3 (Transducer)"  count-letter-frequencies-v3]
               ["Method 4 (Transients)"  count-letter-frequencies-v4]]]
    (->> tests
         (mapv (fn [[label test-fn]]
                   (str "|" label "|"
                        (mean (time-taken-n-times n-iterations `(~test-fn medium-word-set)))
                        "|")))
         (str/join "n")
         (str "|Method (50k words)|Time (msecs)|n"))))

(compare-results 5)

<div class="SOURCE">

| Method (50k words)    | Time (msecs) |
|-----------------------|--------------|
| Method 1 (Basic)      | 718.3338416  |
| Method 2 (Merge)      | 46.7370168   |
| Method 3 (Transducer) | 51.1277918   |
| Method 4 (Transients) | 8.3595832    |

</div>


(def medium-large-word-set (generate-random-words 500000))
medium-large-word-set

(defn compare-results [n-iterations]
  (let [tests [["Method 2 (Merge)"  count-letter-frequencies-v2]
               ["Method 3 (Transducer)"  count-letter-frequencies-v3]
               ["Method 4 (Transients)"  count-letter-frequencies-v4]]]
    (->> tests
         (mapv (fn [[label test-fn]]
                   (str "|" label "|"
                        (mean (time-taken-n-times n-iterations `(~test-fn medium-large-word-set)))
                              "|")))
         (str/join "n")
         (str "|Method (500k words)|Time (msecs)|n"))))

(compare-results 5)

<div class="SOURCE">

| Method (500k words)   | Time (msecs)      |
|-----------------------|-------------------|
| Method 2 (Merge)      | 638.0600833999999 |
| Method 3 (Transducer) | 535.7018332       |
| Method 4 (Transients) | 86.6949416        |

</div>


-Take Away-

This was a very simple, and not very 'real-world', type of problem. But
it was useful for me to think through the different approaches to solving
it. In the end, the fastest solution was based on something already in
the `clojure.core` library, the `frequencies` function. So, if there is
one key takeaway its to look at the source code for inspiration in terms
of performance.