====================================================================== Title: Clojure performance considerations Date: 2025-12-20 Tags: box3, clojure Link: https://spool-five.com/box3/20251220t152030--clojure-performance-considerations__box3_clojure/ Word Count: 2422 ====================================================================== One of the nice things about the clojure programming language[1] is how flexible built-in functions are. Like python, clojure is not typed, so you can spend less time worrying about how your functions/methods handle types, and more thinking about the actual data that is being worked on. In this way, it is more of a 'data-first' kind of language. =>[1] https://clojure.org/ Clojure also has great built-in abstractions, meaning that core operations like map/filter/remove work on lots of data types. However, the downside of this kind of flexibility is that you can't necessarily rely on the compiler/type system, etc. to make sure you are doing things in the most performant way. A lot of the optimizations in the core library are hidden from the user. Again, this is good in that you can think about problems at a higher-level, and get to solutions very quickly. But, with programming, you always come up against cases where you do need to optimise and understand performance better. This note is exploring those kind of considerations. -A sample problem- This is not a **real** problem, so the usual caveats apply. For this exercise, though, I'm going to imagine a task where you have some kind of list of words, and you have to: 1. Count the frequencies of letters that appear across the words 2. Count the frequences after some kind of filter has been applied In the first case, we are trying to find the best way to compute this, and in the second, we also want a solution that is somewhat 'composable' and reusable. We will start with some sample input. -Dependencies & Helpers- (require '[clojure.string :as str]) (defn time-taken [label expr] (let [start# (. System (nanoTime)) _result (eval expr)] (with-out-str (prn (str "=> " label ": " (/ (double (- (. System (nanoTime)) start#)) 1000000.0) " msecs"))))) -Input Generator- (def base-sentence "Pellentesque dapibus suscipit ligula Donec posuere augue in quam Etiam vel tortor sodales tellus ultricies commodo Suspendisse potenti Aenean in sem ac leo mollis blandit Donec neque quam dignissim in mollis nec sagittis eu wisi Phasellus lacus Etiam laoreet quam sed arcu Phasellus at dui in ligula mollis ultricies Integer placerat tristique nisl Praesent augue Fusce commodo Vestibulum convallis lorem a tempus semper dui dui euismod elit vitae placerat urna tortor vitae lacus Nullam libero mauris consequat quis varius et dictum id arcu Mauris mollis tincidunt felis Aliquam feugiat tellus ut neque Nulla facilisis risus a rhoncus fermentum tellus tellus lacinia purus et dictum nunc justo sit amet elit") (defn generate-random-words [size] (let [words (->> (-> base-sentence (clojure.string/split #" ")) (remove #(= % "")))] (take size (repeatedly #(rand-nth words))))) (def small-word-set (generate-random-words 1000)) (def large-word-set (generate-random-words 1000000)) ;; realise both these here, so they don't impact the timings later small-word-set large-word-set nil -Method 1 - The Naive Approach- This problem is very easy to solve, especially given that there is a built-in function in clojure for it: `frequencies`. But let's explore what kind of performance costs different approaches have. First, let's start with a pretty naive approach. This function just takes the words, converts them to lower case, joins them together and then calls the `frequencies` function. In order to 'join' the list of words, we could use either: - `concat`: which would produce a sequence of characters - `str`: which would produce a new long string (reduce concat ["word1" "word2"]) (w o r d 1 w o r d 2) (reduce str ["word1" "word2"]) word1word2 Because clojure is so flexible in terms of the types of sequences functions can accept, either a sequence of character or a string can be given to frequencies: (frequencies (reduce concat ["word1" "word2"])) {w 2, o 2, r 2, d 2, 1 1, 2 1} (frequencies (reduce str ["word1" "word2"])) {w 2, o 2, r 2, d 2, 1 1, 2 1} However, it is this very flexibility that I am trying to investigate here, because although you can use many types of approaches, some may be better/more performant than others. Because clojure is quite a high-level/abstract language, it is not always clear what the best approach to choose is. Let's test out both on a longer sequence. (defn count-letter-frequencies-v1 [join-fn input-words] (->> input-words (map str/lower-case) (reduce join-fn) (frequencies))) (time-taken "Method 1 (Join with `str`)" '(count-letter-frequencies-v1 str small-word-set)) "=> Method 1 (Join with `str`): 12.1635 msecs" (time-taken "Method 1 (Join with `concat`)" '(count-letter-frequencies-v1 concat small-word-set)) "=> Method 1 (Join with `concat`): 218.816541 msecs" A huge difference! This is a first glimpse into why I am writing this reference note. There is nothing particularly obvious about why using `concat` would be so much less performant than using `str`, but it is. I myself can't explain the difference fully, but when you look into the implementations for both, `concat` utilizes lazy sequences, while `str` utilizes java's string-builder method. Another very simple mistake I have made above is in terms of the order of the processing. In plain terms, the function: - Applies the `lower-case` function to each item in the list - Joins all those items together - Gets the frequencies of this single list of characters There is an intermediate 'mapping' step that is included here, and which is not needed, the `(map str/lower-case)` step. Instead, why not just convert the final joined string into lower-case letters. (defn count-letter-frequencies-v1_1 [input-words] (->> input-words (reduce str) str/lower-case frequencies)) (time-taken "Method 1.1 (Reorder steps)" '(count-letter-frequencies-v1_1 small-word-set)) "=> Method 1.1 (Reorder steps): 2.069709 msecs" This, again, is a faster method. A good heuristic to have when dealing with clojure is to try to avoid 'intermediate' collections as much as possible. Clojure uses persistent data structures, so when we map over the list of words with `str/lower-case`, we are creating a new, intermediate collection that takes up memory. For reference, here is what the actual output looks like: (count-letter-frequencies-v1_1 small-word-set) {a 446, b 40, c 248, d 171, e 601, f 47, g 82, h 23, i 590, j 5, l 467, m 272, n 258, o 290, p 119, q 90, r 236, s 494, t 414, u 528, v 57, w 17} Unfortunately, neither of these methods can actually handle the longer list of words in a reasonable amount of time. So, we will try move beyond these more 'naive' implementations to thinking more about alternative approaches in clojure. -Method 2 - Sequences of Sequences- To frame the problem in a different way, we could say that at the abstract level what we are dealing with is a "sequence of sequences", i.e., a **recursive** problem. We have a list of words, but each word itself is its own sequence. Since we already saw that the built-in `frequencies` function is doing a lot of the heavy-lifting, we can also try to use that on the 'inner' sequences also. In other words, we will try make a frequencies-map for each word, then join sub-maps those together. Clojure has a function for this kind of joining operation, `merge-with`, which takes a sequence of maps and an instruction for how to merge elements if there are overlapping keys. In this case, were are just adding up all the counts so we can use the function `+`. (defn count-letter-frequencies-v2 [input-words] (->> input-words (mapv (comp frequencies str/lower-case)) ;; convert each word to lower-case and get the letter frequencies (apply (partial merge-with +)))) (time-taken "Method 2 (Small Word List)" '(count-letter-frequencies-v2 small-word-set)) "=> Method 2 (Small Word List): 1.968792 msecs" Not necessarily much faster than the solution above at this small scale, but we now have a function that is capable of handling the larger word list. (time-taken "Method 2 (Large Word List)" '(count-letter-frequencies-v2 large-word-set)) "=> Method 2 (Large Word List): 2824.748125 msecs" It's still pretty slow however, so let's see if there are any other methods we could try. -Method 3 - Transducers- When you are first learning clojure, it takes a little while to get used to using the function `reduce`. However, once it clicks, you start using it everywhere. Clojure's `transduce` is a little similar, however, while it might have taken me a week to get used to `reduce`, I'm still not quite used to `transduce` after many years! The best explanation for it is from a Rich Hickey talk, where he describes some kind of pie-making recipe, where you have to do something like take an apple from a batch, check it for imperfections (filter), chop it up, and then add it to the basket of good, chopped apple pieces. In real life, you would do all of these operations at once (pick up, check, chop, place in the container). In other words, you wouldn't (a) check all the apples (creating a new pile of 'good' apples), (b) slice up that pile (creating a new pile of sliced apples), then (c) place the slices in the final container. In this, latter case, you are creating a lot of intermediate, unnecessary piles of apples. Transducers are related to this intuition, they are about processing a collection in a comprehensive way, rather than one sequence at a time. As I mentioned above, I can't say I have a very deep understanding of transducers, but here is my attempt to translate the above process into a transducer. (def recipe (comp (map str/lower-case) (map frequencies))) ;; Our 'recipe' for processing the words. Each 'sequence' function is a step ;; that will be composed together by the transducer. ;; Worth noting that this could also be written as: ;; (map (comp frequencies str/lower-case)) (def end-form (partial merge-with +)) ;; This is the thing we are 'collecting' the process into. (defn count-letter-frequencies-v3 [input-words] (transduce recipe end-form input-words)) (time-taken "Method 3 (Transduce)" '(count-letter-frequencies-v3 large-word-set)) "=> Method 3 (Transduce): 1085.506542 msecs" It is has similar performance to the above method, but has become more composable. For example, say we want to now only collect the frequencies of letters in words of n length, we could do something like: (defn letter-frequencies-for-words-length-n [n] (comp (filter #(= (count %) n)) (map str/lower-case) (map frequencies))) (transduce (letter-frequencies-for-words-length-n 5) end-form large-word-set) {a 82537, c 45819, d 18342, e 138203, f 18509, g 18402, i 55131, j 9131, l 55157, m 27535, n 46120, o 36610, p 9214, q 18586, r 27507, s 73468, t 45772, u 128774, v 18243} -Method 4 - Transient Data Structures- What about if we just want to abandon all of this 'persistent' data structure stuff all together? Clojure has a way to do this. In fact, the function that we have been calling a lot, `frequencies` does this under the hood. We will try to do something similar as our last attempt. This time, we are going to very literally write the whole process as one function (no more of this chaining things together/composing things). This will make our function less reusable, but hopefully will speed up the process. The thinking here is that we are going to take a word, add it to the 'final' stage immediately, and then move on to the next word. The important point will be that this 'final' container will be a `transient` map the whole way through. This means that we don't care about what is happening 'inside' the map while we are running the function (it can be stateful), and then we call `persistent!` on it at the end to get the final result. (defn count-letter-frequencies-v4 [input-words] (->> input-words (reduce (fn [result word] (reduce (fn [result letter] (assoc! result letter (inc (get result letter 0)))) ;; note that we aren't 'updating' the map here. `update` is ;; a pure function, which has guarentees about the value ;; inside the map. This doesn't work here with the transient ;; map, so we have to create a new entry each time result (str/lower-case word))) (transient {})) ;; create a transient map (persistent!))) ;; 'realise' the map (time-taken "Method 4 (Transients)" '(count-letter-frequencies-v4 large-word-set)) "=> Method 4 (Transients): 200.558 msecs" -Side-By-Side Comparisons- Just to be sure that these are all working as intended, let's check that the outputs are equal. (= (count-letter-frequencies-v1 str small-word-set) (count-letter-frequencies-v1_1 small-word-set) (count-letter-frequencies-v2 small-word-set) (count-letter-frequencies-v3 small-word-set) (count-letter-frequencies-v4 small-word-set)) true Now, let's look at the times again, and try format as an org-table. (defn mean [values] (/ (apply + values) (count values))) (defn time-taken-value [expr] (let [start# (. System (nanoTime)) _result (eval expr)] (/ (double (- (. System (nanoTime)) start#)) 1000000.0))) (defn time-taken-n-times [n-times expr] (take n-times (repeatedly #(time-taken-value expr)))) (def medium-word-set (generate-random-words 50000)) medium-word-set (defn compare-results [n-iterations] (let [tests [["Method 1 (Basic)" count-letter-frequencies-v1_1] ["Method 2 (Merge)" count-letter-frequencies-v2] ["Method 3 (Transducer)" count-letter-frequencies-v3] ["Method 4 (Transients)" count-letter-frequencies-v4]]] (->> tests (mapv (fn [[label test-fn]] (str "|" label "|" (mean (time-taken-n-times n-iterations `(~test-fn medium-word-set))) "|"))) (str/join "n") (str "|Method (50k words)|Time (msecs)|n")))) (compare-results 5)
| Method (50k words) | Time (msecs) | |-----------------------|--------------| | Method 1 (Basic) | 718.3338416 | | Method 2 (Merge) | 46.7370168 | | Method 3 (Transducer) | 51.1277918 | | Method 4 (Transients) | 8.3595832 |
(def medium-large-word-set (generate-random-words 500000)) medium-large-word-set (defn compare-results [n-iterations] (let [tests [["Method 2 (Merge)" count-letter-frequencies-v2] ["Method 3 (Transducer)" count-letter-frequencies-v3] ["Method 4 (Transients)" count-letter-frequencies-v4]]] (->> tests (mapv (fn [[label test-fn]] (str "|" label "|" (mean (time-taken-n-times n-iterations `(~test-fn medium-large-word-set))) "|"))) (str/join "n") (str "|Method (500k words)|Time (msecs)|n")))) (compare-results 5)
| Method (500k words) | Time (msecs) | |-----------------------|-------------------| | Method 2 (Merge) | 638.0600833999999 | | Method 3 (Transducer) | 535.7018332 | | Method 4 (Transients) | 86.6949416 |
-Take Away- This was a very simple, and not very 'real-world', type of problem. But it was useful for me to think through the different approaches to solving it. In the end, the fastest solution was based on something already in the `clojure.core` library, the `frequencies` function. So, if there is one key takeaway its to look at the source code for inspiration in terms of performance.