Skip to content
Dmitry Astapov edited this page Nov 19, 2023 · 3 revisions

Associated directory: 14-speeding-up

Speeding things up

After you've lived with this setup for a while, you might start noticing that running export.sh takes longer than you would've wanted it to and rebuilds the files that you would've expected to be left alone.

This happens because rules.psv are per provider, and adding/amending rules would cause all the files for this provider to be regenerated, and you could easily have dozens or hundreds of them. The worst slowdown happens when you are adding new rules for unclassified transactions - you know that the rules you add would be applicable to a single new file, and yet all the rest of the journals would be regenerated as well.

Additionally, scripts that call hledger multiple times -- such as tax or investment -- will do so sequentially, and if you have spare CPU cores, it would be best to do it in parallel.

Parallelizing hledger in shell scripts

Let's tackle the latter problem first. We could use parallel to make sure that our shell scripts call hledger with as much parallelism as possible. We will do this by making a bash function that wraps up a single hledger invocation and then using the fact that parallel would happily allow arbitrary shell commands in the input that you can feed to it via here-doc.

As a result, we could write what looks like a shell script, in which every invocation of hledger-wrapper function will happen in parallel, occupying all CPU cores. The changes necessary are really minimal, here is for example an investments report:

$ cat export/investments.sh

#!/bin/bash
function report() {
    asset="$1"
    shift
    hledger roi -f ../all.journal \
            --investment "acct:assets:${asset} not:acct:equity" \
            --pnl 'acct:virtual:unrealized not:acct:equity' "$@"
}
export -f report

parallel -k :::: <<EOF
echo "Pension"
echo
report "pension"
report "pension" -Y
EOF

Tracking which rules apply to which csv file

To prevent changes in rules.psv from causing a rebuild of all journals for a given provider, we need some sort of "sentinel" value that would certify that change in rules.psv would not affect a given journal file.

For each csv file that we convert into journal, we could track (in a separate file) which rules match any line in this csv. This could be done with a simple script (which I chose to write in Python) that would read file once, try regular expressions from your rules file one by one, and print the ones that match. We could save the result next to the file.csv, naming it file.matching_rules, and now we have our "sentinel".

Where previously we had .csv and rules.psv as a dependency for a .journal file, we would now instead have a .csv and .matching_rules.

Now, if you change rules.psv, all .matching_rules files would be regenerated (which is fast), and then only the csv files that would be affected by the changed rules would be re-built into `.journal.

On my dataset, for one of the providers where I have 12 years of data accross 100+ files, rebuild after rules.psv change went from almost two minutes to ten seconds -- quite a welcome speedup.

Best of all, the change in export.hs is quite small and does not require any manual per-file twiddling of build rules. You can find the resulting setup in 14-speeding-up or diffs/13-to-14.diff.

Next steps

To be continued!