-
Notifications
You must be signed in to change notification settings - Fork 31
Speeding up
Associated directory: 14-speeding-up
After you've lived with this setup for a while, you might start noticing that running export.sh
takes longer than you would've wanted it to and rebuilds the files that you would've expected to be left alone.
This happens because rules.psv
are per provider, and adding/amending rules would cause all the files for this provider to be regenerated, and you could easily have dozens or hundreds of them. The worst slowdown happens when you are adding new rules for unclassified transactions - you know that the rules you add would be applicable to a single new file, and yet all the rest of the journals would be regenerated as well.
Additionally, scripts that call hledger multiple times -- such as tax or investment -- will do so sequentially, and if you have spare CPU cores, it would be best to do it in parallel.
Let's tackle the latter problem first. We could use parallel
to make sure that our shell scripts call hledger
with as much parallelism as possible. We will do this by making a bash function that wraps up a single hledger invocation and then using the fact that parallel
would happily allow arbitrary shell commands in the input that you can feed to it via here-doc.
As a result, we could write what looks like a shell script, in which every invocation of hledger-wrapper function will happen in parallel, occupying all CPU cores. The changes necessary are really minimal, here is for example an investments report:
$ cat export/investments.sh
#!/bin/bash
function report() {
asset="$1"
shift
hledger roi -f ../all.journal \
--investment "acct:assets:${asset} not:acct:equity" \
--pnl 'acct:virtual:unrealized not:acct:equity' "$@"
}
export -f report
parallel -k :::: <<EOF
echo "Pension"
echo
report "pension"
report "pension" -Y
EOF
To prevent changes in rules.psv
from causing a rebuild of all journals for a given provider, we need some sort of "sentinel" value that would certify that change in rules.psv
would not affect a given journal file.
For each csv file that we convert into journal, we could track (in a separate file) which rules match any line in this csv. This could be done with a simple script (which I chose to write in Python) that would read file once, try regular expressions from your rules file one by one, and print the ones that match. We could save the result next to the file.csv
, naming it file.matching_rules
, and now we have our "sentinel".
Where previously we had .csv
and rules.psv
as a dependency for a .journal
file, we would now instead have a .csv
and .matching_rules
.
Now, if you change rules.psv
, all .matching_rules
files would be regenerated (which is fast), and then only the csv files that would be affected by the changed rules would be re-built into `.journal.
On my dataset, for one of the providers where I have 12 years of data accross 100+ files, rebuild after rules.psv
change went from almost two minutes to ten seconds -- quite a welcome speedup.
Best of all, the change in export.hs
is quite small and does not require any manual per-file twiddling of build rules. You can find the resulting setup in 14-speeding-up or diffs/13-to-14.diff.
To be continued!
- Key principles and practices
- Getting started
- Getting data in
- Getting full history of the account
- Adding more accounts
- Creating CSV import rules
- Maintaining CSV rules
- Investments - easy approach
- Mortgages
- Remortgage
- Foreign currency
- Sorting unknowns
- File-specific CSV rules
- Tax returns
- Speeding things up
- Tracking commodity lost manually
- Fetching prices automatically
- ChangeLog