Parallel export #265

clslgrnc · 2023-08-06T18:55:09Z

Following our brief exchange on mastodon, here is a complete parallel export code, with functions and call graph.

With 5 workers I get a 2x speedup on a 1MB binary and 100% match on callgraph and functions from a regular export.

Each job is assigned some of the functions to export.
all resulting db are merged
a final job exports the remaining data

I refactored all SQL insertions related to functions in order to easily switch between:

sequential export: rows are inserted sequentially
parallel export: ensure rowid % nbr_jobs = job_id in order to avoid collisions when merging

run with: IDADIR=<path-to-ida> ./diaphora_parallel_export.py <path-to-target-binary>

Two potential improvements that remain to be done:

distribute address ranges to workers without relaunching IDA (needs a communication channel with the ida scripts)
use a threadsafe db in order to parallelize merges

Sequential export still seems to work.

joxeankoret · 2023-08-11T10:56:27Z

I will not integrate it into the main branch for obvious reasons, but I will leave this PR here so people can use it if they like.

clslgrnc · 2023-08-14T18:18:27Z

I can think of several obvious reasons not to integrate this into the main branch 😃

If you could integrate any (or all) of the first three commits (a1b769d, d28da05, 6f9ae86) it would make maintaining my fork easier though. If you are interested, let me know which commits would be acceptable so that I can prepare a PR (if not, fair enough).

For anyone interested please post any issue about diaphora_parallel_export.py over there: https://github.com/clslgrnc/diaphora-parallel-export/issues

Edit:
I just expanded the commit messages (and updated the commit SHAs)

clslgrnc · 2023-08-14T18:38:03Z

From #264

Not a feature so much, but optimizing the export speed would be nice. I have an arm64 binary that takes over 2 and a half hours to export at the moment. [Diaphora: Wed Jul 19 21:07:13 2023] Database exported, time taken: 2:43:39.113402.

@Myles1 the diaphora_parallel_export.py script in this PR might be able to speed this up.
It launches several instances of idat64 on copies of the target idb, thus RAM might be a limiting factor.
Let me know if you have any question.

It's just so nice to be able to automate function matching in this way. You've made an enormously helpful program.

Indeed, thank you @joxeankoret

joxeankoret · 2023-08-15T20:48:04Z

I will review the commits as soon as I can and integrate them as possible. I will probably do this weekend.
Thank you!

Extract the list of functions exported by `do_export` into the method `filter_functions`. This will allow a parallel version of `do_export` to build a generator of functions to export on the fly while receiving instruction from a main process. Additionally, parallel export will not always export functions in a predictable order. In the context of crashed exports, instead of relying on the address of the last function inserted into the database, all addresses added are retrieved and `self._funcs_cache` is restored.

Extract argument of `replace_wait_box` into a variable, so that it can be used as a log line instead of a message box update. And a slight cosmetic change to the frequency of updates.

Extract all SQL queries related to functions into a dedicated class. It will then be easier to alter all of them at once in the context of parallel export of functions data. If further SQL queries related to functions are necessary, they should be added to this class.

`diaphora_parallel_export.py` performs the following actions: - first, `idat64` is used to perform the auto-analysis and get an idb; - then, a queue manager with two queues is created: - `job_queue` is used to send jobs to the workers; - `report_queue` is used by workers to report jobs done and termination. - a number of workers are launched: they copy the idb and launch `idat64` with a script that retrieves the queues and await for instructions; - jobs and kill switches are sent; - as soon as possible (when all jobs are performed or in-progress), the generated SQLite databases, containing functions information, are merged, resulting in a database with all functions data but no program data; - `idat64` is used one last time to retrieve those program data. In order to avoid collisions while merging databases, each worker only use database indices such that: `index % nbr_workers == worker_id`. In order to choose the functions to analyze, a worker just divide the sorted list of functions into as many parts as the total number of jobs, and process the nth part (where n is the job id).

Keep the connection to the main database open. Not sure if it really is faster.

clslgrnc force-pushed the parallel_export branch from a4e5d60 to 75f16fe Compare August 14, 2023 17:32

clslgrnc mentioned this pull request Aug 14, 2023

ugly_parallel clslgrnc/diaphora-parallel-export#1

Closed

clslgrnc added 5 commits August 16, 2023 18:20

Refactor do_export logging

d28da05

Extract argument of `replace_wait_box` into a variable, so that it can be used as a log line instead of a message box update. And a slight cosmetic change to the frequency of updates.

// export: keep db connection open

17110e2

Keep the connection to the main database open. Not sure if it really is faster.

clslgrnc force-pushed the parallel_export branch from 75f16fe to 17110e2 Compare August 16, 2023 17:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel export #265

Parallel export #265

clslgrnc commented Aug 6, 2023 •

edited

Loading

joxeankoret commented Aug 11, 2023

clslgrnc commented Aug 14, 2023 •

edited

Loading

clslgrnc commented Aug 14, 2023

joxeankoret commented Aug 15, 2023

Parallel export #265

Are you sure you want to change the base?

Parallel export #265

Conversation

clslgrnc commented Aug 6, 2023 • edited Loading

joxeankoret commented Aug 11, 2023

clslgrnc commented Aug 14, 2023 • edited Loading

clslgrnc commented Aug 14, 2023

joxeankoret commented Aug 15, 2023

clslgrnc commented Aug 6, 2023 •

edited

Loading

clslgrnc commented Aug 14, 2023 •

edited

Loading