Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Match results show 100% equal for functions with differences #313

Open
grimdoomer opened this issue Aug 31, 2024 · 4 comments
Open

Match results show 100% equal for functions with differences #313

grimdoomer opened this issue Aug 31, 2024 · 4 comments
Assignees
Labels

Comments

@grimdoomer
Copy link

I'm trying to use diaphora to diff different versions of the same binary and detect functions that have differences with a granularity of single instruction changes. I noticed when diff'ing two versions of this binary that only contain a single instruction difference the match of the function is detected as "100% equal" with a ratio of 1.0 even though the functions contain a single instruction difference.
image

If I diff the assembly for the functions I can see the single instruction change:
image

I understand the "lwz" line is a false positive because I changed the immediate display type in one of the databases before exporting, but I would still expect the slwi/sldi instruction change to get detected. Is there some settings I can change for the comparisons to be more strict? I thought some of the heuristics used the MD5 hash of the function data which I would expect to change between these two functions.

For additional confirmation I diff'd the two binaries in a hex editor and can clearly see the 4 byte change for the different instructions:
image
image

@grimdoomer
Copy link
Author

Digging into this a bit more I got the "best" matches to run by changing DIFFING_ENABLE_EXPERIMENTAL to False but the results still showed "100% equal" for the function in question. I checked in the sqlite dbs to make sure the function in question has a different byte_hash between the two different versions and they are different.

In the diaphora.py file I found the find_equal_matches function that reports functions as "100% equal". However, it only compares them based on the following fields: id, address, mangled_function, nodes, edges, size. I added bytes_hash and now I get the function in question reported as a partial match with a ratio of 0.99, which is what I was expecting.

So it seems like as long as two functions have the same name, address, size, and control flow they get reported as 100% equal even though the instructions in the functions could have changed? Is this intended behavior or a bug?

@joxeankoret
Copy link
Owner

So it seems like as long as two functions have the same name, address, size, and control flow they get reported as 100% equal even though the instructions in the functions could have changed? Is this intended behavior or a bug?

This is intended behaviour. But according to the very detailed report you made, it might be wrong. I'm going to add the patch you did (adding bytes_hash) but, could you please share the two samples? (Or their hashes, and I would search them myself).

@joxeankoret joxeankoret self-assigned this Sep 4, 2024
@joxeankoret joxeankoret added the bug label Sep 4, 2024
@grimdoomer
Copy link
Author

I have attached the sample files and IDC scripts used to reproduce the IDA databases I had setup. You can load each .bin file as "PowerPC big-endian", use default memory layout settings, if asked analyze as 32-bit, use all default settings for IO ports, etc. Please let me know if you have any other questions or issues loading the samples.

image
image
image
image
image

hv_images.zip

@joxeankoret
Copy link
Owner

Bug fixed locally, waiting for all the tests to pass. Thanks a lot!

joxeankoret added a commit that referenced this issue Sep 17, 2024
ML: Dropped support for training local models. They were not working properly at all.
BUG: HEUR: Added field 'bytes_hash' to the '100% equal' heuristic, as it was ignoring some minimal changes (issue #313)
BUG: HEUR: Always check if there are differences even for structurally 100% equal databases (issue #313).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants