-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve JALR
execution with JIT-cache
#471
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Benchmarks
Benchmark suite | Current: f5d04fb | Previous: 2b492f4 | Ratio |
---|---|---|---|
Dhrystone |
1602.55 Average DMIPS over 10 runs |
1575.66 Average DMIPS over 10 runs |
0.98 |
Coremark |
1407.431 Average iterations/sec over 10 runs |
1413.569 Average iterations/sec over 10 runs |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Place the source code of fibonacci.elf
in directory tests
. Always use lowercase filenames.
Based on my previous research, searching any code cache is not suit for our framework because we do not translate |
Yes, the cache-miss might cancel the improvement due to the cache searching overhead, so the implementation of searching cache is important here. I have tried several implementations like searching in static array, searching in hash table, etc., and I found the hardware-like caching is the most friendly approach for our framework, which its overhead can be almost disregarded. I have updated the main performance analysis picture in the top to prove the above statement. |
I see, your strategy significantly improves the performance of benchmark with large number of |
No, the performance degradation of |
Could you further explain why the performance of all cache misses is better than your improvement? The JIT-ed code stored in the code cache for |
CI failure:
|
This seems to be the upstream issue (riscv-software-src/riscof#122) . |
A temporary workaround:
|
It is a bit confusing to have both block cache and jit cache in the same file. Can you clarify? |
After researching, the performance degradation occurred when the T2C executed the T1C cache directly and bypassed the profiler. In the previous implementation, I added all entries of the T1C-generated code to the cache (and even the part from block-chaining), but the chained blocks might be generated by parsing the branch history table and not be the main control flow. If that cache is used by the T2C, the profiler will have no chance to execute again to tag the right potential hotspot, and I think this is the main reason of the performance degradation of After constraining the source of cache coming from the T1C, the performance is shown at the top and it becomes more consistent with expectations. |
Does JIT cache make sense only to T2C enabled builds? If so, it should be part of T2C component if we found no regressions. |
Yes, but there is no header file for the declarations of the T2C implementations, so the related type definitions and code are placed in |
We can consider header refactoring when integrating with frameworks like (PHP) IR or similar compiler frameworks. However, that is not necessary at this moment. Let's keep changes to a minimum. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rework the git commits by merging the intermediate changes.
Currently, the "JALR" indirect jump instruction turns the mode of rv32emu from T2C back to the interpreter. This commit introduces a "JIT-cache" table lookup to make it redirect to the T2C JIT-ed code entry and avoid the mode change. There are several scenarios benefitting from this approach, e.g. function pointer invocation and far-way function call. The former like "qsort" can be speeded up by two times, and the latter like "fibonacci", which compiled from the hand-written assembly for creating "JALR" instructions, can even reach x4.8 performance enhencement.
Thank @vacantron for contributing! |
Currently, the "JALR" indirect jump instruction turns the mode of rv32emu from T2C back to the interpreter. This commit introduces a "JIT-cache" table lookup to make it redirect to the JIT-ed code entry and avoids the mode change.
There are several scenarios benefitting from this approach, e.g. function pointer invocation and far-way function call. The former like "qsort" can be speeded up two times, and the latter like "Fibonacci", which compiled from the hand-written assembly, can even reach x4.3 performance enhencement.