Improve `JALR` execution with JIT-cache #471

vacantron · 2024-08-01T19:12:26Z

Currently, the "JALR" indirect jump instruction turns the mode of rv32emu from T2C back to the interpreter. This commit introduces a "JIT-cache" table lookup to make it redirect to the JIT-ed code entry and avoids the mode change.

There are several scenarios benefitting from this approach, e.g. function pointer invocation and far-way function call. The former like "qsort" can be speeded up two times, and the latter like "Fibonacci", which compiled from the hand-written assembly, can even reach x4.3 performance enhencement.

jserv

Benchmarks

Benchmark suite	Current: `f5d04fb`	Previous: `2b492f4`	Ratio
`Dhrystone`	`1602.55` Average DMIPS over 10 runs	`1575.66` Average DMIPS over 10 runs	`0.98`
`Coremark`	`1407.431` Average iterations/sec over 10 runs	`1413.569` Average iterations/sec over 10 runs	`1.00`

This comment was automatically generated by workflow using github-action-benchmark.

jserv

Place the source code of fibonacci.elf in directory tests. Always use lowercase filenames.

src/jit-cache.h

qwe661234 · 2024-08-02T04:08:13Z

Based on my previous research, searching any code cache is not suit for our framework because we do not translate all path into T1C or T2C. Therefore, you will get performance loss due to frequent cache miss. Maybe you can try to use the same way in T1C, that is parsing branch history table and including the predicted path into execution path for T2C.

vacantron · 2024-08-02T06:45:07Z

Based on my previous research, searching any code cache is not suit for our framework because we do not translate all path into T1C or T2C. Therefore, you will get performance loss due to frequent cache miss.

Yes, the cache-miss might cancel the improvement due to the cache searching overhead, so the implementation of searching cache is important here.

I have tried several implementations like searching in static array, searching in hash table, etc., and I found the hardware-like caching is the most friendly approach for our framework, which its overhead can be almost disregarded.

I have updated the main performance analysis picture in the top to prove the above statement.

qwe661234 · 2024-08-02T07:39:29Z

the cache-miss might cancel the improvement due to the cache searching overhead, so the implementation of searching cache is important here.

I have tried several implementations like searching in static array, searching in hash table, etc., and I found the hardware-like caching is the most friendly approach for our framework, which its overhead can be almost disregarded.

I have updated the main performance analysis picture in the top to prove the above statement.

I see, your strategy significantly improves the performance of benchmark with large number of jalr instruction, such as qsort and fib. However, it has negative impact on other types of benchmarks like idea, and this result is the same as my research.

vacantron · 2024-08-02T21:33:27Z

However, it has negative impact on other types of benchmarks like idea, and this result is the same as my research.

No, the performance degradation of IDEA comes from the invocation of JIT-generated code. Since the second bar (orange) shows the overhead of all cache-miss, the last bar (green) only shows the performance variety of the JIT-ed code invocation, which the increment/decrement of it is decided by the quality of generated code.

src/jit.h

qwe661234 · 2024-08-03T03:13:11Z

However, it has negative impact on other types of benchmarks like idea, and this result is the same as my research.

No, the performance degradation of IDEA comes from the invocation of JIT-generated code. Since the second bar (orange) shows the overhead of all cache-miss, the last bar (green) only shows the performance variety of the JIT-ed code invocation, which the increment/decrement of it is decided by the quality of generated code.

Could you further explain why the performance of all cache misses is better than your improvement? The JIT-ed code stored in the code cache for call is T1C, so the quality is bad?

jserv · 2024-08-03T04:26:25Z

CI failure:

git submodule update --init tests/
Submodule 'riscv-arch-test' (https://github.com/riscv-non-isa/riscv-arch-test) registered for path 'tests/riscv-arch-test'
Cloning into '/home/runner/work/rv32emu/rv32emu/tests/riscv-arch-test'...
From https://github.com/riscv-non-isa/riscv-arch-test
 * branch              ed32d6767e5c35fa98dfc5aa91d7f4b199f8c639 -> FETCH_HEAD
Submodule path 'tests/riscv-arch-test': checked out 'ed32d6767e5c35fa98dfc5aa91d7f4b199f8c639'
Traceback (most recent call last):
  File "/home/runner/.local/bin/riscof", line 5, in <module>
    from riscof.cli import cli
  File "/home/runner/.local/lib/python3.10/site-packages/riscof/cli.py", line 18, in <module>
    import riscof.framework.main as framework
  File "/home/runner/.local/lib/python3.10/site-packages/riscof/framework/main.py", line 11, in <module>
    from riscv_isac.isac import preprocessing
ImportError: cannot import name 'preprocessing' from 'riscv_isac.isac' (/home/runner/.local/lib/python3.10/site-packages/riscv_isac/isac.py)
make: *** [mk/riscv-arch-test.mk:18: arch-test] Error 1

vacantron · 2024-08-03T05:03:45Z

CI failure

This seems to be the upstream issue (riscv-software-src/riscof#122) .

jserv · 2024-08-03T14:42:33Z

riscv-software-src/riscof#122

A temporary workaround:

pip3 install git+https://github.com/riscv/riscof.git@d38859f85fe407bcacddd2efcd355ada4683aee4

jserv · 2024-08-03T14:43:34Z

It is a bit confusing to have both block cache and jit cache in the same file. Can you clarify?

tests/fibonacci.s

vacantron · 2024-08-04T17:57:03Z

Could you further explain why the performance of all cache misses is better than your improvement? The JIT-ed code stored in the code cache for call is T1C, so the quality is bad?

After researching, the performance degradation occurred when the T2C executed the T1C cache directly and bypassed the profiler.

In the previous implementation, I added all entries of the T1C-generated code to the cache (and even the part from block-chaining), but the chained blocks might be generated by parsing the branch history table and not be the main control flow. If that cache is used by the T2C, the profiler will have no chance to execute again to tag the right potential hotspot, and I think this is the main reason of the performance degradation of IDEA .

After constraining the source of cache coming from the T1C, the performance is shown at the top and it becomes more consistent with expectations.

tests/fibonacci.s

src/t2c.c

src/jit.h

jserv · 2024-08-05T02:40:32Z

Does JIT cache make sense only to T2C enabled builds? If so, it should be part of T2C component if we found no regressions.

vacantron · 2024-08-06T06:31:33Z

Does JIT cache make sense only to T2C enabled builds? If so, it should be part of T2C component if we found no regressions.

Yes, but there is no header file for the declarations of the T2C implementations, so the related type definitions and code are placed in jit.h and t2c.c now. Do we split them into t2c.h ?

jserv · 2024-08-06T07:25:05Z

there is no header file for the declarations of the T2C implementations, so the related type definitions and code are placed in jit.h and t2c.c now. Do we split them into t2c.h ?

We can consider header refactoring when integrating with frameworks like (PHP) IR or similar compiler frameworks. However, that is not necessary at this moment. Let's keep changes to a minimum.

jserv

Rework the git commits by merging the intermediate changes.

Currently, the "JALR" indirect jump instruction turns the mode of rv32emu from T2C back to the interpreter. This commit introduces a "JIT-cache" table lookup to make it redirect to the T2C JIT-ed code entry and avoid the mode change. There are several scenarios benefitting from this approach, e.g. function pointer invocation and far-way function call. The former like "qsort" can be speeded up by two times, and the latter like "fibonacci", which compiled from the hand-written assembly for creating "JALR" instructions, can even reach x4.8 performance enhencement.

jserv · 2024-08-07T19:04:10Z

Thank @vacantron for contributing!

jserv reviewed Aug 1, 2024

View reviewed changes

jserv requested changes Aug 1, 2024

View reviewed changes

jserv reviewed Aug 2, 2024

View reviewed changes

src/jit-cache.h Outdated Show resolved Hide resolved