Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add RISC-V vector spec v1.0 support #279

Open
wants to merge 12 commits into
base: master
Choose a base branch
from
Open

Conversation

sh-zheng
Copy link

@sh-zheng sh-zheng commented Mar 29, 2022

Enable rvv 1.0 with --enable-rvv.
Correctness tests for FP32 and FP64 are done with bench --verify on qemu.

@rdolbeau
Copy link
Contributor

rdolbeau commented May 4, 2022

Maybe you could first submit to https://github.com/rdolbeau/fftw3/tree/riscv-v where the code was originally developed with the EPI intrinsics.

The RVV implementation (and the SVE implementation upon which it is based in https://github.com/rdolbeau/fftw3/tree/arm-sve-clean) adds a lot of codelets by using a 'fixed-width' model, which may not be ideal for code-size and planning time. But a "scalable" implementation (full-width and/or dynamically masked) would require an adapted infrastructure, as currently the SIMD solvers require a known width at compile-time.

@sh-zheng
Copy link
Author

sh-zheng commented May 4, 2022

@rdolbeau The code is indeed a prototype and need further development. I'd love to submit it to your repo first, and we can make some optimizations on it later.
I found that the branch https://github.com/rdolbeau/fftw3/tree/riscv-v has not updated since 2020 and maybe out of date with the official mainline. Could you creat a new branch from mainline, where the code is latest to submit?

@rdolbeau
Copy link
Contributor

rdolbeau commented May 5, 2022

@sh-zheng Indeed I need to bring the branch up-to-date (or create a new, clean one), I'll try to do that ASAP.

#include <riscv_vector.h>

#if HAVE_RVV
/* don't know how to autodetect RVV; assume it is present */

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Detect RVV at run-time:

#include <sys/auxv.h>
if (getauxval(AT_HWCAP) & HWCAP_ISA_V)
{

}

It's not detect zve* right, but that's the best we can did for now.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much. I will try this detection.

@sh-zheng
Copy link
Author

sh-zheng commented May 5, 2022

@rdolbeau Thank you so much. A new, clean branch may be better, since the official rvv intrinsics are stable enough for development.

@rdolbeau
Copy link
Contributor

rdolbeau commented May 31, 2022

@sh-zheng Sorry for the delay, been quite busy; riscv-v-clean should be up-to-date with master and just a couple of consolidated commits (rdcycle & the V support).

@sh-zheng
Copy link
Author

@rdolbeau Thank you very much for the new branch. I will submit my codes into it as soon as possible.

@nick-knight
Copy link

What is the plan? For @sh-zheng to PR their changes to @rdolbeau's fork, and then PR that here?

a "scalable" implementation (full-width and/or dynamically masked) would require an adapted infrastructure, as currently the SIMD solvers require a known width at compile-time.

My suggestion is to contribute the fixed-width RVV support first, and then propose optimizations in follow-up patches. A fixed SIMD width seems to be a fundamental assumption of FFTW's code-gen, and breaking this assumption may require invasive changes. Additionally, both RVV and SVE could benefit from supporting scalable vectors, and it might be preferable to consider both use-cases at the same time, to ensure the changes don't preclude one ISA or the other.

@sh-zheng
Copy link
Author

sh-zheng commented Sep 7, 2022

@nick-knight I agree. Maybe we can submit the fixed-width version first. The submit of scalable version is still in progress and may not be perfect soon.

@JustToEnjoy
Copy link

@rdolbeau @sh-zheng
1, I fetch branch https://github.com/rdolbeau/fftw3/tree/riscv-v-clean, and build it using C906 toolchain, but it will happen error when I --enable-r5v, Have you encountered this problem?

r5v.c:28:12: warning: implicit declaration of function '__builtin_epi_vsetvl'; did you mean '__builtin_iceil'? [-Wimplicit-function-declaration] 28 | return __builtin_epi_vsetvl(rs/64, __epi_e64, __epi_m1) >= (rs/64); | ^~~~~~~~~~~~~~~~~~~~ | __builtin_iceilr5v.c:28:12: warning: nested extern declaration of '__builtin_epi_vsetvl' [-Wnested-externs]r5v.c:28:40: error: '__epi_e64' undeclared (first use in this function) 28 | return __builtin_epi_vsetvl(rs/64, __epi_e64, __epi_m1) >= (rs/64); | ^~~~~~~~~r5v.c:28:40: note: each undeclared identifier is reported only once for each function it appears inr5v.c:28:51: error: '__epi_m1' undeclared (first use in this function) 28 | return __builtin_epi_vsetvl(rs/64, __epi_e64, __epi_m1) >= (rs/64); | ^~~~~~~~r5v.c:29:3: warning: control reaches end of non-void function [-Wreturn-type] 29 | }

2,I remove --enable-r5v, it can build pass, but it can't work in C906, if I call any fftw API(for example, fftw_plan_dft_1d), the program will get stuck. No any error been print.

These two problems make me so consufed. Could you help me?

@rdolbeau
Copy link
Contributor

rdolbeau commented Mar 2, 2023

@JustToEnjoy Unfortunately on my branch the code still uses EPI syntax (the European Processor Initiative project, which developed some early support for a set V intrinsics internally). It needs updating to 'standard' intrinsics. @sh-zheng had a prototype but I'm not sure targeting which set of intrinsics, or which version(s) of which compilers (llvm, gcc) would have support.

@nick-knight
Copy link

SiFive has successfully built FFTW from this PR branch: it passes tests, and we are pointing our customers at it. We used an in-house LLVM toolchain, but I don't think it differs significantly from upstream LLVM w.r.t. the necessary intrinsics.

@rdolbeau
Copy link
Contributor

rdolbeau commented Mar 2, 2023

@nick-knight My suggestion to @sh-zheng was to merge on my branch (riscv-v-clean) so we keep the history, then merge to upstream (here). Unfortunately it seems our calendar didn't align it seems :-(

@nick-knight
Copy link

@rdolbeau I have done my best to acknowledge your contribution in customer engagements. If @sh-zheng follows through on your suggestion, then I assume they will close this PR and you will open a new one. When that happens, I will start pointing customers at the new one. Or, even better, the maintainer will accept the PR and we can just use master :)

@sh-zheng
Copy link
Author

So sorry for the delay since I'm in some troubles, and may not continue to move forward with the project :-( . I'd appreciate it if someone could take the project forward.

The project is now in such a state, that:
1. If we don't very care about the vector-length adaptability of the code, the PR could be submited right now, to support the vector length no greater than 1024.
2. If we really care the vector-length adaptability of the code, the PR should be merged into rdolbeau's branch, and need more development as the SVE version. But this may need more time.

My suggestion is that, since the main-line of the fftw has not supported the flexible vector-length, we could submit this version, as a preliminary implementation of fixed vector length, and at least make it usable. The version of flexible vector-length could be developed later, and be submitted in another PR. @nick-knight @rdolbeau

@rdolbeau
Copy link
Contributor

@sh-zheng @nick-knight For unrelated reason I recently opened a PR for SVE (#315). I'll try to take a look at @sh-zeng modifications and see if I can merge them on my RVV branch to follow-up, as they are somewhat similar in terms of behavior and issues.

@sh-zheng
Copy link
Author

sh-zheng commented Mar 13, 2023

Great!
I think the main different between the two PRs is the simd-common.h. Maybe they can be merged and submitted synchronously. @rdolbeau

1. Extend the support of VLEN to 65536. The data type in struct tw_instr of kernel/ifftw.h should be extended synchronously, to support longer integer.
2. Include vtw.h for VTW1, VTW2, and VTWS, to be compatible with the coding style of ARM SVE. Although the vtw.h is auto-generated in the version of ARM SVE, for the integrity and correctness, the vtw.h will be submitted in this version.
@OMaghiarIMG
Copy link

Hello @sh-zheng, think you might want to add dft/simd/rvv*/*.c and rdft/simd/rvv*/*.c to .gitignore.

simd-support/rvv.c Outdated Show resolved Hide resolved
@sh-zheng
Copy link
Author

@OMaghiarIMG Thank you for the review. A new version has been updated.

@rdolbeau
Copy link
Contributor

At last I've been able to move my own branch from the old EPI intrinsics to the current set of RISC-V V1.0 intrinsics. I've made a release package to ease testing (no need for ocaml and maintainer mode!).

It is able to compile and pass checks using either clang-18 (from Debian sid) or gcc-14 (recompiled from FSF source), running on a Banana Pi F3 SBC which features RISC-V V1.0 using 256-bits registers. It's not very fast and only has 4 GiB of RAM, but it's a lot faster than Qemu :-)

@sh-zheng branch also pass the same set of tests on the same hardware. Interestingly, the performance results I get are highly dependent on the compiler: with clang-18 my branch appear to be a bit faster, while with gcc-14 @sh-zheng branch is a bit faster. There's probably quite a bit of analytics needed to figure out what's the best way (or a reasonable compromise...) of doing single-vector/interleaved format using RISC-V V. The many different micro-architectures that are appearing will make trade-offs in the source a lot more complex than for x86-64 or aarch64...

Also, an alternate implementation would be to update and test the 'split' code, which uses two vectors for complex (one for real, one for imaginary). However this requires the V macro to hold both scalable vectors in a structure or array, which I'm not sure is supported in current compilers (the EPI compiler from the BSC does [based on clang], or a least did at some point). @rofirrim, do you know in which compiler(s) (upstream clang, upstream gcc, BSC's clang) there would be support for a struct or an array of scalable types?

@rofirrim
Copy link

Hi @rdolbeau,

Also, an alternate implementation would be to update and test the 'split' code, which uses two vectors for complex (one for real, one for imaginary). However this requires the V macro to hold both scalable vectors in a structure or array, which I'm not sure is supported in current compilers (the EPI compiler from the BSC does [based on clang], or a least did at some point). @rofirrim, do you know in which compiler(s) (upstream clang, upstream gcc, BSC's clang) there would be support for a struct or an array of scalable types?

I seem to recall you would be able to do that if it is acceptable to you to specify a minimum vector size (similar to what SVE does). Check: https://clang.llvm.org/docs/AttributeReference.html#riscv-rvv-vector-bits

I was wondering also if you could use a segmented tuple load/store and then extract the different vectors? Check https://www.godbolt.org/z/5h1Wj7dEf as an example of a naive complex multiplication in case it is useful.

Hope this helps.

@rdolbeau
Copy link
Contributor

I seem to recall you would be able to do that if it is acceptable to you to specify a minimum vector size (similar to what SVE does). Check: https://clang.llvm.org/docs/AttributeReference.html#riscv-rvv-vector-bits

Unfortunately, that would require compiling each file with a different set of options. Not impossible, but certainly a burden to add to the build system...

OTOH, my memory is failing me; while I did use an array or struct, for RISC-V V with the EPI compilers, it was a tuple type: https://github.com/rdolbeau/fftw3/blob/46763868ee386f7a94cb5863afa70519a8123d24/simd-support/simd-r5v-split.h#L73 . It should be possible to do using the current vfloat64m2_t tuple type as V, and not require any special support of feature... I need to investigate this. Only half as many registers are available, but for smaller transform the smaller overhead my be a benefit.

... or, as I just saw in the godbolt link you added, the tuple type vfloat32m1x2_t, which I didn't know existed. Plenty of options actually :-)

I was wondering also if you could use a segmented tuple load/store and then extract the different vectors?

Segmented load was the goal, it did work in simulation back then (https://github.com/rdolbeau/fftw3/blob/46763868ee386f7a94cb5863afa70519a8123d24/simd-support/simd-r5v-split.h#L216). IIRC they were once an extension, do I read the specifications right and they are actually in the base V extension now?

@rofirrim
Copy link

Segmented load was the goal, it did work in simulation back then (https://github.com/rdolbeau/fftw3/blob/46763868ee386f7a94cb5863afa70519a8123d24/simd-support/simd-r5v-split.h#L216). IIRC they were once an extension, do I read the specifications right and they are actually in the base V extension now?

Yes, they are now part of base V.

@rdolbeau
Copy link
Contributor

rdolbeau commented Jul 20, 2024

New release package from my branch here.

Edit: updated from 003 to 004

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants