Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some tests failing on M4 #13

Open
dougallj opened this issue Jun 1, 2024 · 2 comments
Open

Some tests failing on M4 #13

dougallj opened this issue Jun 1, 2024 · 2 comments

Comments

@dougallj
Copy link

dougallj commented Jun 1, 2024

I had a quick look at the M4 – the tests for EXTRX, EXTRY, VECINT and VECFP are failing.

EXTRX and EXTRY can be fixed with the following change to each:

         if ((AMX_VER >= AMX_VER_M2) && (operand & (1ull << 31))) {
             operand &=~ (0x1ffull << 32);
             z_step = z_col & 32 ? 16 : 32;
         }
+        if ((AMX_VER >= AMX_VER_M4) && (operand & (1ull << 31))) {
+            dst_offset &= ~0x3F;
+        }
         store_enable &= parse_writemask(operand >> 32, xybytes, 9);
     } else if (operand & EXTR_BETWEEN_XY) {

VECINT and VECFP seem to have similar changes – if I only test operands of the form rand_next() & ~(0x3F | (0x3F<<10)) the tests pass. I was able to fix a simple test case by zeroing those bits if bit 31 was set, but that broke indexed-loads. Trying to fix indexed-loads didn't go well, and other experiments imply that that wouldn't be the end of it either. I might be able to work through it, but I figured I'd leave this here in case it's helpful.

Edit: Also, entirely unsurprisingly, Streaming-SVE mode (SME) and AMX are mutually exclusive – if either is enabled, trying to enable the other gives EXC_BAD_INSTRUCTION.

@corsix
Copy link
Owner

corsix commented Jun 1, 2024

I'm mildly surprised that AMX instructions are still present at all, given the introduction of SME.

I don't have any M4 hardware to test against at the moment, though I might pick up an M4 MBP when they come out.

@dougallj
Copy link
Author

dougallj commented Jun 2, 2024

Yeah, I was surprised too – my initial theory was it was just for software compatibility (within Apple), but I think we're also seeing worse f16/bf16 throughput with SME too, because the spec'd SME operations map less directly to what AMX can do at that size. I might be misremembering the AMX behaviour, or misusing SME, but I've measured single-core SME f16 FLOPS ≈ single-core SME f32 FLOPS (as did someone else https://mastodon.social/@[email protected]/112528651326649755)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants