RFC: LiteSDCard: Add support for DMA/data completion interrupts #16

gsomlo · 2023-10-14T15:41:27Z

This is the new/reworked version of #15, which should offer a cleaner sequence of steps getting from the current state to the desired end goal of including dma completion support.

Depends on enjoy-digital/litex#1787 being applied.

The important patches are 5/10 ("fix cmd irq/completion sequencing") and 6/10 ("switch cmd_done irq to edge-triggered"); the latter requires enjoy-digital/litex#1787 to be applied (breaking compatibility with the way LiteSDCard/Litex gateware operated before that point).

Signed-off-by: Gabriel Somlo <[email protected]>

Initializing completion(s) belongs with the rest of the IRQ support setup, in `irq_init()`. Signed-off-by: Gabriel Somlo <[email protected]>

Setting up DMA data transfers happens outside (and before) the call to `send_cmd()`; therefore, wait for their completion also outside (and after) the `send_cmd()` call. Signed-off-by: Gabriel Somlo <[email protected]>

Separate the criteria for selecting commands that should use IRQ into an explicit Boolean flag. Signed-off-by: Gabriel Somlo <[email protected]>

To ensure interrupts cannot be missed, we must (re)initialize completions *before* setting up any transaction that will result in the assertion of an interrupt. Waiting for completion should follow *after* the hardware transaction setup, and before proceeding with e.g. a polling loop acting as fall-back for when IRQ support is unavailable. Signed-off-by: Gabriel Somlo <[email protected]>

Switch from *level* to *edge* triggered IRQ mode for the command completion interrupt. The SDIRQ_CMD_DONE interrupt will no longer have to be toggled by the interrupt handler, laying the groundwork for supporting (edge-triggered) data/dma completion interrupts in a subsequent patch. NOTE: this requires the underlying LiteX gateware to be newer than commit #XXXXXXXX ("integration/soc/add_sdcard: switch to edge-triggered cmd-done irq") Signed-off-by: Gabriel Somlo <[email protected]>

Instead of heuristically selecting "slow" commands to use interrupts, leaving "fast" ones to use polling, apply IRQ handling equally to *all* commands. The measured transfer times decreased slightly for some LiteX designs (e.g., using VexRiscv/NaxRiscv), and did not increase measurably for others (e.g., using RocketChip). Signed-off-by: Gabriel Somlo <[email protected]>

Ensure that interrupts are turned off as part of removing the driver. Signed-off-by: Gabriel Somlo <[email protected]>

Collect handled IRQs and ack them all together with a single MMIO port write. Signed-off-by: Gabriel Somlo <[email protected]>

Add support for utilizing interrupts generated by LiteSDCard when DMA data transfers complete. Signed-off-by: Gabriel Somlo <[email protected]>

gsomlo · 2023-10-18T17:34:10Z

@paulusmack : thanks for the testing and the feedback! I'm moving the conversation to this PR instead, since it has the latest and clearest patch sequencing (although the end state is the same as in PR #15 ).

I will rework https://github.com/litex-hub/linux/pull/16/commits/14916705b5e4263a3d1f7a36da15fec4f5243ae1to address your observation by writing back the value of pending immediately, before doing any actual irq handling, to avoid races.

In the mean time, I've done some more comprehensive speed tests of my own. I have a slower/older 32GB SanDisk card, and a newer/faster (SanDisk "Ultra") 32-bit card. I ran tests on 4-core Rocket CPUs running at 50MHz (on a nexys video) and at 100MHz (on a stlnv7324 v2, kintex-based board).

I used your time (rather than my original, date) based method:

# for read speed (use *real* value for overall elapsed time):
mount /dev/mmcblk0p1 /mnt; time cp /mnt/testfile ./foo; umount /mnt

# for write speed (add together the *real* values for cp and umount for total elapsed time):
mount /dev/mmcblk0p1 /mnt; time cp foo /mnt/; time umount /mnt

I had a 32.4 MByte (33995776 bytes) file I used for testing.

The transfer speeds are in KBytes per second:

+-------------+---------------------------+---------------------------+
|             |       50MHz Rocket        |       100MHz Rocket       |
| test/commit |  SlowCard   |  FastCard   |  SlowCard   |  FastCard   |
|             |  rd  |  wr  |  rd  |  wr  |  rd  |  wr  |  rd  |  wr  |
+-------------+------+------+------+------+------+------+------+------+
| 1.  f1557fa |  731 |  696 | 1212 |  719 |  964 |  876 | 2278 | 1403 |
| 2.  66074a0 |  678 |  683 | 1182 |  620 |  944 |  765 | 2127 | 1168 |
| 3.  b695cee |  682 |  661 | 1139 |  614 |  949 |  883 | 2272 | 1166 |
| 4. heur.cmd |  718 |  676 | 1169 |  732 |  968 | 1090 | 2434 | 1384 |
| 5. poll 5us | 1126 |  184 |  374 |  --- |  388 |  191 |  388 |  328 |
| 6. poll 0us |  829 |  949 | 1921 |  --- | 1017 | 1435 | 3668 |  --- |  
+-------------+------+------+------+------+------+------+------+------+

The following commits were tested:

f1557fa : "heuristically" selected commands, no data completion irq handling (essentially the current status-quo, only using edge-triggered gateware)
66074a0 : all commands subject to irq handling (remove "heuristic" selection), still no data completion irq handling
b695cee : full patch series applied, all commands and data completion handled via irq
patch 66074a0 reverted after applying full series (we're back to "heuristically" picking commands, but are also handling data completion via irq)
polling with default #define SD_SLEEP_US 5 interval (comment out irq property in DT node)
polling with aggressive #define SD_SLEEP_US 0 interval (edited in driver source)

It appears that keeping the heuristics for comand completion and adding data completion IRQ handling is competitive with (sometimes faster than) the status quo, particularly as we increase the cpu clock frequency.

@Dolu1990 - I removed the "heuristics" (patch 66074a0) since it apparently made a positive difference in your case (naxriscv 64bit). Any chance you could run a similar set of tests to what I did here (and what @paulusmack did in #15 (comment)) and report back?

I'm guessing there's going to be variability since we're all using different SDCard models. Also, I'm 10-20 KB/s faster or slower results if I repeat the same test multiple times, so this is definitely not super precise :)

It would be interesting if any one of us can figure out a way to dial up the cpu frequency past 100MHz to see what that trend would look like...

EDIT: added polling tests for 50MHz rocket: "fast" card writes errored out hard, and could not complete; "slow" card got really slow for writes with 5us polling interval, did fine in "aggressive" mode. It's late and I'm getting sleepy, so I'll hook up the 100MHz capable dev board tomorrow and see what I can get away with on polling there as well.

EDIT2: added polling tests for 100MHz rocket: "slow card" worked well, "fast card" had spectacularly fast reads, but face planted when trying to write.

Editorial opinion: While throwing around all these numbers, we should remember that raw sdcard transfer speeds are optimized at the expense of keeping a cpu core spinning in a polling loop, unless we use IRQs to free it up. Yes, there's an overhead of dealing with an interrupt, but it's fixed in terms of opcodes, which makes it a delay that becomes a smaller fraction of the transfer speed as cpu frequencies scale up...

paulusmack · 2023-10-19T21:56:04Z

Hmmm, for row 4 in your table (and also for row 1), we are still taking an interrupt for every command, even if litex_mmc_send_cmd doesn't wait for it, which seems like wasted CPU cycles. Surely it takes fewer CPU cycles to do two MMIO writes to disable/enable the interrupt than it does to take an interrupt and handle it.

paulusmack · 2023-10-19T22:42:07Z

Also, I wonder why we need a data completion interrupt for writes. I don't see how the command can complete before all the data has been transferred, for a write to the card. So the command completion interrupt should be sufficient, for writes. In fact, for writes, we need neither the interrupt nor the polling of the completion status. That should simplify the code in litex_mmc_request() a bit.

gsomlo · 2023-10-19T23:25:21Z

Hmmm, for row 4 in your table (and also for row 1), we are still taking an interrupt for every command, even if litex_mmc_send_cmd doesn't wait for it, which seems like wasted CPU cycles. Surely it takes fewer CPU cycles to do two MMIO writes to disable/enable the interrupt than it does to take an interrupt and handle it.

You're right, I guess if we're going to keep the "heuristics", we might indeed want to toggle the command completion interrupt. @Dolu1990 reported his transfer speeds improving when we got rid of them, so I just went for it, and "putting them back" for this specific test was an incomplete operation on my part -- good catch! :)

Also, I wonder why we need a data completion interrupt for writes. I don't see how the command can complete before all the data has been transferred, for a write to the card. So the command completion interrupt should be sufficient

I'll add some pr_info() statements and run some more tests that way, see if anything spins where we think it shouldn't. @Dolu1990 observed that the original upstream driver (with selective/heuristic commands-only irq handling) got "faster" when dialing SD_SLEEP_US down to 0, which to me indicated we're spinning a significant/measurable amount in places where we should be using IRQs instead.

I'm not exactly sure how the apparent "concurrency" between commands and data DMA transfers overlaps. But the driver (in its current, unmodified-upstream state) does things in the following order:

set up data DMA (if applicable)
set up the command (with or without irq handling, as per the heuristic, which says "yes" to irq handling if data is transfered)
wait for completion of command (completion queue or spinning/polling)
if data, wait for its completion (spinning, no irq, hence the SD_SLEEP_US thing, which obviously comes into play at least for one, if not both xfer directions, as per @Dolu1990)

So I didn't want to (or think I needed to) worry about which (command or data xfer) finishes first: their respective completion queues would get triggered by the respective data/command IRQ, and when I got there I would call wait_for_completion(), which would either yield the cpu or keep going right away, depending on whether the IRQ I'm waiting for did or didn't yet fire.

@enjoy-digital might be able to shed some light on whether we're guaranteed for commands to always finish before write DMA transfers (but not reads). I'll run some empirical tests on my end, but if I don't find any evidence of writes spinning past command completion, I don't have enough insight to equate that with "evidence of absence" :)

EDIT: I ran some tests on 50MHz Rocket (nexys video), by unwinding the readx_poll_timeout() loops used by the command-completion portion (https://github.com/litex-hub/linux/blob/litex-rebase/drivers/mmc/host/litex_mmc.c#L108)
and by the data-completion polling (https://github.com/litex-hub/linux/blob/litex-rebase/drivers/mmc/host/litex_mmc.c#L183), and adding counters for how many times each of them spins.

With (heuristic command-only) IRQs, nothing spins, for me, ever. With polling (no irq support), only command completion spins, data completion never does (neither on reads, nor on writes).

@Dolu1990 -- can you try the original unmodified gateware and upstream driver with the spin counter patch applied and let us know what you see?

@enjoy-digital -- off the top of your head, is it possible for a dma transfer to outlast a command that's started concurrently? I can probably dig into the gateware HDL sources and try to figure this out, but maybe you already know the answer :)

paulusmack · 2023-10-20T01:23:43Z

Regarding DMA completion on writes, it doesn't seem logically possible to me for the card to claim that a write command has been completed when it hasn't yet received the data to write. I suppose ultimately it depends on what the SD card specification says about the precise meaning of the command completion status.

gsomlo · 2023-10-20T01:46:12Z

On Thu, Oct 19, 2023 at 06:23:54PM -0700, Paul Mackerras wrote: it doesn't seem logically possible to me for the card to claim that a write command has been completed when it hasn't yet received the data to write.

I (hopefully) never claimed that :) First there's the DMA transfer setup (write base address, count, dma start register). Then there's the actual command setup (and I presume somehow the DMA FSM waits for some signal issued by the command FSM before any actual data starts being transfered over DMA, but not sure about this part). Next we wait for the command to complete (polling its completion register, or sleeping on the command_done completion queue until an interrupt wakes us). *Finally*, we "wait" for the DMA data xfer completion, which in the case of RocketChip based Litex is *always already complete* by the time the command completes, i.e. no further spinning on completing the DMA data xfer happens in my case, whether it's a read or write. Apparently there's a way for the DMA xfer to be still on-going when the command portion of the entire transaction is completed, I just can't seem to reproduce it on my setup.

paulusmack · 2023-10-20T07:28:15Z

Apparently there's a way for the DMA xfer to be still on-going when the command portion of the entire transaction is completed, I just can't seem to reproduce it on my setup.

I could understand that for a read, but for a write that would be bizarre.

I have never actually seen it either. Using interrupts and completions means that the time from when the command is complete and the interrupt is raised, to when we are actually checking for DMA completion, is at least 120 microseconds on my setup, more than enough time for transferring 512 bytes via DMA.

paulusmack · 2023-10-20T08:51:32Z

Looking at the driver, I see there are four call sites where we call litex_mmc_send_cmd() with transfer == SD_CTL_DATA_XFER_NONE and only one call site where transfer can possibly be something else.

That suggests to me that we could move the parts of litex_mmc_send_cmd() that have to do with data handling out to that one call site. Then we could split litex_mmc_send_cmd() into two versions; one for the four call sites that where we are never transferring data, and a separate copy for the other call site that can transfer data. Then we make the first copy always do polling and the second copy always use an interrupt for command completion. That way we get good performance but we don't need any explicit heuristic in litex_mmc_send_cmd().

I'll do a patch to show what I'm thinking of.

gsomlo · 2023-10-20T13:53:26Z

Looking at the driver, I see there are four call sites where we call litex_mmc_send_cmd() with transfer == SD_CTL_DATA_XFER_NONE and only one call site where transfer can possibly be something else.

That suggests to me that we could move the parts of litex_mmc_send_cmd() that have to do with data handling out to that one call site. [...]

I'll do a patch to show what I'm thinking of.

How about starting on top of commit 1f7136d and then modifying the litex_mmc_send_cmd() signature to add a Boolean use_irq argument? Then you can decide on the heuristic right at the call site itself. Is that along the lines of what you had in mind?

Either way, I'd like to hear from @Dolu1990 before we really decide on any changes, since his NaxRiscV experiments apparently did run into scenarios where data completion occurred after command completion, and I'd like to get a better feeling for why/how that happened...

Dolu1990 · 2023-10-20T15:01:34Z

Hi,

So i just did some tests using dual core Nax 16 KB i$ 16KB d$ each, 100mhz + 256KB L2
Using time cp test of 32 MB :

- Vanilla 
  - W 468  R 936
- Vanilla + sleep 0 
  - W 2100 R 4551
- this patch (b695ceef)
  - W 1733 R 3343

I did checked again with the logic analyser. For read test all seems good, i don't see the CPU waiting on timer tick anymore.
For write test i observed the following behaviour which is suspect ? Or maybe it is wanted :)

static int litex_mmc_send_cmd(struct litex_mmc_host *host,
			      u8 cmd, u32 arg, u8 response_len, u8 transfer)
	...		      
	/* Send command */
	litex_write32(host->sdcore + LITEX_CORE_CMDARG, arg);
	litex_write32(host->sdcore + LITEX_CORE_CMDCMD,
		      cmd << 8 | transfer << 5 | response_len);
	litex_write8(host->sdcore + LITEX_CORE_CMDSND, 1);

    trace(1);
	if (host->irq > 0)
		wait_for_completion(&host->cmd_done);
    trace(2);
    
    
static irqreturn_t litex_mmc_interrupt(int irq, void *arg)
{
	struct mmc_host *mmc = arg;
	struct litex_mmc_host *host = mmc_priv(mmc);
	u32 pending = litex_read32(host->sdirq + LITEX_IRQ_PENDING);
	u32 handled = 0;
    trace(7);    
   ...

You can see bellow that trace(1) is followed by two interrupt trace(7) before reaching trace(2) :

Is it wanted ?

gsomlo · 2023-10-20T17:14:07Z

@Dolu1990, I think your two interrupts might be the data-dma completion irq happening before your command completion irq which then wakes up the cmd_done completion wait. I think that's normal But the problem I thought we're trying to solve is spinning while waiting for data completion after waking up from the cmd_done completion (the only place I'd expect the SD_SLEEP_US value to make a difference, here (in the vanilla, pre-modification kernel).

Can you try testing for reads and writes with this patch applied to the original unmodified litex-rebase tree on the unmodified (level-triggered) irq gateware? (feel free to modify the pr_warn() calls with your own trace() if that helps visualize the results better)

I'd like to know if gls2_readx_poll_timeout() spins at all for you or not. gls1_readx_poll_timeout() should never spin if you're using IRQs for command completion.

If gls2_readx... does spin, you're completing a command before the data DMA operation is finished, and in that case it would make sense to

figure out why
decide if 1 is a problem and fix it, or add support for data completion IRQs (the whole point of this patch series) :)

Dolu1990 · 2023-10-20T18:44:23Z

So tested your https://github.com/litex-hub/linux/files/13056696/foo.patch.txt with non of the modifications from this PR, with SLEEP = 5 (default)

gls1_readx_poll_timeout() should never spin if you're using IRQs for command completion.

From what i can mesure, that's right. it doesn't spin when irq are used in litex_mmc_send_cmd

if (host->irq > 0 &&
	    (transfer != SD_CTL_DATA_XFER_NONE ||
	     response_len == SD_CTL_RESP_SHORT_BUSY)) {
		reinit_completion(&host->cmd_done);
		litex_write32(host->sdirq + LITEX_IRQ_ENABLE,
			      SDIRQ_CMD_DONE | SDIRQ_CARD_DETECT);
		wait_for_completion(&host->cmd_done);
        trace(0);
        ret = litex_mmc_sdcard_wait_done2(host->sdcore + LITEX_CORE_CMDEVT, dev); //This guy never spin
        trace(1);
} else {
          ret = litex_mmc_sdcard_wait_done(host->sdcore + LITEX_CORE_CMDEVT, dev); // There it spin, seems always 1 time
 }

I'd like to know if gls2_readx_poll_timeout() spins at all for you or not.

never spin.

I thought we're trying to solve is spinning while waiting for data completion after waking up from the cmd_done completion

When did you observed that behaviour ? From my observation long ago, that the driver was waiting on kernel ticks before doing stuff ? (i'm just asking to be sure there is no missunderstanding)

gsomlo · 2023-10-20T19:13:10Z

if (host->irq > 0 &&
	    (transfer != SD_CTL_DATA_XFER_NONE ||
	     response_len == SD_CTL_RESP_SHORT_BUSY)) {
		[...]
        trace(0);
        ret = litex_mmc_sdcard_wait_done2(host->sdcore + LITEX_CORE_CMDEVT, dev); //This guy never spin
        trace(1);
} else {
          ret = litex_mmc_sdcard_wait_done(host->sdcore + LITEX_CORE_CMDEVT, dev); // There it spin, seems always 1 time
 }

Oh, it makes sense now -- you're spinning while waiting for command completion during commands that weren't heuristically selected to use IRQs :)

This heuristic selectivity threw me off, and I (WRONGLY!) thought you must be spinning while waiting for DMA...

Given this new knowledge, I think we should

forget about integration/soc/add_sdcard: switch to edge-triggered cmd-done irq enjoy-digital/litex#1787
go with @paulusmack's idea of making the heuristic a Boolean argument to litex_mmc_send_cmd().
finally, we should then select which call sites should receive a true vs. false value for that argument

Can you print out the command number that's causing you to spin (i.e., where the heuristic to use IRQ doesn't apply and perhaps should)?

In conclusion, I think we can also ditch all commits beyond 1f7136d and leave the IRQ setup and handling as-is, and just "tune" the heuristic instead... :)

Apologies to everyone for the misunderstanding!

Dolu1990 · 2023-10-20T19:54:31Z

The comand which make it spin while not using IRQ is 23 (decimal) => SET_BLOCK_COUNT =>

"Defines the number of blocks which are
going to be transferred in the immediatedly
succeeding multiple block read or write
command. "

So, to me it seems that, if we don't use IRQ for a given command, for performance reasons we shouldn't use readx_poll_timeout , but instead something which does some active pulling with eventualy a "sleep(0)" to allow rescheduling to other tasks.

gsomlo · 2023-10-20T20:07:08Z

On Fri, Oct 20, 2023 at 12:54:41PM -0700, Dolu1990 wrote: The comand which make it spin while not using IRQ is 23 (decimal) => SET_BLOCK_COUNT => [...] So, to me it seems that, if we don't use IRQ for a given command, for performance reasons we shouldn't use readx_poll_timeout , but instead something which does some active pulling with eventualy a "sleep(0)" to allow rescheduling to other tasks.

That or we should add `... || cmd == MMC_SET_BLOCK_COUNT` to the irq selection heuristic :)

Dolu1990 · 2023-10-20T20:30:17Z

I did a test with the following :

	if (host->irq > 0 &&
	    (transfer != SD_CTL_DATA_XFER_NONE ||
	     response_len == SD_CTL_RESP_SHORT_BUSY)) {
		reinit_completion(&host->cmd_done);
		litex_write32(host->sdirq + LITEX_IRQ_ENABLE,
			      SDIRQ_CMD_DONE | SDIRQ_CARD_DETECT);
		wait_for_completion(&host->cmd_done);
        trace(0);
        ret = litex_mmc_sdcard_wait_done_sleep5(host->sdcore + LITEX_CORE_CMDEVT, dev);
        trace(1);
	} else {
        ret = litex_mmc_sdcard_wait_done_sleep0(host->sdcore + LITEX_CORE_CMDEVT, dev, cmd);
    }

And that solve the performance issue with everything else vanilla

That or we should add ... || cmd == SET_BLOCK_COUNT to the irq selection heuristic :)

I don't agree, it isn't a heuristic issue, i think it is more that sending a command and waiting on the next timer IRQ before doing forward progress seems weird to me, unless you expect the command to take ages (like 100ms+)

gsomlo · 2023-10-20T20:42:55Z

On Fri, Oct 20, 2023 at 01:30:28PM -0700, Dolu1990 wrote: That or we should add ... || cmd == SET_BLOCK_COUNT to the irq selection heuristic :) I don't agree, it isn't a heuristic issue, i think it is more that sending a command and waiting on the next timer IRQ before doing forward progress seems weird to me, unless you expect the command to take ages (like 100ms+)

I'm instinctively paranoid about things that spin in a tight loop (that are *not* spinlocks), waiting for "slow" hardwre to do stuff when IRQ support is available... :) Looks like maybe my instincts are wrong for CPUs that run at such low frequencies that things like an SDCard command would run within timeframes comparable to CPU opcodes, and wehere the idea of selectively *not* using IRQs for that type of hardware is a bona fide optimization of overall system performance... :)

Dolu1990 · 2023-10-21T07:59:04Z

I think here is maybe a good alternative to fully blocking busy loop :
https://github.com/torvalds/linux/blob/9c5d00cb7b6bbc5a7965d9ab7d223b5402d1f02c/drivers/net/ethernet/mediatek/mtk_eth_soc.c#L307

static int mtk_mdio_busy_wait(struct mtk_eth *eth)
{
	unsigned long t_start = jiffies;

	while (1) {
		if (!(mtk_r32(eth, MTK_PHY_IAC) & PHY_IAC_ACCESS))
			return 0;
		if (time_after(jiffies, t_start + PHY_IAC_TIMEOUT))
			break;
		cond_resched();
	}

	dev_err(eth->dev, "mdio: MDIO timeout\n");
	return -ETIMEDOUT;
}

It yield the CPU to other tasks using cond_resched()
It implement a timeout via time_after
loop until condition is meet

Would that be acceptable as a "non IRQ wait path" ?

Dolu1990 · 2023-10-21T08:05:27Z

Note, i never used cond_resched / know next to nothing about cond_resched XD

paulusmack · 2023-10-24T09:49:02Z

Certainly we shouldn't spin indefinitely without a timeout. Regarding the use of cond_resched(), it's probably reasonably quick in the case where it doesn't reschedule - I would expect a few microseconds on the class of CPU that we're talking about here. If it does reschedule, it could be ages (seconds) before it returns.

It seems reasonable to me to spin without calling cond_resched() for a length of time that is comparable to the context switch latency. On FPGA-based CPUs it's probably reasonable to spin without calling cond_resched for up to 100 - 200 microseconds or so.

On my 100MHz Microwatt CPU (FPGA-based) I measured the latency to take and dispatch an interrupt and then resume a task that had called wait_for_completion() to be at least 120 microseconds. So for commands that are going to take less than that amount of time it is more efficient in terms of total CPU cycles consumed to spin rather than use an interrupt.

If we know that a command is likely to take a long time (a millisecond or more) then the best thing is to use an interrupt and call wait_for_completion() without any polling. But if we think it is probably quick, or we don't really know whether it will be quick, then what I would do is poll for about 100 to 150 microseconds. If that times out, then enable the interrupt and call wait_for_completion().

By the way, I notice we are using readx_poll_timeout() rather than readx_poll_timeout_atomic(). The difference is that readx_poll_timeout() uses usleep_range() which involves setting up an hrtimer and sleeping (meaning the scheduler gets involved). The guidance in Documentation/timers/timers-howto.rst is that for sleeping for less than 10 us it is better to use udelay() rather than usleep_range() (meaning it is better to use readx_poll_timeout_atomic() rather than readx_poll_timeout()). (Note that that guidance is probably written for GHz-class CPUs, so for slow FPGA-based CPUs it is probably best to use udelay() for anything under 100 us.)

My suggestions would be: for commands that may be quick, poll for completion using readx_poll_timeout_atomic() with 1 us delay between polls and a 150 us timeout, and if the poll times out, then enable the interrupt and use wait_for_completion() to wait until it is done. (If we do that then I don't think we need cond_resched().) If we call wait_for_completion(), then when it returns we should check the completion status, like what litex_mmc_sdcard_wait_done does, but without any polling, just a single read of the command-event register.

Dolu1990 · 2023-10-24T10:23:15Z

Sound good to me :)

gsomlo · 2023-10-26T14:28:30Z

On Tue, Oct 24, 2023 at 02:49:12AM -0700, Paul Mackerras wrote: My suggestions would be: for commands that may be quick, poll for completion using readx_poll_timeout_atomic() with 1 us delay between polls and a 150 us timeout, and if the poll times out, then enable the interrupt and use wait_for_completion() to wait until it is done. (If we do that then I don't think we need cond_resched().)

Do you mean we should do this for all commands and give up on the current "heuristic" that pre-determines which ones will use IRQ? Just start polling then switch to IRQ if the command takes too long, for all commands? Or should we only do this with commands not already covered by the heuristic?

If we call wait_for_completion(), then when it returns we should check the completion status, like what litex_mmc_sdcard_wait_done does, but without any polling, just a single read of the command-event register.

The `delay_before_read` argument to `readx_poll_timeout[_atomic]()` is (and should continue to be) set to false, therefore no polling/spinning occurs before the first completion status check -- that's already the current behavior in the existing driver. EDIT: I presume we'd call `readx_poll_timeout_atomic()` with `delay_before_read` set to true, if that's the *first* thing we do, so presumably after timing out, enabling IRQ, and waiting for completion, we *could* just do a read of the register rather than calling readx_poll_timeout() again -- so I guess in that context it makes sense. The only remaining question is whether to apply this treatment to all commands uniformly...

When creating ceq_0 during probing irdma, cqp.sc_cqp will be sent as a cqp_request to cqp->sc_cqp.sq_ring. If the request is pending when removing the irdma driver or unplugging its aux device, cqp.sc_cqp will be dereferenced as wrong struct in irdma_free_pending_cqp_request(). PID: 3669 TASK: ffff88aef892c000 CPU: 28 COMMAND: "kworker/28:0" #0 [fffffe0000549e38] crash_nmi_callback at ffffffff810e3a34 #1 [fffffe0000549e40] nmi_handle at ffffffff810788b2 #2 [fffffe0000549ea0] default_do_nmi at ffffffff8107938f #3 [fffffe0000549eb8] do_nmi at ffffffff81079582 #4 [fffffe0000549ef0] end_repeat_nmi at ffffffff82e016b4 [exception RIP: native_queued_spin_lock_slowpath+1291] RIP: ffffffff8127e72b RSP: ffff88aa841ef778 RFLAGS: 00000046 RAX: 0000000000000000 RBX: ffff88b01f849700 RCX: ffffffff8127e47e RDX: 0000000000000000 RSI: 0000000000000004 RDI: ffffffff83857ec0 RBP: ffff88afe3e4efc8 R8: ffffed15fc7c9dfa R9: ffffed15fc7c9dfa R10: 0000000000000001 R11: ffffed15fc7c9df9 R12: 0000000000740000 R13: ffff88b01f849708 R14: 0000000000000003 R15: ffffed1603f092e1 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000 -- <NMI exception stack> -- #5 [ffff88aa841ef778] native_queued_spin_lock_slowpath at ffffffff8127e72b #6 [ffff88aa841ef7b0] _raw_spin_lock_irqsave at ffffffff82c22aa4 #7 [ffff88aa841ef7c8] __wake_up_common_lock at ffffffff81257363 #8 [ffff88aa841ef888] irdma_free_pending_cqp_request at ffffffffa0ba12cc [irdma] #9 [ffff88aa841ef958] irdma_cleanup_pending_cqp_op at ffffffffa0ba1469 [irdma] #10 [ffff88aa841ef9c0] irdma_ctrl_deinit_hw at ffffffffa0b2989f [irdma] #11 [ffff88aa841efa28] irdma_remove at ffffffffa0b252df [irdma] #12 [ffff88aa841efae8] auxiliary_bus_remove at ffffffff8219afdb #13 [ffff88aa841efb00] device_release_driver_internal at ffffffff821882e6 torvalds#14 [ffff88aa841efb38] bus_remove_device at ffffffff82184278 #15 [ffff88aa841efb88] device_del at ffffffff82179d23 #16 [ffff88aa841efc48] ice_unplug_aux_dev at ffffffffa0eb1c14 [ice] torvalds#17 [ffff88aa841efc68] ice_service_task at ffffffffa0d88201 [ice] torvalds#18 [ffff88aa841efde8] process_one_work at ffffffff811c589a torvalds#19 [ffff88aa841efe60] worker_thread at ffffffff811c71ff torvalds#20 [ffff88aa841eff10] kthread at ffffffff811d87a0 torvalds#21 [ffff88aa841eff50] ret_from_fork at ffffffff82e0022f Fixes: 44d9e52 ("RDMA/irdma: Implement device initialization definitions") Link: https://lore.kernel.org/r/[email protected] Suggested-by: "Ismail, Mustafa" <[email protected]> Signed-off-by: Shifeng Li <[email protected]> Reviewed-by: Shiraz Saleem <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>

When processing a packed profile in unpack_profile() described like "profile :ns::samba-dcerpcd /usr/lib*/samba/{,samba/}samba-dcerpcd {...}" a string ":samba-dcerpcd" is unpacked as a fully-qualified name and then passed to aa_splitn_fqname(). aa_splitn_fqname() treats ":samba-dcerpcd" as only containing a namespace. Thus it returns NULL for tmpname, meanwhile tmpns is non-NULL. Later aa_alloc_profile() crashes as the new profile name is NULL now. general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] CPU: 6 PID: 1657 Comm: apparmor_parser Not tainted 6.7.0-rc2-dirty #16 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014 RIP: 0010:strlen+0x1e/0xa0 Call Trace: <TASK> ? strlen+0x1e/0xa0 aa_policy_init+0x1bb/0x230 aa_alloc_profile+0xb1/0x480 unpack_profile+0x3bc/0x4960 aa_unpack+0x309/0x15e0 aa_replace_profiles+0x213/0x33c0 policy_update+0x261/0x370 profile_replace+0x20e/0x2a0 vfs_write+0x2af/0xe00 ksys_write+0x126/0x250 do_syscall_64+0x46/0xf0 entry_SYSCALL_64_after_hwframe+0x6e/0x76 </TASK> ---[ end trace 0000000000000000 ]--- RIP: 0010:strlen+0x1e/0xa0 It seems such behaviour of aa_splitn_fqname() is expected and checked in other places where it is called (e.g. aa_remove_profiles). Well, there is an explicit comment "a ns name without a following profile is allowed" inside. AFAICS, nothing can prevent unpacked "name" to be in form like ":samba-dcerpcd" - it is passed from userspace. Deny the whole profile set replacement in such case and inform user with EPROTO and an explaining message. Found by Linux Verification Center (linuxtesting.org). Fixes: 04dc715 ("apparmor: audit policy ns specified in policy load") Signed-off-by: Fedor Pchelkin <[email protected]> Signed-off-by: John Johansen <[email protected]>

vhost_worker will call tun call backs to receive packets. If too many illegal packets arrives, tun_do_read will keep dumping packet contents. When console is enabled, it will costs much more cpu time to dump packet and soft lockup will be detected. net_ratelimit mechanism can be used to limit the dumping rate. PID: 33036 TASK: ffff949da6f20000 CPU: 23 COMMAND: "vhost-32980" #0 [fffffe00003fce50] crash_nmi_callback at ffffffff89249253 #1 [fffffe00003fce58] nmi_handle at ffffffff89225fa3 #2 [fffffe00003fceb0] default_do_nmi at ffffffff8922642e #3 [fffffe00003fced0] do_nmi at ffffffff8922660d #4 [fffffe00003fcef0] end_repeat_nmi at ffffffff89c01663 [exception RIP: io_serial_in+20] RIP: ffffffff89792594 RSP: ffffa655314979e8 RFLAGS: 00000002 RAX: ffffffff89792500 RBX: ffffffff8af428a0 RCX: 0000000000000000 RDX: 00000000000003fd RSI: 0000000000000005 RDI: ffffffff8af428a0 RBP: 0000000000002710 R8: 0000000000000004 R9: 000000000000000f R10: 0000000000000000 R11: ffffffff8acbf64f R12: 0000000000000020 R13: ffffffff8acbf698 R14: 0000000000000058 R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #5 [ffffa655314979e8] io_serial_in at ffffffff89792594 #6 [ffffa655314979e8] wait_for_xmitr at ffffffff89793470 #7 [ffffa65531497a08] serial8250_console_putchar at ffffffff897934f6 #8 [ffffa65531497a20] uart_console_write at ffffffff8978b605 #9 [ffffa65531497a48] serial8250_console_write at ffffffff89796558 #10 [ffffa65531497ac8] console_unlock at ffffffff89316124 #11 [ffffa65531497b10] vprintk_emit at ffffffff89317c07 #12 [ffffa65531497b68] printk at ffffffff89318306 #13 [ffffa65531497bc8] print_hex_dump at ffffffff89650765 torvalds#14 [ffffa65531497ca8] tun_do_read at ffffffffc0b06c27 [tun] #15 [ffffa65531497d38] tun_recvmsg at ffffffffc0b06e34 [tun] #16 [ffffa65531497d68] handle_rx at ffffffffc0c5d682 [vhost_net] torvalds#17 [ffffa65531497ed0] vhost_worker at ffffffffc0c644dc [vhost] torvalds#18 [ffffa65531497f10] kthread at ffffffff892d2e72 torvalds#19 [ffffa65531497f50] ret_from_fork at ffffffff89c0022f Fixes: ef3db4a ("tun: avoid BUG, dump packet on GSO errors") Signed-off-by: Lei Chen <[email protected]> Reviewed-by: Willem de Bruijn <[email protected]> Acked-by: Jason Wang <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Acked-by: Michael S. Tsirkin <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>

With BPF_PROBE_MEM, BPF allows de-referencing an untrusted pointer. To thwart invalid memory accesses, the JITs add an exception table entry for all such accesses. But in case the src_reg + offset is a userspace address, the BPF program might read that memory if the user has mapped it. Make the verifier add guard instructions around such memory accesses and skip the load if the address falls into the userspace region. The JITs need to implement bpf_arch_uaddress_limit() to define where the userspace addresses end for that architecture or TASK_SIZE is taken as default. The implementation is as follows: REG_AX = SRC_REG if(offset) REG_AX += offset; REG_AX >>= 32; if (REG_AX <= (uaddress_limit >> 32)) DST_REG = 0; else DST_REG = *(size *)(SRC_REG + offset); Comparing just the upper 32 bits of the load address with the upper 32 bits of uaddress_limit implies that the values are being aligned down to a 4GB boundary before comparison. The above means that all loads with address <= uaddress_limit + 4GB are skipped. This is acceptable because there is a large hole (much larger than 4GB) between userspace and kernel space memory, therefore a correctly functioning BPF program should not access this 4GB memory above the userspace. Let's analyze what this patch does to the following fentry program dereferencing an untrusted pointer: SEC("fentry/tcp_v4_connect") int BPF_PROG(fentry_tcp_v4_connect, struct sock *sk) { *(volatile long *)sk; return 0; } BPF Program before | BPF Program after ------------------ | ----------------- 0: (79) r1 = *(u64 *)(r1 +0) 0: (79) r1 = *(u64 *)(r1 +0) ----------------------------------------------------------------------- 1: (79) r1 = *(u64 *)(r1 +0) --\ 1: (bf) r11 = r1 ----------------------------\ \ 2: (77) r11 >>= 32 2: (b7) r0 = 0 \ \ 3: (b5) if r11 <= 0x8000 goto pc+2 3: (95) exit \ \-> 4: (79) r1 = *(u64 *)(r1 +0) \ 5: (05) goto pc+1 \ 6: (b7) r1 = 0 \-------------------------------------- 7: (b7) r0 = 0 8: (95) exit As you can see from above, in the best case (off=0), 5 extra instructions are emitted. Now, we analyze the same program after it has gone through the JITs of ARM64 and RISC-V architectures. We follow the single load instruction that has the untrusted pointer and see what instrumentation has been added around it. x86-64 JIT ========== JIT's Instrumentation (upstream) --------------------- 0: nopl 0x0(%rax,%rax,1) 5: xchg %ax,%ax 7: push %rbp 8: mov %rsp,%rbp b: mov 0x0(%rdi),%rdi --------------------------------- f: movabs $0x800000000000,%r11 19: cmp %r11,%rdi 1c: jb 0x000000000000002a 1e: mov %rdi,%r11 21: add $0x0,%r11 28: jae 0x000000000000002e 2a: xor %edi,%edi 2c: jmp 0x0000000000000032 2e: mov 0x0(%rdi),%rdi --------------------------------- 32: xor %eax,%eax 34: leave 35: ret The x86-64 JIT already emits some instructions to protect against user memory access. This patch doesn't make any changes for the x86-64 JIT. ARM64 JIT ========= No Intrumentation Verifier's Instrumentation (upstream) (This patch) ----------------- -------------------------- 0: add x9, x30, #0x0 0: add x9, x30, #0x0 4: nop 4: nop 8: paciasp 8: paciasp c: stp x29, x30, [sp, #-16]! c: stp x29, x30, [sp, #-16]! 10: mov x29, sp 10: mov x29, sp 14: stp x19, x20, [sp, #-16]! 14: stp x19, x20, [sp, #-16]! 18: stp x21, x22, [sp, #-16]! 18: stp x21, x22, [sp, #-16]! 1c: stp x25, x26, [sp, #-16]! 1c: stp x25, x26, [sp, #-16]! 20: stp x27, x28, [sp, #-16]! 20: stp x27, x28, [sp, #-16]! 24: mov x25, sp 24: mov x25, sp 28: mov x26, #0x0 28: mov x26, #0x0 2c: sub x27, x25, #0x0 2c: sub x27, x25, #0x0 30: sub sp, sp, #0x0 30: sub sp, sp, #0x0 34: ldr x0, [x0] 34: ldr x0, [x0] -------------------------------------------------------------------------------- 38: ldr x0, [x0] ----------\ 38: add x9, x0, #0x0 -----------------------------------\\ 3c: lsr x9, x9, torvalds#32 3c: mov x7, #0x0 \\ 40: cmp x9, #0x10, lsl #12 40: mov sp, sp \\ 44: b.ls 0x0000000000000050 44: ldp x27, x28, [sp], #16 \\--> 48: ldr x0, [x0] 48: ldp x25, x26, [sp], #16 \ 4c: b 0x0000000000000054 4c: ldp x21, x22, [sp], #16 \ 50: mov x0, #0x0 50: ldp x19, x20, [sp], #16 \--------------------------------------- 54: ldp x29, x30, [sp], #16 54: mov x7, #0x0 58: add x0, x7, #0x0 58: mov sp, sp 5c: autiasp 5c: ldp x27, x28, [sp], #16 60: ret 60: ldp x25, x26, [sp], #16 64: nop 64: ldp x21, x22, [sp], #16 68: ldr x10, 0x0000000000000070 68: ldp x19, x20, [sp], #16 6c: br x10 6c: ldp x29, x30, [sp], #16 70: add x0, x7, #0x0 74: autiasp 78: ret 7c: nop 80: ldr x10, 0x0000000000000088 84: br x10 There are 6 extra instructions added in ARM64 in the best case. This will become 7 in the worst case (off != 0). RISC-V JIT (RISCV_ISA_C Disabled) ========== No Intrumentation Verifier's Instrumentation (upstream) (This patch) ----------------- -------------------------- 0: nop 0: nop 4: nop 4: nop 8: li a6, 33 8: li a6, 33 c: addi sp, sp, -16 c: addi sp, sp, -16 10: sd s0, 8(sp) 10: sd s0, 8(sp) 14: addi s0, sp, 16 14: addi s0, sp, 16 18: ld a0, 0(a0) 18: ld a0, 0(a0) --------------------------------------------------------------- 1c: ld a0, 0(a0) --\ 1c: mv t0, a0 --------------------------\ \ 20: srli t0, t0, 32 20: li a5, 0 \ \ 24: lui t1, 4096 24: ld s0, 8(sp) \ \ 28: sext.w t1, t1 28: addi sp, sp, 16 \ \ 2c: bgeu t1, t0, 12 2c: sext.w a0, a5 \ \--> 30: ld a0, 0(a0) 30: ret \ 34: j 8 \ 38: li a0, 0 \------------------------------ 3c: li a5, 0 40: ld s0, 8(sp) 44: addi sp, sp, 16 48: sext.w a0, a5 4c: ret There are 7 extra instructions added in RISC-V. Fixes: 8008342 ("bpf, arm64: Add BPF exception tables") Reported-by: Breno Leitao <[email protected]> Suggested-by: Alexei Starovoitov <[email protected]> Acked-by: Ilya Leoshkevich <[email protected]> Signed-off-by: Puranjay Mohan <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>

We have been seeing crashes on duplicate keys in btrfs_set_item_key_safe(): BTRFS critical (device vdb): slot 4 key (450 108 8192) new key (450 108 8192) ------------[ cut here ]------------ kernel BUG at fs/btrfs/ctree.c:2620! invalid opcode: 0000 [#1] PREEMPT SMP PTI CPU: 0 PID: 3139 Comm: xfs_io Kdump: loaded Not tainted 6.9.0 #6 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014 RIP: 0010:btrfs_set_item_key_safe+0x11f/0x290 [btrfs] With the following stack trace: #0 btrfs_set_item_key_safe (fs/btrfs/ctree.c:2620:4) #1 btrfs_drop_extents (fs/btrfs/file.c:411:4) #2 log_one_extent (fs/btrfs/tree-log.c:4732:9) #3 btrfs_log_changed_extents (fs/btrfs/tree-log.c:4955:9) #4 btrfs_log_inode (fs/btrfs/tree-log.c:6626:9) #5 btrfs_log_inode_parent (fs/btrfs/tree-log.c:7070:8) #6 btrfs_log_dentry_safe (fs/btrfs/tree-log.c:7171:8) #7 btrfs_sync_file (fs/btrfs/file.c:1933:8) #8 vfs_fsync_range (fs/sync.c:188:9) #9 vfs_fsync (fs/sync.c:202:9) #10 do_fsync (fs/sync.c:212:9) #11 __do_sys_fdatasync (fs/sync.c:225:9) #12 __se_sys_fdatasync (fs/sync.c:223:1) #13 __x64_sys_fdatasync (fs/sync.c:223:1) torvalds#14 do_syscall_x64 (arch/x86/entry/common.c:52:14) #15 do_syscall_64 (arch/x86/entry/common.c:83:7) #16 entry_SYSCALL_64+0xaf/0x14c (arch/x86/entry/entry_64.S:121) So we're logging a changed extent from fsync, which is splitting an extent in the log tree. But this split part already exists in the tree, triggering the BUG(). This is the state of the log tree at the time of the crash, dumped with drgn (https://github.com/osandov/drgn/blob/main/contrib/btrfs_tree.py) to get more details than btrfs_print_leaf() gives us: >>> print_extent_buffer(prog.crashed_thread().stack_trace()[0]["eb"]) leaf 33439744 level 0 items 72 generation 9 owner 18446744073709551610 leaf 33439744 flags 0x100000000000000 fs uuid e5bd3946-400c-4223-8923-190ef1f18677 chunk uuid d58cb17e-6d02-494a-829a-18b7d8a399da item 0 key (450 INODE_ITEM 0) itemoff 16123 itemsize 160 generation 7 transid 9 size 8192 nbytes 8473563889606862198 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 sequence 204 flags 0x10(PREALLOC) atime 1716417703.220000000 (2024-05-22 15:41:43) ctime 1716417704.983333333 (2024-05-22 15:41:44) mtime 1716417704.983333333 (2024-05-22 15:41:44) otime 17592186044416.000000000 (559444-03-08 01:40:16) item 1 key (450 INODE_REF 256) itemoff 16110 itemsize 13 index 195 namelen 3 name: 193 item 2 key (450 XATTR_ITEM 1640047104) itemoff 16073 itemsize 37 location key (0 UNKNOWN.0 0) type XATTR transid 7 data_len 1 name_len 6 name: user.a data a item 3 key (450 EXTENT_DATA 0) itemoff 16020 itemsize 53 generation 9 type 1 (regular) extent data disk byte 303144960 nr 12288 extent data offset 0 nr 4096 ram 12288 extent compression 0 (none) item 4 key (450 EXTENT_DATA 4096) itemoff 15967 itemsize 53 generation 9 type 2 (prealloc) prealloc data disk byte 303144960 nr 12288 prealloc data offset 4096 nr 8192 item 5 key (450 EXTENT_DATA 8192) itemoff 15914 itemsize 53 generation 9 type 2 (prealloc) prealloc data disk byte 303144960 nr 12288 prealloc data offset 8192 nr 4096 ... So the real problem happened earlier: notice that items 4 (4k-12k) and 5 (8k-12k) overlap. Both are prealloc extents. Item 4 straddles i_size and item 5 starts at i_size. Here is the state of the filesystem tree at the time of the crash: >>> root = prog.crashed_thread().stack_trace()[2]["inode"].root >>> ret, nodes, slots = btrfs_search_slot(root, BtrfsKey(450, 0, 0)) >>> print_extent_buffer(nodes[0]) leaf 30425088 level 0 items 184 generation 9 owner 5 leaf 30425088 flags 0x100000000000000 fs uuid e5bd3946-400c-4223-8923-190ef1f18677 chunk uuid d58cb17e-6d02-494a-829a-18b7d8a399da ... item 179 key (450 INODE_ITEM 0) itemoff 4907 itemsize 160 generation 7 transid 7 size 4096 nbytes 12288 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 sequence 6 flags 0x10(PREALLOC) atime 1716417703.220000000 (2024-05-22 15:41:43) ctime 1716417703.220000000 (2024-05-22 15:41:43) mtime 1716417703.220000000 (2024-05-22 15:41:43) otime 1716417703.220000000 (2024-05-22 15:41:43) item 180 key (450 INODE_REF 256) itemoff 4894 itemsize 13 index 195 namelen 3 name: 193 item 181 key (450 XATTR_ITEM 1640047104) itemoff 4857 itemsize 37 location key (0 UNKNOWN.0 0) type XATTR transid 7 data_len 1 name_len 6 name: user.a data a item 182 key (450 EXTENT_DATA 0) itemoff 4804 itemsize 53 generation 9 type 1 (regular) extent data disk byte 303144960 nr 12288 extent data offset 0 nr 8192 ram 12288 extent compression 0 (none) item 183 key (450 EXTENT_DATA 8192) itemoff 4751 itemsize 53 generation 9 type 2 (prealloc) prealloc data disk byte 303144960 nr 12288 prealloc data offset 8192 nr 4096 Item 5 in the log tree corresponds to item 183 in the filesystem tree, but nothing matches item 4. Furthermore, item 183 is the last item in the leaf. btrfs_log_prealloc_extents() is responsible for logging prealloc extents beyond i_size. It first truncates any previously logged prealloc extents that start beyond i_size. Then, it walks the filesystem tree and copies the prealloc extent items to the log tree. If it hits the end of a leaf, then it calls btrfs_next_leaf(), which unlocks the tree and does another search. However, while the filesystem tree is unlocked, an ordered extent completion may modify the tree. In particular, it may insert an extent item that overlaps with an extent item that was already copied to the log tree. This may manifest in several ways depending on the exact scenario, including an EEXIST error that is silently translated to a full sync, overlapping items in the log tree, or this crash. This particular crash is triggered by the following sequence of events: - Initially, the file has i_size=4k, a regular extent from 0-4k, and a prealloc extent beyond i_size from 4k-12k. The prealloc extent item is the last item in its B-tree leaf. - The file is fsync'd, which copies its inode item and both extent items to the log tree. - An xattr is set on the file, which sets the BTRFS_INODE_COPY_EVERYTHING flag. - The range 4k-8k in the file is written using direct I/O. i_size is extended to 8k, but the ordered extent is still in flight. - The file is fsync'd. Since BTRFS_INODE_COPY_EVERYTHING is set, this calls copy_inode_items_to_log(), which calls btrfs_log_prealloc_extents(). - btrfs_log_prealloc_extents() finds the 4k-12k prealloc extent in the filesystem tree. Since it starts before i_size, it skips it. Since it is the last item in its B-tree leaf, it calls btrfs_next_leaf(). - btrfs_next_leaf() unlocks the path. - The ordered extent completion runs, which converts the 4k-8k part of the prealloc extent to written and inserts the remaining prealloc part from 8k-12k. - btrfs_next_leaf() does a search and finds the new prealloc extent 8k-12k. - btrfs_log_prealloc_extents() copies the 8k-12k prealloc extent into the log tree. Note that it overlaps with the 4k-12k prealloc extent that was copied to the log tree by the first fsync. - fsync calls btrfs_log_changed_extents(), which tries to log the 4k-8k extent that was written. - This tries to drop the range 4k-8k in the log tree, which requires adjusting the start of the 4k-12k prealloc extent in the log tree to 8k. - btrfs_set_item_key_safe() sees that there is already an extent starting at 8k in the log tree and calls BUG(). Fix this by detecting when we're about to insert an overlapping file extent item in the log tree and truncating the part that would overlap. CC: [email protected] # 6.1+ Reviewed-by: Filipe Manana <[email protected]> Signed-off-by: Omar Sandoval <[email protected]> Signed-off-by: David Sterba <[email protected]>

When queues are started, netif_napi_add() and napi_enable() are called. If there are 4 queues and only 3 queues are used for the current configuration, only 3 queues' napi should be registered and enabled. The ionic_qcq_enable() checks whether the .poll pointer is not NULL for enabling only the using queue' napi. Unused queues' napi will not be registered by netif_napi_add(), so the .poll pointer indicates NULL. But it couldn't distinguish whether the napi was unregistered or not because netif_napi_del() doesn't reset the .poll pointer to NULL. So, ionic_qcq_enable() calls napi_enable() for the queue, which was unregistered by netif_napi_del(). Reproducer: ethtool -L <interface name> rx 1 tx 1 combined 0 ethtool -L <interface name> rx 0 tx 0 combined 1 ethtool -L <interface name> rx 0 tx 0 combined 4 Splat looks like: kernel BUG at net/core/dev.c:6666! Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI CPU: 3 PID: 1057 Comm: kworker/3:3 Not tainted 6.10.0-rc2+ #16 Workqueue: events ionic_lif_deferred_work [ionic] RIP: 0010:napi_enable+0x3b/0x40 Code: 48 89 c2 48 83 e2 f6 80 b9 61 09 00 00 00 74 0d 48 83 bf 60 01 00 00 00 74 03 80 ce 01 f0 4f RSP: 0018:ffffb6ed83227d48 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff97560cda0828 RCX: 0000000000000029 RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff97560cda0a28 RBP: ffffb6ed83227d50 R08: 0000000000000400 R09: 0000000000000001 R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000 R13: ffff97560ce3c1a0 R14: 0000000000000000 R15: ffff975613ba0a20 FS: 0000000000000000(0000) GS:ffff975d5f780000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f8f734ee200 CR3: 0000000103e50000 CR4: 00000000007506f0 PKRU: 55555554 Call Trace: <TASK> ? die+0x33/0x90 ? do_trap+0xd9/0x100 ? napi_enable+0x3b/0x40 ? do_error_trap+0x83/0xb0 ? napi_enable+0x3b/0x40 ? napi_enable+0x3b/0x40 ? exc_invalid_op+0x4e/0x70 ? napi_enable+0x3b/0x40 ? asm_exc_invalid_op+0x16/0x20 ? napi_enable+0x3b/0x40 ionic_qcq_enable+0xb7/0x180 [ionic 59bdfc8a035436e1c4224ff7d10789e3f14643f8] ionic_start_queues+0xc4/0x290 [ionic 59bdfc8a035436e1c4224ff7d10789e3f14643f8] ionic_link_status_check+0x11c/0x170 [ionic 59bdfc8a035436e1c4224ff7d10789e3f14643f8] ionic_lif_deferred_work+0x129/0x280 [ionic 59bdfc8a035436e1c4224ff7d10789e3f14643f8] process_one_work+0x145/0x360 worker_thread+0x2bb/0x3d0 ? __pfx_worker_thread+0x10/0x10 kthread+0xcc/0x100 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x2d/0x50 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 Fixes: 0f3154e ("ionic: Add Tx and Rx handling") Signed-off-by: Taehee Yoo <[email protected]> Reviewed-by: Brett Creeley <[email protected]> Reviewed-by: Shannon Nelson <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>

The code in ocfs2_dio_end_io_write() estimates number of necessary transaction credits using ocfs2_calc_extend_credits(). This however does not take into account that the IO could be arbitrarily large and can contain arbitrary number of extents. Extent tree manipulations do often extend the current transaction but not in all of the cases. For example if we have only single block extents in the tree, ocfs2_mark_extent_written() will end up calling ocfs2_replace_extent_rec() all the time and we will never extend the current transaction and eventually exhaust all the transaction credits if the IO contains many single block extents. Once that happens a WARN_ON(jbd2_handle_buffer_credits(handle) <= 0) is triggered in jbd2_journal_dirty_metadata() and subsequently OCFS2 aborts in response to this error. This was actually triggered by one of our customers on a heavily fragmented OCFS2 filesystem. To fix the issue make sure the transaction always has enough credits for one extent insert before each call of ocfs2_mark_extent_written(). Heming Zhao said: ------ PANIC: "Kernel panic - not syncing: OCFS2: (device dm-1): panic forced after error" PID: xxx TASK: xxxx CPU: 5 COMMAND: "SubmitThread-CA" #0 machine_kexec at ffffffff8c069932 #1 __crash_kexec at ffffffff8c1338fa #2 panic at ffffffff8c1d69b9 #3 ocfs2_handle_error at ffffffffc0c86c0c [ocfs2] #4 __ocfs2_abort at ffffffffc0c88387 [ocfs2] #5 ocfs2_journal_dirty at ffffffffc0c51e98 [ocfs2] #6 ocfs2_split_extent at ffffffffc0c27ea3 [ocfs2] #7 ocfs2_change_extent_flag at ffffffffc0c28053 [ocfs2] #8 ocfs2_mark_extent_written at ffffffffc0c28347 [ocfs2] #9 ocfs2_dio_end_io_write at ffffffffc0c2bef9 [ocfs2] #10 ocfs2_dio_end_io at ffffffffc0c2c0f5 [ocfs2] #11 dio_complete at ffffffff8c2b9fa7 #12 do_blockdev_direct_IO at ffffffff8c2bc09f #13 ocfs2_direct_IO at ffffffffc0c2b653 [ocfs2] torvalds#14 generic_file_direct_write at ffffffff8c1dcf14 #15 __generic_file_write_iter at ffffffff8c1dd07b #16 ocfs2_file_write_iter at ffffffffc0c49f1f [ocfs2] torvalds#17 aio_write at ffffffff8c2cc72e torvalds#18 kmem_cache_alloc at ffffffff8c248dde torvalds#19 do_io_submit at ffffffff8c2ccada torvalds#20 do_syscall_64 at ffffffff8c004984 torvalds#21 entry_SYSCALL_64_after_hwframe at ffffffff8c8000ba Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: c15471f ("ocfs2: fix sparse file & data ordering issue in direct io") Signed-off-by: Jan Kara <[email protected]> Reviewed-by: Joseph Qi <[email protected]> Reviewed-by: Heming Zhao <[email protected]> Cc: Mark Fasheh <[email protected]> Cc: Joel Becker <[email protected]> Cc: Junxiao Bi <[email protected]> Cc: Changwei Ge <[email protected]> Cc: Gang He <[email protected]> Cc: Jun Piao <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>

A sysfs reader can race with a device reset or removal, attempting to read device state when the device is not actually present. eg: [exception RIP: qed_get_current_link+17] #8 [ffffb9e4f2907c48] qede_get_link_ksettings at ffffffffc07a994a [qede] #9 [ffffb9e4f2907cd8] __rh_call_get_link_ksettings at ffffffff992b01a3 #10 [ffffb9e4f2907d38] __ethtool_get_link_ksettings at ffffffff992b04e4 #11 [ffffb9e4f2907d90] duplex_show at ffffffff99260300 #12 [ffffb9e4f2907e38] dev_attr_show at ffffffff9905a01c #13 [ffffb9e4f2907e50] sysfs_kf_seq_show at ffffffff98e0145b torvalds#14 [ffffb9e4f2907e68] seq_read at ffffffff98d902e3 #15 [ffffb9e4f2907ec8] vfs_read at ffffffff98d657d1 #16 [ffffb9e4f2907f00] ksys_read at ffffffff98d65c3f torvalds#17 [ffffb9e4f2907f38] do_syscall_64 at ffffffff98a052fb crash> struct net_device.state ffff9a9d21336000 state = 5, state 5 is __LINK_STATE_START (0b1) and __LINK_STATE_NOCARRIER (0b100). The device is not present, note lack of __LINK_STATE_PRESENT (0b10). This is the same sort of panic as observed in commit 4224cfd ("net-sysfs: add check for netdevice being present to speed_show"). There are many other callers of __ethtool_get_link_ksettings() which don't have a device presence check. Move this check into ethtool to protect all callers. Fixes: d519e17 ("net: export device speed and duplex via sysfs") Fixes: 4224cfd ("net-sysfs: add check for netdevice being present to speed_show") Signed-off-by: Jamie Bainbridge <[email protected]> Link: https://patch.msgid.link/8bae218864beaa44ed01628140475b9bf641c5b0.1724393671.git.jamie.bainbridge@gmail.com Signed-off-by: Jakub Kicinski <[email protected]>

During the migration of Soundwire runtime stream allocation from the Qualcomm Soundwire controller to SoC's soundcard drivers the sdm845 soundcard was forgotten. At this point any playback attempt or audio daemon startup, for instance on sdm845-db845c (Qualcomm RB3 board), will result in stream pointer NULL dereference: Unable to handle kernel NULL pointer dereference at virtual address 0000000000000020 Mem abort info: ESR = 0x0000000096000004 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 FSC = 0x04: level 0 translation fault Data abort info: ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000 CM = 0, WnR = 0, TnD = 0, TagAccess = 0 GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 user pgtable: 4k pages, 48-bit VAs, pgdp=0000000101ecf000 [0000000000000020] pgd=0000000000000000, p4d=0000000000000000 Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP Modules linked in: ... CPU: 5 UID: 0 PID: 1198 Comm: aplay Not tainted 6.12.0-rc2-qcomlt-arm64-00059-g9d78f315a362-dirty torvalds#18 Hardware name: Thundercomm Dragonboard 845c (DT) pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : sdw_stream_add_slave+0x44/0x380 [soundwire_bus] lr : sdw_stream_add_slave+0x44/0x380 [soundwire_bus] sp : ffff80008a2035c0 x29: ffff80008a2035c0 x28: ffff80008a203978 x27: 0000000000000000 x26: 00000000000000c0 x25: 0000000000000000 x24: ffff1676025f4800 x23: ffff167600ff1cb8 x22: ffff167600ff1c98 x21: 0000000000000003 x20: ffff167607316000 x19: ffff167604e64e80 x18: 0000000000000000 x17: 0000000000000000 x16: ffffcec265074160 x15: 0000000000000000 x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000000 x8 : 0000000000000000 x7 : 0000000000000000 x6 : ffff167600ff1cec x5 : ffffcec22cfa2010 x4 : 0000000000000000 x3 : 0000000000000003 x2 : ffff167613f836c0 x1 : 0000000000000000 x0 : ffff16761feb60b8 Call trace: sdw_stream_add_slave+0x44/0x380 [soundwire_bus] wsa881x_hw_params+0x68/0x80 [snd_soc_wsa881x] snd_soc_dai_hw_params+0x3c/0xa4 __soc_pcm_hw_params+0x230/0x660 dpcm_be_dai_hw_params+0x1d0/0x3f8 dpcm_fe_dai_hw_params+0x98/0x268 snd_pcm_hw_params+0x124/0x460 snd_pcm_common_ioctl+0x998/0x16e8 snd_pcm_ioctl+0x34/0x58 __arm64_sys_ioctl+0xac/0xf8 invoke_syscall+0x48/0x104 el0_svc_common.constprop.0+0x40/0xe0 do_el0_svc+0x1c/0x28 el0_svc+0x34/0xe0 el0t_64_sync_handler+0x120/0x12c el0t_64_sync+0x190/0x194 Code: aa0403fb f9418400 9100e000 9400102f (f8420f22) ---[ end trace 0000000000000000 ]--- 0000000000006108 <sdw_stream_add_slave>: 6108: d503233f paciasp 610c: a9b97bfd stp x29, x30, [sp, #-112]! 6110: 910003fd mov x29, sp 6114: a90153f3 stp x19, x20, [sp, #16] 6118: a9025bf5 stp x21, x22, [sp, torvalds#32] 611c: aa0103f6 mov x22, x1 6120: 2a0303f5 mov w21, w3 6124: a90363f7 stp x23, x24, [sp, torvalds#48] 6128: aa0003f8 mov x24, x0 612c: aa0203f7 mov x23, x2 6130: a9046bf9 stp x25, x26, [sp, torvalds#64] 6134: aa0403f9 mov x25, x4 <-- x4 copied to x25 6138: a90573fb stp x27, x28, [sp, torvalds#80] 613c: aa0403fb mov x27, x4 6140: f9418400 ldr x0, [x0, torvalds#776] 6144: 9100e000 add x0, x0, #0x38 6148: 94000000 bl 0 <mutex_lock> 614c: f8420f22 ldr x2, [x25, torvalds#32]! <-- offset 0x44 ^^^ This is 0x6108 + offset 0x44 from the beginning of sdw_stream_add_slave() where data abort happens. wsa881x_hw_params() is called with stream = NULL and passes it further in register x4 (5th argument) to sdw_stream_add_slave() without any checks. Value from x4 is copied to x25 and finally it aborts on trying to load a value from address in x25 plus offset 32 (in dec) which corresponds to master_list member in struct sdw_stream_runtime: struct sdw_stream_runtime { const char * name; /* 0 8 */ struct sdw_stream_params params; /* 8 12 */ enum sdw_stream_state state; /* 20 4 */ enum sdw_stream_type type; /* 24 4 */ /* XXX 4 bytes hole, try to pack */ here-> struct list_head master_list; /* 32 16 */ int m_rt_count; /* 48 4 */ /* size: 56, cachelines: 1, members: 6 */ /* sum members: 48, holes: 1, sum holes: 4 */ /* padding: 4 */ /* last cacheline: 56 bytes */ Fix this by adding required calls to qcom_snd_sdw_startup() and sdw_release_stream() to startup and shutdown routines which restores the previous correct behaviour when ->set_stream() method is called to set a valid stream runtime pointer on playback startup. Reproduced and then fix was tested on db845c RB3 board. Reported-by: Dmitry Baryshkov <[email protected]> Cc: [email protected] Fixes: 15c7fab ("ASoC: qcom: Move Soundwire runtime stream alloc to soundcards") Cc: Srinivas Kandagatla <[email protected]> Cc: Dmitry Baryshkov <[email protected]> Cc: Krzysztof Kozlowski <[email protected]> Cc: Pierre-Louis Bossart <[email protected]> Signed-off-by: Alexey Klimov <[email protected]> Tested-by: Steev Klimaszewski <[email protected]> # Lenovo Yoga C630 Reviewed-by: Krzysztof Kozlowski <[email protected]> Reviewed-by: Srinivas Kandagatla <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Mark Brown <[email protected]>

gsomlo added 10 commits October 14, 2023 11:33

mmc: litex_mmc: use BIT() macros for IRQ registers

f3fdb53

Signed-off-by: Gabriel Somlo <[email protected]>

mmc: litex_mmc: move completion setup to irq_init

c040c28

Initializing completion(s) belongs with the rest of the IRQ support setup, in `irq_init()`. Signed-off-by: Gabriel Somlo <[email protected]>

mmc: litex_mmc: wait for data completion outside cmd handler

1af3e8a

Setting up DMA data transfers happens outside (and before) the call to `send_cmd()`; therefore, wait for their completion also outside (and after) the `send_cmd()` call. Signed-off-by: Gabriel Somlo <[email protected]>

mmc: litex_mmc: isolate cmd irq heuristic

1f7136d

Separate the criteria for selecting commands that should use IRQ into an explicit Boolean flag. Signed-off-by: Gabriel Somlo <[email protected]>

mmc: litex_mmc: turn off IRQs on driver removal

10240a9

Ensure that interrupts are turned off as part of removing the driver. Signed-off-by: Gabriel Somlo <[email protected]>

mmc: litex_mmc: ack all handled IRQs together

1491670

Collect handled IRQs and ack them all together with a single MMIO port write. Signed-off-by: Gabriel Somlo <[email protected]>

mmc: litex_mmc: add IRQ support for DMA data xfers

b695cee

Add support for utilizing interrupts generated by LiteSDCard when DMA data transfers complete. Signed-off-by: Gabriel Somlo <[email protected]>

gsomlo mentioned this pull request Oct 14, 2023

integration/soc/add_sdcard: switch to edge-triggered cmd-done irq enjoy-digital/litex#1787

Closed

gsomlo mentioned this pull request Oct 18, 2023

Add support for data xfer completion IRQs #15

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: LiteSDCard: Add support for DMA/data completion interrupts #16

RFC: LiteSDCard: Add support for DMA/data completion interrupts #16

gsomlo commented Oct 14, 2023

gsomlo commented Oct 18, 2023 •

edited

Loading

paulusmack commented Oct 19, 2023

paulusmack commented Oct 19, 2023

gsomlo commented Oct 19, 2023 •

edited

Loading

paulusmack commented Oct 20, 2023

gsomlo commented Oct 20, 2023 via email

paulusmack commented Oct 20, 2023

paulusmack commented Oct 20, 2023 •

edited

Loading

gsomlo commented Oct 20, 2023

Dolu1990 commented Oct 20, 2023

gsomlo commented Oct 20, 2023 •

edited

Loading

Dolu1990 commented Oct 20, 2023

gsomlo commented Oct 20, 2023

Dolu1990 commented Oct 20, 2023

gsomlo commented Oct 20, 2023 via email •

edited

Loading

Dolu1990 commented Oct 20, 2023

gsomlo commented Oct 20, 2023 via email

Dolu1990 commented Oct 21, 2023

Dolu1990 commented Oct 21, 2023

paulusmack commented Oct 24, 2023 •

edited

Loading

Dolu1990 commented Oct 24, 2023

gsomlo commented Oct 26, 2023 via email •

edited

Loading

RFC: LiteSDCard: Add support for DMA/data completion interrupts #16

Are you sure you want to change the base?

RFC: LiteSDCard: Add support for DMA/data completion interrupts #16

Conversation

gsomlo commented Oct 14, 2023

gsomlo commented Oct 18, 2023 • edited Loading

paulusmack commented Oct 19, 2023

paulusmack commented Oct 19, 2023

gsomlo commented Oct 19, 2023 • edited Loading

paulusmack commented Oct 20, 2023

gsomlo commented Oct 20, 2023 via email

paulusmack commented Oct 20, 2023

paulusmack commented Oct 20, 2023 • edited Loading

gsomlo commented Oct 20, 2023

Dolu1990 commented Oct 20, 2023

gsomlo commented Oct 20, 2023 • edited Loading

Dolu1990 commented Oct 20, 2023

gsomlo commented Oct 20, 2023

Dolu1990 commented Oct 20, 2023

gsomlo commented Oct 20, 2023 via email • edited Loading

Dolu1990 commented Oct 20, 2023

gsomlo commented Oct 20, 2023 via email

Dolu1990 commented Oct 21, 2023

Dolu1990 commented Oct 21, 2023

paulusmack commented Oct 24, 2023 • edited Loading

Dolu1990 commented Oct 24, 2023

gsomlo commented Oct 26, 2023 via email • edited Loading

gsomlo commented Oct 18, 2023 •

edited

Loading

gsomlo commented Oct 19, 2023 •

edited

Loading

paulusmack commented Oct 20, 2023 •

edited

Loading

gsomlo commented Oct 20, 2023 •

edited

Loading

gsomlo commented Oct 20, 2023 via email •

edited

Loading

paulusmack commented Oct 24, 2023 •

edited

Loading

gsomlo commented Oct 26, 2023 via email •

edited

Loading