Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ACPI/HMAT: Move HMAT messages to pr_debug() #34

Open
wants to merge 166 commits into
base: 24.04_linux-nvidia
Choose a base branch
from

Conversation

clsotog
Copy link
Collaborator

@clsotog clsotog commented Dec 6, 2024

ACPI/HMAT: Move HMAT messages to pr_debug()

The HMAT messages printed at boot, beyond being noisy, can also print details for nodes that are not yet enabled. The primary method to consume HMAT details is via sysfs, and the sysfs interface gates what is emitted by whether the node is online or not. Hide the messages by default by moving them from "info" to "debug" log level.

Otherwise, these prints are just a pretty-print way to dump the ACPI HMAT table. It has always been the case that post-analysis was required for these messages to map proximity-domains to Linux NUMA nodes, and as Priya points out that analysis also needs to consider whether the proximity domain is marked "enabled" in the SRAT.

Reported-by: Priya Autee [email protected]

Acked-by: Rafael J. Wysocki [email protected]
Link: https://patch.msgid.link/170668982094.318782.2963631284830500182.stgit@dwillia2-xfh.jf.intel.com

(cherry picked from commit e2b952ffafced49fa6bd5cdc90f472b8bd932b5d cxl-next)
Signed-off-by: Carol L Soto <[email protected]

ianmay81 and others added 30 commits November 20, 2024 14:54
…dversion"

This reverts commit 47d27f2.

We need to revert this to avoid regressing any modules used in Jammy.

Signed-off-by: Ian May <[email protected]>
Ignore: yes
Signed-off-by: Ian May <[email protected]>
Ignore: yes
Signed-off-by: Ian May <[email protected]>
Ignore: yes
Signed-off-by: Paolo Pisati <[email protected]>
Ignore: yes
Signed-off-by: Andrea Righi <[email protected]>
Ignore: yes
Signed-off-by: Andrea Righi <[email protected]>
Ignore: yes
Signed-off-by: Andrea Righi <[email protected]>
Ignore: yes
Signed-off-by: Andrea Righi <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/2055128
Properties: no-test-build
Signed-off-by: Andrea Righi <[email protected]>
Ignore: yes
Signed-off-by: Andrea Righi <[email protected]>
haiyangz and others added 24 commits November 20, 2024 14:54
BugLink: https://bugs.launchpad.net/bugs/2084598

Change the Kconfig dependency, so this driver can be built and run on ARM64
with 4K page size.
16/64K page sizes are not supported yet.

Signed-off-by: Haiyang Zhang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
(cherry picked from commit 40a1d11)
Signed-off-by: John Cabaj <[email protected]>
Acked-by: Tim Gardner <[email protected]>
Acked-by: Paolo Pisati <[email protected]>
Signed-off-by: John Cabaj <[email protected]>
(cherry picked from commit 775969387fec7d04adcc705b656ee5a4396a0579 noble:linux-azure/master-next)
Signed-off-by: Jacob Martin <[email protected]>
Acked-by: Brad Figg <[email protected]>
Acked-by: John Cabaj <[email protected]>
Acked-by: Guoqing Jiang <[email protected]>
Signed-off-by: Jacob Martin <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/2084598

As defined by the MANA Hardware spec, the queue size for DMA is 4KB
minimal, and power of 2. And, the HWC queue size has to be exactly
4KB.

To support page sizes other than 4KB on ARM64, define the minimal
queue size as a macro separately from the PAGE_SIZE, which we always
assumed it to be 4KB before supporting ARM64.

Also, add MANA specific macros and update code related to size
alignment, DMA region calculations, etc.

Signed-off-by: Haiyang Zhang <[email protected]>
Reviewed-by: Michael Kelley <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
(cherry picked from commit 382d174)
Signed-off-by: John Cabaj <[email protected]>
Acked-by: Marcelo Henrique Cerri <[email protected]>
Acked-by: Thibault Ferrante <[email protected]>
Signed-off-by: John Cabaj <[email protected]>
(cherry picked from commit 4191de20636ee76151172bf5c88bf0cdb1bafc05 noble:linux-azure/master-next)
Signed-off-by: Jacob Martin <[email protected]>
Acked-by: Brad Figg <[email protected]>
Acked-by: John Cabaj <[email protected]>
Acked-by: Guoqing Jiang <[email protected]>
Signed-off-by: Jacob Martin <[email protected]>
… size

BugLink: https://bugs.launchpad.net/bugs/2084598

MANA hardware uses 4k page size. When calculating the page table index,
it should use the hardware page size, not the system page size.

Cc: [email protected]
Fixes: 0266a17 ("RDMA/mana_ib: Add a driver for Microsoft Azure Network Adapter")
Signed-off-by: Long Li <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Leon Romanovsky <[email protected]>
(cherry picked from commit 9e517a8)
Signed-off-by: John Cabaj <[email protected]>
Signed-off-by: Jacob Martin <[email protected]>
Acked-by: Brad Figg <[email protected]>
Acked-by: John Cabaj <[email protected]>
Acked-by: Guoqing Jiang <[email protected]>
Signed-off-by: Jacob Martin <[email protected]>
…l page

BugLink: https://bugs.launchpad.net/bugs/2084598

When mapping doorbell page from user-mode, the driver should use the system
page size as this memory is allocated via mmap() from user-mode.

Cc: [email protected]
Fixes: 0266a17 ("RDMA/mana_ib: Add a driver for Microsoft Azure Network Adapter")
Signed-off-by: Long Li <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Leon Romanovsky <[email protected]>
(cherry picked from commit 4a3b99b)
Signed-off-by: John Cabaj <[email protected]>
Signed-off-by: Jacob Martin <[email protected]>
Acked-by: Brad Figg <[email protected]>
Acked-by: John Cabaj <[email protected]>
Acked-by: Guoqing Jiang <[email protected]>
Signed-off-by: Jacob Martin <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/2084598

Set the following configs on x86 and arm64:

CONFIG_MANA_INFINIBAND=m
CONFIG_MICROSOFT_MANA=m

Signed-off-by: Jacob Martin <[email protected]>
Acked-by: Brad Figg <[email protected]>
Acked-by: John Cabaj <[email protected]>
Acked-by: Guoqing Jiang <[email protected]>
Signed-off-by: Jacob Martin <[email protected]>
…ackage

BugLink: https://bugs.launchpad.net/bugs/2084598

Include mana.ko in linux-modules-ABIVER, rather than
linux-modules-extra-ABIVER.

Signed-off-by: Jacob Martin <[email protected]>
Acked-by: Brad Figg <[email protected]>
Acked-by: John Cabaj <[email protected]>
Acked-by: Guoqing Jiang <[email protected]>
Signed-off-by: Jacob Martin <[email protected]>
Ignore: yes
Signed-off-by: Jacob Martin <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/2084817
Properties: no-test-build
Signed-off-by: Jacob Martin <[email protected]>
Ignore: yes
Signed-off-by: Jacob Martin <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/2085928
Properties: no-test-build
Signed-off-by: Jacob Martin <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/2086233

The CPPC performance feedback counters could be 0 or unchanged when the
target cpu is in a low-power idle state, e.g. power-gated or clock-gated.

When the counters are 0, cppc_cpufreq_get_rate() returns 0 KHz, which makes
cpufreq_online() get a false error and fail to generate a cpufreq policy.

When the counters are unchanged, the existing cppc_perf_from_fbctrs()
returns a cached desired perf, but some platforms may update the real
frequency back to the desired perf reg.

For the above cases in cppc_cpufreq_get_rate(), get the latest desired perf
from the CPPC reg to reflect the frequency because some platforms may
update the actual frequency back there; if failed, use the cached desired
perf.

Fixes: 6a4fec4 ("cpufreq: cppc: cppc_cpufreq_get_rate() returns zero in all error cases.")
Signed-off-by: Jie Zhan <[email protected]>
Reviewed-by: Zeng Heng <[email protected]>
Reviewed-by: Ionela Voinescu <[email protected]>
Reviewed-by: Huisong Li <[email protected]>
Signed-off-by: Viresh Kumar <[email protected]>
(cherry picked from commit c471956 linux-next)
Signed-off-by: Jamie Nguyen <[email protected]>
Tested-by: Carol Soto <[email protected]>
Acked-by: Brad Figg <[email protected]>
Acked-by: Carol L Soto <[email protected]>
Acked-by: Koba Ko <[email protected]>
Signed-off-by: Brad Figg <[email protected]>
Acked-by: Noah Wager <[email protected]>
Acked-by: Jacob Martin <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/2086233

Since commit 6c8d750 ("cpufreq / cppc: Work around for Hisilicon CPPC
cpufreq"), we introduce a workround for HiSilicon platforms that do not
support performance feedback counters, whereas they can get the actual
frequency from the desired perf register.  Later on, FIE is disabled in
that workaround as well.

Now the workround can be handled by the common code.  Desired perf would be
read and converted to frequency if feedback counters don't change.  FIE
would be disabled if the CPPC regs are in PCC region.

Hence, the workaround is no longer needed and can be safely removed, in an
effort to consolidate the driver procedure.

Signed-off-by: Jie Zhan <[email protected]>
Reviewed-by: Xiongfeng Wang <[email protected]>
Reviewed-by: Huisong Li <[email protected]>
[ Viresh: Move fie_disabled withing CONFIG option to fix warning ]
Signed-off-by: Viresh Kumar <[email protected]>
(cherry picked from commit ea1829d linux-next)
Signed-off-by: Jamie Nguyen <[email protected]>
Tested-by: Carol Soto <[email protected]>
Acked-by: Brad Figg <[email protected]>
Acked-by: Carol L Soto <[email protected]>
Acked-by: Koba Ko <[email protected]>
Signed-off-by: Brad Figg <[email protected]>
Acked-by: Noah Wager <[email protected]>
Acked-by: Jacob Martin <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/2088114

There is a corner case where the desired_perf is exactly same as the old
perf, but the actual current freq is not.

This happens during S3 while the cpufreq governor is set to powersave.
During cpufreq resume process, the booting CPU's new_freq obtained via
.get() is the highest frequency, while the policy->cur and
cpu->perf_ctrls.desired_perf are set to the lowest level (powersave
governor). This causes the warning: "CPU frequency out of sync:", and
the cpufreq core sets policy->cur to new_freq.

Then the governor->limits() calls cppc_cpufreq_set_target() to
configures the CPU frequency and returns directly because the
desired_perf converted from target_freq is same as the
cpu->perf_ctrls.desired_perf and both are the lowest_perf.

Since target_freq and policy->cur have been already compared in
__cpufreq_driver_target(), there's no need to compare them again here.

Drop the comparison.

Signed-off-by: Riwen Lu <[email protected]>
[ Viresh: Updated commit message / subject ]
Signed-off-by: Viresh Kumar <[email protected]>
(cherry picked from commit 90e4ed6)
Signed-off-by: Jamie Nguyen <[email protected]>
Acked-by: Brad Figg <[email protected]>
Acked-by: Jacob Martin <[email protected]>
Acked-by: Noah Wager <[email protected]>
Signed-off-by: Brad Figg <[email protected]>
Ignore: yes
Signed-off-by: Jacob Martin <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/2086287
Properties: no-test-build
Signed-off-by: Jacob Martin <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/2089306

This reverts commit "vfio/pci: Insert full vma on mmap'd MMIO fault".

The original commit changes vfio_pci to pre-fault the entire vma when
handling a fault. For PCIe devices with large BAR regions, this can take a
very long time to complete, causing kernel soft lockup warnings. This is
particularly noticeable when launching a virtual machine with a passthrough
PCIe GPU.

Signed-off-by: Jacob Martin <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/2089306

This reverts commit "vfio/pci: Use unmap_mapping_range()".

The original commit rewrote the vfio_pci mmap'd MMIO fault handler to use
the "unmap_mapping_range()" and "vmf_insert_pfn()" functions in place of
vfio_pci tracking its own mapped areas and using "zap_vma_ptes()" and
"io_remap_pfn_range()".

Use of "vmf_insert_pfn()" is significantly slower than the prior
implementation. With large BAR region passthrough PCIe devices, this causes
host soft lockup warnings if the commit "vfio/pci: Insert full vma on
mmap'd MMIO fault" is present, or an extremely slow guest boot if it is
not.

Signed-off-by: Jacob Martin <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/2091107

Recent changes to the devlink reload (commit 9b2348e
("devlink: warn about existing entities during reload-reinit"))
force the drivers to destroy devlink ports during reinit.
Adjust ice driver to this requirement, unregister netdvice, destroy
devlink port. ice_init_eth() was removed and all the common code
between probe and reload was moved to ice_load().

During devlink reload we can't take devl_lock (it's already taken)
and in ice_probe() we have to lock it. Use devl_* variant of the API
which does not acquire and release devl_lock. Guard ice_load()
with devl_lock only in case of probe.

Suggested-by: Jiri Pirko <[email protected]>
Reviewed-by: Przemek Kitszel <[email protected]>
Reviewed-by: Vadim Fedorenko <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Reviewed-by: Brett Creeley <[email protected]>
Signed-off-by: Wojciech Drewek <[email protected]>
Tested-by: Pucha Himasekhar Reddy <[email protected]> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <[email protected]>
(cherry picked from commit 41cc4e5)
Signed-off-by: Jacob Martin <[email protected]>
The HMAT messages printed at boot, beyond being noisy, can also print
details for nodes that are not yet enabled. The primary method to
consume HMAT details is via sysfs, and the sysfs interface gates what is
emitted by whether the node is online or not. Hide the messages by
default by moving them from "info" to "debug" log level.

Otherwise, these prints are just a pretty-print way to dump the ACPI
HMAT table. It has always been the case that post-analysis was required
for these messages to map proximity-domains to Linux NUMA nodes, and as
Priya points out that analysis also needs to consider whether the
proximity domain is marked "enabled" in the SRAT.

Reported-by: Priya Autee <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
Acked-by: Rafael J. Wysocki <[email protected]>
Link: https://patch.msgid.link/170668982094.318782.2963631284830500182.stgit@dwillia2-xfh.jf.intel.com
Signed-off-by: Dave Jiang <[email protected]>
(cherry picked from commit e2b952ffafced49fa6bd5cdc90f472b8bd932b5d cxl-next)
Signed-off-by: Carol L Soto <[email protected]
@clsotog clsotog force-pushed the clsotog/hmat_printk branch from 2b906f4 to 7e938a5 Compare December 6, 2024 22:11
Copy link
Collaborator

@khfeng khfeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acked-by: Kai-Heng Feng [email protected]

nvidia-bfigg pushed a commit that referenced this pull request Jan 6, 2025
…nt message

Address a bug in the kernel that triggers a "sleeping function called from
invalid context" warning when /sys/kernel/debug/kmemleak is printed under
specific conditions:
- CONFIG_PREEMPT_RT=y
- Set SELinux as the LSM for the system
- Set kptr_restrict to 1
- kmemleak buffer contains at least one item

BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 136, name: cat
preempt_count: 1, expected: 0
RCU nest depth: 2, expected: 2
6 locks held by cat/136:
 #0: ffff32e64bcbf950 (&p->lock){+.+.}-{3:3}, at: seq_read_iter+0xb8/0xe30
 #1: ffffafe6aaa9dea0 (scan_mutex){+.+.}-{3:3}, at: kmemleak_seq_start+0x34/0x128
 #3: ffff32e6546b1cd0 (&object->lock){....}-{2:2}, at: kmemleak_seq_show+0x3c/0x1e0
 #4: ffffafe6aa8d8560 (rcu_read_lock){....}-{1:2}, at: has_ns_capability_noaudit+0x8/0x1b0
 #5: ffffafe6aabbc0f8 (notif_lock){+.+.}-{2:2}, at: avc_compute_av+0xc4/0x3d0
irq event stamp: 136660
hardirqs last  enabled at (136659): [<ffffafe6a80fd7a0>] _raw_spin_unlock_irqrestore+0xa8/0xd8
hardirqs last disabled at (136660): [<ffffafe6a80fd85c>] _raw_spin_lock_irqsave+0x8c/0xb0
softirqs last  enabled at (0): [<ffffafe6a5d50b28>] copy_process+0x11d8/0x3df8
softirqs last disabled at (0): [<0000000000000000>] 0x0
Preemption disabled at:
[<ffffafe6a6598a4c>] kmemleak_seq_show+0x3c/0x1e0
CPU: 1 UID: 0 PID: 136 Comm: cat Tainted: G            E      6.11.0-rt7+ #34
Tainted: [E]=UNSIGNED_MODULE
Hardware name: linux,dummy-virt (DT)
Call trace:
 dump_backtrace+0xa0/0x128
 show_stack+0x1c/0x30
 dump_stack_lvl+0xe8/0x198
 dump_stack+0x18/0x20
 rt_spin_lock+0x8c/0x1a8
 avc_perm_nonode+0xa0/0x150
 cred_has_capability.isra.0+0x118/0x218
 selinux_capable+0x50/0x80
 security_capable+0x7c/0xd0
 has_ns_capability_noaudit+0x94/0x1b0
 has_capability_noaudit+0x20/0x30
 restricted_pointer+0x21c/0x4b0
 pointer+0x298/0x760
 vsnprintf+0x330/0xf70
 seq_printf+0x178/0x218
 print_unreferenced+0x1a4/0x2d0
 kmemleak_seq_show+0xd0/0x1e0
 seq_read_iter+0x354/0xe30
 seq_read+0x250/0x378
 full_proxy_read+0xd8/0x148
 vfs_read+0x190/0x918
 ksys_read+0xf0/0x1e0
 __arm64_sys_read+0x70/0xa8
 invoke_syscall.constprop.0+0xd4/0x1d8
 el0_svc+0x50/0x158
 el0t_64_sync+0x17c/0x180

%pS and %pK, in the same back trace line, are redundant, and %pS can void
%pK service in certain contexts.

%pS alone already provides the necessary information, and if it cannot
resolve the symbol, it falls back to printing the raw address voiding
the original intent behind the %pK.

Additionally, %pK requires a privilege check CAP_SYSLOG enforced through
the LSM, which can trigger a "sleeping function called from invalid
context" warning under RT_PREEMPT kernels when the check occurs in an
atomic context. This issue may also affect other LSMs.

This change avoids the unnecessary privilege check and resolves the
sleeping function warning without any loss of information.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 3a6f33d ("mm/kmemleak: use %pK to display kernel pointers in backtrace")
Signed-off-by: Alessandro Carminati <[email protected]>
Acked-by: Sebastian Andrzej Siewior <[email protected]>
Acked-by: Catalin Marinas <[email protected]>
Cc: Clément Léger <[email protected]>
Cc: Alessandro Carminati <[email protected]>
Cc: Eric Chanudet <[email protected]>
Cc: Gabriele Paoloni <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Weißschuh <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.