diff options
author | Jan200101 <sentrycraft123@gmail.com> | 2024-09-26 20:08:54 +0200 |
---|---|---|
committer | Jan200101 <sentrycraft123@gmail.com> | 2024-09-26 20:08:54 +0200 |
commit | 28cd700b822627cde036087f4c613a234f0d0858 (patch) | |
tree | 9b78517a97e9ff6447151b0a4d54977d80931c83 /SOURCES/0001-acpi-proc-idle-skip-dummy-wait.patch | |
parent | c44454c0aa0ae1b0999ef46e18516670536b122a (diff) | |
download | kernel-fsync-28cd700b822627cde036087f4c613a234f0d0858.tar.gz kernel-fsync-28cd700b822627cde036087f4c613a234f0d0858.zip |
kernel 6.9.12 cleaning
Diffstat (limited to 'SOURCES/0001-acpi-proc-idle-skip-dummy-wait.patch')
-rw-r--r-- | SOURCES/0001-acpi-proc-idle-skip-dummy-wait.patch | 185 |
1 files changed, 67 insertions, 118 deletions
diff --git a/SOURCES/0001-acpi-proc-idle-skip-dummy-wait.patch b/SOURCES/0001-acpi-proc-idle-skip-dummy-wait.patch index eb3cf54..ef562c9 100644 --- a/SOURCES/0001-acpi-proc-idle-skip-dummy-wait.patch +++ b/SOURCES/0001-acpi-proc-idle-skip-dummy-wait.patch @@ -1,125 +1,74 @@ -Processors based on the Zen microarchitecture support IOPORT based deeper -C-states. The idle driver reads the acpi_gbl_FADT.xpm_timer_block.address -in the IOPORT based C-state exit path which is claimed to be a -"Dummy wait op" and has been around since ACPI introduction to Linux -dating back to Andy Grover's Mar 14, 2002 posting [1]. -The comment above the dummy operation was elaborated by Andreas Mohr back -in 2006 in commit b488f02156d3d ("ACPI: restore comment justifying 'extra' -P_LVLx access") [2] where the commit log claims: -"this dummy read was about: STPCLK# doesn't get asserted in time on -(some) chipsets, which is why we need to have a dummy I/O read to delay -further instruction processing until the CPU is fully stopped." - -However, sampling certain workloads with IBS on AMD Zen3 system shows -that a significant amount of time is spent in the dummy op, which -incorrectly gets accounted as C-State residency. A large C-State -residency value can prime the cpuidle governor to recommend a deeper -C-State during the subsequent idle instances, starting a vicious cycle, -leading to performance degradation on workloads that rapidly switch -between busy and idle phases. - -One such workload is tbench where a massive performance degradation can -be observed during certain runs. Following are some statistics gathered -by running tbench with 128 clients, on a dual socket (2 x 64C/128T) Zen3 -system with the baseline kernel, baseline kernel keeping C2 disabled, -and baseline kernel with this patch applied keeping C2 enabled: - -baseline kernel was tip:sched/core at -commit f3dd3f674555 ("sched: Remove the limitation of WF_ON_CPU on -wakelist if wakee cpu is idle") - -Kernel : baseline baseline + C2 disabled baseline + patch - -Min (MB/s) : 2215.06 33072.10 (+1393.05%) 33016.10 (+1390.52%) -Max (MB/s) : 32938.80 34399.10 34774.50 -Median (MB/s) : 32191.80 33476.60 33805.70 -AMean (MB/s) : 22448.55 33649.27 (+49.89%) 33865.43 (+50.85%) -AMean Stddev : 17526.70 680.14 880.72 -AMean CoefVar : 78.07% 2.02% 2.60% - -The data shows there are edge cases that can cause massive regressions -in case of tbench. Profiling the bad runs with IBS shows a significant -amount of time being spent in acpi_idle_do_entry method: - -Overhead Command Shared Object Symbol - 74.76% swapper [kernel.kallsyms] [k] acpi_idle_do_entry - 0.71% tbench [kernel.kallsyms] [k] update_sd_lb_stats.constprop.0 - 0.69% tbench_srv [kernel.kallsyms] [k] update_sd_lb_stats.constprop.0 - 0.49% swapper [kernel.kallsyms] [k] psi_group_change - ... - -Annotation of acpi_idle_do_entry method reveals almost all the time in -acpi_idle_do_entry is spent on the port I/O in wait_for_freeze(): - - 0.14 │ in (%dx),%al # <------ First "in" corresponding to inb(cx->address) - 0.51 │ mov 0x144d64d(%rip),%rax - 0.00 │ test $0x80000000,%eax - │ ↓ jne 62 # <------ Skip if running in guest - 0.00 │ mov 0x19800c3(%rip),%rdx - 99.33 │ in (%dx),%eax # <------ Second "in" corresponding to inl(acpi_gbl_FADT.xpm_timer_block.address) - 0.00 │62: mov -0x8(%rbp),%r12 - 0.00 │ leave - 0.00 │ ← ret - -This overhead is reflected in the C2 residency on the test system where -C2 is an IOPORT based C-State. The total C-state residency reported by -"cpupower idle-info" on CPU0 for good and bad case over the 80s tbench -run is as follows (all numbers are in microseconds): - - Good Run Bad Run - (Baseline) - -POLL: 43338 6231 (-85.62%) -C1 (MWAIT Based): 23576156 363861 (-98.45%) -C2 (IOPORT Based): 10781218 77027280 (+614.45%) - -The larger residency value in bad case leads to the system recommending -C2 state again for subsequent idle instances. The pattern lasts till the -end of the tbench run. Following is the breakdown of "entry_method" -passed to acpi_idle_do_entry during good run and bad run: - - Good Run Bad Run - (Baseline) - -Number of times acpi_idle_do_entry was called: 6149573 6149050 (-0.01%) - |-> Number of times entry_method was "ACPI_CSTATE_FFH": 6141494 88144 (-98.56%) - |-> Number of times entry_method was "ACPI_CSTATE_HALT": 0 0 (+0.00%) - |-> Number of times entry_method was "ACPI_CSTATE_SYSTEMIO": 8079 6060906 (+74920.49%) - -For processors based on the Zen microarchitecture, this dummy wait op is -unnecessary and can be skipped when choosing IOPORT based C-States to -avoid polluting the C-state residency information. - -Link: https://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux-fullhistory.git/commit/?id=972c16130d9dc182cedcdd408408d9eacc7d6a2d [1] -Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b488f02156d3deb08f5ad7816d565c370a8cc6f1 [2] - -Suggested-by: Calvin Ong <calvin.ong@amd.com> -Cc: stable@vger.kernel.org -Cc: regressions@lists.linux.dev -Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> +From e400ad8b7e6a1b9102123c6240289a811501f7d9 Mon Sep 17 00:00:00 2001 +From: Dave Hansen <dave.hansen@intel.com> +Date: Thu, 22 Sep 2022 11:47:45 -0700 +Subject: [PATCH] ACPI: processor idle: Practically limit "Dummy wait" + workaround to old Intel systems + +Old, circa 2002 chipsets have a bug: they don't go idle when they are +supposed to. So, a workaround was added to slow the CPU down and +ensure that the CPU waits a bit for the chipset to actually go idle. +This workaround is ancient and has been in place in some form since +the original kernel ACPI implementation. + +But, this workaround is very painful on modern systems. The "inl()" +can take thousands of cycles (see Link: for some more detailed +numbers and some fun kernel archaeology). + +First and foremost, modern systems should not be using this code. +Typical Intel systems have not used it in over a decade because it is +horribly inferior to MWAIT-based idle. + +Despite this, people do seem to be tripping over this workaround on +AMD system today. + +Limit the "dummy wait" workaround to Intel systems. Keep Modern AMD +systems from tripping over the workaround. Remotely modern Intel +systems use intel_idle instead of this code and will, in practice, +remain unaffected by the dummy wait. + +Reported-by: K Prateek Nayak <kprateek.nayak@amd.com> +Suggested-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> +Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> +Reviewed-by: Mario Limonciello <mario.limonciello@amd.com> +Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> +Link: https://lore.kernel.org/all/20220921063638.2489-1-kprateek.nayak@amd.com/ +Link: https://lkml.kernel.org/r/20220922184745.3252932-1-dave.hansen@intel.com --- - drivers/acpi/processor_idle.c | 7 +++++-- - 1 file changed, 5 insertions(+), 2 deletions(-) + drivers/acpi/processor_idle.c | 23 ++++++++++++++++++++--- + 1 file changed, 20 insertions(+), 3 deletions(-) diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c -index 16a1663d02d4..18850aa2b79b 100644 +index 16a1663d02d460..9f40917c49efbf 100644 --- a/drivers/acpi/processor_idle.c +++ b/drivers/acpi/processor_idle.c -@@ -529,9 +529,11 @@ static __cpuidle void io_idle(unsigned long addr) - inb(addr); - - #ifdef CONFIG_X86 -- /* No delay is needed if we are in guest */ -- if (boot_cpu_has(X86_FEATURE_HYPERVISOR)) -- return; +@@ -531,10 +531,27 @@ static void wait_for_freeze(void) + /* No delay is needed if we are in guest */ + if (boot_cpu_has(X86_FEATURE_HYPERVISOR)) + return; + /* -+ * No delay is needed if we are in guest or on a processor -+ * based on the Zen microarchitecture. ++ * Modern (>=Nehalem) Intel systems use ACPI via intel_idle, ++ * not this code. Assume that any Intel systems using this ++ * are ancient and may need the dummy wait. This also assumes ++ * that the motivating chipset issue was Intel-only. + */ -+ if (boot_cpu_has(X86_FEATURE_HYPERVISOR) || boot_cpu_has(X86_FEATURE_ZEN)) - /* - * Modern (>=Nehalem) Intel systems use ACPI via intel_idle, - * not this code. Assume that any Intel systems using this - --- -2.25.1 ++ if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL) ++ return; + #endif +- /* Dummy wait op - must do something useless after P_LVL2 read +- because chipsets cannot guarantee that STPCLK# signal +- gets asserted in time to freeze execution properly. */ ++ /* ++ * Dummy wait op - must do something useless after P_LVL2 read ++ * because chipsets cannot guarantee that STPCLK# signal gets ++ * asserted in time to freeze execution properly ++ * ++ * This workaround has been in place since the original ACPI ++ * implementation was merged, circa 2002. ++ * ++ * If a profile is pointing to this instruction, please first ++ * consider moving your system to a more modern idle ++ * mechanism. ++ */ + inl(acpi_gbl_FADT.xpm_timer_block.address); + } + |