diff options
Diffstat (limited to 'SOURCES/futex2.patch')
-rw-r--r-- | SOURCES/futex2.patch | 4024 |
1 files changed, 2745 insertions, 1279 deletions
diff --git a/SOURCES/futex2.patch b/SOURCES/futex2.patch index 1bc4486..3604062 100644 --- a/SOURCES/futex2.patch +++ b/SOURCES/futex2.patch @@ -1,37 +1,314 @@ -From 14a106cc87e6d03169ac8c7ea030e3d7fac2dfe4 Mon Sep 17 00:00:00 2001 +From a64bf661d4fc6dbfde640bf002eae2e22884a419 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Andr=C3=A9=20Almeida?= <andrealmeid@collabora.com> -Date: Wed, 5 Aug 2020 12:40:26 -0300 -Subject: [PATCH 1/9] futex2: Add new futex interface +Date: Fri, 5 Feb 2021 10:34:00 -0300 +Subject: [PATCH 01/13] futex2: Implement wait and wake functions MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit -Initial implementation for futex2. Support only private u32 wait/wake, with -timeout (monotonic and realtime clocks). +Create a new set of futex syscalls known as futex2. This new interface +is aimed to implement a more maintainable code, while removing obsolete +features and expanding it with new functionalities. + +Implements wait and wake semantics for futexes, along with the base +infrastructure for future operations. The whole wait path is designed to +be used by N waiters, thus making easier to implement vectorized wait. + +* Syscalls implemented by this patch: + +- futex_wait(void *uaddr, unsigned int val, unsigned int flags, + struct timespec *timo) + + The user thread is put to sleep, waiting for a futex_wake() at uaddr, + if the value at *uaddr is the same as val (otherwise, the syscall + returns immediately with -EAGAIN). timo is an optional timeout value + for the operation. + + Return 0 on success, error code otherwise. + + - futex_wake(void *uaddr, unsigned long nr_wake, unsigned int flags) + + Wake `nr_wake` threads waiting at uaddr. + + Return the number of woken threads on success, error code otherwise. + +** The `flag` argument + + The flag is used to specify the size of the futex word + (FUTEX_[8, 16, 32]). It's mandatory to define one, since there's no + default size. + + By default, the timeout uses a monotonic clock, but can be used as a + realtime one by using the FUTEX_REALTIME_CLOCK flag. + + By default, futexes are of the private type, that means that this user + address will be accessed by threads that shares the same memory region. + This allows for some internal optimizations, so they are faster. + However, if the address needs to be shared with different processes + (like using `mmap()` or `shm()`), they need to be defined as shared and + the flag FUTEX_SHARED_FLAG is used to set that. + + By default, the operation has no NUMA-awareness, meaning that the user + can't choose the memory node where the kernel side futex data will be + stored. The user can choose the node where it wants to operate by + setting the FUTEX_NUMA_FLAG and using the following structure (where X + can be 8, 16, or 32): + + struct futexX_numa { + __uX value; + __sX hint; + }; + + This structure should be passed at the `void *uaddr` of futex + functions. The address of the structure will be used to be waited/waken + on, and the `value` will be compared to `val` as usual. The `hint` + member is used to defined which node the futex will use. When waiting, + the futex will be registered on a kernel-side table stored on that + node; when waking, the futex will be searched for on that given table. + That means that there's no redundancy between tables, and the wrong + `hint` value will led to undesired behavior. Userspace is responsible + for dealing with node migrations issues that may occur. `hint` can + range from [0, MAX_NUMA_NODES], for specifying a node, or -1, to use + the same node the current process is using. + + When not using FUTEX_NUMA_FLAG on a NUMA system, the futex will be + stored on a global table on some node, defined at compilation time. + +** The `timo` argument + +As per the Y2038 work done in the kernel, new interfaces shouldn't add +timeout options known to be buggy. Given that, `timo` should be a 64bit +timeout at all platforms, using an absolute timeout value. Signed-off-by: André Almeida <andrealmeid@collabora.com> +--- + +[RFC Add futex2 syscall 0/0] + +Hi, + +This patch series introduces the futex2 syscalls. + +* What happened to the current futex()? + +For some years now, developers have been trying to add new features to +futex, but maintainers have been reluctant to accept them, given the +multiplexed interface full of legacy features and tricky to do big +changes. Some problems that people tried to address with patchsets are: +NUMA-awareness[0], smaller sized futexes[1], wait on multiple futexes[2]. +NUMA, for instance, just doesn't fit the current API in a reasonable +way. Considering that, it's not possible to merge new features into the +current futex. + + ** The NUMA problem + + At the current implementation, all futex kernel side infrastructure is + stored on a single node. Given that, all futex() calls issued by + processors that aren't located on that node will have a memory access + penalty when doing it. + + ** The 32bit sized futex problem + + Embedded systems or anything with memory constrains would benefit of + using smaller sizes for the futex userspace integer. Also, a mutex + implementation can be done using just three values, so 8 bits is enough + for various scenarios. + + ** The wait on multiple problem + + The use case lies in the Wine implementation of the Windows NT interface + WaitMultipleObjects. This Windows API function allows a thread to sleep + waiting on the first of a set of event sources (mutexes, timers, signal, + console input, etc) to signal. Considering this is a primitive + synchronization operation for Windows applications, being able to quickly + signal events on the producer side, and quickly go to sleep on the + consumer side is essential for good performance of those running + over Wine. + +[0] https://lore.kernel.org/lkml/20160505204230.932454245@linutronix.de/ +[1] https://lore.kernel.org/lkml/20191221155659.3159-2-malteskarupke@web.de/ +[2] https://lore.kernel.org/lkml/20200213214525.183689-1-andrealmeid@collabora.com/ + +* The solution + +As proposed by Peter Zijlstra and Florian Weimer[3], a new interface +is required to solve this, which must be designed with those features in +mind. futex2() is that interface. As opposed to the current multiplexed +interface, the new one should have one syscall per operation. This will +allow the maintainability of the API if it gets extended, and will help +users with type checking of arguments. + +In particular, the new interface is extended to support the ability to +wait on any of a list of futexes at a time, which could be seen as a +vectored extension of the FUTEX_WAIT semantics. + +[3] https://lore.kernel.org/lkml/20200303120050.GC2596@hirez.programming.kicks-ass.net/ + +* The interface + +The new interface can be seen in details in the following patches, but +this is a high level summary of what the interface can do: + + - Supports wake/wait semantics, as in futex() + - Supports requeue operations, similarly as FUTEX_CMP_REQUEUE, but with + individual flags for each address + - Supports waiting for a vector of futexes, using a new syscall named + futex_waitv() + - Supports variable sized futexes (8bits, 16bits and 32bits) + - Supports NUMA-awareness operations, where the user can specify on + which memory node would like to operate + +* Implementation + +The internal implementation follows a similar design to the original futex. +Given that we want to replicate the same external behavior of current +futex, this should be somewhat expected. For some functions, like the +init and the code to get a shared key, I literally copied code and +comments from kernel/futex.c. I decided to do so instead of exposing the +original function as a public function since in that way we can freely +modify our implementation if required, without any impact on old futex. +Also, the comments precisely describes the details and corner cases of +the implementation. + +Each patch contains a brief description of implementation, but patch 6 +"docs: locking: futex2: Add documentation" adds a more complete document +about it. + +* The patchset + +This patchset can be also found at my git tree: + +https://gitlab.collabora.com/tonyk/linux/-/tree/futex2 + + - Patch 1: Implements wait/wake, and the basics foundations of futex2 + + - Patches 2-4: Implement the remaining features (shared, waitv, requeue). + + - Patch 5: Adds the x86_x32 ABI handling. I kept it in a separated + patch since I'm not sure if x86_x32 is still a thing, or if it should + return -ENOSYS. + + - Patch 6: Add a documentation file which details the interface and + the internal implementation. + + - Patches 7-13: Selftests for all operations along with perf + support for futex2. + + - Patch 14: While working on porting glibc for futex2, I found out + that there's a futex_wake() call at the user thread exit path, if + that thread was created with clone(..., CLONE_CHILD_SETTID, ...). In + order to make pthreads work with futex2, it was required to add + this patch. Note that this is more a proof-of-concept of what we + will need to do in future, rather than part of the interface and + shouldn't be merged as it is. + +* Testing: + +This patchset provides selftests for each operation and their flags. +Along with that, the following work was done: + + ** Stability + + To stress the interface in "real world scenarios": + + - glibc[4]: nptl's low level locking was modified to use futex2 API + (except for robust and PI things). All relevant nptl/ tests passed. + + - Wine[5]: Proton/Wine was modified in order to use futex2() for the + emulation of Windows NT sync mechanisms based on futex, called "fsync". + Triple-A games with huge CPU's loads and tons of parallel jobs worked + as expected when compared with the previous FUTEX_WAIT_MULTIPLE + implementation at futex(). Some games issue 42k futex2() calls + per second. + + - Full GNU/Linux distro: I installed the modified glibc in my host + machine, so all pthread's programs would use futex2(). After tweaking + systemd[6] to allow futex2() calls at seccomp, everything worked as + expected (web browsers do some syscall sandboxing and need some + configuration as well). + + - perf: The perf benchmarks tests can also be used to stress the + interface, and they can be found in this patchset. + + ** Performance + + - For comparing futex() and futex2() performance, I used the artificial + benchmarks implemented at perf (wake, wake-parallel, hash and + requeue). The setup was 200 runs for each test and using 8, 80, 800, + 8000 for the number of threads, Note that for this test, I'm not using + patch 14 ("kernel: Enable waitpid() for futex2") , for reasons explained + at "The patchset" section. + + - For the first three ones, I measured an average of 4% gain in + performance. This is not a big step, but it shows that the new + interface is at least comparable in performance with the current one. + + - For requeue, I measured an average of 21% decrease in performance + compared to the original futex implementation. This is expected given + the new design with individual flags. The performance trade-offs are + explained at patch 4 ("futex2: Implement requeue operation"). + +[4] https://gitlab.collabora.com/tonyk/glibc/-/tree/futex2 +[5] https://gitlab.collabora.com/tonyk/wine/-/tree/proton_5.13 +[6] https://gitlab.collabora.com/tonyk/systemd + +* FAQ + + ** "Where's the code for NUMA and FUTEX_8/16?" + + The current code is already complex enough to take some time for + review, so I believe it's better to split that work out to a future + iteration of this patchset. Besides that, this RFC is the core part of the + infrastructure, and the following features will not pose big design + changes to it, the work will be more about wiring up the flags and + modifying some functions. + + ** "And what's about FUTEX_64?" + + By supporting 64 bit futexes, the kernel structure for futex would + need to have a 64 bit field for the value, and that could defeat one of + the purposes of having different sized futexes in the first place: + supporting smaller ones to decrease memory usage. This might be + something that could be disabled for 32bit archs (and even for + CONFIG_BASE_SMALL). + + Which use case would benefit for FUTEX_64? Does it worth the trade-offs? + + ** "Where's the PI/robust stuff?" + + As said by Peter Zijlstra at [3], all those new features are related to + the "simple" futex interface, that doesn't use PI or robust. Do we want + to have this complexity at futex2() and if so, should it be part of + this patchset or can it be future work? + +Thanks, + André + Signed-off-by: Jan200101 <sentrycraft123@gmail.com> --- MAINTAINERS | 2 +- + arch/arm/tools/syscall.tbl | 2 + + arch/arm64/include/asm/unistd.h | 2 +- + arch/arm64/include/asm/unistd32.h | 4 + arch/x86/entry/syscalls/syscall_32.tbl | 2 + arch/x86/entry/syscalls/syscall_64.tbl | 2 + include/linux/syscalls.h | 7 + include/uapi/asm-generic/unistd.h | 8 +- - include/uapi/linux/futex.h | 40 ++ + include/uapi/linux/futex.h | 56 ++ init/Kconfig | 7 + kernel/Makefile | 1 + - kernel/futex2.c | 484 ++++++++++++++++++ + kernel/futex2.c | 625 ++++++++++++++++++ kernel/sys_ni.c | 4 + - tools/include/uapi/asm-generic/unistd.h | 9 +- + tools/include/uapi/asm-generic/unistd.h | 8 +- .../arch/x86/entry/syscalls/syscall_64.tbl | 2 + - 12 files changed, 565 insertions(+), 3 deletions(-) + 15 files changed, 728 insertions(+), 4 deletions(-) create mode 100644 kernel/futex2.c diff --git a/MAINTAINERS b/MAINTAINERS -index 2daa6ee67..855d38511 100644 +index bfc1b86e3..86ed91b72 100644 --- a/MAINTAINERS +++ b/MAINTAINERS -@@ -7259,7 +7259,7 @@ F: Documentation/locking/*futex* +@@ -7332,7 +7332,7 @@ F: Documentation/locking/*futex* F: include/asm-generic/futex.h F: include/linux/futex.h F: include/uapi/linux/futex.h @@ -40,72 +317,110 @@ index 2daa6ee67..855d38511 100644 F: tools/perf/bench/futex* F: tools/testing/selftests/futex/ +diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl +index 20e1170e2..4eef220cd 100644 +--- a/arch/arm/tools/syscall.tbl ++++ b/arch/arm/tools/syscall.tbl +@@ -455,3 +455,5 @@ + 439 common faccessat2 sys_faccessat2 + 440 common process_madvise sys_process_madvise + 441 common epoll_pwait2 sys_epoll_pwait2 ++442 common futex_wait sys_futex_wait ++443 common futex_wake sys_futex_wake +diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h +index 86a9d7b3e..d1f7d35f9 100644 +--- a/arch/arm64/include/asm/unistd.h ++++ b/arch/arm64/include/asm/unistd.h +@@ -38,7 +38,7 @@ + #define __ARM_NR_compat_set_tls (__ARM_NR_COMPAT_BASE + 5) + #define __ARM_NR_COMPAT_END (__ARM_NR_COMPAT_BASE + 0x800) + +-#define __NR_compat_syscalls 442 ++#define __NR_compat_syscalls 444 + #endif + + #define __ARCH_WANT_SYS_CLONE +diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h +index cccfbbefb..2db1529b2 100644 +--- a/arch/arm64/include/asm/unistd32.h ++++ b/arch/arm64/include/asm/unistd32.h +@@ -891,6 +891,10 @@ __SYSCALL(__NR_faccessat2, sys_faccessat2) + __SYSCALL(__NR_process_madvise, sys_process_madvise) + #define __NR_epoll_pwait2 441 + __SYSCALL(__NR_epoll_pwait2, compat_sys_epoll_pwait2) ++#define __NR_futex_wait 442 ++__SYSCALL(__NR_futex_wait, sys_futex_wait) ++#define __NR_futex_wake 443 ++__SYSCALL(__NR_futex_wake, sys_futex_wake) + + /* + * Please add new compat syscalls above this comment and update diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl -index 0d0667a9f..83a75ff39 100644 +index 874aeacde..ece90c8d9 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl -@@ -445,3 +445,5 @@ - 438 i386 pidfd_getfd sys_pidfd_getfd +@@ -446,3 +446,5 @@ 439 i386 faccessat2 sys_faccessat2 440 i386 process_madvise sys_process_madvise -+441 i386 futex_wait sys_futex_wait -+442 i386 futex_wake sys_futex_wake + 441 i386 epoll_pwait2 sys_epoll_pwait2 compat_sys_epoll_pwait2 ++442 i386 futex_wait sys_futex_wait ++443 i386 futex_wake sys_futex_wake diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl -index 379819244..6658fd63c 100644 +index 78672124d..72fb65ef9 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl -@@ -362,6 +362,8 @@ - 438 common pidfd_getfd sys_pidfd_getfd +@@ -363,6 +363,8 @@ 439 common faccessat2 sys_faccessat2 440 common process_madvise sys_process_madvise -+441 common futex_wait sys_futex_wait -+442 common futex_wake sys_futex_wake + 441 common epoll_pwait2 sys_epoll_pwait2 ++442 common futex_wait sys_futex_wait ++443 common futex_wake sys_futex_wake # # Due to a historical design error, certain syscalls are numbered differently diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h -index 37bea07c1..b6b77cf2b 100644 +index 7688bc983..bf146c2b0 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h -@@ -589,6 +589,13 @@ asmlinkage long sys_get_robust_list(int pid, +@@ -618,6 +618,13 @@ asmlinkage long sys_get_robust_list(int pid, asmlinkage long sys_set_robust_list(struct robust_list_head __user *head, size_t len); +/* kernel/futex2.c */ -+asmlinkage long sys_futex_wait(void __user *uaddr, unsigned long val, -+ unsigned long flags, ++asmlinkage long sys_futex_wait(void __user *uaddr, unsigned int val, ++ unsigned int flags, + struct __kernel_timespec __user __user *timo); -+asmlinkage long sys_futex_wake(void __user *uaddr, unsigned long nr_wake, -+ unsigned long flags); ++asmlinkage long sys_futex_wake(void __user *uaddr, unsigned int nr_wake, ++ unsigned int flags); + /* kernel/hrtimer.c */ asmlinkage long sys_nanosleep(struct __kernel_timespec __user *rqtp, struct __kernel_timespec __user *rmtp); diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h -index 205631898..ae47d6a9e 100644 +index 728752917..57e19200f 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h -@@ -860,8 +860,14 @@ __SYSCALL(__NR_faccessat2, sys_faccessat2) - #define __NR_process_madvise 440 - __SYSCALL(__NR_process_madvise, sys_process_madvise) +@@ -862,8 +862,14 @@ __SYSCALL(__NR_process_madvise, sys_process_madvise) + #define __NR_epoll_pwait2 441 + __SC_COMP(__NR_epoll_pwait2, sys_epoll_pwait2, compat_sys_epoll_pwait2) -+#define __NR_futex_wait 441 ++#define __NR_futex_wait 442 +__SYSCALL(__NR_futex_wait, sys_futex_wait) + -+#define __NR_futex_wake 442 ++#define __NR_futex_wake 443 +__SYSCALL(__NR_futex_wake, sys_futex_wake) + #undef __NR_syscalls --#define __NR_syscalls 441 -+#define __NR_syscalls 443 +-#define __NR_syscalls 442 ++#define __NR_syscalls 444 /* * 32 bit systems traditionally used different diff --git a/include/uapi/linux/futex.h b/include/uapi/linux/futex.h -index a89eb0acc..35a5bf1cd 100644 +index a89eb0acc..9fbdaaf4f 100644 --- a/include/uapi/linux/futex.h +++ b/include/uapi/linux/futex.h -@@ -41,6 +41,46 @@ +@@ -41,6 +41,62 @@ #define FUTEX_CMP_REQUEUE_PI_PRIVATE (FUTEX_CMP_REQUEUE_PI | \ FUTEX_PRIVATE_FLAG) @@ -120,7 +435,7 @@ index a89eb0acc..35a5bf1cd 100644 + +#define FUTEX_NUMA_FLAG 16 + -+/* ++/** + * struct futexXX_numa - struct for NUMA-aware futex operation + * @value: futex value + * @hint: node id to operate @@ -128,35 +443,51 @@ index a89eb0acc..35a5bf1cd 100644 + +struct futex8_numa { + __u8 value; -+ __u8 hint; ++ __s8 hint; +}; + +struct futex16_numa { + __u16 value; -+ __u16 hint; ++ __s16 hint; +}; + +struct futex32_numa { + __u32 value; -+ __u32 hint; ++ __s32 hint; +}; + +#define FUTEX_WAITV_MAX 128 + ++/** ++ * struct futex_waitv - A waiter for vectorized wait ++ * @uaddr: User address to wait on ++ * @val: Expected value at uaddr ++ * @flags: Flags for this waiter ++ */ +struct futex_waitv { + void *uaddr; + unsigned int val; + unsigned int flags; +}; + ++/** ++ * struct futex_requeue - Define an address and its flags for requeue operation ++ * @uaddr: User address of one of the requeue arguments ++ * @flags: Flags for this address ++ */ ++struct futex_requeue { ++ void *uaddr; ++ unsigned int flags; ++}; ++ /* * Support for robust futexes: the kernel cleans up held futexes at * thread exit time. diff --git a/init/Kconfig b/init/Kconfig -index 02d13ae27..1264687ea 100644 +index 29ad68325..c3e62e1b1 100644 --- a/init/Kconfig +++ b/init/Kconfig -@@ -1522,6 +1522,13 @@ config FUTEX +@@ -1531,6 +1531,13 @@ config FUTEX support for "fast userspace mutexes". The resulting kernel may not run glibc-based applications correctly. @@ -165,16 +496,16 @@ index 02d13ae27..1264687ea 100644 + depends on FUTEX + default y + help -+ Experimental support for futex2 interface. ++ Support for futex2 interface. + config FUTEX_PI bool depends on FUTEX && RT_MUTEXES diff --git a/kernel/Makefile b/kernel/Makefile -index af601b9bd..bb7f33986 100644 +index aa7368c7e..afbe15e51 100644 --- a/kernel/Makefile +++ b/kernel/Makefile -@@ -54,6 +54,7 @@ obj-$(CONFIG_PROFILING) += profile.o +@@ -57,6 +57,7 @@ obj-$(CONFIG_PROFILING) += profile.o obj-$(CONFIG_STACKTRACE) += stacktrace.o obj-y += time/ obj-$(CONFIG_FUTEX) += futex.o @@ -184,30 +515,49 @@ index af601b9bd..bb7f33986 100644 ifneq ($(CONFIG_SMP),y) diff --git a/kernel/futex2.c b/kernel/futex2.c new file mode 100644 -index 000000000..107b80a46 +index 000000000..802578ad6 --- /dev/null +++ b/kernel/futex2.c -@@ -0,0 +1,484 @@ +@@ -0,0 +1,625 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * futex2 system call interface by André Almeida <andrealmeid@collabora.com> + * -+ * Copyright 2020 Collabora Ltd. ++ * Copyright 2021 Collabora Ltd. ++ * ++ * Based on original futex implementation by: ++ * (C) 2002 Rusty Russell, IBM ++ * (C) 2003, 2006 Ingo Molnar, Red Hat Inc. ++ * (C) 2003, 2004 Jamie Lokier ++ * (C) 2006 Thomas Gleixner, Timesys Corp. ++ * (C) 2007 Eric Dumazet ++ * (C) 2009 Darren Hart, IBM + */ + +#include <linux/freezer.h> +#include <linux/jhash.h> ++#include <linux/memblock.h> +#include <linux/sched/wake_q.h> +#include <linux/spinlock.h> +#include <linux/syscalls.h> -+#include <linux/memblock.h> +#include <uapi/linux/futex.h> + +/** ++ * struct futex_key - Components to build unique key for a futex ++ * @pointer: Pointer to current->mm ++ * @index: Start address of the page containing futex ++ * @offset: Address offset of uaddr in a page ++ */ ++struct futex_key { ++ u64 pointer; ++ unsigned long index; ++ unsigned long offset; ++}; ++ ++/** + * struct futex_waiter - List entry for a waiter -+ * @key.address: Memory address of userspace futex -+ * @key.mm: Pointer to memory management struct of this process -+ * @key: Stores information that uniquely identify a futex ++ * @uaddr: Virtual address of userspace futex ++ * @key: Information that uniquely identify a futex + * @list: List node struct + * @val: Expected value for this waiter + * @flags: Flags @@ -215,10 +565,8 @@ index 000000000..107b80a46 + * @index: Index of waiter in futexv list + */ +struct futex_waiter { -+ struct futex_key { -+ uintptr_t address; -+ struct mm_struct *mm; -+ } key; ++ uintptr_t uaddr; ++ struct futex_key key; + struct list_head list; + unsigned int val; + unsigned int flags; @@ -227,6 +575,18 @@ index 000000000..107b80a46 +}; + +/** ++ * struct futexv_head - List of futexes to be waited ++ * @task: Task to be awaken ++ * @hint: Was someone on this list awakened? ++ * @objects: List of futexes ++ */ ++struct futexv_head { ++ struct task_struct *task; ++ bool hint; ++ struct futex_waiter objects[0]; ++}; ++ ++/** + * struct futex_bucket - A bucket of futex's hash table + * @waiters: Number of waiters in the bucket + * @lock: Bucket lock @@ -238,30 +598,28 @@ index 000000000..107b80a46 + struct list_head list; +}; + -+struct futexv { -+ struct task_struct *task; -+ int hint; -+ struct futex_waiter objects[0]; -+}; -+ ++/** ++ * struct futex_single_waiter - Wrapper for a futexv_head of one element ++ * @futexv: Single futexv element ++ * @waiter: Single waiter element ++ */ +struct futex_single_waiter { -+ struct futexv parent; ++ struct futexv_head futexv; + struct futex_waiter waiter; +} __packed; + -+struct futex_bucket *futex_table; -+ -+/* mask for futex2 flag operations */ ++/* Mask for futex2 flag operations */ +#define FUTEX2_MASK (FUTEX_SIZE_MASK | FUTEX_SHARED_FLAG | \ + FUTEX_CLOCK_REALTIME) + -+// mask for sys_futex_waitv ++/* Mask for sys_futex_waitv flag */ +#define FUTEXV_MASK (FUTEX_CLOCK_REALTIME) + -+// mask for each futex in futex_waitv list ++/* Mask for each futex in futex_waitv list */ +#define FUTEXV_WAITER_MASK (FUTEX_SIZE_MASK | FUTEX_SHARED_FLAG) + -+int futex2_hashsize; ++struct futex_bucket *futex_table; ++unsigned int futex2_hashsize; + +/* + * Reflects a new waiter being added to the waitqueue. @@ -271,7 +629,8 @@ index 000000000..107b80a46 +#ifdef CONFIG_SMP + atomic_inc(&bucket->waiters); + /* -+ * Full barrier (A), see the ordering comment above. ++ * Issue a barrier after adding so futex_wake() will see that the ++ * value had increased + */ + smp_mb__after_atomic(); +#endif @@ -295,7 +654,8 @@ index 000000000..107b80a46 +{ +#ifdef CONFIG_SMP + /* -+ * Full barrier (B), see the ordering comment above. ++ * Issue a barrier before reading so we get an updated value from ++ * futex_wait() + */ + smp_mb(); + return atomic_read(&bucket->waiters); @@ -315,7 +675,7 @@ index 000000000..107b80a46 +static struct futex_bucket *futex_get_bucket(void __user *uaddr, + struct futex_key *key) +{ -+ uintptr_t address = (uintptr_t) uaddr; ++ uintptr_t address = (uintptr_t)uaddr; + u32 hash_key; + + /* Checking if uaddr is valid and accessible */ @@ -324,11 +684,13 @@ index 000000000..107b80a46 + if (unlikely(!access_ok(address, sizeof(u32)))) + return ERR_PTR(-EFAULT); + -+ key->address = address; -+ key->mm = current->mm; ++ key->offset = address % PAGE_SIZE; ++ address -= key->offset; ++ key->pointer = (u64)address; ++ key->index = (unsigned long)current->mm; + + /* Generate hash key for this futex using uaddr and current->mm */ -+ hash_key = jhash2((u32 *) key, sizeof(*key) / sizeof(u32), 0); ++ hash_key = jhash2((u32 *)key, sizeof(*key) / sizeof(u32), 0); + + /* Since HASH_SIZE is 2^n, subtracting 1 makes a perfect bit mask */ + return &futex_table[hash_key & (futex2_hashsize - 1)]; @@ -339,9 +701,9 @@ index 000000000..107b80a46 + * @uval: variable to store the value + * @uaddr: userspace address + * -+ * Check the comment at futex_get_user_val for more information. ++ * Check the comment at futex_enqueue() for more information. + */ -+static int futex_get_user(u32 *uval, u32 *uaddr) ++static int futex_get_user(u32 *uval, u32 __user *uaddr) +{ + int ret; + @@ -353,7 +715,7 @@ index 000000000..107b80a46 +} + +/** -+ * futex_setup_time - Prepare the timeout mechanism, without starting it. ++ * futex_setup_time - Prepare the timeout mechanism and start it. + * @timo: Timeout value from userspace + * @timeout: Pointer to hrtimer handler + * @flags: Flags from userspace, to decide which clockid to use @@ -381,220 +743,342 @@ index 000000000..107b80a46 + + hrtimer_set_expires(&timeout->timer, time); + ++ hrtimer_sleeper_start_expires(timeout, HRTIMER_MODE_ABS); ++ + return 0; +} + ++/** ++ * futex_dequeue_multiple - Remove multiple futexes from hash table ++ * @futexv: list of waiters ++ * @nr: number of futexes to be removed ++ * ++ * This function is used if (a) something went wrong while enqueuing, and we ++ * need to undo our work (then nr <= nr_futexes) or (b) we woke up, and thus ++ * need to remove every waiter, check if some was indeed woken and return. ++ * Before removing a waiter, we check if it's on the list, since we have no ++ * clue who have been waken. ++ * ++ * Return: ++ * * -1 - If no futex was woken during the removal ++ * * 0>= - At least one futex was found woken, index of the last one ++ */ ++static int futex_dequeue_multiple(struct futexv_head *futexv, unsigned int nr) ++{ ++ int i, ret = -1; ++ ++ for (i = 0; i < nr; i++) { ++ spin_lock(&futexv->objects[i].bucket->lock); ++ if (!list_empty_careful(&futexv->objects[i].list)) { ++ list_del_init_careful(&futexv->objects[i].list); ++ bucket_dec_waiters(futexv->objects[i].bucket); ++ } else { ++ ret = i; ++ } ++ spin_unlock(&futexv->objects[i].bucket->lock); ++ } ++ ++ return ret; ++} + +/** -+ * futex_get_user_value - Get the value from the userspace address and compares -+ * with the expected one. In success, leaves the function -+ * holding the bucket lock. Else, hold no lock. -+ * @bucket: hash bucket of this address -+ * @uaddr: futex's userspace address -+ * @val: expected value -+ * @multiple: is this call in the wait on multiple path ++ * futex_enqueue - Check the value and enqueue a futex on a wait list + * -+ * Return: 0 on success, error code otherwise ++ * @futexv: List of futexes ++ * @nr_futexes: Number of futexes in the list ++ * @awakened: If a futex was awakened during enqueueing, store the index here ++ * ++ * Get the value from the userspace address and compares with the expected one. ++ * ++ * Getting the value from user futex address: ++ * ++ * Since we are in a hurry, we use a spin lock and we can't sleep. ++ * Try to get the value with page fault disabled (when enable, we might ++ * sleep). ++ * ++ * If we fail, we aren't sure if the address is invalid or is just a ++ * page fault. Then, release the lock (so we can sleep) and try to get ++ * the value with page fault enabled. In order to trigger a page fault ++ * handling, we just call __get_user() again. If we sleep with enqueued ++ * futexes, we might miss a wake, so dequeue everything before sleeping. ++ * ++ * If get_user succeeds, this mean that the address is valid and we do ++ * the work again. Since we just handled the page fault, the page is ++ * likely pinned in memory and we should be luckier this time and be ++ * able to get the value. If we fail anyway, we will try again. ++ * ++ * If even with page faults enabled we get and error, this means that ++ * the address is not valid and we return from the syscall. ++ * ++ * If we got an unexpected value or need to treat a page fault and realized that ++ * a futex was awakened, we can priority this and return success. ++ * ++ * In success, enqueue the futex in the correct bucket ++ * ++ * Return: ++ * * 1 - We were awake in the process and nothing is enqueued ++ * * 0 - Everything is enqueued and we are ready to sleep ++ * * 0< - Something went wrong, nothing is enqueued, return error code + */ -+static int futex_get_user_value(struct futex_bucket *bucket, u32 __user *uaddr, -+ unsigned int val, bool multiple) ++static int futex_enqueue(struct futexv_head *futexv, unsigned int nr_futexes, ++ int *awakened) +{ -+ u32 uval; -+ int ret; ++ int i, ret; ++ u32 uval, *uaddr, val; ++ struct futex_bucket *bucket; + -+ /* -+ * Get the value from user futex address. -+ * -+ * Since we are in a hurry, we use a spin lock and we can't sleep. -+ * Try to get the value with page fault disabled (when enable, we might -+ * sleep). -+ * -+ * If we fail, we aren't sure if the address is invalid or is just a -+ * page fault. Then, release the lock (so we can sleep) and try to get -+ * the value with page fault enabled. In order to trigger a page fault -+ * handling, we just call __get_user() again. -+ * -+ * If get_user succeeds, this mean that the address is valid and we do -+ * the loop again. Since we just handled the page fault, the page is -+ * likely pinned in memory and we should be luckier this time and be -+ * able to get the value. If we fail anyway, we will try again. -+ * -+ * If even with page faults enabled we get and error, this means that -+ * the address is not valid and we return from the syscall. -+ */ -+ do { ++retry: ++ set_current_state(TASK_INTERRUPTIBLE); ++ ++ for (i = 0; i < nr_futexes; i++) { ++ uaddr = (u32 * __user)futexv->objects[i].uaddr; ++ val = (u32)futexv->objects[i].val; ++ ++ bucket = futexv->objects[i].bucket; ++ ++ bucket_inc_waiters(bucket); + spin_lock(&bucket->lock); + + ret = futex_get_user(&uval, uaddr); + -+ if (ret) { ++ if (unlikely(ret)) { + spin_unlock(&bucket->lock); -+ if (multiple || __get_user(uval, uaddr)) ++ ++ bucket_dec_waiters(bucket); ++ __set_current_state(TASK_RUNNING); ++ *awakened = futex_dequeue_multiple(futexv, i); ++ ++ if (__get_user(uval, uaddr)) + return -EFAULT; + ++ if (*awakened >= 0) ++ return 1; ++ ++ goto retry; + } -+ } while (ret); + -+ if (uval != val) { ++ if (uval != val) { ++ spin_unlock(&bucket->lock); ++ ++ bucket_dec_waiters(bucket); ++ __set_current_state(TASK_RUNNING); ++ *awakened = futex_dequeue_multiple(futexv, i); ++ ++ if (*awakened >= 0) ++ return 1; ++ ++ return -EAGAIN; ++ } ++ ++ list_add_tail(&futexv->objects[i].list, &bucket->list); + spin_unlock(&bucket->lock); -+ return -EWOULDBLOCK; + } + + return 0; +} + +/** -+ * futex_dequeue - Remove a futex from a queue -+ * @bucket: current bucket holding the futex -+ * @waiter: futex to be removed -+ * -+ * Return: True if futex was removed by this function, false if another wake -+ * thread removed this futex. ++ * __futex_wait - Enqueue the list of futexes and wait to be woken ++ * @futexv: List of futexes to wait ++ * @nr_futexes: Length of futexv ++ * @timeout: Pointer to timeout handler + * -+ * This function should be used after we found that this futex was in a queue. -+ * Thus, it needs to be removed before the next step. However, someone could -+ * wake it between the time of the first check and the time to get the lock for -+ * the bucket. Check one more time if the futex is there with the bucket locked. -+ * If it's there, just remove it and return true. Else, mark the removal as -+ * false and do nothing. ++ * Return: ++ * * 0 >= - Hint of which futex woke us ++ * * 0 < - Error code + */ -+static bool futex_dequeue(struct futex_bucket *bucket, struct futex_waiter *waiter) ++static int __futex_wait(struct futexv_head *futexv, unsigned int nr_futexes, ++ struct hrtimer_sleeper *timeout) +{ -+ bool removed = true; ++ int ret; + -+ spin_lock(&bucket->lock); -+ if (list_empty(&waiter->list)) -+ removed = false; -+ else -+ list_del(&waiter->list); -+ spin_unlock(&bucket->lock); ++ while (1) { ++ int awakened = -1; ++ ++ ret = futex_enqueue(futexv, nr_futexes, &awakened); ++ ++ if (ret) { ++ if (awakened >= 0) ++ return awakened; ++ return ret; ++ } ++ ++ /* Before sleeping, check if someone was woken */ ++ if (!futexv->hint && (!timeout || timeout->task)) ++ freezable_schedule(); ++ ++ __set_current_state(TASK_RUNNING); ++ ++ /* ++ * One of those things triggered this wake: ++ * ++ * * We have been removed from the bucket. futex_wake() woke ++ * us. We just need to dequeue and return 0 to userspace. ++ * ++ * However, if no futex was dequeued by a futex_wake(): ++ * ++ * * If the there's a timeout and it has expired, ++ * return -ETIMEDOUT. ++ * ++ * * If there is a signal pending, something wants to kill our ++ * thread, return -ERESTARTSYS. ++ * ++ * * If there's no signal pending, it was a spurious wake ++ * (scheduler gave us a change to do some work, even if we ++ * don't want to). We need to remove ourselves from the ++ * bucket and add again, to prevent losing wakeups in the ++ * meantime. ++ */ + -+ if (removed) -+ bucket_dec_waiters(bucket); ++ ret = futex_dequeue_multiple(futexv, nr_futexes); + -+ return removed; ++ /* Normal wake */ ++ if (ret >= 0) ++ return ret; ++ ++ if (timeout && !timeout->task) ++ return -ETIMEDOUT; ++ ++ if (signal_pending(current)) ++ return -ERESTARTSYS; ++ ++ /* Spurious wake, do everything again */ ++ } +} + +/** -+ * sys_futex_wait - Wait on a futex address if (*uaddr) == val -+ * @uaddr: User address of futex -+ * @val: Expected value of futex -+ * @flags: Specify the size of futex and the clockid -+ * @timo: Optional absolute timeout. Supports only 64bit time. ++ * futex_wait - Setup the timer (if there's one) and wait on a list of futexes ++ * @futexv: List of futexes ++ * @nr_futexes: Length of futexv ++ * @timo: Timeout ++ * @flags: Timeout flags ++ * ++ * Return: ++ * * 0 >= - Hint of which futex woke us ++ * * 0 < - Error code + */ -+SYSCALL_DEFINE4(futex_wait, void __user *, uaddr, unsigned int, val, -+ unsigned int, flags, struct __kernel_timespec __user *, timo) ++static int futex_set_timer_and_wait(struct futexv_head *futexv, ++ unsigned int nr_futexes, ++ struct __kernel_timespec __user *timo, ++ unsigned int flags) +{ -+ unsigned int size = flags & FUTEX_SIZE_MASK; + struct hrtimer_sleeper timeout; -+ struct futex_bucket *bucket; -+ struct futex_single_waiter wait_single; -+ struct futex_waiter *waiter; + int ret; + -+ wait_single.parent.task = current; -+ wait_single.parent.hint = 0; -+ waiter = &wait_single.waiter; -+ waiter->index = 0; -+ -+ if (flags & ~FUTEX2_MASK) -+ return -EINVAL; -+ -+ if (size != FUTEX_32) -+ return -EINVAL; -+ + if (timo) { + ret = futex_setup_time(timo, &timeout, flags); + if (ret) + return ret; + } + -+ /* Get an unlocked hash bucket */ -+ bucket = futex_get_bucket(uaddr, &waiter->key); -+ if (IS_ERR(bucket)) -+ return PTR_ERR(bucket); ++ ret = __futex_wait(futexv, nr_futexes, timo ? &timeout : NULL); + + if (timo) -+ hrtimer_sleeper_start_expires(&timeout, HRTIMER_MODE_ABS); -+ -+retry: -+ bucket_inc_waiters(bucket); ++ hrtimer_cancel(&timeout.timer); + -+ /* Compare the expected and current value, get the bucket lock */ -+ ret = futex_get_user_value(bucket, uaddr, val, false); -+ if (ret) { -+ bucket_dec_waiters(bucket); -+ goto out; -+ } ++ return ret; ++} + -+ /* Add the waiter to the hash table and sleep */ -+ set_current_state(TASK_INTERRUPTIBLE); -+ list_add_tail(&waiter->list, &bucket->list); -+ spin_unlock(&bucket->lock); ++/** ++ * sys_futex_wait - Wait on a futex address if (*uaddr) == val ++ * @uaddr: User address of futex ++ * @val: Expected value of futex ++ * @flags: Specify the size of futex and the clockid ++ * @timo: Optional absolute timeout. ++ * ++ * The user thread is put to sleep, waiting for a futex_wake() at uaddr, if the ++ * value at *uaddr is the same as val (otherwise, the syscall returns ++ * immediately with -EAGAIN). ++ * ++ * Returns 0 on success, error code otherwise. ++ */ ++SYSCALL_DEFINE4(futex_wait, void __user *, uaddr, unsigned int, val, ++ unsigned int, flags, struct __kernel_timespec __user *, timo) ++{ ++ unsigned int size = flags & FUTEX_SIZE_MASK; ++ struct futex_single_waiter wait_single = {0}; ++ struct futex_waiter *waiter; ++ struct futexv_head *futexv; + -+ /* Do not sleep if someone woke this futex or if it was timeouted */ -+ if (!list_empty_careful(&waiter->list) && (!timo || timeout.task)) -+ freezable_schedule(); ++ if (flags & ~FUTEX2_MASK) ++ return -EINVAL; + -+ __set_current_state(TASK_RUNNING); ++ if (size != FUTEX_32) ++ return -EINVAL; + -+ /* -+ * One of those things triggered this wake: -+ * -+ * * We have been removed from the bucket. futex_wake() woke us. We just -+ * need to return 0 to userspace. -+ * -+ * However, if we find ourselves in the bucket we must remove ourselves -+ * from the bucket and ... -+ * -+ * * If the there's a timeout and it has expired, return -ETIMEDOUT. -+ * -+ * * If there is a signal pending, something wants to kill our thread. -+ * Return -ERESTARTSYS. -+ * -+ * * If there's no signal pending, it was a spurious wake (scheduler -+ * gave us a change to do some work, even if we don't want to). We -+ * need to remove ourselves from the bucket and add again, to prevent -+ * losing wakeups in the meantime. -+ */ ++ futexv = &wait_single.futexv; ++ futexv->task = current; ++ futexv->hint = false; + -+ /* Normal wake */ -+ if (list_empty_careful(&waiter->list)) -+ goto out; ++ waiter = &wait_single.waiter; ++ waiter->index = 0; ++ waiter->val = val; ++ waiter->uaddr = (uintptr_t)uaddr; + -+ if (!futex_dequeue(bucket, waiter)) -+ goto out; ++ INIT_LIST_HEAD(&waiter->list); + -+ /* Timeout */ -+ if (timo && !timeout.task) -+ return -ETIMEDOUT; ++ /* Get an unlocked hash bucket */ ++ waiter->bucket = futex_get_bucket(uaddr, &waiter->key); ++ if (IS_ERR(waiter->bucket)) ++ return PTR_ERR(waiter->bucket); + -+ /* Spurious wakeup */ -+ if (!signal_pending(current)) -+ goto retry; ++ return futex_set_timer_and_wait(futexv, 1, timo, flags); ++} + -+ /* Some signal is pending */ -+ ret = -ERESTARTSYS; -+out: -+ if (timo) -+ hrtimer_cancel(&timeout.timer); ++/** ++ * futex_get_parent - For a given futex in a futexv list, get a pointer to the futexv ++ * @waiter: Address of futex in the list ++ * @index: Index of futex in the list ++ * ++ * Return: A pointer to its futexv struct ++ */ ++static inline struct futexv_head *futex_get_parent(uintptr_t waiter, ++ unsigned int index) ++{ ++ uintptr_t parent = waiter - sizeof(struct futexv_head) ++ - (uintptr_t)(index * sizeof(struct futex_waiter)); + -+ return ret; ++ return (struct futexv_head *)parent; +} + -+static struct futexv *futex_get_parent(uintptr_t waiter, u8 index) ++/** ++ * futex_mark_wake - Find the task to be wake and add it in wake queue ++ * @waiter: Waiter to be wake ++ * @bucket: Bucket to be decremented ++ * @wake_q: Wake queue to insert the task ++ */ ++static void futex_mark_wake(struct futex_waiter *waiter, ++ struct futex_bucket *bucket, ++ struct wake_q_head *wake_q) +{ -+ uintptr_t parent = waiter - sizeof(struct futexv) -+ - (uintptr_t) (index * sizeof(struct futex_waiter)); ++ struct task_struct *task; ++ struct futexv_head *parent = futex_get_parent((uintptr_t)waiter, ++ waiter->index); ++ ++ parent->hint = true; ++ task = parent->task; ++ get_task_struct(task); ++ list_del_init_careful(&waiter->list); ++ wake_q_add_safe(wake_q, task); ++ bucket_dec_waiters(bucket); ++} + -+ return (struct futexv *) parent; ++static inline bool futex_match(struct futex_key key1, struct futex_key key2) ++{ ++ return (key1.index == key2.index && ++ key1.pointer == key2.pointer && ++ key1.offset == key2.offset); +} + +/** + * sys_futex_wake - Wake a number of futexes waiting on an address + * @uaddr: Address of futex to be woken up -+ * @nr_wake: Number of futexes to be woken up -+ * @flags: TODO ++ * @nr_wake: Number of futexes waiting in uaddr to be woken up ++ * @flags: Flags for size and shared ++ * ++ * Wake `nr_wake` threads waiting at uaddr. ++ * ++ * Returns the number of woken threads on success, error code otherwise. + */ +SYSCALL_DEFINE3(futex_wake, void __user *, uaddr, unsigned int, nr_wake, + unsigned int, flags) @@ -602,7 +1086,6 @@ index 000000000..107b80a46 + unsigned int size = flags & FUTEX_SIZE_MASK; + struct futex_waiter waiter, *aux, *tmp; + struct futex_bucket *bucket; -+ struct task_struct *task; + DEFINE_WAKE_Q(wake_q); + int ret = 0; + @@ -616,26 +1099,15 @@ index 000000000..107b80a46 + if (IS_ERR(bucket)) + return PTR_ERR(bucket); + -+ if (!bucket_get_waiters(bucket)) ++ if (!bucket_get_waiters(bucket) || !nr_wake) + return 0; + + spin_lock(&bucket->lock); + list_for_each_entry_safe(aux, tmp, &bucket->list, list) { -+ if (ret >= nr_wake) -+ break; -+ -+ if (waiter.key.address == aux->key.address && -+ waiter.key.mm == aux->key.mm) { -+ struct futexv *parent = -+ futex_get_parent((uintptr_t) aux, aux->index); -+ -+ parent->hint = 1; -+ task = parent->task; -+ get_task_struct(task); -+ list_del_init_careful(&aux->list); -+ wake_q_add_safe(&wake_q, task); -+ ret++; -+ bucket_dec_waiters(bucket); ++ if (futex_match(waiter.key, aux->key)) { ++ futex_mark_wake(aux, bucket, &wake_q); ++ if (++ret >= nr_wake) ++ break; + } + } + spin_unlock(&bucket->lock); @@ -673,10 +1145,10 @@ index 000000000..107b80a46 +} +core_initcall(futex2_init); diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c -index f27ac94d5..35ff743b1 100644 +index 19aa80689..27ef83ca8 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c -@@ -148,6 +148,10 @@ COND_SYSCALL_COMPAT(set_robust_list); +@@ -150,6 +150,10 @@ COND_SYSCALL_COMPAT(set_robust_list); COND_SYSCALL(get_robust_list); COND_SYSCALL_COMPAT(get_robust_list); @@ -688,610 +1160,761 @@ index f27ac94d5..35ff743b1 100644 /* kernel/itimer.c */ diff --git a/tools/include/uapi/asm-generic/unistd.h b/tools/include/uapi/asm-generic/unistd.h -index 205631898..cd79f94e0 100644 +index 728752917..57e19200f 100644 --- a/tools/include/uapi/asm-generic/unistd.h +++ b/tools/include/uapi/asm-generic/unistd.h -@@ -860,8 +860,15 @@ __SYSCALL(__NR_faccessat2, sys_faccessat2) - #define __NR_process_madvise 440 - __SYSCALL(__NR_process_madvise, sys_process_madvise) +@@ -862,8 +862,14 @@ __SYSCALL(__NR_process_madvise, sys_process_madvise) + #define __NR_epoll_pwait2 441 + __SC_COMP(__NR_epoll_pwait2, sys_epoll_pwait2, compat_sys_epoll_pwait2) -+#define __NR_futex_wait 441 ++#define __NR_futex_wait 442 +__SYSCALL(__NR_futex_wait, sys_futex_wait) + -+#define __NR_futex_wake 442 ++#define __NR_futex_wake 443 +__SYSCALL(__NR_futex_wake, sys_futex_wake) + #undef __NR_syscalls --#define __NR_syscalls 441 -+#define __NR_syscalls 443 -+ +-#define __NR_syscalls 442 ++#define __NR_syscalls 444 /* * 32 bit systems traditionally used different diff --git a/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl b/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl -index 379819244..47de3bf93 100644 +index 78672124d..15d2b89b6 100644 --- a/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl +++ b/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl -@@ -362,6 +362,8 @@ - 438 common pidfd_getfd sys_pidfd_getfd +@@ -363,6 +363,8 @@ 439 common faccessat2 sys_faccessat2 440 common process_madvise sys_process_madvise -+441 common futex_wait sys_futex_wait -+442 common futex_wake sys_futex_wake + 441 common epoll_pwait2 sys_epoll_pwait2 ++442 common futex_wait sys_futex_wait ++443 common futex_wake sys_futex_wake # # Due to a historical design error, certain syscalls are numbered differently -- -2.29.2 +2.30.2 -From d71973d99efb1e2fd2542ea4d4b45b0e03e45b9c Mon Sep 17 00:00:00 2001 +From ea4e3d7ee8dc965fbe3cabd753b88ada23cecb39 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Andr=C3=A9=20Almeida?= <andrealmeid@collabora.com> -Date: Thu, 15 Oct 2020 17:15:57 -0300 -Subject: [PATCH 2/9] futex2: Add suport for vectorized wait +Date: Fri, 5 Feb 2021 10:34:01 -0300 +Subject: [PATCH 02/13] futex2: Add support for shared futexes MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit -Add support to wait on multiple futexes +Add support for shared futexes for cross-process resources. This design +relies on the same approach done in old futex to create an unique id for +file-backed shared memory, by using a counter at struct inode. + +There are two types of futexes: private and shared ones. The private are +futexes meant to be used by threads that shares the same memory space, +are easier to be uniquely identified an thus can have some performance +optimization. The elements for identifying one are: the start address of +the page where the address is, the address offset within the page and +the current->mm pointer. + +Now, for uniquely identifying shared futex: + +- If the page containing the user address is an anonymous page, we can + just use the same data used for private futexes (the start address of + the page, the address offset within the page and the current->mm + pointer) that will be enough for uniquely identifying such futex. We + also set one bit at the key to differentiate if a private futex is + used on the same address (mixing shared and private calls are not + allowed). + +- If the page is file-backed, current->mm maybe isn't the same one for + every user of this futex, so we need to use other data: the + page->index, an UUID for the struct inode and the offset within the + page. + +Note that members of futex_key doesn't have any particular meaning after +they are part of the struct - they are just bytes to identify a futex. +Given that, we don't need to use a particular name or type that matches +the original data, we only need to care about the bitsize of each +component and make both private and shared data fit in the same memory +space. Signed-off-by: André Almeida <andrealmeid@collabora.com> Signed-off-by: Jan200101 <sentrycraft123@gmail.com> --- - arch/x86/entry/syscalls/syscall_32.tbl | 1 + - arch/x86/entry/syscalls/syscall_64.tbl | 1 + - include/uapi/asm-generic/unistd.h | 5 +- - kernel/futex2.c | 430 ++++++++++++------ - kernel/sys_ni.c | 1 + - tools/include/uapi/asm-generic/unistd.h | 5 +- - .../arch/x86/entry/syscalls/syscall_64.tbl | 1 + - 7 files changed, 309 insertions(+), 135 deletions(-) + fs/inode.c | 1 + + include/linux/fs.h | 1 + + kernel/futex2.c | 220 +++++++++++++++++++++++++++++++++++++++++++-- + 3 files changed, 217 insertions(+), 5 deletions(-) -diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl -index 83a75ff39..65734d5e1 100644 ---- a/arch/x86/entry/syscalls/syscall_32.tbl -+++ b/arch/x86/entry/syscalls/syscall_32.tbl -@@ -447,3 +447,4 @@ - 440 i386 process_madvise sys_process_madvise - 441 i386 futex_wait sys_futex_wait - 442 i386 futex_wake sys_futex_wake -+443 i386 futex_waitv sys_futex_waitv -diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl -index 6658fd63c..f30811b56 100644 ---- a/arch/x86/entry/syscalls/syscall_64.tbl -+++ b/arch/x86/entry/syscalls/syscall_64.tbl -@@ -364,6 +364,7 @@ - 440 common process_madvise sys_process_madvise - 441 common futex_wait sys_futex_wait - 442 common futex_wake sys_futex_wake -+443 common futex_waitv sys_futex_waitv - - # - # Due to a historical design error, certain syscalls are numbered differently -diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h -index ae47d6a9e..81a90b697 100644 ---- a/include/uapi/asm-generic/unistd.h -+++ b/include/uapi/asm-generic/unistd.h -@@ -866,8 +866,11 @@ __SYSCALL(__NR_futex_wait, sys_futex_wait) - #define __NR_futex_wake 442 - __SYSCALL(__NR_futex_wake, sys_futex_wake) - -+#define __NR_futex_waitv 443 -+__SYSCALL(__NR_futex_waitv, sys_futex_waitv) -+ - #undef __NR_syscalls --#define __NR_syscalls 443 -+#define __NR_syscalls 444 - - /* - * 32 bit systems traditionally used different +diff --git a/fs/inode.c b/fs/inode.c +index 6442d97d9..886fe11cc 100644 +--- a/fs/inode.c ++++ b/fs/inode.c +@@ -139,6 +139,7 @@ int inode_init_always(struct super_block *sb, struct inode *inode) + inode->i_blkbits = sb->s_blocksize_bits; + inode->i_flags = 0; + atomic64_set(&inode->i_sequence, 0); ++ atomic64_set(&inode->i_sequence2, 0); + atomic_set(&inode->i_count, 1); + inode->i_op = &empty_iops; + inode->i_fop = &no_open_fops; +diff --git a/include/linux/fs.h b/include/linux/fs.h +index fd47deea7..516bda982 100644 +--- a/include/linux/fs.h ++++ b/include/linux/fs.h +@@ -681,6 +681,7 @@ struct inode { + }; + atomic64_t i_version; + atomic64_t i_sequence; /* see futex */ ++ atomic64_t i_sequence2; /* see futex2 */ + atomic_t i_count; + atomic_t i_dio_count; + atomic_t i_writecount; diff --git a/kernel/futex2.c b/kernel/futex2.c -index 107b80a46..4b782b5ef 100644 +index 802578ad6..27767b2d0 100644 --- a/kernel/futex2.c +++ b/kernel/futex2.c -@@ -48,14 +48,25 @@ struct futex_bucket { - struct list_head list; - }; +@@ -14,8 +14,10 @@ + */ -+/** -+ * struct futexv - List of futexes to be waited -+ * @task: Task to be awaken -+ * @hint: Was someone on this list awaken? -+ * @objects: List of futexes -+ */ - struct futexv { - struct task_struct *task; -- int hint; -+ bool hint; - struct futex_waiter objects[0]; - }; + #include <linux/freezer.h> ++#include <linux/hugetlb.h> + #include <linux/jhash.h> + #include <linux/memblock.h> ++#include <linux/pagemap.h> + #include <linux/sched/wake_q.h> + #include <linux/spinlock.h> + #include <linux/syscalls.h> +@@ -23,8 +25,8 @@ -+/** -+ * struct futex_single_waiter - Wrapper for a futexv of one element -+ * @futexv: TODO -+ * @waiter: TODO -+ */ - struct futex_single_waiter { -- struct futexv parent; -+ struct futexv futexv; - struct futex_waiter waiter; - } __packed; - -@@ -65,10 +76,10 @@ struct futex_bucket *futex_table; - #define FUTEX2_MASK (FUTEX_SIZE_MASK | FUTEX_SHARED_FLAG | \ - FUTEX_CLOCK_REALTIME) - --// mask for sys_futex_waitv -+/* mask for sys_futex_waitv flag */ - #define FUTEXV_MASK (FUTEX_CLOCK_REALTIME) - --// mask for each futex in futex_waitv list -+/* mask for each futex in futex_waitv list */ + /** + * struct futex_key - Components to build unique key for a futex +- * @pointer: Pointer to current->mm +- * @index: Start address of the page containing futex ++ * @pointer: Pointer to current->mm or inode's UUID for file backed futexes ++ * @index: Start address of the page containing futex or index of the page + * @offset: Address offset of uaddr in a page + */ + struct futex_key { +@@ -97,6 +99,11 @@ struct futex_single_waiter { + /* Mask for each futex in futex_waitv list */ #define FUTEXV_WAITER_MASK (FUTEX_SIZE_MASK | FUTEX_SHARED_FLAG) - int futex2_hashsize; -@@ -151,7 +162,7 @@ static struct futex_bucket *futex_get_bucket(void __user *uaddr, - * - * Check the comment at futex_get_user_val for more information. - */ --static int futex_get_user(u32 *uval, u32 *uaddr) -+static int futex_get_user(u32 *uval, u32 __user *uaddr) - { - int ret; ++#define is_object_shared ((futexv->objects[i].flags & FUTEX_SHARED_FLAG) ? true : false) ++ ++#define FUT_OFF_INODE 1 /* We set bit 0 if key has a reference on inode */ ++#define FUT_OFF_MMSHARED 2 /* We set bit 1 if key has a reference on mm */ ++ + struct futex_bucket *futex_table; + unsigned int futex2_hashsize; -@@ -194,95 +205,227 @@ static int futex_setup_time(struct __kernel_timespec __user *timo, - return 0; +@@ -143,16 +150,200 @@ static inline int bucket_get_waiters(struct futex_bucket *bucket) + #endif } +/** -+ * futex_dequeue_multiple - Remove multiple futexes from hash table -+ * @futexv: list of waiters -+ * @nr: number of futexes to be removed -+ * -+ * This function should be used after we found that this futex was in a queue. -+ * Thus, it needs to be removed before the next step. However, someone could -+ * wake it between the time of the first check and the time to get the lock for -+ * the bucket. Check one more time if the futex is there with the bucket locked. -+ * If it's there, just remove it and return true. Else, mark the removal as -+ * false and do nothing. -+ * -+ * Return: -+ * * -1 if no futex was woken during the removal -+ * * =< 0 at least one futex was found woken, index of the last one -+ */ -+static int futex_dequeue_multiple(struct futexv *futexv, unsigned int nr) -+{ -+ int i, ret = -1; -+ -+ for (i = 0; i < nr; i++) { -+ spin_lock(&futexv->objects[i].bucket->lock); -+ if (!list_empty_careful(&futexv->objects[i].list)) { -+ list_del_init_careful(&futexv->objects[i].list); -+ bucket_dec_waiters(futexv->objects[i].bucket); -+ } else { -+ ret = i; -+ } -+ spin_unlock(&futexv->objects[i].bucket->lock); -+ } -+ -+ return ret; -+} - - /** -- * futex_get_user_value - Get the value from the userspace address and compares -- * with the expected one. In success, leaves the function -- * holding the bucket lock. Else, hold no lock. -- * @bucket: hash bucket of this address -- * @uaddr: futex's userspace address -- * @val: expected value -- * @multiple: is this call in the wait on multiple path -+ * futex_enqueue - Check the value and enqueue a futex on a wait list ++ * futex_get_inode_uuid - Gets an UUID for an inode ++ * @inode: inode to get UUID + * -+ * @futexv: List of futexes -+ * @nr_futexes: Number of futexes in the list -+ * @awaken: If a futex was awaken during enqueueing, store the index here ++ * Generate a machine wide unique identifier for this inode. + * -+ * Get the value from the userspace address and compares with the expected one. -+ * In success, enqueue the futex in the correct bucket ++ * This relies on u64 not wrapping in the life-time of the machine; which with ++ * 1ns resolution means almost 585 years. + * -+ * Get the value from user futex address. ++ * This further relies on the fact that a well formed program will not unmap ++ * the file while it has a (shared) futex waiting on it. This mapping will have ++ * a file reference which pins the mount and inode. + * -+ * Since we are in a hurry, we use a spin lock and we can't sleep. -+ * Try to get the value with page fault disabled (when enable, we might -+ * sleep). ++ * If for some reason an inode gets evicted and read back in again, it will get ++ * a new sequence number and will _NOT_ match, even though it is the exact same ++ * file. + * -+ * If we fail, we aren't sure if the address is invalid or is just a -+ * page fault. Then, release the lock (so we can sleep) and try to get -+ * the value with page fault enabled. In order to trigger a page fault -+ * handling, we just call __get_user() again. If we sleep with enqueued -+ * futexes, we might miss a wake, so dequeue everything before sleeping. ++ * It is important that match_futex() will never have a false-positive, esp. ++ * for PI futexes that can mess up the state. The above argues that false-negatives ++ * are only possible for malformed programs. + * -+ * If get_user succeeds, this mean that the address is valid and we do -+ * the work again. Since we just handled the page fault, the page is -+ * likely pinned in memory and we should be luckier this time and be -+ * able to get the value. If we fail anyway, we will try again. -+ * -+ * If even with page faults enabled we get and error, this means that -+ * the address is not valid and we return from the syscall. -+ * -+ * If we got an unexpected value or need to treat a page fault and realized that -+ * a futex was awaken, we can priority this and return success. - * - * Return: 0 on success, error code otherwise - */ --static int futex_get_user_value(struct futex_bucket *bucket, u32 __user *uaddr, -- unsigned int val, bool multiple) -+static int futex_enqueue(struct futexv *futexv, unsigned int nr_futexes, -+ unsigned int *awaken) - { -- u32 uval; -- int ret; -+ int i, ret; -+ u32 uval, *uaddr, val; -+ struct futex_bucket *bucket; - -- /* -- * Get the value from user futex address. -- * -- * Since we are in a hurry, we use a spin lock and we can't sleep. -- * Try to get the value with page fault disabled (when enable, we might -- * sleep). -- * -- * If we fail, we aren't sure if the address is invalid or is just a -- * page fault. Then, release the lock (so we can sleep) and try to get -- * the value with page fault enabled. In order to trigger a page fault -- * handling, we just call __get_user() again. -- * -- * If get_user succeeds, this mean that the address is valid and we do -- * the loop again. Since we just handled the page fault, the page is -- * likely pinned in memory and we should be luckier this time and be -- * able to get the value. If we fail anyway, we will try again. -- * -- * If even with page faults enabled we get and error, this means that -- * the address is not valid and we return from the syscall. -- */ -- do { -- spin_lock(&bucket->lock); -+retry: -+ set_current_state(TASK_INTERRUPTIBLE); -+ -+ for (i = 0; i < nr_futexes; i++) { -+ uaddr = (u32 * __user) futexv->objects[i].key.address; -+ val = (u32) futexv->objects[i].val; -+ bucket = futexv->objects[i].bucket; -+ -+ bucket_inc_waiters(bucket); -+ spin_lock(&bucket->lock); - -- ret = futex_get_user(&uval, uaddr); -+ ret = futex_get_user(&uval, uaddr); - -- if (ret) { -+ if (unlikely(ret)) { - spin_unlock(&bucket->lock); -- if (multiple || __get_user(uval, uaddr)) -+ -+ bucket_dec_waiters(bucket); -+ __set_current_state(TASK_RUNNING); -+ *awaken = futex_dequeue_multiple(futexv, i); -+ -+ if (__get_user(uval, uaddr)) - return -EFAULT; - -+ if (*awaken >= 0) -+ return 0; ++ * Returns: UUID for the given inode ++ */ ++static u64 futex_get_inode_uuid(struct inode *inode) ++{ ++ static atomic64_t i_seq; ++ u64 old; + -+ goto retry; -+ } ++ /* Does the inode already have a sequence number? */ ++ old = atomic64_read(&inode->i_sequence2); + -+ if (uval != val) { -+ spin_unlock(&bucket->lock); ++ if (likely(old)) ++ return old; + -+ bucket_dec_waiters(bucket); -+ __set_current_state(TASK_RUNNING); -+ *awaken = futex_dequeue_multiple(futexv, i); ++ for (;;) { ++ u64 new = atomic64_add_return(1, &i_seq); + -+ if (*awaken >= 0) -+ return 0; ++ if (WARN_ON_ONCE(!new)) ++ continue; + -+ return -EWOULDBLOCK; - } -- } while (ret); - -- if (uval != val) { -+ list_add_tail(&futexv->objects[i].list, &bucket->list); - spin_unlock(&bucket->lock); -- return -EWOULDBLOCK; - } - - return 0; - } - ++ old = atomic64_cmpxchg_relaxed(&inode->i_sequence2, 0, new); ++ if (old) ++ return old; ++ return new; ++ } ++} + -+static int __futex_wait(struct futexv *futexv, -+ unsigned int nr_futexes, -+ struct hrtimer_sleeper *timeout) ++/** ++ * futex_get_shared_key - Get a key for a shared futex ++ * @address: Futex memory address ++ * @mm: Current process mm_struct pointer ++ * @key: Key struct to be filled ++ * ++ * Returns: 0 on success, error code otherwise ++ */ ++static int futex_get_shared_key(uintptr_t address, struct mm_struct *mm, ++ struct futex_key *key) +{ + int ret; -+ unsigned int awaken = -1; ++ struct page *page, *tail; ++ struct address_space *mapping; + -+ while (1) { -+ ret = futex_enqueue(futexv, nr_futexes, &awaken); ++again: ++ ret = get_user_pages_fast(address, 1, 0, &page); ++ if (ret < 0) ++ return ret; + -+ if (ret < 0) -+ break; ++ /* ++ * The treatment of mapping from this point on is critical. The page ++ * lock protects many things but in this context the page lock ++ * stabilizes mapping, prevents inode freeing in the shared ++ * file-backed region case and guards against movement to swap cache. ++ * ++ * Strictly speaking the page lock is not needed in all cases being ++ * considered here and page lock forces unnecessarily serialization ++ * From this point on, mapping will be re-verified if necessary and ++ * page lock will be acquired only if it is unavoidable ++ * ++ * Mapping checks require the head page for any compound page so the ++ * head page and mapping is looked up now. For anonymous pages, it ++ * does not matter if the page splits in the future as the key is ++ * based on the address. For filesystem-backed pages, the tail is ++ * required as the index of the page determines the key. For ++ * base pages, there is no tail page and tail == page. ++ */ ++ tail = page; ++ page = compound_head(page); ++ mapping = READ_ONCE(page->mapping); + -+ if (awaken <= 0) { -+ return awaken; -+ } ++ /* ++ * If page->mapping is NULL, then it cannot be a PageAnon ++ * page; but it might be the ZERO_PAGE or in the gate area or ++ * in a special mapping (all cases which we are happy to fail); ++ * or it may have been a good file page when get_user_pages_fast ++ * found it, but truncated or holepunched or subjected to ++ * invalidate_complete_page2 before we got the page lock (also ++ * cases which we are happy to fail). And we hold a reference, ++ * so refcount care in invalidate_complete_page's remove_mapping ++ * prevents drop_caches from setting mapping to NULL beneath us. ++ * ++ * The case we do have to guard against is when memory pressure made ++ * shmem_writepage move it from filecache to swapcache beneath us: ++ * an unlikely race, but we do need to retry for page->mapping. ++ */ ++ if (unlikely(!mapping)) { ++ int shmem_swizzled; ++ ++ /* ++ * Page lock is required to identify which special case above ++ * applies. If this is really a shmem page then the page lock ++ * will prevent unexpected transitions. ++ */ ++ lock_page(page); ++ shmem_swizzled = PageSwapCache(page) || page->mapping; ++ unlock_page(page); ++ put_page(page); + ++ if (shmem_swizzled) ++ goto again; + -+ /* Before sleeping, check if someone was woken */ -+ if (!futexv->hint && (!timeout || timeout->task)) -+ freezable_schedule(); ++ return -EFAULT; ++ } + -+ __set_current_state(TASK_RUNNING); ++ /* ++ * Private mappings are handled in a simple way. ++ * ++ * If the futex key is stored on an anonymous page, then the associated ++ * object is the mm which is implicitly pinned by the calling process. ++ * ++ * NOTE: When userspace waits on a MAP_SHARED mapping, even if ++ * it's a read-only handle, it's expected that futexes attach to ++ * the object not the particular process. ++ */ ++ if (PageAnon(page)) { ++ key->offset |= FUT_OFF_MMSHARED; ++ } else { ++ struct inode *inode; + + /* -+ * One of those things triggered this wake: -+ * -+ * * We have been removed from the bucket. futex_wake() woke -+ * us. We just need to dequeue return 0 to userspace. -+ * -+ * However, if no futex was dequeued by a futex_wake(): -+ * -+ * * If the there's a timeout and it has expired, -+ * return -ETIMEDOUT. -+ * -+ * * If there is a signal pending, something wants to kill our -+ * thread, return -ERESTARTSYS. ++ * The associated futex object in this case is the inode and ++ * the page->mapping must be traversed. Ordinarily this should ++ * be stabilised under page lock but it's not strictly ++ * necessary in this case as we just want to pin the inode, not ++ * update the radix tree or anything like that. + * -+ * * If there's no signal pending, it was a spurious wake -+ * (scheduler gave us a change to do some work, even if we -+ * don't want to). We need to remove ourselves from the -+ * bucket and add again, to prevent losing wakeups in the -+ * meantime. ++ * The RCU read lock is taken as the inode is finally freed ++ * under RCU. If the mapping still matches expectations then the ++ * mapping->host can be safely accessed as being a valid inode. + */ ++ rcu_read_lock(); + -+ ret = futex_dequeue_multiple(futexv, nr_futexes); ++ if (READ_ONCE(page->mapping) != mapping) { ++ rcu_read_unlock(); ++ put_page(page); + -+ /* Normal wake */ -+ if (ret >= 0) -+ break; ++ goto again; ++ } + -+ if (timeout && !timeout->task) -+ return -ETIMEDOUT; ++ inode = READ_ONCE(mapping->host); ++ if (!inode) { ++ rcu_read_unlock(); ++ put_page(page); + -+ /* signal */ -+ if (signal_pending(current)) -+ return -ERESTARTSYS; ++ goto again; ++ } ++ ++ key->pointer = futex_get_inode_uuid(inode); ++ key->index = (unsigned long)basepage_index(tail); ++ key->offset |= FUT_OFF_INODE; + -+ /* spurious wake, do everything again */ ++ rcu_read_unlock(); + } + -+ return ret; ++ put_page(page); ++ ++ return 0; +} + /** -- * futex_dequeue - Remove a futex from a queue -- * @bucket: current bucket holding the futex -- * @waiter: futex to be removed -+ * futex_wait - Setup the timer and wait on a list of futexes -+ * @futexv: List of waiters -+ * @nr_futexes: Number of waiters -+ * @timo: Timeout -+ * @timeout: Timeout -+ * @flags: Timeout flags + * futex_get_bucket - Check if the user address is valid, prepare internal + * data and calculate the hash + * @uaddr: futex user address + * @key: data that uniquely identifies a futex ++ * @shared: is this a shared futex? ++ * ++ * For private futexes, each uaddr will be unique for a given mm_struct, and it ++ * won't be freed for the life time of the process. For shared futexes, check ++ * futex_get_shared_key(). * -- * Return: True if futex was removed by this function, false if another wake -- * thread removed this futex. -- * -- * This function should be used after we found that this futex was in a queue. -- * Thus, it needs to be removed before the next step. However, someone could -- * wake it between the time of the first check and the time to get the lock for -- * the bucket. Check one more time if the futex is there with the bucket locked. -- * If it's there, just remove it and return true. Else, mark the removal as -- * false and do nothing. -+ * Return: error code, or a hint of one of the waiters + * Return: address of bucket on success, error code otherwise */ --static bool futex_dequeue(struct futex_bucket *bucket, struct futex_waiter *waiter) -+static int futex_wait(struct futexv *futexv, unsigned int nr_futexes, -+ struct __kernel_timespec __user *timo, -+ struct hrtimer_sleeper *timeout, unsigned int flags) + static struct futex_bucket *futex_get_bucket(void __user *uaddr, +- struct futex_key *key) ++ struct futex_key *key, ++ bool shared) { -- bool removed = true; -+ int ret; + uintptr_t address = (uintptr_t)uaddr; + u32 hash_key; +@@ -168,6 +359,9 @@ static struct futex_bucket *futex_get_bucket(void __user *uaddr, + key->pointer = (u64)address; + key->index = (unsigned long)current->mm; -- spin_lock(&bucket->lock); -- if (list_empty(&waiter->list)) -- removed = false; -- else -- list_del(&waiter->list); -- spin_unlock(&bucket->lock); -+ if (timo) { -+ ret = futex_setup_time(timo, timeout, flags); -+ if (ret) -+ return ret; ++ if (shared) ++ futex_get_shared_key(address, current->mm, key); ++ + /* Generate hash key for this futex using uaddr and current->mm */ + hash_key = jhash2((u32 *)key, sizeof(*key) / sizeof(u32), 0); -- if (removed) -- bucket_dec_waiters(bucket); -+ hrtimer_sleeper_start_expires(timeout, HRTIMER_MODE_ABS); -+ } +@@ -303,6 +497,7 @@ static int futex_enqueue(struct futexv_head *futexv, unsigned int nr_futexes, + int *awakened) + { + int i, ret; ++ bool retry = false; + u32 uval, *uaddr, val; + struct futex_bucket *bucket; -- return removed; -+ ret = __futex_wait(futexv, nr_futexes, timo ? timeout : NULL); -+ -+ -+ if (timo) -+ hrtimer_cancel(&timeout->timer); +@@ -313,6 +508,18 @@ static int futex_enqueue(struct futexv_head *futexv, unsigned int nr_futexes, + uaddr = (u32 * __user)futexv->objects[i].uaddr; + val = (u32)futexv->objects[i].val; + ++ if (is_object_shared && retry) { ++ struct futex_bucket *tmp = ++ futex_get_bucket((void *)uaddr, ++ &futexv->objects[i].key, true); ++ if (IS_ERR(tmp)) { ++ __set_current_state(TASK_RUNNING); ++ futex_dequeue_multiple(futexv, i); ++ return PTR_ERR(tmp); ++ } ++ futexv->objects[i].bucket = tmp; ++ } + -+ return ret; - } + bucket = futexv->objects[i].bucket; - /** -@@ -297,15 +440,20 @@ SYSCALL_DEFINE4(futex_wait, void __user *, uaddr, unsigned int, val, + bucket_inc_waiters(bucket); +@@ -333,6 +540,7 @@ static int futex_enqueue(struct futexv_head *futexv, unsigned int nr_futexes, + if (*awakened >= 0) + return 1; + ++ retry = true; + goto retry; + } + +@@ -474,6 +682,7 @@ static int futex_set_timer_and_wait(struct futexv_head *futexv, + SYSCALL_DEFINE4(futex_wait, void __user *, uaddr, unsigned int, val, + unsigned int, flags, struct __kernel_timespec __user *, timo) { ++ bool shared = (flags & FUTEX_SHARED_FLAG) ? true : false; unsigned int size = flags & FUTEX_SIZE_MASK; - struct hrtimer_sleeper timeout; -- struct futex_bucket *bucket; - struct futex_single_waiter wait_single; + struct futex_single_waiter wait_single = {0}; struct futex_waiter *waiter; -+ struct futexv *futexv; - int ret; +@@ -497,7 +706,7 @@ SYSCALL_DEFINE4(futex_wait, void __user *, uaddr, unsigned int, val, + INIT_LIST_HEAD(&waiter->list); -- wait_single.parent.task = current; -- wait_single.parent.hint = 0; -+ futexv = &wait_single.futexv; -+ futexv->task = current; -+ futexv->hint = false; -+ - waiter = &wait_single.waiter; - waiter->index = 0; -+ waiter->val = val; -+ -+ INIT_LIST_HEAD(&waiter->list); + /* Get an unlocked hash bucket */ +- waiter->bucket = futex_get_bucket(uaddr, &waiter->key); ++ waiter->bucket = futex_get_bucket(uaddr, &waiter->key, shared); + if (IS_ERR(waiter->bucket)) + return PTR_ERR(waiter->bucket); - if (flags & ~FUTEX2_MASK) - return -EINVAL; -@@ -313,85 +461,101 @@ SYSCALL_DEFINE4(futex_wait, void __user *, uaddr, unsigned int, val, +@@ -562,6 +771,7 @@ static inline bool futex_match(struct futex_key key1, struct futex_key key2) + SYSCALL_DEFINE3(futex_wake, void __user *, uaddr, unsigned int, nr_wake, + unsigned int, flags) + { ++ bool shared = (flags & FUTEX_SHARED_FLAG) ? true : false; + unsigned int size = flags & FUTEX_SIZE_MASK; + struct futex_waiter waiter, *aux, *tmp; + struct futex_bucket *bucket; +@@ -574,7 +784,7 @@ SYSCALL_DEFINE3(futex_wake, void __user *, uaddr, unsigned int, nr_wake, if (size != FUTEX_32) return -EINVAL; -- if (timo) { -- ret = futex_setup_time(timo, &timeout, flags); -- if (ret) -- return ret; -- } -- - /* Get an unlocked hash bucket */ -- bucket = futex_get_bucket(uaddr, &waiter->key); -- if (IS_ERR(bucket)) -- return PTR_ERR(bucket); -+ waiter->bucket = futex_get_bucket(uaddr, &waiter->key); -+ if (IS_ERR(waiter->bucket)) -+ return PTR_ERR(waiter->bucket); +- bucket = futex_get_bucket(uaddr, &waiter.key); ++ bucket = futex_get_bucket(uaddr, &waiter.key, shared); + if (IS_ERR(bucket)) + return PTR_ERR(bucket); -- if (timo) -- hrtimer_sleeper_start_expires(&timeout, HRTIMER_MODE_ABS); -+ ret = futex_wait(futexv, 1, timo, &timeout, flags); +-- +2.30.2 + + +From bdfdc48ad40d314933c7872f4818172e76bcd350 Mon Sep 17 00:00:00 2001 +From: =?UTF-8?q?Andr=C3=A9=20Almeida?= <andrealmeid@collabora.com> +Date: Fri, 5 Feb 2021 10:34:00 -0300 +Subject: [PATCH 03/13] futex2: Implement vectorized wait +MIME-Version: 1.0 +Content-Type: text/plain; charset=UTF-8 +Content-Transfer-Encoding: 8bit + +Add support to wait on multiple futexes. This is the interface +implemented by this syscall: + +futex_waitv(struct futex_waitv *waiters, unsigned int nr_futexes, + unsigned int flags, struct timespec *timo) + +struct futex_waitv { + void *uaddr; + unsigned int val; + unsigned int flags; +}; + +Given an array of struct futex_waitv, wait on each uaddr. The thread +wakes if a futex_wake() is performed at any uaddr. The syscall returns +immediately if any waiter has *uaddr != val. *timo is an optional +timeout value for the operation. The flags argument of the syscall +should be used solely for specifying the timeout as realtime, if needed. +Flags for shared futexes, sizes, etc. should be used on the individual +flags of each waiter. + +Returns the array index of one of the awakened futexes. There’s no given +information of how many were awakened, or any particular attribute of it +(if it’s the first awakened, if it is of the smaller index...). + +Signed-off-by: André Almeida <andrealmeid@collabora.com> +Signed-off-by: Jan200101 <sentrycraft123@gmail.com> +--- + arch/arm/tools/syscall.tbl | 1 + + arch/arm64/include/asm/unistd.h | 2 +- + arch/x86/entry/syscalls/syscall_32.tbl | 1 + + arch/x86/entry/syscalls/syscall_64.tbl | 1 + + include/linux/compat.h | 11 ++ + include/linux/syscalls.h | 4 + + include/uapi/asm-generic/unistd.h | 5 +- + kernel/futex2.c | 171 ++++++++++++++++++ + kernel/sys_ni.c | 1 + + tools/include/uapi/asm-generic/unistd.h | 5 +- + .../arch/x86/entry/syscalls/syscall_64.tbl | 1 + + 11 files changed, 200 insertions(+), 3 deletions(-) + +diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl +index 4eef220cd..6d0f6626a 100644 +--- a/arch/arm/tools/syscall.tbl ++++ b/arch/arm/tools/syscall.tbl +@@ -457,3 +457,4 @@ + 441 common epoll_pwait2 sys_epoll_pwait2 + 442 common futex_wait sys_futex_wait + 443 common futex_wake sys_futex_wake ++444 common futex_waitv sys_futex_waitv +diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h +index d1f7d35f9..64ebdc1ec 100644 +--- a/arch/arm64/include/asm/unistd.h ++++ b/arch/arm64/include/asm/unistd.h +@@ -38,7 +38,7 @@ + #define __ARM_NR_compat_set_tls (__ARM_NR_COMPAT_BASE + 5) + #define __ARM_NR_COMPAT_END (__ARM_NR_COMPAT_BASE + 0x800) + +-#define __NR_compat_syscalls 444 ++#define __NR_compat_syscalls 445 + #endif + + #define __ARCH_WANT_SYS_CLONE +diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl +index ece90c8d9..fe242fa0b 100644 +--- a/arch/x86/entry/syscalls/syscall_32.tbl ++++ b/arch/x86/entry/syscalls/syscall_32.tbl +@@ -448,3 +448,4 @@ + 441 i386 epoll_pwait2 sys_epoll_pwait2 compat_sys_epoll_pwait2 + 442 i386 futex_wait sys_futex_wait + 443 i386 futex_wake sys_futex_wake ++444 i386 futex_waitv sys_futex_waitv compat_sys_futex_waitv +diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl +index 72fb65ef9..9d0f07e05 100644 +--- a/arch/x86/entry/syscalls/syscall_64.tbl ++++ b/arch/x86/entry/syscalls/syscall_64.tbl +@@ -365,6 +365,7 @@ + 441 common epoll_pwait2 sys_epoll_pwait2 + 442 common futex_wait sys_futex_wait + 443 common futex_wake sys_futex_wake ++444 common futex_waitv sys_futex_waitv + + # + # Due to a historical design error, certain syscalls are numbered differently +diff --git a/include/linux/compat.h b/include/linux/compat.h +index 6e65be753..041d18174 100644 +--- a/include/linux/compat.h ++++ b/include/linux/compat.h +@@ -365,6 +365,12 @@ struct compat_robust_list_head { + compat_uptr_t list_op_pending; + }; + ++struct compat_futex_waitv { ++ compat_uptr_t uaddr; ++ compat_uint_t val; ++ compat_uint_t flags; ++}; ++ + #ifdef CONFIG_COMPAT_OLD_SIGACTION + struct compat_old_sigaction { + compat_uptr_t sa_handler; +@@ -654,6 +660,11 @@ asmlinkage long + compat_sys_get_robust_list(int pid, compat_uptr_t __user *head_ptr, + compat_size_t __user *len_ptr); + ++/* kernel/futex2.c */ ++asmlinkage long compat_sys_futex_waitv(struct compat_futex_waitv *waiters, ++ compat_uint_t nr_futexes, compat_uint_t flags, ++ struct __kernel_timespec __user *timo); ++ + /* kernel/itimer.c */ + asmlinkage long compat_sys_getitimer(int which, + struct old_itimerval32 __user *it); +diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h +index bf146c2b0..7da1ceb36 100644 +--- a/include/linux/syscalls.h ++++ b/include/linux/syscalls.h +@@ -68,6 +68,7 @@ union bpf_attr; + struct io_uring_params; + struct clone_args; + struct open_how; ++struct futex_waitv; + + #include <linux/types.h> + #include <linux/aio_abi.h> +@@ -624,6 +625,9 @@ asmlinkage long sys_futex_wait(void __user *uaddr, unsigned int val, + struct __kernel_timespec __user __user *timo); + asmlinkage long sys_futex_wake(void __user *uaddr, unsigned int nr_wake, + unsigned int flags); ++asmlinkage long sys_futex_waitv(struct futex_waitv __user *waiters, ++ unsigned int nr_futexes, unsigned int flags, ++ struct __kernel_timespec __user *timo); --retry: -- bucket_inc_waiters(bucket); + /* kernel/hrtimer.c */ + asmlinkage long sys_nanosleep(struct __kernel_timespec __user *rqtp, +diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h +index 57e19200f..090da8e12 100644 +--- a/include/uapi/asm-generic/unistd.h ++++ b/include/uapi/asm-generic/unistd.h +@@ -868,8 +868,11 @@ __SYSCALL(__NR_futex_wait, sys_futex_wait) + #define __NR_futex_wake 443 + __SYSCALL(__NR_futex_wake, sys_futex_wake) + ++#define __NR_futex_waitv 444 ++__SC_COMP(__NR_futex_waitv, sys_futex_waitv, compat_sys_futex_waitv) ++ + #undef __NR_syscalls +-#define __NR_syscalls 444 ++#define __NR_syscalls 445 + + /* + * 32 bit systems traditionally used different +diff --git a/kernel/futex2.c b/kernel/futex2.c +index 27767b2d0..f3c2379ab 100644 +--- a/kernel/futex2.c ++++ b/kernel/futex2.c +@@ -713,6 +713,177 @@ SYSCALL_DEFINE4(futex_wait, void __user *, uaddr, unsigned int, val, + return futex_set_timer_and_wait(futexv, 1, timo, flags); + } + ++#ifdef CONFIG_COMPAT ++/** ++ * compat_futex_parse_waitv - Parse a waitv array from userspace ++ * @futexv: Kernel side list of waiters to be filled ++ * @uwaitv: Userspace list to be parsed ++ * @nr_futexes: Length of futexv ++ * ++ * Return: Error code on failure, pointer to a prepared futexv otherwise ++ */ ++static int compat_futex_parse_waitv(struct futexv_head *futexv, ++ struct compat_futex_waitv __user *uwaitv, ++ unsigned int nr_futexes) ++{ ++ struct futex_bucket *bucket; ++ struct compat_futex_waitv waitv; ++ unsigned int i; ++ ++ for (i = 0; i < nr_futexes; i++) { ++ if (copy_from_user(&waitv, &uwaitv[i], sizeof(waitv))) ++ return -EFAULT; ++ ++ if ((waitv.flags & ~FUTEXV_WAITER_MASK) || ++ (waitv.flags & FUTEX_SIZE_MASK) != FUTEX_32) ++ return -EINVAL; ++ ++ futexv->objects[i].key.pointer = 0; ++ futexv->objects[i].flags = waitv.flags; ++ futexv->objects[i].uaddr = (uintptr_t)compat_ptr(waitv.uaddr); ++ futexv->objects[i].val = waitv.val; ++ futexv->objects[i].index = i; ++ ++ bucket = futex_get_bucket(compat_ptr(waitv.uaddr), ++ &futexv->objects[i].key, ++ is_object_shared); ++ ++ if (IS_ERR(bucket)) ++ return PTR_ERR(bucket); ++ ++ futexv->objects[i].bucket = bucket; ++ ++ INIT_LIST_HEAD(&futexv->objects[i].list); ++ } ++ ++ return 0; ++} ++ ++COMPAT_SYSCALL_DEFINE4(futex_waitv, struct compat_futex_waitv __user *, waiters, ++ unsigned int, nr_futexes, unsigned int, flags, ++ struct __kernel_timespec __user *, timo) ++{ ++ struct futexv_head *futexv; ++ int ret; ++ ++ if (flags & ~FUTEXV_MASK) ++ return -EINVAL; ++ ++ if (!nr_futexes || nr_futexes > FUTEX_WAITV_MAX || !waiters) ++ return -EINVAL; ++ ++ futexv = kmalloc((sizeof(struct futex_waiter) * nr_futexes) + ++ sizeof(*futexv), GFP_KERNEL); ++ if (!futexv) ++ return -ENOMEM; ++ ++ futexv->hint = false; ++ futexv->task = current; ++ ++ ret = compat_futex_parse_waitv(futexv, waiters, nr_futexes); ++ ++ if (!ret) ++ ret = futex_set_timer_and_wait(futexv, nr_futexes, timo, flags); ++ ++ kfree(futexv); ++ + return ret; +} - -- /* Compare the expected and current value, get the bucket lock */ -- ret = futex_get_user_value(bucket, uaddr, val, false); -- if (ret) { -- bucket_dec_waiters(bucket); -- goto out; -- } ++#endif ++ +/** + * futex_parse_waitv - Parse a waitv array from userspace -+ * @futexv: list of waiters -+ * @uwaitv: userspace list -+ * @nr_futexes: number of waiters in the list ++ * @futexv: Kernel side list of waiters to be filled ++ * @uwaitv: Userspace list to be parsed ++ * @nr_futexes: Length of futexv + * + * Return: Error code on failure, pointer to a prepared futexv otherwise + */ -+static int futex_parse_waitv(struct futexv *futexv, ++static int futex_parse_waitv(struct futexv_head *futexv, + struct futex_waitv __user *uwaitv, + unsigned int nr_futexes) +{ ++ struct futex_bucket *bucket; + struct futex_waitv waitv; + unsigned int i; -+ struct futex_bucket *bucket; - -- /* Add the waiter to the hash table and sleep */ -- set_current_state(TASK_INTERRUPTIBLE); -- list_add_tail(&waiter->list, &bucket->list); -- spin_unlock(&bucket->lock); ++ + for (i = 0; i < nr_futexes; i++) { + if (copy_from_user(&waitv, &uwaitv[i], sizeof(waitv))) + return -EFAULT; - -- /* Do not sleep if someone woke this futex or if it was timeouted */ -- if (!list_empty_careful(&waiter->list) && (!timo || timeout.task)) -- freezable_schedule(); ++ + if ((waitv.flags & ~FUTEXV_WAITER_MASK) || + (waitv.flags & FUTEX_SIZE_MASK) != FUTEX_32) + return -EINVAL; - -- __set_current_state(TASK_RUNNING); -+ bucket = futex_get_bucket(waitv.uaddr, -+ &futexv->objects[i].key); ++ ++ futexv->objects[i].key.pointer = 0; ++ futexv->objects[i].flags = waitv.flags; ++ futexv->objects[i].uaddr = (uintptr_t)waitv.uaddr; ++ futexv->objects[i].val = waitv.val; ++ futexv->objects[i].index = i; ++ ++ bucket = futex_get_bucket(waitv.uaddr, &futexv->objects[i].key, ++ is_object_shared); ++ + if (IS_ERR(bucket)) + return PTR_ERR(bucket); - -- /* -- * One of those things triggered this wake: -- * -- * * We have been removed from the bucket. futex_wake() woke us. We just -- * need to return 0 to userspace. -- * -- * However, if we find ourselves in the bucket we must remove ourselves -- * from the bucket and ... -- * -- * * If the there's a timeout and it has expired, return -ETIMEDOUT. -- * -- * * If there is a signal pending, something wants to kill our thread. -- * Return -ERESTARTSYS. -- * -- * * If there's no signal pending, it was a spurious wake (scheduler -- * gave us a change to do some work, even if we don't want to). We -- * need to remove ourselves from the bucket and add again, to prevent -- * losing wakeups in the meantime. -- */ ++ + futexv->objects[i].bucket = bucket; -+ futexv->objects[i].val = waitv.val; -+ futexv->objects[i].flags = waitv.flags; -+ futexv->objects[i].index = i; ++ + INIT_LIST_HEAD(&futexv->objects[i].list); + } - -- /* Normal wake */ -- if (list_empty_careful(&waiter->list)) -- goto out; ++ + return 0; +} - -- if (!futex_dequeue(bucket, waiter)) -- goto out; ++ +/** -+ * sys_futex_waitv - function -+ * @waiters: TODO -+ * @nr_futexes: TODO -+ * @flags: TODO -+ * @timo: TODO ++ * sys_futex_waitv - Wait on a list of futexes ++ * @waiters: List of futexes to wait on ++ * @nr_futexes: Length of futexv ++ * @flags: Flag for timeout (monotonic/realtime) ++ * @timo: Optional absolute timeout. ++ * ++ * Given an array of `struct futex_waitv`, wait on each uaddr. The thread wakes ++ * if a futex_wake() is performed at any uaddr. The syscall returns immediately ++ * if any waiter has *uaddr != val. *timo is an optional timeout value for the ++ * operation. Each waiter has individual flags. The `flags` argument for the ++ * syscall should be used solely for specifying the timeout as realtime, if ++ * needed. Flags for shared futexes, sizes, etc. should be used on the ++ * individual flags of each waiter. ++ * ++ * Returns the array index of one of the awaken futexes. There's no given ++ * information of how many were awakened, or any particular attribute of it (if ++ * it's the first awakened, if it is of the smaller index...). + */ +SYSCALL_DEFINE4(futex_waitv, struct futex_waitv __user *, waiters, + unsigned int, nr_futexes, unsigned int, flags, + struct __kernel_timespec __user *, timo) +{ -+ struct hrtimer_sleeper timeout; -+ struct futexv *futexv; ++ struct futexv_head *futexv; + int ret; - -- /* Timeout */ -- if (timo && !timeout.task) -- return -ETIMEDOUT; ++ + if (flags & ~FUTEXV_MASK) + return -EINVAL; - -- /* Spurious wakeup */ -- if (!signal_pending(current)) -- goto retry; ++ + if (!nr_futexes || nr_futexes > FUTEX_WAITV_MAX || !waiters) + return -EINVAL; - -- /* Some signal is pending */ -- ret = -ERESTARTSYS; --out: -- if (timo) -- hrtimer_cancel(&timeout.timer); -+ futexv = kmalloc(sizeof(struct futexv) + -+ (sizeof(struct futex_waiter) * nr_futexes), -+ GFP_KERNEL); ++ ++ futexv = kmalloc((sizeof(struct futex_waiter) * nr_futexes) + ++ sizeof(*futexv), GFP_KERNEL); + if (!futexv) + return -ENOMEM; + @@ -1300,37 +1923,21 @@ index 107b80a46..4b782b5ef 100644 + + ret = futex_parse_waitv(futexv, waiters, nr_futexes); + if (!ret) -+ ret = futex_wait(futexv, nr_futexes, timo, &timeout, flags); ++ ret = futex_set_timer_and_wait(futexv, nr_futexes, timo, flags); + + kfree(futexv); - - return ret; - } - -+/** -+ * futex_get_parent - Get parent -+ * @waiter: TODO -+ * @index: TODO -+ * -+ * Return: TODO -+ */ - static struct futexv *futex_get_parent(uintptr_t waiter, u8 index) - { - uintptr_t parent = waiter - sizeof(struct futexv) -@@ -439,7 +603,7 @@ SYSCALL_DEFINE3(futex_wake, void __user *, uaddr, unsigned int, nr_wake, - struct futexv *parent = - futex_get_parent((uintptr_t) aux, aux->index); - -- parent->hint = 1; -+ parent->hint = true; - task = parent->task; - get_task_struct(task); - list_del_init_careful(&aux->list); ++ ++ return ret; ++} ++ + /** + * futex_get_parent - For a given futex in a futexv list, get a pointer to the futexv + * @waiter: Address of futex in the list diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c -index 35ff743b1..1898e7340 100644 +index 27ef83ca8..977890c58 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c -@@ -151,6 +151,7 @@ COND_SYSCALL_COMPAT(get_robust_list); +@@ -153,6 +153,7 @@ COND_SYSCALL_COMPAT(get_robust_list); /* kernel/futex2.c */ COND_SYSCALL(futex_wait); COND_SYSCALL(futex_wake); @@ -1339,465 +1946,827 @@ index 35ff743b1..1898e7340 100644 /* kernel/hrtimer.c */ diff --git a/tools/include/uapi/asm-generic/unistd.h b/tools/include/uapi/asm-generic/unistd.h -index cd79f94e0..7de33be59 100644 +index 57e19200f..23febe59e 100644 --- a/tools/include/uapi/asm-generic/unistd.h +++ b/tools/include/uapi/asm-generic/unistd.h -@@ -866,8 +866,11 @@ __SYSCALL(__NR_futex_wait, sys_futex_wait) - #define __NR_futex_wake 442 +@@ -868,8 +868,11 @@ __SYSCALL(__NR_futex_wait, sys_futex_wait) + #define __NR_futex_wake 443 __SYSCALL(__NR_futex_wake, sys_futex_wake) -+#define __NR_futex_waitv 443 -+__SYSCALL(__NR_futex_waitv, sys_futex_waitv) ++#define __NR_futex_waitv 444 ++__SYSCALL(__NR_futex_wait, sys_futex_wait) + #undef __NR_syscalls --#define __NR_syscalls 443 -+#define __NR_syscalls 444 - +-#define __NR_syscalls 444 ++#define __NR_syscalls 445 /* + * 32 bit systems traditionally used different diff --git a/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl b/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl -index 47de3bf93..bd47f368f 100644 +index 15d2b89b6..820c1e4b1 100644 --- a/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl +++ b/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl -@@ -364,6 +364,7 @@ - 440 common process_madvise sys_process_madvise - 441 common futex_wait sys_futex_wait - 442 common futex_wake sys_futex_wake -+443 common futex_waitv sys_futex_waitv +@@ -365,6 +365,7 @@ + 441 common epoll_pwait2 sys_epoll_pwait2 + 442 common futex_wait sys_futex_wait + 443 common futex_wake sys_futex_wake ++444 common futex_waitv sys_futex_waitv # # Due to a historical design error, certain syscalls are numbered differently -- -2.29.2 +2.30.2 -From 24681616a5432f7680f934abf335a9ab9a1eaf1e Mon Sep 17 00:00:00 2001 +From e1198b0e26063ba40993154176b8232f646c3c4b Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Andr=C3=A9=20Almeida?= <andrealmeid@collabora.com> -Date: Thu, 15 Oct 2020 18:06:40 -0300 -Subject: [PATCH 3/9] futex2: Add support for shared futexes +Date: Fri, 5 Feb 2021 10:34:01 -0300 +Subject: [PATCH 04/13] futex2: Implement requeue operation MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit -Add support for shared futexes for cross-process resources. +Implement requeue interface similarly to FUTEX_CMP_REQUEUE operation. +This is the syscall implemented by this patch: + +futex_requeue(struct futex_requeue *uaddr1, struct futex_requeue *uaddr2, + unsigned int nr_wake, unsigned int nr_requeue, + unsigned int cmpval, unsigned int flags) + +struct futex_requeue { + void *uaddr; + unsigned int flags; +}; + +If (uaddr1->uaddr == cmpval), wake at uaddr1->uaddr a nr_wake number of +waiters and then, remove a number of nr_requeue waiters at uaddr1->uaddr +and add them to uaddr2->uaddr list. Each uaddr has its own set of flags, +that must be defined at struct futex_requeue (such as size, shared, NUMA). +The flags argument of the syscall is there just for the sake of +extensibility, and right now it needs to be zero. + +Return the number of the woken futexes + the number of requeued ones on +success, error code otherwise. Signed-off-by: André Almeida <andrealmeid@collabora.com> +--- + +The original FUTEX_CMP_REQUEUE interfaces is such as follows: + +futex(*uaddr1, FUTEX_CMP_REQUEUE, nr_wake, nr_requeue, *uaddr2, cmpval); + +Given that when this interface was created they was only one type of +futex (as opposed to futex2, where there is shared, sizes, and NUMA), +there was no way to specify individual flags for uaddr1 and 2. When +FUTEX_PRIVATE was implemented, a new opcode was created as well +(FUTEX_CMP_REQUEUE_PRIVATE), but they apply both futexes, so they +should be of the same type regarding private/shared. This imposes a +limitation on the use cases of the operation, and to overcome that at futex2, +`struct futex_requeue` was created, so one can set individual flags for +each futex. This flexibility is a trade-off with performance, given that +now we need to perform two extra copy_from_user(). One alternative would +be to use the upper half of flags bits to the first one, and the bottom +half for the second futex, but this would also impose limitations, given +that we would limit by half the flags possibilities. If equal futexes +are common enough, the following extension could be added to overcome +the current performance: + +- A flag FUTEX_REQUEUE_EQUAL is added to futex2() flags; +- If futex_requeue() see this flag, that means that both futexes uses + the same set of attributes. +- Then, the function parses the flags as of futex_wait/wake(). +- *uaddr1 and *uaddr2 are used as void* (instead of struct + futex_requeue) just like wait/wake(). + +In that way, we could avoid the copy_from_user(). + Signed-off-by: Jan200101 <sentrycraft123@gmail.com> --- - kernel/futex2.c | 187 ++++++++++++++++++++++++++++++++++++++++++------ - 1 file changed, 165 insertions(+), 22 deletions(-) + arch/arm/tools/syscall.tbl | 1 + + arch/arm64/include/asm/unistd.h | 2 +- + arch/x86/entry/syscalls/syscall_32.tbl | 1 + + arch/x86/entry/syscalls/syscall_64.tbl | 1 + + include/linux/compat.h | 12 ++ + include/linux/syscalls.h | 5 + + include/uapi/asm-generic/unistd.h | 5 +- + kernel/futex2.c | 215 +++++++++++++++++++++++++ + kernel/sys_ni.c | 1 + + 9 files changed, 241 insertions(+), 2 deletions(-) +diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl +index 6d0f6626a..9aa108802 100644 +--- a/arch/arm/tools/syscall.tbl ++++ b/arch/arm/tools/syscall.tbl +@@ -458,3 +458,4 @@ + 442 common futex_wait sys_futex_wait + 443 common futex_wake sys_futex_wake + 444 common futex_waitv sys_futex_waitv ++445 common futex_requeue sys_futex_requeue +diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h +index 64ebdc1ec..d1cc2849d 100644 +--- a/arch/arm64/include/asm/unistd.h ++++ b/arch/arm64/include/asm/unistd.h +@@ -38,7 +38,7 @@ + #define __ARM_NR_compat_set_tls (__ARM_NR_COMPAT_BASE + 5) + #define __ARM_NR_COMPAT_END (__ARM_NR_COMPAT_BASE + 0x800) + +-#define __NR_compat_syscalls 445 ++#define __NR_compat_syscalls 446 + #endif + + #define __ARCH_WANT_SYS_CLONE +diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl +index fe242fa0b..0cd1df235 100644 +--- a/arch/x86/entry/syscalls/syscall_32.tbl ++++ b/arch/x86/entry/syscalls/syscall_32.tbl +@@ -449,3 +449,4 @@ + 442 i386 futex_wait sys_futex_wait + 443 i386 futex_wake sys_futex_wake + 444 i386 futex_waitv sys_futex_waitv compat_sys_futex_waitv ++445 i386 futex_requeue sys_futex_requeue compat_sys_futex_requeue +diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl +index 9d0f07e05..abbfddcdb 100644 +--- a/arch/x86/entry/syscalls/syscall_64.tbl ++++ b/arch/x86/entry/syscalls/syscall_64.tbl +@@ -366,6 +366,7 @@ + 442 common futex_wait sys_futex_wait + 443 common futex_wake sys_futex_wake + 444 common futex_waitv sys_futex_waitv ++445 common futex_requeue sys_futex_requeue + + # + # Due to a historical design error, certain syscalls are numbered differently +diff --git a/include/linux/compat.h b/include/linux/compat.h +index 041d18174..d4c1b402b 100644 +--- a/include/linux/compat.h ++++ b/include/linux/compat.h +@@ -371,6 +371,11 @@ struct compat_futex_waitv { + compat_uint_t flags; + }; + ++struct compat_futex_requeue { ++ compat_uptr_t uaddr; ++ compat_uint_t flags; ++}; ++ + #ifdef CONFIG_COMPAT_OLD_SIGACTION + struct compat_old_sigaction { + compat_uptr_t sa_handler; +@@ -665,6 +670,13 @@ asmlinkage long compat_sys_futex_waitv(struct compat_futex_waitv *waiters, + compat_uint_t nr_futexes, compat_uint_t flags, + struct __kernel_timespec __user *timo); + ++asmlinkage long compat_sys_futex_requeue(struct compat_futex_requeue *uaddr1, ++ struct compat_futex_requeue *uaddr2, ++ compat_uint_t nr_wake, ++ compat_uint_t nr_requeue, ++ compat_uint_t cmpval, ++ compat_uint_t flags); ++ + /* kernel/itimer.c */ + asmlinkage long compat_sys_getitimer(int which, + struct old_itimerval32 __user *it); +diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h +index 7da1ceb36..06823bc7e 100644 +--- a/include/linux/syscalls.h ++++ b/include/linux/syscalls.h +@@ -69,6 +69,7 @@ struct io_uring_params; + struct clone_args; + struct open_how; + struct futex_waitv; ++struct futex_requeue; + + #include <linux/types.h> + #include <linux/aio_abi.h> +@@ -628,6 +629,10 @@ asmlinkage long sys_futex_wake(void __user *uaddr, unsigned int nr_wake, + asmlinkage long sys_futex_waitv(struct futex_waitv __user *waiters, + unsigned int nr_futexes, unsigned int flags, + struct __kernel_timespec __user *timo); ++asmlinkage long sys_futex_requeue(struct futex_requeue __user *uaddr1, ++ struct futex_requeue __user *uaddr2, ++ unsigned int nr_wake, unsigned int nr_requeue, ++ unsigned int cmpval, unsigned int flags); + + /* kernel/hrtimer.c */ + asmlinkage long sys_nanosleep(struct __kernel_timespec __user *rqtp, +diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h +index 090da8e12..095c10a83 100644 +--- a/include/uapi/asm-generic/unistd.h ++++ b/include/uapi/asm-generic/unistd.h +@@ -871,8 +871,11 @@ __SYSCALL(__NR_futex_wake, sys_futex_wake) + #define __NR_futex_waitv 444 + __SC_COMP(__NR_futex_waitv, sys_futex_waitv, compat_sys_futex_waitv) + ++#define __NR_futex_requeue 445 ++__SC_COMP(__NR_futex_requeue, sys_futex_requeue, compat_sys_futex_requeue) ++ + #undef __NR_syscalls +-#define __NR_syscalls 445 ++#define __NR_syscalls 446 + + /* + * 32 bit systems traditionally used different diff --git a/kernel/futex2.c b/kernel/futex2.c -index 4b782b5ef..5ddb9922d 100644 +index f3c2379ab..bad8c183c 100644 --- a/kernel/futex2.c +++ b/kernel/futex2.c -@@ -6,7 +6,9 @@ - */ - - #include <linux/freezer.h> -+#include <linux/hugetlb.h> - #include <linux/jhash.h> -+#include <linux/pagemap.h> - #include <linux/sched/wake_q.h> - #include <linux/spinlock.h> - #include <linux/syscalls.h> -@@ -15,6 +17,7 @@ - - /** - * struct futex_waiter - List entry for a waiter -+ * @uaddr: Memory address of userspace futex - * @key.address: Memory address of userspace futex - * @key.mm: Pointer to memory management struct of this process - * @key: Stores information that uniquely identify a futex -@@ -25,9 +28,11 @@ - * @index: Index of waiter in futexv list - */ - struct futex_waiter { -+ uintptr_t uaddr; - struct futex_key { - uintptr_t address; - struct mm_struct *mm; -+ unsigned long int offset; - } key; - struct list_head list; - unsigned int val; -@@ -125,16 +130,116 @@ static inline int bucket_get_waiters(struct futex_bucket *bucket) - #endif +@@ -977,6 +977,221 @@ SYSCALL_DEFINE3(futex_wake, void __user *, uaddr, unsigned int, nr_wake, + return ret; } -+static u64 get_inode_sequence_number(struct inode *inode) ++static void futex_double_unlock(struct futex_bucket *b1, struct futex_bucket *b2) +{ -+ static atomic64_t i_seq; -+ u64 old; ++ spin_unlock(&b1->lock); ++ if (b1 != b2) ++ spin_unlock(&b2->lock); ++} + -+ /* Does the inode already have a sequence number? */ -+ old = atomic64_read(&inode->i_sequence); -+ if (likely(old)) -+ return old; ++static inline int __futex_requeue(struct futex_requeue rq1, ++ struct futex_requeue rq2, unsigned int nr_wake, ++ unsigned int nr_requeue, unsigned int cmpval, ++ bool shared1, bool shared2) ++{ ++ struct futex_waiter w1, w2, *aux, *tmp; ++ bool retry = false; ++ struct futex_bucket *b1, *b2; ++ DEFINE_WAKE_Q(wake_q); ++ u32 uval; ++ int ret; + -+ for (;;) { -+ u64 new = atomic64_add_return(1, &i_seq); -+ if (WARN_ON_ONCE(!new)) -+ continue; ++ b1 = futex_get_bucket(rq1.uaddr, &w1.key, shared1); ++ if (IS_ERR(b1)) ++ return PTR_ERR(b1); + -+ old = atomic64_cmpxchg_relaxed(&inode->i_sequence, 0, new); -+ if (old) -+ return old; -+ return new; ++ b2 = futex_get_bucket(rq2.uaddr, &w2.key, shared2); ++ if (IS_ERR(b2)) ++ return PTR_ERR(b2); ++ ++retry: ++ if (shared1 && retry) { ++ b1 = futex_get_bucket(rq1.uaddr, &w1.key, shared1); ++ if (IS_ERR(b1)) ++ return PTR_ERR(b1); + } -+} + -+#define FUT_OFF_INODE 1 /* We set bit 0 if key has a reference on inode */ -+#define FUT_OFF_MMSHARED 2 /* We set bit 1 if key has a reference on mm */ ++ if (shared2 && retry) { ++ b2 = futex_get_bucket(rq2.uaddr, &w2.key, shared2); ++ if (IS_ERR(b2)) ++ return PTR_ERR(b2); ++ } + -+static int futex_get_shared_key(uintptr_t address, struct mm_struct *mm, -+ struct futex_key *key) -+{ -+ int err; -+ struct page *page, *tail; -+ struct address_space *mapping; ++ bucket_inc_waiters(b2); ++ /* ++ * To ensure the locks are taken in the same order for all threads (and ++ * thus avoiding deadlocks), take the "smaller" one first ++ */ ++ if (b1 <= b2) { ++ spin_lock(&b1->lock); ++ if (b1 < b2) ++ spin_lock_nested(&b2->lock, SINGLE_DEPTH_NESTING); ++ } else { ++ spin_lock(&b2->lock); ++ spin_lock_nested(&b1->lock, SINGLE_DEPTH_NESTING); ++ } + -+again: -+ err = get_user_pages_fast(address, 1, 0, &page); ++ ret = futex_get_user(&uval, rq1.uaddr); + -+ if (err < 0) -+ return err; -+ else -+ err = 0; ++ if (unlikely(ret)) { ++ futex_double_unlock(b1, b2); ++ if (__get_user(uval, (u32 * __user)rq1.uaddr)) ++ return -EFAULT; + ++ bucket_dec_waiters(b2); ++ retry = true; ++ goto retry; ++ } + -+ tail = page; -+ page = compound_head(page); -+ mapping = READ_ONCE(page->mapping); ++ if (uval != cmpval) { ++ futex_double_unlock(b1, b2); + ++ bucket_dec_waiters(b2); ++ return -EAGAIN; ++ } + -+ if (unlikely(!mapping)) { -+ int shmem_swizzled; ++ list_for_each_entry_safe(aux, tmp, &b1->list, list) { ++ if (futex_match(w1.key, aux->key)) { ++ if (ret < nr_wake) { ++ futex_mark_wake(aux, b1, &wake_q); ++ ret++; ++ continue; ++ } + -+ lock_page(page); -+ shmem_swizzled = PageSwapCache(page) || page->mapping; -+ unlock_page(page); -+ put_page(page); ++ if (ret >= nr_wake + nr_requeue) ++ break; + -+ if (shmem_swizzled) -+ goto again; ++ aux->key.pointer = w2.key.pointer; ++ aux->key.index = w2.key.index; ++ aux->key.offset = w2.key.offset; + -+ return -EFAULT; ++ if (b1 != b2) { ++ list_del_init_careful(&aux->list); ++ bucket_dec_waiters(b1); ++ ++ list_add_tail(&aux->list, &b2->list); ++ bucket_inc_waiters(b2); ++ } ++ ret++; ++ } + } + -+ if (PageAnon(page)) { ++ futex_double_unlock(b1, b2); ++ wake_up_q(&wake_q); ++ bucket_dec_waiters(b2); + -+ key->mm = mm; -+ key->address = address; ++ return ret; ++} + -+ key->offset |= FUT_OFF_MMSHARED; ++#ifdef CONFIG_COMPAT ++static int compat_futex_parse_requeue(struct futex_requeue *rq, ++ struct compat_futex_requeue __user *uaddr, ++ bool *shared) ++{ ++ struct compat_futex_requeue tmp; + -+ } else { -+ struct inode *inode; ++ if (copy_from_user(&tmp, uaddr, sizeof(tmp))) ++ return -EFAULT; + -+ rcu_read_lock(); ++ if (tmp.flags & ~FUTEXV_WAITER_MASK || ++ (tmp.flags & FUTEX_SIZE_MASK) != FUTEX_32) ++ return -EINVAL; + -+ if (READ_ONCE(page->mapping) != mapping) { -+ rcu_read_unlock(); -+ put_page(page); ++ *shared = (tmp.flags & FUTEX_SHARED_FLAG) ? true : false; + -+ goto again; -+ } ++ rq->uaddr = compat_ptr(tmp.uaddr); ++ rq->flags = tmp.flags; + -+ inode = READ_ONCE(mapping->host); -+ if (!inode) { -+ rcu_read_unlock(); -+ put_page(page); ++ return 0; ++} + -+ goto again; -+ } ++COMPAT_SYSCALL_DEFINE6(futex_requeue, struct compat_futex_requeue __user *, uaddr1, ++ struct compat_futex_requeue __user *, uaddr2, ++ unsigned int, nr_wake, unsigned int, nr_requeue, ++ unsigned int, cmpval, unsigned int, flags) ++{ ++ struct futex_requeue rq1, rq2; ++ bool shared1, shared2; ++ int ret; + -+ key->address = get_inode_sequence_number(inode); -+ key->mm = (struct mm_struct *) basepage_index(tail); -+ key->offset |= FUT_OFF_INODE; ++ if (flags) ++ return -EINVAL; + -+ rcu_read_unlock(); -+ } ++ ret = compat_futex_parse_requeue(&rq1, uaddr1, &shared1); ++ if (ret) ++ return ret; + -+ put_page(page); -+ return err; ++ ret = compat_futex_parse_requeue(&rq2, uaddr2, &shared2); ++ if (ret) ++ return ret; ++ ++ return __futex_requeue(rq1, rq2, nr_wake, nr_requeue, cmpval, shared1, shared2); +} ++#endif + - /** - * futex_get_bucket - Check if the user address is valid, prepare internal - * data and calculate the hash - * @uaddr: futex user address - * @key: data that uniquely identifies a futex -+ * @shared: is this a shared futex? - * - * Return: address of bucket on success, error code otherwise - */ - static struct futex_bucket *futex_get_bucket(void __user *uaddr, -- struct futex_key *key) -+ struct futex_key *key, -+ bool shared) - { - uintptr_t address = (uintptr_t) uaddr; - u32 hash_key; -@@ -145,8 +250,15 @@ static struct futex_bucket *futex_get_bucket(void __user *uaddr, - if (unlikely(!access_ok(address, sizeof(u32)))) - return ERR_PTR(-EFAULT); - -- key->address = address; -- key->mm = current->mm; -+ key->offset = address % PAGE_SIZE; -+ address -= key->offset; ++/** ++ * futex_parse_requeue - Copy a user struct futex_requeue and check it's flags ++ * @rq: Kernel struct ++ * @uaddr: Address of user struct ++ * @shared: Out parameter, defines if this is a shared futex ++ * ++ * Return: 0 on success, error code otherwise ++ */ ++static int futex_parse_requeue(struct futex_requeue *rq, ++ struct futex_requeue __user *uaddr, bool *shared) ++{ ++ if (copy_from_user(rq, uaddr, sizeof(*rq))) ++ return -EFAULT; + -+ if (!shared) { -+ key->address = address; -+ key->mm = current->mm; -+ } else { -+ futex_get_shared_key(address, current->mm, key); -+ } - - /* Generate hash key for this futex using uaddr and current->mm */ - hash_key = jhash2((u32 *) key, sizeof(*key) / sizeof(u32), 0); -@@ -275,9 +387,10 @@ static int futex_dequeue_multiple(struct futexv *futexv, unsigned int nr) - * Return: 0 on success, error code otherwise - */ - static int futex_enqueue(struct futexv *futexv, unsigned int nr_futexes, -- unsigned int *awaken) -+ int *awaken) - { - int i, ret; -+ bool shared, retry = false; - u32 uval, *uaddr, val; - struct futex_bucket *bucket; - -@@ -285,8 +398,18 @@ static int futex_enqueue(struct futexv *futexv, unsigned int nr_futexes, - set_current_state(TASK_INTERRUPTIBLE); - - for (i = 0; i < nr_futexes; i++) { -- uaddr = (u32 * __user) futexv->objects[i].key.address; -+ uaddr = (u32 * __user) futexv->objects[i].uaddr; - val = (u32) futexv->objects[i].val; -+ shared = (futexv->objects[i].flags & FUTEX_SHARED_FLAG) ? true : false; ++ if (rq->flags & ~FUTEXV_WAITER_MASK || ++ (rq->flags & FUTEX_SIZE_MASK) != FUTEX_32) ++ return -EINVAL; + -+ if (shared && retry) { -+ futexv->objects[i].bucket = -+ futex_get_bucket((void *) uaddr, -+ &futexv->objects[i].key, true); -+ if (IS_ERR(futexv->objects[i].bucket)) -+ return PTR_ERR(futexv->objects[i].bucket); -+ } ++ *shared = (rq->flags & FUTEX_SHARED_FLAG) ? true : false; + - bucket = futexv->objects[i].bucket; - - bucket_inc_waiters(bucket); -@@ -301,24 +424,32 @@ static int futex_enqueue(struct futexv *futexv, unsigned int nr_futexes, - __set_current_state(TASK_RUNNING); - *awaken = futex_dequeue_multiple(futexv, i); - -+ if (shared) { -+ retry = true; -+ goto retry; -+ } ++ return 0; ++} + - if (__get_user(uval, uaddr)) - return -EFAULT; - - if (*awaken >= 0) -- return 0; -+ return 1; - -+ retry = true; - goto retry; - } - - if (uval != val) { - spin_unlock(&bucket->lock); - ++/** ++ * sys_futex_requeue - Wake futexes at uaddr1 and requeue from uaddr1 to uaddr2 ++ * @uaddr1: Address of futexes to be waken/dequeued ++ * @uaddr2: Address for the futexes to be enqueued ++ * @nr_wake: Number of futexes waiting in uaddr1 to be woken up ++ * @nr_requeue: Number of futexes to be requeued from uaddr1 to uaddr2 ++ * @cmpval: Expected value at uaddr1 ++ * @flags: Reserved flags arg for requeue operation expansion. Must be 0. ++ * ++ * If (uaddr1->uaddr == cmpval), wake at uaddr1->uaddr a nr_wake number of ++ * waiters and then, remove a number of nr_requeue waiters at uaddr1->uaddr ++ * and add then to uaddr2->uaddr list. Each uaddr has its own set of flags, ++ * that must be defined at struct futex_requeue (such as size, shared, NUMA). ++ * ++ * Return the number of the woken futexes + the number of requeued ones on ++ * success, error code otherwise. ++ */ ++SYSCALL_DEFINE6(futex_requeue, struct futex_requeue __user *, uaddr1, ++ struct futex_requeue __user *, uaddr2, ++ unsigned int, nr_wake, unsigned int, nr_requeue, ++ unsigned int, cmpval, unsigned int, flags) ++{ ++ struct futex_requeue rq1, rq2; ++ bool shared1, shared2; ++ int ret; + - bucket_dec_waiters(bucket); - __set_current_state(TASK_RUNNING); - *awaken = futex_dequeue_multiple(futexv, i); - -- if (*awaken >= 0) -- return 0; -+ if (*awaken >= 0) { -+ return 1; -+ } - - return -EWOULDBLOCK; - } -@@ -336,19 +467,18 @@ static int __futex_wait(struct futexv *futexv, - struct hrtimer_sleeper *timeout) - { - int ret; -- unsigned int awaken = -1; - -- while (1) { -- ret = futex_enqueue(futexv, nr_futexes, &awaken); - -- if (ret < 0) -- break; -+ while (1) { -+ int awaken = -1; - -- if (awaken <= 0) { -- return awaken; -+ ret = futex_enqueue(futexv, nr_futexes, &awaken); -+ if (ret) { -+ if (awaken >= 0) -+ return awaken; -+ return ret; - } - -- - /* Before sleeping, check if someone was woken */ - if (!futexv->hint && (!timeout || timeout->task)) - freezable_schedule(); -@@ -419,6 +549,7 @@ static int futex_wait(struct futexv *futexv, unsigned int nr_futexes, - hrtimer_sleeper_start_expires(timeout, HRTIMER_MODE_ABS); - } - ++ if (flags) ++ return -EINVAL; + - ret = __futex_wait(futexv, nr_futexes, timo ? timeout : NULL); - - -@@ -438,9 +569,10 @@ static int futex_wait(struct futexv *futexv, unsigned int nr_futexes, - SYSCALL_DEFINE4(futex_wait, void __user *, uaddr, unsigned int, val, - unsigned int, flags, struct __kernel_timespec __user *, timo) ++ ret = futex_parse_requeue(&rq1, uaddr1, &shared1); ++ if (ret) ++ return ret; ++ ++ ret = futex_parse_requeue(&rq2, uaddr2, &shared2); ++ if (ret) ++ return ret; ++ ++ return __futex_requeue(rq1, rq2, nr_wake, nr_requeue, cmpval, shared1, shared2); ++} ++ + static int __init futex2_init(void) { -+ bool shared = (flags & FUTEX_SHARED_FLAG) ? true : false; - unsigned int size = flags & FUTEX_SIZE_MASK; -- struct hrtimer_sleeper timeout; - struct futex_single_waiter wait_single; -+ struct hrtimer_sleeper timeout; - struct futex_waiter *waiter; - struct futexv *futexv; - int ret; -@@ -452,6 +584,7 @@ SYSCALL_DEFINE4(futex_wait, void __user *, uaddr, unsigned int, val, - waiter = &wait_single.waiter; - waiter->index = 0; - waiter->val = val; -+ waiter->uaddr = (uintptr_t) uaddr; - - INIT_LIST_HEAD(&waiter->list); - -@@ -462,11 +595,14 @@ SYSCALL_DEFINE4(futex_wait, void __user *, uaddr, unsigned int, val, - return -EINVAL; - - /* Get an unlocked hash bucket */ -- waiter->bucket = futex_get_bucket(uaddr, &waiter->key); -- if (IS_ERR(waiter->bucket)) -+ waiter->bucket = futex_get_bucket(uaddr, &waiter->key, shared); -+ if (IS_ERR(waiter->bucket)) { - return PTR_ERR(waiter->bucket); -+ } + int i; +diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c +index 977890c58..1750dfc41 100644 +--- a/kernel/sys_ni.c ++++ b/kernel/sys_ni.c +@@ -154,6 +154,7 @@ COND_SYSCALL_COMPAT(get_robust_list); + COND_SYSCALL(futex_wait); + COND_SYSCALL(futex_wake); + COND_SYSCALL(futex_waitv); ++COND_SYSCALL(futex_requeue); - ret = futex_wait(futexv, 1, timo, &timeout, flags); -+ if (ret > 0) -+ ret = 0; + /* kernel/hrtimer.c */ - return ret; - } -@@ -486,8 +622,10 @@ static int futex_parse_waitv(struct futexv *futexv, - struct futex_waitv waitv; - unsigned int i; - struct futex_bucket *bucket; -+ bool shared; +-- +2.30.2 + + +From 9ef45e80251029ad164b538b20f0d68a9b75865c Mon Sep 17 00:00:00 2001 +From: =?UTF-8?q?Andr=C3=A9=20Almeida?= <andrealmeid@collabora.com> +Date: Thu, 11 Feb 2021 10:47:23 -0300 +Subject: [PATCH 05/13] futex2: Add compatibility entry point for x86_x32 ABI +MIME-Version: 1.0 +Content-Type: text/plain; charset=UTF-8 +Content-Transfer-Encoding: 8bit + +New syscalls should use the same entry point for x86_64 and x86_x32 +paths. Add a wrapper for x32 calls to use parse functions that assumes +32bit pointers. + +Signed-off-by: André Almeida <andrealmeid@collabora.com> +Signed-off-by: Jan200101 <sentrycraft123@gmail.com> +--- + kernel/futex2.c | 42 +++++++++++++++++++++++++++++++++++------- + 1 file changed, 35 insertions(+), 7 deletions(-) + +diff --git a/kernel/futex2.c b/kernel/futex2.c +index bad8c183c..8a8b45f98 100644 +--- a/kernel/futex2.c ++++ b/kernel/futex2.c +@@ -23,6 +23,10 @@ + #include <linux/syscalls.h> + #include <uapi/linux/futex.h> - for (i = 0; i < nr_futexes; i++) { ++#ifdef CONFIG_X86_64 ++#include <linux/compat.h> ++#endif + - if (copy_from_user(&waitv, &uwaitv[i], sizeof(waitv))) - return -EFAULT; - -@@ -495,8 +633,10 @@ static int futex_parse_waitv(struct futexv *futexv, - (waitv.flags & FUTEX_SIZE_MASK) != FUTEX_32) - return -EINVAL; - -+ shared = (waitv.flags & FUTEX_SHARED_FLAG) ? true : false; + /** + * struct futex_key - Components to build unique key for a futex + * @pointer: Pointer to current->mm or inode's UUID for file backed futexes +@@ -875,7 +879,16 @@ SYSCALL_DEFINE4(futex_waitv, struct futex_waitv __user *, waiters, + futexv->hint = false; + futexv->task = current; + +- ret = futex_parse_waitv(futexv, waiters, nr_futexes); ++#ifdef CONFIG_X86_X32_ABI ++ if (in_x32_syscall()) { ++ ret = compat_futex_parse_waitv(futexv, (struct compat_futex_waitv *)waiters, ++ nr_futexes); ++ } else ++#endif ++ { ++ ret = futex_parse_waitv(futexv, waiters, nr_futexes); ++ } + - bucket = futex_get_bucket(waitv.uaddr, -- &futexv->objects[i].key); -+ &futexv->objects[i].key, shared); - if (IS_ERR(bucket)) - return PTR_ERR(bucket); - -@@ -505,6 +645,7 @@ static int futex_parse_waitv(struct futexv *futexv, - futexv->objects[i].flags = waitv.flags; - futexv->objects[i].index = i; - INIT_LIST_HEAD(&futexv->objects[i].list); -+ futexv->objects[i].uaddr = (uintptr_t) waitv.uaddr; - } + if (!ret) + ret = futex_set_timer_and_wait(futexv, nr_futexes, timo, flags); - return 0; -@@ -573,6 +714,7 @@ static struct futexv *futex_get_parent(uintptr_t waiter, u8 index) - SYSCALL_DEFINE3(futex_wake, void __user *, uaddr, unsigned int, nr_wake, - unsigned int, flags) - { -+ bool shared = (flags & FUTEX_SHARED_FLAG) ? true : false; - unsigned int size = flags & FUTEX_SIZE_MASK; - struct futex_waiter waiter, *aux, *tmp; - struct futex_bucket *bucket; -@@ -586,7 +728,7 @@ SYSCALL_DEFINE3(futex_wake, void __user *, uaddr, unsigned int, nr_wake, - if (size != FUTEX_32) +@@ -1181,13 +1194,28 @@ SYSCALL_DEFINE6(futex_requeue, struct futex_requeue __user *, uaddr1, + if (flags) return -EINVAL; -- bucket = futex_get_bucket(uaddr, &waiter.key); -+ bucket = futex_get_bucket(uaddr, &waiter.key, shared); - if (IS_ERR(bucket)) - return PTR_ERR(bucket); +- ret = futex_parse_requeue(&rq1, uaddr1, &shared1); +- if (ret) +- return ret; ++#ifdef CONFIG_X86_X32_ABI ++ if (in_x32_syscall()) { ++ ret = compat_futex_parse_requeue(&rq1, (struct compat_futex_requeue *)uaddr1, ++ &shared1); ++ if (ret) ++ return ret; -@@ -599,7 +741,8 @@ SYSCALL_DEFINE3(futex_wake, void __user *, uaddr, unsigned int, nr_wake, - break; +- ret = futex_parse_requeue(&rq2, uaddr2, &shared2); +- if (ret) +- return ret; ++ ret = compat_futex_parse_requeue(&rq2, (struct compat_futex_requeue *)uaddr2, ++ &shared2); ++ if (ret) ++ return ret; ++ } else ++#endif ++ { ++ ret = futex_parse_requeue(&rq1, uaddr1, &shared1); ++ if (ret) ++ return ret; ++ ++ ret = futex_parse_requeue(&rq2, uaddr2, &shared2); ++ if (ret) ++ return ret; ++ } - if (waiter.key.address == aux->key.address && -- waiter.key.mm == aux->key.mm) { -+ waiter.key.mm == aux->key.mm && -+ waiter.key.offset == aux->key.offset) { - struct futexv *parent = - futex_get_parent((uintptr_t) aux, aux->index); + return __futex_requeue(rq1, rq2, nr_wake, nr_requeue, cmpval, shared1, shared2); + } +-- +2.30.2 + + +From 80944da5db0f1e00d0bf174d85f74ae4df2444aa Mon Sep 17 00:00:00 2001 +From: =?UTF-8?q?Andr=C3=A9=20Almeida?= <andrealmeid@collabora.com> +Date: Tue, 9 Feb 2021 13:59:00 -0300 +Subject: [PATCH 06/13] docs: locking: futex2: Add documentation +MIME-Version: 1.0 +Content-Type: text/plain; charset=UTF-8 +Content-Transfer-Encoding: 8bit + +Add a new documentation file specifying both userspace API and internal +implementation details of futex2 syscalls. + +Signed-off-by: André Almeida <andrealmeid@collabora.com> +Signed-off-by: Jan200101 <sentrycraft123@gmail.com> +--- + Documentation/locking/futex2.rst | 198 +++++++++++++++++++++++++++++++ + Documentation/locking/index.rst | 1 + + 2 files changed, 199 insertions(+) + create mode 100644 Documentation/locking/futex2.rst + +diff --git a/Documentation/locking/futex2.rst b/Documentation/locking/futex2.rst +new file mode 100644 +index 000000000..edd47c22f +--- /dev/null ++++ b/Documentation/locking/futex2.rst +@@ -0,0 +1,198 @@ ++.. SPDX-License-Identifier: GPL-2.0 ++ ++====== ++futex2 ++====== ++ ++:Author: André Almeida <andrealmeid@collabora.com> ++ ++futex, or fast user mutex, is a set of syscalls to allow the userspace to create ++performant synchronization mechanisms, such as mutexes, semaphores and ++conditional variables in userspace. C standard libraries, like glibc, uses it ++as means to implements more high level interfaces like pthreads. ++ ++The interface ++============= ++ ++uAPI functions ++-------------- ++ ++.. kernel-doc:: kernel/futex2.c ++ :identifiers: sys_futex_wait sys_futex_wake sys_futex_waitv sys_futex_requeue ++ ++uAPI structures ++--------------- ++ ++.. kernel-doc:: include/uapi/linux/futex.h ++ ++The ``flag`` argument ++--------------------- ++ ++The flag is used to specify the size of the futex word ++(FUTEX_[8, 16, 32]). It's mandatory to define one, since there's no ++default size. ++ ++By default, the timeout uses a monotonic clock, but can be used as a realtime ++one by using the FUTEX_REALTIME_CLOCK flag. ++ ++By default, futexes are of the private type, that means that this user address ++will be accessed by threads that shares the same memory region. This allows for ++some internal optimizations, so they are faster. However, if the address needs ++to be shared with different processes (like using ``mmap()`` or ``shm()``), they ++need to be defined as shared and the flag FUTEX_SHARED_FLAG is used to set that. ++ ++By default, the operation has no NUMA-awareness, meaning that the user can't ++choose the memory node where the kernel side futex data will be stored. The ++user can choose the node where it wants to operate by setting the ++FUTEX_NUMA_FLAG and using the following structure (where X can be 8, 16, or ++32):: ++ ++ struct futexX_numa { ++ __uX value; ++ __sX hint; ++ }; ++ ++This structure should be passed at the ``void *uaddr`` of futex functions. The ++address of the structure will be used to be waited on/waken on, and the ++``value`` will be compared to ``val`` as usual. The ``hint`` member is used to ++defined which node the futex will use. When waiting, the futex will be ++registered on a kernel-side table stored on that node; when waking, the futex ++will be searched for on that given table. That means that there's no redundancy ++between tables, and the wrong ``hint`` value will led to undesired behavior. ++Userspace is responsible for dealing with node migrations issues that may ++occur. ``hint`` can range from [0, MAX_NUMA_NODES], for specifying a node, or ++-1, to use the same node the current process is using. ++ ++When not using FUTEX_NUMA_FLAG on a NUMA system, the futex will be stored on a ++global table on some node, defined at compilation time. ++ ++The ``timo`` argument ++--------------------- ++ ++As per the Y2038 work done in the kernel, new interfaces shouldn't add timeout ++options known to be buggy. Given that, ``timo`` should be a 64bit timeout at ++all platforms, using an absolute timeout value. ++ ++Implementation ++============== ++ ++The internal implementation follows a similar design to the original futex. ++Given that we want to replicate the same external behavior of current futex, ++this should be somewhat expected. ++ ++Waiting ++------- ++ ++For the wait operations, they are all treated as if you want to wait on N ++futexes, so the path for futex_wait and futex_waitv is the basically the same. ++For both syscalls, the first step is to prepare an internal list for the list ++of futexes to wait for (using struct futexv_head). For futex_wait() calls, this ++list will have a single object. ++ ++We have a hash table, were waiters register themselves before sleeping. Then, ++the wake function checks this table looking for waiters at uaddr. The hash ++bucket to be used is determined by a struct futex_key, that stores information ++to uniquely identify an address from a given process. Given the huge address ++space, there'll be hash collisions, so we store information to be later used on ++collision treatment. ++ ++First, for every futex we want to wait on, we check if (``*uaddr == val``). ++This check is done holding the bucket lock, so we are correctly serialized with ++any futex_wake() calls. If any waiter fails the check above, we dequeue all ++futexes. The check (``*uaddr == val``) can fail for two reasons: ++ ++- The values are different, and we return -EAGAIN. However, if while ++ dequeueing we found that some futex were awakened, we prioritize this ++ and return success. ++ ++- When trying to access the user address, we do so with page faults ++ disabled because we are holding a bucket's spin lock (and can't sleep ++ while holding a spin lock). If there's an error, it might be a page ++ fault, or an invalid address. We release the lock, dequeue everyone ++ (because it's illegal to sleep while there are futexes enqueued, we ++ could lose wakeups) and try again with page fault enabled. If we ++ succeeded, this means that the address is valid, but we need to do ++ all the work again. For serialization reasons, we need to have the ++ spin lock when getting the user value. Additionally, for shared ++ futexes, we also need to recalculate the hash, since the underlying ++ mapping mechanisms could have changed when dealing with page fault. ++ If, even with page fault enabled, we can't access the address, it ++ means it's an invalid user address, and we return -EFAULT. For this ++ case, we prioritize the error, even if some futex were awaken. ++ ++If the check is OK, they are enqueued on a linked list in our bucket, and ++proceed to the next one. If all waiters succeed, we put the thread to sleep ++until a futex_wake() call, timeout expires or we get a signal. After waking up, ++we dequeue everyone, and check if some futex was awaken. This dequeue is done by ++iteratively walking at each element of struct futex_head list. ++ ++All enqueuing/dequeuing operations requires to hold the bucket lock, to avoid ++racing while modifying the list. ++ ++Waking ++------ ++ ++We get the bucket that's storing the waiters at uaddr, and wake the required ++number of waiters, checking for hash collision. ++ ++There's an optimization that makes futex_wake() not taking the bucket lock if ++there's no one to be wake on that bucket. It checks an atomic counter that each ++bucket has, if it says 0, than the syscall exits. In order to this work, the ++waiter thread increases it before taking the lock, so the wake thread will ++correctly see that there's someone waiting and will continue the path to take ++the bucket lock. To get the correct serialization, the waiter issues a memory ++barrier after increasing the bucket counter and the waker issues a memory ++barrier before checking it. ++ ++Requeuing ++--------- ++ ++The requeue path first checks for each struct futex_requeue and their flags. ++Then, it will compare the excepted value with the one at uaddr1::uaddr. ++Following the same serialization explained at Waking_, we increase the atomic ++counter for the bucket of uaddr2 before taking the lock. We need to have both ++buckets locks at same time so we don't race with others futexes operations. To ++ensure the locks are taken in the same order for all threads (and thus avoiding ++deadlocks), every requeue operation takes the "smaller" bucket first, when ++comparing both addresses. ++ ++If the compare with user value succeeds, we proceed by waking ``nr_wake`` ++futexes, and then requeuing ``nr_requeue`` from bucket of uaddr1 to the uaddr2. ++This consists in a simple list deletion/addition and replacing the old futex key ++for the new one. ++ ++Futex keys ++---------- ++ ++There are two types of futexes: private and shared ones. The private are futexes ++meant to be used by threads that shares the same memory space, are easier to be ++uniquely identified an thus can have some performance optimization. The elements ++for identifying one are: the start address of the page where the address is, ++the address offset within the page and the current->mm pointer. ++ ++Now, for uniquely identifying shared futex: ++ ++- If the page containing the user address is an anonymous page, we can ++ just use the same data used for private futexes (the start address of ++ the page, the address offset within the page and the current->mm ++ pointer) that will be enough for uniquely identifying such futex. We ++ also set one bit at the key to differentiate if a private futex is ++ used on the same address (mixing shared and private calls do not ++ work). ++ ++- If the page is file-backed, current->mm maybe isn't the same one for ++ every user of this futex, so we need to use other data: the ++ page->index, an UUID for the struct inode and the offset within the ++ page. ++ ++Note that members of futex_key doesn't have any particular meaning after they ++are part of the struct - they are just bytes to identify a futex. Given that, ++we don't need to use a particular name or type that matches the original data, ++we only need to care about the bitsize of each component and make both private ++and shared fit in the same memory space. ++ ++Source code documentation ++========================= ++ ++.. kernel-doc:: kernel/futex2.c ++ :no-identifiers: sys_futex_wait sys_futex_wake sys_futex_waitv sys_futex_requeue +diff --git a/Documentation/locking/index.rst b/Documentation/locking/index.rst +index 7003bd5ae..9bf03c7fa 100644 +--- a/Documentation/locking/index.rst ++++ b/Documentation/locking/index.rst +@@ -24,6 +24,7 @@ locking + percpu-rw-semaphore + robust-futexes + robust-futex-ABI ++ futex2 + + .. only:: subproject and html -- -2.29.2 +2.30.2 -From ce3ae4bd9f98763fda07f315c1f239c4aaef4b5e Mon Sep 17 00:00:00 2001 +From 807830198558476757c3e1b77fcfad2129fe29fa Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Andr=C3=A9=20Almeida?= <andrealmeid@collabora.com> -Date: Thu, 9 Jul 2020 11:34:40 -0300 -Subject: [PATCH 4/9] selftests: futex: Add futex2 wake/wait test +Date: Fri, 5 Feb 2021 10:34:01 -0300 +Subject: [PATCH 07/13] selftests: futex2: Add wake/wait test MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit -Add a simple test to test wake/wait mechanism using futex2 interface. +Add a simple file to test wake/wait mechanism using futex2 interface. +Test three scenarios: using a common local int variable as private +futex, a shm futex as shared futex and a file-backed shared memory as a +shared futex. This should test all branches of futex_get_key(). + Create helper files so more tests can evaluate futex2. While 32bit ABIs -from glibc aren't able to use 64 bit sized time variables, add a +from glibc aren't yet able to use 64 bit sized time variables, add a temporary workaround that implements the required types and calls the appropriated syscalls, since futex2 doesn't supports 32 bit sized time. Signed-off-by: André Almeida <andrealmeid@collabora.com> Signed-off-by: Jan200101 <sentrycraft123@gmail.com> --- - tools/include/uapi/asm-generic/unistd.h | 1 - .../selftests/futex/functional/.gitignore | 1 + - .../selftests/futex/functional/Makefile | 4 +- - .../selftests/futex/functional/futex2_wait.c | 148 ++++++++++++++++++ + .../selftests/futex/functional/Makefile | 6 +- + .../selftests/futex/functional/futex2_wait.c | 209 ++++++++++++++++++ .../testing/selftests/futex/functional/run.sh | 3 + - .../selftests/futex/include/futex2test.h | 77 +++++++++ - 6 files changed, 232 insertions(+), 2 deletions(-) + .../selftests/futex/include/futex2test.h | 79 +++++++ + 5 files changed, 296 insertions(+), 2 deletions(-) create mode 100644 tools/testing/selftests/futex/functional/futex2_wait.c create mode 100644 tools/testing/selftests/futex/include/futex2test.h -diff --git a/tools/include/uapi/asm-generic/unistd.h b/tools/include/uapi/asm-generic/unistd.h -index 7de33be59..81a90b697 100644 ---- a/tools/include/uapi/asm-generic/unistd.h -+++ b/tools/include/uapi/asm-generic/unistd.h -@@ -872,7 +872,6 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv) - #undef __NR_syscalls - #define __NR_syscalls 444 - -- - /* - * 32 bit systems traditionally used different - * syscalls for off_t and loff_t arguments, while diff --git a/tools/testing/selftests/futex/functional/.gitignore b/tools/testing/selftests/futex/functional/.gitignore index 0efcd494d..d61f1df94 100644 --- a/tools/testing/selftests/futex/functional/.gitignore @@ -1808,10 +2777,15 @@ index 0efcd494d..d61f1df94 100644 futex_wait_wouldblock +futex2_wait diff --git a/tools/testing/selftests/futex/functional/Makefile b/tools/testing/selftests/futex/functional/Makefile -index 23207829e..7142a94a7 100644 +index 23207829e..9b334f190 100644 --- a/tools/testing/selftests/futex/functional/Makefile +++ b/tools/testing/selftests/futex/functional/Makefile -@@ -5,6 +5,7 @@ LDLIBS := -lpthread -lrt +@@ -1,10 +1,11 @@ + # SPDX-License-Identifier: GPL-2.0 +-INCLUDES := -I../include -I../../ ++INCLUDES := -I../include -I../../ -I../../../../../usr/include/ + CFLAGS := $(CFLAGS) -g -O2 -Wall -D_GNU_SOURCE -pthread $(INCLUDES) + LDLIBS := -lpthread -lrt HEADERS := \ ../include/futextest.h \ @@ -1831,14 +2805,14 @@ index 23207829e..7142a94a7 100644 diff --git a/tools/testing/selftests/futex/functional/futex2_wait.c b/tools/testing/selftests/futex/functional/futex2_wait.c new file mode 100644 -index 000000000..0646a24b7 +index 000000000..4b5416585 --- /dev/null +++ b/tools/testing/selftests/futex/functional/futex2_wait.c -@@ -0,0 +1,148 @@ +@@ -0,0 +1,209 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/****************************************************************************** + * -+ * Copyright Collabora Ltd., 2020 ++ * Copyright Collabora Ltd., 2021 + * + * DESCRIPTION + * Test wait/wake mechanism of futex2, using 32bit sized futexes. @@ -1847,7 +2821,7 @@ index 000000000..0646a24b7 + * André Almeida <andrealmeid@collabora.com> + * + * HISTORY -+ * 2020-Jul-9: Initial version by André <andrealmeid@collabora.com> ++ * 2021-Feb-5: Initial version by André <andrealmeid@collabora.com> + * + *****************************************************************************/ + @@ -1860,12 +2834,16 @@ index 000000000..0646a24b7 +#include <time.h> +#include <pthread.h> +#include <sys/shm.h> ++#include <sys/mman.h> ++#include <fcntl.h> ++#include <string.h> +#include "futex2test.h" +#include "logging.h" + +#define TEST_NAME "futex2-wait" +#define timeout_ns 30000000 +#define WAKE_WAIT_US 10000 ++#define SHM_PATH "futex2_shm_file" +futex_t *f1; + +void usage(char *prog) @@ -1881,6 +2859,7 @@ index 000000000..0646a24b7 +{ + struct timespec64 to64; + unsigned int flags = 0; ++ + if (arg) + flags = *((unsigned int *) arg); + @@ -1901,6 +2880,13 @@ index 000000000..0646a24b7 + return NULL; +} + ++void *waitershm(void *arg) ++{ ++ futex2_wait(arg, 0, FUTEX_32 | FUTEX_SHARED_FLAG, NULL); ++ ++ return NULL; ++} ++ +int main(int argc, char *argv[]) +{ + pthread_t waiter; @@ -1908,6 +2894,7 @@ index 000000000..0646a24b7 + int res, ret = RET_PASS; + int c; + futex_t f_private = 0; ++ + f1 = &f_private; + + while ((c = getopt(argc, argv, "cht:v:")) != -1) { @@ -1928,10 +2915,11 @@ index 000000000..0646a24b7 + } + + ksft_print_header(); -+ ksft_set_plan(2); ++ ksft_set_plan(3); + ksft_print_msg("%s: Test FUTEX2_WAIT\n", + basename(argv[0])); + ++ /* Testing a private futex */ + info("Calling private futex2_wait on f1: %u @ %p with val=%u\n", *f1, f1, *f1); + + if (pthread_create(&waiter, NULL, waiterfn, NULL)) @@ -1951,12 +2939,15 @@ index 000000000..0646a24b7 + } + + int shm_id = shmget(IPC_PRIVATE, 4096, IPC_CREAT | 0666); ++ + if (shm_id < 0) { + perror("shmget"); + exit(1); + } + ++ /* Testing an anon page shared memory */ + unsigned int *shared_data = shmat(shm_id, NULL, 0); ++ + *shared_data = 0; + f1 = shared_data; + @@ -1970,16 +2961,60 @@ index 000000000..0646a24b7 + info("Calling shared futex2_wake on f1: %u @ %p with val=%u\n", *f1, f1, *f1); + res = futex2_wake(f1, 1, FUTEX_32 | FUTEX_SHARED_FLAG); + if (res != 1) { -+ ksft_test_result_fail("futex2_wake shared returned: %d %s\n", ++ ksft_test_result_fail("futex2_wake shared (shmget) returned: %d %s\n", + res ? errno : res, + res ? strerror(errno) : ""); + ret = RET_FAIL; + } else { -+ ksft_test_result_pass("futex2_wake shared succeeds\n"); ++ ksft_test_result_pass("futex2_wake shared (shmget) succeeds\n"); + } + + shmdt(shared_data); + ++ /* Testing a file backed shared memory */ ++ void *shm; ++ int fd, pid; ++ ++ f_private = 0; ++ ++ fd = open(SHM_PATH, O_RDWR | O_CREAT, S_IRUSR | S_IWUSR); ++ if (fd < 0) { ++ perror("open"); ++ exit(1); ++ } ++ ++ res = ftruncate(fd, sizeof(f_private)); ++ if (res) { ++ perror("ftruncate"); ++ exit(1); ++ } ++ ++ shm = mmap(NULL, sizeof(f_private), PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); ++ if (shm == MAP_FAILED) { ++ perror("mmap"); ++ exit(1); ++ } ++ ++ memcpy(shm, &f_private, sizeof(f_private)); ++ ++ pthread_create(&waiter, NULL, waitershm, shm); ++ ++ usleep(WAKE_WAIT_US); ++ ++ res = futex2_wake(shm, 1, FUTEX_32 | FUTEX_SHARED_FLAG); ++ if (res != 1) { ++ ksft_test_result_fail("futex2_wake shared (mmap) returned: %d %s\n", ++ res ? errno : res, ++ res ? strerror(errno) : ""); ++ ret = RET_FAIL; ++ } else { ++ ksft_test_result_pass("futex2_wake shared (mmap) succeeds\n"); ++ } ++ ++ munmap(shm, sizeof(f_private)); ++ ++ remove(SHM_PATH); ++ + ksft_print_cnts(); + return ret; +} @@ -1996,14 +3031,14 @@ index 1acb6ace1..3730159c8 100755 +./futex2_wait $COLOR diff --git a/tools/testing/selftests/futex/include/futex2test.h b/tools/testing/selftests/futex/include/futex2test.h new file mode 100644 -index 000000000..807b8b57f +index 000000000..e724d56b9 --- /dev/null +++ b/tools/testing/selftests/futex/include/futex2test.h -@@ -0,0 +1,77 @@ +@@ -0,0 +1,79 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later */ +/****************************************************************************** + * -+ * Copyright Collabora Ltd., 2020 ++ * Copyright Collabora Ltd., 2021 + * + * DESCRIPTION + * Futex2 library addons for old futex library @@ -2012,7 +3047,7 @@ index 000000000..807b8b57f + * André Almeida <andrealmeid@collabora.com> + * + * HISTORY -+ * 2020-Jul-9: Initial version by André <andrealmeid@collabora.com> ++ * 2021-Feb-5: Initial version by André <andrealmeid@collabora.com> + * + *****************************************************************************/ +#include "futextest.h" @@ -2027,12 +3062,7 @@ index 000000000..807b8b57f +# define FUTEX_16 1 +#endif +#ifndef FUTEX_32 -+#define FUTEX_32 2 -+#endif -+#ifdef __x86_64__ -+# ifndef FUTEX_64 -+# define FUTEX_64 3 -+# endif ++# define FUTEX_32 2 +#endif + +/* @@ -2061,8 +3091,12 @@ index 000000000..807b8b57f + * - End of Y2038 section - + */ + -+/* -+ * wait for uaddr if (*uaddr == val) ++/** ++ * futex2_wait - If (*uaddr == val), wait at uaddr until timo ++ * @uaddr: User address to wait on ++ * @val: Expected value at uaddr, return if is not equal ++ * @flags: Operation flags ++ * @timo: Optional timeout for operation + */ +static inline int futex2_wait(volatile void *uaddr, unsigned long val, + unsigned long flags, struct timespec64 *timo) @@ -2070,27 +3104,31 @@ index 000000000..807b8b57f + return syscall(__NR_futex_wait, uaddr, val, flags, timo); +} + -+/* -+ * wake nr futexes waiting for uaddr ++/** ++ * futex2_wake - Wake a number of waiters at uaddr ++ * @uaddr: Address to wake ++ * @nr: Number of waiters to wake ++ * @flags: Operation flags + */ +static inline int futex2_wake(volatile void *uaddr, unsigned int nr, unsigned long flags) +{ + return syscall(__NR_futex_wake, uaddr, nr, flags); +} -- -2.29.2 +2.30.2 -From 1e0349f5a81a43cdb50d9a97812194df6d937b69 Mon Sep 17 00:00:00 2001 +From 382ed2cfcea3ed7e77d07e3e12b3769a081001ea Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Andr=C3=A9=20Almeida?= <andrealmeid@collabora.com> -Date: Thu, 9 Jul 2020 11:36:14 -0300 -Subject: [PATCH 5/9] selftests: futex: Add futex2 timeout test +Date: Fri, 5 Feb 2021 10:34:01 -0300 +Subject: [PATCH 08/13] selftests: futex2: Add timeout test MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adapt existing futex wait timeout file to test the same mechanism for -futex2. +futex2. futex2 accepts only absolute 64bit timers, but supports both +monotonic and realtime clocks. Signed-off-by: André Almeida <andrealmeid@collabora.com> Signed-off-by: Jan200101 <sentrycraft123@gmail.com> @@ -2099,14 +3137,14 @@ Signed-off-by: Jan200101 <sentrycraft123@gmail.com> 1 file changed, 49 insertions(+), 9 deletions(-) diff --git a/tools/testing/selftests/futex/functional/futex_wait_timeout.c b/tools/testing/selftests/futex/functional/futex_wait_timeout.c -index ee55e6d38..245670e44 100644 +index ee55e6d38..b4dffe9e3 100644 --- a/tools/testing/selftests/futex/functional/futex_wait_timeout.c +++ b/tools/testing/selftests/futex/functional/futex_wait_timeout.c @@ -11,6 +11,7 @@ * * HISTORY * 2009-Nov-6: Initial version by Darren Hart <dvhart@linux.intel.com> -+ * 2020-Jul-9: Add futex2 test by André <andrealmeid@collabora.com> ++ * 2021-Feb-5: Add futex2 test by André <andrealmeid@collabora.com> * *****************************************************************************/ @@ -2198,13 +3236,13 @@ index ee55e6d38..245670e44 100644 return ret; } -- -2.29.2 +2.30.2 -From 298120f6e3a758cd03e26a104f5ce60a88501b7f Mon Sep 17 00:00:00 2001 +From 27d37b4e24805d9dc5478c296ee680a8a4db8a6e Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Andr=C3=A9=20Almeida?= <andrealmeid@collabora.com> -Date: Thu, 9 Jul 2020 11:37:42 -0300 -Subject: [PATCH 6/9] selftests: futex: Add futex2 wouldblock test +Date: Fri, 5 Feb 2021 10:34:01 -0300 +Subject: [PATCH 09/13] selftests: futex2: Add wouldblock test MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit @@ -2219,14 +3257,14 @@ Signed-off-by: Jan200101 <sentrycraft123@gmail.com> 1 file changed, 29 insertions(+), 4 deletions(-) diff --git a/tools/testing/selftests/futex/functional/futex_wait_wouldblock.c b/tools/testing/selftests/futex/functional/futex_wait_wouldblock.c -index 0ae390ff8..1f72e5928 100644 +index 0ae390ff8..ed3660090 100644 --- a/tools/testing/selftests/futex/functional/futex_wait_wouldblock.c +++ b/tools/testing/selftests/futex/functional/futex_wait_wouldblock.c @@ -12,6 +12,7 @@ * * HISTORY * 2009-Nov-14: Initial version by Gowrishankar <gowrishankar.m@in.ibm.com> -+ * 2020-Jul-9: Add futex2 test by André <andrealmeid@collabora.com> ++ * 2021-Feb-5: Add futex2 test by André <andrealmeid@collabora.com> * *****************************************************************************/ @@ -2293,26 +3331,30 @@ index 0ae390ff8..1f72e5928 100644 return ret; } -- -2.29.2 +2.30.2 -From 05c697a239aad5e8608c6acf0da9239cac5f7a2e Mon Sep 17 00:00:00 2001 +From 2b2f4e71b3bb09c0d45f9eae4c1986155d3a1235 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Andr=C3=A9=20Almeida?= <andrealmeid@collabora.com> -Date: Tue, 8 Dec 2020 18:47:31 -0300 -Subject: [PATCH 7/9] selftests: futex: Add futex2 waitv test +Date: Fri, 5 Feb 2021 10:34:02 -0300 +Subject: [PATCH 10/13] selftests: futex2: Add waitv test MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit +Create a new file to test the waitv mechanism. Test both private and +shared futexes. Wake the last futex in the array, and check if the +return value from futex_waitv() is the right index. + Signed-off-by: André Almeida <andrealmeid@collabora.com> Signed-off-by: Jan200101 <sentrycraft123@gmail.com> --- .../selftests/futex/functional/.gitignore | 1 + .../selftests/futex/functional/Makefile | 3 +- - .../selftests/futex/functional/futex2_waitv.c | 156 ++++++++++++++++++ + .../selftests/futex/functional/futex2_waitv.c | 157 ++++++++++++++++++ .../testing/selftests/futex/functional/run.sh | 3 + - .../selftests/futex/include/futex2test.h | 25 ++- - 5 files changed, 183 insertions(+), 5 deletions(-) + .../selftests/futex/include/futex2test.h | 26 +++ + 5 files changed, 189 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/futex/functional/futex2_waitv.c diff --git a/tools/testing/selftests/futex/functional/.gitignore b/tools/testing/selftests/futex/functional/.gitignore @@ -2325,7 +3367,7 @@ index d61f1df94..d0b8f637b 100644 futex2_wait +futex2_waitv diff --git a/tools/testing/selftests/futex/functional/Makefile b/tools/testing/selftests/futex/functional/Makefile -index 7142a94a7..b857b9450 100644 +index 9b334f190..09c08ccde 100644 --- a/tools/testing/selftests/futex/functional/Makefile +++ b/tools/testing/selftests/futex/functional/Makefile @@ -16,7 +16,8 @@ TEST_GEN_FILES := \ @@ -2340,14 +3382,14 @@ index 7142a94a7..b857b9450 100644 diff --git a/tools/testing/selftests/futex/functional/futex2_waitv.c b/tools/testing/selftests/futex/functional/futex2_waitv.c new file mode 100644 -index 000000000..d4b116651 +index 000000000..2f81d296d --- /dev/null +++ b/tools/testing/selftests/futex/functional/futex2_waitv.c -@@ -0,0 +1,156 @@ +@@ -0,0 +1,157 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/****************************************************************************** + * -+ * Copyright Collabora Ltd., 2020 ++ * Copyright Collabora Ltd., 2021 + * + * DESCRIPTION + * Test waitv/wake mechanism of futex2, using 32bit sized futexes. @@ -2356,7 +3398,7 @@ index 000000000..d4b116651 + * André Almeida <andrealmeid@collabora.com> + * + * HISTORY -+ * 2020-Jul-9: Initial version by André <andrealmeid@collabora.com> ++ * 2021-Feb-5: Initial version by André <andrealmeid@collabora.com> + * + *****************************************************************************/ + @@ -2401,7 +3443,11 @@ index 000000000..d4b116651 + + res = futex2_waitv(waitv, NR_FUTEXES, 0, &to64); + if (res < 0) { -+ printf("waiter failed errno %d %s\n", ++ ksft_test_result_fail("futex2_waitv private returned: %d %s\n", ++ res ? errno : res, ++ res ? strerror(errno) : ""); ++ } else if (res != NR_FUTEXES - 1) { ++ ksft_test_result_fail("futex2_waitv private returned: %d %s\n", + res ? errno : res, + res ? strerror(errno) : ""); + } @@ -2437,23 +3483,21 @@ index 000000000..d4b116651 + ksft_print_msg("%s: Test FUTEX2_WAITV\n", + basename(argv[0])); + -+ //info("Calling private futex2_wait on f1: %u @ %p with val=%u\n", *f1, f1, *f1); -+ + for (i = 0; i < NR_FUTEXES; i++) { -+ waitv[i].uaddr = &futexes[i]; ++ waitv[i].uaddr = &futexes[i]; + waitv[i].flags = FUTEX_32; + waitv[i].val = 0; + } + ++ /* Private waitv */ + if (pthread_create(&waiter, NULL, waiterfn, NULL)) + error("pthread_create failed\n", errno); + + usleep(WAKE_WAIT_US); + -+ // info("Calling private futex2_wake on f1: %u @ %p with val=%u\n", *f1, f1, *f1); + res = futex2_wake(waitv[NR_FUTEXES - 1].uaddr, 1, FUTEX_32); + if (res != 1) { -+ ksft_test_result_fail("futex2_wake private returned: %d %s\n", ++ ksft_test_result_fail("futex2_waitv private returned: %d %s\n", + res ? errno : res, + res ? strerror(errno) : ""); + ret = RET_FAIL; @@ -2461,37 +3505,36 @@ index 000000000..d4b116651 + ksft_test_result_pass("futex2_waitv private succeeds\n"); + } + ++ /* Shared waitv */ + for (i = 0; i < NR_FUTEXES; i++) { + int shm_id = shmget(IPC_PRIVATE, 4096, IPC_CREAT | 0666); ++ + if (shm_id < 0) { + perror("shmget"); + exit(1); + } + + unsigned int *shared_data = shmat(shm_id, NULL, 0); -+ *shared_data = 0; + -+ waitv[i].uaddr = shared_data; ++ *shared_data = 0; ++ waitv[i].uaddr = shared_data; + waitv[i].flags = FUTEX_32 | FUTEX_SHARED_FLAG; + waitv[i].val = 0; + } + -+ //info("Calling shared futex2_wait on f1: %u @ %p with val=%u\n", *f1, f1, *f1); -+ + if (pthread_create(&waiter, NULL, waiterfn, NULL)) + error("pthread_create failed\n", errno); + + usleep(WAKE_WAIT_US); + -+ // info("Calling shared futex2_wake on f1: %u @ %p with val=%u\n", *f1, f1, *f1); + res = futex2_wake(waitv[NR_FUTEXES - 1].uaddr, 1, FUTEX_32 | FUTEX_SHARED_FLAG); + if (res != 1) { -+ ksft_test_result_fail("futex2_wake shared returned: %d %s\n", ++ ksft_test_result_fail("futex2_waitv shared returned: %d %s\n", + res ? errno : res, + res ? strerror(errno) : ""); + ret = RET_FAIL; + } else { -+ ksft_test_result_pass("futex2_wake shared succeeds\n"); ++ ksft_test_result_pass("futex2_waitv shared succeeds\n"); + } + + for (i = 0; i < NR_FUTEXES; i++) @@ -2512,18 +3555,13 @@ index 3730159c8..18b3883d7 100755 +echo +./futex2_waitv $COLOR diff --git a/tools/testing/selftests/futex/include/futex2test.h b/tools/testing/selftests/futex/include/futex2test.h -index 807b8b57f..10be0c504 100644 +index e724d56b9..31979afc4 100644 --- a/tools/testing/selftests/futex/include/futex2test.h +++ b/tools/testing/selftests/futex/include/futex2test.h -@@ -27,10 +27,18 @@ - #ifndef FUTEX_32 - #define FUTEX_32 2 +@@ -28,6 +28,19 @@ + # define FUTEX_32 2 #endif --#ifdef __x86_64__ --# ifndef FUTEX_64 --# define FUTEX_64 3 --# endif -+ + +#ifndef FUTEX_SHARED_FLAG +#define FUTEX_SHARED_FLAG 8 +#endif @@ -2535,16 +3573,22 @@ index 807b8b57f..10be0c504 100644 + unsigned int val; + unsigned int flags; +}; - #endif - ++#endif ++ /* -@@ -75,3 +83,12 @@ static inline int futex2_wake(volatile void *uaddr, unsigned int nr, unsigned lo + * - Y2038 section for 32-bit applications - + * +@@ -77,3 +90,16 @@ static inline int futex2_wake(volatile void *uaddr, unsigned int nr, unsigned lo { return syscall(__NR_futex_wake, uaddr, nr, flags); } + -+/* -+ * wait for uaddr if (*uaddr == val) ++/** ++ * futex2_waitv - Wait at multiple futexes, wake on any ++ * @waiters: Array of waiters ++ * @nr_waiters: Length of waiters array ++ * @flags: Operation flags ++ * @timo: Optional timeout for operation + */ +static inline int futex2_waitv(volatile struct futex_waitv *waiters, unsigned long nr_waiters, + unsigned long flags, struct timespec64 *timo) @@ -2552,123 +3596,304 @@ index 807b8b57f..10be0c504 100644 + return syscall(__NR_futex_waitv, waiters, nr_waiters, flags, timo); +} -- -2.29.2 +2.30.2 -From 9358bbdf929a90bc144d13e002fed8f4223d3178 Mon Sep 17 00:00:00 2001 +From 18a89fdf17baa9595b09bb98cc545ecba4ce93fb Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Andr=C3=A9=20Almeida?= <andrealmeid@collabora.com> -Date: Fri, 4 Dec 2020 19:12:23 -0300 -Subject: [PATCH 8/9] futex2: Add sysfs entry for syscall numbers +Date: Fri, 5 Feb 2021 10:34:02 -0300 +Subject: [PATCH 11/13] selftests: futex2: Add requeue test MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit +Add testing for futex_requeue(). The first test just requeue from one +waiter to another one, and wake it. The second performs both wake and +requeue, and we check return values to see if the operation +woke/requeued the expected number of waiters. + Signed-off-by: André Almeida <andrealmeid@collabora.com> Signed-off-by: Jan200101 <sentrycraft123@gmail.com> --- - kernel/futex2.c | 42 ++++++++++++++++++++++++++++++++++++++++++ - 1 file changed, 42 insertions(+) + .../selftests/futex/functional/.gitignore | 1 + + .../selftests/futex/functional/Makefile | 3 +- + .../futex/functional/futex2_requeue.c | 164 ++++++++++++++++++ + .../selftests/futex/include/futex2test.h | 16 ++ + 4 files changed, 183 insertions(+), 1 deletion(-) + create mode 100644 tools/testing/selftests/futex/functional/futex2_requeue.c -diff --git a/kernel/futex2.c b/kernel/futex2.c -index 5ddb9922d..58cd8a868 100644 ---- a/kernel/futex2.c -+++ b/kernel/futex2.c -@@ -762,6 +762,48 @@ SYSCALL_DEFINE3(futex_wake, void __user *, uaddr, unsigned int, nr_wake, - return ret; - } +diff --git a/tools/testing/selftests/futex/functional/.gitignore b/tools/testing/selftests/futex/functional/.gitignore +index d0b8f637b..af7557e82 100644 +--- a/tools/testing/selftests/futex/functional/.gitignore ++++ b/tools/testing/selftests/futex/functional/.gitignore +@@ -8,3 +8,4 @@ futex_wait_uninitialized_heap + futex_wait_wouldblock + futex2_wait + futex2_waitv ++futex2_requeue +diff --git a/tools/testing/selftests/futex/functional/Makefile b/tools/testing/selftests/futex/functional/Makefile +index 09c08ccde..3ccb9ea58 100644 +--- a/tools/testing/selftests/futex/functional/Makefile ++++ b/tools/testing/selftests/futex/functional/Makefile +@@ -17,7 +17,8 @@ TEST_GEN_FILES := \ + futex_wait_uninitialized_heap \ + futex_wait_private_mapped_file \ + futex2_wait \ +- futex2_waitv ++ futex2_waitv \ ++ futex2_requeue -+static ssize_t wait_show(struct kobject *kobj, struct kobj_attribute *attr, -+ char *buf) -+{ -+ return sprintf(buf, "%u\n", __NR_futex_wait); + TEST_PROGS := run.sh + +diff --git a/tools/testing/selftests/futex/functional/futex2_requeue.c b/tools/testing/selftests/futex/functional/futex2_requeue.c +new file mode 100644 +index 000000000..1bc3704dc +--- /dev/null ++++ b/tools/testing/selftests/futex/functional/futex2_requeue.c +@@ -0,0 +1,164 @@ ++// SPDX-License-Identifier: GPL-2.0-or-later ++/****************************************************************************** ++ * ++ * Copyright Collabora Ltd., 2021 ++ * ++ * DESCRIPTION ++ * Test requeue mechanism of futex2, using 32bit sized futexes. ++ * ++ * AUTHOR ++ * André Almeida <andrealmeid@collabora.com> ++ * ++ * HISTORY ++ * 2021-Feb-5: Initial version by André <andrealmeid@collabora.com> ++ * ++ *****************************************************************************/ ++ ++#include <errno.h> ++#include <error.h> ++#include <getopt.h> ++#include <stdio.h> ++#include <stdlib.h> ++#include <string.h> ++#include <time.h> ++#include <pthread.h> ++#include <sys/shm.h> ++#include <limits.h> ++#include "futex2test.h" ++#include "logging.h" ++ ++#define TEST_NAME "futex2-wait" ++#define timeout_ns 30000000 ++#define WAKE_WAIT_US 10000 ++volatile futex_t *f1; + ++void usage(char *prog) ++{ ++ printf("Usage: %s\n", prog); ++ printf(" -c Use color\n"); ++ printf(" -h Display this help message\n"); ++ printf(" -v L Verbosity level: %d=QUIET %d=CRITICAL %d=INFO\n", ++ VQUIET, VCRITICAL, VINFO); +} -+static struct kobj_attribute futex2_wait_attr = __ATTR_RO(wait); + -+static ssize_t wake_show(struct kobject *kobj, struct kobj_attribute *attr, -+ char *buf) ++void *waiterfn(void *arg) +{ -+ return sprintf(buf, "%u\n", __NR_futex_wake); ++ struct timespec64 to64; ++ ++ /* setting absolute timeout for futex2 */ ++ if (gettime64(CLOCK_MONOTONIC, &to64)) ++ error("gettime64 failed\n", errno); ++ ++ to64.tv_nsec += timeout_ns; ++ ++ if (to64.tv_nsec >= 1000000000) { ++ to64.tv_sec++; ++ to64.tv_nsec -= 1000000000; ++ } + ++ if (futex2_wait(f1, *f1, FUTEX_32, &to64)) ++ printf("waiter failed errno %d\n", errno); ++ ++ return NULL; +} -+static struct kobj_attribute futex2_wake_attr = __ATTR_RO(wake); + -+static ssize_t waitv_show(struct kobject *kobj, struct kobj_attribute *attr, -+ char *buf) ++int main(int argc, char *argv[]) +{ -+ return sprintf(buf, "%u\n", __NR_futex_waitv); ++ pthread_t waiter[10]; ++ int res, ret = RET_PASS; ++ int c, i; ++ volatile futex_t _f1 = 0; ++ volatile futex_t f2 = 0; ++ struct futex_requeue r1, r2; + -+} -+static struct kobj_attribute futex2_waitv_attr = __ATTR_RO(waitv); ++ f1 = &_f1; + -+static struct attribute *futex2_sysfs_attrs[] = { -+ &futex2_wait_attr.attr, -+ &futex2_wake_attr.attr, -+ &futex2_waitv_attr.attr, -+ NULL, -+}; ++ r1.flags = FUTEX_32; ++ r2.flags = FUTEX_32; + -+static const struct attribute_group futex2_sysfs_attr_group = { -+ .attrs = futex2_sysfs_attrs, -+ .name = "futex2", -+}; ++ r1.uaddr = f1; ++ r2.uaddr = &f2; + -+static int __init futex2_sysfs_init(void) -+{ -+ return sysfs_create_group(kernel_kobj, &futex2_sysfs_attr_group); -+} -+subsys_initcall(futex2_sysfs_init); ++ while ((c = getopt(argc, argv, "cht:v:")) != -1) { ++ switch (c) { ++ case 'c': ++ log_color(1); ++ break; ++ case 'h': ++ usage(basename(argv[0])); ++ exit(0); ++ case 'v': ++ log_verbosity(atoi(optarg)); ++ break; ++ default: ++ usage(basename(argv[0])); ++ exit(1); ++ } ++ } + - static int __init futex2_init(void) ++ ksft_print_header(); ++ ksft_set_plan(2); ++ ksft_print_msg("%s: Test FUTEX2_REQUEUE\n", ++ basename(argv[0])); ++ ++ /* ++ * Requeue a waiter from f1 to f2, and wake f2. ++ */ ++ if (pthread_create(&waiter[0], NULL, waiterfn, NULL)) ++ error("pthread_create failed\n", errno); ++ ++ usleep(WAKE_WAIT_US); ++ ++ res = futex2_requeue(&r1, &r2, 0, 1, 0, 0); ++ if (res != 1) { ++ ksft_test_result_fail("futex2_requeue private returned: %d %s\n", ++ res ? errno : res, ++ res ? strerror(errno) : ""); ++ ret = RET_FAIL; ++ } ++ ++ ++ info("Calling private futex2_wake on f2: %u @ %p with val=%u\n", f2, &f2, f2); ++ res = futex2_wake(&f2, 1, FUTEX_32); ++ if (res != 1) { ++ ksft_test_result_fail("futex2_requeue private returned: %d %s\n", ++ res ? errno : res, ++ res ? strerror(errno) : ""); ++ ret = RET_FAIL; ++ } else { ++ ksft_test_result_pass("futex2_requeue simple succeeds\n"); ++ } ++ ++ ++ /* ++ * Create 10 waiters at f1. At futex_requeue, wake 3 and requeue 7. ++ * At futex_wake, wake INT_MAX (should be exaclty 7). ++ */ ++ for (i = 0; i < 10; i++) { ++ if (pthread_create(&waiter[i], NULL, waiterfn, NULL)) ++ error("pthread_create failed\n", errno); ++ } ++ ++ usleep(WAKE_WAIT_US); ++ ++ res = futex2_requeue(&r1, &r2, 3, 7, 0, 0); ++ if (res != 10) { ++ ksft_test_result_fail("futex2_requeue private returned: %d %s\n", ++ res ? errno : res, ++ res ? strerror(errno) : ""); ++ ret = RET_FAIL; ++ } ++ ++ res = futex2_wake(&f2, INT_MAX, FUTEX_32); ++ if (res != 7) { ++ ksft_test_result_fail("futex2_requeue private returned: %d %s\n", ++ res ? errno : res, ++ res ? strerror(errno) : ""); ++ ret = RET_FAIL; ++ } else { ++ ksft_test_result_pass("futex2_requeue succeeds\n"); ++ } ++ ++ ksft_print_cnts(); ++ return ret; ++} +diff --git a/tools/testing/selftests/futex/include/futex2test.h b/tools/testing/selftests/futex/include/futex2test.h +index 31979afc4..e2635006b 100644 +--- a/tools/testing/selftests/futex/include/futex2test.h ++++ b/tools/testing/selftests/futex/include/futex2test.h +@@ -103,3 +103,19 @@ static inline int futex2_waitv(volatile struct futex_waitv *waiters, unsigned lo { - int i; + return syscall(__NR_futex_waitv, waiters, nr_waiters, flags, timo); + } ++ ++/** ++ * futex2_requeue - Wake futexes at uaddr1 and requeue from uaddr1 to uaddr2 ++ * @uaddr1: Original address to wake and requeue from ++ * @uaddr2: Address to requeue to ++ * @nr_wake: Number of futexes to wake at uaddr1 before requeuing ++ * @nr_requeue: Number of futexes to requeue from uaddr1 to uaddr2 ++ * @cmpval: If (uaddr1->uaddr != cmpval), return immediatally ++ * @flgas: Operation flags ++ */ ++static inline int futex2_requeue(struct futex_requeue *uaddr1, struct futex_requeue *uaddr2, ++ unsigned int nr_wake, unsigned int nr_requeue, ++ unsigned int cmpval, unsigned long flags) ++{ ++ return syscall(__NR_futex_requeue, uaddr1, uaddr2, nr_wake, nr_requeue, cmpval, flags); ++} -- -2.29.2 +2.30.2 -From f7b1c9a2ad05933e559ef78bc7753b2fac1698fd Mon Sep 17 00:00:00 2001 +From 799e24f7b39e114107b36c4cc4ece4825a9fa6a0 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Andr=C3=A9=20Almeida?= <andrealmeid@collabora.com> -Date: Tue, 5 Jan 2021 15:44:02 -0300 -Subject: [PATCH 9/9] perf bench: Add futex2 benchmark tests +Date: Fri, 5 Feb 2021 10:34:02 -0300 +Subject: [PATCH 12/13] perf bench: Add futex2 benchmark tests MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit -Port existing futex infrastructure to use futex2 calls. +Add support at the existing futex benchmarking code base to enable +futex2 calls. `perf bench` tests can be used not only as a way to +measure the performance of implementation, but also as stress testing +for the kernel infrastructure. Signed-off-by: André Almeida <andrealmeid@collabora.com> Signed-off-by: Jan200101 <sentrycraft123@gmail.com> --- - tools/arch/x86/include/asm/unistd_64.h | 8 +++++ - tools/perf/bench/bench.h | 3 ++ - tools/perf/bench/futex-hash.c | 24 ++++++++++++--- - tools/perf/bench/futex-wake-parallel.c | 41 ++++++++++++++++++++++---- - tools/perf/bench/futex-wake.c | 36 ++++++++++++++++++---- - tools/perf/bench/futex.h | 17 +++++++++++ - tools/perf/builtin-bench.c | 17 ++++++++--- - 7 files changed, 127 insertions(+), 19 deletions(-) + tools/arch/x86/include/asm/unistd_64.h | 12 ++++++ + tools/perf/bench/bench.h | 4 ++ + tools/perf/bench/futex-hash.c | 24 +++++++++-- + tools/perf/bench/futex-requeue.c | 57 ++++++++++++++++++++------ + tools/perf/bench/futex-wake-parallel.c | 41 +++++++++++++++--- + tools/perf/bench/futex-wake.c | 37 +++++++++++++---- + tools/perf/bench/futex.h | 47 +++++++++++++++++++++ + tools/perf/builtin-bench.c | 18 ++++++-- + 8 files changed, 206 insertions(+), 34 deletions(-) diff --git a/tools/arch/x86/include/asm/unistd_64.h b/tools/arch/x86/include/asm/unistd_64.h -index 4205ed415..151a41ceb 100644 +index 4205ed415..cf5ad4ea1 100644 --- a/tools/arch/x86/include/asm/unistd_64.h +++ b/tools/arch/x86/include/asm/unistd_64.h -@@ -17,3 +17,11 @@ +@@ -17,3 +17,15 @@ #ifndef __NR_setns #define __NR_setns 308 #endif + +#ifndef __NR_futex_wait -+# define __NR_futex_wait 441 ++# define __NR_futex_wait 442 +#endif + +#ifndef __NR_futex_wake -+# define __NR_futex_wake 442 ++# define __NR_futex_wake 443 ++#endif ++ ++#ifndef __NR_futex_requeue ++# define __NR_futex_requeue 445 +#endif diff --git a/tools/perf/bench/bench.h b/tools/perf/bench/bench.h -index eac36afab..f6f881a05 100644 +index eac36afab..12346844b 100644 --- a/tools/perf/bench/bench.h +++ b/tools/perf/bench/bench.h -@@ -38,8 +38,11 @@ int bench_mem_memcpy(int argc, const char **argv); +@@ -38,9 +38,13 @@ int bench_mem_memcpy(int argc, const char **argv); int bench_mem_memset(int argc, const char **argv); int bench_mem_find_bit(int argc, const char **argv); int bench_futex_hash(int argc, const char **argv); @@ -2678,10 +3903,12 @@ index eac36afab..f6f881a05 100644 int bench_futex_wake_parallel(int argc, const char **argv); +int bench_futex2_wake_parallel(int argc, const char **argv); int bench_futex_requeue(int argc, const char **argv); ++int bench_futex2_requeue(int argc, const char **argv); /* pi futexes */ int bench_futex_lock_pi(int argc, const char **argv); + int bench_epoll_wait(int argc, const char **argv); diff --git a/tools/perf/bench/futex-hash.c b/tools/perf/bench/futex-hash.c -index 915bf3da7..72921c22b 100644 +index 915bf3da7..6e62e7708 100644 --- a/tools/perf/bench/futex-hash.c +++ b/tools/perf/bench/futex-hash.c @@ -34,7 +34,7 @@ static unsigned int nthreads = 0; @@ -2710,7 +3937,7 @@ index 915bf3da7..72921c22b 100644 } -int bench_futex_hash(int argc, const char **argv) -+static int bench_futex_hash_common(int argc, const char **argv) ++static int __bench_futex_hash(int argc, const char **argv) { int ret = 0; cpu_set_t cpuset; @@ -2732,16 +3959,146 @@ index 915bf3da7..72921c22b 100644 + +int bench_futex_hash(int argc, const char **argv) +{ -+ return bench_futex_hash_common(argc, argv); ++ return __bench_futex_hash(argc, argv); +} + +int bench_futex2_hash(int argc, const char **argv) +{ + futex2 = true; -+ return bench_futex_hash_common(argc, argv); ++ return __bench_futex_hash(argc, argv); ++} +diff --git a/tools/perf/bench/futex-requeue.c b/tools/perf/bench/futex-requeue.c +index 7a15c2e61..4c7486fbe 100644 +--- a/tools/perf/bench/futex-requeue.c ++++ b/tools/perf/bench/futex-requeue.c +@@ -2,8 +2,8 @@ + /* + * Copyright (C) 2013 Davidlohr Bueso <davidlohr@hp.com> + * +- * futex-requeue: Block a bunch of threads on futex1 and requeue them +- * on futex2, N at a time. ++ * futex-requeue: Block a bunch of threads on addr1 and requeue them ++ * on addr2, N at a time. + * + * This program is particularly useful to measure the latency of nthread + * requeues without waking up any tasks -- thus mimicking a regular futex_wait. +@@ -29,7 +29,10 @@ + #include <stdlib.h> + #include <sys/time.h> + +-static u_int32_t futex1 = 0, futex2 = 0; ++static u_int32_t addr1 = 0, addr2 = 0; ++ ++static struct futex_requeue rq1 = { .uaddr = &addr1, .flags = FUTEX_32 }; ++static struct futex_requeue rq2 = { .uaddr = &addr2, .flags = FUTEX_32 }; + + /* + * How many tasks to requeue at a time. +@@ -38,7 +41,7 @@ static u_int32_t futex1 = 0, futex2 = 0; + static unsigned int nrequeue = 1; + + static pthread_t *worker; +-static bool done = false, silent = false, fshared = false; ++static bool done = false, silent = false, fshared = false, futex2 = false; + static pthread_mutex_t thread_lock; + static pthread_cond_t thread_parent, thread_worker; + static struct stats requeuetime_stats, requeued_stats; +@@ -80,7 +83,11 @@ static void *workerfn(void *arg __maybe_unused) + pthread_cond_wait(&thread_worker, &thread_lock); + pthread_mutex_unlock(&thread_lock); + +- futex_wait(&futex1, 0, NULL, futex_flag); ++ if (!futex2) ++ futex_wait(&addr1, 0, NULL, futex_flag); ++ else ++ futex2_wait(&addr1, 0, futex_flag, NULL); ++ + return NULL; + } + +@@ -112,7 +119,7 @@ static void toggle_done(int sig __maybe_unused, + done = true; + } + +-int bench_futex_requeue(int argc, const char **argv) ++static int __bench_futex_requeue(int argc, const char **argv) + { + int ret = 0; + unsigned int i, j; +@@ -140,15 +147,20 @@ int bench_futex_requeue(int argc, const char **argv) + if (!worker) + err(EXIT_FAILURE, "calloc"); + +- if (!fshared) ++ if (futex2) { ++ futex_flag = FUTEX_32 | (fshared * FUTEX_SHARED_FLAG); ++ rq1.flags |= FUTEX_SHARED_FLAG * fshared; ++ rq2.flags |= FUTEX_SHARED_FLAG * fshared; ++ } else if (!fshared) { + futex_flag = FUTEX_PRIVATE_FLAG; ++ } + + if (nrequeue > nthreads) + nrequeue = nthreads; + + printf("Run summary [PID %d]: Requeuing %d threads (from [%s] %p to %p), " + "%d at a time.\n\n", getpid(), nthreads, +- fshared ? "shared":"private", &futex1, &futex2, nrequeue); ++ fshared ? "shared":"private", &addr1, &addr2, nrequeue); + + init_stats(&requeued_stats); + init_stats(&requeuetime_stats); +@@ -177,11 +189,15 @@ int bench_futex_requeue(int argc, const char **argv) + gettimeofday(&start, NULL); + while (nrequeued < nthreads) { + /* +- * Do not wakeup any tasks blocked on futex1, allowing ++ * Do not wakeup any tasks blocked on addr1, allowing + * us to really measure futex_wait functionality. + */ +- nrequeued += futex_cmp_requeue(&futex1, 0, &futex2, 0, +- nrequeue, futex_flag); ++ if (!futex2) ++ nrequeued += futex_cmp_requeue(&addr1, 0, &addr2, ++ 0, nrequeue, futex_flag); ++ else ++ nrequeued += futex2_requeue(&rq1, &rq2, ++ 0, nrequeue, 0, 0); + } + + gettimeofday(&end, NULL); +@@ -195,8 +211,12 @@ int bench_futex_requeue(int argc, const char **argv) + j + 1, nrequeued, nthreads, runtime.tv_usec / (double)USEC_PER_MSEC); + } + +- /* everybody should be blocked on futex2, wake'em up */ +- nrequeued = futex_wake(&futex2, nrequeued, futex_flag); ++ /* everybody should be blocked on addr2, wake'em up */ ++ if (!futex2) ++ nrequeued = futex_wake(&addr2, nrequeued, futex_flag); ++ else ++ nrequeued = futex2_wake(&addr2, nrequeued, futex_flag); ++ + if (nthreads != nrequeued) + warnx("couldn't wakeup all tasks (%d/%d)", nrequeued, nthreads); + +@@ -221,3 +241,14 @@ int bench_futex_requeue(int argc, const char **argv) + usage_with_options(bench_futex_requeue_usage, options); + exit(EXIT_FAILURE); + } ++ ++int bench_futex_requeue(int argc, const char **argv) ++{ ++ return __bench_futex_requeue(argc, argv); ++} ++ ++int bench_futex2_requeue(int argc, const char **argv) ++{ ++ futex2 = true; ++ return __bench_futex_requeue(argc, argv); +} diff --git a/tools/perf/bench/futex-wake-parallel.c b/tools/perf/bench/futex-wake-parallel.c -index cd2b81a84..540104538 100644 +index cd2b81a84..8a89c6ab9 100644 --- a/tools/perf/bench/futex-wake-parallel.c +++ b/tools/perf/bench/futex-wake-parallel.c @@ -17,6 +17,12 @@ int bench_futex_wake_parallel(int argc __maybe_unused, const char **argv __maybe @@ -2800,7 +4157,7 @@ index cd2b81a84..540104538 100644 } -int bench_futex_wake_parallel(int argc, const char **argv) -+static int bench_futex_wake_parallel_common(int argc, const char **argv) ++static int __bench_futex_wake_parallel(int argc, const char **argv) { int ret = 0; unsigned int i, j; @@ -2822,31 +4179,30 @@ index cd2b81a84..540104538 100644 + +int bench_futex_wake_parallel(int argc, const char **argv) +{ -+ return bench_futex_wake_parallel_common(argc, argv); ++ return __bench_futex_wake_parallel(argc, argv); +} + +int bench_futex2_wake_parallel(int argc, const char **argv) +{ + futex2 = true; -+ return bench_futex_wake_parallel_common(argc, argv); ++ return __bench_futex_wake_parallel(argc, argv); +} + #endif /* HAVE_PTHREAD_BARRIER */ diff --git a/tools/perf/bench/futex-wake.c b/tools/perf/bench/futex-wake.c -index 2dfcef3e3..b98b84e7b 100644 +index 2dfcef3e3..be4481f5e 100644 --- a/tools/perf/bench/futex-wake.c +++ b/tools/perf/bench/futex-wake.c -@@ -46,6 +46,9 @@ static struct stats waketime_stats, wakeup_stats; - static unsigned int threads_starting, nthreads = 0; - static int futex_flag = 0; +@@ -39,7 +39,7 @@ static u_int32_t futex1 = 0; + static unsigned int nwakes = 1; -+/* Should we use futex2 API? */ -+static bool futex2 = false; -+ - static const struct option options[] = { - OPT_UINTEGER('t', "threads", &nthreads, "Specify amount of threads"), - OPT_UINTEGER('w', "nwakes", &nwakes, "Specify amount of threads to wake at once"), -@@ -69,8 +72,13 @@ static void *workerfn(void *arg __maybe_unused) + pthread_t *worker; +-static bool done = false, silent = false, fshared = false; ++static bool done = false, silent = false, fshared = false, futex2 = false; + static pthread_mutex_t thread_lock; + static pthread_cond_t thread_parent, thread_worker; + static struct stats waketime_stats, wakeup_stats; +@@ -69,8 +69,13 @@ static void *workerfn(void *arg __maybe_unused) pthread_mutex_unlock(&thread_lock); while (1) { @@ -2862,16 +4218,16 @@ index 2dfcef3e3..b98b84e7b 100644 } pthread_exit(NULL); -@@ -118,7 +126,7 @@ static void toggle_done(int sig __maybe_unused, +@@ -118,7 +123,7 @@ static void toggle_done(int sig __maybe_unused, done = true; } -int bench_futex_wake(int argc, const char **argv) -+static int bench_futex_wake_common(int argc, const char **argv) ++static int __bench_futex_wake(int argc, const char **argv) { int ret = 0; unsigned int i, j; -@@ -148,7 +156,9 @@ int bench_futex_wake(int argc, const char **argv) +@@ -148,7 +153,9 @@ int bench_futex_wake(int argc, const char **argv) if (!worker) err(EXIT_FAILURE, "calloc"); @@ -2882,14 +4238,16 @@ index 2dfcef3e3..b98b84e7b 100644 futex_flag = FUTEX_PRIVATE_FLAG; printf("Run summary [PID %d]: blocking on %d threads (at [%s] futex %p), " -@@ -181,8 +191,13 @@ int bench_futex_wake(int argc, const char **argv) +@@ -180,9 +187,14 @@ int bench_futex_wake(int argc, const char **argv) + /* Ok, all threads are patiently blocked, start waking folks up */ gettimeofday(&start, NULL); - while (nwoken != nthreads) +- while (nwoken != nthreads) - nwoken += futex_wake(&futex1, nwakes, futex_flag); -+ if (!futex2) { ++ while (nwoken != nthreads) { ++ if (!futex2) + nwoken += futex_wake(&futex1, nwakes, futex_flag); -+ } else { ++ else + nwoken += futex2_wake(&futex1, nwakes, futex_flag); + } gettimeofday(&end, NULL); @@ -2897,32 +4255,38 @@ index 2dfcef3e3..b98b84e7b 100644 timersub(&end, &start, &runtime); update_stats(&wakeup_stats, nwoken); -@@ -212,3 +227,14 @@ int bench_futex_wake(int argc, const char **argv) +@@ -212,3 +224,14 @@ int bench_futex_wake(int argc, const char **argv) free(worker); return ret; } + +int bench_futex_wake(int argc, const char **argv) +{ -+ return bench_futex_wake_common(argc, argv); ++ return __bench_futex_wake(argc, argv); +} + +int bench_futex2_wake(int argc, const char **argv) +{ + futex2 = true; -+ return bench_futex_wake_common(argc, argv); ++ return __bench_futex_wake(argc, argv); +} diff --git a/tools/perf/bench/futex.h b/tools/perf/bench/futex.h -index 31b53cc7d..5111799b5 100644 +index 31b53cc7d..6b2213cf3 100644 --- a/tools/perf/bench/futex.h +++ b/tools/perf/bench/futex.h -@@ -86,4 +86,21 @@ futex_cmp_requeue(u_int32_t *uaddr, u_int32_t val, u_int32_t *uaddr2, int nr_wak +@@ -86,4 +86,51 @@ futex_cmp_requeue(u_int32_t *uaddr, u_int32_t val, u_int32_t *uaddr2, int nr_wak return futex(uaddr, FUTEX_CMP_REQUEUE, nr_wake, nr_requeue, uaddr2, val, opflags); } + -+/* -+ * wait for uaddr if (*uaddr == val) ++/** ++ * futex2_wait - Wait at uaddr if *uaddr == val, until timo. ++ * @uaddr: User address to wait for ++ * @val: Expected value at uaddr ++ * @flags: Operation options ++ * @timo: Optional timeout ++ * ++ * Return: 0 on success, error code otherwise + */ +static inline int futex2_wait(volatile void *uaddr, unsigned long val, + unsigned long flags, struct timespec *timo) @@ -2930,16 +4294,40 @@ index 31b53cc7d..5111799b5 100644 + return syscall(__NR_futex_wait, uaddr, val, flags, timo); +} + -+/* -+ * wake nr futexes waiting for uaddr ++/** ++ * futex2_wake - Wake a number of waiters waiting at uaddr ++ * @uaddr: Address to wake ++ * @nr: Number of waiters to wake ++ * @flags: Operation options ++ * ++ * Return: number of waked futexes + */ +static inline int futex2_wake(volatile void *uaddr, unsigned int nr, unsigned long flags) +{ + return syscall(__NR_futex_wake, uaddr, nr, flags); +} ++ ++/** ++ * futex2_requeue - Requeue waiters from an address to another one ++ * @uaddr1: Address where waiters are currently waiting on ++ * @uaddr2: New address to wait ++ * @nr_wake: Number of waiters at uaddr1 to be wake ++ * @nr_requeue: After waking nr_wake, number of waiters to be requeued ++ * @cmpval: Expected value at uaddr1 ++ * @flags: Operation options ++ * ++ * Return: waked futexes + requeued futexes at uaddr1 ++ */ ++static inline int futex2_requeue(volatile struct futex_requeue *uaddr1, ++ volatile struct futex_requeue *uaddr2, ++ unsigned int nr_wake, unsigned int nr_requeue, ++ unsigned int cmpval, unsigned long flags) ++{ ++ return syscall(__NR_futex_requeue, uaddr1, uaddr2, nr_wake, nr_requeue, cmpval, flags); ++} #endif /* _FUTEX_H */ diff --git a/tools/perf/builtin-bench.c b/tools/perf/builtin-bench.c -index 62a7b7420..200ecacad 100644 +index 62a7b7420..e41a95ad2 100644 --- a/tools/perf/builtin-bench.c +++ b/tools/perf/builtin-bench.c @@ -12,10 +12,11 @@ @@ -2958,7 +4346,7 @@ index 62a7b7420..200ecacad 100644 */ #include <subcmd/parse-options.h> #include "builtin.h" -@@ -75,6 +76,13 @@ static struct bench futex_benchmarks[] = { +@@ -75,6 +76,14 @@ static struct bench futex_benchmarks[] = { { NULL, NULL, NULL } }; @@ -2966,13 +4354,14 @@ index 62a7b7420..200ecacad 100644 + { "hash", "Benchmark for futex2 hash table", bench_futex2_hash }, + { "wake", "Benchmark for futex2 wake calls", bench_futex2_wake }, + { "wake-parallel", "Benchmark for parallel futex2 wake calls", bench_futex2_wake_parallel }, ++ { "requeue", "Benchmark for futex2 requeue calls", bench_futex2_requeue }, + { NULL, NULL, NULL } +}; + #ifdef HAVE_EVENTFD_SUPPORT static struct bench epoll_benchmarks[] = { { "wait", "Benchmark epoll concurrent epoll_waits", bench_epoll_wait }, -@@ -105,6 +113,7 @@ static struct collection collections[] = { +@@ -105,6 +114,7 @@ static struct collection collections[] = { { "numa", "NUMA scheduling and MM benchmarks", numa_benchmarks }, #endif {"futex", "Futex stressing benchmarks", futex_benchmarks }, @@ -2981,5 +4370,82 @@ index 62a7b7420..200ecacad 100644 {"epoll", "Epoll stressing benchmarks", epoll_benchmarks }, #endif -- -2.29.2 +2.30.2 + + +From ea9a7956b5f6f44f3ee70d82542c64fcb7c86c5e Mon Sep 17 00:00:00 2001 +From: =?UTF-8?q?Andr=C3=A9=20Almeida?= <andrealmeid@collabora.com> +Date: Fri, 5 Feb 2021 10:34:02 -0300 +Subject: [PATCH 13/13] futex2: Add sysfs entry for syscall numbers +MIME-Version: 1.0 +Content-Type: text/plain; charset=UTF-8 +Content-Transfer-Encoding: 8bit + +In the course of futex2 development, it will be rebased on top of +different kernel releases, and the syscall number can change in this +process. Expose futex2 syscall number via sysfs so tools that are +experimenting with futex2 (like Proton/Wine) can test it and set the +syscall number at runtime, rather than setting it at compilation time. + +Signed-off-by: André Almeida <andrealmeid@collabora.com> +Signed-off-by: Jan200101 <sentrycraft123@gmail.com> +--- + kernel/futex2.c | 42 ++++++++++++++++++++++++++++++++++++++++++ + 1 file changed, 42 insertions(+) + +diff --git a/kernel/futex2.c b/kernel/futex2.c +index 8a8b45f98..1eb20410d 100644 +--- a/kernel/futex2.c ++++ b/kernel/futex2.c +@@ -1220,6 +1220,48 @@ SYSCALL_DEFINE6(futex_requeue, struct futex_requeue __user *, uaddr1, + return __futex_requeue(rq1, rq2, nr_wake, nr_requeue, cmpval, shared1, shared2); + } + ++static ssize_t wait_show(struct kobject *kobj, struct kobj_attribute *attr, ++ char *buf) ++{ ++ return sprintf(buf, "%u\n", __NR_futex_wait); ++ ++} ++static struct kobj_attribute futex2_wait_attr = __ATTR_RO(wait); ++ ++static ssize_t wake_show(struct kobject *kobj, struct kobj_attribute *attr, ++ char *buf) ++{ ++ return sprintf(buf, "%u\n", __NR_futex_wake); ++ ++} ++static struct kobj_attribute futex2_wake_attr = __ATTR_RO(wake); ++ ++static ssize_t waitv_show(struct kobject *kobj, struct kobj_attribute *attr, ++ char *buf) ++{ ++ return sprintf(buf, "%u\n", __NR_futex_waitv); ++ ++} ++static struct kobj_attribute futex2_waitv_attr = __ATTR_RO(waitv); ++ ++static struct attribute *futex2_sysfs_attrs[] = { ++ &futex2_wait_attr.attr, ++ &futex2_wake_attr.attr, ++ &futex2_waitv_attr.attr, ++ NULL, ++}; ++ ++static const struct attribute_group futex2_sysfs_attr_group = { ++ .attrs = futex2_sysfs_attrs, ++ .name = "futex2", ++}; ++ ++static int __init futex2_sysfs_init(void) ++{ ++ return sysfs_create_group(kernel_kobj, &futex2_sysfs_attr_group); ++} ++subsys_initcall(futex2_sysfs_init); ++ + static int __init futex2_init(void) + { + int i; +-- +2.30.2 |