aboutsummaryrefslogtreecommitdiff
path: root/SOURCES/futex2.patch
diff options
context:
space:
mode:
Diffstat (limited to 'SOURCES/futex2.patch')
-rw-r--r--SOURCES/futex2.patch4024
1 files changed, 2745 insertions, 1279 deletions
diff --git a/SOURCES/futex2.patch b/SOURCES/futex2.patch
index 1bc4486..3604062 100644
--- a/SOURCES/futex2.patch
+++ b/SOURCES/futex2.patch
@@ -1,37 +1,314 @@
-From 14a106cc87e6d03169ac8c7ea030e3d7fac2dfe4 Mon Sep 17 00:00:00 2001
+From a64bf661d4fc6dbfde640bf002eae2e22884a419 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Andr=C3=A9=20Almeida?= <andrealmeid@collabora.com>
-Date: Wed, 5 Aug 2020 12:40:26 -0300
-Subject: [PATCH 1/9] futex2: Add new futex interface
+Date: Fri, 5 Feb 2021 10:34:00 -0300
+Subject: [PATCH 01/13] futex2: Implement wait and wake functions
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
-Initial implementation for futex2. Support only private u32 wait/wake, with
-timeout (monotonic and realtime clocks).
+Create a new set of futex syscalls known as futex2. This new interface
+is aimed to implement a more maintainable code, while removing obsolete
+features and expanding it with new functionalities.
+
+Implements wait and wake semantics for futexes, along with the base
+infrastructure for future operations. The whole wait path is designed to
+be used by N waiters, thus making easier to implement vectorized wait.
+
+* Syscalls implemented by this patch:
+
+- futex_wait(void *uaddr, unsigned int val, unsigned int flags,
+ struct timespec *timo)
+
+ The user thread is put to sleep, waiting for a futex_wake() at uaddr,
+ if the value at *uaddr is the same as val (otherwise, the syscall
+ returns immediately with -EAGAIN). timo is an optional timeout value
+ for the operation.
+
+ Return 0 on success, error code otherwise.
+
+ - futex_wake(void *uaddr, unsigned long nr_wake, unsigned int flags)
+
+ Wake `nr_wake` threads waiting at uaddr.
+
+ Return the number of woken threads on success, error code otherwise.
+
+** The `flag` argument
+
+ The flag is used to specify the size of the futex word
+ (FUTEX_[8, 16, 32]). It's mandatory to define one, since there's no
+ default size.
+
+ By default, the timeout uses a monotonic clock, but can be used as a
+ realtime one by using the FUTEX_REALTIME_CLOCK flag.
+
+ By default, futexes are of the private type, that means that this user
+ address will be accessed by threads that shares the same memory region.
+ This allows for some internal optimizations, so they are faster.
+ However, if the address needs to be shared with different processes
+ (like using `mmap()` or `shm()`), they need to be defined as shared and
+ the flag FUTEX_SHARED_FLAG is used to set that.
+
+ By default, the operation has no NUMA-awareness, meaning that the user
+ can't choose the memory node where the kernel side futex data will be
+ stored. The user can choose the node where it wants to operate by
+ setting the FUTEX_NUMA_FLAG and using the following structure (where X
+ can be 8, 16, or 32):
+
+ struct futexX_numa {
+ __uX value;
+ __sX hint;
+ };
+
+ This structure should be passed at the `void *uaddr` of futex
+ functions. The address of the structure will be used to be waited/waken
+ on, and the `value` will be compared to `val` as usual. The `hint`
+ member is used to defined which node the futex will use. When waiting,
+ the futex will be registered on a kernel-side table stored on that
+ node; when waking, the futex will be searched for on that given table.
+ That means that there's no redundancy between tables, and the wrong
+ `hint` value will led to undesired behavior. Userspace is responsible
+ for dealing with node migrations issues that may occur. `hint` can
+ range from [0, MAX_NUMA_NODES], for specifying a node, or -1, to use
+ the same node the current process is using.
+
+ When not using FUTEX_NUMA_FLAG on a NUMA system, the futex will be
+ stored on a global table on some node, defined at compilation time.
+
+** The `timo` argument
+
+As per the Y2038 work done in the kernel, new interfaces shouldn't add
+timeout options known to be buggy. Given that, `timo` should be a 64bit
+timeout at all platforms, using an absolute timeout value.
Signed-off-by: André Almeida <andrealmeid@collabora.com>
+---
+
+[RFC Add futex2 syscall 0/0]
+
+Hi,
+
+This patch series introduces the futex2 syscalls.
+
+* What happened to the current futex()?
+
+For some years now, developers have been trying to add new features to
+futex, but maintainers have been reluctant to accept them, given the
+multiplexed interface full of legacy features and tricky to do big
+changes. Some problems that people tried to address with patchsets are:
+NUMA-awareness[0], smaller sized futexes[1], wait on multiple futexes[2].
+NUMA, for instance, just doesn't fit the current API in a reasonable
+way. Considering that, it's not possible to merge new features into the
+current futex.
+
+ ** The NUMA problem
+
+ At the current implementation, all futex kernel side infrastructure is
+ stored on a single node. Given that, all futex() calls issued by
+ processors that aren't located on that node will have a memory access
+ penalty when doing it.
+
+ ** The 32bit sized futex problem
+
+ Embedded systems or anything with memory constrains would benefit of
+ using smaller sizes for the futex userspace integer. Also, a mutex
+ implementation can be done using just three values, so 8 bits is enough
+ for various scenarios.
+
+ ** The wait on multiple problem
+
+ The use case lies in the Wine implementation of the Windows NT interface
+ WaitMultipleObjects. This Windows API function allows a thread to sleep
+ waiting on the first of a set of event sources (mutexes, timers, signal,
+ console input, etc) to signal. Considering this is a primitive
+ synchronization operation for Windows applications, being able to quickly
+ signal events on the producer side, and quickly go to sleep on the
+ consumer side is essential for good performance of those running
+ over Wine.
+
+[0] https://lore.kernel.org/lkml/20160505204230.932454245@linutronix.de/
+[1] https://lore.kernel.org/lkml/20191221155659.3159-2-malteskarupke@web.de/
+[2] https://lore.kernel.org/lkml/20200213214525.183689-1-andrealmeid@collabora.com/
+
+* The solution
+
+As proposed by Peter Zijlstra and Florian Weimer[3], a new interface
+is required to solve this, which must be designed with those features in
+mind. futex2() is that interface. As opposed to the current multiplexed
+interface, the new one should have one syscall per operation. This will
+allow the maintainability of the API if it gets extended, and will help
+users with type checking of arguments.
+
+In particular, the new interface is extended to support the ability to
+wait on any of a list of futexes at a time, which could be seen as a
+vectored extension of the FUTEX_WAIT semantics.
+
+[3] https://lore.kernel.org/lkml/20200303120050.GC2596@hirez.programming.kicks-ass.net/
+
+* The interface
+
+The new interface can be seen in details in the following patches, but
+this is a high level summary of what the interface can do:
+
+ - Supports wake/wait semantics, as in futex()
+ - Supports requeue operations, similarly as FUTEX_CMP_REQUEUE, but with
+ individual flags for each address
+ - Supports waiting for a vector of futexes, using a new syscall named
+ futex_waitv()
+ - Supports variable sized futexes (8bits, 16bits and 32bits)
+ - Supports NUMA-awareness operations, where the user can specify on
+ which memory node would like to operate
+
+* Implementation
+
+The internal implementation follows a similar design to the original futex.
+Given that we want to replicate the same external behavior of current
+futex, this should be somewhat expected. For some functions, like the
+init and the code to get a shared key, I literally copied code and
+comments from kernel/futex.c. I decided to do so instead of exposing the
+original function as a public function since in that way we can freely
+modify our implementation if required, without any impact on old futex.
+Also, the comments precisely describes the details and corner cases of
+the implementation.
+
+Each patch contains a brief description of implementation, but patch 6
+"docs: locking: futex2: Add documentation" adds a more complete document
+about it.
+
+* The patchset
+
+This patchset can be also found at my git tree:
+
+https://gitlab.collabora.com/tonyk/linux/-/tree/futex2
+
+ - Patch 1: Implements wait/wake, and the basics foundations of futex2
+
+ - Patches 2-4: Implement the remaining features (shared, waitv, requeue).
+
+ - Patch 5: Adds the x86_x32 ABI handling. I kept it in a separated
+ patch since I'm not sure if x86_x32 is still a thing, or if it should
+ return -ENOSYS.
+
+ - Patch 6: Add a documentation file which details the interface and
+ the internal implementation.
+
+ - Patches 7-13: Selftests for all operations along with perf
+ support for futex2.
+
+ - Patch 14: While working on porting glibc for futex2, I found out
+ that there's a futex_wake() call at the user thread exit path, if
+ that thread was created with clone(..., CLONE_CHILD_SETTID, ...). In
+ order to make pthreads work with futex2, it was required to add
+ this patch. Note that this is more a proof-of-concept of what we
+ will need to do in future, rather than part of the interface and
+ shouldn't be merged as it is.
+
+* Testing:
+
+This patchset provides selftests for each operation and their flags.
+Along with that, the following work was done:
+
+ ** Stability
+
+ To stress the interface in "real world scenarios":
+
+ - glibc[4]: nptl's low level locking was modified to use futex2 API
+ (except for robust and PI things). All relevant nptl/ tests passed.
+
+ - Wine[5]: Proton/Wine was modified in order to use futex2() for the
+ emulation of Windows NT sync mechanisms based on futex, called "fsync".
+ Triple-A games with huge CPU's loads and tons of parallel jobs worked
+ as expected when compared with the previous FUTEX_WAIT_MULTIPLE
+ implementation at futex(). Some games issue 42k futex2() calls
+ per second.
+
+ - Full GNU/Linux distro: I installed the modified glibc in my host
+ machine, so all pthread's programs would use futex2(). After tweaking
+ systemd[6] to allow futex2() calls at seccomp, everything worked as
+ expected (web browsers do some syscall sandboxing and need some
+ configuration as well).
+
+ - perf: The perf benchmarks tests can also be used to stress the
+ interface, and they can be found in this patchset.
+
+ ** Performance
+
+ - For comparing futex() and futex2() performance, I used the artificial
+ benchmarks implemented at perf (wake, wake-parallel, hash and
+ requeue). The setup was 200 runs for each test and using 8, 80, 800,
+ 8000 for the number of threads, Note that for this test, I'm not using
+ patch 14 ("kernel: Enable waitpid() for futex2") , for reasons explained
+ at "The patchset" section.
+
+ - For the first three ones, I measured an average of 4% gain in
+ performance. This is not a big step, but it shows that the new
+ interface is at least comparable in performance with the current one.
+
+ - For requeue, I measured an average of 21% decrease in performance
+ compared to the original futex implementation. This is expected given
+ the new design with individual flags. The performance trade-offs are
+ explained at patch 4 ("futex2: Implement requeue operation").
+
+[4] https://gitlab.collabora.com/tonyk/glibc/-/tree/futex2
+[5] https://gitlab.collabora.com/tonyk/wine/-/tree/proton_5.13
+[6] https://gitlab.collabora.com/tonyk/systemd
+
+* FAQ
+
+ ** "Where's the code for NUMA and FUTEX_8/16?"
+
+ The current code is already complex enough to take some time for
+ review, so I believe it's better to split that work out to a future
+ iteration of this patchset. Besides that, this RFC is the core part of the
+ infrastructure, and the following features will not pose big design
+ changes to it, the work will be more about wiring up the flags and
+ modifying some functions.
+
+ ** "And what's about FUTEX_64?"
+
+ By supporting 64 bit futexes, the kernel structure for futex would
+ need to have a 64 bit field for the value, and that could defeat one of
+ the purposes of having different sized futexes in the first place:
+ supporting smaller ones to decrease memory usage. This might be
+ something that could be disabled for 32bit archs (and even for
+ CONFIG_BASE_SMALL).
+
+ Which use case would benefit for FUTEX_64? Does it worth the trade-offs?
+
+ ** "Where's the PI/robust stuff?"
+
+ As said by Peter Zijlstra at [3], all those new features are related to
+ the "simple" futex interface, that doesn't use PI or robust. Do we want
+ to have this complexity at futex2() and if so, should it be part of
+ this patchset or can it be future work?
+
+Thanks,
+ André
+
Signed-off-by: Jan200101 <sentrycraft123@gmail.com>
---
MAINTAINERS | 2 +-
+ arch/arm/tools/syscall.tbl | 2 +
+ arch/arm64/include/asm/unistd.h | 2 +-
+ arch/arm64/include/asm/unistd32.h | 4 +
arch/x86/entry/syscalls/syscall_32.tbl | 2 +
arch/x86/entry/syscalls/syscall_64.tbl | 2 +
include/linux/syscalls.h | 7 +
include/uapi/asm-generic/unistd.h | 8 +-
- include/uapi/linux/futex.h | 40 ++
+ include/uapi/linux/futex.h | 56 ++
init/Kconfig | 7 +
kernel/Makefile | 1 +
- kernel/futex2.c | 484 ++++++++++++++++++
+ kernel/futex2.c | 625 ++++++++++++++++++
kernel/sys_ni.c | 4 +
- tools/include/uapi/asm-generic/unistd.h | 9 +-
+ tools/include/uapi/asm-generic/unistd.h | 8 +-
.../arch/x86/entry/syscalls/syscall_64.tbl | 2 +
- 12 files changed, 565 insertions(+), 3 deletions(-)
+ 15 files changed, 728 insertions(+), 4 deletions(-)
create mode 100644 kernel/futex2.c
diff --git a/MAINTAINERS b/MAINTAINERS
-index 2daa6ee67..855d38511 100644
+index bfc1b86e3..86ed91b72 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
-@@ -7259,7 +7259,7 @@ F: Documentation/locking/*futex*
+@@ -7332,7 +7332,7 @@ F: Documentation/locking/*futex*
F: include/asm-generic/futex.h
F: include/linux/futex.h
F: include/uapi/linux/futex.h
@@ -40,72 +317,110 @@ index 2daa6ee67..855d38511 100644
F: tools/perf/bench/futex*
F: tools/testing/selftests/futex/
+diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
+index 20e1170e2..4eef220cd 100644
+--- a/arch/arm/tools/syscall.tbl
++++ b/arch/arm/tools/syscall.tbl
+@@ -455,3 +455,5 @@
+ 439 common faccessat2 sys_faccessat2
+ 440 common process_madvise sys_process_madvise
+ 441 common epoll_pwait2 sys_epoll_pwait2
++442 common futex_wait sys_futex_wait
++443 common futex_wake sys_futex_wake
+diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
+index 86a9d7b3e..d1f7d35f9 100644
+--- a/arch/arm64/include/asm/unistd.h
++++ b/arch/arm64/include/asm/unistd.h
+@@ -38,7 +38,7 @@
+ #define __ARM_NR_compat_set_tls (__ARM_NR_COMPAT_BASE + 5)
+ #define __ARM_NR_COMPAT_END (__ARM_NR_COMPAT_BASE + 0x800)
+
+-#define __NR_compat_syscalls 442
++#define __NR_compat_syscalls 444
+ #endif
+
+ #define __ARCH_WANT_SYS_CLONE
+diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
+index cccfbbefb..2db1529b2 100644
+--- a/arch/arm64/include/asm/unistd32.h
++++ b/arch/arm64/include/asm/unistd32.h
+@@ -891,6 +891,10 @@ __SYSCALL(__NR_faccessat2, sys_faccessat2)
+ __SYSCALL(__NR_process_madvise, sys_process_madvise)
+ #define __NR_epoll_pwait2 441
+ __SYSCALL(__NR_epoll_pwait2, compat_sys_epoll_pwait2)
++#define __NR_futex_wait 442
++__SYSCALL(__NR_futex_wait, sys_futex_wait)
++#define __NR_futex_wake 443
++__SYSCALL(__NR_futex_wake, sys_futex_wake)
+
+ /*
+ * Please add new compat syscalls above this comment and update
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
-index 0d0667a9f..83a75ff39 100644
+index 874aeacde..ece90c8d9 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
-@@ -445,3 +445,5 @@
- 438 i386 pidfd_getfd sys_pidfd_getfd
+@@ -446,3 +446,5 @@
439 i386 faccessat2 sys_faccessat2
440 i386 process_madvise sys_process_madvise
-+441 i386 futex_wait sys_futex_wait
-+442 i386 futex_wake sys_futex_wake
+ 441 i386 epoll_pwait2 sys_epoll_pwait2 compat_sys_epoll_pwait2
++442 i386 futex_wait sys_futex_wait
++443 i386 futex_wake sys_futex_wake
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
-index 379819244..6658fd63c 100644
+index 78672124d..72fb65ef9 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
-@@ -362,6 +362,8 @@
- 438 common pidfd_getfd sys_pidfd_getfd
+@@ -363,6 +363,8 @@
439 common faccessat2 sys_faccessat2
440 common process_madvise sys_process_madvise
-+441 common futex_wait sys_futex_wait
-+442 common futex_wake sys_futex_wake
+ 441 common epoll_pwait2 sys_epoll_pwait2
++442 common futex_wait sys_futex_wait
++443 common futex_wake sys_futex_wake
#
# Due to a historical design error, certain syscalls are numbered differently
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
-index 37bea07c1..b6b77cf2b 100644
+index 7688bc983..bf146c2b0 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
-@@ -589,6 +589,13 @@ asmlinkage long sys_get_robust_list(int pid,
+@@ -618,6 +618,13 @@ asmlinkage long sys_get_robust_list(int pid,
asmlinkage long sys_set_robust_list(struct robust_list_head __user *head,
size_t len);
+/* kernel/futex2.c */
-+asmlinkage long sys_futex_wait(void __user *uaddr, unsigned long val,
-+ unsigned long flags,
++asmlinkage long sys_futex_wait(void __user *uaddr, unsigned int val,
++ unsigned int flags,
+ struct __kernel_timespec __user __user *timo);
-+asmlinkage long sys_futex_wake(void __user *uaddr, unsigned long nr_wake,
-+ unsigned long flags);
++asmlinkage long sys_futex_wake(void __user *uaddr, unsigned int nr_wake,
++ unsigned int flags);
+
/* kernel/hrtimer.c */
asmlinkage long sys_nanosleep(struct __kernel_timespec __user *rqtp,
struct __kernel_timespec __user *rmtp);
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
-index 205631898..ae47d6a9e 100644
+index 728752917..57e19200f 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
-@@ -860,8 +860,14 @@ __SYSCALL(__NR_faccessat2, sys_faccessat2)
- #define __NR_process_madvise 440
- __SYSCALL(__NR_process_madvise, sys_process_madvise)
+@@ -862,8 +862,14 @@ __SYSCALL(__NR_process_madvise, sys_process_madvise)
+ #define __NR_epoll_pwait2 441
+ __SC_COMP(__NR_epoll_pwait2, sys_epoll_pwait2, compat_sys_epoll_pwait2)
-+#define __NR_futex_wait 441
++#define __NR_futex_wait 442
+__SYSCALL(__NR_futex_wait, sys_futex_wait)
+
-+#define __NR_futex_wake 442
++#define __NR_futex_wake 443
+__SYSCALL(__NR_futex_wake, sys_futex_wake)
+
#undef __NR_syscalls
--#define __NR_syscalls 441
-+#define __NR_syscalls 443
+-#define __NR_syscalls 442
++#define __NR_syscalls 444
/*
* 32 bit systems traditionally used different
diff --git a/include/uapi/linux/futex.h b/include/uapi/linux/futex.h
-index a89eb0acc..35a5bf1cd 100644
+index a89eb0acc..9fbdaaf4f 100644
--- a/include/uapi/linux/futex.h
+++ b/include/uapi/linux/futex.h
-@@ -41,6 +41,46 @@
+@@ -41,6 +41,62 @@
#define FUTEX_CMP_REQUEUE_PI_PRIVATE (FUTEX_CMP_REQUEUE_PI | \
FUTEX_PRIVATE_FLAG)
@@ -120,7 +435,7 @@ index a89eb0acc..35a5bf1cd 100644
+
+#define FUTEX_NUMA_FLAG 16
+
-+/*
++/**
+ * struct futexXX_numa - struct for NUMA-aware futex operation
+ * @value: futex value
+ * @hint: node id to operate
@@ -128,35 +443,51 @@ index a89eb0acc..35a5bf1cd 100644
+
+struct futex8_numa {
+ __u8 value;
-+ __u8 hint;
++ __s8 hint;
+};
+
+struct futex16_numa {
+ __u16 value;
-+ __u16 hint;
++ __s16 hint;
+};
+
+struct futex32_numa {
+ __u32 value;
-+ __u32 hint;
++ __s32 hint;
+};
+
+#define FUTEX_WAITV_MAX 128
+
++/**
++ * struct futex_waitv - A waiter for vectorized wait
++ * @uaddr: User address to wait on
++ * @val: Expected value at uaddr
++ * @flags: Flags for this waiter
++ */
+struct futex_waitv {
+ void *uaddr;
+ unsigned int val;
+ unsigned int flags;
+};
+
++/**
++ * struct futex_requeue - Define an address and its flags for requeue operation
++ * @uaddr: User address of one of the requeue arguments
++ * @flags: Flags for this address
++ */
++struct futex_requeue {
++ void *uaddr;
++ unsigned int flags;
++};
++
/*
* Support for robust futexes: the kernel cleans up held futexes at
* thread exit time.
diff --git a/init/Kconfig b/init/Kconfig
-index 02d13ae27..1264687ea 100644
+index 29ad68325..c3e62e1b1 100644
--- a/init/Kconfig
+++ b/init/Kconfig
-@@ -1522,6 +1522,13 @@ config FUTEX
+@@ -1531,6 +1531,13 @@ config FUTEX
support for "fast userspace mutexes". The resulting kernel may not
run glibc-based applications correctly.
@@ -165,16 +496,16 @@ index 02d13ae27..1264687ea 100644
+ depends on FUTEX
+ default y
+ help
-+ Experimental support for futex2 interface.
++ Support for futex2 interface.
+
config FUTEX_PI
bool
depends on FUTEX && RT_MUTEXES
diff --git a/kernel/Makefile b/kernel/Makefile
-index af601b9bd..bb7f33986 100644
+index aa7368c7e..afbe15e51 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
-@@ -54,6 +54,7 @@ obj-$(CONFIG_PROFILING) += profile.o
+@@ -57,6 +57,7 @@ obj-$(CONFIG_PROFILING) += profile.o
obj-$(CONFIG_STACKTRACE) += stacktrace.o
obj-y += time/
obj-$(CONFIG_FUTEX) += futex.o
@@ -184,30 +515,49 @@ index af601b9bd..bb7f33986 100644
ifneq ($(CONFIG_SMP),y)
diff --git a/kernel/futex2.c b/kernel/futex2.c
new file mode 100644
-index 000000000..107b80a46
+index 000000000..802578ad6
--- /dev/null
+++ b/kernel/futex2.c
-@@ -0,0 +1,484 @@
+@@ -0,0 +1,625 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * futex2 system call interface by André Almeida <andrealmeid@collabora.com>
+ *
-+ * Copyright 2020 Collabora Ltd.
++ * Copyright 2021 Collabora Ltd.
++ *
++ * Based on original futex implementation by:
++ * (C) 2002 Rusty Russell, IBM
++ * (C) 2003, 2006 Ingo Molnar, Red Hat Inc.
++ * (C) 2003, 2004 Jamie Lokier
++ * (C) 2006 Thomas Gleixner, Timesys Corp.
++ * (C) 2007 Eric Dumazet
++ * (C) 2009 Darren Hart, IBM
+ */
+
+#include <linux/freezer.h>
+#include <linux/jhash.h>
++#include <linux/memblock.h>
+#include <linux/sched/wake_q.h>
+#include <linux/spinlock.h>
+#include <linux/syscalls.h>
-+#include <linux/memblock.h>
+#include <uapi/linux/futex.h>
+
+/**
++ * struct futex_key - Components to build unique key for a futex
++ * @pointer: Pointer to current->mm
++ * @index: Start address of the page containing futex
++ * @offset: Address offset of uaddr in a page
++ */
++struct futex_key {
++ u64 pointer;
++ unsigned long index;
++ unsigned long offset;
++};
++
++/**
+ * struct futex_waiter - List entry for a waiter
-+ * @key.address: Memory address of userspace futex
-+ * @key.mm: Pointer to memory management struct of this process
-+ * @key: Stores information that uniquely identify a futex
++ * @uaddr: Virtual address of userspace futex
++ * @key: Information that uniquely identify a futex
+ * @list: List node struct
+ * @val: Expected value for this waiter
+ * @flags: Flags
@@ -215,10 +565,8 @@ index 000000000..107b80a46
+ * @index: Index of waiter in futexv list
+ */
+struct futex_waiter {
-+ struct futex_key {
-+ uintptr_t address;
-+ struct mm_struct *mm;
-+ } key;
++ uintptr_t uaddr;
++ struct futex_key key;
+ struct list_head list;
+ unsigned int val;
+ unsigned int flags;
@@ -227,6 +575,18 @@ index 000000000..107b80a46
+};
+
+/**
++ * struct futexv_head - List of futexes to be waited
++ * @task: Task to be awaken
++ * @hint: Was someone on this list awakened?
++ * @objects: List of futexes
++ */
++struct futexv_head {
++ struct task_struct *task;
++ bool hint;
++ struct futex_waiter objects[0];
++};
++
++/**
+ * struct futex_bucket - A bucket of futex's hash table
+ * @waiters: Number of waiters in the bucket
+ * @lock: Bucket lock
@@ -238,30 +598,28 @@ index 000000000..107b80a46
+ struct list_head list;
+};
+
-+struct futexv {
-+ struct task_struct *task;
-+ int hint;
-+ struct futex_waiter objects[0];
-+};
-+
++/**
++ * struct futex_single_waiter - Wrapper for a futexv_head of one element
++ * @futexv: Single futexv element
++ * @waiter: Single waiter element
++ */
+struct futex_single_waiter {
-+ struct futexv parent;
++ struct futexv_head futexv;
+ struct futex_waiter waiter;
+} __packed;
+
-+struct futex_bucket *futex_table;
-+
-+/* mask for futex2 flag operations */
++/* Mask for futex2 flag operations */
+#define FUTEX2_MASK (FUTEX_SIZE_MASK | FUTEX_SHARED_FLAG | \
+ FUTEX_CLOCK_REALTIME)
+
-+// mask for sys_futex_waitv
++/* Mask for sys_futex_waitv flag */
+#define FUTEXV_MASK (FUTEX_CLOCK_REALTIME)
+
-+// mask for each futex in futex_waitv list
++/* Mask for each futex in futex_waitv list */
+#define FUTEXV_WAITER_MASK (FUTEX_SIZE_MASK | FUTEX_SHARED_FLAG)
+
-+int futex2_hashsize;
++struct futex_bucket *futex_table;
++unsigned int futex2_hashsize;
+
+/*
+ * Reflects a new waiter being added to the waitqueue.
@@ -271,7 +629,8 @@ index 000000000..107b80a46
+#ifdef CONFIG_SMP
+ atomic_inc(&bucket->waiters);
+ /*
-+ * Full barrier (A), see the ordering comment above.
++ * Issue a barrier after adding so futex_wake() will see that the
++ * value had increased
+ */
+ smp_mb__after_atomic();
+#endif
@@ -295,7 +654,8 @@ index 000000000..107b80a46
+{
+#ifdef CONFIG_SMP
+ /*
-+ * Full barrier (B), see the ordering comment above.
++ * Issue a barrier before reading so we get an updated value from
++ * futex_wait()
+ */
+ smp_mb();
+ return atomic_read(&bucket->waiters);
@@ -315,7 +675,7 @@ index 000000000..107b80a46
+static struct futex_bucket *futex_get_bucket(void __user *uaddr,
+ struct futex_key *key)
+{
-+ uintptr_t address = (uintptr_t) uaddr;
++ uintptr_t address = (uintptr_t)uaddr;
+ u32 hash_key;
+
+ /* Checking if uaddr is valid and accessible */
@@ -324,11 +684,13 @@ index 000000000..107b80a46
+ if (unlikely(!access_ok(address, sizeof(u32))))
+ return ERR_PTR(-EFAULT);
+
-+ key->address = address;
-+ key->mm = current->mm;
++ key->offset = address % PAGE_SIZE;
++ address -= key->offset;
++ key->pointer = (u64)address;
++ key->index = (unsigned long)current->mm;
+
+ /* Generate hash key for this futex using uaddr and current->mm */
-+ hash_key = jhash2((u32 *) key, sizeof(*key) / sizeof(u32), 0);
++ hash_key = jhash2((u32 *)key, sizeof(*key) / sizeof(u32), 0);
+
+ /* Since HASH_SIZE is 2^n, subtracting 1 makes a perfect bit mask */
+ return &futex_table[hash_key & (futex2_hashsize - 1)];
@@ -339,9 +701,9 @@ index 000000000..107b80a46
+ * @uval: variable to store the value
+ * @uaddr: userspace address
+ *
-+ * Check the comment at futex_get_user_val for more information.
++ * Check the comment at futex_enqueue() for more information.
+ */
-+static int futex_get_user(u32 *uval, u32 *uaddr)
++static int futex_get_user(u32 *uval, u32 __user *uaddr)
+{
+ int ret;
+
@@ -353,7 +715,7 @@ index 000000000..107b80a46
+}
+
+/**
-+ * futex_setup_time - Prepare the timeout mechanism, without starting it.
++ * futex_setup_time - Prepare the timeout mechanism and start it.
+ * @timo: Timeout value from userspace
+ * @timeout: Pointer to hrtimer handler
+ * @flags: Flags from userspace, to decide which clockid to use
@@ -381,220 +743,342 @@ index 000000000..107b80a46
+
+ hrtimer_set_expires(&timeout->timer, time);
+
++ hrtimer_sleeper_start_expires(timeout, HRTIMER_MODE_ABS);
++
+ return 0;
+}
+
++/**
++ * futex_dequeue_multiple - Remove multiple futexes from hash table
++ * @futexv: list of waiters
++ * @nr: number of futexes to be removed
++ *
++ * This function is used if (a) something went wrong while enqueuing, and we
++ * need to undo our work (then nr <= nr_futexes) or (b) we woke up, and thus
++ * need to remove every waiter, check if some was indeed woken and return.
++ * Before removing a waiter, we check if it's on the list, since we have no
++ * clue who have been waken.
++ *
++ * Return:
++ * * -1 - If no futex was woken during the removal
++ * * 0>= - At least one futex was found woken, index of the last one
++ */
++static int futex_dequeue_multiple(struct futexv_head *futexv, unsigned int nr)
++{
++ int i, ret = -1;
++
++ for (i = 0; i < nr; i++) {
++ spin_lock(&futexv->objects[i].bucket->lock);
++ if (!list_empty_careful(&futexv->objects[i].list)) {
++ list_del_init_careful(&futexv->objects[i].list);
++ bucket_dec_waiters(futexv->objects[i].bucket);
++ } else {
++ ret = i;
++ }
++ spin_unlock(&futexv->objects[i].bucket->lock);
++ }
++
++ return ret;
++}
+
+/**
-+ * futex_get_user_value - Get the value from the userspace address and compares
-+ * with the expected one. In success, leaves the function
-+ * holding the bucket lock. Else, hold no lock.
-+ * @bucket: hash bucket of this address
-+ * @uaddr: futex's userspace address
-+ * @val: expected value
-+ * @multiple: is this call in the wait on multiple path
++ * futex_enqueue - Check the value and enqueue a futex on a wait list
+ *
-+ * Return: 0 on success, error code otherwise
++ * @futexv: List of futexes
++ * @nr_futexes: Number of futexes in the list
++ * @awakened: If a futex was awakened during enqueueing, store the index here
++ *
++ * Get the value from the userspace address and compares with the expected one.
++ *
++ * Getting the value from user futex address:
++ *
++ * Since we are in a hurry, we use a spin lock and we can't sleep.
++ * Try to get the value with page fault disabled (when enable, we might
++ * sleep).
++ *
++ * If we fail, we aren't sure if the address is invalid or is just a
++ * page fault. Then, release the lock (so we can sleep) and try to get
++ * the value with page fault enabled. In order to trigger a page fault
++ * handling, we just call __get_user() again. If we sleep with enqueued
++ * futexes, we might miss a wake, so dequeue everything before sleeping.
++ *
++ * If get_user succeeds, this mean that the address is valid and we do
++ * the work again. Since we just handled the page fault, the page is
++ * likely pinned in memory and we should be luckier this time and be
++ * able to get the value. If we fail anyway, we will try again.
++ *
++ * If even with page faults enabled we get and error, this means that
++ * the address is not valid and we return from the syscall.
++ *
++ * If we got an unexpected value or need to treat a page fault and realized that
++ * a futex was awakened, we can priority this and return success.
++ *
++ * In success, enqueue the futex in the correct bucket
++ *
++ * Return:
++ * * 1 - We were awake in the process and nothing is enqueued
++ * * 0 - Everything is enqueued and we are ready to sleep
++ * * 0< - Something went wrong, nothing is enqueued, return error code
+ */
-+static int futex_get_user_value(struct futex_bucket *bucket, u32 __user *uaddr,
-+ unsigned int val, bool multiple)
++static int futex_enqueue(struct futexv_head *futexv, unsigned int nr_futexes,
++ int *awakened)
+{
-+ u32 uval;
-+ int ret;
++ int i, ret;
++ u32 uval, *uaddr, val;
++ struct futex_bucket *bucket;
+
-+ /*
-+ * Get the value from user futex address.
-+ *
-+ * Since we are in a hurry, we use a spin lock and we can't sleep.
-+ * Try to get the value with page fault disabled (when enable, we might
-+ * sleep).
-+ *
-+ * If we fail, we aren't sure if the address is invalid or is just a
-+ * page fault. Then, release the lock (so we can sleep) and try to get
-+ * the value with page fault enabled. In order to trigger a page fault
-+ * handling, we just call __get_user() again.
-+ *
-+ * If get_user succeeds, this mean that the address is valid and we do
-+ * the loop again. Since we just handled the page fault, the page is
-+ * likely pinned in memory and we should be luckier this time and be
-+ * able to get the value. If we fail anyway, we will try again.
-+ *
-+ * If even with page faults enabled we get and error, this means that
-+ * the address is not valid and we return from the syscall.
-+ */
-+ do {
++retry:
++ set_current_state(TASK_INTERRUPTIBLE);
++
++ for (i = 0; i < nr_futexes; i++) {
++ uaddr = (u32 * __user)futexv->objects[i].uaddr;
++ val = (u32)futexv->objects[i].val;
++
++ bucket = futexv->objects[i].bucket;
++
++ bucket_inc_waiters(bucket);
+ spin_lock(&bucket->lock);
+
+ ret = futex_get_user(&uval, uaddr);
+
-+ if (ret) {
++ if (unlikely(ret)) {
+ spin_unlock(&bucket->lock);
-+ if (multiple || __get_user(uval, uaddr))
++
++ bucket_dec_waiters(bucket);
++ __set_current_state(TASK_RUNNING);
++ *awakened = futex_dequeue_multiple(futexv, i);
++
++ if (__get_user(uval, uaddr))
+ return -EFAULT;
+
++ if (*awakened >= 0)
++ return 1;
++
++ goto retry;
+ }
-+ } while (ret);
+
-+ if (uval != val) {
++ if (uval != val) {
++ spin_unlock(&bucket->lock);
++
++ bucket_dec_waiters(bucket);
++ __set_current_state(TASK_RUNNING);
++ *awakened = futex_dequeue_multiple(futexv, i);
++
++ if (*awakened >= 0)
++ return 1;
++
++ return -EAGAIN;
++ }
++
++ list_add_tail(&futexv->objects[i].list, &bucket->list);
+ spin_unlock(&bucket->lock);
-+ return -EWOULDBLOCK;
+ }
+
+ return 0;
+}
+
+/**
-+ * futex_dequeue - Remove a futex from a queue
-+ * @bucket: current bucket holding the futex
-+ * @waiter: futex to be removed
-+ *
-+ * Return: True if futex was removed by this function, false if another wake
-+ * thread removed this futex.
++ * __futex_wait - Enqueue the list of futexes and wait to be woken
++ * @futexv: List of futexes to wait
++ * @nr_futexes: Length of futexv
++ * @timeout: Pointer to timeout handler
+ *
-+ * This function should be used after we found that this futex was in a queue.
-+ * Thus, it needs to be removed before the next step. However, someone could
-+ * wake it between the time of the first check and the time to get the lock for
-+ * the bucket. Check one more time if the futex is there with the bucket locked.
-+ * If it's there, just remove it and return true. Else, mark the removal as
-+ * false and do nothing.
++ * Return:
++ * * 0 >= - Hint of which futex woke us
++ * * 0 < - Error code
+ */
-+static bool futex_dequeue(struct futex_bucket *bucket, struct futex_waiter *waiter)
++static int __futex_wait(struct futexv_head *futexv, unsigned int nr_futexes,
++ struct hrtimer_sleeper *timeout)
+{
-+ bool removed = true;
++ int ret;
+
-+ spin_lock(&bucket->lock);
-+ if (list_empty(&waiter->list))
-+ removed = false;
-+ else
-+ list_del(&waiter->list);
-+ spin_unlock(&bucket->lock);
++ while (1) {
++ int awakened = -1;
++
++ ret = futex_enqueue(futexv, nr_futexes, &awakened);
++
++ if (ret) {
++ if (awakened >= 0)
++ return awakened;
++ return ret;
++ }
++
++ /* Before sleeping, check if someone was woken */
++ if (!futexv->hint && (!timeout || timeout->task))
++ freezable_schedule();
++
++ __set_current_state(TASK_RUNNING);
++
++ /*
++ * One of those things triggered this wake:
++ *
++ * * We have been removed from the bucket. futex_wake() woke
++ * us. We just need to dequeue and return 0 to userspace.
++ *
++ * However, if no futex was dequeued by a futex_wake():
++ *
++ * * If the there's a timeout and it has expired,
++ * return -ETIMEDOUT.
++ *
++ * * If there is a signal pending, something wants to kill our
++ * thread, return -ERESTARTSYS.
++ *
++ * * If there's no signal pending, it was a spurious wake
++ * (scheduler gave us a change to do some work, even if we
++ * don't want to). We need to remove ourselves from the
++ * bucket and add again, to prevent losing wakeups in the
++ * meantime.
++ */
+
-+ if (removed)
-+ bucket_dec_waiters(bucket);
++ ret = futex_dequeue_multiple(futexv, nr_futexes);
+
-+ return removed;
++ /* Normal wake */
++ if (ret >= 0)
++ return ret;
++
++ if (timeout && !timeout->task)
++ return -ETIMEDOUT;
++
++ if (signal_pending(current))
++ return -ERESTARTSYS;
++
++ /* Spurious wake, do everything again */
++ }
+}
+
+/**
-+ * sys_futex_wait - Wait on a futex address if (*uaddr) == val
-+ * @uaddr: User address of futex
-+ * @val: Expected value of futex
-+ * @flags: Specify the size of futex and the clockid
-+ * @timo: Optional absolute timeout. Supports only 64bit time.
++ * futex_wait - Setup the timer (if there's one) and wait on a list of futexes
++ * @futexv: List of futexes
++ * @nr_futexes: Length of futexv
++ * @timo: Timeout
++ * @flags: Timeout flags
++ *
++ * Return:
++ * * 0 >= - Hint of which futex woke us
++ * * 0 < - Error code
+ */
-+SYSCALL_DEFINE4(futex_wait, void __user *, uaddr, unsigned int, val,
-+ unsigned int, flags, struct __kernel_timespec __user *, timo)
++static int futex_set_timer_and_wait(struct futexv_head *futexv,
++ unsigned int nr_futexes,
++ struct __kernel_timespec __user *timo,
++ unsigned int flags)
+{
-+ unsigned int size = flags & FUTEX_SIZE_MASK;
+ struct hrtimer_sleeper timeout;
-+ struct futex_bucket *bucket;
-+ struct futex_single_waiter wait_single;
-+ struct futex_waiter *waiter;
+ int ret;
+
-+ wait_single.parent.task = current;
-+ wait_single.parent.hint = 0;
-+ waiter = &wait_single.waiter;
-+ waiter->index = 0;
-+
-+ if (flags & ~FUTEX2_MASK)
-+ return -EINVAL;
-+
-+ if (size != FUTEX_32)
-+ return -EINVAL;
-+
+ if (timo) {
+ ret = futex_setup_time(timo, &timeout, flags);
+ if (ret)
+ return ret;
+ }
+
-+ /* Get an unlocked hash bucket */
-+ bucket = futex_get_bucket(uaddr, &waiter->key);
-+ if (IS_ERR(bucket))
-+ return PTR_ERR(bucket);
++ ret = __futex_wait(futexv, nr_futexes, timo ? &timeout : NULL);
+
+ if (timo)
-+ hrtimer_sleeper_start_expires(&timeout, HRTIMER_MODE_ABS);
-+
-+retry:
-+ bucket_inc_waiters(bucket);
++ hrtimer_cancel(&timeout.timer);
+
-+ /* Compare the expected and current value, get the bucket lock */
-+ ret = futex_get_user_value(bucket, uaddr, val, false);
-+ if (ret) {
-+ bucket_dec_waiters(bucket);
-+ goto out;
-+ }
++ return ret;
++}
+
-+ /* Add the waiter to the hash table and sleep */
-+ set_current_state(TASK_INTERRUPTIBLE);
-+ list_add_tail(&waiter->list, &bucket->list);
-+ spin_unlock(&bucket->lock);
++/**
++ * sys_futex_wait - Wait on a futex address if (*uaddr) == val
++ * @uaddr: User address of futex
++ * @val: Expected value of futex
++ * @flags: Specify the size of futex and the clockid
++ * @timo: Optional absolute timeout.
++ *
++ * The user thread is put to sleep, waiting for a futex_wake() at uaddr, if the
++ * value at *uaddr is the same as val (otherwise, the syscall returns
++ * immediately with -EAGAIN).
++ *
++ * Returns 0 on success, error code otherwise.
++ */
++SYSCALL_DEFINE4(futex_wait, void __user *, uaddr, unsigned int, val,
++ unsigned int, flags, struct __kernel_timespec __user *, timo)
++{
++ unsigned int size = flags & FUTEX_SIZE_MASK;
++ struct futex_single_waiter wait_single = {0};
++ struct futex_waiter *waiter;
++ struct futexv_head *futexv;
+
-+ /* Do not sleep if someone woke this futex or if it was timeouted */
-+ if (!list_empty_careful(&waiter->list) && (!timo || timeout.task))
-+ freezable_schedule();
++ if (flags & ~FUTEX2_MASK)
++ return -EINVAL;
+
-+ __set_current_state(TASK_RUNNING);
++ if (size != FUTEX_32)
++ return -EINVAL;
+
-+ /*
-+ * One of those things triggered this wake:
-+ *
-+ * * We have been removed from the bucket. futex_wake() woke us. We just
-+ * need to return 0 to userspace.
-+ *
-+ * However, if we find ourselves in the bucket we must remove ourselves
-+ * from the bucket and ...
-+ *
-+ * * If the there's a timeout and it has expired, return -ETIMEDOUT.
-+ *
-+ * * If there is a signal pending, something wants to kill our thread.
-+ * Return -ERESTARTSYS.
-+ *
-+ * * If there's no signal pending, it was a spurious wake (scheduler
-+ * gave us a change to do some work, even if we don't want to). We
-+ * need to remove ourselves from the bucket and add again, to prevent
-+ * losing wakeups in the meantime.
-+ */
++ futexv = &wait_single.futexv;
++ futexv->task = current;
++ futexv->hint = false;
+
-+ /* Normal wake */
-+ if (list_empty_careful(&waiter->list))
-+ goto out;
++ waiter = &wait_single.waiter;
++ waiter->index = 0;
++ waiter->val = val;
++ waiter->uaddr = (uintptr_t)uaddr;
+
-+ if (!futex_dequeue(bucket, waiter))
-+ goto out;
++ INIT_LIST_HEAD(&waiter->list);
+
-+ /* Timeout */
-+ if (timo && !timeout.task)
-+ return -ETIMEDOUT;
++ /* Get an unlocked hash bucket */
++ waiter->bucket = futex_get_bucket(uaddr, &waiter->key);
++ if (IS_ERR(waiter->bucket))
++ return PTR_ERR(waiter->bucket);
+
-+ /* Spurious wakeup */
-+ if (!signal_pending(current))
-+ goto retry;
++ return futex_set_timer_and_wait(futexv, 1, timo, flags);
++}
+
-+ /* Some signal is pending */
-+ ret = -ERESTARTSYS;
-+out:
-+ if (timo)
-+ hrtimer_cancel(&timeout.timer);
++/**
++ * futex_get_parent - For a given futex in a futexv list, get a pointer to the futexv
++ * @waiter: Address of futex in the list
++ * @index: Index of futex in the list
++ *
++ * Return: A pointer to its futexv struct
++ */
++static inline struct futexv_head *futex_get_parent(uintptr_t waiter,
++ unsigned int index)
++{
++ uintptr_t parent = waiter - sizeof(struct futexv_head)
++ - (uintptr_t)(index * sizeof(struct futex_waiter));
+
-+ return ret;
++ return (struct futexv_head *)parent;
+}
+
-+static struct futexv *futex_get_parent(uintptr_t waiter, u8 index)
++/**
++ * futex_mark_wake - Find the task to be wake and add it in wake queue
++ * @waiter: Waiter to be wake
++ * @bucket: Bucket to be decremented
++ * @wake_q: Wake queue to insert the task
++ */
++static void futex_mark_wake(struct futex_waiter *waiter,
++ struct futex_bucket *bucket,
++ struct wake_q_head *wake_q)
+{
-+ uintptr_t parent = waiter - sizeof(struct futexv)
-+ - (uintptr_t) (index * sizeof(struct futex_waiter));
++ struct task_struct *task;
++ struct futexv_head *parent = futex_get_parent((uintptr_t)waiter,
++ waiter->index);
++
++ parent->hint = true;
++ task = parent->task;
++ get_task_struct(task);
++ list_del_init_careful(&waiter->list);
++ wake_q_add_safe(wake_q, task);
++ bucket_dec_waiters(bucket);
++}
+
-+ return (struct futexv *) parent;
++static inline bool futex_match(struct futex_key key1, struct futex_key key2)
++{
++ return (key1.index == key2.index &&
++ key1.pointer == key2.pointer &&
++ key1.offset == key2.offset);
+}
+
+/**
+ * sys_futex_wake - Wake a number of futexes waiting on an address
+ * @uaddr: Address of futex to be woken up
-+ * @nr_wake: Number of futexes to be woken up
-+ * @flags: TODO
++ * @nr_wake: Number of futexes waiting in uaddr to be woken up
++ * @flags: Flags for size and shared
++ *
++ * Wake `nr_wake` threads waiting at uaddr.
++ *
++ * Returns the number of woken threads on success, error code otherwise.
+ */
+SYSCALL_DEFINE3(futex_wake, void __user *, uaddr, unsigned int, nr_wake,
+ unsigned int, flags)
@@ -602,7 +1086,6 @@ index 000000000..107b80a46
+ unsigned int size = flags & FUTEX_SIZE_MASK;
+ struct futex_waiter waiter, *aux, *tmp;
+ struct futex_bucket *bucket;
-+ struct task_struct *task;
+ DEFINE_WAKE_Q(wake_q);
+ int ret = 0;
+
@@ -616,26 +1099,15 @@ index 000000000..107b80a46
+ if (IS_ERR(bucket))
+ return PTR_ERR(bucket);
+
-+ if (!bucket_get_waiters(bucket))
++ if (!bucket_get_waiters(bucket) || !nr_wake)
+ return 0;
+
+ spin_lock(&bucket->lock);
+ list_for_each_entry_safe(aux, tmp, &bucket->list, list) {
-+ if (ret >= nr_wake)
-+ break;
-+
-+ if (waiter.key.address == aux->key.address &&
-+ waiter.key.mm == aux->key.mm) {
-+ struct futexv *parent =
-+ futex_get_parent((uintptr_t) aux, aux->index);
-+
-+ parent->hint = 1;
-+ task = parent->task;
-+ get_task_struct(task);
-+ list_del_init_careful(&aux->list);
-+ wake_q_add_safe(&wake_q, task);
-+ ret++;
-+ bucket_dec_waiters(bucket);
++ if (futex_match(waiter.key, aux->key)) {
++ futex_mark_wake(aux, bucket, &wake_q);
++ if (++ret >= nr_wake)
++ break;
+ }
+ }
+ spin_unlock(&bucket->lock);
@@ -673,10 +1145,10 @@ index 000000000..107b80a46
+}
+core_initcall(futex2_init);
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
-index f27ac94d5..35ff743b1 100644
+index 19aa80689..27ef83ca8 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
-@@ -148,6 +148,10 @@ COND_SYSCALL_COMPAT(set_robust_list);
+@@ -150,6 +150,10 @@ COND_SYSCALL_COMPAT(set_robust_list);
COND_SYSCALL(get_robust_list);
COND_SYSCALL_COMPAT(get_robust_list);
@@ -688,610 +1160,761 @@ index f27ac94d5..35ff743b1 100644
/* kernel/itimer.c */
diff --git a/tools/include/uapi/asm-generic/unistd.h b/tools/include/uapi/asm-generic/unistd.h
-index 205631898..cd79f94e0 100644
+index 728752917..57e19200f 100644
--- a/tools/include/uapi/asm-generic/unistd.h
+++ b/tools/include/uapi/asm-generic/unistd.h
-@@ -860,8 +860,15 @@ __SYSCALL(__NR_faccessat2, sys_faccessat2)
- #define __NR_process_madvise 440
- __SYSCALL(__NR_process_madvise, sys_process_madvise)
+@@ -862,8 +862,14 @@ __SYSCALL(__NR_process_madvise, sys_process_madvise)
+ #define __NR_epoll_pwait2 441
+ __SC_COMP(__NR_epoll_pwait2, sys_epoll_pwait2, compat_sys_epoll_pwait2)
-+#define __NR_futex_wait 441
++#define __NR_futex_wait 442
+__SYSCALL(__NR_futex_wait, sys_futex_wait)
+
-+#define __NR_futex_wake 442
++#define __NR_futex_wake 443
+__SYSCALL(__NR_futex_wake, sys_futex_wake)
+
#undef __NR_syscalls
--#define __NR_syscalls 441
-+#define __NR_syscalls 443
-+
+-#define __NR_syscalls 442
++#define __NR_syscalls 444
/*
* 32 bit systems traditionally used different
diff --git a/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl b/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl
-index 379819244..47de3bf93 100644
+index 78672124d..15d2b89b6 100644
--- a/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl
-@@ -362,6 +362,8 @@
- 438 common pidfd_getfd sys_pidfd_getfd
+@@ -363,6 +363,8 @@
439 common faccessat2 sys_faccessat2
440 common process_madvise sys_process_madvise
-+441 common futex_wait sys_futex_wait
-+442 common futex_wake sys_futex_wake
+ 441 common epoll_pwait2 sys_epoll_pwait2
++442 common futex_wait sys_futex_wait
++443 common futex_wake sys_futex_wake
#
# Due to a historical design error, certain syscalls are numbered differently
--
-2.29.2
+2.30.2
-From d71973d99efb1e2fd2542ea4d4b45b0e03e45b9c Mon Sep 17 00:00:00 2001
+From ea4e3d7ee8dc965fbe3cabd753b88ada23cecb39 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Andr=C3=A9=20Almeida?= <andrealmeid@collabora.com>
-Date: Thu, 15 Oct 2020 17:15:57 -0300
-Subject: [PATCH 2/9] futex2: Add suport for vectorized wait
+Date: Fri, 5 Feb 2021 10:34:01 -0300
+Subject: [PATCH 02/13] futex2: Add support for shared futexes
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
-Add support to wait on multiple futexes
+Add support for shared futexes for cross-process resources. This design
+relies on the same approach done in old futex to create an unique id for
+file-backed shared memory, by using a counter at struct inode.
+
+There are two types of futexes: private and shared ones. The private are
+futexes meant to be used by threads that shares the same memory space,
+are easier to be uniquely identified an thus can have some performance
+optimization. The elements for identifying one are: the start address of
+the page where the address is, the address offset within the page and
+the current->mm pointer.
+
+Now, for uniquely identifying shared futex:
+
+- If the page containing the user address is an anonymous page, we can
+ just use the same data used for private futexes (the start address of
+ the page, the address offset within the page and the current->mm
+ pointer) that will be enough for uniquely identifying such futex. We
+ also set one bit at the key to differentiate if a private futex is
+ used on the same address (mixing shared and private calls are not
+ allowed).
+
+- If the page is file-backed, current->mm maybe isn't the same one for
+ every user of this futex, so we need to use other data: the
+ page->index, an UUID for the struct inode and the offset within the
+ page.
+
+Note that members of futex_key doesn't have any particular meaning after
+they are part of the struct - they are just bytes to identify a futex.
+Given that, we don't need to use a particular name or type that matches
+the original data, we only need to care about the bitsize of each
+component and make both private and shared data fit in the same memory
+space.
Signed-off-by: André Almeida <andrealmeid@collabora.com>
Signed-off-by: Jan200101 <sentrycraft123@gmail.com>
---
- arch/x86/entry/syscalls/syscall_32.tbl | 1 +
- arch/x86/entry/syscalls/syscall_64.tbl | 1 +
- include/uapi/asm-generic/unistd.h | 5 +-
- kernel/futex2.c | 430 ++++++++++++------
- kernel/sys_ni.c | 1 +
- tools/include/uapi/asm-generic/unistd.h | 5 +-
- .../arch/x86/entry/syscalls/syscall_64.tbl | 1 +
- 7 files changed, 309 insertions(+), 135 deletions(-)
+ fs/inode.c | 1 +
+ include/linux/fs.h | 1 +
+ kernel/futex2.c | 220 +++++++++++++++++++++++++++++++++++++++++++--
+ 3 files changed, 217 insertions(+), 5 deletions(-)
-diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
-index 83a75ff39..65734d5e1 100644
---- a/arch/x86/entry/syscalls/syscall_32.tbl
-+++ b/arch/x86/entry/syscalls/syscall_32.tbl
-@@ -447,3 +447,4 @@
- 440 i386 process_madvise sys_process_madvise
- 441 i386 futex_wait sys_futex_wait
- 442 i386 futex_wake sys_futex_wake
-+443 i386 futex_waitv sys_futex_waitv
-diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
-index 6658fd63c..f30811b56 100644
---- a/arch/x86/entry/syscalls/syscall_64.tbl
-+++ b/arch/x86/entry/syscalls/syscall_64.tbl
-@@ -364,6 +364,7 @@
- 440 common process_madvise sys_process_madvise
- 441 common futex_wait sys_futex_wait
- 442 common futex_wake sys_futex_wake
-+443 common futex_waitv sys_futex_waitv
-
- #
- # Due to a historical design error, certain syscalls are numbered differently
-diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
-index ae47d6a9e..81a90b697 100644
---- a/include/uapi/asm-generic/unistd.h
-+++ b/include/uapi/asm-generic/unistd.h
-@@ -866,8 +866,11 @@ __SYSCALL(__NR_futex_wait, sys_futex_wait)
- #define __NR_futex_wake 442
- __SYSCALL(__NR_futex_wake, sys_futex_wake)
-
-+#define __NR_futex_waitv 443
-+__SYSCALL(__NR_futex_waitv, sys_futex_waitv)
-+
- #undef __NR_syscalls
--#define __NR_syscalls 443
-+#define __NR_syscalls 444
-
- /*
- * 32 bit systems traditionally used different
+diff --git a/fs/inode.c b/fs/inode.c
+index 6442d97d9..886fe11cc 100644
+--- a/fs/inode.c
++++ b/fs/inode.c
+@@ -139,6 +139,7 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
+ inode->i_blkbits = sb->s_blocksize_bits;
+ inode->i_flags = 0;
+ atomic64_set(&inode->i_sequence, 0);
++ atomic64_set(&inode->i_sequence2, 0);
+ atomic_set(&inode->i_count, 1);
+ inode->i_op = &empty_iops;
+ inode->i_fop = &no_open_fops;
+diff --git a/include/linux/fs.h b/include/linux/fs.h
+index fd47deea7..516bda982 100644
+--- a/include/linux/fs.h
++++ b/include/linux/fs.h
+@@ -681,6 +681,7 @@ struct inode {
+ };
+ atomic64_t i_version;
+ atomic64_t i_sequence; /* see futex */
++ atomic64_t i_sequence2; /* see futex2 */
+ atomic_t i_count;
+ atomic_t i_dio_count;
+ atomic_t i_writecount;
diff --git a/kernel/futex2.c b/kernel/futex2.c
-index 107b80a46..4b782b5ef 100644
+index 802578ad6..27767b2d0 100644
--- a/kernel/futex2.c
+++ b/kernel/futex2.c
-@@ -48,14 +48,25 @@ struct futex_bucket {
- struct list_head list;
- };
+@@ -14,8 +14,10 @@
+ */
-+/**
-+ * struct futexv - List of futexes to be waited
-+ * @task: Task to be awaken
-+ * @hint: Was someone on this list awaken?
-+ * @objects: List of futexes
-+ */
- struct futexv {
- struct task_struct *task;
-- int hint;
-+ bool hint;
- struct futex_waiter objects[0];
- };
+ #include <linux/freezer.h>
++#include <linux/hugetlb.h>
+ #include <linux/jhash.h>
+ #include <linux/memblock.h>
++#include <linux/pagemap.h>
+ #include <linux/sched/wake_q.h>
+ #include <linux/spinlock.h>
+ #include <linux/syscalls.h>
+@@ -23,8 +25,8 @@
-+/**
-+ * struct futex_single_waiter - Wrapper for a futexv of one element
-+ * @futexv: TODO
-+ * @waiter: TODO
-+ */
- struct futex_single_waiter {
-- struct futexv parent;
-+ struct futexv futexv;
- struct futex_waiter waiter;
- } __packed;
-
-@@ -65,10 +76,10 @@ struct futex_bucket *futex_table;
- #define FUTEX2_MASK (FUTEX_SIZE_MASK | FUTEX_SHARED_FLAG | \
- FUTEX_CLOCK_REALTIME)
-
--// mask for sys_futex_waitv
-+/* mask for sys_futex_waitv flag */
- #define FUTEXV_MASK (FUTEX_CLOCK_REALTIME)
-
--// mask for each futex in futex_waitv list
-+/* mask for each futex in futex_waitv list */
+ /**
+ * struct futex_key - Components to build unique key for a futex
+- * @pointer: Pointer to current->mm
+- * @index: Start address of the page containing futex
++ * @pointer: Pointer to current->mm or inode's UUID for file backed futexes
++ * @index: Start address of the page containing futex or index of the page
+ * @offset: Address offset of uaddr in a page
+ */
+ struct futex_key {
+@@ -97,6 +99,11 @@ struct futex_single_waiter {
+ /* Mask for each futex in futex_waitv list */
#define FUTEXV_WAITER_MASK (FUTEX_SIZE_MASK | FUTEX_SHARED_FLAG)
- int futex2_hashsize;
-@@ -151,7 +162,7 @@ static struct futex_bucket *futex_get_bucket(void __user *uaddr,
- *
- * Check the comment at futex_get_user_val for more information.
- */
--static int futex_get_user(u32 *uval, u32 *uaddr)
-+static int futex_get_user(u32 *uval, u32 __user *uaddr)
- {
- int ret;
++#define is_object_shared ((futexv->objects[i].flags & FUTEX_SHARED_FLAG) ? true : false)
++
++#define FUT_OFF_INODE 1 /* We set bit 0 if key has a reference on inode */
++#define FUT_OFF_MMSHARED 2 /* We set bit 1 if key has a reference on mm */
++
+ struct futex_bucket *futex_table;
+ unsigned int futex2_hashsize;
-@@ -194,95 +205,227 @@ static int futex_setup_time(struct __kernel_timespec __user *timo,
- return 0;
+@@ -143,16 +150,200 @@ static inline int bucket_get_waiters(struct futex_bucket *bucket)
+ #endif
}
+/**
-+ * futex_dequeue_multiple - Remove multiple futexes from hash table
-+ * @futexv: list of waiters
-+ * @nr: number of futexes to be removed
-+ *
-+ * This function should be used after we found that this futex was in a queue.
-+ * Thus, it needs to be removed before the next step. However, someone could
-+ * wake it between the time of the first check and the time to get the lock for
-+ * the bucket. Check one more time if the futex is there with the bucket locked.
-+ * If it's there, just remove it and return true. Else, mark the removal as
-+ * false and do nothing.
-+ *
-+ * Return:
-+ * * -1 if no futex was woken during the removal
-+ * * =< 0 at least one futex was found woken, index of the last one
-+ */
-+static int futex_dequeue_multiple(struct futexv *futexv, unsigned int nr)
-+{
-+ int i, ret = -1;
-+
-+ for (i = 0; i < nr; i++) {
-+ spin_lock(&futexv->objects[i].bucket->lock);
-+ if (!list_empty_careful(&futexv->objects[i].list)) {
-+ list_del_init_careful(&futexv->objects[i].list);
-+ bucket_dec_waiters(futexv->objects[i].bucket);
-+ } else {
-+ ret = i;
-+ }
-+ spin_unlock(&futexv->objects[i].bucket->lock);
-+ }
-+
-+ return ret;
-+}
-
- /**
-- * futex_get_user_value - Get the value from the userspace address and compares
-- * with the expected one. In success, leaves the function
-- * holding the bucket lock. Else, hold no lock.
-- * @bucket: hash bucket of this address
-- * @uaddr: futex's userspace address
-- * @val: expected value
-- * @multiple: is this call in the wait on multiple path
-+ * futex_enqueue - Check the value and enqueue a futex on a wait list
++ * futex_get_inode_uuid - Gets an UUID for an inode
++ * @inode: inode to get UUID
+ *
-+ * @futexv: List of futexes
-+ * @nr_futexes: Number of futexes in the list
-+ * @awaken: If a futex was awaken during enqueueing, store the index here
++ * Generate a machine wide unique identifier for this inode.
+ *
-+ * Get the value from the userspace address and compares with the expected one.
-+ * In success, enqueue the futex in the correct bucket
++ * This relies on u64 not wrapping in the life-time of the machine; which with
++ * 1ns resolution means almost 585 years.
+ *
-+ * Get the value from user futex address.
++ * This further relies on the fact that a well formed program will not unmap
++ * the file while it has a (shared) futex waiting on it. This mapping will have
++ * a file reference which pins the mount and inode.
+ *
-+ * Since we are in a hurry, we use a spin lock and we can't sleep.
-+ * Try to get the value with page fault disabled (when enable, we might
-+ * sleep).
++ * If for some reason an inode gets evicted and read back in again, it will get
++ * a new sequence number and will _NOT_ match, even though it is the exact same
++ * file.
+ *
-+ * If we fail, we aren't sure if the address is invalid or is just a
-+ * page fault. Then, release the lock (so we can sleep) and try to get
-+ * the value with page fault enabled. In order to trigger a page fault
-+ * handling, we just call __get_user() again. If we sleep with enqueued
-+ * futexes, we might miss a wake, so dequeue everything before sleeping.
++ * It is important that match_futex() will never have a false-positive, esp.
++ * for PI futexes that can mess up the state. The above argues that false-negatives
++ * are only possible for malformed programs.
+ *
-+ * If get_user succeeds, this mean that the address is valid and we do
-+ * the work again. Since we just handled the page fault, the page is
-+ * likely pinned in memory and we should be luckier this time and be
-+ * able to get the value. If we fail anyway, we will try again.
-+ *
-+ * If even with page faults enabled we get and error, this means that
-+ * the address is not valid and we return from the syscall.
-+ *
-+ * If we got an unexpected value or need to treat a page fault and realized that
-+ * a futex was awaken, we can priority this and return success.
- *
- * Return: 0 on success, error code otherwise
- */
--static int futex_get_user_value(struct futex_bucket *bucket, u32 __user *uaddr,
-- unsigned int val, bool multiple)
-+static int futex_enqueue(struct futexv *futexv, unsigned int nr_futexes,
-+ unsigned int *awaken)
- {
-- u32 uval;
-- int ret;
-+ int i, ret;
-+ u32 uval, *uaddr, val;
-+ struct futex_bucket *bucket;
-
-- /*
-- * Get the value from user futex address.
-- *
-- * Since we are in a hurry, we use a spin lock and we can't sleep.
-- * Try to get the value with page fault disabled (when enable, we might
-- * sleep).
-- *
-- * If we fail, we aren't sure if the address is invalid or is just a
-- * page fault. Then, release the lock (so we can sleep) and try to get
-- * the value with page fault enabled. In order to trigger a page fault
-- * handling, we just call __get_user() again.
-- *
-- * If get_user succeeds, this mean that the address is valid and we do
-- * the loop again. Since we just handled the page fault, the page is
-- * likely pinned in memory and we should be luckier this time and be
-- * able to get the value. If we fail anyway, we will try again.
-- *
-- * If even with page faults enabled we get and error, this means that
-- * the address is not valid and we return from the syscall.
-- */
-- do {
-- spin_lock(&bucket->lock);
-+retry:
-+ set_current_state(TASK_INTERRUPTIBLE);
-+
-+ for (i = 0; i < nr_futexes; i++) {
-+ uaddr = (u32 * __user) futexv->objects[i].key.address;
-+ val = (u32) futexv->objects[i].val;
-+ bucket = futexv->objects[i].bucket;
-+
-+ bucket_inc_waiters(bucket);
-+ spin_lock(&bucket->lock);
-
-- ret = futex_get_user(&uval, uaddr);
-+ ret = futex_get_user(&uval, uaddr);
-
-- if (ret) {
-+ if (unlikely(ret)) {
- spin_unlock(&bucket->lock);
-- if (multiple || __get_user(uval, uaddr))
-+
-+ bucket_dec_waiters(bucket);
-+ __set_current_state(TASK_RUNNING);
-+ *awaken = futex_dequeue_multiple(futexv, i);
-+
-+ if (__get_user(uval, uaddr))
- return -EFAULT;
-
-+ if (*awaken >= 0)
-+ return 0;
++ * Returns: UUID for the given inode
++ */
++static u64 futex_get_inode_uuid(struct inode *inode)
++{
++ static atomic64_t i_seq;
++ u64 old;
+
-+ goto retry;
-+ }
++ /* Does the inode already have a sequence number? */
++ old = atomic64_read(&inode->i_sequence2);
+
-+ if (uval != val) {
-+ spin_unlock(&bucket->lock);
++ if (likely(old))
++ return old;
+
-+ bucket_dec_waiters(bucket);
-+ __set_current_state(TASK_RUNNING);
-+ *awaken = futex_dequeue_multiple(futexv, i);
++ for (;;) {
++ u64 new = atomic64_add_return(1, &i_seq);
+
-+ if (*awaken >= 0)
-+ return 0;
++ if (WARN_ON_ONCE(!new))
++ continue;
+
-+ return -EWOULDBLOCK;
- }
-- } while (ret);
-
-- if (uval != val) {
-+ list_add_tail(&futexv->objects[i].list, &bucket->list);
- spin_unlock(&bucket->lock);
-- return -EWOULDBLOCK;
- }
-
- return 0;
- }
-
++ old = atomic64_cmpxchg_relaxed(&inode->i_sequence2, 0, new);
++ if (old)
++ return old;
++ return new;
++ }
++}
+
-+static int __futex_wait(struct futexv *futexv,
-+ unsigned int nr_futexes,
-+ struct hrtimer_sleeper *timeout)
++/**
++ * futex_get_shared_key - Get a key for a shared futex
++ * @address: Futex memory address
++ * @mm: Current process mm_struct pointer
++ * @key: Key struct to be filled
++ *
++ * Returns: 0 on success, error code otherwise
++ */
++static int futex_get_shared_key(uintptr_t address, struct mm_struct *mm,
++ struct futex_key *key)
+{
+ int ret;
-+ unsigned int awaken = -1;
++ struct page *page, *tail;
++ struct address_space *mapping;
+
-+ while (1) {
-+ ret = futex_enqueue(futexv, nr_futexes, &awaken);
++again:
++ ret = get_user_pages_fast(address, 1, 0, &page);
++ if (ret < 0)
++ return ret;
+
-+ if (ret < 0)
-+ break;
++ /*
++ * The treatment of mapping from this point on is critical. The page
++ * lock protects many things but in this context the page lock
++ * stabilizes mapping, prevents inode freeing in the shared
++ * file-backed region case and guards against movement to swap cache.
++ *
++ * Strictly speaking the page lock is not needed in all cases being
++ * considered here and page lock forces unnecessarily serialization
++ * From this point on, mapping will be re-verified if necessary and
++ * page lock will be acquired only if it is unavoidable
++ *
++ * Mapping checks require the head page for any compound page so the
++ * head page and mapping is looked up now. For anonymous pages, it
++ * does not matter if the page splits in the future as the key is
++ * based on the address. For filesystem-backed pages, the tail is
++ * required as the index of the page determines the key. For
++ * base pages, there is no tail page and tail == page.
++ */
++ tail = page;
++ page = compound_head(page);
++ mapping = READ_ONCE(page->mapping);
+
-+ if (awaken <= 0) {
-+ return awaken;
-+ }
++ /*
++ * If page->mapping is NULL, then it cannot be a PageAnon
++ * page; but it might be the ZERO_PAGE or in the gate area or
++ * in a special mapping (all cases which we are happy to fail);
++ * or it may have been a good file page when get_user_pages_fast
++ * found it, but truncated or holepunched or subjected to
++ * invalidate_complete_page2 before we got the page lock (also
++ * cases which we are happy to fail). And we hold a reference,
++ * so refcount care in invalidate_complete_page's remove_mapping
++ * prevents drop_caches from setting mapping to NULL beneath us.
++ *
++ * The case we do have to guard against is when memory pressure made
++ * shmem_writepage move it from filecache to swapcache beneath us:
++ * an unlikely race, but we do need to retry for page->mapping.
++ */
++ if (unlikely(!mapping)) {
++ int shmem_swizzled;
++
++ /*
++ * Page lock is required to identify which special case above
++ * applies. If this is really a shmem page then the page lock
++ * will prevent unexpected transitions.
++ */
++ lock_page(page);
++ shmem_swizzled = PageSwapCache(page) || page->mapping;
++ unlock_page(page);
++ put_page(page);
+
++ if (shmem_swizzled)
++ goto again;
+
-+ /* Before sleeping, check if someone was woken */
-+ if (!futexv->hint && (!timeout || timeout->task))
-+ freezable_schedule();
++ return -EFAULT;
++ }
+
-+ __set_current_state(TASK_RUNNING);
++ /*
++ * Private mappings are handled in a simple way.
++ *
++ * If the futex key is stored on an anonymous page, then the associated
++ * object is the mm which is implicitly pinned by the calling process.
++ *
++ * NOTE: When userspace waits on a MAP_SHARED mapping, even if
++ * it's a read-only handle, it's expected that futexes attach to
++ * the object not the particular process.
++ */
++ if (PageAnon(page)) {
++ key->offset |= FUT_OFF_MMSHARED;
++ } else {
++ struct inode *inode;
+
+ /*
-+ * One of those things triggered this wake:
-+ *
-+ * * We have been removed from the bucket. futex_wake() woke
-+ * us. We just need to dequeue return 0 to userspace.
-+ *
-+ * However, if no futex was dequeued by a futex_wake():
-+ *
-+ * * If the there's a timeout and it has expired,
-+ * return -ETIMEDOUT.
-+ *
-+ * * If there is a signal pending, something wants to kill our
-+ * thread, return -ERESTARTSYS.
++ * The associated futex object in this case is the inode and
++ * the page->mapping must be traversed. Ordinarily this should
++ * be stabilised under page lock but it's not strictly
++ * necessary in this case as we just want to pin the inode, not
++ * update the radix tree or anything like that.
+ *
-+ * * If there's no signal pending, it was a spurious wake
-+ * (scheduler gave us a change to do some work, even if we
-+ * don't want to). We need to remove ourselves from the
-+ * bucket and add again, to prevent losing wakeups in the
-+ * meantime.
++ * The RCU read lock is taken as the inode is finally freed
++ * under RCU. If the mapping still matches expectations then the
++ * mapping->host can be safely accessed as being a valid inode.
+ */
++ rcu_read_lock();
+
-+ ret = futex_dequeue_multiple(futexv, nr_futexes);
++ if (READ_ONCE(page->mapping) != mapping) {
++ rcu_read_unlock();
++ put_page(page);
+
-+ /* Normal wake */
-+ if (ret >= 0)
-+ break;
++ goto again;
++ }
+
-+ if (timeout && !timeout->task)
-+ return -ETIMEDOUT;
++ inode = READ_ONCE(mapping->host);
++ if (!inode) {
++ rcu_read_unlock();
++ put_page(page);
+
-+ /* signal */
-+ if (signal_pending(current))
-+ return -ERESTARTSYS;
++ goto again;
++ }
++
++ key->pointer = futex_get_inode_uuid(inode);
++ key->index = (unsigned long)basepage_index(tail);
++ key->offset |= FUT_OFF_INODE;
+
-+ /* spurious wake, do everything again */
++ rcu_read_unlock();
+ }
+
-+ return ret;
++ put_page(page);
++
++ return 0;
+}
+
/**
-- * futex_dequeue - Remove a futex from a queue
-- * @bucket: current bucket holding the futex
-- * @waiter: futex to be removed
-+ * futex_wait - Setup the timer and wait on a list of futexes
-+ * @futexv: List of waiters
-+ * @nr_futexes: Number of waiters
-+ * @timo: Timeout
-+ * @timeout: Timeout
-+ * @flags: Timeout flags
+ * futex_get_bucket - Check if the user address is valid, prepare internal
+ * data and calculate the hash
+ * @uaddr: futex user address
+ * @key: data that uniquely identifies a futex
++ * @shared: is this a shared futex?
++ *
++ * For private futexes, each uaddr will be unique for a given mm_struct, and it
++ * won't be freed for the life time of the process. For shared futexes, check
++ * futex_get_shared_key().
*
-- * Return: True if futex was removed by this function, false if another wake
-- * thread removed this futex.
-- *
-- * This function should be used after we found that this futex was in a queue.
-- * Thus, it needs to be removed before the next step. However, someone could
-- * wake it between the time of the first check and the time to get the lock for
-- * the bucket. Check one more time if the futex is there with the bucket locked.
-- * If it's there, just remove it and return true. Else, mark the removal as
-- * false and do nothing.
-+ * Return: error code, or a hint of one of the waiters
+ * Return: address of bucket on success, error code otherwise
*/
--static bool futex_dequeue(struct futex_bucket *bucket, struct futex_waiter *waiter)
-+static int futex_wait(struct futexv *futexv, unsigned int nr_futexes,
-+ struct __kernel_timespec __user *timo,
-+ struct hrtimer_sleeper *timeout, unsigned int flags)
+ static struct futex_bucket *futex_get_bucket(void __user *uaddr,
+- struct futex_key *key)
++ struct futex_key *key,
++ bool shared)
{
-- bool removed = true;
-+ int ret;
+ uintptr_t address = (uintptr_t)uaddr;
+ u32 hash_key;
+@@ -168,6 +359,9 @@ static struct futex_bucket *futex_get_bucket(void __user *uaddr,
+ key->pointer = (u64)address;
+ key->index = (unsigned long)current->mm;
-- spin_lock(&bucket->lock);
-- if (list_empty(&waiter->list))
-- removed = false;
-- else
-- list_del(&waiter->list);
-- spin_unlock(&bucket->lock);
-+ if (timo) {
-+ ret = futex_setup_time(timo, timeout, flags);
-+ if (ret)
-+ return ret;
++ if (shared)
++ futex_get_shared_key(address, current->mm, key);
++
+ /* Generate hash key for this futex using uaddr and current->mm */
+ hash_key = jhash2((u32 *)key, sizeof(*key) / sizeof(u32), 0);
-- if (removed)
-- bucket_dec_waiters(bucket);
-+ hrtimer_sleeper_start_expires(timeout, HRTIMER_MODE_ABS);
-+ }
+@@ -303,6 +497,7 @@ static int futex_enqueue(struct futexv_head *futexv, unsigned int nr_futexes,
+ int *awakened)
+ {
+ int i, ret;
++ bool retry = false;
+ u32 uval, *uaddr, val;
+ struct futex_bucket *bucket;
-- return removed;
-+ ret = __futex_wait(futexv, nr_futexes, timo ? timeout : NULL);
-+
-+
-+ if (timo)
-+ hrtimer_cancel(&timeout->timer);
+@@ -313,6 +508,18 @@ static int futex_enqueue(struct futexv_head *futexv, unsigned int nr_futexes,
+ uaddr = (u32 * __user)futexv->objects[i].uaddr;
+ val = (u32)futexv->objects[i].val;
+
++ if (is_object_shared && retry) {
++ struct futex_bucket *tmp =
++ futex_get_bucket((void *)uaddr,
++ &futexv->objects[i].key, true);
++ if (IS_ERR(tmp)) {
++ __set_current_state(TASK_RUNNING);
++ futex_dequeue_multiple(futexv, i);
++ return PTR_ERR(tmp);
++ }
++ futexv->objects[i].bucket = tmp;
++ }
+
-+ return ret;
- }
+ bucket = futexv->objects[i].bucket;
- /**
-@@ -297,15 +440,20 @@ SYSCALL_DEFINE4(futex_wait, void __user *, uaddr, unsigned int, val,
+ bucket_inc_waiters(bucket);
+@@ -333,6 +540,7 @@ static int futex_enqueue(struct futexv_head *futexv, unsigned int nr_futexes,
+ if (*awakened >= 0)
+ return 1;
+
++ retry = true;
+ goto retry;
+ }
+
+@@ -474,6 +682,7 @@ static int futex_set_timer_and_wait(struct futexv_head *futexv,
+ SYSCALL_DEFINE4(futex_wait, void __user *, uaddr, unsigned int, val,
+ unsigned int, flags, struct __kernel_timespec __user *, timo)
{
++ bool shared = (flags & FUTEX_SHARED_FLAG) ? true : false;
unsigned int size = flags & FUTEX_SIZE_MASK;
- struct hrtimer_sleeper timeout;
-- struct futex_bucket *bucket;
- struct futex_single_waiter wait_single;
+ struct futex_single_waiter wait_single = {0};
struct futex_waiter *waiter;
-+ struct futexv *futexv;
- int ret;
+@@ -497,7 +706,7 @@ SYSCALL_DEFINE4(futex_wait, void __user *, uaddr, unsigned int, val,
+ INIT_LIST_HEAD(&waiter->list);
-- wait_single.parent.task = current;
-- wait_single.parent.hint = 0;
-+ futexv = &wait_single.futexv;
-+ futexv->task = current;
-+ futexv->hint = false;
-+
- waiter = &wait_single.waiter;
- waiter->index = 0;
-+ waiter->val = val;
-+
-+ INIT_LIST_HEAD(&waiter->list);
+ /* Get an unlocked hash bucket */
+- waiter->bucket = futex_get_bucket(uaddr, &waiter->key);
++ waiter->bucket = futex_get_bucket(uaddr, &waiter->key, shared);
+ if (IS_ERR(waiter->bucket))
+ return PTR_ERR(waiter->bucket);
- if (flags & ~FUTEX2_MASK)
- return -EINVAL;
-@@ -313,85 +461,101 @@ SYSCALL_DEFINE4(futex_wait, void __user *, uaddr, unsigned int, val,
+@@ -562,6 +771,7 @@ static inline bool futex_match(struct futex_key key1, struct futex_key key2)
+ SYSCALL_DEFINE3(futex_wake, void __user *, uaddr, unsigned int, nr_wake,
+ unsigned int, flags)
+ {
++ bool shared = (flags & FUTEX_SHARED_FLAG) ? true : false;
+ unsigned int size = flags & FUTEX_SIZE_MASK;
+ struct futex_waiter waiter, *aux, *tmp;
+ struct futex_bucket *bucket;
+@@ -574,7 +784,7 @@ SYSCALL_DEFINE3(futex_wake, void __user *, uaddr, unsigned int, nr_wake,
if (size != FUTEX_32)
return -EINVAL;
-- if (timo) {
-- ret = futex_setup_time(timo, &timeout, flags);
-- if (ret)
-- return ret;
-- }
--
- /* Get an unlocked hash bucket */
-- bucket = futex_get_bucket(uaddr, &waiter->key);
-- if (IS_ERR(bucket))
-- return PTR_ERR(bucket);
-+ waiter->bucket = futex_get_bucket(uaddr, &waiter->key);
-+ if (IS_ERR(waiter->bucket))
-+ return PTR_ERR(waiter->bucket);
+- bucket = futex_get_bucket(uaddr, &waiter.key);
++ bucket = futex_get_bucket(uaddr, &waiter.key, shared);
+ if (IS_ERR(bucket))
+ return PTR_ERR(bucket);
-- if (timo)
-- hrtimer_sleeper_start_expires(&timeout, HRTIMER_MODE_ABS);
-+ ret = futex_wait(futexv, 1, timo, &timeout, flags);
+--
+2.30.2
+
+
+From bdfdc48ad40d314933c7872f4818172e76bcd350 Mon Sep 17 00:00:00 2001
+From: =?UTF-8?q?Andr=C3=A9=20Almeida?= <andrealmeid@collabora.com>
+Date: Fri, 5 Feb 2021 10:34:00 -0300
+Subject: [PATCH 03/13] futex2: Implement vectorized wait
+MIME-Version: 1.0
+Content-Type: text/plain; charset=UTF-8
+Content-Transfer-Encoding: 8bit
+
+Add support to wait on multiple futexes. This is the interface
+implemented by this syscall:
+
+futex_waitv(struct futex_waitv *waiters, unsigned int nr_futexes,
+ unsigned int flags, struct timespec *timo)
+
+struct futex_waitv {
+ void *uaddr;
+ unsigned int val;
+ unsigned int flags;
+};
+
+Given an array of struct futex_waitv, wait on each uaddr. The thread
+wakes if a futex_wake() is performed at any uaddr. The syscall returns
+immediately if any waiter has *uaddr != val. *timo is an optional
+timeout value for the operation. The flags argument of the syscall
+should be used solely for specifying the timeout as realtime, if needed.
+Flags for shared futexes, sizes, etc. should be used on the individual
+flags of each waiter.
+
+Returns the array index of one of the awakened futexes. There’s no given
+information of how many were awakened, or any particular attribute of it
+(if it’s the first awakened, if it is of the smaller index...).
+
+Signed-off-by: André Almeida <andrealmeid@collabora.com>
+Signed-off-by: Jan200101 <sentrycraft123@gmail.com>
+---
+ arch/arm/tools/syscall.tbl | 1 +
+ arch/arm64/include/asm/unistd.h | 2 +-
+ arch/x86/entry/syscalls/syscall_32.tbl | 1 +
+ arch/x86/entry/syscalls/syscall_64.tbl | 1 +
+ include/linux/compat.h | 11 ++
+ include/linux/syscalls.h | 4 +
+ include/uapi/asm-generic/unistd.h | 5 +-
+ kernel/futex2.c | 171 ++++++++++++++++++
+ kernel/sys_ni.c | 1 +
+ tools/include/uapi/asm-generic/unistd.h | 5 +-
+ .../arch/x86/entry/syscalls/syscall_64.tbl | 1 +
+ 11 files changed, 200 insertions(+), 3 deletions(-)
+
+diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
+index 4eef220cd..6d0f6626a 100644
+--- a/arch/arm/tools/syscall.tbl
++++ b/arch/arm/tools/syscall.tbl
+@@ -457,3 +457,4 @@
+ 441 common epoll_pwait2 sys_epoll_pwait2
+ 442 common futex_wait sys_futex_wait
+ 443 common futex_wake sys_futex_wake
++444 common futex_waitv sys_futex_waitv
+diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
+index d1f7d35f9..64ebdc1ec 100644
+--- a/arch/arm64/include/asm/unistd.h
++++ b/arch/arm64/include/asm/unistd.h
+@@ -38,7 +38,7 @@
+ #define __ARM_NR_compat_set_tls (__ARM_NR_COMPAT_BASE + 5)
+ #define __ARM_NR_COMPAT_END (__ARM_NR_COMPAT_BASE + 0x800)
+
+-#define __NR_compat_syscalls 444
++#define __NR_compat_syscalls 445
+ #endif
+
+ #define __ARCH_WANT_SYS_CLONE
+diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
+index ece90c8d9..fe242fa0b 100644
+--- a/arch/x86/entry/syscalls/syscall_32.tbl
++++ b/arch/x86/entry/syscalls/syscall_32.tbl
+@@ -448,3 +448,4 @@
+ 441 i386 epoll_pwait2 sys_epoll_pwait2 compat_sys_epoll_pwait2
+ 442 i386 futex_wait sys_futex_wait
+ 443 i386 futex_wake sys_futex_wake
++444 i386 futex_waitv sys_futex_waitv compat_sys_futex_waitv
+diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
+index 72fb65ef9..9d0f07e05 100644
+--- a/arch/x86/entry/syscalls/syscall_64.tbl
++++ b/arch/x86/entry/syscalls/syscall_64.tbl
+@@ -365,6 +365,7 @@
+ 441 common epoll_pwait2 sys_epoll_pwait2
+ 442 common futex_wait sys_futex_wait
+ 443 common futex_wake sys_futex_wake
++444 common futex_waitv sys_futex_waitv
+
+ #
+ # Due to a historical design error, certain syscalls are numbered differently
+diff --git a/include/linux/compat.h b/include/linux/compat.h
+index 6e65be753..041d18174 100644
+--- a/include/linux/compat.h
++++ b/include/linux/compat.h
+@@ -365,6 +365,12 @@ struct compat_robust_list_head {
+ compat_uptr_t list_op_pending;
+ };
+
++struct compat_futex_waitv {
++ compat_uptr_t uaddr;
++ compat_uint_t val;
++ compat_uint_t flags;
++};
++
+ #ifdef CONFIG_COMPAT_OLD_SIGACTION
+ struct compat_old_sigaction {
+ compat_uptr_t sa_handler;
+@@ -654,6 +660,11 @@ asmlinkage long
+ compat_sys_get_robust_list(int pid, compat_uptr_t __user *head_ptr,
+ compat_size_t __user *len_ptr);
+
++/* kernel/futex2.c */
++asmlinkage long compat_sys_futex_waitv(struct compat_futex_waitv *waiters,
++ compat_uint_t nr_futexes, compat_uint_t flags,
++ struct __kernel_timespec __user *timo);
++
+ /* kernel/itimer.c */
+ asmlinkage long compat_sys_getitimer(int which,
+ struct old_itimerval32 __user *it);
+diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
+index bf146c2b0..7da1ceb36 100644
+--- a/include/linux/syscalls.h
++++ b/include/linux/syscalls.h
+@@ -68,6 +68,7 @@ union bpf_attr;
+ struct io_uring_params;
+ struct clone_args;
+ struct open_how;
++struct futex_waitv;
+
+ #include <linux/types.h>
+ #include <linux/aio_abi.h>
+@@ -624,6 +625,9 @@ asmlinkage long sys_futex_wait(void __user *uaddr, unsigned int val,
+ struct __kernel_timespec __user __user *timo);
+ asmlinkage long sys_futex_wake(void __user *uaddr, unsigned int nr_wake,
+ unsigned int flags);
++asmlinkage long sys_futex_waitv(struct futex_waitv __user *waiters,
++ unsigned int nr_futexes, unsigned int flags,
++ struct __kernel_timespec __user *timo);
--retry:
-- bucket_inc_waiters(bucket);
+ /* kernel/hrtimer.c */
+ asmlinkage long sys_nanosleep(struct __kernel_timespec __user *rqtp,
+diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
+index 57e19200f..090da8e12 100644
+--- a/include/uapi/asm-generic/unistd.h
++++ b/include/uapi/asm-generic/unistd.h
+@@ -868,8 +868,11 @@ __SYSCALL(__NR_futex_wait, sys_futex_wait)
+ #define __NR_futex_wake 443
+ __SYSCALL(__NR_futex_wake, sys_futex_wake)
+
++#define __NR_futex_waitv 444
++__SC_COMP(__NR_futex_waitv, sys_futex_waitv, compat_sys_futex_waitv)
++
+ #undef __NR_syscalls
+-#define __NR_syscalls 444
++#define __NR_syscalls 445
+
+ /*
+ * 32 bit systems traditionally used different
+diff --git a/kernel/futex2.c b/kernel/futex2.c
+index 27767b2d0..f3c2379ab 100644
+--- a/kernel/futex2.c
++++ b/kernel/futex2.c
+@@ -713,6 +713,177 @@ SYSCALL_DEFINE4(futex_wait, void __user *, uaddr, unsigned int, val,
+ return futex_set_timer_and_wait(futexv, 1, timo, flags);
+ }
+
++#ifdef CONFIG_COMPAT
++/**
++ * compat_futex_parse_waitv - Parse a waitv array from userspace
++ * @futexv: Kernel side list of waiters to be filled
++ * @uwaitv: Userspace list to be parsed
++ * @nr_futexes: Length of futexv
++ *
++ * Return: Error code on failure, pointer to a prepared futexv otherwise
++ */
++static int compat_futex_parse_waitv(struct futexv_head *futexv,
++ struct compat_futex_waitv __user *uwaitv,
++ unsigned int nr_futexes)
++{
++ struct futex_bucket *bucket;
++ struct compat_futex_waitv waitv;
++ unsigned int i;
++
++ for (i = 0; i < nr_futexes; i++) {
++ if (copy_from_user(&waitv, &uwaitv[i], sizeof(waitv)))
++ return -EFAULT;
++
++ if ((waitv.flags & ~FUTEXV_WAITER_MASK) ||
++ (waitv.flags & FUTEX_SIZE_MASK) != FUTEX_32)
++ return -EINVAL;
++
++ futexv->objects[i].key.pointer = 0;
++ futexv->objects[i].flags = waitv.flags;
++ futexv->objects[i].uaddr = (uintptr_t)compat_ptr(waitv.uaddr);
++ futexv->objects[i].val = waitv.val;
++ futexv->objects[i].index = i;
++
++ bucket = futex_get_bucket(compat_ptr(waitv.uaddr),
++ &futexv->objects[i].key,
++ is_object_shared);
++
++ if (IS_ERR(bucket))
++ return PTR_ERR(bucket);
++
++ futexv->objects[i].bucket = bucket;
++
++ INIT_LIST_HEAD(&futexv->objects[i].list);
++ }
++
++ return 0;
++}
++
++COMPAT_SYSCALL_DEFINE4(futex_waitv, struct compat_futex_waitv __user *, waiters,
++ unsigned int, nr_futexes, unsigned int, flags,
++ struct __kernel_timespec __user *, timo)
++{
++ struct futexv_head *futexv;
++ int ret;
++
++ if (flags & ~FUTEXV_MASK)
++ return -EINVAL;
++
++ if (!nr_futexes || nr_futexes > FUTEX_WAITV_MAX || !waiters)
++ return -EINVAL;
++
++ futexv = kmalloc((sizeof(struct futex_waiter) * nr_futexes) +
++ sizeof(*futexv), GFP_KERNEL);
++ if (!futexv)
++ return -ENOMEM;
++
++ futexv->hint = false;
++ futexv->task = current;
++
++ ret = compat_futex_parse_waitv(futexv, waiters, nr_futexes);
++
++ if (!ret)
++ ret = futex_set_timer_and_wait(futexv, nr_futexes, timo, flags);
++
++ kfree(futexv);
++
+ return ret;
+}
-
-- /* Compare the expected and current value, get the bucket lock */
-- ret = futex_get_user_value(bucket, uaddr, val, false);
-- if (ret) {
-- bucket_dec_waiters(bucket);
-- goto out;
-- }
++#endif
++
+/**
+ * futex_parse_waitv - Parse a waitv array from userspace
-+ * @futexv: list of waiters
-+ * @uwaitv: userspace list
-+ * @nr_futexes: number of waiters in the list
++ * @futexv: Kernel side list of waiters to be filled
++ * @uwaitv: Userspace list to be parsed
++ * @nr_futexes: Length of futexv
+ *
+ * Return: Error code on failure, pointer to a prepared futexv otherwise
+ */
-+static int futex_parse_waitv(struct futexv *futexv,
++static int futex_parse_waitv(struct futexv_head *futexv,
+ struct futex_waitv __user *uwaitv,
+ unsigned int nr_futexes)
+{
++ struct futex_bucket *bucket;
+ struct futex_waitv waitv;
+ unsigned int i;
-+ struct futex_bucket *bucket;
-
-- /* Add the waiter to the hash table and sleep */
-- set_current_state(TASK_INTERRUPTIBLE);
-- list_add_tail(&waiter->list, &bucket->list);
-- spin_unlock(&bucket->lock);
++
+ for (i = 0; i < nr_futexes; i++) {
+ if (copy_from_user(&waitv, &uwaitv[i], sizeof(waitv)))
+ return -EFAULT;
-
-- /* Do not sleep if someone woke this futex or if it was timeouted */
-- if (!list_empty_careful(&waiter->list) && (!timo || timeout.task))
-- freezable_schedule();
++
+ if ((waitv.flags & ~FUTEXV_WAITER_MASK) ||
+ (waitv.flags & FUTEX_SIZE_MASK) != FUTEX_32)
+ return -EINVAL;
-
-- __set_current_state(TASK_RUNNING);
-+ bucket = futex_get_bucket(waitv.uaddr,
-+ &futexv->objects[i].key);
++
++ futexv->objects[i].key.pointer = 0;
++ futexv->objects[i].flags = waitv.flags;
++ futexv->objects[i].uaddr = (uintptr_t)waitv.uaddr;
++ futexv->objects[i].val = waitv.val;
++ futexv->objects[i].index = i;
++
++ bucket = futex_get_bucket(waitv.uaddr, &futexv->objects[i].key,
++ is_object_shared);
++
+ if (IS_ERR(bucket))
+ return PTR_ERR(bucket);
-
-- /*
-- * One of those things triggered this wake:
-- *
-- * * We have been removed from the bucket. futex_wake() woke us. We just
-- * need to return 0 to userspace.
-- *
-- * However, if we find ourselves in the bucket we must remove ourselves
-- * from the bucket and ...
-- *
-- * * If the there's a timeout and it has expired, return -ETIMEDOUT.
-- *
-- * * If there is a signal pending, something wants to kill our thread.
-- * Return -ERESTARTSYS.
-- *
-- * * If there's no signal pending, it was a spurious wake (scheduler
-- * gave us a change to do some work, even if we don't want to). We
-- * need to remove ourselves from the bucket and add again, to prevent
-- * losing wakeups in the meantime.
-- */
++
+ futexv->objects[i].bucket = bucket;
-+ futexv->objects[i].val = waitv.val;
-+ futexv->objects[i].flags = waitv.flags;
-+ futexv->objects[i].index = i;
++
+ INIT_LIST_HEAD(&futexv->objects[i].list);
+ }
-
-- /* Normal wake */
-- if (list_empty_careful(&waiter->list))
-- goto out;
++
+ return 0;
+}
-
-- if (!futex_dequeue(bucket, waiter))
-- goto out;
++
+/**
-+ * sys_futex_waitv - function
-+ * @waiters: TODO
-+ * @nr_futexes: TODO
-+ * @flags: TODO
-+ * @timo: TODO
++ * sys_futex_waitv - Wait on a list of futexes
++ * @waiters: List of futexes to wait on
++ * @nr_futexes: Length of futexv
++ * @flags: Flag for timeout (monotonic/realtime)
++ * @timo: Optional absolute timeout.
++ *
++ * Given an array of `struct futex_waitv`, wait on each uaddr. The thread wakes
++ * if a futex_wake() is performed at any uaddr. The syscall returns immediately
++ * if any waiter has *uaddr != val. *timo is an optional timeout value for the
++ * operation. Each waiter has individual flags. The `flags` argument for the
++ * syscall should be used solely for specifying the timeout as realtime, if
++ * needed. Flags for shared futexes, sizes, etc. should be used on the
++ * individual flags of each waiter.
++ *
++ * Returns the array index of one of the awaken futexes. There's no given
++ * information of how many were awakened, or any particular attribute of it (if
++ * it's the first awakened, if it is of the smaller index...).
+ */
+SYSCALL_DEFINE4(futex_waitv, struct futex_waitv __user *, waiters,
+ unsigned int, nr_futexes, unsigned int, flags,
+ struct __kernel_timespec __user *, timo)
+{
-+ struct hrtimer_sleeper timeout;
-+ struct futexv *futexv;
++ struct futexv_head *futexv;
+ int ret;
-
-- /* Timeout */
-- if (timo && !timeout.task)
-- return -ETIMEDOUT;
++
+ if (flags & ~FUTEXV_MASK)
+ return -EINVAL;
-
-- /* Spurious wakeup */
-- if (!signal_pending(current))
-- goto retry;
++
+ if (!nr_futexes || nr_futexes > FUTEX_WAITV_MAX || !waiters)
+ return -EINVAL;
-
-- /* Some signal is pending */
-- ret = -ERESTARTSYS;
--out:
-- if (timo)
-- hrtimer_cancel(&timeout.timer);
-+ futexv = kmalloc(sizeof(struct futexv) +
-+ (sizeof(struct futex_waiter) * nr_futexes),
-+ GFP_KERNEL);
++
++ futexv = kmalloc((sizeof(struct futex_waiter) * nr_futexes) +
++ sizeof(*futexv), GFP_KERNEL);
+ if (!futexv)
+ return -ENOMEM;
+
@@ -1300,37 +1923,21 @@ index 107b80a46..4b782b5ef 100644
+
+ ret = futex_parse_waitv(futexv, waiters, nr_futexes);
+ if (!ret)
-+ ret = futex_wait(futexv, nr_futexes, timo, &timeout, flags);
++ ret = futex_set_timer_and_wait(futexv, nr_futexes, timo, flags);
+
+ kfree(futexv);
-
- return ret;
- }
-
-+/**
-+ * futex_get_parent - Get parent
-+ * @waiter: TODO
-+ * @index: TODO
-+ *
-+ * Return: TODO
-+ */
- static struct futexv *futex_get_parent(uintptr_t waiter, u8 index)
- {
- uintptr_t parent = waiter - sizeof(struct futexv)
-@@ -439,7 +603,7 @@ SYSCALL_DEFINE3(futex_wake, void __user *, uaddr, unsigned int, nr_wake,
- struct futexv *parent =
- futex_get_parent((uintptr_t) aux, aux->index);
-
-- parent->hint = 1;
-+ parent->hint = true;
- task = parent->task;
- get_task_struct(task);
- list_del_init_careful(&aux->list);
++
++ return ret;
++}
++
+ /**
+ * futex_get_parent - For a given futex in a futexv list, get a pointer to the futexv
+ * @waiter: Address of futex in the list
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
-index 35ff743b1..1898e7340 100644
+index 27ef83ca8..977890c58 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
-@@ -151,6 +151,7 @@ COND_SYSCALL_COMPAT(get_robust_list);
+@@ -153,6 +153,7 @@ COND_SYSCALL_COMPAT(get_robust_list);
/* kernel/futex2.c */
COND_SYSCALL(futex_wait);
COND_SYSCALL(futex_wake);
@@ -1339,465 +1946,827 @@ index 35ff743b1..1898e7340 100644
/* kernel/hrtimer.c */
diff --git a/tools/include/uapi/asm-generic/unistd.h b/tools/include/uapi/asm-generic/unistd.h
-index cd79f94e0..7de33be59 100644
+index 57e19200f..23febe59e 100644
--- a/tools/include/uapi/asm-generic/unistd.h
+++ b/tools/include/uapi/asm-generic/unistd.h
-@@ -866,8 +866,11 @@ __SYSCALL(__NR_futex_wait, sys_futex_wait)
- #define __NR_futex_wake 442
+@@ -868,8 +868,11 @@ __SYSCALL(__NR_futex_wait, sys_futex_wait)
+ #define __NR_futex_wake 443
__SYSCALL(__NR_futex_wake, sys_futex_wake)
-+#define __NR_futex_waitv 443
-+__SYSCALL(__NR_futex_waitv, sys_futex_waitv)
++#define __NR_futex_waitv 444
++__SYSCALL(__NR_futex_wait, sys_futex_wait)
+
#undef __NR_syscalls
--#define __NR_syscalls 443
-+#define __NR_syscalls 444
-
+-#define __NR_syscalls 444
++#define __NR_syscalls 445
/*
+ * 32 bit systems traditionally used different
diff --git a/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl b/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl
-index 47de3bf93..bd47f368f 100644
+index 15d2b89b6..820c1e4b1 100644
--- a/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl
-@@ -364,6 +364,7 @@
- 440 common process_madvise sys_process_madvise
- 441 common futex_wait sys_futex_wait
- 442 common futex_wake sys_futex_wake
-+443 common futex_waitv sys_futex_waitv
+@@ -365,6 +365,7 @@
+ 441 common epoll_pwait2 sys_epoll_pwait2
+ 442 common futex_wait sys_futex_wait
+ 443 common futex_wake sys_futex_wake
++444 common futex_waitv sys_futex_waitv
#
# Due to a historical design error, certain syscalls are numbered differently
--
-2.29.2
+2.30.2
-From 24681616a5432f7680f934abf335a9ab9a1eaf1e Mon Sep 17 00:00:00 2001
+From e1198b0e26063ba40993154176b8232f646c3c4b Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Andr=C3=A9=20Almeida?= <andrealmeid@collabora.com>
-Date: Thu, 15 Oct 2020 18:06:40 -0300
-Subject: [PATCH 3/9] futex2: Add support for shared futexes
+Date: Fri, 5 Feb 2021 10:34:01 -0300
+Subject: [PATCH 04/13] futex2: Implement requeue operation
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
-Add support for shared futexes for cross-process resources.
+Implement requeue interface similarly to FUTEX_CMP_REQUEUE operation.
+This is the syscall implemented by this patch:
+
+futex_requeue(struct futex_requeue *uaddr1, struct futex_requeue *uaddr2,
+ unsigned int nr_wake, unsigned int nr_requeue,
+ unsigned int cmpval, unsigned int flags)
+
+struct futex_requeue {
+ void *uaddr;
+ unsigned int flags;
+};
+
+If (uaddr1->uaddr == cmpval), wake at uaddr1->uaddr a nr_wake number of
+waiters and then, remove a number of nr_requeue waiters at uaddr1->uaddr
+and add them to uaddr2->uaddr list. Each uaddr has its own set of flags,
+that must be defined at struct futex_requeue (such as size, shared, NUMA).
+The flags argument of the syscall is there just for the sake of
+extensibility, and right now it needs to be zero.
+
+Return the number of the woken futexes + the number of requeued ones on
+success, error code otherwise.
Signed-off-by: André Almeida <andrealmeid@collabora.com>
+---
+
+The original FUTEX_CMP_REQUEUE interfaces is such as follows:
+
+futex(*uaddr1, FUTEX_CMP_REQUEUE, nr_wake, nr_requeue, *uaddr2, cmpval);
+
+Given that when this interface was created they was only one type of
+futex (as opposed to futex2, where there is shared, sizes, and NUMA),
+there was no way to specify individual flags for uaddr1 and 2. When
+FUTEX_PRIVATE was implemented, a new opcode was created as well
+(FUTEX_CMP_REQUEUE_PRIVATE), but they apply both futexes, so they
+should be of the same type regarding private/shared. This imposes a
+limitation on the use cases of the operation, and to overcome that at futex2,
+`struct futex_requeue` was created, so one can set individual flags for
+each futex. This flexibility is a trade-off with performance, given that
+now we need to perform two extra copy_from_user(). One alternative would
+be to use the upper half of flags bits to the first one, and the bottom
+half for the second futex, but this would also impose limitations, given
+that we would limit by half the flags possibilities. If equal futexes
+are common enough, the following extension could be added to overcome
+the current performance:
+
+- A flag FUTEX_REQUEUE_EQUAL is added to futex2() flags;
+- If futex_requeue() see this flag, that means that both futexes uses
+ the same set of attributes.
+- Then, the function parses the flags as of futex_wait/wake().
+- *uaddr1 and *uaddr2 are used as void* (instead of struct
+ futex_requeue) just like wait/wake().
+
+In that way, we could avoid the copy_from_user().
+
Signed-off-by: Jan200101 <sentrycraft123@gmail.com>
---
- kernel/futex2.c | 187 ++++++++++++++++++++++++++++++++++++++++++------
- 1 file changed, 165 insertions(+), 22 deletions(-)
+ arch/arm/tools/syscall.tbl | 1 +
+ arch/arm64/include/asm/unistd.h | 2 +-
+ arch/x86/entry/syscalls/syscall_32.tbl | 1 +
+ arch/x86/entry/syscalls/syscall_64.tbl | 1 +
+ include/linux/compat.h | 12 ++
+ include/linux/syscalls.h | 5 +
+ include/uapi/asm-generic/unistd.h | 5 +-
+ kernel/futex2.c | 215 +++++++++++++++++++++++++
+ kernel/sys_ni.c | 1 +
+ 9 files changed, 241 insertions(+), 2 deletions(-)
+diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
+index 6d0f6626a..9aa108802 100644
+--- a/arch/arm/tools/syscall.tbl
++++ b/arch/arm/tools/syscall.tbl
+@@ -458,3 +458,4 @@
+ 442 common futex_wait sys_futex_wait
+ 443 common futex_wake sys_futex_wake
+ 444 common futex_waitv sys_futex_waitv
++445 common futex_requeue sys_futex_requeue
+diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
+index 64ebdc1ec..d1cc2849d 100644
+--- a/arch/arm64/include/asm/unistd.h
++++ b/arch/arm64/include/asm/unistd.h
+@@ -38,7 +38,7 @@
+ #define __ARM_NR_compat_set_tls (__ARM_NR_COMPAT_BASE + 5)
+ #define __ARM_NR_COMPAT_END (__ARM_NR_COMPAT_BASE + 0x800)
+
+-#define __NR_compat_syscalls 445
++#define __NR_compat_syscalls 446
+ #endif
+
+ #define __ARCH_WANT_SYS_CLONE
+diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
+index fe242fa0b..0cd1df235 100644
+--- a/arch/x86/entry/syscalls/syscall_32.tbl
++++ b/arch/x86/entry/syscalls/syscall_32.tbl
+@@ -449,3 +449,4 @@
+ 442 i386 futex_wait sys_futex_wait
+ 443 i386 futex_wake sys_futex_wake
+ 444 i386 futex_waitv sys_futex_waitv compat_sys_futex_waitv
++445 i386 futex_requeue sys_futex_requeue compat_sys_futex_requeue
+diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
+index 9d0f07e05..abbfddcdb 100644
+--- a/arch/x86/entry/syscalls/syscall_64.tbl
++++ b/arch/x86/entry/syscalls/syscall_64.tbl
+@@ -366,6 +366,7 @@
+ 442 common futex_wait sys_futex_wait
+ 443 common futex_wake sys_futex_wake
+ 444 common futex_waitv sys_futex_waitv
++445 common futex_requeue sys_futex_requeue
+
+ #
+ # Due to a historical design error, certain syscalls are numbered differently
+diff --git a/include/linux/compat.h b/include/linux/compat.h
+index 041d18174..d4c1b402b 100644
+--- a/include/linux/compat.h
++++ b/include/linux/compat.h
+@@ -371,6 +371,11 @@ struct compat_futex_waitv {
+ compat_uint_t flags;
+ };
+
++struct compat_futex_requeue {
++ compat_uptr_t uaddr;
++ compat_uint_t flags;
++};
++
+ #ifdef CONFIG_COMPAT_OLD_SIGACTION
+ struct compat_old_sigaction {
+ compat_uptr_t sa_handler;
+@@ -665,6 +670,13 @@ asmlinkage long compat_sys_futex_waitv(struct compat_futex_waitv *waiters,
+ compat_uint_t nr_futexes, compat_uint_t flags,
+ struct __kernel_timespec __user *timo);
+
++asmlinkage long compat_sys_futex_requeue(struct compat_futex_requeue *uaddr1,
++ struct compat_futex_requeue *uaddr2,
++ compat_uint_t nr_wake,
++ compat_uint_t nr_requeue,
++ compat_uint_t cmpval,
++ compat_uint_t flags);
++
+ /* kernel/itimer.c */
+ asmlinkage long compat_sys_getitimer(int which,
+ struct old_itimerval32 __user *it);
+diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
+index 7da1ceb36..06823bc7e 100644
+--- a/include/linux/syscalls.h
++++ b/include/linux/syscalls.h
+@@ -69,6 +69,7 @@ struct io_uring_params;
+ struct clone_args;
+ struct open_how;
+ struct futex_waitv;
++struct futex_requeue;
+
+ #include <linux/types.h>
+ #include <linux/aio_abi.h>
+@@ -628,6 +629,10 @@ asmlinkage long sys_futex_wake(void __user *uaddr, unsigned int nr_wake,
+ asmlinkage long sys_futex_waitv(struct futex_waitv __user *waiters,
+ unsigned int nr_futexes, unsigned int flags,
+ struct __kernel_timespec __user *timo);
++asmlinkage long sys_futex_requeue(struct futex_requeue __user *uaddr1,
++ struct futex_requeue __user *uaddr2,
++ unsigned int nr_wake, unsigned int nr_requeue,
++ unsigned int cmpval, unsigned int flags);
+
+ /* kernel/hrtimer.c */
+ asmlinkage long sys_nanosleep(struct __kernel_timespec __user *rqtp,
+diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
+index 090da8e12..095c10a83 100644
+--- a/include/uapi/asm-generic/unistd.h
++++ b/include/uapi/asm-generic/unistd.h
+@@ -871,8 +871,11 @@ __SYSCALL(__NR_futex_wake, sys_futex_wake)
+ #define __NR_futex_waitv 444
+ __SC_COMP(__NR_futex_waitv, sys_futex_waitv, compat_sys_futex_waitv)
+
++#define __NR_futex_requeue 445
++__SC_COMP(__NR_futex_requeue, sys_futex_requeue, compat_sys_futex_requeue)
++
+ #undef __NR_syscalls
+-#define __NR_syscalls 445
++#define __NR_syscalls 446
+
+ /*
+ * 32 bit systems traditionally used different
diff --git a/kernel/futex2.c b/kernel/futex2.c
-index 4b782b5ef..5ddb9922d 100644
+index f3c2379ab..bad8c183c 100644
--- a/kernel/futex2.c
+++ b/kernel/futex2.c
-@@ -6,7 +6,9 @@
- */
-
- #include <linux/freezer.h>
-+#include <linux/hugetlb.h>
- #include <linux/jhash.h>
-+#include <linux/pagemap.h>
- #include <linux/sched/wake_q.h>
- #include <linux/spinlock.h>
- #include <linux/syscalls.h>
-@@ -15,6 +17,7 @@
-
- /**
- * struct futex_waiter - List entry for a waiter
-+ * @uaddr: Memory address of userspace futex
- * @key.address: Memory address of userspace futex
- * @key.mm: Pointer to memory management struct of this process
- * @key: Stores information that uniquely identify a futex
-@@ -25,9 +28,11 @@
- * @index: Index of waiter in futexv list
- */
- struct futex_waiter {
-+ uintptr_t uaddr;
- struct futex_key {
- uintptr_t address;
- struct mm_struct *mm;
-+ unsigned long int offset;
- } key;
- struct list_head list;
- unsigned int val;
-@@ -125,16 +130,116 @@ static inline int bucket_get_waiters(struct futex_bucket *bucket)
- #endif
+@@ -977,6 +977,221 @@ SYSCALL_DEFINE3(futex_wake, void __user *, uaddr, unsigned int, nr_wake,
+ return ret;
}
-+static u64 get_inode_sequence_number(struct inode *inode)
++static void futex_double_unlock(struct futex_bucket *b1, struct futex_bucket *b2)
+{
-+ static atomic64_t i_seq;
-+ u64 old;
++ spin_unlock(&b1->lock);
++ if (b1 != b2)
++ spin_unlock(&b2->lock);
++}
+
-+ /* Does the inode already have a sequence number? */
-+ old = atomic64_read(&inode->i_sequence);
-+ if (likely(old))
-+ return old;
++static inline int __futex_requeue(struct futex_requeue rq1,
++ struct futex_requeue rq2, unsigned int nr_wake,
++ unsigned int nr_requeue, unsigned int cmpval,
++ bool shared1, bool shared2)
++{
++ struct futex_waiter w1, w2, *aux, *tmp;
++ bool retry = false;
++ struct futex_bucket *b1, *b2;
++ DEFINE_WAKE_Q(wake_q);
++ u32 uval;
++ int ret;
+
-+ for (;;) {
-+ u64 new = atomic64_add_return(1, &i_seq);
-+ if (WARN_ON_ONCE(!new))
-+ continue;
++ b1 = futex_get_bucket(rq1.uaddr, &w1.key, shared1);
++ if (IS_ERR(b1))
++ return PTR_ERR(b1);
+
-+ old = atomic64_cmpxchg_relaxed(&inode->i_sequence, 0, new);
-+ if (old)
-+ return old;
-+ return new;
++ b2 = futex_get_bucket(rq2.uaddr, &w2.key, shared2);
++ if (IS_ERR(b2))
++ return PTR_ERR(b2);
++
++retry:
++ if (shared1 && retry) {
++ b1 = futex_get_bucket(rq1.uaddr, &w1.key, shared1);
++ if (IS_ERR(b1))
++ return PTR_ERR(b1);
+ }
-+}
+
-+#define FUT_OFF_INODE 1 /* We set bit 0 if key has a reference on inode */
-+#define FUT_OFF_MMSHARED 2 /* We set bit 1 if key has a reference on mm */
++ if (shared2 && retry) {
++ b2 = futex_get_bucket(rq2.uaddr, &w2.key, shared2);
++ if (IS_ERR(b2))
++ return PTR_ERR(b2);
++ }
+
-+static int futex_get_shared_key(uintptr_t address, struct mm_struct *mm,
-+ struct futex_key *key)
-+{
-+ int err;
-+ struct page *page, *tail;
-+ struct address_space *mapping;
++ bucket_inc_waiters(b2);
++ /*
++ * To ensure the locks are taken in the same order for all threads (and
++ * thus avoiding deadlocks), take the "smaller" one first
++ */
++ if (b1 <= b2) {
++ spin_lock(&b1->lock);
++ if (b1 < b2)
++ spin_lock_nested(&b2->lock, SINGLE_DEPTH_NESTING);
++ } else {
++ spin_lock(&b2->lock);
++ spin_lock_nested(&b1->lock, SINGLE_DEPTH_NESTING);
++ }
+
-+again:
-+ err = get_user_pages_fast(address, 1, 0, &page);
++ ret = futex_get_user(&uval, rq1.uaddr);
+
-+ if (err < 0)
-+ return err;
-+ else
-+ err = 0;
++ if (unlikely(ret)) {
++ futex_double_unlock(b1, b2);
++ if (__get_user(uval, (u32 * __user)rq1.uaddr))
++ return -EFAULT;
+
++ bucket_dec_waiters(b2);
++ retry = true;
++ goto retry;
++ }
+
-+ tail = page;
-+ page = compound_head(page);
-+ mapping = READ_ONCE(page->mapping);
++ if (uval != cmpval) {
++ futex_double_unlock(b1, b2);
+
++ bucket_dec_waiters(b2);
++ return -EAGAIN;
++ }
+
-+ if (unlikely(!mapping)) {
-+ int shmem_swizzled;
++ list_for_each_entry_safe(aux, tmp, &b1->list, list) {
++ if (futex_match(w1.key, aux->key)) {
++ if (ret < nr_wake) {
++ futex_mark_wake(aux, b1, &wake_q);
++ ret++;
++ continue;
++ }
+
-+ lock_page(page);
-+ shmem_swizzled = PageSwapCache(page) || page->mapping;
-+ unlock_page(page);
-+ put_page(page);
++ if (ret >= nr_wake + nr_requeue)
++ break;
+
-+ if (shmem_swizzled)
-+ goto again;
++ aux->key.pointer = w2.key.pointer;
++ aux->key.index = w2.key.index;
++ aux->key.offset = w2.key.offset;
+
-+ return -EFAULT;
++ if (b1 != b2) {
++ list_del_init_careful(&aux->list);
++ bucket_dec_waiters(b1);
++
++ list_add_tail(&aux->list, &b2->list);
++ bucket_inc_waiters(b2);
++ }
++ ret++;
++ }
+ }
+
-+ if (PageAnon(page)) {
++ futex_double_unlock(b1, b2);
++ wake_up_q(&wake_q);
++ bucket_dec_waiters(b2);
+
-+ key->mm = mm;
-+ key->address = address;
++ return ret;
++}
+
-+ key->offset |= FUT_OFF_MMSHARED;
++#ifdef CONFIG_COMPAT
++static int compat_futex_parse_requeue(struct futex_requeue *rq,
++ struct compat_futex_requeue __user *uaddr,
++ bool *shared)
++{
++ struct compat_futex_requeue tmp;
+
-+ } else {
-+ struct inode *inode;
++ if (copy_from_user(&tmp, uaddr, sizeof(tmp)))
++ return -EFAULT;
+
-+ rcu_read_lock();
++ if (tmp.flags & ~FUTEXV_WAITER_MASK ||
++ (tmp.flags & FUTEX_SIZE_MASK) != FUTEX_32)
++ return -EINVAL;
+
-+ if (READ_ONCE(page->mapping) != mapping) {
-+ rcu_read_unlock();
-+ put_page(page);
++ *shared = (tmp.flags & FUTEX_SHARED_FLAG) ? true : false;
+
-+ goto again;
-+ }
++ rq->uaddr = compat_ptr(tmp.uaddr);
++ rq->flags = tmp.flags;
+
-+ inode = READ_ONCE(mapping->host);
-+ if (!inode) {
-+ rcu_read_unlock();
-+ put_page(page);
++ return 0;
++}
+
-+ goto again;
-+ }
++COMPAT_SYSCALL_DEFINE6(futex_requeue, struct compat_futex_requeue __user *, uaddr1,
++ struct compat_futex_requeue __user *, uaddr2,
++ unsigned int, nr_wake, unsigned int, nr_requeue,
++ unsigned int, cmpval, unsigned int, flags)
++{
++ struct futex_requeue rq1, rq2;
++ bool shared1, shared2;
++ int ret;
+
-+ key->address = get_inode_sequence_number(inode);
-+ key->mm = (struct mm_struct *) basepage_index(tail);
-+ key->offset |= FUT_OFF_INODE;
++ if (flags)
++ return -EINVAL;
+
-+ rcu_read_unlock();
-+ }
++ ret = compat_futex_parse_requeue(&rq1, uaddr1, &shared1);
++ if (ret)
++ return ret;
+
-+ put_page(page);
-+ return err;
++ ret = compat_futex_parse_requeue(&rq2, uaddr2, &shared2);
++ if (ret)
++ return ret;
++
++ return __futex_requeue(rq1, rq2, nr_wake, nr_requeue, cmpval, shared1, shared2);
+}
++#endif
+
- /**
- * futex_get_bucket - Check if the user address is valid, prepare internal
- * data and calculate the hash
- * @uaddr: futex user address
- * @key: data that uniquely identifies a futex
-+ * @shared: is this a shared futex?
- *
- * Return: address of bucket on success, error code otherwise
- */
- static struct futex_bucket *futex_get_bucket(void __user *uaddr,
-- struct futex_key *key)
-+ struct futex_key *key,
-+ bool shared)
- {
- uintptr_t address = (uintptr_t) uaddr;
- u32 hash_key;
-@@ -145,8 +250,15 @@ static struct futex_bucket *futex_get_bucket(void __user *uaddr,
- if (unlikely(!access_ok(address, sizeof(u32))))
- return ERR_PTR(-EFAULT);
-
-- key->address = address;
-- key->mm = current->mm;
-+ key->offset = address % PAGE_SIZE;
-+ address -= key->offset;
++/**
++ * futex_parse_requeue - Copy a user struct futex_requeue and check it's flags
++ * @rq: Kernel struct
++ * @uaddr: Address of user struct
++ * @shared: Out parameter, defines if this is a shared futex
++ *
++ * Return: 0 on success, error code otherwise
++ */
++static int futex_parse_requeue(struct futex_requeue *rq,
++ struct futex_requeue __user *uaddr, bool *shared)
++{
++ if (copy_from_user(rq, uaddr, sizeof(*rq)))
++ return -EFAULT;
+
-+ if (!shared) {
-+ key->address = address;
-+ key->mm = current->mm;
-+ } else {
-+ futex_get_shared_key(address, current->mm, key);
-+ }
-
- /* Generate hash key for this futex using uaddr and current->mm */
- hash_key = jhash2((u32 *) key, sizeof(*key) / sizeof(u32), 0);
-@@ -275,9 +387,10 @@ static int futex_dequeue_multiple(struct futexv *futexv, unsigned int nr)
- * Return: 0 on success, error code otherwise
- */
- static int futex_enqueue(struct futexv *futexv, unsigned int nr_futexes,
-- unsigned int *awaken)
-+ int *awaken)
- {
- int i, ret;
-+ bool shared, retry = false;
- u32 uval, *uaddr, val;
- struct futex_bucket *bucket;
-
-@@ -285,8 +398,18 @@ static int futex_enqueue(struct futexv *futexv, unsigned int nr_futexes,
- set_current_state(TASK_INTERRUPTIBLE);
-
- for (i = 0; i < nr_futexes; i++) {
-- uaddr = (u32 * __user) futexv->objects[i].key.address;
-+ uaddr = (u32 * __user) futexv->objects[i].uaddr;
- val = (u32) futexv->objects[i].val;
-+ shared = (futexv->objects[i].flags & FUTEX_SHARED_FLAG) ? true : false;
++ if (rq->flags & ~FUTEXV_WAITER_MASK ||
++ (rq->flags & FUTEX_SIZE_MASK) != FUTEX_32)
++ return -EINVAL;
+
-+ if (shared && retry) {
-+ futexv->objects[i].bucket =
-+ futex_get_bucket((void *) uaddr,
-+ &futexv->objects[i].key, true);
-+ if (IS_ERR(futexv->objects[i].bucket))
-+ return PTR_ERR(futexv->objects[i].bucket);
-+ }
++ *shared = (rq->flags & FUTEX_SHARED_FLAG) ? true : false;
+
- bucket = futexv->objects[i].bucket;
-
- bucket_inc_waiters(bucket);
-@@ -301,24 +424,32 @@ static int futex_enqueue(struct futexv *futexv, unsigned int nr_futexes,
- __set_current_state(TASK_RUNNING);
- *awaken = futex_dequeue_multiple(futexv, i);
-
-+ if (shared) {
-+ retry = true;
-+ goto retry;
-+ }
++ return 0;
++}
+
- if (__get_user(uval, uaddr))
- return -EFAULT;
-
- if (*awaken >= 0)
-- return 0;
-+ return 1;
-
-+ retry = true;
- goto retry;
- }
-
- if (uval != val) {
- spin_unlock(&bucket->lock);
-
++/**
++ * sys_futex_requeue - Wake futexes at uaddr1 and requeue from uaddr1 to uaddr2
++ * @uaddr1: Address of futexes to be waken/dequeued
++ * @uaddr2: Address for the futexes to be enqueued
++ * @nr_wake: Number of futexes waiting in uaddr1 to be woken up
++ * @nr_requeue: Number of futexes to be requeued from uaddr1 to uaddr2
++ * @cmpval: Expected value at uaddr1
++ * @flags: Reserved flags arg for requeue operation expansion. Must be 0.
++ *
++ * If (uaddr1->uaddr == cmpval), wake at uaddr1->uaddr a nr_wake number of
++ * waiters and then, remove a number of nr_requeue waiters at uaddr1->uaddr
++ * and add then to uaddr2->uaddr list. Each uaddr has its own set of flags,
++ * that must be defined at struct futex_requeue (such as size, shared, NUMA).
++ *
++ * Return the number of the woken futexes + the number of requeued ones on
++ * success, error code otherwise.
++ */
++SYSCALL_DEFINE6(futex_requeue, struct futex_requeue __user *, uaddr1,
++ struct futex_requeue __user *, uaddr2,
++ unsigned int, nr_wake, unsigned int, nr_requeue,
++ unsigned int, cmpval, unsigned int, flags)
++{
++ struct futex_requeue rq1, rq2;
++ bool shared1, shared2;
++ int ret;
+
- bucket_dec_waiters(bucket);
- __set_current_state(TASK_RUNNING);
- *awaken = futex_dequeue_multiple(futexv, i);
-
-- if (*awaken >= 0)
-- return 0;
-+ if (*awaken >= 0) {
-+ return 1;
-+ }
-
- return -EWOULDBLOCK;
- }
-@@ -336,19 +467,18 @@ static int __futex_wait(struct futexv *futexv,
- struct hrtimer_sleeper *timeout)
- {
- int ret;
-- unsigned int awaken = -1;
-
-- while (1) {
-- ret = futex_enqueue(futexv, nr_futexes, &awaken);
-
-- if (ret < 0)
-- break;
-+ while (1) {
-+ int awaken = -1;
-
-- if (awaken <= 0) {
-- return awaken;
-+ ret = futex_enqueue(futexv, nr_futexes, &awaken);
-+ if (ret) {
-+ if (awaken >= 0)
-+ return awaken;
-+ return ret;
- }
-
--
- /* Before sleeping, check if someone was woken */
- if (!futexv->hint && (!timeout || timeout->task))
- freezable_schedule();
-@@ -419,6 +549,7 @@ static int futex_wait(struct futexv *futexv, unsigned int nr_futexes,
- hrtimer_sleeper_start_expires(timeout, HRTIMER_MODE_ABS);
- }
-
++ if (flags)
++ return -EINVAL;
+
- ret = __futex_wait(futexv, nr_futexes, timo ? timeout : NULL);
-
-
-@@ -438,9 +569,10 @@ static int futex_wait(struct futexv *futexv, unsigned int nr_futexes,
- SYSCALL_DEFINE4(futex_wait, void __user *, uaddr, unsigned int, val,
- unsigned int, flags, struct __kernel_timespec __user *, timo)
++ ret = futex_parse_requeue(&rq1, uaddr1, &shared1);
++ if (ret)
++ return ret;
++
++ ret = futex_parse_requeue(&rq2, uaddr2, &shared2);
++ if (ret)
++ return ret;
++
++ return __futex_requeue(rq1, rq2, nr_wake, nr_requeue, cmpval, shared1, shared2);
++}
++
+ static int __init futex2_init(void)
{
-+ bool shared = (flags & FUTEX_SHARED_FLAG) ? true : false;
- unsigned int size = flags & FUTEX_SIZE_MASK;
-- struct hrtimer_sleeper timeout;
- struct futex_single_waiter wait_single;
-+ struct hrtimer_sleeper timeout;
- struct futex_waiter *waiter;
- struct futexv *futexv;
- int ret;
-@@ -452,6 +584,7 @@ SYSCALL_DEFINE4(futex_wait, void __user *, uaddr, unsigned int, val,
- waiter = &wait_single.waiter;
- waiter->index = 0;
- waiter->val = val;
-+ waiter->uaddr = (uintptr_t) uaddr;
-
- INIT_LIST_HEAD(&waiter->list);
-
-@@ -462,11 +595,14 @@ SYSCALL_DEFINE4(futex_wait, void __user *, uaddr, unsigned int, val,
- return -EINVAL;
-
- /* Get an unlocked hash bucket */
-- waiter->bucket = futex_get_bucket(uaddr, &waiter->key);
-- if (IS_ERR(waiter->bucket))
-+ waiter->bucket = futex_get_bucket(uaddr, &waiter->key, shared);
-+ if (IS_ERR(waiter->bucket)) {
- return PTR_ERR(waiter->bucket);
-+ }
+ int i;
+diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
+index 977890c58..1750dfc41 100644
+--- a/kernel/sys_ni.c
++++ b/kernel/sys_ni.c
+@@ -154,6 +154,7 @@ COND_SYSCALL_COMPAT(get_robust_list);
+ COND_SYSCALL(futex_wait);
+ COND_SYSCALL(futex_wake);
+ COND_SYSCALL(futex_waitv);
++COND_SYSCALL(futex_requeue);
- ret = futex_wait(futexv, 1, timo, &timeout, flags);
-+ if (ret > 0)
-+ ret = 0;
+ /* kernel/hrtimer.c */
- return ret;
- }
-@@ -486,8 +622,10 @@ static int futex_parse_waitv(struct futexv *futexv,
- struct futex_waitv waitv;
- unsigned int i;
- struct futex_bucket *bucket;
-+ bool shared;
+--
+2.30.2
+
+
+From 9ef45e80251029ad164b538b20f0d68a9b75865c Mon Sep 17 00:00:00 2001
+From: =?UTF-8?q?Andr=C3=A9=20Almeida?= <andrealmeid@collabora.com>
+Date: Thu, 11 Feb 2021 10:47:23 -0300
+Subject: [PATCH 05/13] futex2: Add compatibility entry point for x86_x32 ABI
+MIME-Version: 1.0
+Content-Type: text/plain; charset=UTF-8
+Content-Transfer-Encoding: 8bit
+
+New syscalls should use the same entry point for x86_64 and x86_x32
+paths. Add a wrapper for x32 calls to use parse functions that assumes
+32bit pointers.
+
+Signed-off-by: André Almeida <andrealmeid@collabora.com>
+Signed-off-by: Jan200101 <sentrycraft123@gmail.com>
+---
+ kernel/futex2.c | 42 +++++++++++++++++++++++++++++++++++-------
+ 1 file changed, 35 insertions(+), 7 deletions(-)
+
+diff --git a/kernel/futex2.c b/kernel/futex2.c
+index bad8c183c..8a8b45f98 100644
+--- a/kernel/futex2.c
++++ b/kernel/futex2.c
+@@ -23,6 +23,10 @@
+ #include <linux/syscalls.h>
+ #include <uapi/linux/futex.h>
- for (i = 0; i < nr_futexes; i++) {
++#ifdef CONFIG_X86_64
++#include <linux/compat.h>
++#endif
+
- if (copy_from_user(&waitv, &uwaitv[i], sizeof(waitv)))
- return -EFAULT;
-
-@@ -495,8 +633,10 @@ static int futex_parse_waitv(struct futexv *futexv,
- (waitv.flags & FUTEX_SIZE_MASK) != FUTEX_32)
- return -EINVAL;
-
-+ shared = (waitv.flags & FUTEX_SHARED_FLAG) ? true : false;
+ /**
+ * struct futex_key - Components to build unique key for a futex
+ * @pointer: Pointer to current->mm or inode's UUID for file backed futexes
+@@ -875,7 +879,16 @@ SYSCALL_DEFINE4(futex_waitv, struct futex_waitv __user *, waiters,
+ futexv->hint = false;
+ futexv->task = current;
+
+- ret = futex_parse_waitv(futexv, waiters, nr_futexes);
++#ifdef CONFIG_X86_X32_ABI
++ if (in_x32_syscall()) {
++ ret = compat_futex_parse_waitv(futexv, (struct compat_futex_waitv *)waiters,
++ nr_futexes);
++ } else
++#endif
++ {
++ ret = futex_parse_waitv(futexv, waiters, nr_futexes);
++ }
+
- bucket = futex_get_bucket(waitv.uaddr,
-- &futexv->objects[i].key);
-+ &futexv->objects[i].key, shared);
- if (IS_ERR(bucket))
- return PTR_ERR(bucket);
-
-@@ -505,6 +645,7 @@ static int futex_parse_waitv(struct futexv *futexv,
- futexv->objects[i].flags = waitv.flags;
- futexv->objects[i].index = i;
- INIT_LIST_HEAD(&futexv->objects[i].list);
-+ futexv->objects[i].uaddr = (uintptr_t) waitv.uaddr;
- }
+ if (!ret)
+ ret = futex_set_timer_and_wait(futexv, nr_futexes, timo, flags);
- return 0;
-@@ -573,6 +714,7 @@ static struct futexv *futex_get_parent(uintptr_t waiter, u8 index)
- SYSCALL_DEFINE3(futex_wake, void __user *, uaddr, unsigned int, nr_wake,
- unsigned int, flags)
- {
-+ bool shared = (flags & FUTEX_SHARED_FLAG) ? true : false;
- unsigned int size = flags & FUTEX_SIZE_MASK;
- struct futex_waiter waiter, *aux, *tmp;
- struct futex_bucket *bucket;
-@@ -586,7 +728,7 @@ SYSCALL_DEFINE3(futex_wake, void __user *, uaddr, unsigned int, nr_wake,
- if (size != FUTEX_32)
+@@ -1181,13 +1194,28 @@ SYSCALL_DEFINE6(futex_requeue, struct futex_requeue __user *, uaddr1,
+ if (flags)
return -EINVAL;
-- bucket = futex_get_bucket(uaddr, &waiter.key);
-+ bucket = futex_get_bucket(uaddr, &waiter.key, shared);
- if (IS_ERR(bucket))
- return PTR_ERR(bucket);
+- ret = futex_parse_requeue(&rq1, uaddr1, &shared1);
+- if (ret)
+- return ret;
++#ifdef CONFIG_X86_X32_ABI
++ if (in_x32_syscall()) {
++ ret = compat_futex_parse_requeue(&rq1, (struct compat_futex_requeue *)uaddr1,
++ &shared1);
++ if (ret)
++ return ret;
-@@ -599,7 +741,8 @@ SYSCALL_DEFINE3(futex_wake, void __user *, uaddr, unsigned int, nr_wake,
- break;
+- ret = futex_parse_requeue(&rq2, uaddr2, &shared2);
+- if (ret)
+- return ret;
++ ret = compat_futex_parse_requeue(&rq2, (struct compat_futex_requeue *)uaddr2,
++ &shared2);
++ if (ret)
++ return ret;
++ } else
++#endif
++ {
++ ret = futex_parse_requeue(&rq1, uaddr1, &shared1);
++ if (ret)
++ return ret;
++
++ ret = futex_parse_requeue(&rq2, uaddr2, &shared2);
++ if (ret)
++ return ret;
++ }
- if (waiter.key.address == aux->key.address &&
-- waiter.key.mm == aux->key.mm) {
-+ waiter.key.mm == aux->key.mm &&
-+ waiter.key.offset == aux->key.offset) {
- struct futexv *parent =
- futex_get_parent((uintptr_t) aux, aux->index);
+ return __futex_requeue(rq1, rq2, nr_wake, nr_requeue, cmpval, shared1, shared2);
+ }
+--
+2.30.2
+
+
+From 80944da5db0f1e00d0bf174d85f74ae4df2444aa Mon Sep 17 00:00:00 2001
+From: =?UTF-8?q?Andr=C3=A9=20Almeida?= <andrealmeid@collabora.com>
+Date: Tue, 9 Feb 2021 13:59:00 -0300
+Subject: [PATCH 06/13] docs: locking: futex2: Add documentation
+MIME-Version: 1.0
+Content-Type: text/plain; charset=UTF-8
+Content-Transfer-Encoding: 8bit
+
+Add a new documentation file specifying both userspace API and internal
+implementation details of futex2 syscalls.
+
+Signed-off-by: André Almeida <andrealmeid@collabora.com>
+Signed-off-by: Jan200101 <sentrycraft123@gmail.com>
+---
+ Documentation/locking/futex2.rst | 198 +++++++++++++++++++++++++++++++
+ Documentation/locking/index.rst | 1 +
+ 2 files changed, 199 insertions(+)
+ create mode 100644 Documentation/locking/futex2.rst
+
+diff --git a/Documentation/locking/futex2.rst b/Documentation/locking/futex2.rst
+new file mode 100644
+index 000000000..edd47c22f
+--- /dev/null
++++ b/Documentation/locking/futex2.rst
+@@ -0,0 +1,198 @@
++.. SPDX-License-Identifier: GPL-2.0
++
++======
++futex2
++======
++
++:Author: André Almeida <andrealmeid@collabora.com>
++
++futex, or fast user mutex, is a set of syscalls to allow the userspace to create
++performant synchronization mechanisms, such as mutexes, semaphores and
++conditional variables in userspace. C standard libraries, like glibc, uses it
++as means to implements more high level interfaces like pthreads.
++
++The interface
++=============
++
++uAPI functions
++--------------
++
++.. kernel-doc:: kernel/futex2.c
++ :identifiers: sys_futex_wait sys_futex_wake sys_futex_waitv sys_futex_requeue
++
++uAPI structures
++---------------
++
++.. kernel-doc:: include/uapi/linux/futex.h
++
++The ``flag`` argument
++---------------------
++
++The flag is used to specify the size of the futex word
++(FUTEX_[8, 16, 32]). It's mandatory to define one, since there's no
++default size.
++
++By default, the timeout uses a monotonic clock, but can be used as a realtime
++one by using the FUTEX_REALTIME_CLOCK flag.
++
++By default, futexes are of the private type, that means that this user address
++will be accessed by threads that shares the same memory region. This allows for
++some internal optimizations, so they are faster. However, if the address needs
++to be shared with different processes (like using ``mmap()`` or ``shm()``), they
++need to be defined as shared and the flag FUTEX_SHARED_FLAG is used to set that.
++
++By default, the operation has no NUMA-awareness, meaning that the user can't
++choose the memory node where the kernel side futex data will be stored. The
++user can choose the node where it wants to operate by setting the
++FUTEX_NUMA_FLAG and using the following structure (where X can be 8, 16, or
++32)::
++
++ struct futexX_numa {
++ __uX value;
++ __sX hint;
++ };
++
++This structure should be passed at the ``void *uaddr`` of futex functions. The
++address of the structure will be used to be waited on/waken on, and the
++``value`` will be compared to ``val`` as usual. The ``hint`` member is used to
++defined which node the futex will use. When waiting, the futex will be
++registered on a kernel-side table stored on that node; when waking, the futex
++will be searched for on that given table. That means that there's no redundancy
++between tables, and the wrong ``hint`` value will led to undesired behavior.
++Userspace is responsible for dealing with node migrations issues that may
++occur. ``hint`` can range from [0, MAX_NUMA_NODES], for specifying a node, or
++-1, to use the same node the current process is using.
++
++When not using FUTEX_NUMA_FLAG on a NUMA system, the futex will be stored on a
++global table on some node, defined at compilation time.
++
++The ``timo`` argument
++---------------------
++
++As per the Y2038 work done in the kernel, new interfaces shouldn't add timeout
++options known to be buggy. Given that, ``timo`` should be a 64bit timeout at
++all platforms, using an absolute timeout value.
++
++Implementation
++==============
++
++The internal implementation follows a similar design to the original futex.
++Given that we want to replicate the same external behavior of current futex,
++this should be somewhat expected.
++
++Waiting
++-------
++
++For the wait operations, they are all treated as if you want to wait on N
++futexes, so the path for futex_wait and futex_waitv is the basically the same.
++For both syscalls, the first step is to prepare an internal list for the list
++of futexes to wait for (using struct futexv_head). For futex_wait() calls, this
++list will have a single object.
++
++We have a hash table, were waiters register themselves before sleeping. Then,
++the wake function checks this table looking for waiters at uaddr. The hash
++bucket to be used is determined by a struct futex_key, that stores information
++to uniquely identify an address from a given process. Given the huge address
++space, there'll be hash collisions, so we store information to be later used on
++collision treatment.
++
++First, for every futex we want to wait on, we check if (``*uaddr == val``).
++This check is done holding the bucket lock, so we are correctly serialized with
++any futex_wake() calls. If any waiter fails the check above, we dequeue all
++futexes. The check (``*uaddr == val``) can fail for two reasons:
++
++- The values are different, and we return -EAGAIN. However, if while
++ dequeueing we found that some futex were awakened, we prioritize this
++ and return success.
++
++- When trying to access the user address, we do so with page faults
++ disabled because we are holding a bucket's spin lock (and can't sleep
++ while holding a spin lock). If there's an error, it might be a page
++ fault, or an invalid address. We release the lock, dequeue everyone
++ (because it's illegal to sleep while there are futexes enqueued, we
++ could lose wakeups) and try again with page fault enabled. If we
++ succeeded, this means that the address is valid, but we need to do
++ all the work again. For serialization reasons, we need to have the
++ spin lock when getting the user value. Additionally, for shared
++ futexes, we also need to recalculate the hash, since the underlying
++ mapping mechanisms could have changed when dealing with page fault.
++ If, even with page fault enabled, we can't access the address, it
++ means it's an invalid user address, and we return -EFAULT. For this
++ case, we prioritize the error, even if some futex were awaken.
++
++If the check is OK, they are enqueued on a linked list in our bucket, and
++proceed to the next one. If all waiters succeed, we put the thread to sleep
++until a futex_wake() call, timeout expires or we get a signal. After waking up,
++we dequeue everyone, and check if some futex was awaken. This dequeue is done by
++iteratively walking at each element of struct futex_head list.
++
++All enqueuing/dequeuing operations requires to hold the bucket lock, to avoid
++racing while modifying the list.
++
++Waking
++------
++
++We get the bucket that's storing the waiters at uaddr, and wake the required
++number of waiters, checking for hash collision.
++
++There's an optimization that makes futex_wake() not taking the bucket lock if
++there's no one to be wake on that bucket. It checks an atomic counter that each
++bucket has, if it says 0, than the syscall exits. In order to this work, the
++waiter thread increases it before taking the lock, so the wake thread will
++correctly see that there's someone waiting and will continue the path to take
++the bucket lock. To get the correct serialization, the waiter issues a memory
++barrier after increasing the bucket counter and the waker issues a memory
++barrier before checking it.
++
++Requeuing
++---------
++
++The requeue path first checks for each struct futex_requeue and their flags.
++Then, it will compare the excepted value with the one at uaddr1::uaddr.
++Following the same serialization explained at Waking_, we increase the atomic
++counter for the bucket of uaddr2 before taking the lock. We need to have both
++buckets locks at same time so we don't race with others futexes operations. To
++ensure the locks are taken in the same order for all threads (and thus avoiding
++deadlocks), every requeue operation takes the "smaller" bucket first, when
++comparing both addresses.
++
++If the compare with user value succeeds, we proceed by waking ``nr_wake``
++futexes, and then requeuing ``nr_requeue`` from bucket of uaddr1 to the uaddr2.
++This consists in a simple list deletion/addition and replacing the old futex key
++for the new one.
++
++Futex keys
++----------
++
++There are two types of futexes: private and shared ones. The private are futexes
++meant to be used by threads that shares the same memory space, are easier to be
++uniquely identified an thus can have some performance optimization. The elements
++for identifying one are: the start address of the page where the address is,
++the address offset within the page and the current->mm pointer.
++
++Now, for uniquely identifying shared futex:
++
++- If the page containing the user address is an anonymous page, we can
++ just use the same data used for private futexes (the start address of
++ the page, the address offset within the page and the current->mm
++ pointer) that will be enough for uniquely identifying such futex. We
++ also set one bit at the key to differentiate if a private futex is
++ used on the same address (mixing shared and private calls do not
++ work).
++
++- If the page is file-backed, current->mm maybe isn't the same one for
++ every user of this futex, so we need to use other data: the
++ page->index, an UUID for the struct inode and the offset within the
++ page.
++
++Note that members of futex_key doesn't have any particular meaning after they
++are part of the struct - they are just bytes to identify a futex. Given that,
++we don't need to use a particular name or type that matches the original data,
++we only need to care about the bitsize of each component and make both private
++and shared fit in the same memory space.
++
++Source code documentation
++=========================
++
++.. kernel-doc:: kernel/futex2.c
++ :no-identifiers: sys_futex_wait sys_futex_wake sys_futex_waitv sys_futex_requeue
+diff --git a/Documentation/locking/index.rst b/Documentation/locking/index.rst
+index 7003bd5ae..9bf03c7fa 100644
+--- a/Documentation/locking/index.rst
++++ b/Documentation/locking/index.rst
+@@ -24,6 +24,7 @@ locking
+ percpu-rw-semaphore
+ robust-futexes
+ robust-futex-ABI
++ futex2
+
+ .. only:: subproject and html
--
-2.29.2
+2.30.2
-From ce3ae4bd9f98763fda07f315c1f239c4aaef4b5e Mon Sep 17 00:00:00 2001
+From 807830198558476757c3e1b77fcfad2129fe29fa Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Andr=C3=A9=20Almeida?= <andrealmeid@collabora.com>
-Date: Thu, 9 Jul 2020 11:34:40 -0300
-Subject: [PATCH 4/9] selftests: futex: Add futex2 wake/wait test
+Date: Fri, 5 Feb 2021 10:34:01 -0300
+Subject: [PATCH 07/13] selftests: futex2: Add wake/wait test
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
-Add a simple test to test wake/wait mechanism using futex2 interface.
+Add a simple file to test wake/wait mechanism using futex2 interface.
+Test three scenarios: using a common local int variable as private
+futex, a shm futex as shared futex and a file-backed shared memory as a
+shared futex. This should test all branches of futex_get_key().
+
Create helper files so more tests can evaluate futex2. While 32bit ABIs
-from glibc aren't able to use 64 bit sized time variables, add a
+from glibc aren't yet able to use 64 bit sized time variables, add a
temporary workaround that implements the required types and calls the
appropriated syscalls, since futex2 doesn't supports 32 bit sized time.
Signed-off-by: André Almeida <andrealmeid@collabora.com>
Signed-off-by: Jan200101 <sentrycraft123@gmail.com>
---
- tools/include/uapi/asm-generic/unistd.h | 1 -
.../selftests/futex/functional/.gitignore | 1 +
- .../selftests/futex/functional/Makefile | 4 +-
- .../selftests/futex/functional/futex2_wait.c | 148 ++++++++++++++++++
+ .../selftests/futex/functional/Makefile | 6 +-
+ .../selftests/futex/functional/futex2_wait.c | 209 ++++++++++++++++++
.../testing/selftests/futex/functional/run.sh | 3 +
- .../selftests/futex/include/futex2test.h | 77 +++++++++
- 6 files changed, 232 insertions(+), 2 deletions(-)
+ .../selftests/futex/include/futex2test.h | 79 +++++++
+ 5 files changed, 296 insertions(+), 2 deletions(-)
create mode 100644 tools/testing/selftests/futex/functional/futex2_wait.c
create mode 100644 tools/testing/selftests/futex/include/futex2test.h
-diff --git a/tools/include/uapi/asm-generic/unistd.h b/tools/include/uapi/asm-generic/unistd.h
-index 7de33be59..81a90b697 100644
---- a/tools/include/uapi/asm-generic/unistd.h
-+++ b/tools/include/uapi/asm-generic/unistd.h
-@@ -872,7 +872,6 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
- #undef __NR_syscalls
- #define __NR_syscalls 444
-
--
- /*
- * 32 bit systems traditionally used different
- * syscalls for off_t and loff_t arguments, while
diff --git a/tools/testing/selftests/futex/functional/.gitignore b/tools/testing/selftests/futex/functional/.gitignore
index 0efcd494d..d61f1df94 100644
--- a/tools/testing/selftests/futex/functional/.gitignore
@@ -1808,10 +2777,15 @@ index 0efcd494d..d61f1df94 100644
futex_wait_wouldblock
+futex2_wait
diff --git a/tools/testing/selftests/futex/functional/Makefile b/tools/testing/selftests/futex/functional/Makefile
-index 23207829e..7142a94a7 100644
+index 23207829e..9b334f190 100644
--- a/tools/testing/selftests/futex/functional/Makefile
+++ b/tools/testing/selftests/futex/functional/Makefile
-@@ -5,6 +5,7 @@ LDLIBS := -lpthread -lrt
+@@ -1,10 +1,11 @@
+ # SPDX-License-Identifier: GPL-2.0
+-INCLUDES := -I../include -I../../
++INCLUDES := -I../include -I../../ -I../../../../../usr/include/
+ CFLAGS := $(CFLAGS) -g -O2 -Wall -D_GNU_SOURCE -pthread $(INCLUDES)
+ LDLIBS := -lpthread -lrt
HEADERS := \
../include/futextest.h \
@@ -1831,14 +2805,14 @@ index 23207829e..7142a94a7 100644
diff --git a/tools/testing/selftests/futex/functional/futex2_wait.c b/tools/testing/selftests/futex/functional/futex2_wait.c
new file mode 100644
-index 000000000..0646a24b7
+index 000000000..4b5416585
--- /dev/null
+++ b/tools/testing/selftests/futex/functional/futex2_wait.c
-@@ -0,0 +1,148 @@
+@@ -0,0 +1,209 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/******************************************************************************
+ *
-+ * Copyright Collabora Ltd., 2020
++ * Copyright Collabora Ltd., 2021
+ *
+ * DESCRIPTION
+ * Test wait/wake mechanism of futex2, using 32bit sized futexes.
@@ -1847,7 +2821,7 @@ index 000000000..0646a24b7
+ * André Almeida <andrealmeid@collabora.com>
+ *
+ * HISTORY
-+ * 2020-Jul-9: Initial version by André <andrealmeid@collabora.com>
++ * 2021-Feb-5: Initial version by André <andrealmeid@collabora.com>
+ *
+ *****************************************************************************/
+
@@ -1860,12 +2834,16 @@ index 000000000..0646a24b7
+#include <time.h>
+#include <pthread.h>
+#include <sys/shm.h>
++#include <sys/mman.h>
++#include <fcntl.h>
++#include <string.h>
+#include "futex2test.h"
+#include "logging.h"
+
+#define TEST_NAME "futex2-wait"
+#define timeout_ns 30000000
+#define WAKE_WAIT_US 10000
++#define SHM_PATH "futex2_shm_file"
+futex_t *f1;
+
+void usage(char *prog)
@@ -1881,6 +2859,7 @@ index 000000000..0646a24b7
+{
+ struct timespec64 to64;
+ unsigned int flags = 0;
++
+ if (arg)
+ flags = *((unsigned int *) arg);
+
@@ -1901,6 +2880,13 @@ index 000000000..0646a24b7
+ return NULL;
+}
+
++void *waitershm(void *arg)
++{
++ futex2_wait(arg, 0, FUTEX_32 | FUTEX_SHARED_FLAG, NULL);
++
++ return NULL;
++}
++
+int main(int argc, char *argv[])
+{
+ pthread_t waiter;
@@ -1908,6 +2894,7 @@ index 000000000..0646a24b7
+ int res, ret = RET_PASS;
+ int c;
+ futex_t f_private = 0;
++
+ f1 = &f_private;
+
+ while ((c = getopt(argc, argv, "cht:v:")) != -1) {
@@ -1928,10 +2915,11 @@ index 000000000..0646a24b7
+ }
+
+ ksft_print_header();
-+ ksft_set_plan(2);
++ ksft_set_plan(3);
+ ksft_print_msg("%s: Test FUTEX2_WAIT\n",
+ basename(argv[0]));
+
++ /* Testing a private futex */
+ info("Calling private futex2_wait on f1: %u @ %p with val=%u\n", *f1, f1, *f1);
+
+ if (pthread_create(&waiter, NULL, waiterfn, NULL))
@@ -1951,12 +2939,15 @@ index 000000000..0646a24b7
+ }
+
+ int shm_id = shmget(IPC_PRIVATE, 4096, IPC_CREAT | 0666);
++
+ if (shm_id < 0) {
+ perror("shmget");
+ exit(1);
+ }
+
++ /* Testing an anon page shared memory */
+ unsigned int *shared_data = shmat(shm_id, NULL, 0);
++
+ *shared_data = 0;
+ f1 = shared_data;
+
@@ -1970,16 +2961,60 @@ index 000000000..0646a24b7
+ info("Calling shared futex2_wake on f1: %u @ %p with val=%u\n", *f1, f1, *f1);
+ res = futex2_wake(f1, 1, FUTEX_32 | FUTEX_SHARED_FLAG);
+ if (res != 1) {
-+ ksft_test_result_fail("futex2_wake shared returned: %d %s\n",
++ ksft_test_result_fail("futex2_wake shared (shmget) returned: %d %s\n",
+ res ? errno : res,
+ res ? strerror(errno) : "");
+ ret = RET_FAIL;
+ } else {
-+ ksft_test_result_pass("futex2_wake shared succeeds\n");
++ ksft_test_result_pass("futex2_wake shared (shmget) succeeds\n");
+ }
+
+ shmdt(shared_data);
+
++ /* Testing a file backed shared memory */
++ void *shm;
++ int fd, pid;
++
++ f_private = 0;
++
++ fd = open(SHM_PATH, O_RDWR | O_CREAT, S_IRUSR | S_IWUSR);
++ if (fd < 0) {
++ perror("open");
++ exit(1);
++ }
++
++ res = ftruncate(fd, sizeof(f_private));
++ if (res) {
++ perror("ftruncate");
++ exit(1);
++ }
++
++ shm = mmap(NULL, sizeof(f_private), PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
++ if (shm == MAP_FAILED) {
++ perror("mmap");
++ exit(1);
++ }
++
++ memcpy(shm, &f_private, sizeof(f_private));
++
++ pthread_create(&waiter, NULL, waitershm, shm);
++
++ usleep(WAKE_WAIT_US);
++
++ res = futex2_wake(shm, 1, FUTEX_32 | FUTEX_SHARED_FLAG);
++ if (res != 1) {
++ ksft_test_result_fail("futex2_wake shared (mmap) returned: %d %s\n",
++ res ? errno : res,
++ res ? strerror(errno) : "");
++ ret = RET_FAIL;
++ } else {
++ ksft_test_result_pass("futex2_wake shared (mmap) succeeds\n");
++ }
++
++ munmap(shm, sizeof(f_private));
++
++ remove(SHM_PATH);
++
+ ksft_print_cnts();
+ return ret;
+}
@@ -1996,14 +3031,14 @@ index 1acb6ace1..3730159c8 100755
+./futex2_wait $COLOR
diff --git a/tools/testing/selftests/futex/include/futex2test.h b/tools/testing/selftests/futex/include/futex2test.h
new file mode 100644
-index 000000000..807b8b57f
+index 000000000..e724d56b9
--- /dev/null
+++ b/tools/testing/selftests/futex/include/futex2test.h
-@@ -0,0 +1,77 @@
+@@ -0,0 +1,79 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/******************************************************************************
+ *
-+ * Copyright Collabora Ltd., 2020
++ * Copyright Collabora Ltd., 2021
+ *
+ * DESCRIPTION
+ * Futex2 library addons for old futex library
@@ -2012,7 +3047,7 @@ index 000000000..807b8b57f
+ * André Almeida <andrealmeid@collabora.com>
+ *
+ * HISTORY
-+ * 2020-Jul-9: Initial version by André <andrealmeid@collabora.com>
++ * 2021-Feb-5: Initial version by André <andrealmeid@collabora.com>
+ *
+ *****************************************************************************/
+#include "futextest.h"
@@ -2027,12 +3062,7 @@ index 000000000..807b8b57f
+# define FUTEX_16 1
+#endif
+#ifndef FUTEX_32
-+#define FUTEX_32 2
-+#endif
-+#ifdef __x86_64__
-+# ifndef FUTEX_64
-+# define FUTEX_64 3
-+# endif
++# define FUTEX_32 2
+#endif
+
+/*
@@ -2061,8 +3091,12 @@ index 000000000..807b8b57f
+ * - End of Y2038 section -
+ */
+
-+/*
-+ * wait for uaddr if (*uaddr == val)
++/**
++ * futex2_wait - If (*uaddr == val), wait at uaddr until timo
++ * @uaddr: User address to wait on
++ * @val: Expected value at uaddr, return if is not equal
++ * @flags: Operation flags
++ * @timo: Optional timeout for operation
+ */
+static inline int futex2_wait(volatile void *uaddr, unsigned long val,
+ unsigned long flags, struct timespec64 *timo)
@@ -2070,27 +3104,31 @@ index 000000000..807b8b57f
+ return syscall(__NR_futex_wait, uaddr, val, flags, timo);
+}
+
-+/*
-+ * wake nr futexes waiting for uaddr
++/**
++ * futex2_wake - Wake a number of waiters at uaddr
++ * @uaddr: Address to wake
++ * @nr: Number of waiters to wake
++ * @flags: Operation flags
+ */
+static inline int futex2_wake(volatile void *uaddr, unsigned int nr, unsigned long flags)
+{
+ return syscall(__NR_futex_wake, uaddr, nr, flags);
+}
--
-2.29.2
+2.30.2
-From 1e0349f5a81a43cdb50d9a97812194df6d937b69 Mon Sep 17 00:00:00 2001
+From 382ed2cfcea3ed7e77d07e3e12b3769a081001ea Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Andr=C3=A9=20Almeida?= <andrealmeid@collabora.com>
-Date: Thu, 9 Jul 2020 11:36:14 -0300
-Subject: [PATCH 5/9] selftests: futex: Add futex2 timeout test
+Date: Fri, 5 Feb 2021 10:34:01 -0300
+Subject: [PATCH 08/13] selftests: futex2: Add timeout test
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Adapt existing futex wait timeout file to test the same mechanism for
-futex2.
+futex2. futex2 accepts only absolute 64bit timers, but supports both
+monotonic and realtime clocks.
Signed-off-by: André Almeida <andrealmeid@collabora.com>
Signed-off-by: Jan200101 <sentrycraft123@gmail.com>
@@ -2099,14 +3137,14 @@ Signed-off-by: Jan200101 <sentrycraft123@gmail.com>
1 file changed, 49 insertions(+), 9 deletions(-)
diff --git a/tools/testing/selftests/futex/functional/futex_wait_timeout.c b/tools/testing/selftests/futex/functional/futex_wait_timeout.c
-index ee55e6d38..245670e44 100644
+index ee55e6d38..b4dffe9e3 100644
--- a/tools/testing/selftests/futex/functional/futex_wait_timeout.c
+++ b/tools/testing/selftests/futex/functional/futex_wait_timeout.c
@@ -11,6 +11,7 @@
*
* HISTORY
* 2009-Nov-6: Initial version by Darren Hart <dvhart@linux.intel.com>
-+ * 2020-Jul-9: Add futex2 test by André <andrealmeid@collabora.com>
++ * 2021-Feb-5: Add futex2 test by André <andrealmeid@collabora.com>
*
*****************************************************************************/
@@ -2198,13 +3236,13 @@ index ee55e6d38..245670e44 100644
return ret;
}
--
-2.29.2
+2.30.2
-From 298120f6e3a758cd03e26a104f5ce60a88501b7f Mon Sep 17 00:00:00 2001
+From 27d37b4e24805d9dc5478c296ee680a8a4db8a6e Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Andr=C3=A9=20Almeida?= <andrealmeid@collabora.com>
-Date: Thu, 9 Jul 2020 11:37:42 -0300
-Subject: [PATCH 6/9] selftests: futex: Add futex2 wouldblock test
+Date: Fri, 5 Feb 2021 10:34:01 -0300
+Subject: [PATCH 09/13] selftests: futex2: Add wouldblock test
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
@@ -2219,14 +3257,14 @@ Signed-off-by: Jan200101 <sentrycraft123@gmail.com>
1 file changed, 29 insertions(+), 4 deletions(-)
diff --git a/tools/testing/selftests/futex/functional/futex_wait_wouldblock.c b/tools/testing/selftests/futex/functional/futex_wait_wouldblock.c
-index 0ae390ff8..1f72e5928 100644
+index 0ae390ff8..ed3660090 100644
--- a/tools/testing/selftests/futex/functional/futex_wait_wouldblock.c
+++ b/tools/testing/selftests/futex/functional/futex_wait_wouldblock.c
@@ -12,6 +12,7 @@
*
* HISTORY
* 2009-Nov-14: Initial version by Gowrishankar <gowrishankar.m@in.ibm.com>
-+ * 2020-Jul-9: Add futex2 test by André <andrealmeid@collabora.com>
++ * 2021-Feb-5: Add futex2 test by André <andrealmeid@collabora.com>
*
*****************************************************************************/
@@ -2293,26 +3331,30 @@ index 0ae390ff8..1f72e5928 100644
return ret;
}
--
-2.29.2
+2.30.2
-From 05c697a239aad5e8608c6acf0da9239cac5f7a2e Mon Sep 17 00:00:00 2001
+From 2b2f4e71b3bb09c0d45f9eae4c1986155d3a1235 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Andr=C3=A9=20Almeida?= <andrealmeid@collabora.com>
-Date: Tue, 8 Dec 2020 18:47:31 -0300
-Subject: [PATCH 7/9] selftests: futex: Add futex2 waitv test
+Date: Fri, 5 Feb 2021 10:34:02 -0300
+Subject: [PATCH 10/13] selftests: futex2: Add waitv test
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
+Create a new file to test the waitv mechanism. Test both private and
+shared futexes. Wake the last futex in the array, and check if the
+return value from futex_waitv() is the right index.
+
Signed-off-by: André Almeida <andrealmeid@collabora.com>
Signed-off-by: Jan200101 <sentrycraft123@gmail.com>
---
.../selftests/futex/functional/.gitignore | 1 +
.../selftests/futex/functional/Makefile | 3 +-
- .../selftests/futex/functional/futex2_waitv.c | 156 ++++++++++++++++++
+ .../selftests/futex/functional/futex2_waitv.c | 157 ++++++++++++++++++
.../testing/selftests/futex/functional/run.sh | 3 +
- .../selftests/futex/include/futex2test.h | 25 ++-
- 5 files changed, 183 insertions(+), 5 deletions(-)
+ .../selftests/futex/include/futex2test.h | 26 +++
+ 5 files changed, 189 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/futex/functional/futex2_waitv.c
diff --git a/tools/testing/selftests/futex/functional/.gitignore b/tools/testing/selftests/futex/functional/.gitignore
@@ -2325,7 +3367,7 @@ index d61f1df94..d0b8f637b 100644
futex2_wait
+futex2_waitv
diff --git a/tools/testing/selftests/futex/functional/Makefile b/tools/testing/selftests/futex/functional/Makefile
-index 7142a94a7..b857b9450 100644
+index 9b334f190..09c08ccde 100644
--- a/tools/testing/selftests/futex/functional/Makefile
+++ b/tools/testing/selftests/futex/functional/Makefile
@@ -16,7 +16,8 @@ TEST_GEN_FILES := \
@@ -2340,14 +3382,14 @@ index 7142a94a7..b857b9450 100644
diff --git a/tools/testing/selftests/futex/functional/futex2_waitv.c b/tools/testing/selftests/futex/functional/futex2_waitv.c
new file mode 100644
-index 000000000..d4b116651
+index 000000000..2f81d296d
--- /dev/null
+++ b/tools/testing/selftests/futex/functional/futex2_waitv.c
-@@ -0,0 +1,156 @@
+@@ -0,0 +1,157 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/******************************************************************************
+ *
-+ * Copyright Collabora Ltd., 2020
++ * Copyright Collabora Ltd., 2021
+ *
+ * DESCRIPTION
+ * Test waitv/wake mechanism of futex2, using 32bit sized futexes.
@@ -2356,7 +3398,7 @@ index 000000000..d4b116651
+ * André Almeida <andrealmeid@collabora.com>
+ *
+ * HISTORY
-+ * 2020-Jul-9: Initial version by André <andrealmeid@collabora.com>
++ * 2021-Feb-5: Initial version by André <andrealmeid@collabora.com>
+ *
+ *****************************************************************************/
+
@@ -2401,7 +3443,11 @@ index 000000000..d4b116651
+
+ res = futex2_waitv(waitv, NR_FUTEXES, 0, &to64);
+ if (res < 0) {
-+ printf("waiter failed errno %d %s\n",
++ ksft_test_result_fail("futex2_waitv private returned: %d %s\n",
++ res ? errno : res,
++ res ? strerror(errno) : "");
++ } else if (res != NR_FUTEXES - 1) {
++ ksft_test_result_fail("futex2_waitv private returned: %d %s\n",
+ res ? errno : res,
+ res ? strerror(errno) : "");
+ }
@@ -2437,23 +3483,21 @@ index 000000000..d4b116651
+ ksft_print_msg("%s: Test FUTEX2_WAITV\n",
+ basename(argv[0]));
+
-+ //info("Calling private futex2_wait on f1: %u @ %p with val=%u\n", *f1, f1, *f1);
-+
+ for (i = 0; i < NR_FUTEXES; i++) {
-+ waitv[i].uaddr = &futexes[i];
++ waitv[i].uaddr = &futexes[i];
+ waitv[i].flags = FUTEX_32;
+ waitv[i].val = 0;
+ }
+
++ /* Private waitv */
+ if (pthread_create(&waiter, NULL, waiterfn, NULL))
+ error("pthread_create failed\n", errno);
+
+ usleep(WAKE_WAIT_US);
+
-+ // info("Calling private futex2_wake on f1: %u @ %p with val=%u\n", *f1, f1, *f1);
+ res = futex2_wake(waitv[NR_FUTEXES - 1].uaddr, 1, FUTEX_32);
+ if (res != 1) {
-+ ksft_test_result_fail("futex2_wake private returned: %d %s\n",
++ ksft_test_result_fail("futex2_waitv private returned: %d %s\n",
+ res ? errno : res,
+ res ? strerror(errno) : "");
+ ret = RET_FAIL;
@@ -2461,37 +3505,36 @@ index 000000000..d4b116651
+ ksft_test_result_pass("futex2_waitv private succeeds\n");
+ }
+
++ /* Shared waitv */
+ for (i = 0; i < NR_FUTEXES; i++) {
+ int shm_id = shmget(IPC_PRIVATE, 4096, IPC_CREAT | 0666);
++
+ if (shm_id < 0) {
+ perror("shmget");
+ exit(1);
+ }
+
+ unsigned int *shared_data = shmat(shm_id, NULL, 0);
-+ *shared_data = 0;
+
-+ waitv[i].uaddr = shared_data;
++ *shared_data = 0;
++ waitv[i].uaddr = shared_data;
+ waitv[i].flags = FUTEX_32 | FUTEX_SHARED_FLAG;
+ waitv[i].val = 0;
+ }
+
-+ //info("Calling shared futex2_wait on f1: %u @ %p with val=%u\n", *f1, f1, *f1);
-+
+ if (pthread_create(&waiter, NULL, waiterfn, NULL))
+ error("pthread_create failed\n", errno);
+
+ usleep(WAKE_WAIT_US);
+
-+ // info("Calling shared futex2_wake on f1: %u @ %p with val=%u\n", *f1, f1, *f1);
+ res = futex2_wake(waitv[NR_FUTEXES - 1].uaddr, 1, FUTEX_32 | FUTEX_SHARED_FLAG);
+ if (res != 1) {
-+ ksft_test_result_fail("futex2_wake shared returned: %d %s\n",
++ ksft_test_result_fail("futex2_waitv shared returned: %d %s\n",
+ res ? errno : res,
+ res ? strerror(errno) : "");
+ ret = RET_FAIL;
+ } else {
-+ ksft_test_result_pass("futex2_wake shared succeeds\n");
++ ksft_test_result_pass("futex2_waitv shared succeeds\n");
+ }
+
+ for (i = 0; i < NR_FUTEXES; i++)
@@ -2512,18 +3555,13 @@ index 3730159c8..18b3883d7 100755
+echo
+./futex2_waitv $COLOR
diff --git a/tools/testing/selftests/futex/include/futex2test.h b/tools/testing/selftests/futex/include/futex2test.h
-index 807b8b57f..10be0c504 100644
+index e724d56b9..31979afc4 100644
--- a/tools/testing/selftests/futex/include/futex2test.h
+++ b/tools/testing/selftests/futex/include/futex2test.h
-@@ -27,10 +27,18 @@
- #ifndef FUTEX_32
- #define FUTEX_32 2
+@@ -28,6 +28,19 @@
+ # define FUTEX_32 2
#endif
--#ifdef __x86_64__
--# ifndef FUTEX_64
--# define FUTEX_64 3
--# endif
-+
+
+#ifndef FUTEX_SHARED_FLAG
+#define FUTEX_SHARED_FLAG 8
+#endif
@@ -2535,16 +3573,22 @@ index 807b8b57f..10be0c504 100644
+ unsigned int val;
+ unsigned int flags;
+};
- #endif
-
++#endif
++
/*
-@@ -75,3 +83,12 @@ static inline int futex2_wake(volatile void *uaddr, unsigned int nr, unsigned lo
+ * - Y2038 section for 32-bit applications -
+ *
+@@ -77,3 +90,16 @@ static inline int futex2_wake(volatile void *uaddr, unsigned int nr, unsigned lo
{
return syscall(__NR_futex_wake, uaddr, nr, flags);
}
+
-+/*
-+ * wait for uaddr if (*uaddr == val)
++/**
++ * futex2_waitv - Wait at multiple futexes, wake on any
++ * @waiters: Array of waiters
++ * @nr_waiters: Length of waiters array
++ * @flags: Operation flags
++ * @timo: Optional timeout for operation
+ */
+static inline int futex2_waitv(volatile struct futex_waitv *waiters, unsigned long nr_waiters,
+ unsigned long flags, struct timespec64 *timo)
@@ -2552,123 +3596,304 @@ index 807b8b57f..10be0c504 100644
+ return syscall(__NR_futex_waitv, waiters, nr_waiters, flags, timo);
+}
--
-2.29.2
+2.30.2
-From 9358bbdf929a90bc144d13e002fed8f4223d3178 Mon Sep 17 00:00:00 2001
+From 18a89fdf17baa9595b09bb98cc545ecba4ce93fb Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Andr=C3=A9=20Almeida?= <andrealmeid@collabora.com>
-Date: Fri, 4 Dec 2020 19:12:23 -0300
-Subject: [PATCH 8/9] futex2: Add sysfs entry for syscall numbers
+Date: Fri, 5 Feb 2021 10:34:02 -0300
+Subject: [PATCH 11/13] selftests: futex2: Add requeue test
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
+Add testing for futex_requeue(). The first test just requeue from one
+waiter to another one, and wake it. The second performs both wake and
+requeue, and we check return values to see if the operation
+woke/requeued the expected number of waiters.
+
Signed-off-by: André Almeida <andrealmeid@collabora.com>
Signed-off-by: Jan200101 <sentrycraft123@gmail.com>
---
- kernel/futex2.c | 42 ++++++++++++++++++++++++++++++++++++++++++
- 1 file changed, 42 insertions(+)
+ .../selftests/futex/functional/.gitignore | 1 +
+ .../selftests/futex/functional/Makefile | 3 +-
+ .../futex/functional/futex2_requeue.c | 164 ++++++++++++++++++
+ .../selftests/futex/include/futex2test.h | 16 ++
+ 4 files changed, 183 insertions(+), 1 deletion(-)
+ create mode 100644 tools/testing/selftests/futex/functional/futex2_requeue.c
-diff --git a/kernel/futex2.c b/kernel/futex2.c
-index 5ddb9922d..58cd8a868 100644
---- a/kernel/futex2.c
-+++ b/kernel/futex2.c
-@@ -762,6 +762,48 @@ SYSCALL_DEFINE3(futex_wake, void __user *, uaddr, unsigned int, nr_wake,
- return ret;
- }
+diff --git a/tools/testing/selftests/futex/functional/.gitignore b/tools/testing/selftests/futex/functional/.gitignore
+index d0b8f637b..af7557e82 100644
+--- a/tools/testing/selftests/futex/functional/.gitignore
++++ b/tools/testing/selftests/futex/functional/.gitignore
+@@ -8,3 +8,4 @@ futex_wait_uninitialized_heap
+ futex_wait_wouldblock
+ futex2_wait
+ futex2_waitv
++futex2_requeue
+diff --git a/tools/testing/selftests/futex/functional/Makefile b/tools/testing/selftests/futex/functional/Makefile
+index 09c08ccde..3ccb9ea58 100644
+--- a/tools/testing/selftests/futex/functional/Makefile
++++ b/tools/testing/selftests/futex/functional/Makefile
+@@ -17,7 +17,8 @@ TEST_GEN_FILES := \
+ futex_wait_uninitialized_heap \
+ futex_wait_private_mapped_file \
+ futex2_wait \
+- futex2_waitv
++ futex2_waitv \
++ futex2_requeue
-+static ssize_t wait_show(struct kobject *kobj, struct kobj_attribute *attr,
-+ char *buf)
-+{
-+ return sprintf(buf, "%u\n", __NR_futex_wait);
+ TEST_PROGS := run.sh
+
+diff --git a/tools/testing/selftests/futex/functional/futex2_requeue.c b/tools/testing/selftests/futex/functional/futex2_requeue.c
+new file mode 100644
+index 000000000..1bc3704dc
+--- /dev/null
++++ b/tools/testing/selftests/futex/functional/futex2_requeue.c
+@@ -0,0 +1,164 @@
++// SPDX-License-Identifier: GPL-2.0-or-later
++/******************************************************************************
++ *
++ * Copyright Collabora Ltd., 2021
++ *
++ * DESCRIPTION
++ * Test requeue mechanism of futex2, using 32bit sized futexes.
++ *
++ * AUTHOR
++ * André Almeida <andrealmeid@collabora.com>
++ *
++ * HISTORY
++ * 2021-Feb-5: Initial version by André <andrealmeid@collabora.com>
++ *
++ *****************************************************************************/
++
++#include <errno.h>
++#include <error.h>
++#include <getopt.h>
++#include <stdio.h>
++#include <stdlib.h>
++#include <string.h>
++#include <time.h>
++#include <pthread.h>
++#include <sys/shm.h>
++#include <limits.h>
++#include "futex2test.h"
++#include "logging.h"
++
++#define TEST_NAME "futex2-wait"
++#define timeout_ns 30000000
++#define WAKE_WAIT_US 10000
++volatile futex_t *f1;
+
++void usage(char *prog)
++{
++ printf("Usage: %s\n", prog);
++ printf(" -c Use color\n");
++ printf(" -h Display this help message\n");
++ printf(" -v L Verbosity level: %d=QUIET %d=CRITICAL %d=INFO\n",
++ VQUIET, VCRITICAL, VINFO);
+}
-+static struct kobj_attribute futex2_wait_attr = __ATTR_RO(wait);
+
-+static ssize_t wake_show(struct kobject *kobj, struct kobj_attribute *attr,
-+ char *buf)
++void *waiterfn(void *arg)
+{
-+ return sprintf(buf, "%u\n", __NR_futex_wake);
++ struct timespec64 to64;
++
++ /* setting absolute timeout for futex2 */
++ if (gettime64(CLOCK_MONOTONIC, &to64))
++ error("gettime64 failed\n", errno);
++
++ to64.tv_nsec += timeout_ns;
++
++ if (to64.tv_nsec >= 1000000000) {
++ to64.tv_sec++;
++ to64.tv_nsec -= 1000000000;
++ }
+
++ if (futex2_wait(f1, *f1, FUTEX_32, &to64))
++ printf("waiter failed errno %d\n", errno);
++
++ return NULL;
+}
-+static struct kobj_attribute futex2_wake_attr = __ATTR_RO(wake);
+
-+static ssize_t waitv_show(struct kobject *kobj, struct kobj_attribute *attr,
-+ char *buf)
++int main(int argc, char *argv[])
+{
-+ return sprintf(buf, "%u\n", __NR_futex_waitv);
++ pthread_t waiter[10];
++ int res, ret = RET_PASS;
++ int c, i;
++ volatile futex_t _f1 = 0;
++ volatile futex_t f2 = 0;
++ struct futex_requeue r1, r2;
+
-+}
-+static struct kobj_attribute futex2_waitv_attr = __ATTR_RO(waitv);
++ f1 = &_f1;
+
-+static struct attribute *futex2_sysfs_attrs[] = {
-+ &futex2_wait_attr.attr,
-+ &futex2_wake_attr.attr,
-+ &futex2_waitv_attr.attr,
-+ NULL,
-+};
++ r1.flags = FUTEX_32;
++ r2.flags = FUTEX_32;
+
-+static const struct attribute_group futex2_sysfs_attr_group = {
-+ .attrs = futex2_sysfs_attrs,
-+ .name = "futex2",
-+};
++ r1.uaddr = f1;
++ r2.uaddr = &f2;
+
-+static int __init futex2_sysfs_init(void)
-+{
-+ return sysfs_create_group(kernel_kobj, &futex2_sysfs_attr_group);
-+}
-+subsys_initcall(futex2_sysfs_init);
++ while ((c = getopt(argc, argv, "cht:v:")) != -1) {
++ switch (c) {
++ case 'c':
++ log_color(1);
++ break;
++ case 'h':
++ usage(basename(argv[0]));
++ exit(0);
++ case 'v':
++ log_verbosity(atoi(optarg));
++ break;
++ default:
++ usage(basename(argv[0]));
++ exit(1);
++ }
++ }
+
- static int __init futex2_init(void)
++ ksft_print_header();
++ ksft_set_plan(2);
++ ksft_print_msg("%s: Test FUTEX2_REQUEUE\n",
++ basename(argv[0]));
++
++ /*
++ * Requeue a waiter from f1 to f2, and wake f2.
++ */
++ if (pthread_create(&waiter[0], NULL, waiterfn, NULL))
++ error("pthread_create failed\n", errno);
++
++ usleep(WAKE_WAIT_US);
++
++ res = futex2_requeue(&r1, &r2, 0, 1, 0, 0);
++ if (res != 1) {
++ ksft_test_result_fail("futex2_requeue private returned: %d %s\n",
++ res ? errno : res,
++ res ? strerror(errno) : "");
++ ret = RET_FAIL;
++ }
++
++
++ info("Calling private futex2_wake on f2: %u @ %p with val=%u\n", f2, &f2, f2);
++ res = futex2_wake(&f2, 1, FUTEX_32);
++ if (res != 1) {
++ ksft_test_result_fail("futex2_requeue private returned: %d %s\n",
++ res ? errno : res,
++ res ? strerror(errno) : "");
++ ret = RET_FAIL;
++ } else {
++ ksft_test_result_pass("futex2_requeue simple succeeds\n");
++ }
++
++
++ /*
++ * Create 10 waiters at f1. At futex_requeue, wake 3 and requeue 7.
++ * At futex_wake, wake INT_MAX (should be exaclty 7).
++ */
++ for (i = 0; i < 10; i++) {
++ if (pthread_create(&waiter[i], NULL, waiterfn, NULL))
++ error("pthread_create failed\n", errno);
++ }
++
++ usleep(WAKE_WAIT_US);
++
++ res = futex2_requeue(&r1, &r2, 3, 7, 0, 0);
++ if (res != 10) {
++ ksft_test_result_fail("futex2_requeue private returned: %d %s\n",
++ res ? errno : res,
++ res ? strerror(errno) : "");
++ ret = RET_FAIL;
++ }
++
++ res = futex2_wake(&f2, INT_MAX, FUTEX_32);
++ if (res != 7) {
++ ksft_test_result_fail("futex2_requeue private returned: %d %s\n",
++ res ? errno : res,
++ res ? strerror(errno) : "");
++ ret = RET_FAIL;
++ } else {
++ ksft_test_result_pass("futex2_requeue succeeds\n");
++ }
++
++ ksft_print_cnts();
++ return ret;
++}
+diff --git a/tools/testing/selftests/futex/include/futex2test.h b/tools/testing/selftests/futex/include/futex2test.h
+index 31979afc4..e2635006b 100644
+--- a/tools/testing/selftests/futex/include/futex2test.h
++++ b/tools/testing/selftests/futex/include/futex2test.h
+@@ -103,3 +103,19 @@ static inline int futex2_waitv(volatile struct futex_waitv *waiters, unsigned lo
{
- int i;
+ return syscall(__NR_futex_waitv, waiters, nr_waiters, flags, timo);
+ }
++
++/**
++ * futex2_requeue - Wake futexes at uaddr1 and requeue from uaddr1 to uaddr2
++ * @uaddr1: Original address to wake and requeue from
++ * @uaddr2: Address to requeue to
++ * @nr_wake: Number of futexes to wake at uaddr1 before requeuing
++ * @nr_requeue: Number of futexes to requeue from uaddr1 to uaddr2
++ * @cmpval: If (uaddr1->uaddr != cmpval), return immediatally
++ * @flgas: Operation flags
++ */
++static inline int futex2_requeue(struct futex_requeue *uaddr1, struct futex_requeue *uaddr2,
++ unsigned int nr_wake, unsigned int nr_requeue,
++ unsigned int cmpval, unsigned long flags)
++{
++ return syscall(__NR_futex_requeue, uaddr1, uaddr2, nr_wake, nr_requeue, cmpval, flags);
++}
--
-2.29.2
+2.30.2
-From f7b1c9a2ad05933e559ef78bc7753b2fac1698fd Mon Sep 17 00:00:00 2001
+From 799e24f7b39e114107b36c4cc4ece4825a9fa6a0 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Andr=C3=A9=20Almeida?= <andrealmeid@collabora.com>
-Date: Tue, 5 Jan 2021 15:44:02 -0300
-Subject: [PATCH 9/9] perf bench: Add futex2 benchmark tests
+Date: Fri, 5 Feb 2021 10:34:02 -0300
+Subject: [PATCH 12/13] perf bench: Add futex2 benchmark tests
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
-Port existing futex infrastructure to use futex2 calls.
+Add support at the existing futex benchmarking code base to enable
+futex2 calls. `perf bench` tests can be used not only as a way to
+measure the performance of implementation, but also as stress testing
+for the kernel infrastructure.
Signed-off-by: André Almeida <andrealmeid@collabora.com>
Signed-off-by: Jan200101 <sentrycraft123@gmail.com>
---
- tools/arch/x86/include/asm/unistd_64.h | 8 +++++
- tools/perf/bench/bench.h | 3 ++
- tools/perf/bench/futex-hash.c | 24 ++++++++++++---
- tools/perf/bench/futex-wake-parallel.c | 41 ++++++++++++++++++++++----
- tools/perf/bench/futex-wake.c | 36 ++++++++++++++++++----
- tools/perf/bench/futex.h | 17 +++++++++++
- tools/perf/builtin-bench.c | 17 ++++++++---
- 7 files changed, 127 insertions(+), 19 deletions(-)
+ tools/arch/x86/include/asm/unistd_64.h | 12 ++++++
+ tools/perf/bench/bench.h | 4 ++
+ tools/perf/bench/futex-hash.c | 24 +++++++++--
+ tools/perf/bench/futex-requeue.c | 57 ++++++++++++++++++++------
+ tools/perf/bench/futex-wake-parallel.c | 41 +++++++++++++++---
+ tools/perf/bench/futex-wake.c | 37 +++++++++++++----
+ tools/perf/bench/futex.h | 47 +++++++++++++++++++++
+ tools/perf/builtin-bench.c | 18 ++++++--
+ 8 files changed, 206 insertions(+), 34 deletions(-)
diff --git a/tools/arch/x86/include/asm/unistd_64.h b/tools/arch/x86/include/asm/unistd_64.h
-index 4205ed415..151a41ceb 100644
+index 4205ed415..cf5ad4ea1 100644
--- a/tools/arch/x86/include/asm/unistd_64.h
+++ b/tools/arch/x86/include/asm/unistd_64.h
-@@ -17,3 +17,11 @@
+@@ -17,3 +17,15 @@
#ifndef __NR_setns
#define __NR_setns 308
#endif
+
+#ifndef __NR_futex_wait
-+# define __NR_futex_wait 441
++# define __NR_futex_wait 442
+#endif
+
+#ifndef __NR_futex_wake
-+# define __NR_futex_wake 442
++# define __NR_futex_wake 443
++#endif
++
++#ifndef __NR_futex_requeue
++# define __NR_futex_requeue 445
+#endif
diff --git a/tools/perf/bench/bench.h b/tools/perf/bench/bench.h
-index eac36afab..f6f881a05 100644
+index eac36afab..12346844b 100644
--- a/tools/perf/bench/bench.h
+++ b/tools/perf/bench/bench.h
-@@ -38,8 +38,11 @@ int bench_mem_memcpy(int argc, const char **argv);
+@@ -38,9 +38,13 @@ int bench_mem_memcpy(int argc, const char **argv);
int bench_mem_memset(int argc, const char **argv);
int bench_mem_find_bit(int argc, const char **argv);
int bench_futex_hash(int argc, const char **argv);
@@ -2678,10 +3903,12 @@ index eac36afab..f6f881a05 100644
int bench_futex_wake_parallel(int argc, const char **argv);
+int bench_futex2_wake_parallel(int argc, const char **argv);
int bench_futex_requeue(int argc, const char **argv);
++int bench_futex2_requeue(int argc, const char **argv);
/* pi futexes */
int bench_futex_lock_pi(int argc, const char **argv);
+ int bench_epoll_wait(int argc, const char **argv);
diff --git a/tools/perf/bench/futex-hash.c b/tools/perf/bench/futex-hash.c
-index 915bf3da7..72921c22b 100644
+index 915bf3da7..6e62e7708 100644
--- a/tools/perf/bench/futex-hash.c
+++ b/tools/perf/bench/futex-hash.c
@@ -34,7 +34,7 @@ static unsigned int nthreads = 0;
@@ -2710,7 +3937,7 @@ index 915bf3da7..72921c22b 100644
}
-int bench_futex_hash(int argc, const char **argv)
-+static int bench_futex_hash_common(int argc, const char **argv)
++static int __bench_futex_hash(int argc, const char **argv)
{
int ret = 0;
cpu_set_t cpuset;
@@ -2732,16 +3959,146 @@ index 915bf3da7..72921c22b 100644
+
+int bench_futex_hash(int argc, const char **argv)
+{
-+ return bench_futex_hash_common(argc, argv);
++ return __bench_futex_hash(argc, argv);
+}
+
+int bench_futex2_hash(int argc, const char **argv)
+{
+ futex2 = true;
-+ return bench_futex_hash_common(argc, argv);
++ return __bench_futex_hash(argc, argv);
++}
+diff --git a/tools/perf/bench/futex-requeue.c b/tools/perf/bench/futex-requeue.c
+index 7a15c2e61..4c7486fbe 100644
+--- a/tools/perf/bench/futex-requeue.c
++++ b/tools/perf/bench/futex-requeue.c
+@@ -2,8 +2,8 @@
+ /*
+ * Copyright (C) 2013 Davidlohr Bueso <davidlohr@hp.com>
+ *
+- * futex-requeue: Block a bunch of threads on futex1 and requeue them
+- * on futex2, N at a time.
++ * futex-requeue: Block a bunch of threads on addr1 and requeue them
++ * on addr2, N at a time.
+ *
+ * This program is particularly useful to measure the latency of nthread
+ * requeues without waking up any tasks -- thus mimicking a regular futex_wait.
+@@ -29,7 +29,10 @@
+ #include <stdlib.h>
+ #include <sys/time.h>
+
+-static u_int32_t futex1 = 0, futex2 = 0;
++static u_int32_t addr1 = 0, addr2 = 0;
++
++static struct futex_requeue rq1 = { .uaddr = &addr1, .flags = FUTEX_32 };
++static struct futex_requeue rq2 = { .uaddr = &addr2, .flags = FUTEX_32 };
+
+ /*
+ * How many tasks to requeue at a time.
+@@ -38,7 +41,7 @@ static u_int32_t futex1 = 0, futex2 = 0;
+ static unsigned int nrequeue = 1;
+
+ static pthread_t *worker;
+-static bool done = false, silent = false, fshared = false;
++static bool done = false, silent = false, fshared = false, futex2 = false;
+ static pthread_mutex_t thread_lock;
+ static pthread_cond_t thread_parent, thread_worker;
+ static struct stats requeuetime_stats, requeued_stats;
+@@ -80,7 +83,11 @@ static void *workerfn(void *arg __maybe_unused)
+ pthread_cond_wait(&thread_worker, &thread_lock);
+ pthread_mutex_unlock(&thread_lock);
+
+- futex_wait(&futex1, 0, NULL, futex_flag);
++ if (!futex2)
++ futex_wait(&addr1, 0, NULL, futex_flag);
++ else
++ futex2_wait(&addr1, 0, futex_flag, NULL);
++
+ return NULL;
+ }
+
+@@ -112,7 +119,7 @@ static void toggle_done(int sig __maybe_unused,
+ done = true;
+ }
+
+-int bench_futex_requeue(int argc, const char **argv)
++static int __bench_futex_requeue(int argc, const char **argv)
+ {
+ int ret = 0;
+ unsigned int i, j;
+@@ -140,15 +147,20 @@ int bench_futex_requeue(int argc, const char **argv)
+ if (!worker)
+ err(EXIT_FAILURE, "calloc");
+
+- if (!fshared)
++ if (futex2) {
++ futex_flag = FUTEX_32 | (fshared * FUTEX_SHARED_FLAG);
++ rq1.flags |= FUTEX_SHARED_FLAG * fshared;
++ rq2.flags |= FUTEX_SHARED_FLAG * fshared;
++ } else if (!fshared) {
+ futex_flag = FUTEX_PRIVATE_FLAG;
++ }
+
+ if (nrequeue > nthreads)
+ nrequeue = nthreads;
+
+ printf("Run summary [PID %d]: Requeuing %d threads (from [%s] %p to %p), "
+ "%d at a time.\n\n", getpid(), nthreads,
+- fshared ? "shared":"private", &futex1, &futex2, nrequeue);
++ fshared ? "shared":"private", &addr1, &addr2, nrequeue);
+
+ init_stats(&requeued_stats);
+ init_stats(&requeuetime_stats);
+@@ -177,11 +189,15 @@ int bench_futex_requeue(int argc, const char **argv)
+ gettimeofday(&start, NULL);
+ while (nrequeued < nthreads) {
+ /*
+- * Do not wakeup any tasks blocked on futex1, allowing
++ * Do not wakeup any tasks blocked on addr1, allowing
+ * us to really measure futex_wait functionality.
+ */
+- nrequeued += futex_cmp_requeue(&futex1, 0, &futex2, 0,
+- nrequeue, futex_flag);
++ if (!futex2)
++ nrequeued += futex_cmp_requeue(&addr1, 0, &addr2,
++ 0, nrequeue, futex_flag);
++ else
++ nrequeued += futex2_requeue(&rq1, &rq2,
++ 0, nrequeue, 0, 0);
+ }
+
+ gettimeofday(&end, NULL);
+@@ -195,8 +211,12 @@ int bench_futex_requeue(int argc, const char **argv)
+ j + 1, nrequeued, nthreads, runtime.tv_usec / (double)USEC_PER_MSEC);
+ }
+
+- /* everybody should be blocked on futex2, wake'em up */
+- nrequeued = futex_wake(&futex2, nrequeued, futex_flag);
++ /* everybody should be blocked on addr2, wake'em up */
++ if (!futex2)
++ nrequeued = futex_wake(&addr2, nrequeued, futex_flag);
++ else
++ nrequeued = futex2_wake(&addr2, nrequeued, futex_flag);
++
+ if (nthreads != nrequeued)
+ warnx("couldn't wakeup all tasks (%d/%d)", nrequeued, nthreads);
+
+@@ -221,3 +241,14 @@ int bench_futex_requeue(int argc, const char **argv)
+ usage_with_options(bench_futex_requeue_usage, options);
+ exit(EXIT_FAILURE);
+ }
++
++int bench_futex_requeue(int argc, const char **argv)
++{
++ return __bench_futex_requeue(argc, argv);
++}
++
++int bench_futex2_requeue(int argc, const char **argv)
++{
++ futex2 = true;
++ return __bench_futex_requeue(argc, argv);
+}
diff --git a/tools/perf/bench/futex-wake-parallel.c b/tools/perf/bench/futex-wake-parallel.c
-index cd2b81a84..540104538 100644
+index cd2b81a84..8a89c6ab9 100644
--- a/tools/perf/bench/futex-wake-parallel.c
+++ b/tools/perf/bench/futex-wake-parallel.c
@@ -17,6 +17,12 @@ int bench_futex_wake_parallel(int argc __maybe_unused, const char **argv __maybe
@@ -2800,7 +4157,7 @@ index cd2b81a84..540104538 100644
}
-int bench_futex_wake_parallel(int argc, const char **argv)
-+static int bench_futex_wake_parallel_common(int argc, const char **argv)
++static int __bench_futex_wake_parallel(int argc, const char **argv)
{
int ret = 0;
unsigned int i, j;
@@ -2822,31 +4179,30 @@ index cd2b81a84..540104538 100644
+
+int bench_futex_wake_parallel(int argc, const char **argv)
+{
-+ return bench_futex_wake_parallel_common(argc, argv);
++ return __bench_futex_wake_parallel(argc, argv);
+}
+
+int bench_futex2_wake_parallel(int argc, const char **argv)
+{
+ futex2 = true;
-+ return bench_futex_wake_parallel_common(argc, argv);
++ return __bench_futex_wake_parallel(argc, argv);
+}
+
#endif /* HAVE_PTHREAD_BARRIER */
diff --git a/tools/perf/bench/futex-wake.c b/tools/perf/bench/futex-wake.c
-index 2dfcef3e3..b98b84e7b 100644
+index 2dfcef3e3..be4481f5e 100644
--- a/tools/perf/bench/futex-wake.c
+++ b/tools/perf/bench/futex-wake.c
-@@ -46,6 +46,9 @@ static struct stats waketime_stats, wakeup_stats;
- static unsigned int threads_starting, nthreads = 0;
- static int futex_flag = 0;
+@@ -39,7 +39,7 @@ static u_int32_t futex1 = 0;
+ static unsigned int nwakes = 1;
-+/* Should we use futex2 API? */
-+static bool futex2 = false;
-+
- static const struct option options[] = {
- OPT_UINTEGER('t', "threads", &nthreads, "Specify amount of threads"),
- OPT_UINTEGER('w', "nwakes", &nwakes, "Specify amount of threads to wake at once"),
-@@ -69,8 +72,13 @@ static void *workerfn(void *arg __maybe_unused)
+ pthread_t *worker;
+-static bool done = false, silent = false, fshared = false;
++static bool done = false, silent = false, fshared = false, futex2 = false;
+ static pthread_mutex_t thread_lock;
+ static pthread_cond_t thread_parent, thread_worker;
+ static struct stats waketime_stats, wakeup_stats;
+@@ -69,8 +69,13 @@ static void *workerfn(void *arg __maybe_unused)
pthread_mutex_unlock(&thread_lock);
while (1) {
@@ -2862,16 +4218,16 @@ index 2dfcef3e3..b98b84e7b 100644
}
pthread_exit(NULL);
-@@ -118,7 +126,7 @@ static void toggle_done(int sig __maybe_unused,
+@@ -118,7 +123,7 @@ static void toggle_done(int sig __maybe_unused,
done = true;
}
-int bench_futex_wake(int argc, const char **argv)
-+static int bench_futex_wake_common(int argc, const char **argv)
++static int __bench_futex_wake(int argc, const char **argv)
{
int ret = 0;
unsigned int i, j;
-@@ -148,7 +156,9 @@ int bench_futex_wake(int argc, const char **argv)
+@@ -148,7 +153,9 @@ int bench_futex_wake(int argc, const char **argv)
if (!worker)
err(EXIT_FAILURE, "calloc");
@@ -2882,14 +4238,16 @@ index 2dfcef3e3..b98b84e7b 100644
futex_flag = FUTEX_PRIVATE_FLAG;
printf("Run summary [PID %d]: blocking on %d threads (at [%s] futex %p), "
-@@ -181,8 +191,13 @@ int bench_futex_wake(int argc, const char **argv)
+@@ -180,9 +187,14 @@ int bench_futex_wake(int argc, const char **argv)
+
/* Ok, all threads are patiently blocked, start waking folks up */
gettimeofday(&start, NULL);
- while (nwoken != nthreads)
+- while (nwoken != nthreads)
- nwoken += futex_wake(&futex1, nwakes, futex_flag);
-+ if (!futex2) {
++ while (nwoken != nthreads) {
++ if (!futex2)
+ nwoken += futex_wake(&futex1, nwakes, futex_flag);
-+ } else {
++ else
+ nwoken += futex2_wake(&futex1, nwakes, futex_flag);
+ }
gettimeofday(&end, NULL);
@@ -2897,32 +4255,38 @@ index 2dfcef3e3..b98b84e7b 100644
timersub(&end, &start, &runtime);
update_stats(&wakeup_stats, nwoken);
-@@ -212,3 +227,14 @@ int bench_futex_wake(int argc, const char **argv)
+@@ -212,3 +224,14 @@ int bench_futex_wake(int argc, const char **argv)
free(worker);
return ret;
}
+
+int bench_futex_wake(int argc, const char **argv)
+{
-+ return bench_futex_wake_common(argc, argv);
++ return __bench_futex_wake(argc, argv);
+}
+
+int bench_futex2_wake(int argc, const char **argv)
+{
+ futex2 = true;
-+ return bench_futex_wake_common(argc, argv);
++ return __bench_futex_wake(argc, argv);
+}
diff --git a/tools/perf/bench/futex.h b/tools/perf/bench/futex.h
-index 31b53cc7d..5111799b5 100644
+index 31b53cc7d..6b2213cf3 100644
--- a/tools/perf/bench/futex.h
+++ b/tools/perf/bench/futex.h
-@@ -86,4 +86,21 @@ futex_cmp_requeue(u_int32_t *uaddr, u_int32_t val, u_int32_t *uaddr2, int nr_wak
+@@ -86,4 +86,51 @@ futex_cmp_requeue(u_int32_t *uaddr, u_int32_t val, u_int32_t *uaddr2, int nr_wak
return futex(uaddr, FUTEX_CMP_REQUEUE, nr_wake, nr_requeue, uaddr2,
val, opflags);
}
+
-+/*
-+ * wait for uaddr if (*uaddr == val)
++/**
++ * futex2_wait - Wait at uaddr if *uaddr == val, until timo.
++ * @uaddr: User address to wait for
++ * @val: Expected value at uaddr
++ * @flags: Operation options
++ * @timo: Optional timeout
++ *
++ * Return: 0 on success, error code otherwise
+ */
+static inline int futex2_wait(volatile void *uaddr, unsigned long val,
+ unsigned long flags, struct timespec *timo)
@@ -2930,16 +4294,40 @@ index 31b53cc7d..5111799b5 100644
+ return syscall(__NR_futex_wait, uaddr, val, flags, timo);
+}
+
-+/*
-+ * wake nr futexes waiting for uaddr
++/**
++ * futex2_wake - Wake a number of waiters waiting at uaddr
++ * @uaddr: Address to wake
++ * @nr: Number of waiters to wake
++ * @flags: Operation options
++ *
++ * Return: number of waked futexes
+ */
+static inline int futex2_wake(volatile void *uaddr, unsigned int nr, unsigned long flags)
+{
+ return syscall(__NR_futex_wake, uaddr, nr, flags);
+}
++
++/**
++ * futex2_requeue - Requeue waiters from an address to another one
++ * @uaddr1: Address where waiters are currently waiting on
++ * @uaddr2: New address to wait
++ * @nr_wake: Number of waiters at uaddr1 to be wake
++ * @nr_requeue: After waking nr_wake, number of waiters to be requeued
++ * @cmpval: Expected value at uaddr1
++ * @flags: Operation options
++ *
++ * Return: waked futexes + requeued futexes at uaddr1
++ */
++static inline int futex2_requeue(volatile struct futex_requeue *uaddr1,
++ volatile struct futex_requeue *uaddr2,
++ unsigned int nr_wake, unsigned int nr_requeue,
++ unsigned int cmpval, unsigned long flags)
++{
++ return syscall(__NR_futex_requeue, uaddr1, uaddr2, nr_wake, nr_requeue, cmpval, flags);
++}
#endif /* _FUTEX_H */
diff --git a/tools/perf/builtin-bench.c b/tools/perf/builtin-bench.c
-index 62a7b7420..200ecacad 100644
+index 62a7b7420..e41a95ad2 100644
--- a/tools/perf/builtin-bench.c
+++ b/tools/perf/builtin-bench.c
@@ -12,10 +12,11 @@
@@ -2958,7 +4346,7 @@ index 62a7b7420..200ecacad 100644
*/
#include <subcmd/parse-options.h>
#include "builtin.h"
-@@ -75,6 +76,13 @@ static struct bench futex_benchmarks[] = {
+@@ -75,6 +76,14 @@ static struct bench futex_benchmarks[] = {
{ NULL, NULL, NULL }
};
@@ -2966,13 +4354,14 @@ index 62a7b7420..200ecacad 100644
+ { "hash", "Benchmark for futex2 hash table", bench_futex2_hash },
+ { "wake", "Benchmark for futex2 wake calls", bench_futex2_wake },
+ { "wake-parallel", "Benchmark for parallel futex2 wake calls", bench_futex2_wake_parallel },
++ { "requeue", "Benchmark for futex2 requeue calls", bench_futex2_requeue },
+ { NULL, NULL, NULL }
+};
+
#ifdef HAVE_EVENTFD_SUPPORT
static struct bench epoll_benchmarks[] = {
{ "wait", "Benchmark epoll concurrent epoll_waits", bench_epoll_wait },
-@@ -105,6 +113,7 @@ static struct collection collections[] = {
+@@ -105,6 +114,7 @@ static struct collection collections[] = {
{ "numa", "NUMA scheduling and MM benchmarks", numa_benchmarks },
#endif
{"futex", "Futex stressing benchmarks", futex_benchmarks },
@@ -2981,5 +4370,82 @@ index 62a7b7420..200ecacad 100644
{"epoll", "Epoll stressing benchmarks", epoll_benchmarks },
#endif
--
-2.29.2
+2.30.2
+
+
+From ea9a7956b5f6f44f3ee70d82542c64fcb7c86c5e Mon Sep 17 00:00:00 2001
+From: =?UTF-8?q?Andr=C3=A9=20Almeida?= <andrealmeid@collabora.com>
+Date: Fri, 5 Feb 2021 10:34:02 -0300
+Subject: [PATCH 13/13] futex2: Add sysfs entry for syscall numbers
+MIME-Version: 1.0
+Content-Type: text/plain; charset=UTF-8
+Content-Transfer-Encoding: 8bit
+
+In the course of futex2 development, it will be rebased on top of
+different kernel releases, and the syscall number can change in this
+process. Expose futex2 syscall number via sysfs so tools that are
+experimenting with futex2 (like Proton/Wine) can test it and set the
+syscall number at runtime, rather than setting it at compilation time.
+
+Signed-off-by: André Almeida <andrealmeid@collabora.com>
+Signed-off-by: Jan200101 <sentrycraft123@gmail.com>
+---
+ kernel/futex2.c | 42 ++++++++++++++++++++++++++++++++++++++++++
+ 1 file changed, 42 insertions(+)
+
+diff --git a/kernel/futex2.c b/kernel/futex2.c
+index 8a8b45f98..1eb20410d 100644
+--- a/kernel/futex2.c
++++ b/kernel/futex2.c
+@@ -1220,6 +1220,48 @@ SYSCALL_DEFINE6(futex_requeue, struct futex_requeue __user *, uaddr1,
+ return __futex_requeue(rq1, rq2, nr_wake, nr_requeue, cmpval, shared1, shared2);
+ }
+
++static ssize_t wait_show(struct kobject *kobj, struct kobj_attribute *attr,
++ char *buf)
++{
++ return sprintf(buf, "%u\n", __NR_futex_wait);
++
++}
++static struct kobj_attribute futex2_wait_attr = __ATTR_RO(wait);
++
++static ssize_t wake_show(struct kobject *kobj, struct kobj_attribute *attr,
++ char *buf)
++{
++ return sprintf(buf, "%u\n", __NR_futex_wake);
++
++}
++static struct kobj_attribute futex2_wake_attr = __ATTR_RO(wake);
++
++static ssize_t waitv_show(struct kobject *kobj, struct kobj_attribute *attr,
++ char *buf)
++{
++ return sprintf(buf, "%u\n", __NR_futex_waitv);
++
++}
++static struct kobj_attribute futex2_waitv_attr = __ATTR_RO(waitv);
++
++static struct attribute *futex2_sysfs_attrs[] = {
++ &futex2_wait_attr.attr,
++ &futex2_wake_attr.attr,
++ &futex2_waitv_attr.attr,
++ NULL,
++};
++
++static const struct attribute_group futex2_sysfs_attr_group = {
++ .attrs = futex2_sysfs_attrs,
++ .name = "futex2",
++};
++
++static int __init futex2_sysfs_init(void)
++{
++ return sysfs_create_group(kernel_kobj, &futex2_sysfs_attr_group);
++}
++subsys_initcall(futex2_sysfs_init);
++
+ static int __init futex2_init(void)
+ {
+ int i;
+--
+2.30.2