[LU-12508] (llite_mmap.c:71:our_vma()) ASSERTION( !down_write_trylock(&mm->mmap_sem) ) failed when writing in multiple threads Created: 05/Jul/19 Updated: 13/Dec/21 Resolved: 19/Nov/19 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Jacek Tomaka | Assignee: | Yang Sheng |
| Resolution: | Not a Bug | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Linux nanny1926 3.10.0-693.5.2.el7.x86_64 #1 SMP Fri Oct 20 20:32:50 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux |
||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
2019-07-02T01:45:11-05:00 nanny1926 kernel: LustreError: 251884:0:(llite_mmap.c:71:our_vma()) ASSERTION( !down_write_trylock(&mm->mmap_sem) ) failed: 2019-07-02T01:45:11-05:00 nanny1926 kernel: LustreError: 251884:0:(llite_mmap.c:71:our_vma()) LBUG 2019-07-02T01:45:11-05:00 nanny1926 kernel: Pid: 251884, comm: java 2019-07-02T01:45:11-05:00 nanny1926 kernel: #012Call Trace: 2019-07-02T01:45:11-05:00 nanny1926 kernel: [<ffffffffc03d67ae>] libcfs_call_trace+0x4e/0x60 [libcfs] 2019-07-02T01:45:11-05:00 nanny1926 kernel: [<ffffffffc03d683c>] lbug_with_loc+0x4c/0xb0 [libcfs] 2019-07-02T01:45:11-05:00 nanny1926 kernel: [<ffffffffc116e66b>] our_vma+0x16b/0x170 [lustre] 2019-07-02T01:45:11-05:00 nanny1926 kernel: [<ffffffffc11857f9>] vvp_io_rw_lock+0x409/0x6e0 [lustre] 2019-07-02T01:45:11-05:00 nanny1926 kernel: [<ffffffffc0fbb312>] ? lov_io_iter_init+0x302/0x8b0 [lov] 2019-07-02T01:45:11-05:00 nanny1926 kernel: [<ffffffffc1185b29>] vvp_io_write_lock+0x59/0xf0 [lustre] 2019-07-02T01:45:11-05:00 nanny1926 kernel: [<ffffffffc063ebec>] cl_io_lock+0x5c/0x3d0 [obdclass] 2019-07-02T01:45:11-05:00 nanny1926 kernel: [<ffffffffc063f1db>] cl_io_loop+0x11b/0xc90 [obdclass] 2019-07-02T01:45:11-05:00 nanny1926 kernel: [<ffffffffc1133258>] ll_file_io_generic+0x498/0xc40 [lustre] 2019-07-02T01:45:11-05:00 nanny1926 kernel: [<ffffffffc1133cdd>] ll_file_aio_write+0x12d/0x1f0 [lustre] 2019-07-02T01:45:11-05:00 nanny1926 kernel: [<ffffffffc1133e6e>] ll_file_write+0xce/0x1e0 [lustre] 2019-07-02T01:45:11-05:00 nanny1926 kernel: [<ffffffff81200cad>] vfs_write+0xbd/0x1e0 2019-07-02T01:45:11-05:00 nanny1926 kernel: [<ffffffff8111f394>] ? __audit_syscall_entry+0xb4/0x110 2019-07-02T01:45:11-05:00 nanny1926 kernel: [<ffffffff81201abf>] SyS_write+0x7f/0xe0 2019-07-02T01:45:11-05:00 nanny1926 kernel: [<ffffffff816b5292>] tracesys+0xdd/0xe2 2019-07-02T01:45:11-05:00 nanny1926 kernel: 2019-07-02T01:45:11-05:00 nanny1926 kernel: Kernel panic - not syncing: LBUG It is reading in up to 256 threads. And writing 16 files in up to 16 threads. 452 down_read(&mm->mmap_sem); 453 while((vma = our_vma(mm, addr, count)) != NULL) { 454 struct dentry *de = file_dentry(vma->vm_file); 455 struct inode *inode = de->d_inode; 456 int flags = CEF_MUST; whereas our_vma has this:
70 /* mmap_sem must have been held by caller. */
71 LASSERT(!down_write_trylock(&mm->mmap_sem));
So i guess if there are multiple threads in vvp_mmap_locks and more than one happen to acquire read_lock, or one of them acquires write lock then the other would fail, no? |
| Comments |
| Comment by Yang Sheng [ 05/Jul/19 ] |
|
Hi, Jacek, Please upgrade kernel to 3.10.0-957.12.1 to fix this issue. Thanks, |
| Comment by Jacek Tomaka [ 05/Jul/19 ] |
|
Hi Yang, Thank you! https://access.redhat.com/solutions/3393611 mentions also that the issue is fixed in kernel-3.10.0-693.47.2.el7 for Centos 7.4. Would that work as well? Regards. Jacek Tomaka
|
| Comment by Yang Sheng [ 05/Jul/19 ] |
|
Hi, Jacek, Yes, 693.47.2 also included this fix. Thanks, |
| Comment by Yang Sheng [ 08/Jul/19 ] |
|
Hi, Jacek, Just for a record, Could you please tell us what kind of machine was reproducible? Especial the cpu architecture. Thanks, |
| Comment by Jacek Tomaka [ 08/Jul/19 ] |
|
Hi YangSheng, So far i managed to reproduce it on three nodes: Intel(R) Xeon Phi(TM) CPU 7210 @ 1.30GHz (three times), two times was straight after reboot, once it required to run it overnight in a loop, by the morning it crashed. Intel(R) Xeon Phi(TM) CPU 7250 @ 1.40GHz (two different nodes, once each). I believe we started seeing it after a change to write one of the workflows in parallel. Please let me know if you need more information. We still have not deployed never version of the kernel for testing but i will confirm if it fixes the issue. Regards. Jacek Tomaka
|
| Comment by Jacek Tomaka [ 11/Jul/19 ] |
|
Hi YangSheng, Lustre: Lustre: Build Version: 2.12.2 2019-07-11T02:13:50-05:00 nanny2309 kernel: LustreError: 161434:0:(llite_mmap.c:61:our_vma()) ASSERTION( !down_write_trylock(&mm->mmap_sem) ) failed: 2019-07-11T02:13:50-05:00 nanny2309 kernel: LustreError: 161434:0:(llite_mmap.c:61:our_vma()) LBUG 2019-07-11T02:13:50-05:00 nanny2309 kernel: Pid: 161434, comm: pool-HeadersRea 3.10.0-957.21.3.el7.x86_64 #1 SMP Tue Jun 18 16:35:19 UTC 2019 2019-07-11T02:13:50-05:00 nanny2309 kernel: Call Trace: 2019-07-11T02:13:50-05:00 nanny2309 kernel: [<ffffffffc073b7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs] 2019-07-11T02:13:50-05:00 nanny2309 kernel: [<ffffffffc073b87c>] lbug_with_loc+0x4c/0xa0 [libcfs] 2019-07-11T02:13:50-05:00 nanny2309 kernel: [<ffffffffc0e3cb4b>] our_vma+0x16b/0x170 [lustre] 2019-07-11T02:13:50-05:00 nanny2309 kernel: [<ffffffffc0e522e9>] vvp_io_rw_lock+0x439/0x760 [lustre] 2019-07-11T02:13:50-05:00 nanny2309 kernel: [<ffffffffc0e526ce>] vvp_io_read_lock+0x3e/0xe0 [lustre] 2019-07-11T02:13:50-05:00 nanny2309 kernel: [<ffffffffc08942ff>] cl_io_lock+0x5f/0x3d0 [obdclass] 2019-07-11T02:13:50-05:00 nanny2309 kernel: [<ffffffffc089488a>] cl_io_loop+0xba/0x1c0 [obdclass] 2019-07-11T02:13:50-05:00 nanny2309 kernel: [<ffffffffc0e0b5b0>] ll_file_io_generic+0x590/0xcb0 [lustre] 2019-07-11T02:13:50-05:00 nanny2309 kernel: [<ffffffffc0e0c868>] ll_file_aio_read+0x2b8/0x3d0 [lustre] 2019-07-11T02:13:50-05:00 nanny2309 kernel: [<ffffffffc0e0ca24>] ll_file_read+0xa4/0x170 [lustre] 2019-07-11T02:13:50-05:00 nanny2309 kernel: [<ffffffffa32416ff>] vfs_read+0x9f/0x170 2019-07-11T02:13:50-05:00 nanny2309 kernel: [<ffffffffa32425bf>] SyS_read+0x7f/0xf0 2019-07-11T02:13:50-05:00 nanny2309 kernel: [<ffffffffa377606b>] tracesys+0xa3/0xc9 2019-07-11T02:13:50-05:00 nanny2309 kernel: [<ffffffffffffffff>] 0xffffffffffffffff 2019-07-11T02:13:50-05:00 nanny2309 kernel: Kernel panic - not syncing: LBUG So i was wondering if you could answer my question about the code analysis i have done in the description of this bug. I.e. why are we expecting to succeed with down_write_trylock, when caller only has read lock? When there is contention it will obviously fail. Shouldn't it in this case just spin? |
| Comment by Yang Sheng [ 11/Jul/19 ] |
|
Hi, Jacek, Ok, It is really a surprise. Can you get a vmcore for analysis? We want a failed return from down_write_trylock there. It means caller already hold this lock. Then the ASSERTION would not be triggered. Thanks, |
| Comment by James A Simmons [ 11/Jul/19 ] |
|
It looks like you are still having problems. Can you try https://review.whamcloud.com/#/c/35271 to see if it helps |
| Comment by Jacek Tomaka [ 18/Jul/19 ] |
|
Yang Sheng: core_collector makedumpfile -F -l --message-level 1 -d 31 will that be good enough or you will need some pages that i have excluded? James A Simmons: |
| Comment by Jacek Tomaka [ 18/Jul/19 ] |
|
Ok, Yang Sheng, i do have a kernel crash dump. It is 5.4GB. We are happy to give you access to it or upload it to a place but unfortunately we cannot make it public for everyone. |
| Comment by Peter Jones [ 18/Jul/19 ] |
|
Sometimes the diagnostic data collected as part of Lustre troubleshooting is too large to be attached to a JIRA ticket. For these cases, Whamcloud provides an anonymous write-only FTP upload service. In order to use this service, you'll need an FTP client (e.g. ncftp, ftp, etc.) and a JIRA issue. Use the 'uploads' directory and create a new subdirectory using your Jira issue as a name. |
| Comment by Jacek Tomaka [ 19/Jul/19 ] |
|
Hi Peter! ls -la uploads/LU-12508 -rw-r--r-- 1 99 50 101 Jul 19 01:46 md5.sums -rw-r--r-- 1 99 50 149799 Jul 19 01:45 vmcore-dmesg.txt -rw-r--r-- 1 99 50 4032739061 Jul 19 01:46 vmcore.flat.bz2 Also, please treat this data as sensitive and delete as soon as you are done with investigation. Please do not share with third parties. |
| Comment by Jacek Tomaka [ 24/Jul/19 ] |
|
Hello, any news on this? Here is the semaphore state from the crashed kernel: mmap_sem = {
{
count = {
counter = 0xfffffffe00000002
},
__UNIQUE_ID_rh_kabi_hide4 = {
count = 0xfffffffe00000002
},
{<No data fields>}
},
wait_lock = {
raw_lock = {
val = {
counter = 0
}
}
},
osq = {
tail = {
counter = 0
}
},
wait_list = {
next = 0xffff9b2d10fa7a30
},
owner = 0x1
},
BTW: do you know if the patch that RedHat claims fixes this problem is https://lkml.org/lkml/2018/11/29/961 ? Also, when i reduce number of writer threads to 4 in our workload, i can no longer reproduce the problem. I have also tested without adaptive ticks (without setting nohz_full=1-255) and the problem is still there (with 16 writing threads) |
| Comment by Yang Sheng [ 25/Jul/19 ] |
|
Hi, Jacek, From source diff, we can conclude this patch is only change between 957 kernel and previous. I think this issue was caused by some memory barrier issue. It only show up on some kind of cpu, But i still cannot narrow down the range. Looks like you have a certain way to reproduce it? Can you share your way? Thanks, |
| Comment by Yang Sheng [ 04/Sep/19 ] |
|
Hi, Guys, Please upgrade to latest rhel7.7 kernel to fix this issue. Thanks, |
| Comment by Jacek Tomaka [ 05/Sep/19 ] |
|
YangSheng, |
| Comment by Yang Sheng [ 05/Sep/19 ] |
|
Hi, Jacek, This is patch comment: From a9e9bcb45b1525ba7aea26ed9441e8632aeeda58 Mon Sep 17 00:00:00 2001
From: Waiman Long <longman@redhat.com>
Date: Sun, 28 Apr 2019 17:25:38 -0400
Subject: [PATCH] locking/rwsem: Prevent decrement of reader count before
increment
During my rwsem testing, it was found that after a down_read(), the
reader count may occasionally become 0 or even negative. Consequently,
a writer may steal the lock at that time and execute with the reader
in parallel thus breaking the mutual exclusion guarantee of the write
lock. In other words, both readers and writer can become rwsem owners
simultaneously.
The current reader wakeup code does it in one pass to clear waiter->task
and put them into wake_q before fully incrementing the reader count.
Once waiter->task is cleared, the corresponding reader may see it,
finish the critical section and do unlock to decrement the count before
the count is incremented. This is not a problem if there is only one
reader to wake up as the count has been pre-incremented by 1. It is
a problem if there are more than one readers to be woken up and writer
can steal the lock.
The wakeup was actually done in 2 passes before the following v4.9 commit:
70800c3c0cc5 ("locking/rwsem: Scan the wait_list for readers only once")
To fix this problem, the wakeup is now done in two passes
again. In the first pass, we collect the readers and count them.
The reader count is then fully incremented. In the second pass, the
waiter->task is then cleared and they are put into wake_q to be woken
up later.
Thanks,
Thanks, |
| Comment by Jacek Tomaka [ 05/Sep/19 ] |
|
That looks reasonable. Thanks! |
| Comment by Jacek Tomaka [ 19/Nov/19 ] |
|
Hi YangSheng, |