[LU-12508] (llite_mmap.c:71:our_vma()) ASSERTION( !down_write_trylock(&mm->mmap_sem) ) failed when writing in multiple threads Created: 05/Jul/19  Updated: 13/Dec/21  Resolved: 19/Nov/19

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.1
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Jacek Tomaka Assignee: Yang Sheng
Resolution: Not a Bug Votes: 0
Labels: None
Environment:

Linux nanny1926 3.10.0-693.5.2.el7.x86_64 #1 SMP Fri Oct 20 20:32:50 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux


Issue Links:
Duplicate
duplicates LU-10678 LBUG: osd_handler.c:2353:osd_read_loc... Resolved
Related
is related to LU-15156 Back port upstream patch for rwsem issue Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   
2019-07-02T01:45:11-05:00 nanny1926 kernel: LustreError:
 251884:0:(llite_mmap.c:71:our_vma()) ASSERTION(
 !down_write_trylock(&mm->mmap_sem) ) failed:
 2019-07-02T01:45:11-05:00 nanny1926 kernel: LustreError:
 251884:0:(llite_mmap.c:71:our_vma()) LBUG
 2019-07-02T01:45:11-05:00 nanny1926 kernel: Pid: 251884, comm: java
 2019-07-02T01:45:11-05:00 nanny1926 kernel: #012Call Trace:
 2019-07-02T01:45:11-05:00 nanny1926 kernel: [<ffffffffc03d67ae>]
 libcfs_call_trace+0x4e/0x60 [libcfs]
 2019-07-02T01:45:11-05:00 nanny1926 kernel: [<ffffffffc03d683c>]
 lbug_with_loc+0x4c/0xb0 [libcfs]
 2019-07-02T01:45:11-05:00 nanny1926 kernel: [<ffffffffc116e66b>]
 our_vma+0x16b/0x170 [lustre]
 2019-07-02T01:45:11-05:00 nanny1926 kernel: [<ffffffffc11857f9>]
 vvp_io_rw_lock+0x409/0x6e0 [lustre]
 2019-07-02T01:45:11-05:00 nanny1926 kernel: [<ffffffffc0fbb312>] ?
 lov_io_iter_init+0x302/0x8b0 [lov]
 2019-07-02T01:45:11-05:00 nanny1926 kernel: [<ffffffffc1185b29>]
 vvp_io_write_lock+0x59/0xf0 [lustre]
 2019-07-02T01:45:11-05:00 nanny1926 kernel: [<ffffffffc063ebec>]
 cl_io_lock+0x5c/0x3d0 [obdclass]
 2019-07-02T01:45:11-05:00 nanny1926 kernel: [<ffffffffc063f1db>]
 cl_io_loop+0x11b/0xc90 [obdclass]
 2019-07-02T01:45:11-05:00 nanny1926 kernel: [<ffffffffc1133258>]
 ll_file_io_generic+0x498/0xc40 [lustre]
 2019-07-02T01:45:11-05:00 nanny1926 kernel: [<ffffffffc1133cdd>]
 ll_file_aio_write+0x12d/0x1f0 [lustre]
 2019-07-02T01:45:11-05:00 nanny1926 kernel: [<ffffffffc1133e6e>]
 ll_file_write+0xce/0x1e0 [lustre]
 2019-07-02T01:45:11-05:00 nanny1926 kernel: [<ffffffff81200cad>]
 vfs_write+0xbd/0x1e0
 2019-07-02T01:45:11-05:00 nanny1926 kernel: [<ffffffff8111f394>] ?
 __audit_syscall_entry+0xb4/0x110
 2019-07-02T01:45:11-05:00 nanny1926 kernel: [<ffffffff81201abf>]
 SyS_write+0x7f/0xe0
 2019-07-02T01:45:11-05:00 nanny1926 kernel: [<ffffffff816b5292>]
 tracesys+0xdd/0xe2
 2019-07-02T01:45:11-05:00 nanny1926 kernel:
 2019-07-02T01:45:11-05:00 nanny1926 kernel: Kernel panic - not syncing: LBUG

It is reading in up to 256 threads. And writing 16 files in up to 16 threads. 
 
It is reproducible (but does not fail every time) on this particular machine, which might just be a particular network timing.
I will try to reproduce it on another machine and get back to you if successful.
 
Any ideas why this lock would have failed?
A quick analysis shows that the only place where our_vma is called is lustre/llite/vvp_io.c:453, and it only acquires read lock:
vvp_mmap_locks:

452                 down_read(&mm->mmap_sem);
 453                 while((vma = our_vma(mm, addr, count)) != NULL) {
 454                         struct dentry *de = file_dentry(vma->vm_file);
 455                         struct inode *inode = de->d_inode;
 456                         int flags = CEF_MUST;
 

whereas our_vma has this:

70         /* mmap_sem must have been held by caller. */
71         LASSERT(!down_write_trylock(&mm->mmap_sem));
 

So i guess if there are multiple threads in vvp_mmap_locks and more than one happen to acquire read_lock, or one of them acquires write lock then the other would fail, no?



 Comments   
Comment by Yang Sheng [ 05/Jul/19 ]

Hi, Jacek,

Please upgrade kernel to 3.10.0-957.12.1 to fix this issue.

Thanks,
YangSheng

Comment by Jacek Tomaka [ 05/Jul/19 ]

Hi Yang,

Thank you!

https://access.redhat.com/solutions/3393611 mentions also that the issue is fixed in kernel-3.10.0-693.47.2.el7 for Centos 7.4. Would that work as well?

Regards.

Jacek Tomaka

 

Comment by Yang Sheng [ 05/Jul/19 ]

Hi, Jacek,

Yes, 693.47.2 also included this fix.

Thanks,
YangSheng

Comment by Yang Sheng [ 08/Jul/19 ]

Hi, Jacek,

Just for a record, Could you please tell us what kind of machine was reproducible? Especial the cpu architecture.

Thanks,
YangSheng

Comment by Jacek Tomaka [ 08/Jul/19 ]

Hi YangSheng, 

So far i managed to reproduce it on three nodes: 

Intel(R) Xeon Phi(TM) CPU 7210 @ 1.30GHz (three times), two times was straight after reboot, once it required to run it overnight in a loop, by the morning it crashed. 

Intel(R) Xeon Phi(TM) CPU 7250 @ 1.40GHz (two different nodes, once each).

I believe we started seeing it after a change to write one of the workflows in parallel. 

Please let me know if you need more information.

We still have not deployed never version of the kernel for testing but i will confirm if it fixes the issue. 

Regards.

Jacek Tomaka

 

Comment by Jacek Tomaka [ 11/Jul/19 ]

Hi YangSheng,
I managed to reproduce it in Lustre 2.12.2 and 3.10.0-957.21.3 (Centos 7.6)
This time it is read!

Lustre: Lustre: Build Version: 2.12.2
2019-07-11T02:13:50-05:00 nanny2309 kernel: LustreError: 161434:0:(llite_mmap.c:61:our_vma()) ASSERTION( !down_write_trylock(&mm->mmap_sem) ) failed: 
2019-07-11T02:13:50-05:00 nanny2309 kernel: LustreError: 161434:0:(llite_mmap.c:61:our_vma()) LBUG
2019-07-11T02:13:50-05:00 nanny2309 kernel: Pid: 161434, comm: pool-HeadersRea 3.10.0-957.21.3.el7.x86_64 #1 SMP Tue Jun 18 16:35:19 UTC 2019
2019-07-11T02:13:50-05:00 nanny2309 kernel: Call Trace:
2019-07-11T02:13:50-05:00 nanny2309 kernel: [<ffffffffc073b7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
2019-07-11T02:13:50-05:00 nanny2309 kernel: [<ffffffffc073b87c>] lbug_with_loc+0x4c/0xa0 [libcfs]
2019-07-11T02:13:50-05:00 nanny2309 kernel: [<ffffffffc0e3cb4b>] our_vma+0x16b/0x170 [lustre]
2019-07-11T02:13:50-05:00 nanny2309 kernel: [<ffffffffc0e522e9>] vvp_io_rw_lock+0x439/0x760 [lustre]
2019-07-11T02:13:50-05:00 nanny2309 kernel: [<ffffffffc0e526ce>] vvp_io_read_lock+0x3e/0xe0 [lustre]
2019-07-11T02:13:50-05:00 nanny2309 kernel: [<ffffffffc08942ff>] cl_io_lock+0x5f/0x3d0 [obdclass]
2019-07-11T02:13:50-05:00 nanny2309 kernel: [<ffffffffc089488a>] cl_io_loop+0xba/0x1c0 [obdclass]
2019-07-11T02:13:50-05:00 nanny2309 kernel: [<ffffffffc0e0b5b0>] ll_file_io_generic+0x590/0xcb0 [lustre]
2019-07-11T02:13:50-05:00 nanny2309 kernel: [<ffffffffc0e0c868>] ll_file_aio_read+0x2b8/0x3d0 [lustre]
2019-07-11T02:13:50-05:00 nanny2309 kernel: [<ffffffffc0e0ca24>] ll_file_read+0xa4/0x170 [lustre]
2019-07-11T02:13:50-05:00 nanny2309 kernel: [<ffffffffa32416ff>] vfs_read+0x9f/0x170
2019-07-11T02:13:50-05:00 nanny2309 kernel: [<ffffffffa32425bf>] SyS_read+0x7f/0xf0
2019-07-11T02:13:50-05:00 nanny2309 kernel: [<ffffffffa377606b>] tracesys+0xa3/0xc9
2019-07-11T02:13:50-05:00 nanny2309 kernel: [<ffffffffffffffff>] 0xffffffffffffffff
2019-07-11T02:13:50-05:00 nanny2309 kernel: Kernel panic - not syncing: LBUG

So i was wondering if you could answer my question about the code analysis i have done in the description of this bug. I.e. why are we expecting to succeed with down_write_trylock, when caller only has read lock? When there is contention it will obviously fail. Shouldn't it in this case just spin?

Comment by Yang Sheng [ 11/Jul/19 ]

Hi, Jacek,

Ok, It is really a surprise. Can you get a vmcore for analysis? We want a failed return from down_write_trylock there. It means caller already hold this lock. Then the ASSERTION would not be triggered.

Thanks,
Yangsheng

Comment by James A Simmons [ 11/Jul/19 ]

It looks like you are still having problems. Can you try https://review.whamcloud.com/#/c/35271 to see if it helps

Comment by Jacek Tomaka [ 18/Jul/19 ]

Yang Sheng:
Sorry for the delay, looks like setting up kdump was a bit of a task But i think i have it already. With kdump in place, the reproducer has been running for couple of hours already and still has not failed. But i will keep it running.
My kdump.conf says:

core_collector makedumpfile -F -l --message-level 1 -d 31

will that be good enough or you will need some pages that i have excluded?

James A Simmons:
I can give it a go once i give you the dump. Why do you think it will help?

Comment by Jacek Tomaka [ 18/Jul/19 ]

Ok, Yang Sheng, i do have a kernel crash dump. It is 5.4GB. We are happy to give you access to it or upload it to a place but unfortunately we cannot make it public for everyone.
Alternatively i can be your proxy. If you let me know what commands to issue against the dump, i can do it and paste the output.
Please let me know how you would like to proceed.
Regards.
Jacek Tomaka

Comment by Peter Jones [ 18/Jul/19 ]

Sometimes the diagnostic data collected as part of Lustre troubleshooting is too large to be attached to a JIRA ticket. For these cases, Whamcloud provides an anonymous write-only FTP upload service. In order to use this service, you'll need an FTP client (e.g. ncftp, ftp, etc.) and a JIRA issue. Use the 'uploads' directory and create a new subdirectory using your Jira issue as a name.
In the following example, there are three debug logs in a single directory and the JIRA issue LU-4242 has been created. After completing the upload, please update the relevant issue with a note mentioning the upload, so that our engineers know where to find your logs.
$ ls -lh
total 333M
rw-rr- 1 mjmac mjmac 98M Feb 23 17:36 mds-debug
rw-rr- 1 mjmac mjmac 118M Feb 23 17:37 oss-00-debug
rw-rr- 1 mjmac mjmac 118M Feb 23 17:37 oss-01-debug
$ ncftp ftp.whamcloud.com
NcFTP 3.2.2 (Sep 04, 2008) by Mike Gleason (http://www.NcFTP.com/contact/).
Connecting to 99.96.190.235...
(vsFTPd 2.2.2)
Logging in...
Login successful.
Logged in to ftp.whamcloud.com.
ncftp / > cd uploads
Directory successfully changed.
ncftp /uploads > mkdir LU-4242
ncftp /uploads > cd LU-4242
Directory successfully changed.
ncftp /uploads/LU-4242 > put *
mds-debug: 97.66 MB 11.22 MB/s
oss-00-debug: 117.19 MB 11.16 MB/s
oss-01-debug: 117.48 MB 11.18 MB/s
ncftp /uploads/LU-4242 >
Please note that this is a WRITE-ONLY FTP service, so you will not be able to see (with ls) the files or directories you've created, nor will you (or anyone other than Whamcloud staff) be able to see or read them.

Comment by Jacek Tomaka [ 19/Jul/19 ]

Hi Peter!
Thank you for the description:
Uploaded the files:

ls -la uploads/LU-12508
-rw-r--r--    1 99       50            101 Jul 19 01:46 md5.sums
-rw-r--r--    1 99       50         149799 Jul 19 01:45 vmcore-dmesg.txt
-rw-r--r--    1 99       50       4032739061 Jul 19 01:46 vmcore.flat.bz2

Also, please treat this data as sensitive and delete as soon as you are done with investigation. Please do not share with third parties.
Happy hunting!

Comment by Jacek Tomaka [ 24/Jul/19 ]

Hello, any news on this?

Here is the semaphore state from the crashed kernel: 

 mmap_sem = {
    {
      count = {
        counter = 0xfffffffe00000002
      }, 
      __UNIQUE_ID_rh_kabi_hide4 = {
        count = 0xfffffffe00000002
      }, 
      {<No data fields>}
    }, 
    wait_lock = {
      raw_lock = {
        val = {
          counter = 0
        }
      }
    }, 
    osq = {
      tail = {
        counter = 0
      }
    }, 
    wait_list = {
      next = 0xffff9b2d10fa7a30
    }, 
    owner = 0x1
  }, 
 

BTW: do you know if the patch that RedHat claims fixes this problem is https://lkml.org/lkml/2018/11/29/961 ?

Also, when i reduce number of writer threads to 4 in our workload, i can no longer reproduce the problem. 

I have also tested without adaptive ticks (without setting nohz_full=1-255) and the problem is still there (with 16 writing threads) 

Comment by Yang Sheng [ 25/Jul/19 ]

Hi, Jacek,

From source diff, we can conclude this patch is only change between 957 kernel and previous. I think this issue was caused by some memory barrier issue. It only show up on some kind of cpu, But i still cannot narrow down the range. Looks like you have a certain way to reproduce it? Can you share your way?

Thanks,
YangSheng

Comment by Yang Sheng [ 04/Sep/19 ]

Hi, Guys,

Please upgrade to latest rhel7.7 kernel to fix this issue.

Thanks,
YangSheng

Comment by Jacek Tomaka [ 05/Sep/19 ]

YangSheng,
Would you be so kind to point me to the bug fix that you think solves the issue?
It does require significant effort to test on newer version and no one likes to be sent on wild goose chase as happened with 7.6 before.
Regards.
Jacek Tomaka

Comment by Yang Sheng [ 05/Sep/19 ]

Hi, Jacek,

This is patch comment:

From a9e9bcb45b1525ba7aea26ed9441e8632aeeda58 Mon Sep 17 00:00:00 2001
From: Waiman Long <longman@redhat.com>
Date: Sun, 28 Apr 2019 17:25:38 -0400
Subject: [PATCH] locking/rwsem: Prevent decrement of reader count before
 increment

During my rwsem testing, it was found that after a down_read(), the
reader count may occasionally become 0 or even negative. Consequently,
a writer may steal the lock at that time and execute with the reader
in parallel thus breaking the mutual exclusion guarantee of the write
lock. In other words, both readers and writer can become rwsem owners
simultaneously.

The current reader wakeup code does it in one pass to clear waiter->task
and put them into wake_q before fully incrementing the reader count.
Once waiter->task is cleared, the corresponding reader may see it,
finish the critical section and do unlock to decrement the count before
the count is incremented. This is not a problem if there is only one
reader to wake up as the count has been pre-incremented by 1.  It is
a problem if there are more than one readers to be woken up and writer
can steal the lock.

The wakeup was actually done in 2 passes before the following v4.9 commit:

  70800c3c0cc5 ("locking/rwsem: Scan the wait_list for readers only once")

To fix this problem, the wakeup is now done in two passes
again. In the first pass, we collect the readers and count them.
The reader count is then fully incremented. In the second pass, the
waiter->task is then cleared and they are put into wake_q to be woken
up later.
Thanks,

Thanks,
YangSheng

Comment by Jacek Tomaka [ 05/Sep/19 ]

That looks reasonable. Thanks!

Comment by Jacek Tomaka [ 19/Nov/19 ]

Hi YangSheng,
I would like confirm that applying patch you reference to kernel make the otherwise reliable repoducer not hit this issue anymore. Thank you for your help!
Regards.
Jacek Tomaka

Generated at Sat Feb 10 02:53:12 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.