Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12508

(llite_mmap.c:71:our_vma()) ASSERTION( !down_write_trylock(&mm->mmap_sem) ) failed when writing in multiple threads

Details

    • Bug
    • Resolution: Not a Bug
    • Major
    • None
    • Lustre 2.10.1
    • None
    • Linux nanny1926 3.10.0-693.5.2.el7.x86_64 #1 SMP Fri Oct 20 20:32:50 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
    • 3
    • 9223372036854775807

    Description

      2019-07-02T01:45:11-05:00 nanny1926 kernel: LustreError:
       251884:0:(llite_mmap.c:71:our_vma()) ASSERTION(
       !down_write_trylock(&mm->mmap_sem) ) failed:
       2019-07-02T01:45:11-05:00 nanny1926 kernel: LustreError:
       251884:0:(llite_mmap.c:71:our_vma()) LBUG
       2019-07-02T01:45:11-05:00 nanny1926 kernel: Pid: 251884, comm: java
       2019-07-02T01:45:11-05:00 nanny1926 kernel: #012Call Trace:
       2019-07-02T01:45:11-05:00 nanny1926 kernel: [<ffffffffc03d67ae>]
       libcfs_call_trace+0x4e/0x60 [libcfs]
       2019-07-02T01:45:11-05:00 nanny1926 kernel: [<ffffffffc03d683c>]
       lbug_with_loc+0x4c/0xb0 [libcfs]
       2019-07-02T01:45:11-05:00 nanny1926 kernel: [<ffffffffc116e66b>]
       our_vma+0x16b/0x170 [lustre]
       2019-07-02T01:45:11-05:00 nanny1926 kernel: [<ffffffffc11857f9>]
       vvp_io_rw_lock+0x409/0x6e0 [lustre]
       2019-07-02T01:45:11-05:00 nanny1926 kernel: [<ffffffffc0fbb312>] ?
       lov_io_iter_init+0x302/0x8b0 [lov]
       2019-07-02T01:45:11-05:00 nanny1926 kernel: [<ffffffffc1185b29>]
       vvp_io_write_lock+0x59/0xf0 [lustre]
       2019-07-02T01:45:11-05:00 nanny1926 kernel: [<ffffffffc063ebec>]
       cl_io_lock+0x5c/0x3d0 [obdclass]
       2019-07-02T01:45:11-05:00 nanny1926 kernel: [<ffffffffc063f1db>]
       cl_io_loop+0x11b/0xc90 [obdclass]
       2019-07-02T01:45:11-05:00 nanny1926 kernel: [<ffffffffc1133258>]
       ll_file_io_generic+0x498/0xc40 [lustre]
       2019-07-02T01:45:11-05:00 nanny1926 kernel: [<ffffffffc1133cdd>]
       ll_file_aio_write+0x12d/0x1f0 [lustre]
       2019-07-02T01:45:11-05:00 nanny1926 kernel: [<ffffffffc1133e6e>]
       ll_file_write+0xce/0x1e0 [lustre]
       2019-07-02T01:45:11-05:00 nanny1926 kernel: [<ffffffff81200cad>]
       vfs_write+0xbd/0x1e0
       2019-07-02T01:45:11-05:00 nanny1926 kernel: [<ffffffff8111f394>] ?
       __audit_syscall_entry+0xb4/0x110
       2019-07-02T01:45:11-05:00 nanny1926 kernel: [<ffffffff81201abf>]
       SyS_write+0x7f/0xe0
       2019-07-02T01:45:11-05:00 nanny1926 kernel: [<ffffffff816b5292>]
       tracesys+0xdd/0xe2
       2019-07-02T01:45:11-05:00 nanny1926 kernel:
       2019-07-02T01:45:11-05:00 nanny1926 kernel: Kernel panic - not syncing: LBUG
      

      It is reading in up to 256 threads. And writing 16 files in up to 16 threads. 
       
      It is reproducible (but does not fail every time) on this particular machine, which might just be a particular network timing.
      I will try to reproduce it on another machine and get back to you if successful.
       
      Any ideas why this lock would have failed?
      A quick analysis shows that the only place where our_vma is called is lustre/llite/vvp_io.c:453, and it only acquires read lock:
      vvp_mmap_locks:

      452                 down_read(&mm->mmap_sem);
       453                 while((vma = our_vma(mm, addr, count)) != NULL) {
       454                         struct dentry *de = file_dentry(vma->vm_file);
       455                         struct inode *inode = de->d_inode;
       456                         int flags = CEF_MUST;
       

      whereas our_vma has this:

      70         /* mmap_sem must have been held by caller. */
      71         LASSERT(!down_write_trylock(&mm->mmap_sem));
       

      So i guess if there are multiple threads in vvp_mmap_locks and more than one happen to acquire read_lock, or one of them acquires write lock then the other would fail, no?

      Attachments

        Issue Links

          Activity

            [LU-12508] (llite_mmap.c:71:our_vma()) ASSERTION( !down_write_trylock(&mm->mmap_sem) ) failed when writing in multiple threads

            Hi YangSheng,
            I would like confirm that applying patch you reference to kernel make the otherwise reliable repoducer not hit this issue anymore. Thank you for your help!
            Regards.
            Jacek Tomaka

            Tomaka Jacek Tomaka (Inactive) added a comment - Hi YangSheng, I would like confirm that applying patch you reference to kernel make the otherwise reliable repoducer not hit this issue anymore. Thank you for your help! Regards. Jacek Tomaka

            That looks reasonable. Thanks!

            Tomaka Jacek Tomaka (Inactive) added a comment - That looks reasonable. Thanks!
            ys Yang Sheng added a comment -

            Hi, Jacek,

            This is patch comment:

            From a9e9bcb45b1525ba7aea26ed9441e8632aeeda58 Mon Sep 17 00:00:00 2001
            From: Waiman Long <longman@redhat.com>
            Date: Sun, 28 Apr 2019 17:25:38 -0400
            Subject: [PATCH] locking/rwsem: Prevent decrement of reader count before
             increment
            
            During my rwsem testing, it was found that after a down_read(), the
            reader count may occasionally become 0 or even negative. Consequently,
            a writer may steal the lock at that time and execute with the reader
            in parallel thus breaking the mutual exclusion guarantee of the write
            lock. In other words, both readers and writer can become rwsem owners
            simultaneously.
            
            The current reader wakeup code does it in one pass to clear waiter->task
            and put them into wake_q before fully incrementing the reader count.
            Once waiter->task is cleared, the corresponding reader may see it,
            finish the critical section and do unlock to decrement the count before
            the count is incremented. This is not a problem if there is only one
            reader to wake up as the count has been pre-incremented by 1.  It is
            a problem if there are more than one readers to be woken up and writer
            can steal the lock.
            
            The wakeup was actually done in 2 passes before the following v4.9 commit:
            
              70800c3c0cc5 ("locking/rwsem: Scan the wait_list for readers only once")
            
            To fix this problem, the wakeup is now done in two passes
            again. In the first pass, we collect the readers and count them.
            The reader count is then fully incremented. In the second pass, the
            waiter->task is then cleared and they are put into wake_q to be woken
            up later.
            Thanks,
            
            

            Thanks,
            YangSheng

            ys Yang Sheng added a comment - Hi, Jacek, This is patch comment: From a9e9bcb45b1525ba7aea26ed9441e8632aeeda58 Mon Sep 17 00:00:00 2001 From: Waiman Long <longman@redhat.com> Date: Sun, 28 Apr 2019 17:25:38 -0400 Subject: [PATCH] locking/rwsem: Prevent decrement of reader count before increment During my rwsem testing, it was found that after a down_read(), the reader count may occasionally become 0 or even negative. Consequently, a writer may steal the lock at that time and execute with the reader in parallel thus breaking the mutual exclusion guarantee of the write lock. In other words, both readers and writer can become rwsem owners simultaneously. The current reader wakeup code does it in one pass to clear waiter->task and put them into wake_q before fully incrementing the reader count. Once waiter->task is cleared, the corresponding reader may see it, finish the critical section and do unlock to decrement the count before the count is incremented. This is not a problem if there is only one reader to wake up as the count has been pre-incremented by 1. It is a problem if there are more than one readers to be woken up and writer can steal the lock. The wakeup was actually done in 2 passes before the following v4.9 commit: 70800c3c0cc5 ("locking/rwsem: Scan the wait_list for readers only once") To fix this problem, the wakeup is now done in two passes again. In the first pass, we collect the readers and count them. The reader count is then fully incremented. In the second pass, the waiter->task is then cleared and they are put into wake_q to be woken up later. Thanks, Thanks, YangSheng

            YangSheng,
            Would you be so kind to point me to the bug fix that you think solves the issue?
            It does require significant effort to test on newer version and no one likes to be sent on wild goose chase as happened with 7.6 before.
            Regards.
            Jacek Tomaka

            Tomaka Jacek Tomaka (Inactive) added a comment - YangSheng, Would you be so kind to point me to the bug fix that you think solves the issue? It does require significant effort to test on newer version and no one likes to be sent on wild goose chase as happened with 7.6 before. Regards. Jacek Tomaka
            ys Yang Sheng added a comment -

            Hi, Guys,

            Please upgrade to latest rhel7.7 kernel to fix this issue.

            Thanks,
            YangSheng

            ys Yang Sheng added a comment - Hi, Guys, Please upgrade to latest rhel7.7 kernel to fix this issue. Thanks, YangSheng
            ys Yang Sheng added a comment -

            Hi, Jacek,

            From source diff, we can conclude this patch is only change between 957 kernel and previous. I think this issue was caused by some memory barrier issue. It only show up on some kind of cpu, But i still cannot narrow down the range. Looks like you have a certain way to reproduce it? Can you share your way?

            Thanks,
            YangSheng

            ys Yang Sheng added a comment - Hi, Jacek, From source diff, we can conclude this patch is only change between 957 kernel and previous. I think this issue was caused by some memory barrier issue. It only show up on some kind of cpu, But i still cannot narrow down the range. Looks like you have a certain way to reproduce it? Can you share your way? Thanks, YangSheng
            Tomaka Jacek Tomaka (Inactive) added a comment - - edited

            Hello, any news on this?

            Here is the semaphore state from the crashed kernel: 

             mmap_sem = {
                {
                  count = {
                    counter = 0xfffffffe00000002
                  }, 
                  __UNIQUE_ID_rh_kabi_hide4 = {
                    count = 0xfffffffe00000002
                  }, 
                  {<No data fields>}
                }, 
                wait_lock = {
                  raw_lock = {
                    val = {
                      counter = 0
                    }
                  }
                }, 
                osq = {
                  tail = {
                    counter = 0
                  }
                }, 
                wait_list = {
                  next = 0xffff9b2d10fa7a30
                }, 
                owner = 0x1
              }, 
             

            BTW: do you know if the patch that RedHat claims fixes this problem is https://lkml.org/lkml/2018/11/29/961 ?

            Also, when i reduce number of writer threads to 4 in our workload, i can no longer reproduce the problem. 

            I have also tested without adaptive ticks (without setting nohz_full=1-255) and the problem is still there (with 16 writing threads) 

            Tomaka Jacek Tomaka (Inactive) added a comment - - edited Hello, any news on this? Here is the semaphore state from the crashed kernel:  mmap_sem = { { count = { counter = 0xfffffffe00000002 }, __UNIQUE_ID_rh_kabi_hide4 = { count = 0xfffffffe00000002 }, {<No data fields>} }, wait_lock = { raw_lock = { val = { counter = 0 } } }, osq = { tail = { counter = 0 } }, wait_list = { next = 0xffff9b2d10fa7a30 }, owner = 0x1 }, BTW: do you know if the patch that RedHat claims fixes this problem is  https://lkml.org/lkml/2018/11/29/961 ? Also, when i reduce number of writer threads to 4 in our workload, i can no longer reproduce the problem.  I have also tested without adaptive ticks (without setting nohz_full=1-255) and the problem is still there (with 16 writing threads) 
            Tomaka Jacek Tomaka (Inactive) added a comment - - edited

            Hi Peter!
            Thank you for the description:
            Uploaded the files:

            ls -la uploads/LU-12508
            -rw-r--r--    1 99       50            101 Jul 19 01:46 md5.sums
            -rw-r--r--    1 99       50         149799 Jul 19 01:45 vmcore-dmesg.txt
            -rw-r--r--    1 99       50       4032739061 Jul 19 01:46 vmcore.flat.bz2
            

            Also, please treat this data as sensitive and delete as soon as you are done with investigation. Please do not share with third parties.
            Happy hunting!

            Tomaka Jacek Tomaka (Inactive) added a comment - - edited Hi Peter! Thank you for the description: Uploaded the files: ls -la uploads/LU-12508 -rw-r--r-- 1 99 50 101 Jul 19 01:46 md5.sums -rw-r--r-- 1 99 50 149799 Jul 19 01:45 vmcore-dmesg.txt -rw-r--r-- 1 99 50 4032739061 Jul 19 01:46 vmcore.flat.bz2 Also, please treat this data as sensitive and delete as soon as you are done with investigation. Please do not share with third parties. Happy hunting!
            pjones Peter Jones added a comment -

            Sometimes the diagnostic data collected as part of Lustre troubleshooting is too large to be attached to a JIRA ticket. For these cases, Whamcloud provides an anonymous write-only FTP upload service. In order to use this service, you'll need an FTP client (e.g. ncftp, ftp, etc.) and a JIRA issue. Use the 'uploads' directory and create a new subdirectory using your Jira issue as a name.
            In the following example, there are three debug logs in a single directory and the JIRA issue LU-4242 has been created. After completing the upload, please update the relevant issue with a note mentioning the upload, so that our engineers know where to find your logs.
            $ ls -lh
            total 333M
            rw-rr- 1 mjmac mjmac 98M Feb 23 17:36 mds-debug
            rw-rr- 1 mjmac mjmac 118M Feb 23 17:37 oss-00-debug
            rw-rr- 1 mjmac mjmac 118M Feb 23 17:37 oss-01-debug
            $ ncftp ftp.whamcloud.com
            NcFTP 3.2.2 (Sep 04, 2008) by Mike Gleason (http://www.NcFTP.com/contact/).
            Connecting to 99.96.190.235...
            (vsFTPd 2.2.2)
            Logging in...
            Login successful.
            Logged in to ftp.whamcloud.com.
            ncftp / > cd uploads
            Directory successfully changed.
            ncftp /uploads > mkdir LU-4242
            ncftp /uploads > cd LU-4242
            Directory successfully changed.
            ncftp /uploads/LU-4242 > put *
            mds-debug: 97.66 MB 11.22 MB/s
            oss-00-debug: 117.19 MB 11.16 MB/s
            oss-01-debug: 117.48 MB 11.18 MB/s
            ncftp /uploads/LU-4242 >
            Please note that this is a WRITE-ONLY FTP service, so you will not be able to see (with ls) the files or directories you've created, nor will you (or anyone other than Whamcloud staff) be able to see or read them.

            pjones Peter Jones added a comment - Sometimes the diagnostic data collected as part of Lustre troubleshooting is too large to be attached to a JIRA ticket. For these cases, Whamcloud provides an anonymous write-only FTP upload service. In order to use this service, you'll need an FTP client (e.g. ncftp, ftp, etc.) and a JIRA issue. Use the 'uploads' directory and create a new subdirectory using your Jira issue as a name. In the following example, there are three debug logs in a single directory and the JIRA issue LU-4242 has been created. After completing the upload, please update the relevant issue with a note mentioning the upload, so that our engineers know where to find your logs. $ ls -lh total 333M rw-rr- 1 mjmac mjmac 98M Feb 23 17:36 mds-debug rw-rr- 1 mjmac mjmac 118M Feb 23 17:37 oss-00-debug rw-rr- 1 mjmac mjmac 118M Feb 23 17:37 oss-01-debug $ ncftp ftp.whamcloud.com NcFTP 3.2.2 (Sep 04, 2008) by Mike Gleason ( http://www.NcFTP.com/contact/ ). Connecting to 99.96.190.235... (vsFTPd 2.2.2) Logging in... Login successful. Logged in to ftp.whamcloud.com. ncftp / > cd uploads Directory successfully changed. ncftp /uploads > mkdir LU-4242 ncftp /uploads > cd LU-4242 Directory successfully changed. ncftp /uploads/ LU-4242 > put * mds-debug: 97.66 MB 11.22 MB/s oss-00-debug: 117.19 MB 11.16 MB/s oss-01-debug: 117.48 MB 11.18 MB/s ncftp /uploads/ LU-4242 > Please note that this is a WRITE-ONLY FTP service, so you will not be able to see (with ls) the files or directories you've created, nor will you (or anyone other than Whamcloud staff) be able to see or read them.

            Ok, Yang Sheng, i do have a kernel crash dump. It is 5.4GB. We are happy to give you access to it or upload it to a place but unfortunately we cannot make it public for everyone.
            Alternatively i can be your proxy. If you let me know what commands to issue against the dump, i can do it and paste the output.
            Please let me know how you would like to proceed.
            Regards.
            Jacek Tomaka

            Tomaka Jacek Tomaka (Inactive) added a comment - Ok, Yang Sheng, i do have a kernel crash dump. It is 5.4GB. We are happy to give you access to it or upload it to a place but unfortunately we cannot make it public for everyone. Alternatively i can be your proxy. If you let me know what commands to issue against the dump, i can do it and paste the output. Please let me know how you would like to proceed. Regards. Jacek Tomaka

            People

              ys Yang Sheng
              Tomaka Jacek Tomaka (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: