Details

    • Bug
    • Resolution: Unresolved
    • Critical
    • None
    • Lustre 2.10.3
    • None
    • CentOS 7.4 kernel 3.10.0-693.2.2.el7_lustre.pl1.x86_64
    • 3
    • 9223372036854775807

    Description

      We got another OSS deadlock last night on Oak. Likely to be a regression of 2.10.3.

      Since the upgrade to 2.10.3, these servers haven't been stable for more than 48h in general. This issue might be related to the OSS situation described in LU-10697. For latest MDS instabilities, sounds like it will be fixed in LU-10680.

      In this case, OSS deadlock of oak-io2-s1, OSTs from its partner (oak-io2-s2) were already migrated to it (oak-io2-s1) due to a previous deadlock/issue, so 48 OSTs were mounted.

      Timeframe overview:
      Feb 21 11:28:49: OSTs from oak-io2-s2 migrated to oak-io2-s1
      Feb 23 19:05:04: first stack trace of stuck thread (oak-io2-s1 kernel: Pid: 17265, comm: ll_ost00_032)
      Feb 23 22:59: monitoring reports that ssh to oak-io2-s1 doesn't work anymore
      Feb 23 23:01:51 oak-io2-s1 kernel: INFO: task kswapd0:264 blocked for more than 120 seconds.
      Feb 24 02:03:56 manual crash dump taken of oak-io2-s1

      Attaching the following files:

      • kernel logs in oak-io2-s1_kernel.log (where you can find most of the details in the timeframe above)
      • vmcore-dmesg: oak-io2-s1_vmcore-dmesg.txt
      • crash foreach bt: oak_io2-s1_foreach_bt.xt
      • kernel memory usage: oak-io2-s1_kmem.txt
      • vmcore (oak-io2-s1-vmcore-2018-02-24-02_03_56.gz):

      https://stanford.box.com/s/n8ft8quvr6ubuvd12ukdsoarmrz4uixr
      (debuginfo files are available in comment-221257).

      We decided to downgrade all servers to 2.10.2 on this system because this has had a significant impact on production lately.

      Thanks much!

      Stephane

       

      Attachments

        Issue Links

          Activity

            [LU-10709] OSS deadlock in 2.10.3

            We upgraded our Kernel on Oak from CentOS 7.4-patched-by-Bruno to CentOS 7.6 (3.10.0-957.27.2.el7.x86_64 + Lustre patches = 3.10.0-957.27.2.el7_lustre.pl1.x86_64). After one or two days, a similar deadlock occurred. It looks like the kernfs interface still has the same issue.

            • vmcore uploaded to the WC ftp server as vmcore-oak-io1-s1-2019-09-01-21-43-46
            • kernel-debuginfo available there too for 3.10.0-957.27.2.el7_lustre.pl1.x86_64
            • foreach bt attached as foreach_bt-oak-io1-s1-2019-09-01-21-43-46.log

            I know this is a kernel bug but I wanted to update this ticket for the sake of completeness, and the deadlock is somehow triggered by Lustre through lu_cache_shrink.

            User tool accessing the kernfs interface and triggering lu_cache_shrink:

            PID: 254093  TASK: ffff9f16acadd140  CPU: 30  COMMAND: "sas_counters"
             #0 [ffff9f1de48af3d8] __schedule at ffffffffa096aa72
             #1 [ffff9f1de48af460] schedule at ffffffffa096af19
             #2 [ffff9f1de48af470] rwsem_down_read_failed at ffffffffa096c54d
             #3 [ffff9f1de48af4f8] call_rwsem_down_read_failed at ffffffffa0588bf8
             #4 [ffff9f1de48af548] down_read at ffffffffa096a200
             #5 [ffff9f1de48af560] lu_cache_shrink at ffffffffc0e5ee7a [obdclass]
             #6 [ffff9f1de48af5b0] shrink_slab at ffffffffa03cb08e
             #7 [ffff9f1de48af650] do_try_to_free_pages at ffffffffa03ce412
             #8 [ffff9f1de48af6c8] try_to_free_pages at ffffffffa03ce62c
             #9 [ffff9f1de48af760] __alloc_pages_slowpath at ffffffffa09604ef
            #10 [ffff9f1de48af850] __alloc_pages_nodemask at ffffffffa03c2524
            #11 [ffff9f1de48af900] alloc_pages_current at ffffffffa040f438
            #12 [ffff9f1de48af948] new_slab at ffffffffa041a4c5
            #13 [ffff9f1de48af980] ___slab_alloc at ffffffffa041bf2c
            #14 [ffff9f1de48afa58] __slab_alloc at ffffffffa096190c
            #15 [ffff9f1de48afa98] kmem_cache_alloc at ffffffffa041d7cb
            #16 [ffff9f1de48afad8] alloc_inode at ffffffffa045eee1
            #17 [ffff9f1de48afaf8] iget_locked at ffffffffa046025b
            #18 [ffff9f1de48afb38] kernfs_get_inode at ffffffffa04c9c17
            #19 [ffff9f1de48afb58] kernfs_iop_lookup at ffffffffa04ca93b
            #20 [ffff9f1de48afb80] lookup_real at ffffffffa044d573
            #21 [ffff9f1de48afba0] __lookup_hash at ffffffffa044df92
            #22 [ffff9f1de48afbd0] lookup_slow at ffffffffa0961de1
            #23 [ffff9f1de48afc08] link_path_walk at ffffffffa045289f
            #24 [ffff9f1de48afcb8] path_lookupat at ffffffffa0452aaa
            #25 [ffff9f1de48afd50] filename_lookup at ffffffffa045330b
            #26 [ffff9f1de48afd88] user_path_at_empty at ffffffffa04552f7
            #27 [ffff9f1de48afe58] user_path_at at ffffffffa0455361
            #28 [ffff9f1de48afe68] vfs_fstatat at ffffffffa0448223
            #29 [ffff9f1de48afeb8] SYSC_newlstat at ffffffffa0448641
            #30 [ffff9f1de48aff40] sys_newlstat at ffffffffa0448aae
            #31 [ffff9f1de48aff50] system_call_fastpath at ffffffffa0977ddb
                RIP: 00007fdc07510ab5  RSP: 00007ffe9a9e7b30  RFLAGS: 00010202
                RAX: 0000000000000006  RBX: 00000000ffffff9c  RCX: 00007ffe9a9e7b30
                RDX: 00007ffe9a9e6b50  RSI: 00007ffe9a9e6b50  RDI: 00007fdbf86babd0
                RBP: 00000000012d2ca0   R8: 0000000000000001   R9: 0000000000000001
                R10: 00007fdc0834be97  R11: 0000000000000246  R12: 00007ffe9a9e6b50
                R13: 0000000000000001  R14: 00007fdc087ade08  R15: 00007fdbfba9c1d0
                ORIG_RAX: 0000000000000006  CS: 0033  SS: 002b
            

            mdraid task blocked on kernfs too:

            PID: 283550  TASK: ffff9f35c54a0000  CPU: 19  COMMAND: "md0_raid6"
             #0 [ffff9f35c5423b68] __schedule at ffffffffa096aa72
             #1 [ffff9f35c5423bf8] schedule_preempt_disabled at ffffffffa096be39
             #2 [ffff9f35c5423c08] __mutex_lock_slowpath at ffffffffa0969db7
             #3 [ffff9f35c5423c60] mutex_lock at ffffffffa096919f
             #4 [ffff9f35c5423c78] kernfs_find_and_get_ns at ffffffffa04ca883
             #5 [ffff9f35c5423ca0] sysfs_notify at ffffffffa04cd00b
             #6 [ffff9f35c5423cc8] md_update_sb at ffffffffa0795a89
             #7 [ffff9f35c5423d48] md_check_recovery at ffffffffa079681a
             #8 [ffff9f35c5423d68] raid5d at ffffffffc0d9a466 [raid456]
             #9 [ffff9f35c5423e50] md_thread at ffffffffa078dedd
            #10 [ffff9f35c5423ec8] kthread at ffffffffa02c2e81
            

            The original kernel report (https://bugzilla.kernel.org/show_bug.cgi?id=199589) has been dismissed and I'm not sure this was reported to Red Hat actually.

            sthiell Stephane Thiell added a comment - We upgraded our Kernel on Oak from CentOS 7.4-patched-by-Bruno to CentOS 7.6 (3.10.0-957.27.2.el7.x86_64 + Lustre patches = 3.10.0-957.27.2.el7_lustre.pl1.x86_64). After one or two days, a similar deadlock occurred. It looks like the kernfs interface still has the same issue. vmcore uploaded to the WC ftp server as vmcore-oak-io1-s1-2019-09-01-21-43-46 kernel-debuginfo available there too for 3.10.0-957.27.2.el7_lustre.pl1.x86_64 foreach bt attached as foreach_bt-oak-io1-s1-2019-09-01-21-43-46.log I know this is a kernel bug but I wanted to update this ticket for the sake of completeness, and the deadlock is somehow triggered by Lustre through lu_cache_shrink . User tool accessing the kernfs interface and triggering lu_cache_shrink : PID: 254093 TASK: ffff9f16acadd140 CPU: 30 COMMAND: "sas_counters" #0 [ffff9f1de48af3d8] __schedule at ffffffffa096aa72 #1 [ffff9f1de48af460] schedule at ffffffffa096af19 #2 [ffff9f1de48af470] rwsem_down_read_failed at ffffffffa096c54d #3 [ffff9f1de48af4f8] call_rwsem_down_read_failed at ffffffffa0588bf8 #4 [ffff9f1de48af548] down_read at ffffffffa096a200 #5 [ffff9f1de48af560] lu_cache_shrink at ffffffffc0e5ee7a [obdclass] #6 [ffff9f1de48af5b0] shrink_slab at ffffffffa03cb08e #7 [ffff9f1de48af650] do_try_to_free_pages at ffffffffa03ce412 #8 [ffff9f1de48af6c8] try_to_free_pages at ffffffffa03ce62c #9 [ffff9f1de48af760] __alloc_pages_slowpath at ffffffffa09604ef #10 [ffff9f1de48af850] __alloc_pages_nodemask at ffffffffa03c2524 #11 [ffff9f1de48af900] alloc_pages_current at ffffffffa040f438 #12 [ffff9f1de48af948] new_slab at ffffffffa041a4c5 #13 [ffff9f1de48af980] ___slab_alloc at ffffffffa041bf2c #14 [ffff9f1de48afa58] __slab_alloc at ffffffffa096190c #15 [ffff9f1de48afa98] kmem_cache_alloc at ffffffffa041d7cb #16 [ffff9f1de48afad8] alloc_inode at ffffffffa045eee1 #17 [ffff9f1de48afaf8] iget_locked at ffffffffa046025b #18 [ffff9f1de48afb38] kernfs_get_inode at ffffffffa04c9c17 #19 [ffff9f1de48afb58] kernfs_iop_lookup at ffffffffa04ca93b #20 [ffff9f1de48afb80] lookup_real at ffffffffa044d573 #21 [ffff9f1de48afba0] __lookup_hash at ffffffffa044df92 #22 [ffff9f1de48afbd0] lookup_slow at ffffffffa0961de1 #23 [ffff9f1de48afc08] link_path_walk at ffffffffa045289f #24 [ffff9f1de48afcb8] path_lookupat at ffffffffa0452aaa #25 [ffff9f1de48afd50] filename_lookup at ffffffffa045330b #26 [ffff9f1de48afd88] user_path_at_empty at ffffffffa04552f7 #27 [ffff9f1de48afe58] user_path_at at ffffffffa0455361 #28 [ffff9f1de48afe68] vfs_fstatat at ffffffffa0448223 #29 [ffff9f1de48afeb8] SYSC_newlstat at ffffffffa0448641 #30 [ffff9f1de48aff40] sys_newlstat at ffffffffa0448aae #31 [ffff9f1de48aff50] system_call_fastpath at ffffffffa0977ddb RIP: 00007fdc07510ab5 RSP: 00007ffe9a9e7b30 RFLAGS: 00010202 RAX: 0000000000000006 RBX: 00000000ffffff9c RCX: 00007ffe9a9e7b30 RDX: 00007ffe9a9e6b50 RSI: 00007ffe9a9e6b50 RDI: 00007fdbf86babd0 RBP: 00000000012d2ca0 R8: 0000000000000001 R9: 0000000000000001 R10: 00007fdc0834be97 R11: 0000000000000246 R12: 00007ffe9a9e6b50 R13: 0000000000000001 R14: 00007fdc087ade08 R15: 00007fdbfba9c1d0 ORIG_RAX: 0000000000000006 CS: 0033 SS: 002b mdraid task blocked on kernfs too: PID: 283550 TASK: ffff9f35c54a0000 CPU: 19 COMMAND: "md0_raid6" #0 [ffff9f35c5423b68] __schedule at ffffffffa096aa72 #1 [ffff9f35c5423bf8] schedule_preempt_disabled at ffffffffa096be39 #2 [ffff9f35c5423c08] __mutex_lock_slowpath at ffffffffa0969db7 #3 [ffff9f35c5423c60] mutex_lock at ffffffffa096919f #4 [ffff9f35c5423c78] kernfs_find_and_get_ns at ffffffffa04ca883 #5 [ffff9f35c5423ca0] sysfs_notify at ffffffffa04cd00b #6 [ffff9f35c5423cc8] md_update_sb at ffffffffa0795a89 #7 [ffff9f35c5423d48] md_check_recovery at ffffffffa079681a #8 [ffff9f35c5423d68] raid5d at ffffffffc0d9a466 [raid456] #9 [ffff9f35c5423e50] md_thread at ffffffffa078dedd #10 [ffff9f35c5423ec8] kthread at ffffffffa02c2e81 The original kernel report ( https://bugzilla.kernel.org/show_bug.cgi?id=199589 ) has been dismissed and I'm not sure this was reported to Red Hat actually.

            Hey Bruno,

            Great, thanks! It would definitely be nice to get some feedback from the kernel developers and/or have this patch integrated upstream.

            Our Oak system has been rock solid since this patch. Right now ~45 days uptime without any server crash, and the filesystem is still very busy, and mdraid checks are running almost all the time with sas_counters launched every minute on all OSS

            Note: I can't find your email to linux-raid@, maybe it didn't go through?

            Thanks!

            Stephane

            sthiell Stephane Thiell added a comment - Hey Bruno, Great, thanks! It would definitely be nice to get some feedback from the kernel developers and/or have this patch integrated upstream. Our Oak system has been rock solid since this patch. Right now ~45 days uptime without any server crash, and the filesystem is still very busy, and mdraid checks are running almost all the time with sas_counters launched every minute on all OSS Note: I can't find your email to linux-raid@, maybe it didn't go through? Thanks! Stephane

            Hello Stephane,
            Following your previous requests for external reporting of this problem/bug :
            _ I have created a bug report at kernel.org, https://bugzilla.kernel.org/show_bug.cgi?id=199589.
            _ I have also asked MD-Raid maintainers their feeling thru an email to linux-raid@vger.kernel.org, using the title "Deadlock during memory reclaim path involving sysfs and MD-Raid layers".

            Last, recent 4.x Kernels code seems to indicate that problem is still there, but now in kernfs instead of sysfs, as the latter uses the former's methods internally but where the same potential dead-lock seems to exist around kernfs_mutex .

            bfaccini Bruno Faccini (Inactive) added a comment - Hello Stephane, Following your previous requests for external reporting of this problem/bug : _ I have created a bug report at kernel.org, https://bugzilla.kernel.org/show_bug.cgi?id=199589 . _ I have also asked MD-Raid maintainers their feeling thru an email to linux-raid@vger.kernel.org, using the title "Deadlock during memory reclaim path involving sysfs and MD-Raid layers". Last, recent 4.x Kernels code seems to indicate that problem is still there, but now in kernfs instead of sysfs, as the latter uses the former's methods internally but where the same potential dead-lock seems to exist around kernfs_mutex .

            Bruno,

            Great, I'll follow that with much attention. Thank you again, your patch has really saved us.

            Stephane

            sthiell Stephane Thiell added a comment - Bruno, Great, I'll follow that with much attention. Thank you again, your patch has really saved us. Stephane

            Stephane,
            Thanks for your help and patch testing!
            Will do soon for both sysfs patch submission and reporting to linux-raid.
            Will double-check for 4.x kernels and give you an answer soon.

            bfaccini Bruno Faccini (Inactive) added a comment - Stephane, Thanks for your help and patch testing! Will do soon for both sysfs patch submission and reporting to linux-raid. Will double-check for 4.x kernels and give you an answer soon.

            Hi Bruno,

            The system has been very stable lately with the patch. I think we can consider the issue fixed by next week (just to be sure ).

            Few questions for you when you have time (no rush):

            • do you plan to submit the sysfs patch upstream to Red Hat?
            • do you want to notify linux-raid about this sysfs race condition (or do you want me to do it?)
            • do you think this issue is automatically fixed when using more recent 4.x kernels because the sysfs interface has changed?

            Thanks!!

            Stephane

            sthiell Stephane Thiell added a comment - Hi Bruno, The system has been very stable lately with the patch. I think we can consider the issue fixed by next week (just to be sure ). Few questions for you when you have time (no rush): do you plan to submit the sysfs patch upstream to Red Hat? do you want to notify linux-raid about this sysfs race condition (or do you want me to do it?) do you think this issue is automatically fixed when using more recent 4.x kernels because the sysfs interface has changed? Thanks!! Stephane

            Stephane, thanks for the update, and let's cross the fingers now... 

            bfaccini Bruno Faccini (Inactive) added a comment - Stephane, thanks for the update, and let's cross the fingers now... 

            Hey Bruno,

            Quick status update.. patch was deployed last Sunday morning (3/18) only due to production constraints before. sas_counters is running quite frequently again and I started the mdraid checks manually (because usually they start on Saturday night). So far, no issue to report, looking good, but we need more time to be sure (at least a week). Will keep you posted!

             

            sthiell Stephane Thiell added a comment - Hey Bruno, Quick status update.. patch was deployed last Sunday morning (3/18) only due to production constraints before. sas_counters is running quite frequently again and I started the mdraid checks manually (because usually they start on Saturday night). So far, no issue to report, looking good, but we need more time to be sure (at least a week). Will keep you posted!  

            OK. Excellent, thank you!! I just built a new kernel with this patch. No kernel update, just the same as before plus this patch added (new version is kernel-3.10.0-693.2.2.el7_lustre.pl2.x86_64). I'll perform the kernel change on all Oak servers tomorrow early morning (Pacific time), when fewer users are connected to the system and report back.

            sthiell Stephane Thiell added a comment - OK. Excellent, thank you!! I just built a new kernel with this patch. No kernel update, just the same as before plus this patch added (new version is kernel-3.10.0-693.2.2.el7_lustre.pl2.x86_64). I'll perform the kernel change on all Oak servers tomorrow early morning (Pacific time), when fewer users are connected to the system and report back.
            bfaccini Bruno Faccini (Inactive) added a comment - - edited

            > Does that mean that the MD layer would then have to use this new sysfs_alloc_inode() method? Or everyone?

            Everyone. BTW, in the deadlock scenario, this is sas_counters user-land thread that triggers the memory reclaim during sysfs inode allocation.

            > But yes, we'd be very interested to test such a patch!

            Attached sysfs_alloc_inode_GFP_NOFS.patch file.

            sysfs_alloc_inode_GFP_NOFS.patch

             

             

            bfaccini Bruno Faccini (Inactive) added a comment - - edited > Does that mean that the MD layer would then have to use this new sysfs_alloc_inode() method? Or everyone? Everyone. BTW, in the deadlock scenario, this is sas_counters user-land thread that triggers the memory reclaim during sysfs inode allocation. > But yes, we'd be very interested to test such a patch! Attached sysfs_alloc_inode_GFP_NOFS.patch file. sysfs_alloc_inode_GFP_NOFS.patch    

            Thanks Bruno! Does that mean that the MD layer would then have to use this new sysfs_alloc_inode() method? Or everyone?

            But yes, we'd be very interested to test such a patch!

             

            sthiell Stephane Thiell added a comment - Thanks Bruno! Does that mean that the MD layer would then have to use this new sysfs_alloc_inode() method? Or everyone? But yes, we'd be very interested to test such a patch!  

            People

              bfaccini Bruno Faccini (Inactive)
              sthiell Stephane Thiell
              Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated: