[LU-10709] OSS deadlock in 2.10.3 - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Unresolved
Priority: Critical
Fix Version/s: None
Affects Version/s: Lustre 2.10.3
Labels:
None
Environment:
CentOS 7.4 kernel 3.10.0-693.2.2.el7_lustre.pl1.x86_64

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

We got another OSS deadlock last night on Oak. Likely to be a regression of 2.10.3.

Since the upgrade to 2.10.3, these servers haven't been stable for more than 48h in general. This issue might be related to the OSS situation described in LU-10697. For latest MDS instabilities, sounds like it will be fixed in ~~LU-10680~~.

In this case, OSS deadlock of oak-io2-s1, OSTs from its partner (oak-io2-s2) were already migrated to it (oak-io2-s1) due to a previous deadlock/issue, so 48 OSTs were mounted.

Timeframe overview:
Feb 21 11:28:49: OSTs from oak-io2-s2 migrated to oak-io2-s1
Feb 23 19:05:04: first stack trace of stuck thread (oak-io2-s1 kernel: Pid: 17265, comm: ll_ost00_032)
Feb 23 22:59: monitoring reports that ssh to oak-io2-s1 doesn't work anymore
Feb 23 23:01:51 oak-io2-s1 kernel: INFO: task kswapd0:264 blocked for more than 120 seconds.
Feb 24 02:03:56 manual crash dump taken of oak-io2-s1

Attaching the following files:

kernel logs in oak-io2-s1_kernel.log (where you can find most of the details in the timeframe above)
vmcore-dmesg: oak-io2-s1_vmcore-dmesg.txt
crash foreach bt: oak_io2-s1_foreach_bt.xt
kernel memory usage: oak-io2-s1_kmem.txt
vmcore (oak-io2-s1-vmcore-2018-02-24-02_03_56.gz):

https://stanford.box.com/s/n8ft8quvr6ubuvd12ukdsoarmrz4uixr
(debuginfo files are available in comment-221257).

We decided to downgrade all servers to 2.10.2 on this system because this has had a significant impact on production lately.

Thanks much!

Stephane

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

foreach_bt_oak-io4-s1.log
2.33 MB
12/Nov/19 11:41 PM
foreach_bt-oak-io1-s1-2019-09-01-21-43-46.log
2.11 MB
04/Sep/19 4:00 AM
oak_io2-s1_foreach_bt.xt
2.40 MB
24/Feb/18 6:35 PM
oak-io1-s2-2018-03-05-09-33-02_vmcore-dmesg.txt
760 kB
05/Mar/18 7:04 PM
oak-io2-s1_kernel.log
498 kB
24/Feb/18 6:35 PM
oak-io2-s1_kmem.txt
0.8 kB
24/Feb/18 6:35 PM
oak-io2-s1_vmcore-dmesg.txt
987 kB
24/Feb/18 6:35 PM
sysfs_alloc_inode_GFP_NOFS.patch
3 kB
14/Mar/18 9:59 AM

Issue Links

is related to

LU-10697 MDT locking issues after failing over OSTs from hung OSS

Open

Activity

[LU-10709] OSS deadlock in 2.10.3

Stephane Thiell added a comment - 12/Nov/19 11:47 PM

I've attached foreach_bt_oak-io4-s1.log just in case someone wants to take a look at a newer trace. This was without Bruno's patch running CentOS 7.6.

In my understanding, this task is part of the deadlock:

PID: 308960  TASK: ffff944ae769a080  CPU: 17  COMMAND: "sas_discover"
 #0 [ffff946c547f32d8] __schedule at ffffffffb2d6aa72
 #1 [ffff946c547f3360] schedule at ffffffffb2d6af19
 #2 [ffff946c547f3370] wait_transaction_locked at ffffffffc0739085 [jbd2]
 #3 [ffff946c547f33c8] add_transaction_credits at ffffffffc0739368 [jbd2]
 #4 [ffff946c547f3428] start_this_handle at ffffffffc07395e1 [jbd2]
 #5 [ffff946c547f34c0] jbd2__journal_start at ffffffffc0739a93 [jbd2]
 #6 [ffff946c547f3508] __ldiskfs_journal_start_sb at ffffffffc149c189 [ldiskfs]
 #7 [ffff946c547f3548] ldiskfs_release_dquot at ffffffffc149379c [ldiskfs]
 #8 [ffff946c547f3568] dqput at ffffffffb28afc1d
 #9 [ffff946c547f3590] __dquot_drop at ffffffffb28b12d5
#10 [ffff946c547f35c8] dquot_drop at ffffffffb28b1345
#11 [ffff946c547f35d8] ldiskfs_clear_inode at ffffffffc14980e2 [ldiskfs]
#12 [ffff946c547f35f0] ldiskfs_evict_inode at ffffffffc14bb2bf [ldiskfs]
#13 [ffff946c547f3630] evict at ffffffffb285ff34
#14 [ffff946c547f3658] dispose_list at ffffffffb286003e
#15 [ffff946c547f3680] prune_icache_sb at ffffffffb286104c
#16 [ffff946c547f36e8] prune_super at ffffffffb28453b3
#17 [ffff946c547f3720] shrink_slab at ffffffffb27cb155
#18 [ffff946c547f37c0] zone_reclaim at ffffffffb27cdf31
#19 [ffff946c547f3868] get_page_from_freelist at ffffffffb27c1f2b
#20 [ffff946c547f3980] __alloc_pages_nodemask at ffffffffb27c2296
#21 [ffff946c547f3a30] alloc_pages_current at ffffffffb280f438
#22 [ffff946c547f3a78] new_slab at ffffffffb281a4c5
#23 [ffff946c547f3ab0] ___slab_alloc at ffffffffb281bf2c
#24 [ffff946c547f3b88] __slab_alloc at ffffffffb2d6190c
#25 [ffff946c547f3bc8] kmem_cache_alloc at ffffffffb281d7cb
#26 [ffff946c547f3c08] alloc_inode at ffffffffb285eee1
#27 [ffff946c547f3c28] iget_locked at ffffffffb286025b
#28 [ffff946c547f3c68] kernfs_get_inode at ffffffffb28c9c17
#29 [ffff946c547f3c88] kernfs_iop_lookup at ffffffffb28ca93b
#30 [ffff946c547f3cb0] lookup_real at ffffffffb284d573
#31 [ffff946c547f3cd0] do_last at ffffffffb285153a
#32 [ffff946c547f3d70] path_openat at ffffffffb2853a27
#33 [ffff946c547f3e08] do_filp_open at ffffffffb285542d
#34 [ffff946c547f3ee0] do_sys_open at ffffffffb2841587
#35 [ffff946c547f3f40] sys_openat at ffffffffb28416c4
#36 [ffff946c547f3f50] system_call_fastpath at ffffffffb2d77ddb
    RIP: 00007f93ec8f8f70  RSP: 00007ffdf42ef080  RFLAGS: 00010202
    RAX: 0000000000000101  RBX: 00007f93e1287630  RCX: 00007f93dfde5000
    RDX: 0000000000090800  RSI: 00007f93dfeb9b18  RDI: ffffffffffffff9c
    RBP: 00007f93dfeb9b18   R8: 00007f93e1287640   R9: 00007f93ed7334c2
    R10: 0000000000000000  R11: 0000000000000246  R12: 000000000076ac90
    R13: 0000000000000001  R14: 00007f93edb95e08  R15: 00007f93e1040390
    ORIG_RAX: 0000000000000101  CS: 0033  SS: 002b

As Bruno explained before, ( retranscribed in my less good understanding here), this involves an external tool accessing sysfs/kernfs like the task above (although I've seen systemd accessing it too), under memory pressure -> zone_reclaim -> shrink_slab -> flush I/O due to quotas I guess -> I/O deadlock mdraid because of md_check_recovery() running and accessing kernfs:

PID: 318783  TASK: ffff9460e2cd4100  CPU: 15  COMMAND: "md24_raid6"
 #0 [ffff946ca83ebb68] __schedule at ffffffffb2d6aa72
 #1 [ffff946ca83ebbf8] schedule_preempt_disabled at ffffffffb2d6be39
 #2 [ffff946ca83ebc08] __mutex_lock_slowpath at ffffffffb2d69db7
 #3 [ffff946ca83ebc60] mutex_lock at ffffffffb2d6919f
 #4 [ffff946ca83ebc78] kernfs_find_and_get_ns at ffffffffb28ca883
 #5 [ffff946ca83ebca0] sysfs_notify at ffffffffb28cd00b
 #6 [ffff946ca83ebcc8] md_update_sb at ffffffffb2b95a89
 #7 [ffff946ca83ebd48] md_check_recovery at ffffffffb2b9681a
 #8 [ffff946ca83ebd68] raid5d at ffffffffc1511466 [raid456]
 #9 [ffff946ca83ebe50] md_thread at ffffffffb2b8dedd
#10 [ffff946ca83ebec8] kthread at ffffffffb26c2e81
#11 [ffff946ca83ebf50] ret_from_fork_nospec_begin at ffffffffb2d77c1d

Stephane Thiell added a comment - 12/Nov/19 11:47 PM I've attached foreach_bt_oak-io4-s1.log just in case someone wants to take a look at a newer trace. This was without Bruno's patch running CentOS 7.6. In my understanding, this task is part of the deadlock: PID: 308960 TASK: ffff944ae769a080 CPU: 17 COMMAND: "sas_discover" #0 [ffff946c547f32d8] __schedule at ffffffffb2d6aa72 #1 [ffff946c547f3360] schedule at ffffffffb2d6af19 #2 [ffff946c547f3370] wait_transaction_locked at ffffffffc0739085 [jbd2] #3 [ffff946c547f33c8] add_transaction_credits at ffffffffc0739368 [jbd2] #4 [ffff946c547f3428] start_this_handle at ffffffffc07395e1 [jbd2] #5 [ffff946c547f34c0] jbd2__journal_start at ffffffffc0739a93 [jbd2] #6 [ffff946c547f3508] __ldiskfs_journal_start_sb at ffffffffc149c189 [ldiskfs] #7 [ffff946c547f3548] ldiskfs_release_dquot at ffffffffc149379c [ldiskfs] #8 [ffff946c547f3568] dqput at ffffffffb28afc1d #9 [ffff946c547f3590] __dquot_drop at ffffffffb28b12d5 #10 [ffff946c547f35c8] dquot_drop at ffffffffb28b1345 #11 [ffff946c547f35d8] ldiskfs_clear_inode at ffffffffc14980e2 [ldiskfs] #12 [ffff946c547f35f0] ldiskfs_evict_inode at ffffffffc14bb2bf [ldiskfs] #13 [ffff946c547f3630] evict at ffffffffb285ff34 #14 [ffff946c547f3658] dispose_list at ffffffffb286003e #15 [ffff946c547f3680] prune_icache_sb at ffffffffb286104c #16 [ffff946c547f36e8] prune_super at ffffffffb28453b3 #17 [ffff946c547f3720] shrink_slab at ffffffffb27cb155 #18 [ffff946c547f37c0] zone_reclaim at ffffffffb27cdf31 #19 [ffff946c547f3868] get_page_from_freelist at ffffffffb27c1f2b #20 [ffff946c547f3980] __alloc_pages_nodemask at ffffffffb27c2296 #21 [ffff946c547f3a30] alloc_pages_current at ffffffffb280f438 #22 [ffff946c547f3a78] new_slab at ffffffffb281a4c5 #23 [ffff946c547f3ab0] ___slab_alloc at ffffffffb281bf2c #24 [ffff946c547f3b88] __slab_alloc at ffffffffb2d6190c #25 [ffff946c547f3bc8] kmem_cache_alloc at ffffffffb281d7cb #26 [ffff946c547f3c08] alloc_inode at ffffffffb285eee1 #27 [ffff946c547f3c28] iget_locked at ffffffffb286025b #28 [ffff946c547f3c68] kernfs_get_inode at ffffffffb28c9c17 #29 [ffff946c547f3c88] kernfs_iop_lookup at ffffffffb28ca93b #30 [ffff946c547f3cb0] lookup_real at ffffffffb284d573 #31 [ffff946c547f3cd0] do_last at ffffffffb285153a #32 [ffff946c547f3d70] path_openat at ffffffffb2853a27 #33 [ffff946c547f3e08] do_filp_open at ffffffffb285542d #34 [ffff946c547f3ee0] do_sys_open at ffffffffb2841587 #35 [ffff946c547f3f40] sys_openat at ffffffffb28416c4 #36 [ffff946c547f3f50] system_call_fastpath at ffffffffb2d77ddb RIP: 00007f93ec8f8f70 RSP: 00007ffdf42ef080 RFLAGS: 00010202 RAX: 0000000000000101 RBX: 00007f93e1287630 RCX: 00007f93dfde5000 RDX: 0000000000090800 RSI: 00007f93dfeb9b18 RDI: ffffffffffffff9c RBP: 00007f93dfeb9b18 R8: 00007f93e1287640 R9: 00007f93ed7334c2 R10: 0000000000000000 R11: 0000000000000246 R12: 000000000076ac90 R13: 0000000000000001 R14: 00007f93edb95e08 R15: 00007f93e1040390 ORIG_RAX: 0000000000000101 CS: 0033 SS: 002b As Bruno explained before, ( retranscribed in my less good understanding here), this involves an external tool accessing sysfs/kernfs like the task above (although I've seen systemd accessing it too), under memory pressure -> zone_reclaim -> shrink_slab -> flush I/O due to quotas I guess -> I/O deadlock mdraid because of md_check_recovery() running and accessing kernfs: PID: 318783 TASK: ffff9460e2cd4100 CPU: 15 COMMAND: "md24_raid6" #0 [ffff946ca83ebb68] __schedule at ffffffffb2d6aa72 #1 [ffff946ca83ebbf8] schedule_preempt_disabled at ffffffffb2d6be39 #2 [ffff946ca83ebc08] __mutex_lock_slowpath at ffffffffb2d69db7 #3 [ffff946ca83ebc60] mutex_lock at ffffffffb2d6919f #4 [ffff946ca83ebc78] kernfs_find_and_get_ns at ffffffffb28ca883 #5 [ffff946ca83ebca0] sysfs_notify at ffffffffb28cd00b #6 [ffff946ca83ebcc8] md_update_sb at ffffffffb2b95a89 #7 [ffff946ca83ebd48] md_check_recovery at ffffffffb2b9681a #8 [ffff946ca83ebd68] raid5d at ffffffffc1511466 [raid456] #9 [ffff946ca83ebe50] md_thread at ffffffffb2b8dedd #10 [ffff946ca83ebec8] kthread at ffffffffb26c2e81 #11 [ffff946ca83ebf50] ret_from_fork_nospec_begin at ffffffffb2d77c1d

Stephane Thiell added a comment - 12/Nov/19 11:19 PM - edited

This deadlock is still not fixed upstream, and we hit the same issue with CentOS 7.6. We have opened a new case at Red Hat (case 02514526) but my expectation is low. Meanwhile, I've ported Bruno's kernel patch from sysfs to the newer kernfs interface (it's very similar - unless I missed something). We have it on two updated OSS on Oak right now and it seems to work fine.

As a reminder, the problem is complex but solved by allocating sysfs/kernfs inodes using GFP_NOFS so that they don't trigger lu_cache_shrink() while holding a sysfs/kernfs inode lock. I'm not sure where to draw the line, ie. if it's a kernel or a lustre bug at this point.

Stephane Thiell added a comment - 12/Nov/19 11:19 PM - edited This deadlock is still not fixed upstream, and we hit the same issue with CentOS 7.6. We have opened a new case at Red Hat (case 02514526) but my expectation is low. Meanwhile, I've ported Bruno's kernel patch from sysfs to the newer kernfs interface (it's very similar - unless I missed something). We have it on two updated OSS on Oak right now and it seems to work fine. As a reminder, the problem is complex but solved by allocating sysfs/kernfs inodes using GFP_NOFS so that they don't trigger lu_cache_shrink() while holding a sysfs/kernfs inode lock. I'm not sure where to draw the line, ie. if it's a kernel or a lustre bug at this point.

Stephane Thiell added a comment - 04/Sep/19 4:08 AM

We upgraded our Kernel on Oak from CentOS 7.4-patched-by-Bruno to CentOS 7.6 (3.10.0-957.27.2.el7.x86_64 + Lustre patches = 3.10.0-957.27.2.el7_lustre.pl1.x86_64). After one or two days, a similar deadlock occurred. It looks like the kernfs interface still has the same issue.

vmcore uploaded to the WC ftp server as vmcore-oak-io1-s1-2019-09-01-21-43-46
kernel-debuginfo available there too for 3.10.0-957.27.2.el7_lustre.pl1.x86_64
foreach bt attached as foreach_bt-oak-io1-s1-2019-09-01-21-43-46.log

I know this is a kernel bug but I wanted to update this ticket for the sake of completeness, and the deadlock is somehow triggered by Lustre through lu_cache_shrink.

User tool accessing the kernfs interface and triggering lu_cache_shrink:

PID: 254093  TASK: ffff9f16acadd140  CPU: 30  COMMAND: "sas_counters"
 #0 [ffff9f1de48af3d8] __schedule at ffffffffa096aa72
 #1 [ffff9f1de48af460] schedule at ffffffffa096af19
 #2 [ffff9f1de48af470] rwsem_down_read_failed at ffffffffa096c54d
 #3 [ffff9f1de48af4f8] call_rwsem_down_read_failed at ffffffffa0588bf8
 #4 [ffff9f1de48af548] down_read at ffffffffa096a200
 #5 [ffff9f1de48af560] lu_cache_shrink at ffffffffc0e5ee7a [obdclass]
 #6 [ffff9f1de48af5b0] shrink_slab at ffffffffa03cb08e
 #7 [ffff9f1de48af650] do_try_to_free_pages at ffffffffa03ce412
 #8 [ffff9f1de48af6c8] try_to_free_pages at ffffffffa03ce62c
 #9 [ffff9f1de48af760] __alloc_pages_slowpath at ffffffffa09604ef
#10 [ffff9f1de48af850] __alloc_pages_nodemask at ffffffffa03c2524
#11 [ffff9f1de48af900] alloc_pages_current at ffffffffa040f438
#12 [ffff9f1de48af948] new_slab at ffffffffa041a4c5
#13 [ffff9f1de48af980] ___slab_alloc at ffffffffa041bf2c
#14 [ffff9f1de48afa58] __slab_alloc at ffffffffa096190c
#15 [ffff9f1de48afa98] kmem_cache_alloc at ffffffffa041d7cb
#16 [ffff9f1de48afad8] alloc_inode at ffffffffa045eee1
#17 [ffff9f1de48afaf8] iget_locked at ffffffffa046025b
#18 [ffff9f1de48afb38] kernfs_get_inode at ffffffffa04c9c17
#19 [ffff9f1de48afb58] kernfs_iop_lookup at ffffffffa04ca93b
#20 [ffff9f1de48afb80] lookup_real at ffffffffa044d573
#21 [ffff9f1de48afba0] __lookup_hash at ffffffffa044df92
#22 [ffff9f1de48afbd0] lookup_slow at ffffffffa0961de1
#23 [ffff9f1de48afc08] link_path_walk at ffffffffa045289f
#24 [ffff9f1de48afcb8] path_lookupat at ffffffffa0452aaa
#25 [ffff9f1de48afd50] filename_lookup at ffffffffa045330b
#26 [ffff9f1de48afd88] user_path_at_empty at ffffffffa04552f7
#27 [ffff9f1de48afe58] user_path_at at ffffffffa0455361
#28 [ffff9f1de48afe68] vfs_fstatat at ffffffffa0448223
#29 [ffff9f1de48afeb8] SYSC_newlstat at ffffffffa0448641
#30 [ffff9f1de48aff40] sys_newlstat at ffffffffa0448aae
#31 [ffff9f1de48aff50] system_call_fastpath at ffffffffa0977ddb
    RIP: 00007fdc07510ab5  RSP: 00007ffe9a9e7b30  RFLAGS: 00010202
    RAX: 0000000000000006  RBX: 00000000ffffff9c  RCX: 00007ffe9a9e7b30
    RDX: 00007ffe9a9e6b50  RSI: 00007ffe9a9e6b50  RDI: 00007fdbf86babd0
    RBP: 00000000012d2ca0   R8: 0000000000000001   R9: 0000000000000001
    R10: 00007fdc0834be97  R11: 0000000000000246  R12: 00007ffe9a9e6b50
    R13: 0000000000000001  R14: 00007fdc087ade08  R15: 00007fdbfba9c1d0
    ORIG_RAX: 0000000000000006  CS: 0033  SS: 002b

mdraid task blocked on kernfs too:

PID: 283550  TASK: ffff9f35c54a0000  CPU: 19  COMMAND: "md0_raid6"
 #0 [ffff9f35c5423b68] __schedule at ffffffffa096aa72
 #1 [ffff9f35c5423bf8] schedule_preempt_disabled at ffffffffa096be39
 #2 [ffff9f35c5423c08] __mutex_lock_slowpath at ffffffffa0969db7
 #3 [ffff9f35c5423c60] mutex_lock at ffffffffa096919f
 #4 [ffff9f35c5423c78] kernfs_find_and_get_ns at ffffffffa04ca883
 #5 [ffff9f35c5423ca0] sysfs_notify at ffffffffa04cd00b
 #6 [ffff9f35c5423cc8] md_update_sb at ffffffffa0795a89
 #7 [ffff9f35c5423d48] md_check_recovery at ffffffffa079681a
 #8 [ffff9f35c5423d68] raid5d at ffffffffc0d9a466 [raid456]
 #9 [ffff9f35c5423e50] md_thread at ffffffffa078dedd
#10 [ffff9f35c5423ec8] kthread at ffffffffa02c2e81

The original kernel report (https://bugzilla.kernel.org/show_bug.cgi?id=199589) has been dismissed and I'm not sure this was reported to Red Hat actually.

Stephane Thiell added a comment - 04/Sep/19 4:08 AM We upgraded our Kernel on Oak from CentOS 7.4-patched-by-Bruno to CentOS 7.6 (3.10.0-957.27.2.el7.x86_64 + Lustre patches = 3.10.0-957.27.2.el7_lustre.pl1.x86_64). After one or two days, a similar deadlock occurred. It looks like the kernfs interface still has the same issue. vmcore uploaded to the WC ftp server as vmcore-oak-io1-s1-2019-09-01-21-43-46 kernel-debuginfo available there too for 3.10.0-957.27.2.el7_lustre.pl1.x86_64 foreach bt attached as foreach_bt-oak-io1-s1-2019-09-01-21-43-46.log I know this is a kernel bug but I wanted to update this ticket for the sake of completeness, and the deadlock is somehow triggered by Lustre through lu_cache_shrink . User tool accessing the kernfs interface and triggering lu_cache_shrink : PID: 254093 TASK: ffff9f16acadd140 CPU: 30 COMMAND: "sas_counters" #0 [ffff9f1de48af3d8] __schedule at ffffffffa096aa72 #1 [ffff9f1de48af460] schedule at ffffffffa096af19 #2 [ffff9f1de48af470] rwsem_down_read_failed at ffffffffa096c54d #3 [ffff9f1de48af4f8] call_rwsem_down_read_failed at ffffffffa0588bf8 #4 [ffff9f1de48af548] down_read at ffffffffa096a200 #5 [ffff9f1de48af560] lu_cache_shrink at ffffffffc0e5ee7a [obdclass] #6 [ffff9f1de48af5b0] shrink_slab at ffffffffa03cb08e #7 [ffff9f1de48af650] do_try_to_free_pages at ffffffffa03ce412 #8 [ffff9f1de48af6c8] try_to_free_pages at ffffffffa03ce62c #9 [ffff9f1de48af760] __alloc_pages_slowpath at ffffffffa09604ef #10 [ffff9f1de48af850] __alloc_pages_nodemask at ffffffffa03c2524 #11 [ffff9f1de48af900] alloc_pages_current at ffffffffa040f438 #12 [ffff9f1de48af948] new_slab at ffffffffa041a4c5 #13 [ffff9f1de48af980] ___slab_alloc at ffffffffa041bf2c #14 [ffff9f1de48afa58] __slab_alloc at ffffffffa096190c #15 [ffff9f1de48afa98] kmem_cache_alloc at ffffffffa041d7cb #16 [ffff9f1de48afad8] alloc_inode at ffffffffa045eee1 #17 [ffff9f1de48afaf8] iget_locked at ffffffffa046025b #18 [ffff9f1de48afb38] kernfs_get_inode at ffffffffa04c9c17 #19 [ffff9f1de48afb58] kernfs_iop_lookup at ffffffffa04ca93b #20 [ffff9f1de48afb80] lookup_real at ffffffffa044d573 #21 [ffff9f1de48afba0] __lookup_hash at ffffffffa044df92 #22 [ffff9f1de48afbd0] lookup_slow at ffffffffa0961de1 #23 [ffff9f1de48afc08] link_path_walk at ffffffffa045289f #24 [ffff9f1de48afcb8] path_lookupat at ffffffffa0452aaa #25 [ffff9f1de48afd50] filename_lookup at ffffffffa045330b #26 [ffff9f1de48afd88] user_path_at_empty at ffffffffa04552f7 #27 [ffff9f1de48afe58] user_path_at at ffffffffa0455361 #28 [ffff9f1de48afe68] vfs_fstatat at ffffffffa0448223 #29 [ffff9f1de48afeb8] SYSC_newlstat at ffffffffa0448641 #30 [ffff9f1de48aff40] sys_newlstat at ffffffffa0448aae #31 [ffff9f1de48aff50] system_call_fastpath at ffffffffa0977ddb RIP: 00007fdc07510ab5 RSP: 00007ffe9a9e7b30 RFLAGS: 00010202 RAX: 0000000000000006 RBX: 00000000ffffff9c RCX: 00007ffe9a9e7b30 RDX: 00007ffe9a9e6b50 RSI: 00007ffe9a9e6b50 RDI: 00007fdbf86babd0 RBP: 00000000012d2ca0 R8: 0000000000000001 R9: 0000000000000001 R10: 00007fdc0834be97 R11: 0000000000000246 R12: 00007ffe9a9e6b50 R13: 0000000000000001 R14: 00007fdc087ade08 R15: 00007fdbfba9c1d0 ORIG_RAX: 0000000000000006 CS: 0033 SS: 002b mdraid task blocked on kernfs too: PID: 283550 TASK: ffff9f35c54a0000 CPU: 19 COMMAND: "md0_raid6" #0 [ffff9f35c5423b68] __schedule at ffffffffa096aa72 #1 [ffff9f35c5423bf8] schedule_preempt_disabled at ffffffffa096be39 #2 [ffff9f35c5423c08] __mutex_lock_slowpath at ffffffffa0969db7 #3 [ffff9f35c5423c60] mutex_lock at ffffffffa096919f #4 [ffff9f35c5423c78] kernfs_find_and_get_ns at ffffffffa04ca883 #5 [ffff9f35c5423ca0] sysfs_notify at ffffffffa04cd00b #6 [ffff9f35c5423cc8] md_update_sb at ffffffffa0795a89 #7 [ffff9f35c5423d48] md_check_recovery at ffffffffa079681a #8 [ffff9f35c5423d68] raid5d at ffffffffc0d9a466 [raid456] #9 [ffff9f35c5423e50] md_thread at ffffffffa078dedd #10 [ffff9f35c5423ec8] kthread at ffffffffa02c2e81 The original kernel report ( https://bugzilla.kernel.org/show_bug.cgi?id=199589 ) has been dismissed and I'm not sure this was reported to Red Hat actually.

Stephane Thiell added a comment - 04/May/18 5:44 AM

Hey Bruno,

Great, thanks! It would definitely be nice to get some feedback from the kernel developers and/or have this patch integrated upstream.

Our Oak system has been rock solid since this patch. Right now ~45 days uptime without any server crash, and the filesystem is still very busy, and mdraid checks are running almost all the time with sas_counters launched every minute on all OSS

Note: I can't find your email to linux-raid@, maybe it didn't go through?

Thanks!

Stephane

Stephane Thiell added a comment - 04/May/18 5:44 AM Hey Bruno, Great, thanks! It would definitely be nice to get some feedback from the kernel developers and/or have this patch integrated upstream. Our Oak system has been rock solid since this patch. Right now ~45 days uptime without any server crash, and the filesystem is still very busy, and mdraid checks are running almost all the time with sas_counters launched every minute on all OSS Note: I can't find your email to linux-raid@, maybe it didn't go through? Thanks! Stephane

Bruno Faccini (Inactive) added a comment - 02/May/18 8:40 AM

Hello Stephane,
Following your previous requests for external reporting of this problem/bug :
_ I have created a bug report at kernel.org, https://bugzilla.kernel.org/show_bug.cgi?id=199589.
_ I have also asked MD-Raid maintainers their feeling thru an email to linux-raid@vger.kernel.org, using the title "Deadlock during memory reclaim path involving sysfs and MD-Raid layers".

Last, recent 4.x Kernels code seems to indicate that problem is still there, but now in kernfs instead of sysfs, as the latter uses the former's methods internally but where the same potential dead-lock seems to exist around kernfs_mutex .

Bruno Faccini (Inactive) added a comment - 02/May/18 8:40 AM Hello Stephane, Following your previous requests for external reporting of this problem/bug : _ I have created a bug report at kernel.org, https://bugzilla.kernel.org/show_bug.cgi?id=199589 . _ I have also asked MD-Raid maintainers their feeling thru an email to linux-raid@vger.kernel.org, using the title "Deadlock during memory reclaim path involving sysfs and MD-Raid layers". Last, recent 4.x Kernels code seems to indicate that problem is still there, but now in kernfs instead of sysfs, as the latter uses the former's methods internally but where the same potential dead-lock seems to exist around kernfs_mutex .

Stephane Thiell added a comment - 13/Apr/18 6:52 PM

Bruno,

Great, I'll follow that with much attention. Thank you again, your patch has really saved us.

Stephane

Stephane Thiell added a comment - 13/Apr/18 6:52 PM Bruno, Great, I'll follow that with much attention. Thank you again, your patch has really saved us. Stephane

Bruno Faccini (Inactive) added a comment - 13/Apr/18 12:03 PM

Stephane,
Thanks for your help and patch testing!
Will do soon for both sysfs patch submission and reporting to linux-raid.
Will double-check for 4.x kernels and give you an answer soon.

Bruno Faccini (Inactive) added a comment - 13/Apr/18 12:03 PM Stephane, Thanks for your help and patch testing! Will do soon for both sysfs patch submission and reporting to linux-raid. Will double-check for 4.x kernels and give you an answer soon.

Stephane Thiell added a comment - 29/Mar/18 9:51 PM

Hi Bruno,

The system has been very stable lately with the patch. I think we can consider the issue fixed by next week (just to be sure ).

Few questions for you when you have time (no rush):

do you plan to submit the sysfs patch upstream to Red Hat?
do you want to notify linux-raid about this sysfs race condition (or do you want me to do it?)
do you think this issue is automatically fixed when using more recent 4.x kernels because the sysfs interface has changed?

Thanks!!

Stephane

Stephane Thiell added a comment - 29/Mar/18 9:51 PM Hi Bruno, The system has been very stable lately with the patch. I think we can consider the issue fixed by next week (just to be sure ). Few questions for you when you have time (no rush): do you plan to submit the sysfs patch upstream to Red Hat? do you want to notify linux-raid about this sysfs race condition (or do you want me to do it?) do you think this issue is automatically fixed when using more recent 4.x kernels because the sysfs interface has changed? Thanks!! Stephane

Bruno Faccini (Inactive) added a comment - 21/Mar/18 3:19 PM

Stephane, thanks for the update, and let's cross the fingers now...

Bruno Faccini (Inactive) added a comment - 21/Mar/18 3:19 PM Stephane, thanks for the update, and let's cross the fingers now...

Stephane Thiell added a comment - 21/Mar/18 2:36 PM

Hey Bruno,

Quick status update.. patch was deployed last Sunday morning (3/18) only due to production constraints before. sas_counters is running quite frequently again and I started the mdraid checks manually (because usually they start on Saturday night). So far, no issue to report, looking good, but we need more time to be sure (at least a week). Will keep you posted!

Stephane Thiell added a comment - 21/Mar/18 2:36 PM Hey Bruno, Quick status update.. patch was deployed last Sunday morning (3/18) only due to production constraints before. sas_counters is running quite frequently again and I started the mdraid checks manually (because usually they start on Saturday night). So far, no issue to report, looking good, but we need more time to be sure (at least a week). Will keep you posted!

People

Assignee:: Bruno Faccini (Inactive)

Reporter:: Stephane Thiell

Votes:: 1 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 24/Feb/18 6:37 PM

Updated:: 12/Nov/19 11:47 PM