[LU-10709] OSS deadlock in 2.10.3 - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Unresolved
Priority: Critical
Fix Version/s: None
Affects Version/s: Lustre 2.10.3
Labels:
None
Environment:
CentOS 7.4 kernel 3.10.0-693.2.2.el7_lustre.pl1.x86_64

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

We got another OSS deadlock last night on Oak. Likely to be a regression of 2.10.3.

Since the upgrade to 2.10.3, these servers haven't been stable for more than 48h in general. This issue might be related to the OSS situation described in LU-10697. For latest MDS instabilities, sounds like it will be fixed in ~~LU-10680~~.

In this case, OSS deadlock of oak-io2-s1, OSTs from its partner (oak-io2-s2) were already migrated to it (oak-io2-s1) due to a previous deadlock/issue, so 48 OSTs were mounted.

Timeframe overview:
Feb 21 11:28:49: OSTs from oak-io2-s2 migrated to oak-io2-s1
Feb 23 19:05:04: first stack trace of stuck thread (oak-io2-s1 kernel: Pid: 17265, comm: ll_ost00_032)
Feb 23 22:59: monitoring reports that ssh to oak-io2-s1 doesn't work anymore
Feb 23 23:01:51 oak-io2-s1 kernel: INFO: task kswapd0:264 blocked for more than 120 seconds.
Feb 24 02:03:56 manual crash dump taken of oak-io2-s1

Attaching the following files:

kernel logs in oak-io2-s1_kernel.log (where you can find most of the details in the timeframe above)
vmcore-dmesg: oak-io2-s1_vmcore-dmesg.txt
crash foreach bt: oak_io2-s1_foreach_bt.xt
kernel memory usage: oak-io2-s1_kmem.txt
vmcore (oak-io2-s1-vmcore-2018-02-24-02_03_56.gz):

https://stanford.box.com/s/n8ft8quvr6ubuvd12ukdsoarmrz4uixr
(debuginfo files are available in comment-221257).

We decided to downgrade all servers to 2.10.2 on this system because this has had a significant impact on production lately.

Thanks much!

Stephane

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

foreach_bt_oak-io4-s1.log
2.33 MB
12/Nov/19 11:41 PM
foreach_bt-oak-io1-s1-2019-09-01-21-43-46.log
2.11 MB
04/Sep/19 4:00 AM
oak_io2-s1_foreach_bt.xt
2.40 MB
24/Feb/18 6:35 PM
oak-io1-s2-2018-03-05-09-33-02_vmcore-dmesg.txt
760 kB
05/Mar/18 7:04 PM
oak-io2-s1_kernel.log
498 kB
24/Feb/18 6:35 PM
oak-io2-s1_kmem.txt
0.8 kB
24/Feb/18 6:35 PM
oak-io2-s1_vmcore-dmesg.txt
987 kB
24/Feb/18 6:35 PM
sysfs_alloc_inode_GFP_NOFS.patch
3 kB
14/Mar/18 9:59 AM

Issue Links

is related to

LU-10697 MDT locking issues after failing over OSTs from hung OSS

Open

Activity

[LU-10709] OSS deadlock in 2.10.3

Bruno Faccini (Inactive) added a comment - 13/Apr/18 12:03 PM

Stephane,
Thanks for your help and patch testing!
Will do soon for both sysfs patch submission and reporting to linux-raid.
Will double-check for 4.x kernels and give you an answer soon.

Bruno Faccini (Inactive) added a comment - 13/Apr/18 12:03 PM Stephane, Thanks for your help and patch testing! Will do soon for both sysfs patch submission and reporting to linux-raid. Will double-check for 4.x kernels and give you an answer soon.

Stephane Thiell added a comment - 29/Mar/18 9:51 PM

Hi Bruno,

The system has been very stable lately with the patch. I think we can consider the issue fixed by next week (just to be sure ).

Few questions for you when you have time (no rush):

do you plan to submit the sysfs patch upstream to Red Hat?
do you want to notify linux-raid about this sysfs race condition (or do you want me to do it?)
do you think this issue is automatically fixed when using more recent 4.x kernels because the sysfs interface has changed?

Thanks!!

Stephane

Stephane Thiell added a comment - 29/Mar/18 9:51 PM Hi Bruno, The system has been very stable lately with the patch. I think we can consider the issue fixed by next week (just to be sure ). Few questions for you when you have time (no rush): do you plan to submit the sysfs patch upstream to Red Hat? do you want to notify linux-raid about this sysfs race condition (or do you want me to do it?) do you think this issue is automatically fixed when using more recent 4.x kernels because the sysfs interface has changed? Thanks!! Stephane

Bruno Faccini (Inactive) added a comment - 21/Mar/18 3:19 PM

Stephane, thanks for the update, and let's cross the fingers now...

Bruno Faccini (Inactive) added a comment - 21/Mar/18 3:19 PM Stephane, thanks for the update, and let's cross the fingers now...

Stephane Thiell added a comment - 21/Mar/18 2:36 PM

Hey Bruno,

Quick status update.. patch was deployed last Sunday morning (3/18) only due to production constraints before. sas_counters is running quite frequently again and I started the mdraid checks manually (because usually they start on Saturday night). So far, no issue to report, looking good, but we need more time to be sure (at least a week). Will keep you posted!

Stephane Thiell added a comment - 21/Mar/18 2:36 PM Hey Bruno, Quick status update.. patch was deployed last Sunday morning (3/18) only due to production constraints before. sas_counters is running quite frequently again and I started the mdraid checks manually (because usually they start on Saturday night). So far, no issue to report, looking good, but we need more time to be sure (at least a week). Will keep you posted!

Stephane Thiell added a comment - 14/Mar/18 7:16 PM

OK. Excellent, thank you!! I just built a new kernel with this patch. No kernel update, just the same as before plus this patch added (new version is kernel-3.10.0-693.2.2.el7_lustre.pl2.x86_64). I'll perform the kernel change on all Oak servers tomorrow early morning (Pacific time), when fewer users are connected to the system and report back.

Stephane Thiell added a comment - 14/Mar/18 7:16 PM OK. Excellent, thank you!! I just built a new kernel with this patch. No kernel update, just the same as before plus this patch added (new version is kernel-3.10.0-693.2.2.el7_lustre.pl2.x86_64). I'll perform the kernel change on all Oak servers tomorrow early morning (Pacific time), when fewer users are connected to the system and report back.

Bruno Faccini (Inactive) added a comment - 14/Mar/18 9:59 AM - edited

> Does that mean that the MD layer would then have to use this new sysfs_alloc_inode() method? Or everyone?

Everyone. BTW, in the deadlock scenario, this is sas_counters user-land thread that triggers the memory reclaim during sysfs inode allocation.

> But yes, we'd be very interested to test such a patch!

Attached sysfs_alloc_inode_GFP_NOFS.patch file.

sysfs_alloc_inode_GFP_NOFS.patch

Bruno Faccini (Inactive) added a comment - 14/Mar/18 9:59 AM - edited > Does that mean that the MD layer would then have to use this new sysfs_alloc_inode() method? Or everyone? Everyone. BTW, in the deadlock scenario, this is sas_counters user-land thread that triggers the memory reclaim during sysfs inode allocation. > But yes, we'd be very interested to test such a patch! Attached sysfs_alloc_inode_GFP_NOFS.patch file. sysfs_alloc_inode_GFP_NOFS.patch

Stephane Thiell added a comment - 12/Mar/18 5:19 PM

Thanks Bruno! Does that mean that the MD layer would then have to use this new sysfs_alloc_inode() method? Or everyone?

But yes, we'd be very interested to test such a patch!

Stephane Thiell added a comment - 12/Mar/18 5:19 PM Thanks Bruno! Does that mean that the MD layer would then have to use this new sysfs_alloc_inode() method? Or everyone? But yes, we'd be very interested to test such a patch!

Bruno Faccini (Inactive) added a comment - 12/Mar/18 10:34 AM

Got it! Thanks and sorry for the extra work.
This crash-dump shows almost the same scenario, with the little difference that with zone_reclaim=1 direct reclaim will occur as later as possible, but it will finally be under memory high pressure with the shrinkers being run then :

PID: 39539  TASK: ffff880066253f40  CPU: 37  COMMAND: "sas_counters"
 #0 [ffff88102f92b1f0] __schedule at ffffffff816a8f65
 #1 [ffff88102f92b258] schedule_preempt_disabled at ffffffff816aa409
 #2 [ffff88102f92b268] __mutex_lock_slowpath at ffffffff816a8337
 #3 [ffff88102f92b2c8] mutex_lock at ffffffff816a774f
 #4 [ffff88102f92b2e0] dquot_acquire at ffffffff81265e8a
 #5 [ffff88102f92b318] ldiskfs_acquire_dquot at ffffffffc0f4f766 [ldiskfs]
 #6 [ffff88102f92b338] dqget at ffffffff81267814
 #7 [ffff88102f92b398] dquot_get_dqblk at ffffffff812685f4
 #8 [ffff88102f92b3b8] osd_acct_index_lookup at ffffffffc10968bf [osd_ldiskfs]
 #9 [ffff88102f92b3f0] lquota_disk_read at ffffffffc0fef214 [lquota]
#10 [ffff88102f92b420] qsd_refresh_usage at ffffffffc0ff6bfa [lquota]
#11 [ffff88102f92b458] qsd_op_adjust at ffffffffc1005881 [lquota]
#12 [ffff88102f92b498] osd_object_delete at ffffffffc105fc50 [osd_ldiskfs]
#13 [ffff88102f92b4d8] lu_object_free at ffffffffc099ee9d [obdclass]
#14 [ffff88102f92b530] lu_site_purge_objects at ffffffffc099fafe [obdclass]
#15 [ffff88102f92b5d8] lu_cache_shrink at ffffffffc09a0949 [obdclass]
#16 [ffff88102f92b628] shrink_slab at ffffffff81195413
#17 [ffff88102f92b6c8] zone_reclaim at ffffffff81198091
#18 [ffff88102f92b770] get_page_from_freelist at ffffffff8118c264
#19 [ffff88102f92b880] __alloc_pages_nodemask at ffffffff8118caf6
#20 [ffff88102f92b930] alloc_pages_current at ffffffff811d1108
#21 [ffff88102f92b978] new_slab at ffffffff811dbe15
#22 [ffff88102f92b9b0] ___slab_alloc at ffffffff811dd71c
#23 [ffff88102f92ba80] __slab_alloc at ffffffff816a10ee
#24 [ffff88102f92bac0] kmem_cache_alloc at ffffffff811df6b3
#25 [ffff88102f92bb00] alloc_inode at ffffffff8121c851
#26 [ffff88102f92bb20] iget_locked at ffffffff8121dafb
#27 [ffff88102f92bb60] sysfs_get_inode at ffffffff8127fcf7
#28 [ffff88102f92bb80] sysfs_lookup at ffffffff81281ea1
#29 [ffff88102f92bbb0] lookup_real at ffffffff8120b46d
#30 [ffff88102f92bbd0] __lookup_hash at ffffffff8120bd42
#31 [ffff88102f92bc00] lookup_slow at ffffffff816a1342
#32 [ffff88102f92bc38] link_path_walk at ffffffff8120e9df
#33 [ffff88102f92bce8] path_lookupat at ffffffff8120ebdb
#34 [ffff88102f92bd80] filename_lookup at ffffffff8120f34b
#35 [ffff88102f92bdb8] user_path_at_empty at ffffffff81212ec7
#36 [ffff88102f92be88] user_path_at at ffffffff81212f31
#37 [ffff88102f92be98] vfs_fstatat at ffffffff81206473
#38 [ffff88102f92bee8] SYSC_newlstat at ffffffff81206a41
#39 [ffff88102f92bf70] sys_newlstat at ffffffff81206cce
#40 [ffff88102f92bf80] system_call_fastpath at ffffffff816b5009
    RIP: 00007f3374dc21a5  RSP: 00007ffe70743110  RFLAGS: 00010202
    RAX: 0000000000000006  RBX: ffffffff816b5009  RCX: 00007ffe70743120
    RDX: 00007ffe707425b0  RSI: 00007ffe707425b0  RDI: 00007f3366adea10
    RBP: 00000000ffffff9c   R8: 00007f3366adea10   R9: 312d74726f702f65
    R10: 617078652f303a32  R11: 0000000000000246  R12: ffffffff81206cce
    R13: ffff88102f92bf78  R14: 00007ffe707426b0  R15: 000000008217f4f9
    ORIG_RAX: 0000000000000006  CS: 0033  SS: 002b

So, I think that at the moment if you want to prevent this problem to happen again, you will need to remove one of the involved actor/feature, i.e. either sas_counters, or MD devices regular check/sync, or Lustre quotas.

And on the other hand, I wonder if the final solution could not be to have a new sysfs_alloc_inode() method be available in "struct super_operations sysfs_ops", where the kmem_cache_alloc() will occur with GFP_NOFS, this to avoid fair shrinkers testing it (like in lu_cache_shrink() !!) to do their job.
I wonder if you would be ok to test this as it will be a Kernel patch??

Bruno Faccini (Inactive) added a comment - 12/Mar/18 10:34 AM Got it! Thanks and sorry for the extra work. This crash-dump shows almost the same scenario, with the little difference that with zone_reclaim=1 direct reclaim will occur as later as possible, but it will finally be under memory high pressure with the shrinkers being run then : PID: 39539 TASK: ffff880066253f40 CPU: 37 COMMAND: "sas_counters" #0 [ffff88102f92b1f0] __schedule at ffffffff816a8f65 #1 [ffff88102f92b258] schedule_preempt_disabled at ffffffff816aa409 #2 [ffff88102f92b268] __mutex_lock_slowpath at ffffffff816a8337 #3 [ffff88102f92b2c8] mutex_lock at ffffffff816a774f #4 [ffff88102f92b2e0] dquot_acquire at ffffffff81265e8a #5 [ffff88102f92b318] ldiskfs_acquire_dquot at ffffffffc0f4f766 [ldiskfs] #6 [ffff88102f92b338] dqget at ffffffff81267814 #7 [ffff88102f92b398] dquot_get_dqblk at ffffffff812685f4 #8 [ffff88102f92b3b8] osd_acct_index_lookup at ffffffffc10968bf [osd_ldiskfs] #9 [ffff88102f92b3f0] lquota_disk_read at ffffffffc0fef214 [lquota] #10 [ffff88102f92b420] qsd_refresh_usage at ffffffffc0ff6bfa [lquota] #11 [ffff88102f92b458] qsd_op_adjust at ffffffffc1005881 [lquota] #12 [ffff88102f92b498] osd_object_delete at ffffffffc105fc50 [osd_ldiskfs] #13 [ffff88102f92b4d8] lu_object_free at ffffffffc099ee9d [obdclass] #14 [ffff88102f92b530] lu_site_purge_objects at ffffffffc099fafe [obdclass] #15 [ffff88102f92b5d8] lu_cache_shrink at ffffffffc09a0949 [obdclass] #16 [ffff88102f92b628] shrink_slab at ffffffff81195413 #17 [ffff88102f92b6c8] zone_reclaim at ffffffff81198091 #18 [ffff88102f92b770] get_page_from_freelist at ffffffff8118c264 #19 [ffff88102f92b880] __alloc_pages_nodemask at ffffffff8118caf6 #20 [ffff88102f92b930] alloc_pages_current at ffffffff811d1108 #21 [ffff88102f92b978] new_slab at ffffffff811dbe15 #22 [ffff88102f92b9b0] ___slab_alloc at ffffffff811dd71c #23 [ffff88102f92ba80] __slab_alloc at ffffffff816a10ee #24 [ffff88102f92bac0] kmem_cache_alloc at ffffffff811df6b3 #25 [ffff88102f92bb00] alloc_inode at ffffffff8121c851 #26 [ffff88102f92bb20] iget_locked at ffffffff8121dafb #27 [ffff88102f92bb60] sysfs_get_inode at ffffffff8127fcf7 #28 [ffff88102f92bb80] sysfs_lookup at ffffffff81281ea1 #29 [ffff88102f92bbb0] lookup_real at ffffffff8120b46d #30 [ffff88102f92bbd0] __lookup_hash at ffffffff8120bd42 #31 [ffff88102f92bc00] lookup_slow at ffffffff816a1342 #32 [ffff88102f92bc38] link_path_walk at ffffffff8120e9df #33 [ffff88102f92bce8] path_lookupat at ffffffff8120ebdb #34 [ffff88102f92bd80] filename_lookup at ffffffff8120f34b #35 [ffff88102f92bdb8] user_path_at_empty at ffffffff81212ec7 #36 [ffff88102f92be88] user_path_at at ffffffff81212f31 #37 [ffff88102f92be98] vfs_fstatat at ffffffff81206473 #38 [ffff88102f92bee8] SYSC_newlstat at ffffffff81206a41 #39 [ffff88102f92bf70] sys_newlstat at ffffffff81206cce #40 [ffff88102f92bf80] system_call_fastpath at ffffffff816b5009 RIP: 00007f3374dc21a5 RSP: 00007ffe70743110 RFLAGS: 00010202 RAX: 0000000000000006 RBX: ffffffff816b5009 RCX: 00007ffe70743120 RDX: 00007ffe707425b0 RSI: 00007ffe707425b0 RDI: 00007f3366adea10 RBP: 00000000ffffff9c R8: 00007f3366adea10 R9: 312d74726f702f65 R10: 617078652f303a32 R11: 0000000000000246 R12: ffffffff81206cce R13: ffff88102f92bf78 R14: 00007ffe707426b0 R15: 000000008217f4f9 ORIG_RAX: 0000000000000006 CS: 0033 SS: 002b So, I think that at the moment if you want to prevent this problem to happen again, you will need to remove one of the involved actor/feature, i.e. either sas_counters, or MD devices regular check/sync, or Lustre quotas. And on the other hand, I wonder if the final solution could not be to have a new sysfs_alloc_inode() method be available in "struct super_operations sysfs_ops", where the kmem_cache_alloc() will occur with GFP_NOFS, this to avoid fair shrinkers testing it (like in lu_cache_shrink() !!) to do their job. I wonder if you would be ok to test this as it will be a Kernel patch??

Stephane Thiell added a comment - 08/Mar/18 5:43 PM

Hello Bruno,

Please find the last (and big) crash dump below, in two parts:

MD5 (vmcore_oak-io1-s1-2018-03-05-14_06_57.1) = 60c66a81c9acc1675a41722d6016efcc

https://stanford.box.com/s/nsdwy6bind6l48spesjg76uv2tb9jmrp

MD5 (vmcore_oak-io1-s1-2018-03-05-14_06_57.2) = a17323c3afbdf8fc970d430c35ac864c

https://stanford.box.com/s/3idy1mv956cf0l1a9dj6c9otulmtmnd4

Simply use cat to aggregate the parts:

cat vmcore_oak-io1-s1-2018-03-05-14_06_57.1 vmcore_oak-io1-s1-2018-03-05-14_06_57.2 > vmcore_oak-io1-s1-2018-03-05-14_06_57

you should get:

MD5 (vmcore_oak-io1-s1-2018-03-05-14_06_57) = 14752f1c982d5011c0375fcca8c3ebbe

Thanks!!

Stephane Thiell added a comment - 08/Mar/18 5:43 PM Hello Bruno, Please find the last (and big) crash dump below, in two parts: MD5 (vmcore_oak-io1-s1-2018-03-05-14_06_57.1) = 60c66a81c9acc1675a41722d6016efcc https://stanford.box.com/s/nsdwy6bind6l48spesjg76uv2tb9jmrp MD5 (vmcore_oak-io1-s1-2018-03-05-14_06_57.2) = a17323c3afbdf8fc970d430c35ac864c https://stanford.box.com/s/3idy1mv956cf0l1a9dj6c9otulmtmnd4 Simply use cat to aggregate the parts: cat vmcore_oak-io1-s1-2018-03-05-14_06_57.1 vmcore_oak-io1-s1-2018-03-05-14_06_57.2 > vmcore_oak-io1-s1-2018-03-05-14_06_57 you should get: MD5 (vmcore_oak-io1-s1-2018-03-05-14_06_57) = 14752f1c982d5011c0375fcca8c3ebbe Thanks!!

Stephane Thiell added a comment - 07/Mar/18 3:13 AM

Hey Bruno,

Yeah, the last one is about 29GB and that is too big for Box (which has a file size limit of 15GB). Hmm, I could split it in two, which should work with Box (but won't be until tomorrow).

Stephane Thiell added a comment - 07/Mar/18 3:13 AM Hey Bruno, Yeah, the last one is about 29GB and that is too big for Box (which has a file size limit of 15GB). Hmm, I could split it in two, which should work with Box (but won't be until tomorrow).

Bruno Faccini (Inactive) added a comment - 07/Mar/18 12:49 AM

Stephane,
I am presently unable to download this last crash-dump directly to one of our internal system where I am usually using lynx to do so. Seems that using google drive is much more complex than previous stanford repository. Can you use this same/previous way for it too ?

Bruno Faccini (Inactive) added a comment - 07/Mar/18 12:49 AM Stephane, I am presently unable to download this last crash-dump directly to one of our internal system where I am usually using lynx to do so. Seems that using google drive is much more complex than previous stanford repository. Can you use this same/previous way for it too ?

People

Assignee:: Bruno Faccini (Inactive)

Reporter:: Stephane Thiell

Votes:: 1 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 24/Feb/18 6:37 PM

Updated:: 12/Nov/19 11:47 PM