[LU-8408] mgc_request.c:141:config_log_put()) ASSERTION( atomic_read(&cld->cld_refcount) > 0 ) Created: 15/Jul/16  Updated: 16/Feb/17  Resolved: 02/Sep/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Critical
Reporter: Christopher Morrone Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: llnl

Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Running Lustre 2.8.0, we hit the following assertion on many clients at the same time while attempting to umount a DNE filesystem.

One thing to note is that the client were mounting two lustre filesystems at the time, one DNE and one non-DNE. The DNE filesystem was the only on that I was trying to unmount.

The DNE filesystem was also not quite happy at the time. 2 out of 16 MDTs were not running. The other 14 MDTs were stuck in recovery with a recovery timer of 0 seconds. The assertion happened, I believe, after I finished bringing up the two missing MDTs. The umount was just stuck until they came up.

2016-07-15 10:36:13 [162519.584734] Lustre: 7006:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 109 previous similar messages
2016-07-15 10:46:55 [163161.876527] Lustre: 7006:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1468604725/real 1468
604725]  req@ffff881feee70300 x1539773254358284/t0(0) o400->lquake2-MDT000f-mdc-ffff88201e8e2800@172.19.1.126@o2ib100:12/10 lens 224/224 e 2 to 1 dl 1468604815 ref 1 fl
 Rpc:X/c0/ffffffff rc 0/-1
2016-07-15 10:46:55 [163161.917624] Lustre: 7006:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 111 previous similar messages
2016-07-15 10:57:25 [163792.204109] Lustre: 7006:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1468605355/real 1468
605355]  req@ffff881feee70300 x1539773254361464/t0(0) o400->lquake2-MDT000f-mdc-ffff88201e8e2800@172.19.1.126@o2ib100:12/10 lens 224/224 e 2 to 1 dl 1468605445 ref 1 fl
 Rpc:X/c0/ffffffff rc 0/-1
2016-07-15 10:57:25 [163792.245207] Lustre: 7006:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 109 previous similar messages

<ConMan> Console [opal99] log at 2016-07-15 11:00:00 PDT.
2016-07-15 11:07:53 [164420.530684] Lustre: 7006:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1468606018/real 1468
606018]  req@ffff881fee80ad00 x1539773254364936/t0(0) o38->lquake2-MDT0001-mdc-ffff88201e8e2800@172.19.1.112@o2ib100:12/10 lens 520/544 e 0 to 1 dl 1468606073 ref 1 fl 
Rpc:XN/0/ffffffff rc 0/-1
2016-07-15 11:07:53 [164420.571687] Lustre: 7006:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 109 previous similar messages
2016-07-15 11:16:26 [164934.492190] Lustre: lquake2-MDT000f-mdc-ffff88201e8e2800: Connection restored to 172.19.1.126@o2ib100 (at 172.19.1.126@o2ib100)
2016-07-15 11:16:26 [164934.510252] Lustre: Skipped 3 previous similar messages
2016-07-15 11:16:32 [164940.031847] Lustre: lquake2-MDT0008-mdc-ffff88201e8e2800: Connection restored to 172.19.1.119@o2ib100 (at 172.19.1.119@o2ib100)
2016-07-15 11:16:53 [164960.817277] Lustre: lquake2-MDT000e-mdc-ffff88201e8e2800: Connection restored to 172.19.1.125@o2ib100 (at 172.19.1.125@o2ib100)
2016-07-15 11:16:53 [164960.835337] Lustre: Skipped 8 previous similar messages
2016-07-15 11:24:52 [165440.838390] hsi0: can't use GFP_NOIO for QPs on device hfi1_0, using GFP_KERNEL
2016-07-15 11:24:54 [165442.178261] LustreError: 152795:0:(mgc_request.c:141:config_log_put()) ASSERTION( atomic_read(&cld->cld_refcount) > 0 ) failed: 
2016-07-15 11:24:54 [165442.196348] LustreError: 152795:0:(mgc_request.c:141:config_log_put()) LBUG
2016-07-15 11:24:54 [165442.206772] Pid: 152795, comm: umount
2016-07-15 11:24:54 Jul 15 11:24:54 opal99 kernel: [[165442.213463] 
2016-07-15 11:24:54 [165442.213463] Call Trace:
2016-07-15 11:24:54 165442.178261] LustreError: 152795:0:(mgc_request.c:141:config_l[165442.223668]  [<ffffffffa0add7e3>] libcfs_debug_dumpstack+0x53/0x80 [libcfs]
2016-07-15 11:24:54 og_put()) ASSERTION( atomic_read[165442.235364]  [<ffffffffa0addd85>] lbug_with_loc+0x45/0xc0 [libcfs]
2016-07-15 11:24:54 (&cld->cld_refcount) > 0 ) faile[165442.245430]  [<ffffffffa11fc2e3>] config_log_put+0x3a3/0x3e0 [mgc]
2016-07-15 11:24:54 d: 
2016-07-15 11:24:54 Jul 15 11:24:54 opal99 kern[165442.255502]  [<ffffffffa12037be>] mgc_process_config+0x77e/0x1280 [mgc]
2016-07-15 11:24:54 el: [165442.196348] LustreError:[165442.266077]  [<ffffffff810c6cae>] ? dequeue_task_fair+0x42e/0x640
2016-07-15 11:24:54  152795:0:(mgc_request.c:141:con[165442.276080]  [<ffffffff810c0045>] ? sched_clock_cpu+0xa5/0xe0
2016-07-15 11:24:54 fig_log_put()) LBUG
2016-07-15 11:24:54 Jul 15 11:2[165442.285669]  [<ffffffff8101560b>] ? __switch_to+0x17b/0x4d0
2016-07-15 11:24:54 6:21 opal99 kernel: LustreError:[165442.295087]  [<ffffffffa0c7f715>] obd_process_config.constprop.13+0x85/0x2d0 [obdclass]
2016-07-15 11:24:54  152795:0:(mgc_request.c:141:con[165442.307221]  [<ffffffffa0c7fae0>] ? lustre_cfg_new+0x180/0x400 [obdclass]
2016-07-15 11:24:54 fig_log_put()) ASSERTION( atomic[165442.318001]  [<ffffffffa0c81760>] lustre_end_log+0xf0/0x5c0 [obdclass]
2016-07-15 11:24:54 _read(&cld->cld_refcount) > 0 ) [165442.328469]  [<ffffffffa1156a5d>] ll_put_super+0x8d/0xae0 [lustre]
2016-07-15 11:24:54 failed: 
2016-07-15 11:24:54 Jul 15 11:26:21 opal99[165442.338523]  [<ffffffff8122dc17>] ? fsnotify_clear_marks_by_inode+0xa7/0x140
2016-07-15 11:24:54  kernel: LustreError: 152795:0:([165442.349585]  [<ffffffff8112af1d>] ? call_rcu_sched+0x1d/0x20
2016-07-15 11:24:54 mgc_request.c:141:config_log_put[165442.359112]  [<ffffffffa1182a5c>] ? ll_destroy_inode+0x1c/0x20 [lustre]
2016-07-15 11:24:54 ()) LBUG
2016-07-15 11:24:54 [165442.369662]  [<ffffffff81203fa8>] ? destroy_inode+0x38/0x60
2016-07-15 11:24:54 [165442.378003]  [<ffffffff812040d6>] ? evict+0x106/0x170
2016-07-15 11:24:54 [165442.385728]  [<ffffffff8120417e>] ? dispose_list+0x3e/0x50
2016-07-15 11:24:54 [165442.393919]  [<ffffffff81204e24>] ? evict_inodes+0x114/0x140
2016-07-15 11:24:54 [165442.402261]  [<ffffffff811ea176>] generic_shutdown_super+0x56/0xe0
2016-07-15 11:24:54 [165442.411149]  [<ffffffff811ea552>] kill_anon_super+0x12/0x20
2016-07-15 11:24:54 [165442.419336]  [<ffffffffa0c7f2b5>] lustre_kill_super+0x45/0x50 [obdclass]
2016-07-15 11:24:54 [165442.428768]  [<ffffffff811ea909>] deactivate_locked_super+0x49/0x60
2016-07-15 11:24:54 [165442.437678]  [<ffffffff811eaf06>] deactivate_super+0x46/0x60
2016-07-15 11:24:54 [165442.445858]  [<ffffffff81208a25>] mntput_no_expire+0xc5/0x120
2016-07-15 11:24:54 [165442.454095]  [<ffffffff81209b9f>] SyS_umount+0x9f/0x3c0
2016-07-15 11:24:54 [165442.461739]  [<ffffffff8165d709>] system_call_fastpath+0x16/0x1b
2016-07-15 11:24:54 [165442.470213] 
2016-07-15 11:24:54 [165442.473946] Kernel panic - not syncing: LBUG
2016-07-15 11:24:54 [165442.480405] CPU: 0 PID: 152795 Comm: umount Tainted: P           OE  ------------   3.10.0-327.22.2.1chaos.ch6.x86_64 #1
2016-07-15 11:24:54 [165442.494284] Hardware name: Penguin Computing Relion OCP1930e/S2600KPR, BIOS SE5C610.86B.01.01.0016.033120161139 03/31/2016
2016-07-15 11:24:54 [165442.508355]  ffffffffa0af9e0f 00000000389f16da ffff88080bc67ab8 ffffffff8164c6b4
2016-07-15 11:24:54 [165442.518381]  ffff88080bc67b38 ffffffff816456af ffffffff00000008 ffff88080bc67b48
2016-07-15 11:24:54 [165442.528375]  ffff88080bc67ae8 00000000389f16da ffffffffa1206417 0000000000000246
2016-07-15 11:24:54 [165442.538336] Call Trace:
2016-07-15 11:24:54 [165442.542668]  [<ffffffff8164c6b4>] dump_stack+0x19/0x1b
2016-07-15 11:24:54 [165442.550011]  [<ffffffff816456af>] panic+0xd8/0x1e7
2016-07-15 11:24:54 [165442.556931]  [<ffffffffa0adddeb>] lbug_with_loc+0xab/0xc0 [libcfs]
2016-07-15 11:24:54 [165442.565366]  [<ffffffffa11fc2e3>] config_log_put+0x3a3/0x3e0 [mgc]
2016-07-15 11:24:54 [165442.573761]  [<ffffffffa12037be>] mgc_process_config+0x77e/0x1280 [mgc]
2016-07-15 11:24:54 [165442.582608]  [<ffffffff810c6cae>] ? dequeue_task_fair+0x42e/0x640
2016-07-15 11:24:54 [165442.590863]  [<ffffffff810c0045>] ? sched_clock_cpu+0xa5/0xe0
2016-07-15 11:24:54 [165442.598713]  [<ffffffff8101560b>] ? __switch_to+0x17b/0x4d0
2016-07-15 11:24:54 [165442.606353]  [<ffffffffa0c7f715>] obd_process_config.constprop.13+0x85/0x2d0 [obdclass]
2016-07-15 11:24:54 [165442.616688]  [<ffffffffa0c7fae0>] ? lustre_cfg_new+0x180/0x400 [obdclass]
2016-07-15 11:24:54 [165442.625648]  [<ffffffffa0c81760>] lustre_end_log+0xf0/0x5c0 [obdclass]
2016-07-15 11:24:54 [165442.634304]  [<ffffffffa1156a5d>] ll_put_super+0x8d/0xae0 [lustre]
2016-07-15 11:24:54 [165442.642562]  [<ffffffff8122dc17>] ? fsnotify_clear_marks_by_inode+0xa7/0x140
2016-07-15 11:24:54 [165442.651820]  [<ffffffff8112af1d>] ? call_rcu_sched+0x1d/0x20
2016-07-15 11:24:54 [165442.659521]  [<ffffffffa1182a5c>] ? ll_destroy_inode+0x1c/0x20 [lustre]
2016-07-15 11:24:54 [165442.668280]  [<ffffffff81203fa8>] ? destroy_inode+0x38/0x60
2016-07-15 11:24:54 [165442.675870]  [<ffffffff812040d6>] ? evict+0x106/0x170
2016-07-15 11:24:54 [165442.682867]  [<ffffffff8120417e>] ? dispose_list+0x3e/0x50
2016-07-15 11:24:54 [165442.690342]  [<ffffffff81204e24>] ? evict_inodes+0x114/0x140
2016-07-15 11:24:54 [165442.698012]  [<ffffffff811ea176>] generic_shutdown_super+0x56/0xe0
2016-07-15 11:24:54 [165442.706269]  [<ffffffff811ea552>] kill_anon_super+0x12/0x20
2016-07-15 11:24:54 [165442.713850]  [<ffffffffa0c7f2b5>] lustre_kill_super+0x45/0x50 [obdclass]
2016-07-15 11:24:54 [165442.722674]  [<ffffffff811ea909>] deactivate_locked_super+0x49/0x60
2016-07-15 11:24:54 [165442.731014]  [<ffffffff811eaf06>] deactivate_super+0x46/0x60
2016-07-15 11:24:54 [165442.738662]  [<ffffffff81208a25>] mntput_no_expire+0xc5/0x120
2016-07-15 11:24:54 [165442.746397]  [<ffffffff81209b9f>] SyS_umount+0x9f/0x3c0
2016-07-15 11:24:54 [165442.753539]  [<ffffffff8165d709>] system_call_fastpath+0x16/0x1b


 Comments   
Comment by Peter Jones [ 16/Jul/16 ]

Fan Yong

Could you please advise on this issue?

Thanks

Peter

Comment by nasf (Inactive) [ 01/Aug/16 ]

Honestly, only from the bug descriptions, it is not easy to exactly locate the root reason. I have tried many times in my VM environment, but cannot get the same failure trace. Then I studied through the MGC logic carefully, and find that the current 'config_llog_data::cld_refcount' logic is some confusing, that may cause reference leak or over dereferenced under some cases. I will make a patch to clean the 'config_llog_data::cld_refcount' logic entirely. It is quite possible that the patch has contained the solution for current trouble in this ticket. I would suggest to verify the patch on your system after passing Maloo tests.

Comment by Gerrit Updater [ 01/Aug/16 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/21616
Subject: LU-8408 mgc: handle config_llog_data::cld_refcount properly
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 6454c370f5afcdc5b91ea792653973888c326b68

Comment by Christopher Morrone [ 05/Aug/16 ]

If you provide a patch for 2.8, I'll try it out. The one for master does not apply.

Comment by nasf (Inactive) [ 05/Aug/16 ]

The patch for b2_8_fe:
http://review.whamcloud.com/21740

Comment by Christopher Morrone [ 18/Aug/16 ]

We have not hit this since running the b2_8_fe backported LU-8404 patch (http://review.whamcloud.com/21740).

Comment by nasf (Inactive) [ 19/Aug/16 ]

Excellent! Thanks Chris for the verification.

Comment by Gerrit Updater [ 02/Sep/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/21616/
Subject: LU-8408 mgc: handle config_llog_data::cld_refcount properly
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 48d24ebd6d51873a6c560000ea3b638fdae22a27

Comment by nasf (Inactive) [ 02/Sep/16 ]

The patch has been landed on master.

Generated at Sat Feb 10 02:17:17 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.