Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.8.0
-
3
-
9223372036854775807
Description
Running Lustre 2.8.0, we hit the following assertion on many clients at the same time while attempting to umount a DNE filesystem.
One thing to note is that the client were mounting two lustre filesystems at the time, one DNE and one non-DNE. The DNE filesystem was the only on that I was trying to unmount.
The DNE filesystem was also not quite happy at the time. 2 out of 16 MDTs were not running. The other 14 MDTs were stuck in recovery with a recovery timer of 0 seconds. The assertion happened, I believe, after I finished bringing up the two missing MDTs. The umount was just stuck until they came up.
2016-07-15 10:36:13 [162519.584734] Lustre: 7006:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 109 previous similar messages 2016-07-15 10:46:55 [163161.876527] Lustre: 7006:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1468604725/real 1468 604725] req@ffff881feee70300 x1539773254358284/t0(0) o400->lquake2-MDT000f-mdc-ffff88201e8e2800@172.19.1.126@o2ib100:12/10 lens 224/224 e 2 to 1 dl 1468604815 ref 1 fl Rpc:X/c0/ffffffff rc 0/-1 2016-07-15 10:46:55 [163161.917624] Lustre: 7006:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 111 previous similar messages 2016-07-15 10:57:25 [163792.204109] Lustre: 7006:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1468605355/real 1468 605355] req@ffff881feee70300 x1539773254361464/t0(0) o400->lquake2-MDT000f-mdc-ffff88201e8e2800@172.19.1.126@o2ib100:12/10 lens 224/224 e 2 to 1 dl 1468605445 ref 1 fl Rpc:X/c0/ffffffff rc 0/-1 2016-07-15 10:57:25 [163792.245207] Lustre: 7006:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 109 previous similar messages <ConMan> Console [opal99] log at 2016-07-15 11:00:00 PDT. 2016-07-15 11:07:53 [164420.530684] Lustre: 7006:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1468606018/real 1468 606018] req@ffff881fee80ad00 x1539773254364936/t0(0) o38->lquake2-MDT0001-mdc-ffff88201e8e2800@172.19.1.112@o2ib100:12/10 lens 520/544 e 0 to 1 dl 1468606073 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 2016-07-15 11:07:53 [164420.571687] Lustre: 7006:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 109 previous similar messages 2016-07-15 11:16:26 [164934.492190] Lustre: lquake2-MDT000f-mdc-ffff88201e8e2800: Connection restored to 172.19.1.126@o2ib100 (at 172.19.1.126@o2ib100) 2016-07-15 11:16:26 [164934.510252] Lustre: Skipped 3 previous similar messages 2016-07-15 11:16:32 [164940.031847] Lustre: lquake2-MDT0008-mdc-ffff88201e8e2800: Connection restored to 172.19.1.119@o2ib100 (at 172.19.1.119@o2ib100) 2016-07-15 11:16:53 [164960.817277] Lustre: lquake2-MDT000e-mdc-ffff88201e8e2800: Connection restored to 172.19.1.125@o2ib100 (at 172.19.1.125@o2ib100) 2016-07-15 11:16:53 [164960.835337] Lustre: Skipped 8 previous similar messages 2016-07-15 11:24:52 [165440.838390] hsi0: can't use GFP_NOIO for QPs on device hfi1_0, using GFP_KERNEL 2016-07-15 11:24:54 [165442.178261] LustreError: 152795:0:(mgc_request.c:141:config_log_put()) ASSERTION( atomic_read(&cld->cld_refcount) > 0 ) failed: 2016-07-15 11:24:54 [165442.196348] LustreError: 152795:0:(mgc_request.c:141:config_log_put()) LBUG 2016-07-15 11:24:54 [165442.206772] Pid: 152795, comm: umount 2016-07-15 11:24:54 Jul 15 11:24:54 opal99 kernel: [[165442.213463] 2016-07-15 11:24:54 [165442.213463] Call Trace: 2016-07-15 11:24:54 165442.178261] LustreError: 152795:0:(mgc_request.c:141:config_l[165442.223668] [<ffffffffa0add7e3>] libcfs_debug_dumpstack+0x53/0x80 [libcfs] 2016-07-15 11:24:54 og_put()) ASSERTION( atomic_read[165442.235364] [<ffffffffa0addd85>] lbug_with_loc+0x45/0xc0 [libcfs] 2016-07-15 11:24:54 (&cld->cld_refcount) > 0 ) faile[165442.245430] [<ffffffffa11fc2e3>] config_log_put+0x3a3/0x3e0 [mgc] 2016-07-15 11:24:54 d: 2016-07-15 11:24:54 Jul 15 11:24:54 opal99 kern[165442.255502] [<ffffffffa12037be>] mgc_process_config+0x77e/0x1280 [mgc] 2016-07-15 11:24:54 el: [165442.196348] LustreError:[165442.266077] [<ffffffff810c6cae>] ? dequeue_task_fair+0x42e/0x640 2016-07-15 11:24:54 152795:0:(mgc_request.c:141:con[165442.276080] [<ffffffff810c0045>] ? sched_clock_cpu+0xa5/0xe0 2016-07-15 11:24:54 fig_log_put()) LBUG 2016-07-15 11:24:54 Jul 15 11:2[165442.285669] [<ffffffff8101560b>] ? __switch_to+0x17b/0x4d0 2016-07-15 11:24:54 6:21 opal99 kernel: LustreError:[165442.295087] [<ffffffffa0c7f715>] obd_process_config.constprop.13+0x85/0x2d0 [obdclass] 2016-07-15 11:24:54 152795:0:(mgc_request.c:141:con[165442.307221] [<ffffffffa0c7fae0>] ? lustre_cfg_new+0x180/0x400 [obdclass] 2016-07-15 11:24:54 fig_log_put()) ASSERTION( atomic[165442.318001] [<ffffffffa0c81760>] lustre_end_log+0xf0/0x5c0 [obdclass] 2016-07-15 11:24:54 _read(&cld->cld_refcount) > 0 ) [165442.328469] [<ffffffffa1156a5d>] ll_put_super+0x8d/0xae0 [lustre] 2016-07-15 11:24:54 failed: 2016-07-15 11:24:54 Jul 15 11:26:21 opal99[165442.338523] [<ffffffff8122dc17>] ? fsnotify_clear_marks_by_inode+0xa7/0x140 2016-07-15 11:24:54 kernel: LustreError: 152795:0:([165442.349585] [<ffffffff8112af1d>] ? call_rcu_sched+0x1d/0x20 2016-07-15 11:24:54 mgc_request.c:141:config_log_put[165442.359112] [<ffffffffa1182a5c>] ? ll_destroy_inode+0x1c/0x20 [lustre] 2016-07-15 11:24:54 ()) LBUG 2016-07-15 11:24:54 [165442.369662] [<ffffffff81203fa8>] ? destroy_inode+0x38/0x60 2016-07-15 11:24:54 [165442.378003] [<ffffffff812040d6>] ? evict+0x106/0x170 2016-07-15 11:24:54 [165442.385728] [<ffffffff8120417e>] ? dispose_list+0x3e/0x50 2016-07-15 11:24:54 [165442.393919] [<ffffffff81204e24>] ? evict_inodes+0x114/0x140 2016-07-15 11:24:54 [165442.402261] [<ffffffff811ea176>] generic_shutdown_super+0x56/0xe0 2016-07-15 11:24:54 [165442.411149] [<ffffffff811ea552>] kill_anon_super+0x12/0x20 2016-07-15 11:24:54 [165442.419336] [<ffffffffa0c7f2b5>] lustre_kill_super+0x45/0x50 [obdclass] 2016-07-15 11:24:54 [165442.428768] [<ffffffff811ea909>] deactivate_locked_super+0x49/0x60 2016-07-15 11:24:54 [165442.437678] [<ffffffff811eaf06>] deactivate_super+0x46/0x60 2016-07-15 11:24:54 [165442.445858] [<ffffffff81208a25>] mntput_no_expire+0xc5/0x120 2016-07-15 11:24:54 [165442.454095] [<ffffffff81209b9f>] SyS_umount+0x9f/0x3c0 2016-07-15 11:24:54 [165442.461739] [<ffffffff8165d709>] system_call_fastpath+0x16/0x1b 2016-07-15 11:24:54 [165442.470213] 2016-07-15 11:24:54 [165442.473946] Kernel panic - not syncing: LBUG 2016-07-15 11:24:54 [165442.480405] CPU: 0 PID: 152795 Comm: umount Tainted: P OE ------------ 3.10.0-327.22.2.1chaos.ch6.x86_64 #1 2016-07-15 11:24:54 [165442.494284] Hardware name: Penguin Computing Relion OCP1930e/S2600KPR, BIOS SE5C610.86B.01.01.0016.033120161139 03/31/2016 2016-07-15 11:24:54 [165442.508355] ffffffffa0af9e0f 00000000389f16da ffff88080bc67ab8 ffffffff8164c6b4 2016-07-15 11:24:54 [165442.518381] ffff88080bc67b38 ffffffff816456af ffffffff00000008 ffff88080bc67b48 2016-07-15 11:24:54 [165442.528375] ffff88080bc67ae8 00000000389f16da ffffffffa1206417 0000000000000246 2016-07-15 11:24:54 [165442.538336] Call Trace: 2016-07-15 11:24:54 [165442.542668] [<ffffffff8164c6b4>] dump_stack+0x19/0x1b 2016-07-15 11:24:54 [165442.550011] [<ffffffff816456af>] panic+0xd8/0x1e7 2016-07-15 11:24:54 [165442.556931] [<ffffffffa0adddeb>] lbug_with_loc+0xab/0xc0 [libcfs] 2016-07-15 11:24:54 [165442.565366] [<ffffffffa11fc2e3>] config_log_put+0x3a3/0x3e0 [mgc] 2016-07-15 11:24:54 [165442.573761] [<ffffffffa12037be>] mgc_process_config+0x77e/0x1280 [mgc] 2016-07-15 11:24:54 [165442.582608] [<ffffffff810c6cae>] ? dequeue_task_fair+0x42e/0x640 2016-07-15 11:24:54 [165442.590863] [<ffffffff810c0045>] ? sched_clock_cpu+0xa5/0xe0 2016-07-15 11:24:54 [165442.598713] [<ffffffff8101560b>] ? __switch_to+0x17b/0x4d0 2016-07-15 11:24:54 [165442.606353] [<ffffffffa0c7f715>] obd_process_config.constprop.13+0x85/0x2d0 [obdclass] 2016-07-15 11:24:54 [165442.616688] [<ffffffffa0c7fae0>] ? lustre_cfg_new+0x180/0x400 [obdclass] 2016-07-15 11:24:54 [165442.625648] [<ffffffffa0c81760>] lustre_end_log+0xf0/0x5c0 [obdclass] 2016-07-15 11:24:54 [165442.634304] [<ffffffffa1156a5d>] ll_put_super+0x8d/0xae0 [lustre] 2016-07-15 11:24:54 [165442.642562] [<ffffffff8122dc17>] ? fsnotify_clear_marks_by_inode+0xa7/0x140 2016-07-15 11:24:54 [165442.651820] [<ffffffff8112af1d>] ? call_rcu_sched+0x1d/0x20 2016-07-15 11:24:54 [165442.659521] [<ffffffffa1182a5c>] ? ll_destroy_inode+0x1c/0x20 [lustre] 2016-07-15 11:24:54 [165442.668280] [<ffffffff81203fa8>] ? destroy_inode+0x38/0x60 2016-07-15 11:24:54 [165442.675870] [<ffffffff812040d6>] ? evict+0x106/0x170 2016-07-15 11:24:54 [165442.682867] [<ffffffff8120417e>] ? dispose_list+0x3e/0x50 2016-07-15 11:24:54 [165442.690342] [<ffffffff81204e24>] ? evict_inodes+0x114/0x140 2016-07-15 11:24:54 [165442.698012] [<ffffffff811ea176>] generic_shutdown_super+0x56/0xe0 2016-07-15 11:24:54 [165442.706269] [<ffffffff811ea552>] kill_anon_super+0x12/0x20 2016-07-15 11:24:54 [165442.713850] [<ffffffffa0c7f2b5>] lustre_kill_super+0x45/0x50 [obdclass] 2016-07-15 11:24:54 [165442.722674] [<ffffffff811ea909>] deactivate_locked_super+0x49/0x60 2016-07-15 11:24:54 [165442.731014] [<ffffffff811eaf06>] deactivate_super+0x46/0x60 2016-07-15 11:24:54 [165442.738662] [<ffffffff81208a25>] mntput_no_expire+0xc5/0x120 2016-07-15 11:24:54 [165442.746397] [<ffffffff81209b9f>] SyS_umount+0x9f/0x3c0 2016-07-15 11:24:54 [165442.753539] [<ffffffff8165d709>] system_call_fastpath+0x16/0x1b