Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8408

mgc_request.c:141:config_log_put()) ASSERTION( atomic_read(&cld->cld_refcount) > 0 )

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.9.0
    • Lustre 2.8.0
    • 3
    • 9223372036854775807

    Description

      Running Lustre 2.8.0, we hit the following assertion on many clients at the same time while attempting to umount a DNE filesystem.

      One thing to note is that the client were mounting two lustre filesystems at the time, one DNE and one non-DNE. The DNE filesystem was the only on that I was trying to unmount.

      The DNE filesystem was also not quite happy at the time. 2 out of 16 MDTs were not running. The other 14 MDTs were stuck in recovery with a recovery timer of 0 seconds. The assertion happened, I believe, after I finished bringing up the two missing MDTs. The umount was just stuck until they came up.

      2016-07-15 10:36:13 [162519.584734] Lustre: 7006:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 109 previous similar messages
      2016-07-15 10:46:55 [163161.876527] Lustre: 7006:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1468604725/real 1468
      604725]  req@ffff881feee70300 x1539773254358284/t0(0) o400->lquake2-MDT000f-mdc-ffff88201e8e2800@172.19.1.126@o2ib100:12/10 lens 224/224 e 2 to 1 dl 1468604815 ref 1 fl
       Rpc:X/c0/ffffffff rc 0/-1
      2016-07-15 10:46:55 [163161.917624] Lustre: 7006:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 111 previous similar messages
      2016-07-15 10:57:25 [163792.204109] Lustre: 7006:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1468605355/real 1468
      605355]  req@ffff881feee70300 x1539773254361464/t0(0) o400->lquake2-MDT000f-mdc-ffff88201e8e2800@172.19.1.126@o2ib100:12/10 lens 224/224 e 2 to 1 dl 1468605445 ref 1 fl
       Rpc:X/c0/ffffffff rc 0/-1
      2016-07-15 10:57:25 [163792.245207] Lustre: 7006:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 109 previous similar messages
      
      <ConMan> Console [opal99] log at 2016-07-15 11:00:00 PDT.
      2016-07-15 11:07:53 [164420.530684] Lustre: 7006:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1468606018/real 1468
      606018]  req@ffff881fee80ad00 x1539773254364936/t0(0) o38->lquake2-MDT0001-mdc-ffff88201e8e2800@172.19.1.112@o2ib100:12/10 lens 520/544 e 0 to 1 dl 1468606073 ref 1 fl 
      Rpc:XN/0/ffffffff rc 0/-1
      2016-07-15 11:07:53 [164420.571687] Lustre: 7006:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 109 previous similar messages
      2016-07-15 11:16:26 [164934.492190] Lustre: lquake2-MDT000f-mdc-ffff88201e8e2800: Connection restored to 172.19.1.126@o2ib100 (at 172.19.1.126@o2ib100)
      2016-07-15 11:16:26 [164934.510252] Lustre: Skipped 3 previous similar messages
      2016-07-15 11:16:32 [164940.031847] Lustre: lquake2-MDT0008-mdc-ffff88201e8e2800: Connection restored to 172.19.1.119@o2ib100 (at 172.19.1.119@o2ib100)
      2016-07-15 11:16:53 [164960.817277] Lustre: lquake2-MDT000e-mdc-ffff88201e8e2800: Connection restored to 172.19.1.125@o2ib100 (at 172.19.1.125@o2ib100)
      2016-07-15 11:16:53 [164960.835337] Lustre: Skipped 8 previous similar messages
      2016-07-15 11:24:52 [165440.838390] hsi0: can't use GFP_NOIO for QPs on device hfi1_0, using GFP_KERNEL
      2016-07-15 11:24:54 [165442.178261] LustreError: 152795:0:(mgc_request.c:141:config_log_put()) ASSERTION( atomic_read(&cld->cld_refcount) > 0 ) failed: 
      2016-07-15 11:24:54 [165442.196348] LustreError: 152795:0:(mgc_request.c:141:config_log_put()) LBUG
      2016-07-15 11:24:54 [165442.206772] Pid: 152795, comm: umount
      2016-07-15 11:24:54 Jul 15 11:24:54 opal99 kernel: [[165442.213463] 
      2016-07-15 11:24:54 [165442.213463] Call Trace:
      2016-07-15 11:24:54 165442.178261] LustreError: 152795:0:(mgc_request.c:141:config_l[165442.223668]  [<ffffffffa0add7e3>] libcfs_debug_dumpstack+0x53/0x80 [libcfs]
      2016-07-15 11:24:54 og_put()) ASSERTION( atomic_read[165442.235364]  [<ffffffffa0addd85>] lbug_with_loc+0x45/0xc0 [libcfs]
      2016-07-15 11:24:54 (&cld->cld_refcount) > 0 ) faile[165442.245430]  [<ffffffffa11fc2e3>] config_log_put+0x3a3/0x3e0 [mgc]
      2016-07-15 11:24:54 d: 
      2016-07-15 11:24:54 Jul 15 11:24:54 opal99 kern[165442.255502]  [<ffffffffa12037be>] mgc_process_config+0x77e/0x1280 [mgc]
      2016-07-15 11:24:54 el: [165442.196348] LustreError:[165442.266077]  [<ffffffff810c6cae>] ? dequeue_task_fair+0x42e/0x640
      2016-07-15 11:24:54  152795:0:(mgc_request.c:141:con[165442.276080]  [<ffffffff810c0045>] ? sched_clock_cpu+0xa5/0xe0
      2016-07-15 11:24:54 fig_log_put()) LBUG
      2016-07-15 11:24:54 Jul 15 11:2[165442.285669]  [<ffffffff8101560b>] ? __switch_to+0x17b/0x4d0
      2016-07-15 11:24:54 6:21 opal99 kernel: LustreError:[165442.295087]  [<ffffffffa0c7f715>] obd_process_config.constprop.13+0x85/0x2d0 [obdclass]
      2016-07-15 11:24:54  152795:0:(mgc_request.c:141:con[165442.307221]  [<ffffffffa0c7fae0>] ? lustre_cfg_new+0x180/0x400 [obdclass]
      2016-07-15 11:24:54 fig_log_put()) ASSERTION( atomic[165442.318001]  [<ffffffffa0c81760>] lustre_end_log+0xf0/0x5c0 [obdclass]
      2016-07-15 11:24:54 _read(&cld->cld_refcount) > 0 ) [165442.328469]  [<ffffffffa1156a5d>] ll_put_super+0x8d/0xae0 [lustre]
      2016-07-15 11:24:54 failed: 
      2016-07-15 11:24:54 Jul 15 11:26:21 opal99[165442.338523]  [<ffffffff8122dc17>] ? fsnotify_clear_marks_by_inode+0xa7/0x140
      2016-07-15 11:24:54  kernel: LustreError: 152795:0:([165442.349585]  [<ffffffff8112af1d>] ? call_rcu_sched+0x1d/0x20
      2016-07-15 11:24:54 mgc_request.c:141:config_log_put[165442.359112]  [<ffffffffa1182a5c>] ? ll_destroy_inode+0x1c/0x20 [lustre]
      2016-07-15 11:24:54 ()) LBUG
      2016-07-15 11:24:54 [165442.369662]  [<ffffffff81203fa8>] ? destroy_inode+0x38/0x60
      2016-07-15 11:24:54 [165442.378003]  [<ffffffff812040d6>] ? evict+0x106/0x170
      2016-07-15 11:24:54 [165442.385728]  [<ffffffff8120417e>] ? dispose_list+0x3e/0x50
      2016-07-15 11:24:54 [165442.393919]  [<ffffffff81204e24>] ? evict_inodes+0x114/0x140
      2016-07-15 11:24:54 [165442.402261]  [<ffffffff811ea176>] generic_shutdown_super+0x56/0xe0
      2016-07-15 11:24:54 [165442.411149]  [<ffffffff811ea552>] kill_anon_super+0x12/0x20
      2016-07-15 11:24:54 [165442.419336]  [<ffffffffa0c7f2b5>] lustre_kill_super+0x45/0x50 [obdclass]
      2016-07-15 11:24:54 [165442.428768]  [<ffffffff811ea909>] deactivate_locked_super+0x49/0x60
      2016-07-15 11:24:54 [165442.437678]  [<ffffffff811eaf06>] deactivate_super+0x46/0x60
      2016-07-15 11:24:54 [165442.445858]  [<ffffffff81208a25>] mntput_no_expire+0xc5/0x120
      2016-07-15 11:24:54 [165442.454095]  [<ffffffff81209b9f>] SyS_umount+0x9f/0x3c0
      2016-07-15 11:24:54 [165442.461739]  [<ffffffff8165d709>] system_call_fastpath+0x16/0x1b
      2016-07-15 11:24:54 [165442.470213] 
      2016-07-15 11:24:54 [165442.473946] Kernel panic - not syncing: LBUG
      2016-07-15 11:24:54 [165442.480405] CPU: 0 PID: 152795 Comm: umount Tainted: P           OE  ------------   3.10.0-327.22.2.1chaos.ch6.x86_64 #1
      2016-07-15 11:24:54 [165442.494284] Hardware name: Penguin Computing Relion OCP1930e/S2600KPR, BIOS SE5C610.86B.01.01.0016.033120161139 03/31/2016
      2016-07-15 11:24:54 [165442.508355]  ffffffffa0af9e0f 00000000389f16da ffff88080bc67ab8 ffffffff8164c6b4
      2016-07-15 11:24:54 [165442.518381]  ffff88080bc67b38 ffffffff816456af ffffffff00000008 ffff88080bc67b48
      2016-07-15 11:24:54 [165442.528375]  ffff88080bc67ae8 00000000389f16da ffffffffa1206417 0000000000000246
      2016-07-15 11:24:54 [165442.538336] Call Trace:
      2016-07-15 11:24:54 [165442.542668]  [<ffffffff8164c6b4>] dump_stack+0x19/0x1b
      2016-07-15 11:24:54 [165442.550011]  [<ffffffff816456af>] panic+0xd8/0x1e7
      2016-07-15 11:24:54 [165442.556931]  [<ffffffffa0adddeb>] lbug_with_loc+0xab/0xc0 [libcfs]
      2016-07-15 11:24:54 [165442.565366]  [<ffffffffa11fc2e3>] config_log_put+0x3a3/0x3e0 [mgc]
      2016-07-15 11:24:54 [165442.573761]  [<ffffffffa12037be>] mgc_process_config+0x77e/0x1280 [mgc]
      2016-07-15 11:24:54 [165442.582608]  [<ffffffff810c6cae>] ? dequeue_task_fair+0x42e/0x640
      2016-07-15 11:24:54 [165442.590863]  [<ffffffff810c0045>] ? sched_clock_cpu+0xa5/0xe0
      2016-07-15 11:24:54 [165442.598713]  [<ffffffff8101560b>] ? __switch_to+0x17b/0x4d0
      2016-07-15 11:24:54 [165442.606353]  [<ffffffffa0c7f715>] obd_process_config.constprop.13+0x85/0x2d0 [obdclass]
      2016-07-15 11:24:54 [165442.616688]  [<ffffffffa0c7fae0>] ? lustre_cfg_new+0x180/0x400 [obdclass]
      2016-07-15 11:24:54 [165442.625648]  [<ffffffffa0c81760>] lustre_end_log+0xf0/0x5c0 [obdclass]
      2016-07-15 11:24:54 [165442.634304]  [<ffffffffa1156a5d>] ll_put_super+0x8d/0xae0 [lustre]
      2016-07-15 11:24:54 [165442.642562]  [<ffffffff8122dc17>] ? fsnotify_clear_marks_by_inode+0xa7/0x140
      2016-07-15 11:24:54 [165442.651820]  [<ffffffff8112af1d>] ? call_rcu_sched+0x1d/0x20
      2016-07-15 11:24:54 [165442.659521]  [<ffffffffa1182a5c>] ? ll_destroy_inode+0x1c/0x20 [lustre]
      2016-07-15 11:24:54 [165442.668280]  [<ffffffff81203fa8>] ? destroy_inode+0x38/0x60
      2016-07-15 11:24:54 [165442.675870]  [<ffffffff812040d6>] ? evict+0x106/0x170
      2016-07-15 11:24:54 [165442.682867]  [<ffffffff8120417e>] ? dispose_list+0x3e/0x50
      2016-07-15 11:24:54 [165442.690342]  [<ffffffff81204e24>] ? evict_inodes+0x114/0x140
      2016-07-15 11:24:54 [165442.698012]  [<ffffffff811ea176>] generic_shutdown_super+0x56/0xe0
      2016-07-15 11:24:54 [165442.706269]  [<ffffffff811ea552>] kill_anon_super+0x12/0x20
      2016-07-15 11:24:54 [165442.713850]  [<ffffffffa0c7f2b5>] lustre_kill_super+0x45/0x50 [obdclass]
      2016-07-15 11:24:54 [165442.722674]  [<ffffffff811ea909>] deactivate_locked_super+0x49/0x60
      2016-07-15 11:24:54 [165442.731014]  [<ffffffff811eaf06>] deactivate_super+0x46/0x60
      2016-07-15 11:24:54 [165442.738662]  [<ffffffff81208a25>] mntput_no_expire+0xc5/0x120
      2016-07-15 11:24:54 [165442.746397]  [<ffffffff81209b9f>] SyS_umount+0x9f/0x3c0
      2016-07-15 11:24:54 [165442.753539]  [<ffffffff8165d709>] system_call_fastpath+0x16/0x1b
      

      Attachments

        Activity

          People

            yong.fan nasf (Inactive)
            morrone Christopher Morrone (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: