Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9204

replay-vbr test_1b: umount MDS hung in DNE

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • Lustre 2.10.0
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for sarah_lw <wei3.liu@intel.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/5e48e8d2-f92a-11e6-aa39-5254006e85c2.

      The sub-test test_1b failed with the following error:

      test failed to respond and timed out
      

      server/client lustre-master tag-2.9.53 el7 DNE with ZFS

      MDS console

      13:06:28:[ 9555.377705] Lustre: DEBUG MARKER: == replay-vbr test 1b: open (O_CREAT) checks version of parent ======================================= 12:59:38 (1487681978)
      13:06:28:[ 9556.168476] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param mdd.lustre-MDT0000.sync_permission=0
      13:06:28:[ 9556.511353] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param mdt.lustre-MDT0000.commit_on_sharing=0
      13:06:28:[ 9556.861773] Lustre: DEBUG MARKER: sync; sync; sync
      13:06:28:[ 9557.944765] Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-MDT0000 notransno
      13:06:28:[ 9558.286819] Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-MDT0000 readonly
      13:06:28:[ 9558.541020] LustreError: 22420:0:(osd_handler.c:651:osd_ro()) lustre-MDT0000: *** setting device osd-zfs read-only ***
      13:06:28:[ 9558.543353] LustreError: 22420:0:(osd_handler.c:651:osd_ro()) Skipped 7 previous similar messages
      13:06:28:[ 9558.715652] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 REPLAY BARRIER on lustre-MDT0000
      13:06:28:[ 9558.886272] Lustre: DEBUG MARKER: mds1 REPLAY BARRIER on lustre-MDT0000
      13:06:28:[ 9819.188656] Lustre: MGS: Received new LWP connection from 10.9.6.28@tcp, removing former export from same NID
      13:06:28:[ 9819.751607] Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds1' ' /proc/mounts
      13:06:28:[ 9820.090977] Lustre: DEBUG MARKER: umount -d /mnt/lustre-mds1
      13:06:28:[ 9820.265288] Lustre: Failing over lustre-MDT0000
      13:06:28:[ 9820.267192] Lustre: Skipped 6 previous similar messages
      13:06:28:[ 9820.634437] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.9.6.35@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      13:06:28:[ 9820.638958] LustreError: Skipped 321 previous similar messages
      13:06:28:[ 9822.726157] LustreError: 11-0: lustre-MDT0000-osp-MDT0002: operation obd_ping to node 0@lo failed: rc = -107
      13:06:28:[ 9822.728445] LustreError: Skipped 3 previous similar messages
      13:06:28:[ 9822.730302] Lustre: lustre-MDT0000-osp-MDT0002: Connection to lustre-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete
      13:06:28:[ 9822.734323] Lustre: Skipped 3 previous similar messages
      13:06:28:[ 9822.734585] LustreError: 166-1: MGC10.9.6.34@tcp: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail
      13:06:28:[ 9822.734586] LustreError: Skipped 1 previous similar message
      13:06:28:[ 9822.734612] LustreError: 9478:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1487681946, 300s ago), entering recovery for MGS@10.9.6.34@tcp ns: MGC10.9.6.34@tcp lock: ffff88004b23bc00/0xe3098f4980534cd7 lrc: 4/1,0 mode: --/CR res: [0x65727473756c:0x4:0x0].0x0 rrc: 1 type: PLN flags: 0x1000000000000 nid: local remote: 0xe3098f4980534cde expref: -99 pid: 9478 timeout: 0 lvb_type: 0
      13:06:28:[ 9828.751055] Lustre: 9455:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1487682246/real 1487682246]  req@ffff880047458300 x1559937866081120/t0(0) o250->MGC10.9.6.34@tcp@0@lo:26/25 lens 520/544 e 0 to 1 dl 1487682252 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      13:06:28:[ 9828.758132] Lustre: 9455:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 5 previous similar messages
      13:06:28:[ 9896.207670] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.9.6.28@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      13:06:28:[ 9896.212466] LustreError: Skipped 237 previous similar messages
      13:06:28:[ 9960.081186] INFO: task umount:22669 blocked for more than 120 seconds.
      13:06:28:[ 9960.083476] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      13:06:28:
      13:06:28:[ 9960.085839] umount          D ffff880039e8ba00     0 22669  22668 0x00000080
      13:06:28:[ 9960.090108]  ffff880039d43a10 0000000000000082 ffff880037519f60 ffff880039d43fd8
      13:06:28:[ 9960.092505]  ffff880039d43fd8 ffff880039d43fd8 ffff880037519f60 ffff880039e8bb28
      13:12:33:[ 9960.094963]  ffff880039e8bb30 7fffffffffffffff ffff880037519f60 ffff880039e8ba00
      13:12:33:[ 9960.097448] Call Trace:
      13:12:33:[ 9960.099437]  [<ffffffff8168bac9>] schedule+0x29/0x70
      13:12:33:[ 9960.101633]  [<ffffffff81689519>] schedule_timeout+0x239/0x2d0
      13:12:33:[ 9960.103812]  [<ffffffff811daa99>] ? discard_slab+0x39/0x50
      13:12:33:[ 9960.105961]  [<ffffffff811dc58a>] ? __slab_free+0x2ca/0x2f0
      13:12:33:[ 9960.108126]  [<ffffffffa0ae88e1>] ? llog_free_handle+0x121/0x400 [obdclass]
      13:12:33:[ 9960.110386]  [<ffffffff8168bea6>] wait_for_completion+0x116/0x170
      13:12:33:[ 9960.112518]  [<ffffffff810c4fd0>] ? wake_up_state+0x20/0x20
      13:12:33:[ 9960.114676]  [<ffffffffa0fd5aee>] mgs_ir_fini_fs+0x27e/0x2ec [mgs]
      13:12:33:[ 9960.116802]  [<ffffffffa0fca329>] mgs_put_fsdb+0x49/0x910 [mgs]
      13:12:33:[ 9960.118949]  [<ffffffffa0fcacfc>] mgs_cleanup_fsdb_list+0x6c/0x90 [mgs]
      13:12:33:[ 9960.121089]  [<ffffffffa0fb0a47>] mgs_device_fini+0x97/0x5b0 [mgs]
      13:12:33:[ 9960.123266]  [<ffffffffa0b23f94>] class_cleanup+0x7f4/0xd90 [obdclass]
      13:12:33:[ 9960.125377]  [<ffffffffa0b26f73>] class_process_config+0x2323/0x3340 [obdclass]
      13:12:33:[ 9960.127605]  [<ffffffffa0b16679>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
      13:12:33:[ 9960.129740]  [<ffffffffa0b2807f>] class_manual_cleanup+0xef/0x810 [obdclass]
      13:12:33:[ 9960.131945]  [<ffffffffa0b58360>] server_put_super+0xb20/0xcd0 [obdclass]
      13:12:33:[ 9960.134048]  [<ffffffff812007f2>] generic_shutdown_super+0x72/0xf0
      13:12:33:[ 9960.136150]  [<ffffffff81200bc2>] kill_anon_super+0x12/0x20
      13:12:33:[ 9960.138127]  [<ffffffffa0b2bb92>] lustre_kill_super+0x32/0x50 [obdclass]
      13:12:33:[ 9960.140217]  [<ffffffff81200f79>] deactivate_locked_super+0x49/0x60
      13:12:33:[ 9960.142205]  [<ffffffff81201576>] deactivate_super+0x46/0x60
      13:12:33:[ 9960.144173]  [<ffffffff8121e9b5>] mntput_no_expire+0xc5/0x120
      13:12:33:[ 9960.146059]  [<ffffffff8121faf0>] SyS_umount+0xa0/0x3b0
      13:12:33:[ 9960.147941]  [<ffffffff81696a09>] system_call_fastpath+0x16/0x1b
      13:12:33:[10046.552431] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.9.6.29@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      13:12:33:[10046.556737] LustreError: Skipped 495 previous similar messages
      13:12:33:[10080.149184] INFO: task umount:22669 blocked for more than 120 seconds.
      13:12:33:[10080.151185] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      13:12:33:[10080.153204] umount          D ffff880039e8ba00     0 22669  22668 0x00000080
      13:12:33:[10080.155238]  ffff880039d43a10 0000000000000082 ffff880037519f60 ffff880039d43fd8
      13:12:33:[10080.157558]  ffff880039d43fd8 ffff880039d43fd8 ffff880037519f60 ffff880039e8bb28
      13:12:33:[10080.159579]  ffff880039e8bb30 7fffffffffffffff ffff880037519f60 ffff880039e8ba00
      13:12:33:[10080.161883] Call Trace:
      13:12:33:[10080.163346]  [<ffffffff8168bac9>] schedule+0x29/0x70
      13:12:33:[10080.165282]  [<ffffffff81689519>] schedule_timeout+0x239/0x2d0
      13:12:33:[10080.166946]  [<ffffffff811daa99>] ? discard_slab+0x39/0x50
      13:12:33:[10080.168727]  [<ffffffff811dc58a>] ? __slab_free+0x2ca/0x2f0
      13:12:33:[10080.170347]  [<ffffffffa0ae88e1>] ? llog_free_handle+0x121/0x400 [obdclass]
      13:12:33:[10080.172222]  [<ffffffff8168bea6>] wait_for_completion+0x116/0x170
      13:12:33:[10080.173817]  [<ffffffff810c4fd0>] ? wake_up_state+0x20/0x20
      13:12:33:[10080.175549]  [<ffffffffa0fd5aee>] mgs_ir_fini_fs+0x27e/0x2ec [mgs]
      13:12:33:[10080.177123]  [<ffffffffa0fca329>] mgs_put_fsdb+0x49/0x910 [mgs]
      13:12:33:[10080.178850]  [<ffffffffa0fcacfc>] mgs_cleanup_fsdb_list+0x6c/0x90 [mgs]
      13:12:33:[10080.180440]  [<ffffffffa0fb0a47>] mgs_device_fini+0x97/0x5b0 [mgs]
      13:12:33:[10080.182902]  [<ffffffffa0b23f94>] class_cleanup+0x7f4/0xd90 [obdclass]
      13:12:33:[10080.184978]  [<ffffffffa0b26f73>] class_process_config+0x2323/0x3340 [obdclass]
      13:12:33:[10080.186644]  [<ffffffffa0b16679>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
      13:12:33:[10080.188473]  [<ffffffffa0b2807f>] class_manual_cleanup+0xef/0x810 [obdclass]
      13:12:33:[10080.190134]  [<ffffffffa0b58360>] server_put_super+0xb20/0xcd0 [obdclass]
      13:12:33:[10080.191985]  [<ffffffff812007f2>] generic_shutdown_super+0x72/0xf0
      13:12:33:[10080.193641]  [<ffffffff81200bc2>] kill_anon_super+0x12/0x20
      13:12:33:[10080.195308]  [<ffffffffa0b2bb92>] lustre_kill_super+0x32/0x50 [obdclass]
      13:12:33:[10080.197089]  [<ffffffff81200f79>] deactivate_locked_super+0x49/0x60
      13:12:33:[10080.198671]  [<ffffffff81201576>] deactivate_super+0x46/0x60
      13:12:33:[10080.200426]  [<ffffffff8121e9b5>] mntput_no_expire+0xc5/0x120
      13:12:33:[10080.201976]  [<ffffffff8121faf0>] SyS_umount+0xa0/0x3b0
      13:12:33:[10080.203649]  [<ffffffff81696a09>] system_call_fastpath+0x16/0x1b
      13:12:33:[10200.205111] INFO: task umount:22669 blocked for more than 120 seconds.
      13:12:33:[10200.206778] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
      

      Info required for matching: replay-vbr 1b

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: