Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5942

Interop 2.5.3<->master replay-dual test_10: MDS hang with D state

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Critical
    • None
    • Lustre 2.7.0
    • 3
    • 16588

    Description

      This issue was created by maloo for sarah <sarah@whamcloud.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/93f10028-6bb0-11e4-88ff-5254006e85c2.

      The sub-test test_10 failed with the following error:

      import is not in FULL state
      

      MDS console:

      Lustre: DEBUG MARKER: /usr/sbin/lctl mark == replay-dual test 10: resending a replayed unlink == 04:24:27 \(1415708667\)
      Lustre: DEBUG MARKER: == replay-dual test 10: resending a replayed unlink == 04:24:27 (1415708667)
      Lustre: DEBUG MARKER: sync; sync; sync
      Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-MDT0000 notransno
      Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-MDT0000 readonly
      LustreError: 20635:0:(osd_handler.c:1402:osd_ro()) *** setting lustre-MDT0000 read-only ***
      Turning device dm-0 (0xfd00000) read-only
      Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 REPLAY BARRIER on lustre-MDT0000
      Lustre: DEBUG MARKER: mds1 REPLAY BARRIER on lustre-MDT0000
      Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000119
      Lustre: DEBUG MARKER: grep -c /mnt/mds1' ' /proc/mounts
      Lustre: DEBUG MARKER: umount -d /mnt/mds1
      Lustre: Failing over lustre-MDT0000
      Lustre: Skipped 1 previous similar message
      LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.2.4.100@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      LustreError: Skipped 16 previous similar messages
      Lustre: 20823:0:(client.c:1947:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1415708669/real 1415708669]  req@ffff88007b2d3800 x1484473792234212/t0(0) o251->MGC10.2.4.99@tcp@0@lo:26/25 lens 224/224 e 0 to 1 dl 1415708675 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
      Lustre: 20823:0:(client.c:1947:ptlrpc_expire_one_request()) Skipped 1 previous similar message
      Removing read-only on unknown block (0xfd00000)
      Lustre: server umount lustre-MDT0000 complete
      Lustre: Skipped 1 previous similar message
      Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
      Lustre: DEBUG MARKER: hostname
      Lustre: DEBUG MARKER: test -b /dev/lvm-Role_MDS/P1
      Lustre: DEBUG MARKER: mkdir -p /mnt/mds1; mount -t lustre   		                   /dev/lvm-Role_MDS/P1 /mnt/mds1
      LDISKFS-fs (dm-0): recovery complete
      LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. quota=on. Opts: 
      LustreError: 11-0: lustre-MDT0000-lwp-MDT0000: Communicating with 0@lo, operation mds_connect failed with -11.
      LustreError: Skipped 1 previous similar message
      Lustre: lustre-MDT0000: Will be in recovery for at least 1:00, or until 4 clients reconnect
      Lustre: Skipped 1 previous similar message
      Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u
      Lustre: DEBUG MARKER: e2label /dev/lvm-Role_MDS/P1 2>/dev/null
      Lustre: *** cfs_fail_loc=119, val=2147483648***
      LustreError: 21089:0:(ldlm_lib.c:2389:target_send_reply_msg()) @@@ dropping reply  req@ffff880064eee400 x1484473741963360/t47244640260(47244640260) o36->250c9384-2350-36d9-cb7a-d96794f57fdd@10.2.4.94@tcp:0/0 lens 520/448 e 0 to 0 dl 1415709342 ref 1 fl Complete:/4/0 rc 0/0
      Lustre: lustre-MDT0000: Denying connection for new client lustre-MDT0000-lwp-OST0001_UUID (at 10.2.4.100@tcp), waiting for all 4 known clients (0 recovered, 4 in progress, and 0 evicted) to recover in 10:21
      Lustre: Skipped 771 previous similar messages
      INFO: task tgt_recov:21089 blocked for more than 120 seconds.
            Not tainted 2.6.32-431.29.2.el6_lustre.x86_64 #1
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      tgt_recov     D 0000000000000000     0 21089      2 0x00000080
       ffff88006987dda0 0000000000000046 ffff88006987dd00 ffff88007b8aff67
       ffffc90001fb7370 ffff88006921153c 0000000000000004 ffff8800692111b8
       ffff880055108638 ffff88006987dfd8 000000000000fbc8 ffff880055108638
      Call Trace:
       [<ffffffffa07e0f10>] ? check_for_next_transno+0x0/0x590 [ptlrpc]
       [<ffffffffa07ddf4d>] target_recovery_overseer+0x9d/0x230 [ptlrpc]
       [<ffffffffa07dc630>] ? exp_req_replay_healthy+0x0/0x30 [ptlrpc]
       [<ffffffff8109afa0>] ? autoremove_wake_function+0x0/0x40
       [<ffffffffa07e521e>] target_recovery_thread+0x9ae/0x1a10 [ptlrpc]
       [<ffffffff81061d12>] ? default_wake_function+0x12/0x20
       [<ffffffffa07e4870>] ? target_recovery_thread+0x0/0x1a10 [ptlrpc]
       [<ffffffff8109abf6>] kthread+0x96/0xa0
       [<ffffffff8100c20a>] child_rip+0xa/0x20
       [<ffffffff8109ab60>] ? kthread+0x0/0xa0
       [<ffffffff8100c200>] ? child_rip+0x0/0x20
      INFO: task tgt_recov:21089 blocked for more than 120 seconds.
            Not tainted 2.6.32-431.29.2.el6_lustre.x86_64 #1
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      

      Info required for matching: replay-dual 10

      Attachments

        Issue Links

          Activity

            People

              tappro Mikhail Pershin
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: