Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11336

replay-single test 80d hangs on MDT unmount

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.12.0
    • DNE/ZFS
    • 3
    • 9223372036854775807

    Description

      replay-single test_80d hangs on MDT unmount. Looking at a recent hang, at https://testing.whamcloud.com/test_sets/ffb8585c-a706-11e8-80f7-52540065bddc, the last thing we see in the test log is

      CMD: trevis-44vm10 /usr/sbin/lctl --device lustre-MDT0001 notransno
      CMD: trevis-44vm10 /usr/sbin/lctl --device lustre-MDT0001 readonly
      CMD: trevis-44vm10 /usr/sbin/lctl mark mds2 REPLAY BARRIER on lustre-MDT0001
      Failing mds1 on trevis-44vm9
      CMD: trevis-44vm9 grep -c /mnt/lustre-mds1' ' /proc/mounts || true
      Stopping /mnt/lustre-mds1 (opts:) on trevis-44vm9
      CMD: trevis-44vm9 umount -d /mnt/lustre-mds1
      CMD: trevis-44vm9 lsmod | grep lnet > /dev/null &&
      lctl dl | grep ' ST ' || true
      CMD: trevis-44vm9 ! zpool list -H lustre-mdt1 >/dev/null 2>&1 ||
      			grep -q ^lustre-mdt1/ /proc/mounts ||
      			zpool export  lustre-mdt1
      Failing mds2 on trevis-44vm10
      CMD: trevis-44vm10 grep -c /mnt/lustre-mds2' ' /proc/mounts || true
      Stopping /mnt/lustre-mds2 (opts:) on trevis-44vm10
      CMD: trevis-44vm10 umount -d /mnt/lustre-mds2
      

      Looking at the console log for the MDS (vm10), we see the some errors and a call trace

      [ 5775.673375] Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-MDT0001 notransno
      [ 5776.031453] Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-MDT0001 readonly
      [ 5776.390963] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds2 REPLAY BARRIER on lustre-MDT0001
      [ 5776.569652] Lustre: DEBUG MARKER: mds2 REPLAY BARRIER on lustre-MDT0001
      [ 5778.383410] Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds2' ' /proc/mounts || true
      [ 5778.731629] Lustre: DEBUG MARKER: umount -d /mnt/lustre-mds2
      [ 5778.917429] LustreError: 8724:0:(ldlm_resource.c:1146:ldlm_resource_complain()) lustre-MDT0000-osp-MDT0001: namespace resource [0x20000abf8:0xa:0x0].0x0 (ffff96bab691e6c0) refcount nonzero (1) after lock cleanup; forcing cleanup.
      [ 5778.922402] LustreError: 8724:0:(ldlm_resource.c:1146:ldlm_resource_complain()) lustre-MDT0000-osp-MDT0001: namespace resource [0x20000abf8:0xa:0x0].0x0 (ffff96bab691e6c0) refcount nonzero (1) after lock cleanup; forcing cleanup.
      [ 5783.926373] LustreError: 0-0: Forced cleanup waiting for lustre-MDT0000-osp-MDT0001 namespace with 1 resources in use, (rc=-110)
      [ 5830.783393] LNet: Service thread pid 27370 was inactive for 62.11s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [ 5830.785144] Pid: 27370, comm: mdt00_000 3.10.0-862.9.1.el7_lustre.x86_64 #1 SMP Fri Aug 17 20:37:05 UTC 2018
      [ 5830.786119] Call Trace:
      [ 5830.786419]  [<ffffffffc0f14dca>] top_trans_wait_result+0xa6/0x155 [ptlrpc]
      [ 5830.787317]  [<ffffffffc0ef6c0b>] top_trans_stop+0x42b/0x930 [ptlrpc]
      [ 5830.788062]  [<ffffffffc131d80c>] lod_trans_stop+0x25c/0x340 [lod]
      [ 5830.788958]  [<ffffffffc13ca9ae>] mdd_trans_stop+0x2e/0x174 [mdd]
      [ 5830.789654]  [<ffffffffc13af627>] mdd_create+0x967/0x13f0 [mdd]
      [ 5830.790307]  [<ffffffffc124cc80>] mdt_create+0x820/0xc40 [mdt]
      [ 5830.791044]  [<ffffffffc124d20b>] mdt_reint_create+0x16b/0x350 [mdt]
      [ 5830.791709]  [<ffffffffc124ece3>] mdt_reint_rec+0x83/0x210 [mdt]
      [ 5830.792402]  [<ffffffffc122e1d2>] mdt_reint_internal+0x6b2/0xa80 [mdt]
      [ 5830.793112]  [<ffffffffc12391e7>] mdt_reint+0x67/0x140 [mdt]
      [ 5830.793705]  [<ffffffffc0ee627a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
      [ 5830.794488]  [<ffffffffc0e8940b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [ 5830.795329]  [<ffffffffc0e8cc44>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc]
      [ 5830.796151]  [<ffffffffb88bb621>] kthread+0xd1/0xe0
      [ 5830.796694]  [<ffffffffb8f205f7>] ret_from_fork_nospec_end+0x0/0x39
      [ 5830.797409]  [<ffffffffffffffff>] 0xffffffffffffffff
      [ 5830.798027] LustreError: dumping log to /tmp/lustre-log.1535036577.27370
      [ 9434.653754] SysRq : Changing Loglevel
      [ 9434.654263] Loglevel set to 8
      

      This test only hnags in this way for DNE/ZFS testing.

      It’s not easy to detect when this issue started because of issues with the kernel call trace , but this test did not fail nor hang in July 2018 and the first signs of replay-single test 80d hanging is on August 6, 2018.

      Here are links to logs for some hangs:
      https://testing.whamcloud.com/test_sets/06129818-af39-11e8-bd05-52540065bddc
      https://testing.whamcloud.com/test_sets/0339a0e2-ae2a-11e8-bfed-52540065bddc
      https://testing.whamcloud.com/test_sets/021c99b8-ad2f-11e8-bfed-52540065bddc
      https://testing.whamcloud.com/test_sets/fd1d1dd8-abe0-11e8-80f7-52540065bddc
      https://testing.whamcloud.com/test_sets/057c01e4-9f6e-11e8-b0aa-52540065bddc

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: