Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11249

recovery-small test 29a hangs on unmount/mount

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.12.0
    • None
    • 3
    • 9223372036854775807

    Description

      recovery-small test_29a times out with logs for the failed test session at https://testing.whamcloud.com/test_sets/23dc8976-9d91-11e8-87f3-52540065bddc.

      The MDS console log shows some issue not seen when this test succeeds. First, we see issues when we try and unmount the MDT

      [ 1737.267177] Lustre: DEBUG MARKER: == recovery-small test 29a: error adding new clients doesn't cause LBUG (bug 22273) ================== 00:41:46 (1533948106)
      [ 1737.442848] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000711
      [ 1737.752721] Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds1' ' /proc/mounts || true
      [ 1738.058674] Lustre: DEBUG MARKER: umount -d /mnt/lustre-mds1
      [ 1738.864698] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.9.6.11@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      [ 1738.864761] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.9.6.11@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      [ 1738.864891] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.9.6.11@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      [ 1738.865011] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.9.6.11@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      [ 1738.865139] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.9.6.11@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      [ 1738.865258] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.9.6.11@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      [ 1738.865375] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.9.6.11@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      [ 1739.492348] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.9.6.9@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      [ 1742.857182] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.9.6.10@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      [ 1743.871557] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.9.6.11@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      [ 1743.871720] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.9.6.11@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      [ 1743.871855] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.9.6.11@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      [ 1743.871995] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.9.6.11@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      [ 1743.872116] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.9.6.11@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      [ 1743.872235] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.9.6.11@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      [ 1743.872353] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.9.6.11@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      [ 1744.644124] Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
      [ 1744.644124] lctl dl | grep ' ST ' || true
      

      In the same logs, we see problems when we try and mount the MDT

      [ 1747.118280] Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-mds1; mount -t lustre  -o abort_recovery /dev/mapper/mds1_flakey /mnt/lustre-mds1
      [ 1747.292237] LDISKFS-fs (dm-3): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
      [ 1747.529282] LustreError: 14742:0:(mdt_handler.c:6403:mdt_iocontrol()) lustre-MDT0000: Aborting recovery for device
      [ 1747.530362] LustreError: 14742:0:(ldlm_lib.c:2593:target_stop_recovery_thread()) lustre-MDT0000: Aborting recovery
      [ 1747.531991] Lustre: 14814:0:(ldlm_lib.c:2046:target_recovery_overseer()) recovery is aborted, evict exports in recovery
      [ 1747.533049] Lustre: lustre-MDT0000: disconnecting 2 stale clients
      [ 1747.533989] Lustre: 14814:0:(ldlm_lib.c:2046:target_recovery_overseer()) recovery is aborted, evict exports in recovery
      [ 1747.535041] Lustre: 14814:0:(ldlm_lib.c:2046:target_recovery_overseer()) recovery is aborted, evict exports in recovery
      [ 1747.773274] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n health_check
      

      In the OSS console log, we see some issues

      [ 1744.759406] Lustre: DEBUG MARKER: == rpc test complete, duration -o sec ================================================================ 00:41:58 (1533948118)
      [ 1745.083373] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-6vm12.trevis.whamcloud.com: executing set_default_debug -1 all 4
      [ 1745.269671] Lustre: DEBUG MARKER: trevis-6vm12.trevis.whamcloud.com: executing set_default_debug -1 all 4
      [ 1761.394382] LustreError: 11361:0:(qsd_reint.c:56:qsd_reint_completion()) lustre-OST0006: failed to enqueue global quota lock, glb fid:[0x200000006:0x20000:0x0], rc:-5
      [ 1761.394471] LustreError: 11362:0:(qsd_reint.c:56:qsd_reint_completion()) lustre-OST0006: failed to enqueue global quota lock, glb fid:[0x200000006:0x1020000:0x0], rc:-5
      [ 1761.394748] LustreError: 11352:0:(qsd_reint.c:56:qsd_reint_completion()) lustre-OST0005: failed to enqueue global quota lock, glb fid:[0x200000006:0x20000:0x0], rc:-5
      [ 1761.394972] LustreError: 11353:0:(qsd_reint.c:56:qsd_reint_completion()) lustre-OST0005: failed to enqueue global quota lock, glb fid:[0x200000006:0x1020000:0x0], rc:-5
      

      recovery-small test 29a times out several times in the past month, but their logs do not contains the errors listed above.

      Attachments

        Activity

          People

            wc-triage WC Triage
            jamesanunez James Nunez (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: