Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7256

sanity-lfsck TIMEOUT on umount /mnt/mds4

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: Lustre 2.8.0
    • Fix Version/s: Lustre 2.9.0
    • Labels:
      None
    • Environment:
      review-dne-part-2 in autotest
    • Severity:
      3
    • Rank (Obsolete):
      9223372036854775807

      Description

      After all the tests in sanity-lfsck run, the suite hangs at umount for /mnt/mds4. All subtests are marked as PASS. Logs are at https://testing.hpdd.intel.com/test_sets/fb298f72-6a05-11e5-9d0a-5254006e85c2

      In the suite_stdout log, the last thing we see is

      13:48:04:CMD: shadow-23vm8 grep -c /mnt/mds4' ' /proc/mounts
      13:48:04:Stopping /mnt/mds4 (opts:-f) on shadow-23vm8
      13:48:04:CMD: shadow-23vm8 umount -d -f /mnt/mds4
      14:47:53:********** Timeout by autotest system **********
      

      Looking at the logs for the last test run, sanity-lfsck test_31h, it’s clear that something is not functioning correctly. What went wrong isn’t that clear to me. From the MDS2, 3, 4 console, it’s clear there’s problems communicating between the MDTs more than normal for this test:

      14:23:44:Lustre: lustre-MDT0003: Not available for connect from 10.1.5.32@tcp (stopping)
      14:23:44:Lustre: Skipped 226 previous similar messages
      14:23:44:LustreError: 137-5: lustre-MDT0001_UUID: not available for connect from 10.1.5.32@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      14:23:44:LustreError: Skipped 585 previous similar messages
      14:23:44:Lustre: 4063:0:(client.c:2092:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1443880351/real 1443880351]  req@ffff880058aa2080 x1514012107003420/t0(0) o250->MGC10.1.5.33@tcp@10.1.5.33@tcp:26/25 lens 520/544 e 0 to 1 dl 1443880376 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      14:23:44:Lustre: 4063:0:(client.c:2092:ptlrpc_expire_one_request()) Skipped 50 previous similar messages
      14:23:44:Lustre: lustre-MDT0003: Not available for connect from 10.1.5.32@tcp (stopping)
      14:23:44:Lustre: Skipped 475 previous similar messages
      14:23:44:LustreError: 137-5: lustre-MDT0001_UUID: not available for connect from 10.1.5.32@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      14:23:44:LustreError: Skipped 1191 previous similar messages
      14:23:44:Lustre: 4063:0:(client.c:2092:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1443880961/real 1443880961]  req@ffff880058a93cc0 x1514012107006512/t0(0) o38->lustre-MDT0000-osp-MDT0003@10.1.5.33@tcp:24/4 lens 520/544 e 0 to 1 dl 1443880986 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      14:23:44:Lustre: 4063:0:(client.c:2092:ptlrpc_expire_one_request()) Skipped 40 previous similar messages
      14:23:44:Lustre: lustre-MDT0003: Not available for connect from 10.1.5.32@tcp (stopping)
      14:23:44:Lustre: Skipped 475 previous similar messages
      14:23:44:LustreError: 137-5: lustre-MDT0001_UUID: not available for connect from 10.1.5.32@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      14:23:44:LustreError: Skipped 1191 previous similar messages
      14:23:44:Lustre: 4063:0:(client.c:2092:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1443881581/real 1443881581]  req@ffff88007cb1fc80 x1514012107009652/t0(0) o250->MGC10.1.5.33@tcp@10.1.5.33@tcp:26/25 lens 520/544 e 0 to 1 dl 1443881606 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      14:23:44:Lustre: 4063:0:(client.c:2092:ptlrpc_expire_one_request()) Skipped 40 previous similar messages
      14:23:44:Lustre: lustre-MDT0003: Not available for connect from 10.1.5.32@tcp (stopping)
      14:23:44:Lustre: Skipped 475 previous similar messages
      14:23:44:LustreError: 137-5: lustre-MDT0001_UUID: not available for connect from 10.1.5.32@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      14:23:44:LustreError: Skipped 1189 previous similar messages
      14:23:44:Lustre: 4063:0:(client.c:2092:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1443882191/real 1443882191]  req@ffff88006705b980 x1514012107012736/t0(0) o38->lustre-MDT0000-osp-MDT0003@10.1.5.33@tcp:24/4 lens 520/544 e 0 to 1 dl 1443882216 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      

      There are also stack traces in the console logs for all nodes for test 31h, but there are no Lustre function calls anywhere close to the top of the stack traces.

      There have been three recent occurrences of this issues with logs at:
      https://testing.hpdd.intel.com/test_sets/42501ed6-69be-11e5-9fbf-5254006e85c2
      https://testing.hpdd.intel.com/test_sets/60742a32-6a19-11e5-9fbf-5254006e85c2
      https://testing.hpdd.intel.com/test_sets/fb298f72-6a05-11e5-9d0a-5254006e85c2
      all on 2015-10-03 in review-dne-part-2.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                yong.fan nasf (Inactive)
                Reporter:
                jamesanunez James Nunez
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: