Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.12.0
    • Lustre 2.12.0
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for bobijam <bobijam@whamcloud.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/5c95f0b2-8186-11e8-b441-52540065bddc

      test_115 failed with the following error:

      Timeout occurred after 216 mins, last suite running was replay-single, restarting cluster to continue tests
      

      MDS dmesg keeps showing following error messages during several tests, and the test takes too much time.

      [ 2545.541360] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server.
      [ 2545.571570] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.9.5.210@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      [ 2545.618732] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.9.5.212@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      [ 2545.618926] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.9.5.212@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      [ 2545.619112] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.9.5.212@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      ...
      

      another hit also happens at https://testing.whamcloud.com/test_sets/08372d04-8188-11e8-97ff-52540065bddc

      test_80c 'Timeout occurred after 159 mins, last suite running was replay-single, restarting cluster to continue tests' 
      

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      replay-single test_115 - Timeout occurred after 216 mins, last suite running was replay-single, restarting cluster to continue tests
       

      Attachments

        Issue Links

          Activity

            [LU-11128] replay-single test timeout
            pjones Peter Jones added a comment -

            Landed for 2.12

            pjones Peter Jones added a comment - Landed for 2.12

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32980/
            Subject: LU-11128 ptlrpc: new request vs disconnect race
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 93d20d171c20491a96e5e85d7442a002f300619d

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32980/ Subject: LU-11128 ptlrpc: new request vs disconnect race Project: fs/lustre-release Branch: master Current Patch Set: Commit: 93d20d171c20491a96e5e85d7442a002f300619d

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33168/
            Subject: LU-11128 ptlrpc: add debugging for idle connections
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 0aa58d26f5df2b71a040ed6f0f419b925528b6ad

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33168/ Subject: LU-11128 ptlrpc: add debugging for idle connections Project: fs/lustre-release Branch: master Current Patch Set: Commit: 0aa58d26f5df2b71a040ed6f0f419b925528b6ad

            In that case, can you submit a patch to increase the default idle_timeout value, and set it lower in the test config for testing.

            adilger Andreas Dilger added a comment - In that case, can you submit a patch to increase the default idle_timeout value, and set it lower in the test config for testing.

            Andreas, I'm fine to change the defaults and yes, one of the reason to rather have it short is to hit the code more frequently.

            ping reply is not counted:

            if (lustre_msg_get_opc(req->rq_reqmsg) != OBD_PING)
             req->rq_import->imp_last_reply_time = ktime_get_real_seconds();

            then check for idle:

            if (now - imp->imp_last_reply_time < imp->imp_idle_timeout)
                return false;
            bzzz Alex Zhuravlev added a comment - Andreas, I'm fine to change the defaults and yes, one of the reason to rather have it short is to hit the code more frequently. ping reply is not counted: if (lustre_msg_get_opc(req->rq_reqmsg) != OBD_PING) req->rq_import->imp_last_reply_time = ktime_get_real_seconds(); then check for idle: if (now - imp->imp_last_reply_time < imp->imp_idle_timeout) return false ;

            Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33168
            Subject: LU-11128 ptlrpc: add debugging for idle connections
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 216c9c7cbcd38fa56ee2240c8a13066ad66b3f77

            gerrit Gerrit Updater added a comment - Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33168 Subject: LU-11128 ptlrpc: add debugging for idle connections Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 216c9c7cbcd38fa56ee2240c8a13066ad66b3f77

            People

              bzzz Alex Zhuravlev
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: