Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.12.0
    • Lustre 2.12.0
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for bobijam <bobijam@whamcloud.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/5c95f0b2-8186-11e8-b441-52540065bddc

      test_115 failed with the following error:

      Timeout occurred after 216 mins, last suite running was replay-single, restarting cluster to continue tests
      

      MDS dmesg keeps showing following error messages during several tests, and the test takes too much time.

      [ 2545.541360] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server.
      [ 2545.571570] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.9.5.210@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      [ 2545.618732] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.9.5.212@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      [ 2545.618926] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.9.5.212@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      [ 2545.619112] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.9.5.212@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      ...
      

      another hit also happens at https://testing.whamcloud.com/test_sets/08372d04-8188-11e8-97ff-52540065bddc

      test_80c 'Timeout occurred after 159 mins, last suite running was replay-single, restarting cluster to continue tests' 
      

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      replay-single test_115 - Timeout occurred after 216 mins, last suite running was replay-single, restarting cluster to continue tests
       

      Attachments

        Issue Links

          Activity

            [LU-11128] replay-single test timeout
            pjones Peter Jones made changes -
            Link New: This issue is related to HP-239 [ HP-239 ]
            pjones Peter Jones made changes -
            Link New: This issue is related to LU-11269 [ LU-11269 ]
            pjones Peter Jones made changes -
            Resolution New: Fixed [ 1 ]
            Status Original: Open [ 1 ] New: Resolved [ 5 ]
            pjones Peter Jones added a comment -

            Landed for 2.12

            pjones Peter Jones added a comment - Landed for 2.12

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32980/
            Subject: LU-11128 ptlrpc: new request vs disconnect race
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 93d20d171c20491a96e5e85d7442a002f300619d

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32980/ Subject: LU-11128 ptlrpc: new request vs disconnect race Project: fs/lustre-release Branch: master Current Patch Set: Commit: 93d20d171c20491a96e5e85d7442a002f300619d

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33168/
            Subject: LU-11128 ptlrpc: add debugging for idle connections
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 0aa58d26f5df2b71a040ed6f0f419b925528b6ad

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33168/ Subject: LU-11128 ptlrpc: add debugging for idle connections Project: fs/lustre-release Branch: master Current Patch Set: Commit: 0aa58d26f5df2b71a040ed6f0f419b925528b6ad
            adilger Andreas Dilger made changes -
            Link New: This issue is related to LU-11405 [ LU-11405 ]

            In that case, can you submit a patch to increase the default idle_timeout value, and set it lower in the test config for testing.

            adilger Andreas Dilger added a comment - In that case, can you submit a patch to increase the default idle_timeout value, and set it lower in the test config for testing.
            adilger Andreas Dilger made changes -
            Link New: This issue is duplicated by LU-11183 [ LU-11183 ]

            Andreas, I'm fine to change the defaults and yes, one of the reason to rather have it short is to hit the code more frequently.

            ping reply is not counted:

            if (lustre_msg_get_opc(req->rq_reqmsg) != OBD_PING)
             req->rq_import->imp_last_reply_time = ktime_get_real_seconds();

            then check for idle:

            if (now - imp->imp_last_reply_time < imp->imp_idle_timeout)
                return false;
            bzzz Alex Zhuravlev added a comment - Andreas, I'm fine to change the defaults and yes, one of the reason to rather have it short is to hit the code more frequently. ping reply is not counted: if (lustre_msg_get_opc(req->rq_reqmsg) != OBD_PING) req->rq_import->imp_last_reply_time = ktime_get_real_seconds(); then check for idle: if (now - imp->imp_last_reply_time < imp->imp_idle_timeout) return false ;

            People

              bzzz Alex Zhuravlev
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: