Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.12.0
    • Lustre 2.12.0
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for bobijam <bobijam@whamcloud.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/5c95f0b2-8186-11e8-b441-52540065bddc

      test_115 failed with the following error:

      Timeout occurred after 216 mins, last suite running was replay-single, restarting cluster to continue tests
      

      MDS dmesg keeps showing following error messages during several tests, and the test takes too much time.

      [ 2545.541360] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server.
      [ 2545.571570] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.9.5.210@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      [ 2545.618732] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.9.5.212@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      [ 2545.618926] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.9.5.212@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      [ 2545.619112] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.9.5.212@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      ...
      

      another hit also happens at https://testing.whamcloud.com/test_sets/08372d04-8188-11e8-97ff-52540065bddc

      test_80c 'Timeout occurred after 159 mins, last suite running was replay-single, restarting cluster to continue tests' 
      

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      replay-single test_115 - Timeout occurred after 216 mins, last suite running was replay-single, restarting cluster to continue tests
       

      Attachments

        Issue Links

          Activity

            [LU-11128] replay-single test timeout
            pjones Peter Jones added a comment -

            Landed for 2.12

            pjones Peter Jones added a comment - Landed for 2.12

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32980/
            Subject: LU-11128 ptlrpc: new request vs disconnect race
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 93d20d171c20491a96e5e85d7442a002f300619d

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32980/ Subject: LU-11128 ptlrpc: new request vs disconnect race Project: fs/lustre-release Branch: master Current Patch Set: Commit: 93d20d171c20491a96e5e85d7442a002f300619d

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33168/
            Subject: LU-11128 ptlrpc: add debugging for idle connections
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 0aa58d26f5df2b71a040ed6f0f419b925528b6ad

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33168/ Subject: LU-11128 ptlrpc: add debugging for idle connections Project: fs/lustre-release Branch: master Current Patch Set: Commit: 0aa58d26f5df2b71a040ed6f0f419b925528b6ad

            In that case, can you submit a patch to increase the default idle_timeout value, and set it lower in the test config for testing.

            adilger Andreas Dilger added a comment - In that case, can you submit a patch to increase the default idle_timeout value, and set it lower in the test config for testing.

            Andreas, I'm fine to change the defaults and yes, one of the reason to rather have it short is to hit the code more frequently.

            ping reply is not counted:

            if (lustre_msg_get_opc(req->rq_reqmsg) != OBD_PING)
             req->rq_import->imp_last_reply_time = ktime_get_real_seconds();

            then check for idle:

            if (now - imp->imp_last_reply_time < imp->imp_idle_timeout)
                return false;
            bzzz Alex Zhuravlev added a comment - Andreas, I'm fine to change the defaults and yes, one of the reason to rather have it short is to hit the code more frequently. ping reply is not counted: if (lustre_msg_get_opc(req->rq_reqmsg) != OBD_PING) req->rq_import->imp_last_reply_time = ktime_get_real_seconds(); then check for idle: if (now - imp->imp_last_reply_time < imp->imp_idle_timeout) return false ;

            Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33168
            Subject: LU-11128 ptlrpc: add debugging for idle connections
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 216c9c7cbcd38fa56ee2240c8a13066ad66b3f77

            gerrit Gerrit Updater added a comment - Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33168 Subject: LU-11128 ptlrpc: add debugging for idle connections Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 216c9c7cbcd38fa56ee2240c8a13066ad66b3f77

            I wonder if 20s is too short by default? Especially in the case of large systems where there may be thousands of clients that have nearly identical behaviour (e.g. active/idle at the same time, though possibly to different OSTs).

            On the one hand, the 20s timeout definitely good for finding issues with this code during testing, but I think the default should be longer (e.g. 60s or 300s) depending on how long it takes for a large number of clients to reconnect. We could still set a shorter time in the test-framework to ensure the code continues to be tested. For testing purposes, it might also make sense to have an option (e.g. "lctl set_param osc.*.idle_timeout=debug" and "...=nodebug" or similar) to print a message to the console when the client disconnects (e.g. "testfs-OST0004: disconnect after 50s idle" and "testfs-OST0004: reconnect after 650s idle" or similar) so that we can help debug problems related to this feature. The console message should be enabled during testing.

            I see in the code that idle_timeout has a maximum value of CONNECT_SWITCH_MAX = 50s, which seems a bit short to me? Is that because the OBD_PING RPCs will keep the connection alive if it is longer than this? What happens if ping_interval (default 25s) is shorter than idle_timeout? Is that why the default idle_timeout is 20s?

            adilger Andreas Dilger added a comment - I wonder if 20s is too short by default? Especially in the case of large systems where there may be thousands of clients that have nearly identical behaviour (e.g. active/idle at the same time, though possibly to different OSTs). On the one hand, the 20s timeout definitely good for finding issues with this code during testing, but I think the default should be longer (e.g. 60s or 300s) depending on how long it takes for a large number of clients to reconnect. We could still set a shorter time in the test-framework to ensure the code continues to be tested. For testing purposes, it might also make sense to have an option (e.g. " lctl set_param osc.*.idle_timeout=debug " and " ...=nodebug " or similar) to print a message to the console when the client disconnects (e.g. " testfs-OST0004: disconnect after 50s idle " and " testfs-OST0004: reconnect after 650s idle " or similar) so that we can help debug problems related to this feature. The console message should be enabled during testing. I see in the code that idle_timeout has a maximum value of CONNECT_SWITCH_MAX = 50s , which seems a bit short to me? Is that because the OBD_PING RPCs will keep the connection alive if it is longer than this? What happens if ping_interval (default 25s) is shorter than idle_timeout ? Is that why the default idle_timeout is 20s?

            20s by default

            bzzz Alex Zhuravlev added a comment - 20s by default

            One question I had about this failure - how short/long is the idle connection timeout? It seems like we shouldn't be getting so many timeouts in the middle of actively running tests. Is there some correlation between the tests failing with this issue and the length of time the client is idle?

            When you think about it, we don't want the connections to be dropping after only a few seconds of idle time, or we may get big reconnection storms if the system is still mostly in use, which will also hurt performance because of dropped grant and such.

            adilger Andreas Dilger added a comment - One question I had about this failure - how short/long is the idle connection timeout? It seems like we shouldn't be getting so many timeouts in the middle of actively running tests. Is there some correlation between the tests failing with this issue and the length of time the client is idle? When you think about it, we don't want the connections to be dropping after only a few seconds of idle time, or we may get big reconnection storms if the system is still mostly in use, which will also hurt performance because of dropped grant and such.

            https://review.whamcloud.com/#/c/32980 has passed many replay-single runs, I think it's ready for inspection and regular testing.

             

            bzzz Alex Zhuravlev added a comment - https://review.whamcloud.com/#/c/32980  has passed many replay-single runs, I think it's ready for inspection and regular testing.  

            People

              bzzz Alex Zhuravlev
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: