Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8351

ptlrpc_import_delay_req() refuses to delay blocking asts when import is not in LUSTRE_IMP_FULL yet

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.10.0
    • None
    • 3
    • 9223372036854775807

    Description

      ptlrpc_import_delay_req() refuses to delay blocking asts when import
      is not in LUSTRE_IMP_FULL yet. That leads to client eviction assuming
      that it failed to respond.

      Solution: allow delays for blocking asts being resent.

      Attachments

        Activity

          [LU-8351] ptlrpc_import_delay_req() refuses to delay blocking asts when import is not in LUSTRE_IMP_FULL yet
          pjones Peter Jones added a comment -

          Landed for 2.10

          pjones Peter Jones added a comment - Landed for 2.10

          Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/21065/
          Subject: LU-8351 ptlrpc: allow blocking asts to be delayed
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: bde948c9cf11793d8e7fef59ea2a10bb2d684bf6

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/21065/ Subject: LU-8351 ptlrpc: allow blocking asts to be delayed Project: fs/lustre-release Branch: master Current Patch Set: Commit: bde948c9cf11793d8e7fef59ea2a10bb2d684bf6

          Was it considered that the alternative is the server is stuck waiting for a lock from unavailable client for a long time?
          What's the plan there?
          Before the patch: Send AST, once there is timeout, retry a few more times, then evict client.
          Now: Try to send AST, it gets blocked... wait...? (for how long) and then what?
          What if the client never reappears, are we waiting for it to be evicted by the ping evictor or something?
          What about all the other clients that could get blocked on this lock meanwhile?

          I believe that all the above will remain as it is now.

          The only situation the patch is to address arises when a server handles reconnect from a client and when there is blocking ast for that client to be resend.

          The server enters target_handle_connect() and eventually rev_import_reconnect(), where the reverse import is set to LUSTRE_IMP_RECOVER, and then ptlrpc_import_recovery_state_machine() is called to do resends.
          ptlrpc_resend() sets rq_resend flag for each request on the sending list and wakeup corresponding ptlrpc_check_set-s. Then the import state is set to LUSTRE_IMP_FULL.

                  if (imp->imp_state == LUSTRE_IMP_RECOVER) {
                          struct ptlrpc_connection *conn = imp->imp_connection;
          
                          rc = ptlrpc_resend(imp);
                          if (rc)
                                  GOTO(out, rc);
                          IMPORT_SET_STATE(imp, LUSTRE_IMP_FULL);
          

          There is a import’s lock race window between ptlrpc_resend() and IMPORT_SET_STATE(imp, LUSTRE_IMP_FULL).
          If awaken ptlrpc_set_check()->ptlrpc_import_delay_req() locks the import before import’s state gets set to LUSTRE_IMP_FULL, it sees req->rq_send_state != imp->imp_state and sets *status to -EWOULDBLOCK(-EAGAIN).

                  } else if (req->rq_send_state != imp->imp_state) {
                          /* invalidate in progress - any requests should be drop */
                          if (atomic_read(&imp->imp_inval_count) != 0) {
                                  DEBUG_REQ(D_ERROR, req, "invalidate in flight");
                                  *status = -EIO;
                          } else if (imp->imp_dlm_fake || req->rq_no_delay) {
                                  *status = -EWOULDBLOCK;
          

          After that the request goes to interpret state and ldlm_handle_ast_error()->ldlm_failed_ast() evicts the client as req->rq_status is -EAGAIN.

          Also, the import on the server is not a real import, so how could it be non-full?

          The server is handling reconnect. The state of the reverse import is changed in rev_import_reconnect().

          this probably needs an interop check too? I.e. the change is on the server, but we always run with client scriptss, so we need to ensure the server has this fix before running this test?

          I do not know. The test is attached just to illustrate the problem. If you would like older servers testing to not fail you might want to just add the test to exception list.

          vsaveliev Vladimir Saveliev added a comment - Was it considered that the alternative is the server is stuck waiting for a lock from unavailable client for a long time? What's the plan there? Before the patch: Send AST, once there is timeout, retry a few more times, then evict client. Now: Try to send AST, it gets blocked... wait...? (for how long) and then what? What if the client never reappears, are we waiting for it to be evicted by the ping evictor or something? What about all the other clients that could get blocked on this lock meanwhile? I believe that all the above will remain as it is now. The only situation the patch is to address arises when a server handles reconnect from a client and when there is blocking ast for that client to be resend. The server enters target_handle_connect() and eventually rev_import_reconnect(), where the reverse import is set to LUSTRE_IMP_RECOVER, and then ptlrpc_import_recovery_state_machine() is called to do resends. ptlrpc_resend() sets rq_resend flag for each request on the sending list and wakeup corresponding ptlrpc_check_set-s. Then the import state is set to LUSTRE_IMP_FULL. if (imp->imp_state == LUSTRE_IMP_RECOVER) { struct ptlrpc_connection *conn = imp->imp_connection; rc = ptlrpc_resend(imp); if (rc) GOTO(out, rc); IMPORT_SET_STATE(imp, LUSTRE_IMP_FULL); There is a import’s lock race window between ptlrpc_resend() and IMPORT_SET_STATE(imp, LUSTRE_IMP_FULL). If awaken ptlrpc_set_check()->ptlrpc_import_delay_req() locks the import before import’s state gets set to LUSTRE_IMP_FULL, it sees req->rq_send_state != imp->imp_state and sets *status to -EWOULDBLOCK(-EAGAIN). } else if (req->rq_send_state != imp->imp_state) { /* invalidate in progress - any requests should be drop */ if (atomic_read(&imp->imp_inval_count) != 0) { DEBUG_REQ(D_ERROR, req, "invalidate in flight" ); *status = -EIO; } else if (imp->imp_dlm_fake || req->rq_no_delay) { *status = -EWOULDBLOCK; After that the request goes to interpret state and ldlm_handle_ast_error()->ldlm_failed_ast() evicts the client as req->rq_status is -EAGAIN. Also, the import on the server is not a real import, so how could it be non-full? The server is handling reconnect. The state of the reverse import is changed in rev_import_reconnect(). this probably needs an interop check too? I.e. the change is on the server, but we always run with client scriptss, so we need to ensure the server has this fix before running this test? I do not know. The test is attached just to illustrate the problem. If you would like older servers testing to not fail you might want to just add the test to exception list.

          Artem Blagodarenko (artem.blagodarenko@seagate.com) uploaded a new patch: http://review.whamcloud.com/21065
          Subject: LU-8351 ptlrpc: allow blocking asts to be delayed
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 44dc760a25d313cd96f9dbd5071ff6018326eae6

          gerrit Gerrit Updater added a comment - Artem Blagodarenko (artem.blagodarenko@seagate.com) uploaded a new patch: http://review.whamcloud.com/21065 Subject: LU-8351 ptlrpc: allow blocking asts to be delayed Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 44dc760a25d313cd96f9dbd5071ff6018326eae6

          People

            green Oleg Drokin
            artem_blagodarenko Artem Blagodarenko (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: