Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7115

fld_client_rpc() may run into deadloop

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.10.0
    • Lustre 2.7.0
    • None
    • 3
    • 9223372036854775807

    Description

      In fld_client_rpc():

              if (rc != 0) {
                      if (imp->imp_state != LUSTRE_IMP_CLOSED && !imp->imp_deactive) {
                              /* Since LWP is not replayable, so it will keep
                               * trying unless umount happens, otherwise it would
                               * cause unecessary failure of the application. */
                              ptlrpc_req_finished(req);
                              rc = 0;
                              goto again;
                      }
                      GOTO(out_req, rc);
              }
      

      If the connection is broken, this function will run into an dead loop. I think we'd reshape the function somehow to make it interruptable, otherwise, if connection never being established, caller will stuck in this function forever.

      Seems fld_update_from_controller() has similar problem.

      Attachments

        Issue Links

          Activity

            [LU-7115] fld_client_rpc() may run into deadloop
            pjones Peter Jones added a comment -

            Landed for 2.10

            pjones Peter Jones added a comment - Landed for 2.10

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/17041/
            Subject: LU-7115 fld: don't retry for no LWP device
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 9407629a816feff9f773517f90b615164319642f

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/17041/ Subject: LU-7115 fld: don't retry for no LWP device Project: fs/lustre-release Branch: master Current Patch Set: Commit: 9407629a816feff9f773517f90b615164319642f

            Yang Sheng (yang.sheng@intel.com) uploaded a new patch: http://review.whamcloud.com/17041
            Subject: LU-7115 fld: don't try again for no LWP device
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 4fb4d98b6432e2a9d2e0397599421a1c2032e51e

            gerrit Gerrit Updater added a comment - Yang Sheng (yang.sheng@intel.com) uploaded a new patch: http://review.whamcloud.com/17041 Subject: LU-7115 fld: don't try again for no LWP device Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 4fb4d98b6432e2a9d2e0397599421a1c2032e51e

            As Di mentioned, the thread can be terminated by umount target, I think that's fine to me, we can just leave as it is.

            And this function will be called by client as well, we may need to check if it's called by client (not from LWP but from mdc device), and break the loop for non-LWP device case.

            niu Niu Yawei (Inactive) added a comment - As Di mentioned, the thread can be terminated by umount target, I think that's fine to me, we can just leave as it is. And this function will be called by client as well, we may need to check if it's called by client (not from LWP but from mdc device), and break the loop for non-LWP device case.

            Is this really a bug or could this be closed?

            adilger Andreas Dilger added a comment - Is this really a bug or could this be closed?
            di.wang Di Wang added a comment -

            Oh, it can break the loop, see

                           if (imp->imp_state != LUSTRE_IMP_CLOSED && !imp->imp_deactive) {
                                    /* Since LWP is not replayable, so it will keep
                                     * trying unless umount happens, otherwise it would
                                     * cause unecessary failure of the application. */
                                    ptlrpc_req_finished(req);
                                    rc = 0;
                                    goto again;
                            }
            

            It will check the import state here. And also the point right now is that we do not break the connection between MDTs, until umount and admin step in, so this implementation actually fit in here.
            Though we still need check the LWP here as I said in the previous comment.

            di.wang Di Wang added a comment - Oh, it can break the loop, see if (imp->imp_state != LUSTRE_IMP_CLOSED && !imp->imp_deactive) { /* Since LWP is not replayable, so it will keep * trying unless umount happens, otherwise it would * cause unecessary failure of the application. */ ptlrpc_req_finished(req); rc = 0; goto again; } It will check the import state here. And also the point right now is that we do not break the connection between MDTs, until umount and admin step in, so this implementation actually fit in here. Though we still need check the LWP here as I said in the previous comment.

            I mean if the connection is broken, then it'll run into a deadloop, then there is no way to terminate the thread which calls this function, and we won't able to shutdown the MDT/OST at the end.

            I think it's not a serious problem, and looks not easy to fix.

            niu Niu Yawei (Inactive) added a comment - I mean if the connection is broken, then it'll run into a deadloop, then there is no way to terminate the thread which calls this function, and we won't able to shutdown the MDT/OST at the end. I think it's not a serious problem, and looks not easy to fix.
            di.wang Di Wang added a comment -

            I thought the point here is to not fail for LWP unless it is being umounted or deactive, not sure how interruptible can help here, since it is the connection between MDTs. I may miss sth?

            But we do need check if the import is for LWP, i.e. only do this "try again" for LWP connection, not for MDC or other import.

            di.wang Di Wang added a comment - I thought the point here is to not fail for LWP unless it is being umounted or deactive, not sure how interruptible can help here, since it is the connection between MDTs. I may miss sth? But we do need check if the import is for LWP, i.e. only do this "try again" for LWP connection, not for MDC or other import.
            pjones Peter Jones added a comment -

            Yang Sheng

            Could you please look into this issue?

            Thanks

            Peter

            pjones Peter Jones added a comment - Yang Sheng Could you please look into this issue? Thanks Peter

            People

              ys Yang Sheng
              niu Niu Yawei (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: