Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3353

import_sec_validate_get() import ffff88061d4a7000 (FULL) with no sec

Details

    • 3
    • 8288

    Description

      We started getting many of these messages on our 2.1.4 MDS. The timing corresponded with client reconnection problems, i.e. LU-1934.

      2013-05-15 15:54:44 Lustre: lsd-MDT0000: Client a22a8a8a-754a-c1ab-859d-eda3c476a27d (at 192.168.112.1@o2ib6) reconnecting
      2013-05-15 15:54:44 Lustre: Skipped 143 previous similar messages
      2013-05-15 15:54:44 Lustre: lsd-MDT0000: Client a22a8a8a-754a-c1ab-859d-eda3c476a27d (at 192.168.112.1@o2ib6) refused reconnection, still busy with 1 active RPCs
      2013-05-15 15:54:44 Lustre: Skipped 143 previous similar messages
      2013-05-15 15:56:31 LustreError: 5204:0:(sec.c:385:import_sec_validate_get()) import ffff88061d4a7000 (FULL) with no sec
      2013-05-15 15:56:31 LustreError: 5204:0:(sec.c:385:import_sec_validate_get()) Skipped 2399 previous similar messages
      

      The MDS was just rebooted without a crash dump so, unless this reproduces, we won't be able to tell much about the imports in question.

      Attachments

        Issue Links

          Activity

            [LU-3353] import_sec_validate_get() import ffff88061d4a7000 (FULL) with no sec

            Patch landed to Master. Patch for other branches will be tracked outside of this ticket.

            jlevi Jodi Levi (Inactive) added a comment - Patch landed to Master. Patch for other branches will be tracked outside of this ticket.

            James Simmons (uja.ornl@gmail.com) uploaded a new patch: http://review.whamcloud.com/13254
            Subject: LU-3353 ptlrpc: Suppress error message when imp_sec is freed
            Project: fs/lustre-release
            Branch: b2_5
            Current Patch Set: 1
            Commit: 2d6469ec1a1ffc820cb7d5a6e18006a4e41bad19

            gerrit Gerrit Updater added a comment - James Simmons (uja.ornl@gmail.com) uploaded a new patch: http://review.whamcloud.com/13254 Subject: LU-3353 ptlrpc: Suppress error message when imp_sec is freed Project: fs/lustre-release Branch: b2_5 Current Patch Set: 1 Commit: 2d6469ec1a1ffc820cb7d5a6e18006a4e41bad19

            Can a patch be landed for 2.5? We still see this on our production clusters, and it would be nice to reduce unneeded log messages.

            marc@llnl.gov D. Marc Stearman (Inactive) added a comment - Can a patch be landed for 2.5? We still see this on our production clusters, and it would be nice to reduce unneeded log messages.

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/10200/
            Subject: LU-3353 ptlrpc: Suppress error message when imp_sec is freed
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: edf6116663724467207422dcc0c6120320055cac

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/10200/ Subject: LU-3353 ptlrpc: Suppress error message when imp_sec is freed Project: fs/lustre-release Branch: master Current Patch Set: Commit: edf6116663724467207422dcc0c6120320055cac

            the core reason for the race condition, where a reverse import is destroyed while there are client bound requests in flight, is when a client reconnects. The area in the code where I could identify is in:

            target_handle_connect()
            {
            ...
             spin_lock(&export->exp_lock);
             if (export->exp_imp_reverse != NULL)
              /* destroyed import can be still referenced in ctxt */
              tmp_imp = export->exp_imp_reverse;
             export->exp_imp_reverse = revimp;
             spin_unlock(&export->exp_lock);
            ...
            }
            
            

            later on the tmp_imp is destroyed.

            While it's being destroyed import_sec_validate_get() could be getting called, and the security could've been cleared already.

            the suggested solution is to suppress the error message in import_sec_validate_get() if the import is being destroyed

            ashehata Amir Shehata (Inactive) added a comment - the core reason for the race condition, where a reverse import is destroyed while there are client bound requests in flight, is when a client reconnects. The area in the code where I could identify is in: target_handle_connect() { ... spin_lock(&export->exp_lock); if (export->exp_imp_reverse != NULL) /* destroyed import can be still referenced in ctxt */ tmp_imp = export->exp_imp_reverse; export->exp_imp_reverse = revimp; spin_unlock(&export->exp_lock); ... } later on the tmp_imp is destroyed. While it's being destroyed import_sec_validate_get() could be getting called, and the security could've been cleared already. the suggested solution is to suppress the error message in import_sec_validate_get() if the import is being destroyed

            it's bug in reconnect logic, we are disconnect an rev import but lack an invalidate list of requests on import.
            in that case lock cancel callback may still in sending queue with waiting sec context state and don't killed at all but it's need some network flap also.
            i have crash dump with such situation.

            shadow Alexey Lyashkov added a comment - it's bug in reconnect logic, we are disconnect an rev import but lack an invalidate list of requests on import. in that case lock cancel callback may still in sending queue with waiting sec context state and don't killed at all but it's need some network flap also. i have crash dump with such situation.

            This looks like a race: when the export is being destroyed, it'll kill the imp_sec on it's reverse import, and in the meantime, there is still some inflight requests on it's reverse import, which triggered the error message of "import_sec_validate_get() import ... with no sec". I don't think it could cause any serious damange so far.

            niu Niu Yawei (Inactive) added a comment - This looks like a race: when the export is being destroyed, it'll kill the imp_sec on it's reverse import, and in the meantime, there is still some inflight requests on it's reverse import, which triggered the error message of "import_sec_validate_get() import ... with no sec". I don't think it could cause any serious damange so far.

            I don't think we specify any security flavor, so it should be null.

            nedbass Ned Bass (Inactive) added a comment - I don't think we specify any security flavor, so it should be null.

            I'm not familiar with lustre security. Ned, do you know what kind of security flavor was specified for the cluster? Thanks.

            niu Niu Yawei (Inactive) added a comment - I'm not familiar with lustre security. Ned, do you know what kind of security flavor was specified for the cluster? Thanks.
            pjones Peter Jones added a comment -

            Niu

            Could you please comment on this one?

            Thanks

            Peter

            pjones Peter Jones added a comment - Niu Could you please comment on this one? Thanks Peter

            People

              niu Niu Yawei (Inactive)
              nedbass Ned Bass (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: