Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1716

Race in setting connection flags and using them on 2.x client connect

Details

    • 3
    • 4475

    Description

      Lustre 2.1 client fails to connect to Lustre 2.2 server

      > c0-0c2s6n1 LustreError: 11-0: an error occurred while communicating with 10.149.3.5@o2ib. The mgs_config_read operation failed with -524
      > c0-0c2s6n1 LustreError: 4645:0:(mgc_request.c:1917:mgc_process_config()) Cannot process recover llog -524
      > c0-0c2s6n1 LustreError: 15c-8: MGC10.149.3.5@o2ib: The configuration from log 'snxs2-client' failed (-524). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
      > c0-0c2s6n1 LustreError: 4645:0:(llite_lib.c:983:ll_fill_super()) Unable to process log: -524

      the race can be reproduced with following patch:

      diff --git a/lustre/ptlrpc/import.c b/lustre/ptlrpc/import.c
      index 2953352..a69e6b9 100644
      — a/lustre/ptlrpc/import.c
      +++ b/lustre/ptlrpc/import.c
      @@ -805,6 +805,7 @@ static int ptlrpc_connect_interpret(const struct lu_env *env,
                       } else {
                               IMPORT_SET_STATE(imp, LUSTRE_IMP_FULL);
                               ptlrpc_activate_import(imp);
      +                        OBD_FAIL_TIMEOUT(0x5555, 2);
                       }
       
                       GOTO(finish, rc = 0);
      

      Attachments

        Activity

          [LU-1716] Race in setting connection flags and using them on 2.x client connect

          Xyratex-bug-id: MRP-577

          nrutman Nathan Rutman added a comment - Xyratex-bug-id: MRP-577
          pjones Peter Jones added a comment -

          Landed for 2.3 and 2.4

          pjones Peter Jones added a comment - Landed for 2.3 and 2.4
          askulysh Andriy Skulysh added a comment - b2_1 patch http://review.whamcloud.com/3783
          green Oleg Drokin added a comment -

          Hm, actually I now believe the issue is more serious than interop. Since the flags like CAPA, GSS and remote clients are set in a client by default and only after server rejects them they are reset, this could occur without any interop again, though strange that we have not seen it before (needs a lot of superfast cores on client to show up the race?)

          I will change the title to reflect this.

          green Oleg Drokin added a comment - Hm, actually I now believe the issue is more serious than interop. Since the flags like CAPA, GSS and remote clients are set in a client by default and only after server rejects them they are reset, this could occur without any interop again, though strange that we have not seen it before (needs a lot of superfast cores on client to show up the race?) I will change the title to reflect this.

          I haven't added test to master branch because the fix changes the order of assigning connect flags and setting import to FULL state. OBD_FAIL_TIMEOUT() will not catch anything.

          askulysh Andriy Skulysh added a comment - I haven't added test to master branch because the fix changes the order of assigning connect flags and setting import to FULL state. OBD_FAIL_TIMEOUT() will not catch anything.

          Never mind. I see I am playing catch up here. Looks like Andriy has already addressed the style and cleanup issues but not the test. That being the case should I grab the 2.1 test into master even though it doesn't prove anything in this release?

          bogl Bob Glossman (Inactive) added a comment - Never mind. I see I am playing catch up here. Looks like Andriy has already addressed the style and cleanup issues but not the test. That being the case should I grab the 2.1 test into master even though it doesn't prove anything in this release?

          Andreas, I can incorporate the test from http://review.whamcloud.com/#change,3654 into the fix for master. Since this bug is an interop problem it won't prove anything with the test only in 2.3 client & server, but should be good to prove the problem doesn't come back for testing interop of 2.3 with future versions. Will also rework the patch to address your style and cleanup comments and resubmit.

          bogl Bob Glossman (Inactive) added a comment - Andreas, I can incorporate the test from http://review.whamcloud.com/#change,3654 into the fix for master. Since this bug is an interop problem it won't prove anything with the test only in 2.3 client & server, but should be good to prove the problem doesn't come back for testing interop of 2.3 with future versions. Will also rework the patch to address your style and cleanup comments and resubmit.

          added test http://review.whamcloud.com/#change,3654 for b2_1
          In fact it reproduces the bug in another place:
          00000100:00000001:0.0:1344768393.872519:0:4160:0:(client.c:617:__ptlrpc_request_bufs_pack()) Process leaving (rc=0 : 0 : 0)
          00000000:00040000:0.0:1344768393.872521:0:4160:0:(mdc_request.c:239:mdc_getattr()) ASSERTION(client_is_remote(exp)) failed
          00000000:00040000:0.0:1344768393.872527:0:4160:0:(mdc_request.c:239:mdc_getattr()) LBUG

          but it is connection flags race also. The LBUG dissapears with my fix.

          askulysh Andriy Skulysh added a comment - added test http://review.whamcloud.com/#change,3654 for b2_1 In fact it reproduces the bug in another place: 00000100:00000001:0.0:1344768393.872519:0:4160:0:(client.c:617:__ptlrpc_request_bufs_pack()) Process leaving (rc=0 : 0 : 0) 00000000:00040000:0.0:1344768393.872521:0:4160:0:(mdc_request.c:239:mdc_getattr()) ASSERTION(client_is_remote(exp)) failed 00000000:00040000:0.0:1344768393.872527:0:4160:0:(mdc_request.c:239:mdc_getattr()) LBUG but it is connection flags race also. The LBUG dissapears with my fix.

          Bob/Andriy, it would be useful to add a proper OBD_FAIL value for this test, and write a recovery-small.sh test to trigger the timeout, and submit it as a patch to autotest with the following directives in the commit comment (see http://wiki.whamcloud.com/display/PUB/Changing+Test+Parameters+with+Gerrit+Commit+Messages for details):

          Test-Parameters: fortestonly clientbuildno=??? list=recovery-small,recovery-small,recovery-small
          

          Where clientbuildno=<2.1.2 release build number>. This is without the actual fix applied, just the test case. That would give us an indication on how easily this problem can be hit, and then if it fails during the testing, then the test case should be added to the patch, which will presumably allow the test to pass.

          adilger Andreas Dilger added a comment - Bob/Andriy, it would be useful to add a proper OBD_FAIL value for this test, and write a recovery-small.sh test to trigger the timeout, and submit it as a patch to autotest with the following directives in the commit comment (see http://wiki.whamcloud.com/display/PUB/Changing+Test+Parameters+with+Gerrit+Commit+Messages for details): Test-Parameters: fortestonly clientbuildno=??? list=recovery-small,recovery-small,recovery-small Where clientbuildno=<2.1.2 release build number>. This is without the actual fix applied, just the test case. That would give us an indication on how easily this problem can be hit, and then if it fails during the testing, then the test case should be added to the patch, which will presumably allow the test to pass.

          I tried to reproduce the reported problem with just repeated mount/unmount cycles of a 2.3 filesystem on a 2.1 client. Couldn't make it happen. Didn't add in your suggested OBD_FAIL_* hook to try to force it to happen. Should I have expected to see it or is there something else that needs to be done to cause it?

          bogl Bob Glossman (Inactive) added a comment - I tried to reproduce the reported problem with just repeated mount/unmount cycles of a 2.3 filesystem on a 2.1 client. Couldn't make it happen. Didn't add in your suggested OBD_FAIL_* hook to try to force it to happen. Should I have expected to see it or is there something else that needs to be done to cause it?

          People

            bogl Bob Glossman (Inactive)
            askulysh Andriy Skulysh
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: