Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17476

lnet: only report mismatched nid in ME if bits match

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0, Lustre 2.15.5
    • Lustre 2.15.5
    • None
    • 3
    • 9223372036854775807

    Description

      There are rare cases where a client-to-server AST reply was being dropped by the server, with messages similar to the following with o104, o105, or o106 as the RPC type:

      Lustre: 3678513:0:(client.c:2318:ptlrpc_expire_one_request())
           @@@ Request sent has timed out for slow reply: [sent 1706140870/real 1706140870]
            req@00000000a8fbe768 x1788044801687552/t0(0) o104->lfs00-MDT0001@10.31.3.109@tcp:15/16
           lens 328/224 e 0 to 1 dl 1706140908 ref 1 fl Rpc:XQr/0/ffffffff rc 0/-1 job:''
      

      and in the kernel debug logs it shows that LNet is dropping the RPC due to no matching request:

      lnet_parse_put()) Dropping PUT from 12345-10.31.3.108@tcp portal 16 match 1788044801687552 offset 224 length 224: 4
      :
      request_out_callback()) @@@ type 5, status 0  req@00000000a8fbe768 x1788044801687552/t0(0) o104->lfs02-MDT0001@10.31.3.109@tcp:15/16 lens 328/224 e 0 to 0 dl 1706140946 ref 2 fl Rpc:r/2/ffffffff rc 0/-1 job:''
      lnet_parse_put()) Dropping PUT from 12345-10.31.3.108@tcp portal 16 match 1788044801687552 offset 224 length 224: 4
      lnet_is_health_check()) Msg 00000000a906b193 is in inconsistent state, don't perform health checking (-2, 0)
      lnet_is_health_check()) health check = 0, status = -2, hstatus = 0
      

      As a part of MD matching for incoming GET or PUT from a peer with multiple NIDs, use "matchbits" only if they are available and only report an error on NID/PID mismatch. If can't use "matchbits" for matching, fail on NID/PID mismatch as before.

      Attachments

        Issue Links

          Activity

            [LU-17476] lnet: only report mismatched nid in ME if bits match

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/55489/
            Subject: LU-17476 lnet: use bits only to match ME in all cases
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set:
            Commit: a34b3596ad29fc4fd9e7d1f007e4f6ee514dfcaa

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/55489/ Subject: LU-17476 lnet: use bits only to match ME in all cases Project: fs/lustre-release Branch: b2_15 Current Patch Set: Commit: a34b3596ad29fc4fd9e7d1f007e4f6ee514dfcaa

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/55488/
            Subject: LU-17476 lnet: prefer to use bits only to match ME
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set:
            Commit: eb35ce5538512b67fd82955c54a148eb707a10ee

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/55488/ Subject: LU-17476 lnet: prefer to use bits only to match ME Project: fs/lustre-release Branch: b2_15 Current Patch Set: Commit: eb35ce5538512b67fd82955c54a148eb707a10ee

            "Frederick Dilger <fdilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/55489
            Subject: LU-17476 lnet: use bits only to match ME in all cases
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set: 1
            Commit: b342b92b923938892df81a207a0271473c16060e

            gerrit Gerrit Updater added a comment - "Frederick Dilger <fdilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/55489 Subject: LU-17476 lnet: use bits only to match ME in all cases Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: b342b92b923938892df81a207a0271473c16060e

            "Frederick Dilger <fdilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/55488
            Subject: LU-17476 lnet: prefer to use bits only to match ME
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set: 1
            Commit: e10fa830fe3313768df31eee657fe1b02792c1ab

            gerrit Gerrit Updater added a comment - "Frederick Dilger <fdilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/55488 Subject: LU-17476 lnet: prefer to use bits only to match ME Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: e10fa830fe3313768df31eee657fe1b02792c1ab
            hornc Chris Horn added a comment - - edited

            adilger I root caused a NID mismatch bug that is due to a bad back-port that landed in 2.15.4. Details in LU-17664

            hornc Chris Horn added a comment - - edited adilger I root caused a NID mismatch bug that is due to a bad back-port that landed in 2.15.4. Details in LU-17664
            hornc Chris Horn added a comment -

            Do you have the complete log(s) from Oleg's analysis? Maybe there is a clue there.

            I think maybe there is some race between discovery and ptlrpc connection setup. The reply buffer (always?) uses whatever NID is stored in the import's (or export's reverse import) ptlrpc_connection.c_peer.nid. This is populated by a call to LNetPrimaryNID(). Therefore we can infer that at some point LNetPrimaryNID() returned the .109, but then discovery later set .108 as the primary. This shouldn't happen with the primary NID locking feature, but there is no other explanation I can think of.

            hornc Chris Horn added a comment - Do you have the complete log(s) from Oleg's analysis? Maybe there is a clue there. I think maybe there is some race between discovery and ptlrpc connection setup. The reply buffer (always?) uses whatever NID is stored in the import's (or export's reverse import) ptlrpc_connection.c_peer.nid. This is populated by a call to LNetPrimaryNID(). Therefore we can infer that at some point LNetPrimaryNID() returned the .109, but then discovery later set .108 as the primary. This shouldn't happen with the primary NID locking feature, but there is no other explanation I can think of.

            This message tells us the reply buffer was setup using 10.31.3.109@tcp. i.e. ptlrpc thinks .109 is the client's primary NID.

            hornc, thanks for looking into this. Can you see if this is something wrong with how the RPC is generated at the PtlRPC/LDLM layer, maybe where LDLM is getting the peer NID for the AST? We only really saw this with LDLM requests, but that might also have been self selective based on the fact that is one of the few server-to-client RPC types.

            adilger Andreas Dilger added a comment - This message tells us the reply buffer was setup using 10.31.3.109@tcp. i.e. ptlrpc thinks .109 is the client's primary NID. hornc , thanks for looking into this. Can you see if this is something wrong with how the RPC is generated at the PtlRPC/LDLM layer, maybe where LDLM is getting the peer NID for the AST? We only really saw this with LDLM requests, but that might also have been self selective based on the fact that is one of the few server-to-client RPC types.
            hornc Chris Horn added a comment - - edited

            It definitely is possible that the blocking AST request might be generated with the wrong NID for the reply buffer, we haven't yet looked into that code to confirm.

            From the logging you provided, this is definitely the case.

            00000100:00000200:19.0:1706140870.571438:0:18949:0:(events.c:65:request_out_callback())
                 @@@ type 5, status 0  req@00000000a8fbe768 x1788044801687552/t0(0) o104->lfs00-MDT0001@10.31.3.109@tcp:15/16
                 lens 328/224 e 0 to 0 dl 1706140908 ref 2 fl Rpc:r/0/ffffffff rc 0/-1 job:''
            

            This message tells us the reply buffer was setup using 10.31.3.109@tcp. i.e. ptlrpc thinks .109 is the client's primary NID.

            00000400:00000100:17.0:1706140946.315344:0:18949:0:(lib-move.c:4092:lnet_parse_put())
                 Dropping PUT from 12345-10.31.3.108@tcp portal 16 match 1788044801687552 offset 224 length 224: 4
            

            This message tells us the lnet_peer object for this peer has .108 as the primary NID. Thus, message drop due to NID mismatch. I suspect there is a bug with the primary NID locking, or somewhere in ptlrpc connection management where it stores the primary NID.

            hornc Chris Horn added a comment - - edited It definitely is possible that the blocking AST request might be generated with the wrong NID for the reply buffer, we haven't yet looked into that code to confirm. From the logging you provided, this is definitely the case. 00000100:00000200:19.0:1706140870.571438:0:18949:0:(events.c:65:request_out_callback()) @@@ type 5, status 0 req@00000000a8fbe768 x1788044801687552/t0(0) o104->lfs00-MDT0001@10.31.3.109@tcp:15/16 lens 328/224 e 0 to 0 dl 1706140908 ref 2 fl Rpc:r/0/ffffffff rc 0/-1 job:'' This message tells us the reply buffer was setup using 10.31.3.109@tcp. i.e. ptlrpc thinks .109 is the client's primary NID. 00000400:00000100:17.0:1706140946.315344:0:18949:0:(lib-move.c:4092:lnet_parse_put()) Dropping PUT from 12345-10.31.3.108@tcp portal 16 match 1788044801687552 offset 224 length 224: 4 This message tells us the lnet_peer object for this peer has .108 as the primary NID. Thus, message drop due to NID mismatch. I suspect there is a bug with the primary NID locking, or somewhere in ptlrpc connection management where it stores the primary NID.
            pjones Peter Jones added a comment -

            Landed for 2.16

            pjones Peter Jones added a comment - Landed for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/54082/
            Subject: LU-17476 lnet: use bits only to match ME in all cases
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: a7ae2e5515879dc31e87106314d35dc439a2c50d

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/54082/ Subject: LU-17476 lnet: use bits only to match ME in all cases Project: fs/lustre-release Branch: master Current Patch Set: Commit: a7ae2e5515879dc31e87106314d35dc439a2c50d

            People

              ssmirnov Serguei Smirnov
              ssmirnov Serguei Smirnov
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: