Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17958

Clients can't mount after server upgrade

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • Lustre 2.15.5
    • None
    • 3
    • 9223372036854775807

    Description

      After we upgrade servers to 2.15.5RC2, clients were in RC1 can't mount with following order:
      mount -t lustre 172.16.5.5@tcp,172.16.5.6@tcp:172.16.5.7@tcp,172.16.5.8@tcp:172.16.5.3@tcp,172.16.5.4@tcp:172.16.5.1@tcp,172.16.5.2@tcp:/testfs /lustre/testfs/client

      However, this order worked
      mount -t lustre 172.16.5.1@tcp,172.16.5.2@tcp:172.16.5.3@tcp,172.16.5.4@tcp:172.16.5.5@tcp,172.16.5.6@tcp:172.16.5.7@tcp,172.16.5.8@tcp:/testfs /lustre/testfs/client

      for the clients that update to RC2, mount worked with both order

      Attachments

        Issue Links

          Activity

            [LU-17958] Clients can't mount after server upgrade

            This is very likely a duplicate of LU-17476. The patches from that ticket have been backported to b2_15 and will be included into RC3.

            adilger Andreas Dilger added a comment - This is very likely a duplicate of LU-17476 . The patches from that ticket have been backported to b2_15 and will be included into RC3.

            Here's what I saw on the server with net debugging enabled when reproducing the failed mount on the client:

            00000400:00000200:14.0:1718661485.310558:0:34329:0:(lib-move.c:4551:lnet_parse()) TRACE: 172.16.5.5@tcp(172.16.5.5@tcp) <- 172.16.2.243@tcp : PUT - for me 00000400:00000200:14.0:1718661485.310560:0:34329:0:(api-ni.c:1540:lnet_nid4_cpt_hash()) Match nid 172.16.2.243@tcp to cpt 3 00000400:00000200:14.0:1718661485.310562:0:34329:0:(api-ni.c:1540:lnet_nid4_cpt_hash()) Match nid 172.16.2.243@tcp to cpt 3 00000400:00000200:14.0:1718661485.310563:0:34329:0:(api-ni.c:1540:lnet_nid4_cpt_hash()) Match nid 172.16.2.243@tcp to cpt 3 00000400:00000200:14.0:1718661485.310564:0:34329:0:(lib-ptl.c:573:lnet_ptl_match_md()) Request from 12345-172.16.2.243@tcp of length 520 into portal 26 MB=0x66706332b50c0 00000400:00000100:14.0:1718661485.310566:0:34329:0:(lib-move.c:4249:lnet_parse_put()) Dropping PUT from 12345-172.16.2.243@tcp portal 26 match 1802126186205376 offset 0 length 520: 4

            "lnetctl ping" and lnet_selftest showed no issues between the same server and client.

            This indicates "failing to match MD" - similar symptoms to LU-17476. Not sure why PUT is getting dropped in this case though because the client is just initiating the connection and there should have been no prior history of transactions.

            My suggestion was to try rebooting the server. In case if the server did get stuck with some "incorrect expectations" towards the client failing to mount, the reboot be able to clear this state.

             

             

            ssmirnov Serguei Smirnov added a comment - Here's what I saw on the server with net debugging enabled when reproducing the failed mount on the client: 00000400:00000200:14.0:1718661485.310558:0:34329:0:(lib-move.c:4551:lnet_parse()) TRACE: 172.16.5.5@tcp(172.16.5.5@tcp) <- 172.16.2.243@tcp : PUT - for me 00000400:00000200:14.0:1718661485.310560:0:34329:0:(api-ni.c:1540:lnet_nid4_cpt_hash()) Match nid 172.16.2.243@tcp to cpt 3 00000400:00000200:14.0:1718661485.310562:0:34329:0:(api-ni.c:1540:lnet_nid4_cpt_hash()) Match nid 172.16.2.243@tcp to cpt 3 00000400:00000200:14.0:1718661485.310563:0:34329:0:(api-ni.c:1540:lnet_nid4_cpt_hash()) Match nid 172.16.2.243@tcp to cpt 3 00000400:00000200:14.0:1718661485.310564:0:34329:0:(lib-ptl.c:573:lnet_ptl_match_md()) Request from 12345-172.16.2.243@tcp of length 520 into portal 26 MB=0x66706332b50c0 00000400:00000100:14.0:1718661485.310566:0:34329:0:(lib-move.c:4249:lnet_parse_put()) Dropping PUT from 12345-172.16.2.243@tcp portal 26 match 1802126186205376 offset 0 length 520: 4 "lnetctl ping" and lnet_selftest showed no issues between the same server and client. This indicates "failing to match MD" - similar symptoms to LU-17476 . Not sure why PUT is getting dropped in this case though because the client is just initiating the connection and there should have been no prior history of transactions. My suggestion was to try rebooting the server. In case if the server did get stuck with some "incorrect expectations" towards the client failing to mount, the reboot be able to clear this state.    

            People

              ssmirnov Serguei Smirnov
              mdiep Minh Diep
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: