Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15539

clients report mds_mds_connection in connect_flags after lustre update on servers

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • None
    • Lustre 2.12.8
    • lustre-2.12.8_6.llnl
      3.10.0-1160.53.1.1chaos.ch6.x86_64
      RHEL7.9
      zfs-0.7.11-9.8llnl
    • 3
    • 9223372036854775807

    Description

      We upgraded a lustre server cluster from lustre-2.12.7_2.llnl to lustre-2.12.8_6.llnl. 

      The node on which the MGS runs, copper1, began reporting "new MDS connections" from NIDs that are assigned to client nodes:

      Lustre: MGS: Received new MDS connection from 192.168.128.68@o2ib38, keep former export from same NID
      Lustre: MGS: Received new MDS connection from 192.168.128.8@o2ib42, keep former export from same NID
      Lustre: MGS: Received new MDS connection from 192.168.131.78@o2ib39, keep former export from same NID
      Lustre: MGS: Received new MDS connection from 192.168.132.204@o2ib39, keep former export from same NID
      Lustre: MGS: Received new MDS connection from 192.168.134.127@o2ib27, keep former export from same NID
      

      Clients connect flags includes "mds_mds_connection":

      [root@quartz7:lustre]# head */*/connect_flags
      ==> mgc/MGC172.19.3.1@o2ib600/connect_flags <==
      flags=0x2000011005002020
      flags2=0x0
      version
      barrier
      adaptive_timeouts
      mds_mds_connection
      full20
      imp_recov
      bulk_mbits
      

      The clients are running lustre lustre-2.12.7_2.llnl, which does not have "LU-13356 client: don't use OBD_CONNECT_MNE_SWAB".

      Shutting down the servers and restoring them to lustre-2.12.7_2.llnl did not change the symptoms.

      Patch stacks are:
      https://github.com/LLNL/lustre/releases/tag/2.12.8_6.llnl
      https://github.com/LLNL/lustre/releases/tag/2.12.7_2.llnl

      Seen during the same lustre server update where we saw LU-15541 but appears to be a separate issue

      Attachments

        Issue Links

          Activity

            [LU-15539] clients report mds_mds_connection in connect_flags after lustre update on servers
            pjones Peter Jones made changes -
            Link Original: This issue is related to JFC-21 [ JFC-21 ]
            ofaaland Olaf Faaland made changes -
            Resolution New: Fixed [ 1 ]
            Status Original: Open [ 1 ] New: Resolved [ 5 ]
            ofaaland Olaf Faaland added a comment -

            Updated all the clients to 2.12.8_6.llnl (or later) with patch "LU-13356 client: don't use OBD_CONNECT_MNE_SWAB".
            Then updated the servers to 2.12.8_6.llnl (or later) with that patch.
            No longer seeing inappropriate "Received new MDS connection" messages on bringup.

            ofaaland Olaf Faaland added a comment - Updated all the clients to 2.12.8_6.llnl (or later) with patch " LU-13356 client: don't use OBD_CONNECT_MNE_SWAB". Then updated the servers to 2.12.8_6.llnl (or later) with that patch. No longer seeing inappropriate "Received new MDS connection" messages on bringup.
            ofaaland Olaf Faaland added a comment -

            In retrospect the isssue that brought kept the lustre file system from coming up was the LNet issue documented in https://jira.whamcloud.com/browse/LU-15541 so reduced this issue priority to "Minor".

            ofaaland Olaf Faaland added a comment - In retrospect the isssue that brought kept the lustre file system from coming up was the LNet issue documented in https://jira.whamcloud.com/browse/LU-15541 so reduced this issue priority to "Minor".
            ofaaland Olaf Faaland made changes -
            Labels Original: llnl topllnl New: llnl
            ofaaland Olaf Faaland made changes -
            Priority Original: Critical [ 2 ] New: Minor [ 4 ]
            pjones Peter Jones made changes -
            Link New: This issue is related to LU-13356 [ LU-13356 ]
            ofaaland Olaf Faaland added a comment - - edited

            We umounted all the clients while the server cluster was down, then brought the server cluster back up in 2.12.7_2.llnl.  We did not see the "Received new MDS connection" messages on bringup.

            We are proceeding with client cluster updates, and will update server clusters in about 2 weeks.

            ofaaland Olaf Faaland added a comment - - edited We umounted all the clients while the server cluster was down, then brought the server cluster back up in 2.12.7_2.llnl.  We did not see the "Received new MDS connection" messages on bringup. We are proceeding with client cluster updates, and will update server clusters in about 2 weeks.
            ofaaland Olaf Faaland made changes -
            Description Original: The node on which the MGS runs, copper1, reported "new MDS connections" from NIDs that are assigned to client nodes:
            {noformat}
            Lustre: MGS: Received new MDS connection from 192.168.128.68@o2ib38, keep former export from same NID
            Lustre: MGS: Received new MDS connection from 192.168.128.8@o2ib42, keep former export from same NID
            Lustre: MGS: Received new MDS connection from 192.168.131.78@o2ib39, keep former export from same NID
            Lustre: MGS: Received new MDS connection from 192.168.132.204@o2ib39, keep former export from same NID
            Lustre: MGS: Received new MDS connection from 192.168.134.127@o2ib27, keep former export from same NID
            {noformat}
            Clients connect flags includes "mds_mds_connection":
            {noformat}
            [root@quartz7:lustre]# head */*/connect_flags
            ==> mgc/MGC172.19.3.1@o2ib600/connect_flags <==
            flags=0x2000011005002020
            flags2=0x0
            version
            barrier
            adaptive_timeouts
            mds_mds_connection
            full20
            imp_recov
            bulk_mbits
            {noformat}
            The clients are running lustre lustre-2.12.7_2.llnl, which does not have "LU-13356 client: don't use OBD_CONNECT_MNE_SWAB".

            Shutting down the servers and restoring them to lustre-2.12.7_2.llnl did not change the symptoms.

            Patch stacks are:
            [https://github.com/LLNL/lustre/releases/tag/2.12.8_6.llnl]
            [https://github.com/LLNL/lustre/releases/tag/2.12.7_2.llnl]

            Seen during the same lustre server update where we saw LU-15541 but appears to be a separate issue
            New: We upgraded a lustre server cluster from lustre-2.12.7_2.llnl to lustre-2.12.8_6.llnl. 

            The node on which the MGS runs, copper1, began reporting "new MDS connections" from NIDs that are assigned to client nodes:
            {noformat}
            Lustre: MGS: Received new MDS connection from 192.168.128.68@o2ib38, keep former export from same NID
            Lustre: MGS: Received new MDS connection from 192.168.128.8@o2ib42, keep former export from same NID
            Lustre: MGS: Received new MDS connection from 192.168.131.78@o2ib39, keep former export from same NID
            Lustre: MGS: Received new MDS connection from 192.168.132.204@o2ib39, keep former export from same NID
            Lustre: MGS: Received new MDS connection from 192.168.134.127@o2ib27, keep former export from same NID
            {noformat}
            Clients connect flags includes "mds_mds_connection":
            {noformat}
            [root@quartz7:lustre]# head */*/connect_flags
            ==> mgc/MGC172.19.3.1@o2ib600/connect_flags <==
            flags=0x2000011005002020
            flags2=0x0
            version
            barrier
            adaptive_timeouts
            mds_mds_connection
            full20
            imp_recov
            bulk_mbits
            {noformat}
            The clients are running lustre lustre-2.12.7_2.llnl, which does not have "LU-13356 client: don't use OBD_CONNECT_MNE_SWAB".

            Shutting down the servers and restoring them to lustre-2.12.7_2.llnl did not change the symptoms.

            Patch stacks are:
            [https://github.com/LLNL/lustre/releases/tag/2.12.8_6.llnl]
            [https://github.com/LLNL/lustre/releases/tag/2.12.7_2.llnl]

            Seen during the same lustre server update where we saw LU-15541 but appears to be a separate issue
            ofaaland Olaf Faaland made changes -
            Description Original: The node on which the MGS runs, copper1, reported "new MDS connections" from NIDs that are assigned to client nodes:
            {noformat}
            Lustre: MGS: Received new MDS connection from 192.168.128.68@o2ib38, keep former export from same NID
            Lustre: MGS: Received new MDS connection from 192.168.128.8@o2ib42, keep former export from same NID
            Lustre: MGS: Received new MDS connection from 192.168.131.78@o2ib39, keep former export from same NID
            Lustre: MGS: Received new MDS connection from 192.168.132.204@o2ib39, keep former export from same NID
            Lustre: MGS: Received new MDS connection from 192.168.134.127@o2ib27, keep former export from same NID
            {noformat}
            Clients connect flags includes "mds_mds_connection":
            {noformat}
            [root@quartz7:lustre]# head */*/connect_flags
            ==> mgc/MGC172.19.3.1@o2ib600/connect_flags <==
            flags=0x2000011005002020
            flags2=0x0
            version
            barrier
            adaptive_timeouts
            mds_mds_connection
            full20
            imp_recov
            bulk_mbits
            {noformat}
            The clients are running lustre lustre-2.12.7_2.llnl, which does not have "LU-13356 client: don't use OBD_CONNECT_MNE_SWAB".

            Shutting down the servers and restoring them to lustre-2.12.7_2.llnl did not change the symptoms.

            Patch stacks are:
            [https://github.com/LLNL/lustre/releases/tag/2.12.8_6.llnl]
            [https://github.com/LLNL/lustre/releases/tag/2.12.7_2.llnl]
            New: The node on which the MGS runs, copper1, reported "new MDS connections" from NIDs that are assigned to client nodes:
            {noformat}
            Lustre: MGS: Received new MDS connection from 192.168.128.68@o2ib38, keep former export from same NID
            Lustre: MGS: Received new MDS connection from 192.168.128.8@o2ib42, keep former export from same NID
            Lustre: MGS: Received new MDS connection from 192.168.131.78@o2ib39, keep former export from same NID
            Lustre: MGS: Received new MDS connection from 192.168.132.204@o2ib39, keep former export from same NID
            Lustre: MGS: Received new MDS connection from 192.168.134.127@o2ib27, keep former export from same NID
            {noformat}
            Clients connect flags includes "mds_mds_connection":
            {noformat}
            [root@quartz7:lustre]# head */*/connect_flags
            ==> mgc/MGC172.19.3.1@o2ib600/connect_flags <==
            flags=0x2000011005002020
            flags2=0x0
            version
            barrier
            adaptive_timeouts
            mds_mds_connection
            full20
            imp_recov
            bulk_mbits
            {noformat}
            The clients are running lustre lustre-2.12.7_2.llnl, which does not have "LU-13356 client: don't use OBD_CONNECT_MNE_SWAB".

            Shutting down the servers and restoring them to lustre-2.12.7_2.llnl did not change the symptoms.

            Patch stacks are:
            [https://github.com/LLNL/lustre/releases/tag/2.12.8_6.llnl]
            [https://github.com/LLNL/lustre/releases/tag/2.12.7_2.llnl]

            Seen during the same lustre server update where we saw LU-15541 but appears to be a separate issue

            People

              laisiyao Lai Siyao
              ofaaland Olaf Faaland
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: