Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14108

Mounting targets created with mkfs "network" option should disable discovery

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0
    • None
    • 3
    • 9223372036854775807

    Description

      The --network= option to mkfs.lustre allows restricting a target (OST/MDT) to a given LNet network. This makes it register to the MGS with the specified network only. However, dynamic discovery is unaware of this restriction and this can create problems.

      If this scenario is recognised, it is possible to deal with it in at least two ways:

      1)  Prevent the mount with an error.

      2) Disable dynamic discovery with a warning, prevent from enabling discovery in the future.

      Attachments

        Issue Links

          Activity

            [LU-14108] Mounting targets created with mkfs "network" option should disable discovery

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/46632/
            Subject: LU-14108 mount: prevent if --network and discovery
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: e82836f56ee7a9337a86ad0a32f19751024c7ec6

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/46632/ Subject: LU-14108 mount: prevent if --network and discovery Project: fs/lustre-release Branch: master Current Patch Set: Commit: e82836f56ee7a9337a86ad0a32f19751024c7ec6

            "Cyril Bordage <cbordage@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47137
            Subject: LU-14108 tests: mount with discovery and network
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 038db7e1872e6ef743921744fb09f0f0c201124a

            gerrit Gerrit Updater added a comment - "Cyril Bordage <cbordage@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47137 Subject: LU-14108 tests: mount with discovery and network Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 038db7e1872e6ef743921744fb09f0f0c201124a

            "Cyril Bordage <cbordage@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/46632
            Subject: LU-14108 mount: prevent if --network and discovery
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: f5f0dba9c253a25ba9b6d06a0b4e361aa48690d5

            gerrit Gerrit Updater added a comment - "Cyril Bordage <cbordage@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/46632 Subject: LU-14108 mount: prevent if --network and discovery Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: f5f0dba9c253a25ba9b6d06a0b4e361aa48690d5

            To reproduce, use --network option when formatting targets with mkfs.lustre. The following is an example from the customer setup:

            "

            The error is occurring between MDS (md1) and OSS (all OSS showing similar errors). For example, in the case of rcf2-OST000d, it is being mounted on os2, but md1 is trying to connect it to os1 repeatedly. Usually rcf2-OST000d is mounted on os2 (primary OSS for this OST) and error should not happen.

            tunefs.lustre shows network=o2ib1, and primary is os2 (172.20.2.26@o2ib1) and secondary is os1 (172.20.2.25@o2ib1)

            "

            [root@os2 ~]# tunefs.lustre --dryrun /dev/mapper/OST0d
            checking for existing Lustre data: found
            
               Read previous values:
            Target:     rcf2-OST000d
            Index:      13
            Lustre FS:  rcf2
            Mount type: ldiskfs
            Flags:      0x2
                          (OST )
            Persistent mount opts: errors=remount-ro
            Parameters:  mgsnode=172.20.2.11@o2ib:172.20.2.12@o2ib failover.node=172.20.2.25@o2ib1 network=o2ib1
            
            
               Permanent disk data:
            Target:     rcf2-OST000d
            Index:      13
            Lustre FS:  rcf2
            Mount type: ldiskfs
            Flags:      0x2
                          (OST )
            Persistent mount opts: errors=remount-ro
            Parameters:  mgsnode=172.20.2.11@o2ib:172.20.2.12@o2ib failover.node=172.20.2.25@o2ib1 network=o2ib1
            
            exiting before disk write.

            It should be possible to reproduce with a simpler setup. 

            The symptom is the error message similar to the following:

            Oct 22 11:58:27 os1-s kernel: Lustre: rcf2-OST0001: Received new MDS connection from 172.20.2.11@o2ib, removing former export from same NID
            Oct 22 11:58:27 os1-s kernel: Lustre: Skipped 47 previous similar messages
             ...
            Oct 22 12:02:13 os1-s kernel: LustreError: 137-5: rcf2-OST000d_UUID: not available for connect from 172.20.2.11@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
            Oct 22 12:02:13 os1-s kernel: LustreError: Skipped 47 previous similar messages

            To see how the targets are registered to the MGS, run command similar to the following on the MGS:

            lctl --device MGS llog_print rcf2-MDT0000

            This should show that only NIDs on the network specified as the option with mkfs were used.

            ssmirnov Serguei Smirnov added a comment - To reproduce, use --network option when formatting targets with mkfs.lustre. The following is an example from the customer setup: " The error is occurring between MDS (md1) and OSS (all OSS showing similar errors). For example, in the case of rcf2-OST000d, it is being mounted on os2, but md1 is trying to connect it to os1 repeatedly. Usually rcf2-OST000d is mounted on os2 (primary OSS for this OST) and error should not happen. tunefs.lustre shows network=o2ib1, and primary is os2 (172.20.2.26@o2ib1) and secondary is os1 (172.20.2.25@o2ib1) " [root@os2 ~]# tunefs.lustre --dryrun /dev/mapper/OST0d checking for existing Lustre data: found Read previous values: Target: rcf2-OST000d Index: 13 Lustre FS: rcf2 Mount type: ldiskfs Flags: 0x2 (OST ) Persistent mount opts: errors=remount-ro Parameters: mgsnode=172.20.2.11@o2ib:172.20.2.12@o2ib failover.node=172.20.2.25@o2ib1 network=o2ib1 Permanent disk data: Target: rcf2-OST000d Index: 13 Lustre FS: rcf2 Mount type: ldiskfs Flags: 0x2 (OST ) Persistent mount opts: errors=remount-ro Parameters: mgsnode=172.20.2.11@o2ib:172.20.2.12@o2ib failover.node=172.20.2.25@o2ib1 network=o2ib1 exiting before disk write. It should be possible to reproduce with a simpler setup.  The symptom is the error message similar to the following: Oct 22 11:58:27 os1-s kernel: Lustre: rcf2-OST0001: Received new MDS connection from 172.20.2.11@o2ib, removing former export from same NID Oct 22 11:58:27 os1-s kernel: Lustre: Skipped 47 previous similar messages ... Oct 22 12:02:13 os1-s kernel: LustreError: 137-5: rcf2-OST000d_UUID: not available for connect from 172.20.2.11@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server. Oct 22 12:02:13 os1-s kernel: LustreError: Skipped 47 previous similar messages To see how the targets are registered to the MGS, run command similar to the following on the MGS: lctl --device MGS llog_print rcf2-MDT0000 This should show that only NIDs on the network specified as the option with mkfs were used.

            People

              cbordage Cyril Bordage
              ssmirnov Serguei Smirnov
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: