Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8311

Target does not mount with the new mgsnode parameter format in case of multirail configuration

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.9.0
    • Lustre 2.5.3
    • Lustre 2.5.3.90 w/ Bull patches, including LU-5690
    • 3
    • 9223372036854775807

    Description

      We are unable to mount the targets on Lustre servers when using multirail configuration on the MGS.

      LU-4334 introduced a format change of the mgsnode value on the targets.

      Old format:
      mgsnode=192.168.101.41@tcp,192.168.102.41@tcp1 mgsnode=192.168.101.42@tcp,192.168.102.42@tcp1

      New format:
      mgsnode=192.168.101.41@tcp,192.168.102.41@tcp1:192.168.101.42@tcp,192.168.102.42@tcp1

      With patch LU-5690, we are now unable to start any target with this new format. We can see this Lustre error in the console of the OSS:

      LDISKFS-fs (vdb): Unrecognized mount option "192.168.102.41@tcp1:192.168.101.42@tcp" or missing value

      The debug log reports the following message while trying to mount OST 0:

      00000020:01200004:0.0F:1466084531.621867:0:2966:0:(obd_mount.c:1339:lustre_fill_super()) VFS Op: sb ffff88001f583c00
      00000020:01000004:0.0:1466084531.621882:0:2966:0:(obd_mount.c:830:lmd_print()) mount data:
      00000020:01000004:0.0:1466084531.621883:0:2966:0:(obd_mount.c:833:lmd_print()) device: /dev/vdb
      00000020:01000004:0.0:1466084531.621884:0:2966:0:(obd_mount.c:834:lmd_print()) flags: 0
      00000020:01000004:0.0:1466084531.621884:0:2966:0:(obd_mount.c:837:lmd_print()) options: errors=remount-ro,192.168.102.41@tcp1:192.168.101.42@tcp,192.168.102.42@tcp1
      00000020:01000004:0.0:1466084531.621885:0:2966:0:(obd_mount.c:1386:lustre_fill_super()) Mounting server from /dev/vdb
      00000020:01000004:0.0:1466084531.621887:0:2966:0:(obd_mount_server.c:1627:osd_start()) Attempting to start scratch-OST0000, type=osd-ldiskfs, lsifl=200002, mountfl=0
      00000020:01000004:0.0:1466084531.621925:0:2966:0:(obd_mount.c:191:lustre_start_simple()) Starting obd scratch-OST0000-osd (typ=osd-ldiskfs)
      00000004:00020000:0.0:1466084531.623545:0:2966:0:(osd_handler.c:5613:osd_mount()) scratch-OST0000-osd: can't mount /dev/vdb: -22
      00000020:00020000:0.0:1466084531.624487:0:2966:0:(obd_config.c:572:class_setup()) setup scratch-OST0000-osd failed (-22)
      00000020:00020000:0.0:1466084531.625290:0:2966:0:(obd_mount.c:200:lustre_start_simple()) scratch-OST0000-osd setup error -22
      00000020:01000000:0.0:1466084531.626153:0:2966:0:(obd_config.c:750:class_decref()) finishing cleanup of obd scratch-OST0000-osd (scratch-OST0000-osd_UUID)
      00000020:00020000:0.0:1466084531.626156:0:2966:0:(obd_mount_server.c:1701:server_fill_super()) Unable to start osd on /dev/vdb: -22
      00000020:01000004:0.0:1466084531.627005:0:2966:0:(obd_mount.c:653:lustre_put_lsi()) put ffff88001f583c00 1
      00000020:01000004:0.0:1466084531.627007:0:2966:0:(obd_mount.c:603:lustre_free_lsi()) Freeing lsi ffff880017c67000
      00000020:00020000:0.0:1466084531.627009:0:2966:0:(obd_mount.c:1405:lustre_fill_super()) Unable to mount (-22)

      This is easily reproducible with Lustre 2.5.3.90+LU-5690.

      1. tunefs.lustre --erase-params --mgsnode=192.168.101.41@tcp,192.168.102.41@tcp1 --mgsnode=192.168.101.42@tcp,192.168.102.42@tcp1 /dev/vdb
        checking for existing Lustre data: found
        Reading CONFIGS/mountdata

      Read previous values:
      Target: scratch-OST0000
      Index: 0
      Lustre FS: scratch
      Mount type: ldiskfs
      Flags: 0x42
      (OST update )
      Persistent mount opts: errors=remount-ro
      Parameters: mgsnode=192.168.101.41@tcp,192.168.102.41@tcp1 mgsnode=192.168.101.42@tcp,192.168.102.42@tcp1

      Permanent disk data:
      Target: scratch-OST0000
      Index: 0
      Lustre FS: scratch
      Mount type: ldiskfs
      Flags: 0x42
      (OST update )
      Persistent mount opts: errors=remount-ro
      Parameters: mgsnode=192.168.101.41@tcp,192.168.102.41@tcp1:192.168.101.42@tcp,192.168.102.42@tcp1

      Writing CONFIGS/mountdata

      1. mount -t lustre /dev/vdb /mnt/fs/scratch/ost0
        mount.lustre: set /sys/block/vdb/queue/max_sectors_kb to 2147483647

      mount.lustre: mount /dev/vdb at /mnt/fs/scratch/ost0 failed: Invalid argument
      This may have multiple causes.
      Are the mount options correct?
      Check the syslog for more info.

      Attachments

        Issue Links

          Activity

            [LU-8311] Target does not mount with the new mgsnode parameter format in case of multirail configuration
            yujian Jian Yu added a comment -

            Thank you Darby. The issue will be resolved in LU-8397.

            yujian Jian Yu added a comment - Thank you Darby. The issue will be resolved in LU-8397 .
            dvicker Darby Vicker added a comment -

            Thanks a lot for the info.  If you need any more data from me, I'd be glad to post that - either to this LU or one those others.  

            I'd like to try reverting that patch from the 2.9 release and see if it fixes our issue.  Please let me know if you think that's worthwhile and, if so, which LU I should post the info to.  

            dvicker Darby Vicker added a comment - Thanks a lot for the info.  If you need any more data from me, I'd be glad to post that - either to this LU or one those others.   I'd like to try reverting that patch from the 2.9 release and see if it fixes our issue.  Please let me know if you think that's worthwhile and, if so, which LU I should post the info to.  
            yujian Jian Yu added a comment -

            Hi Darby,

            From the debug log in debug.log.ib_and_eth, I didn't see any MGS NID was not parsed by lmd_parse(). So, it's not the same issue as this one.
            It looks like there is a defect in lustre_start_mgc() introduced by patch https://review.whamcloud.com/7509 for LU-3829. And a similar issue was reported in LU-8397. I'll add comments in that ticket.

            yujian Jian Yu added a comment - Hi Darby, From the debug log in debug.log.ib_and_eth, I didn't see any MGS NID was not parsed by lmd_parse(). So, it's not the same issue as this one. It looks like there is a defect in lustre_start_mgc() introduced by patch https://review.whamcloud.com/7509 for LU-3829 . And a similar issue was reported in LU-8397 . I'll add comments in that ticket.
            dvicker Darby Vicker added a comment - - edited

            I just uploaded a couple of debug logs. Both were taken with while mounting an OST on one of our OSS's. One was while we were configured only for ethernet.

                   tunefs.lustre \
                       --verbose \
                       --writeconf \
                       --erase-param \
                       --mgsnode=192.52.98.30@tcp0 \
                       --mgsnode=192.52.98.31@tcp0 \
                       --servicenode=${LUSTRE_LOCAL_TCP_IP}@tcp0 \
                       --servicenode=${LUSTRE_PEER_TCP_IP}@tcp0 \
                       $pool/ost-fsl
            
            

            The other was while configured with both IB and ethernet.

                   tunefs.lustre \
                       --verbose \
                       --writeconf \
                       --erase-param \
                       --mgsnode=192.52.98.30@tcp0,10.148.0.30@o2ib0 \
                       --mgsnode=192.52.98.31@tcp0,10.148.0.31@o2ib0 \
                       --servicenode=${LUSTRE_LOCAL_TCP_IP}@tcp0,${LUSTRE_LOCAL_IB_IP}@o2ib0 \
                       --servicenode=${LUSTRE_PEER_TCP_IP}@tcp0,${LUSTRE_PEER_IB_IP}@o2ib0 \
                       $pool/ost-fsl
            
            
            dvicker Darby Vicker added a comment - - edited I just uploaded a couple of debug logs. Both were taken with while mounting an OST on one of our OSS's. One was while we were configured only for ethernet. tunefs.lustre \ --verbose \ --writeconf \ --erase-param \ --mgsnode=192.52.98.30@tcp0 \ --mgsnode=192.52.98.31@tcp0 \ --servicenode=${LUSTRE_LOCAL_TCP_IP}@tcp0 \ --servicenode=${LUSTRE_PEER_TCP_IP}@tcp0 \ $pool/ost-fsl The other was while configured with both IB and ethernet. tunefs.lustre \ --verbose \ --writeconf \ --erase-param \ --mgsnode=192.52.98.30@tcp0,10.148.0.30@o2ib0 \ --mgsnode=192.52.98.31@tcp0,10.148.0.31@o2ib0 \ --servicenode=${LUSTRE_LOCAL_TCP_IP}@tcp0,${LUSTRE_LOCAL_IB_IP}@o2ib0 \ --servicenode=${LUSTRE_PEER_TCP_IP}@tcp0,${LUSTRE_PEER_IB_IP}@o2ib0 \ $pool/ost-fsl
            dvicker Darby Vicker added a comment -

            Good to know - will do.

            dvicker Darby Vicker added a comment - Good to know - will do.

            It is important to note that the 2.9.51 code is a development tag and as such has no expectation of being tested or supported. Development tags may contain protocol changes and experimental code, so unless you are using this only for testing the stability of the development branch, I would suggest to go back to 2.9.0.

            adilger Andreas Dilger added a comment - It is important to note that the 2.9.51 code is a development tag and as such has no expectation of being tested or supported. Development tags may contain protocol changes and experimental code, so unless you are using this only for testing the stability of the development branch, I would suggest to go back to 2.9.0.
            dvicker Darby Vicker added a comment -

            We are running into something very similar to this - not sure if its related or something different. Lots of detail in a thread on the mailing list - here is a link to one of the latest posts.

            http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/2017-January/014154.html

            The summary of our situation is that our LFS was formatted originally using 2.8 but we have since upgraded to 2.9.51 We are using a JBOB with server pairs for failover and are using ZFS as the backend. All servers are dual-homed on both ethernet and IB. MDT and OST failover works fine. MGS failover doesn't work if we have both ethernet and IB but does if only have ethernet NID's. We have build our own lustre server RPM's using a "git checkout 2.9.51" and zfs 0.6.5.8-1. I've verified that commit 2458067d8d55173ad68caac8c0460d46bf8106a1 is in the git log. Any help would be much appreciated.

            dvicker Darby Vicker added a comment - We are running into something very similar to this - not sure if its related or something different. Lots of detail in a thread on the mailing list - here is a link to one of the latest posts. http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/2017-January/014154.html The summary of our situation is that our LFS was formatted originally using 2.8 but we have since upgraded to 2.9.51 We are using a JBOB with server pairs for failover and are using ZFS as the backend. All servers are dual-homed on both ethernet and IB. MDT and OST failover works fine. MGS failover doesn't work if we have both ethernet and IB but does if only have ethernet NID's. We have build our own lustre server RPM's using a "git checkout 2.9.51" and zfs 0.6.5.8-1. I've verified that commit 2458067d8d55173ad68caac8c0460d46bf8106a1 is in the git log. Any help would be much appreciated.
            pjones Peter Jones added a comment -

            Landed for 2.9

            pjones Peter Jones added a comment - Landed for 2.9

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/23355/
            Subject: LU-8311 doc: add NIDs examples to mkfs.lustre and mount.lustre
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: fd488bacbf31b623815cf86ca21e5d6c888c068e

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/23355/ Subject: LU-8311 doc: add NIDs examples to mkfs.lustre and mount.lustre Project: fs/lustre-release Branch: master Current Patch Set: Commit: fd488bacbf31b623815cf86ca21e5d6c888c068e

            Jian Yu (jian.yu@intel.com) uploaded a new patch: http://review.whamcloud.com/23355
            Subject: LU-8311 doc: add NIDs examples to mkfs.lustre and mount.lustre
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 7bfd77d5b7a34bf8dba529276d6516e40d59cc3d

            gerrit Gerrit Updater added a comment - Jian Yu (jian.yu@intel.com) uploaded a new patch: http://review.whamcloud.com/23355 Subject: LU-8311 doc: add NIDs examples to mkfs.lustre and mount.lustre Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 7bfd77d5b7a34bf8dba529276d6516e40d59cc3d

            People

              yujian Jian Yu
              bruno.travouillon Bruno Travouillon (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: