Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4311

Mount sometimes fails with EIO on OSS with several mounts in parallel

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Duplicate
    • None
    • None
    • None
    • Lustre 2.4.1
    • 3
    • 11805

    Description

      On one of our test cluster installed with Lustre 2.4.1, we somtimes saw the following error message in the "shine" command line tool output, when starting a lustre file system, and the corresponding OST is not mounted:

      mount.lustre: mount /dev/mapper/mpathj at /mnt/fs1/ost/6 failed: Input/output error
      Is the MGS running?
      

      The test file system is composed of 6 servers: one MDS (one MDT), 4 OSS (3 with 2 OSTs and one with 1 OST) and a separate MGS.
      Configuration (see attached config_parameters file for details):
      MGS: lama5 (failover lama6)
      MDS: lama6 (failover lama5)
      OSS: lama7 (failover lama8, lama9 and lama10) to lama10 (failover lama7, lama8 and lama9)

      When the error occurs, we have the following lustre kernel traces on MGS:

      MGS: Client <client_name> seen on new nid <nid2> when existing nid <nid1> is already connected
      ...
      @@@ MGS fail to handle opc = 250: rc = -114
      ...
      

      and on OSS:

      InitRecov MGC10.3.0.10@o2ib 1/d0:i0:r0:or0:NEW
      ...
      InitRecov MGC10.3.0.10@o2ib 1/d0:i0:r1:or0:CONNECTING
      ...
      recovery of MGS on MGC10.3.0.10@o2ib_0 failed (-5)
      ...
      MGS: recovery started, waiting 100000 seconds
      ...
      MGC10.3.0.10@o2ib: Communicating with 10.4.0.10@o2ib1, operation mgs_connect failed with -114
      ...
      recovery of MGS on MGC10.3.0.10@o2ib_0 failed (-114)
      MGS: recovery finished
      ...
      fs1-OST0005: cannot register this server with the MGS: rc = -5. Is the MGS running?
      ...
      Unable to start targets: -5
      ...
      Unable to mount  (-5)
      

      I was able to reproduce the error without shine and with only one OSS, with the script below.
      The MGS (lama5) and MDS (lama6) are started/mounted, and the script is started on lama10.
      If the tunefs.lustre or the lustre_rmmod is removed, or the first mount is started in foreground, the error does not occur.

      N=1
      rm -f error stop
      while true; do
              tunefs.lustre --erase-params --quiet "--mgsnode=lama5-ic1@o2ib0,lama5-ic2@o2ib1" \
                   "--mgsnode=lama6-ic1@o2ib0,lama6-ic2@o2ib1" "--failnode=lama7-ic1@o2ib0" \
                   "--failnode=lama8-ic1@o2ib0" "--failnode=lama9-ic1@o2ib0" \
                    --network=o2ib0 --writeconf /dev/ldn.cook.ost3 > /dev/null
      
              tunefs.lustre --erase-params --quiet "--mgsnode=lama5-ic1@o2ib0,lama5-ic2@o2ib1" \
                   "--mgsnode=lama6-ic1@o2ib0,lama6-ic2@o2ib1" "--failnode=lama7-ic2@o2ib1" \
                   "--failnode=lama8-ic2@o2ib1" "--failnode=lama9-ic2@o2ib1" \
                   --network=o2ib1 --writeconf /dev/ldn.cook.ost6 > /dev/null
      
              modprobe fsfilt_ldiskfs
              modprobe lustre
              ssh lama5 lctl clear
              dmesg -c > /dev/null
              ssh lama5 dmesg -c > /dev/null
              (/bin/mount -t lustre -o errors=panic /dev/ldn.cook.ost3 /mnt/fs1/ost/5 || touch error) &
              /bin/mount -t lustre -o errors=panic /dev/ldn.cook.ost6 /mnt/fs1/ost/6 || touch error
              wait
              if [ -f error ]; then
                      lctl dk > oss.lustre.dk.bad
                      ssh lama5 lctl dk > mgs.lustre.dk.bad
                      dmesg > oss.dmesg.bad
                      ssh lama5 dmesg > mgs.dmesg.bad
              else
                      lctl dk > oss.lustre.dk.good
                      ssh lama5 lctl dk > mgs.lustre.dk.good
                      dmesg > oss.dmesg.good
                      ssh lama5 dmesg > mgs.dmesg.good
              fi
              umount /mnt/fs1/ost/5
              umount /mnt/fs1/ost/6
              lustre_rmmod
              [ -f stop -o -f error ] && break
              [ $N -ge 25 ] && break
              echo "============================> loop $N"
              N=$((N+1))
      done
      

      I have attached a tarball containing the config parameters, the reproducer, and the files produced by the reproducer:
      reproducer
      config_parameters
      mgs.dmesg.good, mgs.lustre.dk.good, oss.dmesg.good, oss.lustre.dk.good
      mgs.dmesg.bad, mgs.lustre.dk.bad, oss.dmesg.bad, oss.lustre.dk.bad

      I have tried the following patch, which skips the connection at INIT_RECOV_BACKUP if one already exists.
      With this patch the "mount" no longer fails, but it's only a workaround and it does not solve the problem of double connection on MGS. Probably there is a missing serialisation/synchronisation.

      --- a/lustre/mgc/mgc_request.c
      +++ b/lustre/mgc/mgc_request.c
      @@ -1029,6 +1029,7 @@ int mgc_set_info_async(const struct lu_e
                              ptlrpc_import_state_name(imp->imp_state));
                       /* Resurrect if we previously died */
                       if ((imp->imp_state != LUSTRE_IMP_FULL &&
      +                     imp->imp_state != LUSTRE_IMP_CONNECTING &&
                            imp->imp_state != LUSTRE_IMP_NEW) || value > 1)
                               ptlrpc_reconnect_import(imp);
                       RETURN(0);
      

      Attachments

        Issue Links

          Activity

            People

              bfaccini Bruno Faccini (Inactive)
              patrick.valentin Patrick Valentin (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: