Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5298

The lwp device cannot be started when we migrate from Lustre 2.1 to Lustre 2.4

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • None
    • Lustre 2.4.3
    • None
    • RHEL6 w/ patched kernel for Lustre server
    • 3
    • 14784

    Description

      We have an issue with the quotas on our filesystems after the upgrade from
      Lustre 2.1.6 to Lustre 2.4.3.

      The quotas have been successfully enabled on all target devices using
      'tunefs.lustre --quota $device'. The enforcement has been enabled with 'lctl conf_param scratch.quota.mdt=ug' and 'lctl conf_param scratch.quota.ost=ug' on
      the MGS. However, the enforcement does not work and users can exceed their
      quota limits.

      Check of the quota_slave.info on the MDT:

      # lctl get_param osd-*.*.quota_slave.info
      osd-ldiskfs.scratch-MDT0000.quota_slave.info=
      target name:    scratch-MDT0000
      pool ID:        0
      type:           md
      quota enabled:  ug
      conn to master: not setup yet
      space acct:     ug
      user uptodate:  glb[0],slv[0],reint[1]
      group uptodate: glb[0],slv[0],reint[1]
      

      We can see that the connection to the QMT is not setup yet. I also noticed that
      the lwp device is not started, so no callback can be sent to the QMT.

      By looking at the code, it seems that the lwp device cannot be started when we
      migrate from Lustre 2.1 to Lustre 2.4.

      In lustre/obdclass/obd_mount_server.c:

       717 /**
       718  * Retrieve MDT nids from the client log, then start the lwp device.
       719  * there are only two scenarios which would include mdt nid.
       720  * 1.
       721  * marker   5 (flags=0x01, v2.1.54.0) lustre-MDT0000  'add mdc' xxx-
       722  * add_uuid  nid=192.168.122.162@tcp(0x20000c0a87aa2)  0:  1:192.168.122.162@tcp
       723  * attach    0:lustre-MDT0000-mdc  1:mdc  2:lustre-clilmv_UUID
       724  * setup     0:lustre-MDT0000-mdc  1:lustre-MDT0000_UUID  2:192.168.122.162@tcp
       725  * add_uuid  nid=192.168.172.1@tcp(0x20000c0a8ac01)  0:  1:192.168.172.1@tcp
       726  * add_conn  0:lustre-MDT0000-mdc  1:192.168.172.1@tcp
       727  * modify_mdc_tgts add 0:lustre-clilmv  1:lustre-MDT0000_UUID xxxx
       728  * marker   5 (flags=0x02, v2.1.54.0) lustre-MDT0000  'add mdc' xxxx-
       729  * 2.
       730  * marker   7 (flags=0x01, v2.1.54.0) lustre-MDT0000  'add failnid' xxxx-
       731  * add_uuid  nid=192.168.122.2@tcp(0x20000c0a87a02)  0:  1:192.168.122.2@tcp
       732  * add_conn  0:lustre-MDT0000-mdc  1:192.168.122.2@tcp
       733  * marker   7 (flags=0x02, v2.1.54.0) lustre-MDT0000  'add failnid' xxxx-
       734  **/
       735 static int client_lwp_config_process(const struct lu_env *env,
       736                      struct llog_handle *handle,
       737                      struct llog_rec_hdr *rec, void *data)
       738 {
      [...]
       779         /* Don't try to connect old MDT server without LWP support,
       780          * otherwise, the old MDT could regard this LWP client as
       781          * a normal client and save the export on disk for recovery.
       782          *
       783          * This usually happen when rolling upgrade. LU-3929 */
       784         if (marker->cm_vers < OBD_OCD_VERSION(2, 3, 60, 0))
       785             GOTO(out, rc = 0);
      

      The function checks the MDT server version in the llog. I checked on the MGS of
      our Lustre 2.4 filesystem and the version of the device scratch-MDT0000-mdc is
      2.1.6.0.

      #09 (224)marker   5 (flags=0x01, v2.1.6.0) scratch-MDT0000 'add mdc' Sat Jul  5 14:40:44 2014-
      #10 (088)add_uuid  nid=192.168.122.41@tcp(0x20000c0a87a29)  0:  1:192.168.122.41@tcp
      #11 (128)attach    0:scratch-MDT0000-mdc  1:mdc  2:scratch-clilmv_UUID
      #12 (144)setup     0:scratch-MDT0000-mdc  1:scratch-MDT0000_UUID  2:192.168.122.41@tcp
      #13 (168)modify_mdc_tgts add 0:scratch-clilmv  1:scratch-MDT0000_UUID  2:0  3:1  4:scratch-MDT0000-mdc_UUID
      #14 (224)marker   5 (flags=0x02, v2.1.6.0) scratch-MDT0000 'add mdc' Sat Jul  5 14:40:44 2014-
      

      After a writeconf on the filesystem, the llog has been updated and the device scratch-MDT0000-mdc is now registered with version 2.4.3.0.

      #09 (224)marker   6 (flags=0x01, v2.4.3.0) scratch-MDT0000 'add mdc' Sat Jul  5 15:19:27 2014-
      #10 (088)add_uuid  nid=192.168.122.41@tcp(0x20000c0a87a29)  0:  1:192.168.122.41@tcp
      #11 (128)attach    0:scratch-MDT0000-mdc  1:mdc  2:scratch-clilmv_UUID
      #12 (144)setup     0:scratch-MDT0000-mdc  1:scratch-MDT0000_UUID  2:192.168.122.41@tcp
      #13 (168)modify_mdc_tgts add 0:scratch-clilmv  1:scratch-MDT0000_UUID  2:0  3:1  4:scratch-MDT0000-mdc_UUID
      #14 (224)marker   6 (flags=0x02, v2.4.3.0) scratch-MDT0000 'add mdc' Sat Jul  5 15:19:27 2014-
      

      Check of the quota_slave.info:

      # lctl get_param osd-*.*.quota_slave.info
      osd-ldiskfs.scratch-MDT0000.quota_slave.info=
      target name:    scratch-MDT0000
      pool ID:        0
      type:           md
      quota enabled:  none
      conn to master: setup
      space acct:     ug
      user uptodate:  glb[0],slv[0],reint[0]
      group uptodate: glb[0],slv[0],reint[0]
      # lctl conf_param scratch.quota.mdt=ug
      # lctl conf_param scratch.quota.ost=ug
      # lctl get_param osd-*.*.quota_slave.info
      osd-ldiskfs.scratch-MDT0000.quota_slave.info=
      target name:    scratch-MDT0000
      pool ID:        0
      type:           md
      quota enabled:  ug
      conn to master: setup
      space acct:     ug
      user uptodate:  glb[1],slv[1],reint[0]
      group uptodate: glb[1],slv[1],reint[0]
      

      Same behavior is observed on OSTs.

      It would be better to:

      • get the current version of the MDT server instead of the one recorded in the llog
      • or modify the operations manual to perform a writeconf while upgrading to a major release
      • or add a CWARN before lustre/obdclass/obd_mount_server.c:785 GOTO(out, rc = 0);

      I think this issue can be related to LU-5192.

      Attachments

        Issue Links

          Activity

            People

              niu Niu Yawei (Inactive)
              bruno.travouillon Bruno Travouillon (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: