Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14399

mount MDT takes very long with hsm enable

Details

    • 3
    • 9223372036854775807

    Description

      We observed that when mounting MDT with HSM enable, mount command take minutes compare to seconds as before. We saw this in the log

      [53618.238941] Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-mds2; mount -t lustre -o localrecov  /dev/mapper/mds2_flakey /mnt/lustre-mds2
      [53618.624098] LDISKFS-fs (dm-6): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
      [53720.390690] Lustre: 1722736:0:(mdt_coordinator.c:1114:mdt_hsm_cdt_start()) lustre-MDT0001: trying to init HSM before MDD
      [53720.392834] LustreError: 1722736:0:(mdt_coordinator.c:1125:mdt_hsm_cdt_start()) lustre-MDT0001: cannot take the layout locks needed for registered restore: -2
      [53720.398049] LustreError: 1722741:0:(mdt_coordinator.c:1090:mdt_hsm_cdt_start()) lustre-MDT0001: Coordinator already started or stopping
      [53720.400681] Lustre: lustre-MDT0001: Imperative Recovery not enabled, recovery window 60-180
      [53720.424872] Lustre: lustre-MDT0001: in recovery but waiting for the first client to connect
      [53720.953893] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n health_check
      [53722.067555] Lustre: DEBUG MARKER:  
      

      Seems related to LU-13920

      Attachments

        Issue Links

          Activity

            [LU-14399] mount MDT takes very long with hsm enable
            bzzz Alex Zhuravlev added a comment - - edited

            well, I'm running master branch, so clearly LU-14399 is in. this is what helps:\

            -       while (!test_bit(MDT_FL_CFGLOG, &mdt->mdt_state) && i < obd_timeout) {
            +       while (!test_bit(MDT_FL_CFGLOG, &mdt->mdt_state) && i < obd_timeout * 2) {
            

            not sure how good it is...

            bzzz Alex Zhuravlev added a comment - - edited well, I'm running master branch, so clearly LU-14399 is in. this is what helps:\ - while (!test_bit(MDT_FL_CFGLOG, &mdt->mdt_state) && i < obd_timeout) { + while (!test_bit(MDT_FL_CFGLOG, &mdt->mdt_state) && i < obd_timeout * 2) { not sure how good it is...

             I guess "trying to init HSM before MDD" is a hint?

            IMO, it hints your build doesn't include "LU-14399 hsm: process hsm_actions in coordinator".

            static void cdt_start_pending_restore(struct mdt_device *mdt,
                                  struct coordinator *cdt)
            {
                struct mdt_thread_info *cdt_mti;
                unsigned int i = 0;
                int rc;    /* wait until MDD initialize hsm actions llog */
                while (!test_bit(MDT_FL_CFGLOG, &mdt->mdt_state) && i < obd_timeout) {
                    schedule_timeout_interruptible(cfs_time_seconds(1));
                    i++;
                }
                if (!test_bit(MDT_FL_CFGLOG, &mdt->mdt_state))
                    CWARN("%s: trying to init HSM before MDD\n", mdt_obd_name(mdt));
             

            "trying to init HSM before MDD" message should be printed by cdt_start_pending_restore, while in your case it is mdt_hsm_cdt_start.

            scherementsev Sergey Cheremencev added a comment -  I guess "trying to init HSM before MDD" is a hint? IMO, it hints your build doesn't include " LU-14399 hsm: process hsm_actions in coordinator". static void cdt_start_pending_restore(struct mdt_device *mdt,                       struct coordinator *cdt) {     struct mdt_thread_info *cdt_mti;     unsigned int i = 0;     int rc;    /* wait until MDD initialize hsm actions llog */     while (!test_bit(MDT_FL_CFGLOG, &mdt->mdt_state) && i < obd_timeout) {         schedule_timeout_interruptible(cfs_time_seconds(1));         i++;     }     if (!test_bit(MDT_FL_CFGLOG, &mdt->mdt_state))         CWARN("%s: trying to init HSM before MDD\n", mdt_obd_name(mdt)); "trying to init HSM before MDD" message should be printed by cdt_start_pending_restore, while in your case it is mdt_hsm_cdt_start.
            [ 8878.042110] Lustre: Found index 0 for lustre-MDT0000, updating log
            [ 8878.044766] Lustre: Modifying parameter lustre-MDT0000.mdt.identity_upcall in log lustre-MDT0000
            [ 8898.880071] Lustre: 497204:0:(mdt_coordinator.c:1145:mdt_hsm_cdt_start()) lustre-MDT0000: trying to init HSM before MDD
            [ 8898.888824] LustreError: 497204:0:(mdt_coordinator.c:1156:mdt_hsm_cdt_start()) lustre-MDT0000: cannot take the layout locks needed for registered restore: -2
            [ 8899.150565] Lustre: DEBUG MARKER: conf-sanity test_132: @@@@@@ FAIL: Can not take the layout lock
            

            I guess "trying to init HSM before MDD" is a hint?

            bzzz Alex Zhuravlev added a comment - [ 8878.042110] Lustre: Found index 0 for lustre-MDT0000, updating log [ 8878.044766] Lustre: Modifying parameter lustre-MDT0000.mdt.identity_upcall in log lustre-MDT0000 [ 8898.880071] Lustre: 497204:0:(mdt_coordinator.c:1145:mdt_hsm_cdt_start()) lustre-MDT0000: trying to init HSM before MDD [ 8898.888824] LustreError: 497204:0:(mdt_coordinator.c:1156:mdt_hsm_cdt_start()) lustre-MDT0000: cannot take the layout locks needed for registered restore: -2 [ 8899.150565] Lustre: DEBUG MARKER: conf-sanity test_132: @@@@@@ FAIL: Can not take the layout lock I guess "trying to init HSM before MDD" is a hint?

            Alex, it still doesn't fail on my local VM.
            I can you look into the logs if you attach them to the ticket.

            scherementsev Sergey Cheremencev added a comment - Alex, it still doesn't fail on my local VM. I can you look into the logs if you attach them to the ticket.
            bzzz Alex Zhuravlev added a comment - - edited

            well, essentially just a local VM:

            FSTYPE=ldiskfs MDSCOUNT=2 MDSSIZE=300000 OSTSIZE=400000 OSTCOUNT=2 LOGDIR=/tmp/ltest-logs REFORMAT=yes HONOR_EXCEPT=y bash conf-sanity.sh 
            bzzz Alex Zhuravlev added a comment - - edited well, essentially just a local VM: FSTYPE=ldiskfs MDSCOUNT=2 MDSSIZE=300000 OSTSIZE=400000 OSTCOUNT=2 LOGDIR=/tmp/ltest-logs REFORMAT=yes HONOR_EXCEPT=y bash conf-sanity.sh

            Hello Alex,

            what setup did you use? It doesn't fail on my local VM on the latest master(0feec5a3).

            scherementsev Sergey Cheremencev added a comment - Hello Alex, what setup did you use? It doesn't fail on my local VM on the latest master(0feec5a3).

            the patch just landed fails every run on my setup:

            ...
            Writing CONFIGS/mountdata
            start mds service on tmp.BKaRODHgLn
            Starting mds1: -o localrecov /dev/mapper/mds1_flakey /mnt/lustre-mds1
            Started lustre-MDT0000
            conf-sanity test_132: @@@@@@ FAIL: Can not take the layout lock
            Trace dump:
            = ./../tests/test-framework.sh:6389:error()
            = conf-sanity.sh:9419:test_132()
            = ./../tests/test-framework.sh:6693:run_one()
            = ./../tests/test-framework.sh:6740:run_one_logged()
            = ./../tests/test-framework.sh:6581:run_test()
            = conf-sanity.sh:9422:main()
            Dumping lctl log to /tmp/ltest-logs/conf-sanity.test_132.*.1642612854.log
            Dumping logs only on local client.
            FAIL 132 (84s)

            bzzz Alex Zhuravlev added a comment - the patch just landed fails every run on my setup: ... Writing CONFIGS/mountdata start mds service on tmp.BKaRODHgLn Starting mds1: -o localrecov /dev/mapper/mds1_flakey /mnt/lustre-mds1 Started lustre-MDT0000 conf-sanity test_132: @@@@@@ FAIL: Can not take the layout lock Trace dump: = ./../tests/test-framework.sh:6389:error() = conf-sanity.sh:9419:test_132() = ./../tests/test-framework.sh:6693:run_one() = ./../tests/test-framework.sh:6740:run_one_logged() = ./../tests/test-framework.sh:6581:run_test() = conf-sanity.sh:9422:main() Dumping lctl log to /tmp/ltest-logs/conf-sanity.test_132.*.1642612854.log Dumping logs only on local client. FAIL 132 (84s)
            pjones Peter Jones added a comment -

            Landed for 2.15

            pjones Peter Jones added a comment - Landed for 2.15

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/41445/
            Subject: LU-14399 hsm: process hsm_actions in coordinator
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: e26d7cc3992252e5fce5a51aee716f933b04c13a

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/41445/ Subject: LU-14399 hsm: process hsm_actions in coordinator Project: fs/lustre-release Branch: master Current Patch Set: Commit: e26d7cc3992252e5fce5a51aee716f933b04c13a

            Sergey Cheremencev (sergey.cheremencev@hpe.com) uploaded a new patch: https://review.whamcloud.com/42005
            Subject: LU-14399 tests: hsm_actions after failover
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 87493ef365d9faaf1f6c1e1a40f65157d37f72dc

            gerrit Gerrit Updater added a comment - Sergey Cheremencev (sergey.cheremencev@hpe.com) uploaded a new patch: https://review.whamcloud.com/42005 Subject: LU-14399 tests: hsm_actions after failover Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 87493ef365d9faaf1f6c1e1a40f65157d37f72dc

            People

              scherementsev Sergey Cheremencev
              mdiep Minh Diep
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: