Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15517

MDT stuck in recovery if one other MDT is failed over to partner node

Details

    • Bug
    • Resolution: Unresolved
    • Critical
    • None
    • Lustre 2.12.8
    • 3.10.0-1160.53.1.1chaos.ch6.x86_64
      zfs-0.7.11-9.8llnl.ch6.x86_64
      lustre-2.12.8_6.llnl-1.ch6.x86_64
    • 3
    • 9223372036854775807

    Description

      MDT fails to enter recovery when there is one MDT running on its partner MDS. The recovery_status file indicates "WAITING" and reports the MDT running on its parter as a non-ready MDT.

      The console log of the MDT unable to enter recovery repeatedly shows messages like:

      [Thu Feb  3 12:15:18 2022] Lustre: 4102:0:(ldlm_lib.c:1827:extend_recovery_timer()) lquake-MDT0009: extended recovery timer reached hard limit: 900, extend: 1
      [Thu Feb  3 12:15:18 2022] Lustre: 4102:0:(ldlm_lib.c:1827:extend_recovery_timer()) Skipped 29 previous similar messages
      

      My steps to reproduce:
      1. Start all MDTs on their primary MDS (in my case, that's jet1...jet16 => MDT0000...MDT000f). Allow them to complete recovery.
      2. umount MDT000e on jet15, and mount it on jet16. Allow it to complete recovery.
      3. umount MDT0009 on jet10, and then mount it again (on the same node, jet10, where it was happily running moments ago).

      Debug log shows that the update log for MDT000e was not received. There are no messages in the console log regarding MDT000e after MDT0009 starts up.

      I determined the PID of the thread lod_sub_recovery_thread() for MDT000e. The debug log shows the thread starts up, follows roughly this sequence of calls, and never returns from ptlrpc_set_wait().

      lod_sub_prep_llog>llog_osd_get_cat_list->dt_locate_at->lu_object_find_at-> ?->
      osp_attr_get->osp_remote_sync->ptlrpc_queue_wait->ptlrpc_set_wait
      

      So apparently the RPC never times out, so upper layers never get to retry, there are no error messages reporting the problem, and the MDT never enters recovery.

      Our patch stack is:

      82ea54e (tag: 2.12.8_6.llnl) LU-13356 client: don't use OBD_CONNECT_MNE_SWAB
      d06f5b2 LU-15357 mdd: fix changelog context leak
      543b60b LU-9964 llite: prevent mulitple group locks
      d776b67 LU-15234 lnet: Race on discovery queue
      77040da Revert "LU-15234 lnet: Race on discovery queue"
      bba827c LU-15234 lnet: Race on discovery queue
      7faa872 LU-14865 utils: llog_reader.c printf type mismatch
      5dc104e LU-13946 build: OpenZFS 2.0 compatibility
      0fef268 TOSS-4917 grant: chatty warning in tgt_grant_incoming
      5b70822 TOSS-4917 grant: improve debug for grant calcs
      8af02e8 log lfs setstripe paths to syslog
      391be81 Don't install lustre init script on systemd systems
      c613926 LLNL build customizations
      a4f71cd TOSS-4431 build: build ldiskfs only for x86_64
      067cb55 (tag: v2_12_8, tag: 2.12.8) New release 2.12.8

      See https://github.com/LLNL/lustre/tree/2.12.8-llnl for the details.

      Attachments

        1. dk.1.jet16.txt.gz
          287 kB
          Olaf Faaland
        2. dk.3.txt.gz
          5.18 MB
          Olaf Faaland
        3. dmesg.jet16.txt.gz
          26 kB
          Olaf Faaland
        4. dmesg.txt.gz
          27 kB
          Olaf Faaland
        5. lu-15517.llog_reader.out
          328 kB
          Olaf Faaland

        Activity

          [LU-15517] MDT stuck in recovery if one other MDT is failed over to partner node
          pjones Peter Jones made changes -
          Link Original: This issue is related to JFC-21 [ JFC-21 ]
          ofaaland Olaf Faaland made changes -
          Labels Original: llnl topllnl New: llnl
          tappro Mikhail Pershin added a comment - - edited

          Olaf, I'd propose to add several patches to your stack. I think the problem in general is that update llogs became quite big at some or several MDTs. There are series of patches to manage update llogs more gracefully (top to bottom):

          This might help MDT to process update logs after all if problem is in them as I am expecting. Another patch that could also be useful is about OSP_DISCONNECT handling between MDTs:

          https://review.whamcloud.com/44753 - osp: do force disconnect if import is not ready

          Mike

           

          tappro Mikhail Pershin added a comment - - edited Olaf, I'd propose to add several patches to your stack. I think the problem in general is that update llogs became quite big at some or several MDTs. There are series of patches to manage update llogs more gracefully (top to bottom): https://review.whamcloud.com/47011 - LU-15645 obdclass: llog to handle gaps https://review.whamcloud.com/47010 - LU-13195 osp: osp_send_update_req() should check generation https://review.whamcloud.com/45847 - LU-12577 llog: protect partial updates from readers https://review.whamcloud.com/46863 - LU-13195 osp: invalidate object on write error https://review.whamcloud.com/46873   - mostly to make upper one to apply cleanly This might help MDT to process update logs after all if problem is in them as I am expecting. Another patch that could also be useful is about OSP_DISCONNECT handling between MDTs: https://review.whamcloud.com/44753 - osp: do force disconnect if import is not ready Mike  

          In all cases, the lod_sub_process_config() error was preceeded by a osp_disconnect() error, like this:

          2022-04-11 14:46:58 [1649713618.877752] Lustre: Failing over lquake-MDT0004
          2022-04-11 14:46:58 [1649713618.896752] LustreError: 11-0: lquake-MDT0001-osp-MDT0004: operation mds_disconnect to node 172.19.1.112@o2ib100 failed: rc = -107
          2022-04-11 14:46:58 [1649713618.915752] Lustre: lquake-MDT000e-osp-MDT0004: Connection to lquake-MDT000e (at 172.19.1.125@o2ib100) was lost; in progress operations using this service will wait for recovery to complete
          2022-04-11 14:46:58 [1649713618.936752] LustreError: 5267:0:(osp_dev.c:485:osp_disconnect()) lquake-MDT0003-osp-MDT0004: can't disconnect: rc = -19
          2022-04-11 14:46:58 [1649713618.950752] LustreError: 5267:0:(lod_dev.c:267:lod_sub_process_config()) lquake-MDT0004-mdtlov: error cleaning up LOD index 3: cmd 0xcf031 : rc = -19 
          ofaaland Olaf Faaland added a comment - In all cases, the lod_sub_process_config() error was preceeded by a osp_disconnect() error, like this: 2022-04-11 14:46:58 [1649713618.877752] Lustre: Failing over lquake-MDT0004 2022-04-11 14:46:58 [1649713618.896752] LustreError: 11-0: lquake-MDT0001-osp-MDT0004: operation mds_disconnect to node 172.19.1.112@o2ib100 failed: rc = -107 2022-04-11 14:46:58 [1649713618.915752] Lustre: lquake-MDT000e-osp-MDT0004: Connection to lquake-MDT000e (at 172.19.1.125@o2ib100) was lost; in progress operations using this service will wait for recovery to complete 2022-04-11 14:46:58 [1649713618.936752] LustreError: 5267:0:(osp_dev.c:485:osp_disconnect()) lquake-MDT0003-osp-MDT0004: can't disconnect: rc = -19 2022-04-11 14:46:58 [1649713618.950752] LustreError: 5267:0:(lod_dev.c:267:lod_sub_process_config()) lquake-MDT0004-mdtlov: error cleaning up LOD index 3: cmd 0xcf031 : rc = -19

          I also saw these errors on jet6 (which was hosting MDT0005):

          2022-04-11 14:48:39 [1649713719.122533] LustreError: 30035:0:(osp_object.c:594:osp_attr_get()) lquake-MDT0000-osp-MDT0005:osp_attr_get update error [0x20000000a:0x0:0x0]: rc = -5
          2022-04-11 14:48:39 [1649713719.141533] LustreError: 30035:0:(llog_cat.c:447:llog_cat_close()) lquake-MDT0000-osp-MDT0005: failure destroying log during cleanup: rc = -5
          2022-04-11 14:48:39 [1649713719.467531] Lustre: server umount lquake-MDT0005 complete 
          ofaaland Olaf Faaland added a comment - I also saw these errors on jet6 (which was hosting MDT0005): 2022-04-11 14:48:39 [1649713719.122533] LustreError: 30035:0:(osp_object.c:594:osp_attr_get()) lquake-MDT0000-osp-MDT0005:osp_attr_get update error [0x20000000a:0x0:0x0]: rc = -5 2022-04-11 14:48:39 [1649713719.141533] LustreError: 30035:0:(llog_cat.c:447:llog_cat_close()) lquake-MDT0000-osp-MDT0005: failure destroying log during cleanup: rc = -5 2022-04-11 14:48:39 [1649713719.467531] Lustre: server umount lquake-MDT0005 complete
          ofaaland Olaf Faaland added a comment - - edited

          Hi Mikhail,
          I found that there are several errors reported in lod_sub_process_config().  They were not encountered during my experiment, but earlier in the day today when I stopped the targets prior to a software update.  Note they appeared on several nodes, all MDTs.

          Apr 11 14:46:58 jet2 kernel: LustreError: 27555:0:(lod_dev.c:267:lod_sub_process_config()) lquake-MDT0001-mdtlov: error cleaning up LOD index 3: cmd 0xcf031 : rc = -19
          Apr 11 14:46:58 jet3 kernel: LustreError: 15820:0:(lod_dev.c:267:lod_sub_process_config()) lquake-MDT0002-mdtlov: error cleaning up LOD index 3: cmd 0xcf031 : rc = -19
          Apr 11 14:46:58 jet5 kernel: LustreError: 5267:0:(lod_dev.c:267:lod_sub_process_config()) lquake-MDT0004-mdtlov: error cleaning up LOD index 3: cmd 0xcf031 : rc = -19
          Apr 11 14:46:58 jet7 kernel: LustreError: 22882:0:(lod_dev.c:267:lod_sub_process_config()) lquake-MDT0006-mdtlov: error cleaning up LOD index 3: cmd 0xcf031 : rc = -19
          Apr 11 14:46:58 jet8 kernel: LustreError: 21185:0:(lod_dev.c:267:lod_sub_process_config()) lquake-MDT0007-mdtlov: error cleaning up LOD index 3: cmd 0xcf031 : rc = -19
          Apr 11 14:46:58 jet9 kernel: LustreError: 3160:0:(lod_dev.c:267:lod_sub_process_config()) lquake-MDT0008-mdtlov: error cleaning up LOD index 1: cmd 0xcf031 : rc = -19
          Apr 11 14:46:58 jet10 kernel: LustreError: 18778:0:(lod_dev.c:267:lod_sub_process_config()) lquake-MDT0009-mdtlov: error cleaning up LOD index 1: cmd 0xcf031 : rc = -19
          Apr 11 14:46:58 jet12 kernel: LustreError: 3992:0:(lod_dev.c:267:lod_sub_process_config()) lquake-MDT000b-mdtlov: error cleaning up LOD index 3: cmd 0xcf031 : rc = -19
          Apr 11 14:46:58 jet14 kernel: LustreError: 1060:0:(lod_dev.c:267:lod_sub_process_config()) lquake-MDT000d-mdtlov: error cleaning up LOD index 1: cmd 0xcf031 : rc = -19
          Apr 11 14:46:58 jet15 kernel: LustreError: 22650:0:(lod_dev.c:267:lod_sub_process_config()) lquake-MDT000e-mdtlov: error cleaning up LOD index 3: cmd 0xcf031 : rc = -19
          
          ofaaland Olaf Faaland added a comment - - edited Hi Mikhail, I found that there are several errors reported in lod_sub_process_config().  They were not encountered during my experiment, but earlier in the day today when I stopped the targets prior to a software update.  Note they appeared on several nodes, all MDTs. Apr 11 14:46:58 jet2 kernel: LustreError: 27555:0:(lod_dev.c:267:lod_sub_process_config()) lquake-MDT0001-mdtlov: error cleaning up LOD index 3: cmd 0xcf031 : rc = -19 Apr 11 14:46:58 jet3 kernel: LustreError: 15820:0:(lod_dev.c:267:lod_sub_process_config()) lquake-MDT0002-mdtlov: error cleaning up LOD index 3: cmd 0xcf031 : rc = -19 Apr 11 14:46:58 jet5 kernel: LustreError: 5267:0:(lod_dev.c:267:lod_sub_process_config()) lquake-MDT0004-mdtlov: error cleaning up LOD index 3: cmd 0xcf031 : rc = -19 Apr 11 14:46:58 jet7 kernel: LustreError: 22882:0:(lod_dev.c:267:lod_sub_process_config()) lquake-MDT0006-mdtlov: error cleaning up LOD index 3: cmd 0xcf031 : rc = -19 Apr 11 14:46:58 jet8 kernel: LustreError: 21185:0:(lod_dev.c:267:lod_sub_process_config()) lquake-MDT0007-mdtlov: error cleaning up LOD index 3: cmd 0xcf031 : rc = -19 Apr 11 14:46:58 jet9 kernel: LustreError: 3160:0:(lod_dev.c:267:lod_sub_process_config()) lquake-MDT0008-mdtlov: error cleaning up LOD index 1: cmd 0xcf031 : rc = -19 Apr 11 14:46:58 jet10 kernel: LustreError: 18778:0:(lod_dev.c:267:lod_sub_process_config()) lquake-MDT0009-mdtlov: error cleaning up LOD index 1: cmd 0xcf031 : rc = -19 Apr 11 14:46:58 jet12 kernel: LustreError: 3992:0:(lod_dev.c:267:lod_sub_process_config()) lquake-MDT000b-mdtlov: error cleaning up LOD index 3: cmd 0xcf031 : rc = -19 Apr 11 14:46:58 jet14 kernel: LustreError: 1060:0:(lod_dev.c:267:lod_sub_process_config()) lquake-MDT000d-mdtlov: error cleaning up LOD index 1: cmd 0xcf031 : rc = -19 Apr 11 14:46:58 jet15 kernel: LustreError: 22650:0:(lod_dev.c:267:lod_sub_process_config()) lquake-MDT000e-mdtlov: error cleaning up LOD index 3: cmd 0xcf031 : rc = -19

          Olaf, the reason why I've asked about reproducing that on clean system is that there were several issues with update log processing causing MDT-MDT recovery to fail. Could you check if any of MDT reported errors from lod_sub_recovery_thread or llog_process_thread() or any other llog processing complains when MDT is stuck in WAITING state? 

          tappro Mikhail Pershin added a comment - Olaf, the reason why I've asked about reproducing that on clean system is that there were several issues with update log processing causing MDT-MDT recovery to fail. Could you check if any of MDT reported errors from lod_sub_recovery_thread or llog_process_thread() or any other llog processing complains when MDT is stuck in WAITING state? 

          Olaf, no need to upload all llogs right now, let me think a bit first

          tappro Mikhail Pershin added a comment - Olaf, no need to upload all llogs right now, let me think a bit first

          ah, missed that you've uploaded llog_reader output. Looks like it knows about update log, let me check it first

          tappro Mikhail Pershin added a comment - ah, missed that you've uploaded llog_reader output. Looks like it knows about update log, let me check it first

          Olaf, please check these files with llog_reader. I wonder if any of them contains 'holes' in records but llog_reader in your version of Lustre might not be able to translate update records correctly and report them as 'Unknown' type, you can check.  I'd ask you to upload these files listed above for analysis here, is that possible?

          tappro Mikhail Pershin added a comment - Olaf, please check these files with llog_reader. I wonder if any of them contains 'holes' in records but llog_reader in your version of Lustre might not be able to translate update records correctly and report them as 'Unknown' type, you can check.  I'd ask you to upload these files listed above for analysis here, is that possible?

          People

            tappro Mikhail Pershin
            ofaaland Olaf Faaland
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated: