Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13464

MDT0000 remount in recovery 40 hours

Details

    • 3
    • 9223372036854775807

    Description

      Summary
      metadata server mds01 encountered ldiskfs error which caused the mdt0000 to be remounted read-only.
      Exascaler HA remounted mdt0000 volume, which was in recovery 40+ hours.

      Details

      mds01 encountered ldiskfs error "illegal pblock", which causted mdt0000 to be remounted read-only

      Mar 21 14:41:20 mds01 kernel: LDISKFS-fs error (device dm-18): ldiskfs_map_blocks:592: inode #1426025594: block 1836017711: comm mdt00_094: lblock 0 mapped to illegal pblock 1836017711 (length 1)
      Mar 21 14:41:20 mds01 kernel: Aborting journal on device dm-18-8.
      Mar 21 14:41:20 mds01 kernel: LustreError: 47100:0:(osd_handler.c:1727:osd_trans_commit_cb()) transaction @0xffff9575746da700 commit error: 2
      Mar 21 14:41:20 mds01 kernel: LDISKFS-fs (dm-18): Remounting filesystem read-only
      Mar 21 14:41:20 mds01 kernel: LustreError: 47222:0:(osd_io.c:1833:osd_ldiskfs_read()) dm-18: can't read 59@0 on ino 1426025594: rc = -5
      Mar 21 14:41:20 mds01 kernel: LustreError: 47222:0:(mdd_dir.c:4507:mdd_migrate()) eaglefs-MDD0000: [0x2001457f2:0xbb37:0x0] readlink failed: rc = -5
      Mar 21 14:41:20 mds01 kernel: LDISKFS-fs warning (device dm-18): kmmpd:186: kmmpd being stopped since filesystem has been remounted as readonly.
      

      Then Exascaler HA remounted mdt0000 volume

      Mar 21 14:42:25 mds01 kernel: LDISKFS-fs warning (device dm-18): ldiskfs_multi_mount_protect:321: MMP interval 42 higher than expected, please wait.\x0a
      Mar 21 14:43:08 mds01 kernel: LDISKFS-fs warning (device dm-18): ldiskfs_clear_journal_err:4994: Filesystem error recorded from previous mount: IO failure
      Mar 21 14:43:08 mds01 kernel: LDISKFS-fs warning (device dm-18): ldiskfs_clear_journal_err:4995: Marking fs in need of filesystem check.
      Mar 21 14:43:08 mds01 kernel: LDISKFS-fs (dm-18): warning: mounting fs with errors, running e2fsck is recommended
      Mar 21 14:43:08 mds01 kernel: LDISKFS-fs (dm-18): recovery complete
      Mar 21 14:43:08 mds01 kernel: LDISKFS-fs (dm-18): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
      Mar 21 14:43:09 mds01 kernel: Lustre: eaglefs-MDT0000: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
      

      mds01 begin recovery

      Mar 21 14:43:09 mds01 kernel: Lustre: eaglefs-MDT0000: Will be in recovery for at least 2:30, or until 2199 clients rec
      onnect
      Mar 21 14:48:46 mds01 kernel: Lustre: eaglefs-MDT0000: Denying connection for new client 4b73dcdb-c341-5e93-6e2d-56df59cccb3c (at 10.148.4.147@o2ib), waiting for 2199 known clients (2152 recovered, 44 in progress, and 3 evicted) to recover in 0:53
      

      mds01 recovery encountered "hard timeout" 41 hours later

      Mar 23 08:31:14 mds01 kernel: Lustre: eaglefs-MDT0000: Denying connection for new client de9156e9-00e3-cbf3-a651-7566c6
      3216df (at 10.148.0.125@o2ib), waiting for 2199 known clients (2152 recovered, 44 in progress, and 3 evicted) already p
      assed deadline 2507:31
      Mar 23 08:37:35 mds01 kernel: Lustre: eaglefs-OST0014-osc-MDT0000: Connection to eaglefs-OST0014 (at 10.148.66.34@o2ib)
       was lost; in progress operations using this service will wait for recovery to complete
      Mar 23 08:38:00 mds01 kernel: Lustre: eaglefs-OST003c-osc-MDT0000: Connection to eaglefs-OST003c (at 10.148.66.84@o2ib)
       was lost; in progress operations using this service will wait for recovery to complete
      Mar 23 08:38:47 mds01 kernel: Lustre: eaglefs-OST000a-osc-MDT0000: Connection to eaglefs-OST000a (at 10.148.66.22@o2ib) was lost; in progress operations using this service will wait for recovery to complete
      Mar 23 08:41:12 mds01 kernel: Lustre: eaglefs-MDT0000: Denying connection for new client de9156e9-00e3-cbf3-a651-7566c63216df (at 10.148.0.125@o2ib), waiting for 2199 known clients (2152 recovered, 44 in progress, and 3 evicted) already passed deadline 2517:31
      
      Mar 23 08:44:39 mds01 kernel: Lustre: 64029:0:(ldlm_lib.c:2046:target_recovery_overseer()) eaglefs-MDT0000 recovery is aborted by hard timeout
      Mar 23 08:44:39 mds01 kernel: Lustre: 64029:0:(ldlm_lib.c:2056:target_recovery_overseer()) recovery is aborted, evict exports in recovery
      Mar 23 08:44:41 mds01 kernel: Lustre: eaglefs-MDT0000: Recovery over after 2527:32, of 2199 clients 0 recovered and 2199 were evicted.
      

      Attachments

        Activity

          [LU-13464] MDT0000 remount in recovery 40 hours

          Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41171
          Subject: LU-13464 ldlm: add recovery time limit
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: ea7a9006d1fe7d3ac5b8a027325901bfbad43a21

          gerrit Gerrit Updater added a comment - Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41171 Subject: LU-13464 ldlm: add recovery time limit Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: ea7a9006d1fe7d3ac5b8a027325901bfbad43a21

          Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40303/
          Subject: LU-13464 target: abort recovery if timer fail
          Project: fs/lustre-release
          Branch: b2_12
          Current Patch Set:
          Commit: d1f4de9bb568affc523dcbc46d82f4a6676990de

          gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40303/ Subject: LU-13464 target: abort recovery if timer fail Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: d1f4de9bb568affc523dcbc46d82f4a6676990de

          Jian Yu (yujian@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40303
          Subject: LU-13464 target: abort recovery if timer fail
          Project: fs/lustre-release
          Branch: b2_12
          Current Patch Set: 1
          Commit: 6ccd96714ce1061c1018086ae5ac1f228f618ed0

          gerrit Gerrit Updater added a comment - Jian Yu (yujian@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40303 Subject: LU-13464 target: abort recovery if timer fail Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: 6ccd96714ce1061c1018086ae5ac1f228f618ed0
          pjones Peter Jones added a comment -

          Landed for 2.14

          pjones Peter Jones added a comment - Landed for 2.14

          Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38277/
          Subject: LU-13464 target: abort recovery if timer fail
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 87443d9c27e8535c3e17d6bf142ad68d4449b93f

          gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38277/ Subject: LU-13464 target: abort recovery if timer fail Project: fs/lustre-release Branch: master Current Patch Set: Commit: 87443d9c27e8535c3e17d6bf142ad68d4449b93f

          Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38277
          Subject: LU-13464 target: abort recovery if timer fail
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 34ca5b9121654a0a125e2dbd0e3984faf74b6804

          gerrit Gerrit Updater added a comment - Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38277 Subject: LU-13464 target: abort recovery if timer fail Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 34ca5b9121654a0a125e2dbd0e3984faf74b6804

          People

            hongchao.zhang Hongchao Zhang
            hongchao.zhang Hongchao Zhang
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: