Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13464

MDT0000 remount in recovery 40 hours

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: Lustre 2.14.0
    • Labels:
    • Severity:
      3
    • Rank (Obsolete):
      9223372036854775807

      Description

      Summary
      metadata server mds01 encountered ldiskfs error which caused the mdt0000 to be remounted read-only.
      Exascaler HA remounted mdt0000 volume, which was in recovery 40+ hours.

      Details

      mds01 encountered ldiskfs error "illegal pblock", which causted mdt0000 to be remounted read-only

      Mar 21 14:41:20 mds01 kernel: LDISKFS-fs error (device dm-18): ldiskfs_map_blocks:592: inode #1426025594: block 1836017711: comm mdt00_094: lblock 0 mapped to illegal pblock 1836017711 (length 1)
      Mar 21 14:41:20 mds01 kernel: Aborting journal on device dm-18-8.
      Mar 21 14:41:20 mds01 kernel: LustreError: 47100:0:(osd_handler.c:1727:osd_trans_commit_cb()) transaction @0xffff9575746da700 commit error: 2
      Mar 21 14:41:20 mds01 kernel: LDISKFS-fs (dm-18): Remounting filesystem read-only
      Mar 21 14:41:20 mds01 kernel: LustreError: 47222:0:(osd_io.c:1833:osd_ldiskfs_read()) dm-18: can't read 59@0 on ino 1426025594: rc = -5
      Mar 21 14:41:20 mds01 kernel: LustreError: 47222:0:(mdd_dir.c:4507:mdd_migrate()) eaglefs-MDD0000: [0x2001457f2:0xbb37:0x0] readlink failed: rc = -5
      Mar 21 14:41:20 mds01 kernel: LDISKFS-fs warning (device dm-18): kmmpd:186: kmmpd being stopped since filesystem has been remounted as readonly.
      

      Then Exascaler HA remounted mdt0000 volume

      Mar 21 14:42:25 mds01 kernel: LDISKFS-fs warning (device dm-18): ldiskfs_multi_mount_protect:321: MMP interval 42 higher than expected, please wait.\x0a
      Mar 21 14:43:08 mds01 kernel: LDISKFS-fs warning (device dm-18): ldiskfs_clear_journal_err:4994: Filesystem error recorded from previous mount: IO failure
      Mar 21 14:43:08 mds01 kernel: LDISKFS-fs warning (device dm-18): ldiskfs_clear_journal_err:4995: Marking fs in need of filesystem check.
      Mar 21 14:43:08 mds01 kernel: LDISKFS-fs (dm-18): warning: mounting fs with errors, running e2fsck is recommended
      Mar 21 14:43:08 mds01 kernel: LDISKFS-fs (dm-18): recovery complete
      Mar 21 14:43:08 mds01 kernel: LDISKFS-fs (dm-18): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
      Mar 21 14:43:09 mds01 kernel: Lustre: eaglefs-MDT0000: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
      

      mds01 begin recovery

      Mar 21 14:43:09 mds01 kernel: Lustre: eaglefs-MDT0000: Will be in recovery for at least 2:30, or until 2199 clients rec
      onnect
      Mar 21 14:48:46 mds01 kernel: Lustre: eaglefs-MDT0000: Denying connection for new client 4b73dcdb-c341-5e93-6e2d-56df59cccb3c (at 10.148.4.147@o2ib), waiting for 2199 known clients (2152 recovered, 44 in progress, and 3 evicted) to recover in 0:53
      

      mds01 recovery encountered "hard timeout" 41 hours later

      Mar 23 08:31:14 mds01 kernel: Lustre: eaglefs-MDT0000: Denying connection for new client de9156e9-00e3-cbf3-a651-7566c6
      3216df (at 10.148.0.125@o2ib), waiting for 2199 known clients (2152 recovered, 44 in progress, and 3 evicted) already p
      assed deadline 2507:31
      Mar 23 08:37:35 mds01 kernel: Lustre: eaglefs-OST0014-osc-MDT0000: Connection to eaglefs-OST0014 (at 10.148.66.34@o2ib)
       was lost; in progress operations using this service will wait for recovery to complete
      Mar 23 08:38:00 mds01 kernel: Lustre: eaglefs-OST003c-osc-MDT0000: Connection to eaglefs-OST003c (at 10.148.66.84@o2ib)
       was lost; in progress operations using this service will wait for recovery to complete
      Mar 23 08:38:47 mds01 kernel: Lustre: eaglefs-OST000a-osc-MDT0000: Connection to eaglefs-OST000a (at 10.148.66.22@o2ib) was lost; in progress operations using this service will wait for recovery to complete
      Mar 23 08:41:12 mds01 kernel: Lustre: eaglefs-MDT0000: Denying connection for new client de9156e9-00e3-cbf3-a651-7566c63216df (at 10.148.0.125@o2ib), waiting for 2199 known clients (2152 recovered, 44 in progress, and 3 evicted) already passed deadline 2517:31
      
      Mar 23 08:44:39 mds01 kernel: Lustre: 64029:0:(ldlm_lib.c:2046:target_recovery_overseer()) eaglefs-MDT0000 recovery is aborted by hard timeout
      Mar 23 08:44:39 mds01 kernel: Lustre: 64029:0:(ldlm_lib.c:2056:target_recovery_overseer()) recovery is aborted, evict exports in recovery
      Mar 23 08:44:41 mds01 kernel: Lustre: eaglefs-MDT0000: Recovery over after 2527:32, of 2199 clients 0 recovered and 2199 were evicted.
      

        Attachments

          Activity

            People

            • Assignee:
              hongchao.zhang Hongchao Zhang
              Reporter:
              hongchao.zhang Hongchao Zhang
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: