[LU-13464] MDT0000 remount in recovery 40 hours Created: 19/Apr/20  Updated: 08/Jan/21  Resolved: 27/May/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.14.0, Lustre 2.12.6

Type: Bug Priority: Minor
Reporter: Hongchao Zhang Assignee: Hongchao Zhang
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Summary
metadata server mds01 encountered ldiskfs error which caused the mdt0000 to be remounted read-only.
Exascaler HA remounted mdt0000 volume, which was in recovery 40+ hours.

Details

mds01 encountered ldiskfs error "illegal pblock", which causted mdt0000 to be remounted read-only

Mar 21 14:41:20 mds01 kernel: LDISKFS-fs error (device dm-18): ldiskfs_map_blocks:592: inode #1426025594: block 1836017711: comm mdt00_094: lblock 0 mapped to illegal pblock 1836017711 (length 1)
Mar 21 14:41:20 mds01 kernel: Aborting journal on device dm-18-8.
Mar 21 14:41:20 mds01 kernel: LustreError: 47100:0:(osd_handler.c:1727:osd_trans_commit_cb()) transaction @0xffff9575746da700 commit error: 2
Mar 21 14:41:20 mds01 kernel: LDISKFS-fs (dm-18): Remounting filesystem read-only
Mar 21 14:41:20 mds01 kernel: LustreError: 47222:0:(osd_io.c:1833:osd_ldiskfs_read()) dm-18: can't read 59@0 on ino 1426025594: rc = -5
Mar 21 14:41:20 mds01 kernel: LustreError: 47222:0:(mdd_dir.c:4507:mdd_migrate()) eaglefs-MDD0000: [0x2001457f2:0xbb37:0x0] readlink failed: rc = -5
Mar 21 14:41:20 mds01 kernel: LDISKFS-fs warning (device dm-18): kmmpd:186: kmmpd being stopped since filesystem has been remounted as readonly.

Then Exascaler HA remounted mdt0000 volume

Mar 21 14:42:25 mds01 kernel: LDISKFS-fs warning (device dm-18): ldiskfs_multi_mount_protect:321: MMP interval 42 higher than expected, please wait.\x0a
Mar 21 14:43:08 mds01 kernel: LDISKFS-fs warning (device dm-18): ldiskfs_clear_journal_err:4994: Filesystem error recorded from previous mount: IO failure
Mar 21 14:43:08 mds01 kernel: LDISKFS-fs warning (device dm-18): ldiskfs_clear_journal_err:4995: Marking fs in need of filesystem check.
Mar 21 14:43:08 mds01 kernel: LDISKFS-fs (dm-18): warning: mounting fs with errors, running e2fsck is recommended
Mar 21 14:43:08 mds01 kernel: LDISKFS-fs (dm-18): recovery complete
Mar 21 14:43:08 mds01 kernel: LDISKFS-fs (dm-18): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
Mar 21 14:43:09 mds01 kernel: Lustre: eaglefs-MDT0000: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900

mds01 begin recovery

Mar 21 14:43:09 mds01 kernel: Lustre: eaglefs-MDT0000: Will be in recovery for at least 2:30, or until 2199 clients rec
onnect
Mar 21 14:48:46 mds01 kernel: Lustre: eaglefs-MDT0000: Denying connection for new client 4b73dcdb-c341-5e93-6e2d-56df59cccb3c (at 10.148.4.147@o2ib), waiting for 2199 known clients (2152 recovered, 44 in progress, and 3 evicted) to recover in 0:53

mds01 recovery encountered "hard timeout" 41 hours later

Mar 23 08:31:14 mds01 kernel: Lustre: eaglefs-MDT0000: Denying connection for new client de9156e9-00e3-cbf3-a651-7566c6
3216df (at 10.148.0.125@o2ib), waiting for 2199 known clients (2152 recovered, 44 in progress, and 3 evicted) already p
assed deadline 2507:31
Mar 23 08:37:35 mds01 kernel: Lustre: eaglefs-OST0014-osc-MDT0000: Connection to eaglefs-OST0014 (at 10.148.66.34@o2ib)
 was lost; in progress operations using this service will wait for recovery to complete
Mar 23 08:38:00 mds01 kernel: Lustre: eaglefs-OST003c-osc-MDT0000: Connection to eaglefs-OST003c (at 10.148.66.84@o2ib)
 was lost; in progress operations using this service will wait for recovery to complete
Mar 23 08:38:47 mds01 kernel: Lustre: eaglefs-OST000a-osc-MDT0000: Connection to eaglefs-OST000a (at 10.148.66.22@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Mar 23 08:41:12 mds01 kernel: Lustre: eaglefs-MDT0000: Denying connection for new client de9156e9-00e3-cbf3-a651-7566c63216df (at 10.148.0.125@o2ib), waiting for 2199 known clients (2152 recovered, 44 in progress, and 3 evicted) already passed deadline 2517:31

Mar 23 08:44:39 mds01 kernel: Lustre: 64029:0:(ldlm_lib.c:2046:target_recovery_overseer()) eaglefs-MDT0000 recovery is aborted by hard timeout
Mar 23 08:44:39 mds01 kernel: Lustre: 64029:0:(ldlm_lib.c:2056:target_recovery_overseer()) recovery is aborted, evict exports in recovery
Mar 23 08:44:41 mds01 kernel: Lustre: eaglefs-MDT0000: Recovery over after 2527:32, of 2199 clients 0 recovered and 2199 were evicted.


 Comments   
Comment by Gerrit Updater [ 19/Apr/20 ]

Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38277
Subject: LU-13464 target: abort recovery if timer fail
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 34ca5b9121654a0a125e2dbd0e3984faf74b6804

Comment by Gerrit Updater [ 27/May/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38277/
Subject: LU-13464 target: abort recovery if timer fail
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 87443d9c27e8535c3e17d6bf142ad68d4449b93f

Comment by Peter Jones [ 27/May/20 ]

Landed for 2.14

Comment by Gerrit Updater [ 19/Oct/20 ]

Jian Yu (yujian@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40303
Subject: LU-13464 target: abort recovery if timer fail
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 6ccd96714ce1061c1018086ae5ac1f228f618ed0

Comment by Gerrit Updater [ 29/Oct/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40303/
Subject: LU-13464 target: abort recovery if timer fail
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: d1f4de9bb568affc523dcbc46d82f4a6676990de

Comment by Gerrit Updater [ 08/Jan/21 ]

Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41171
Subject: LU-13464 ldlm: add recovery time limit
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ea7a9006d1fe7d3ac5b8a027325901bfbad43a21

Generated at Sat Feb 10 03:01:30 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.