[LU-13464] MDT0000 remount in recovery 40 hours Created: 19/Apr/20 Updated: 08/Jan/21 Resolved: 27/May/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.14.0, Lustre 2.12.6 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Hongchao Zhang | Assignee: | Hongchao Zhang |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
Summary Details mds01 encountered ldiskfs error "illegal pblock", which causted mdt0000 to be remounted read-only Mar 21 14:41:20 mds01 kernel: LDISKFS-fs error (device dm-18): ldiskfs_map_blocks:592: inode #1426025594: block 1836017711: comm mdt00_094: lblock 0 mapped to illegal pblock 1836017711 (length 1) Mar 21 14:41:20 mds01 kernel: Aborting journal on device dm-18-8. Mar 21 14:41:20 mds01 kernel: LustreError: 47100:0:(osd_handler.c:1727:osd_trans_commit_cb()) transaction @0xffff9575746da700 commit error: 2 Mar 21 14:41:20 mds01 kernel: LDISKFS-fs (dm-18): Remounting filesystem read-only Mar 21 14:41:20 mds01 kernel: LustreError: 47222:0:(osd_io.c:1833:osd_ldiskfs_read()) dm-18: can't read 59@0 on ino 1426025594: rc = -5 Mar 21 14:41:20 mds01 kernel: LustreError: 47222:0:(mdd_dir.c:4507:mdd_migrate()) eaglefs-MDD0000: [0x2001457f2:0xbb37:0x0] readlink failed: rc = -5 Mar 21 14:41:20 mds01 kernel: LDISKFS-fs warning (device dm-18): kmmpd:186: kmmpd being stopped since filesystem has been remounted as readonly. Then Exascaler HA remounted mdt0000 volume Mar 21 14:42:25 mds01 kernel: LDISKFS-fs warning (device dm-18): ldiskfs_multi_mount_protect:321: MMP interval 42 higher than expected, please wait.\x0a Mar 21 14:43:08 mds01 kernel: LDISKFS-fs warning (device dm-18): ldiskfs_clear_journal_err:4994: Filesystem error recorded from previous mount: IO failure Mar 21 14:43:08 mds01 kernel: LDISKFS-fs warning (device dm-18): ldiskfs_clear_journal_err:4995: Marking fs in need of filesystem check. Mar 21 14:43:08 mds01 kernel: LDISKFS-fs (dm-18): warning: mounting fs with errors, running e2fsck is recommended Mar 21 14:43:08 mds01 kernel: LDISKFS-fs (dm-18): recovery complete Mar 21 14:43:08 mds01 kernel: LDISKFS-fs (dm-18): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc Mar 21 14:43:09 mds01 kernel: Lustre: eaglefs-MDT0000: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900 mds01 begin recovery Mar 21 14:43:09 mds01 kernel: Lustre: eaglefs-MDT0000: Will be in recovery for at least 2:30, or until 2199 clients rec onnect Mar 21 14:48:46 mds01 kernel: Lustre: eaglefs-MDT0000: Denying connection for new client 4b73dcdb-c341-5e93-6e2d-56df59cccb3c (at 10.148.4.147@o2ib), waiting for 2199 known clients (2152 recovered, 44 in progress, and 3 evicted) to recover in 0:53 mds01 recovery encountered "hard timeout" 41 hours later Mar 23 08:31:14 mds01 kernel: Lustre: eaglefs-MDT0000: Denying connection for new client de9156e9-00e3-cbf3-a651-7566c6 3216df (at 10.148.0.125@o2ib), waiting for 2199 known clients (2152 recovered, 44 in progress, and 3 evicted) already p assed deadline 2507:31 Mar 23 08:37:35 mds01 kernel: Lustre: eaglefs-OST0014-osc-MDT0000: Connection to eaglefs-OST0014 (at 10.148.66.34@o2ib) was lost; in progress operations using this service will wait for recovery to complete Mar 23 08:38:00 mds01 kernel: Lustre: eaglefs-OST003c-osc-MDT0000: Connection to eaglefs-OST003c (at 10.148.66.84@o2ib) was lost; in progress operations using this service will wait for recovery to complete Mar 23 08:38:47 mds01 kernel: Lustre: eaglefs-OST000a-osc-MDT0000: Connection to eaglefs-OST000a (at 10.148.66.22@o2ib) was lost; in progress operations using this service will wait for recovery to complete Mar 23 08:41:12 mds01 kernel: Lustre: eaglefs-MDT0000: Denying connection for new client de9156e9-00e3-cbf3-a651-7566c63216df (at 10.148.0.125@o2ib), waiting for 2199 known clients (2152 recovered, 44 in progress, and 3 evicted) already passed deadline 2517:31 Mar 23 08:44:39 mds01 kernel: Lustre: 64029:0:(ldlm_lib.c:2046:target_recovery_overseer()) eaglefs-MDT0000 recovery is aborted by hard timeout Mar 23 08:44:39 mds01 kernel: Lustre: 64029:0:(ldlm_lib.c:2056:target_recovery_overseer()) recovery is aborted, evict exports in recovery Mar 23 08:44:41 mds01 kernel: Lustre: eaglefs-MDT0000: Recovery over after 2527:32, of 2199 clients 0 recovered and 2199 were evicted. |
| Comments |
| Comment by Gerrit Updater [ 19/Apr/20 ] |
|
Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38277 |
| Comment by Gerrit Updater [ 27/May/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38277/ |
| Comment by Peter Jones [ 27/May/20 ] |
|
Landed for 2.14 |
| Comment by Gerrit Updater [ 19/Oct/20 ] |
|
Jian Yu (yujian@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40303 |
| Comment by Gerrit Updater [ 29/Oct/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40303/ |
| Comment by Gerrit Updater [ 08/Jan/21 ] |
|
Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41171 |