[LU-11800] MDT stuck during recovery Created: 17/Dec/18  Updated: 21/Jan/22  Resolved: 21/Jan/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Sarah Liu Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

soak with 2.12-RC2 ib build #173 EL7.6


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

After soak runs about 2 days, 1 MDS stuck during recovery

MDS

Dec 16 13:45:36 soak-10 kernel: LustreError: 12500:0:(llog_osd.c:988:llog_osd_next_block()) soaked-MDT0001-osp-MDT0002: invalid llog tail at log id [0x54cf:0x400069cc:0x2]:0 offset 16121856 bytes 32768
Dec 16 13:45:36 soak-10 kernel: LustreError: 12500:0:(lod_dev.c:428:lod_sub_recovery_thread()) soaked-MDT0001-osp-MDT0002 get update log failed: rc = -22
Dec 16 13:45:37 soak-10 multipathd: 360080e50001fedb80000015952012962: sdi - rdac checker reports path is ghost
Dec 16 13:45:37 soak-10 kernel: device-mapper: multipath: Reinstating path 8:128.
Dec 16 13:45:37 soak-10 multipathd: 8:128: reinstated
Dec 16 13:45:37 soak-10 multipathd: 360080e50001fedb80000015952012962: queue_if_no_path enabled
Dec 16 13:45:37 soak-10 multipathd: 360080e50001fedb80000015952012962: Recovered to normal mode
Dec 16 13:45:37 soak-10 multipathd: 360080e50001fedb80000015952012962: remaining active paths: 1
Dec 16 13:45:37 soak-10 kernel: device-mapper: multipath: Failing path 8:128.
Dec 16 13:45:37 soak-10 multipathd: sdi: mark as failed
Dec 16 13:45:37 soak-10 multipathd: 360080e50001fedb80000015952012962: Entering recovery mode: max_retries=300
Dec 16 13:45:37 soak-10 multipathd: 360080e50001fedb80000015952012962: remaining active paths: 0
Dec 16 13:45:37 soak-10 multipathd: 360080e50001fedb80000015952012962: Entering recovery mode: max_retries=300
Dec 16 13:45:37 soak-10 kernel: LustreError: 12498:0:(lod_dev.c:428:lod_sub_recovery_thread()) soaked-MDT0002-osd get update log failed: rc = -108
Dec 16 13:45:37 soak-10 kernel: LustreError: 12498:0:(lod_dev.c:428:lod_sub_recovery_thread()) Skipped 2 previous similar messages


 Comments   
Comment by Andreas Dilger [ 18/Dec/18 ]

This might be a problem with the storage hardware, given the multipath messages immediately after the llog error?

Generated at Sat Feb 10 02:47:02 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.