Details
-
Bug
-
Resolution: Incomplete
-
Major
-
None
-
None
-
None
-
3
-
9223372036854775807
Description
Per Andreas, crash while attempting to mount MDT with corrupt OI file:
It looks like the 32927 patch worked as intended, and the problematic LASSERT was replaced with an error message:
[ 243.128917] LustreError: 2994:0:(llog_osd.c:792:llog_osd_next_block()) proj-MDT0000-osd: invalid llog tail at log id 0x14:1/0 offset 16384 [ 243.142871] LustreError: 2994:0:(osp_sync.c:1287:osp_sync_thread()) proj-OST000a-osc-MDT0000: llog process with osp_sync_process_queues failed: -22 [ 243.157690] LustreError: 2994:0:(osp_sync.c:1287:osp_sync_thread()) Skipped 1 previous similar message [ 243.181463] LustreError: 3016:0:(llog_osd.c:780:llog_osd_next_block()) proj-MDT0000-osd: invalid llog tail at log id 0x2a:1/0 offset 16384 last_rec idx 4294936591 tail idx 0Now the MDS is crashing in OI Scrub, due to a corrupt block in the OI file:
[ 306.547356] LustreError: 2900:0:(osd_iam_lfix.c:190:iam_lfix_init()) Wrong magic in node 173391004 (#21): 0x0 != 0x1976 or wrong count: 0 (170) [ 306.561797] LustreError: 2900:0:(osd_iam_lfix.c:190:iam_lfix_init()) Skipped 3 previous similar messages [ 307.088114] BUG: unable to handle kernel NULL pointer dereference at (null) [ 307.096929] IP: [<ffffffffc1182dd0>] __iam_path_lookup+0x70/0x240 [osd_ldiskfs] [ 307.110928] Oops: 0002 [#1] SMP [ 307.388833] Call Trace: [ 307.393153] [<ffffffffc118304f>] __iam_it_get+0xaf/0x1b0 [osd_ldiskfs] [ 307.402139] [<ffffffffc1183bda>] iam_it_get+0x2a/0x160 [osd_ldiskfs] [ 307.410927] [<ffffffffc117c713>] __osd_oi_lookup+0x113/0x390 [osd_ldiskfs] [ 307.420296] [<ffffffffc117ed54>] osd_oi_lookup+0x94/0x170 [osd_ldiskfs] [ 307.429354] [<ffffffffc1194662>] osd_scrub_check_update+0x112/0x12f0 [osd_ldiskfs] [ 307.448551] [<ffffffffc11973d5>] osd_scrub_exec+0x65/0x5f0 [osd_ldiskfs] [ 307.467661] [<ffffffffc1198e81>] osd_inode_iteration+0x571/0xd80 [osd_ldiskfs] [ 307.496163] [<ffffffffc119a100>] osd_scrub_main+0xa70/0x1070 [osd_ldiskfs]This issue with __iam_path_lookup() should probably be moved into a separate LU ticket, so that the crash (likely due to incorrect error handling) can be identified and avoided. We can't really do anything to repair the corruption in place, but the problematic OI file(s) could be removed and OI Scrub can rebuild them.