[LU-16182] llog_osd_prev_block() ASSERTION( last_rec->lrh_index == tail->lrt_index ) failed Created: 22/Sep/22 Updated: 23/Sep/22 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.10, Lustre 2.15.2 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Andreas Dilger | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
There is an LBUG triggered during MDT mount due to corruption in the Changelog catalog/log file: LustreError: 29516:0:(llog_osd.c:1075:llog_osd_prev_block()) ASSERTION( last_rec->lrh_index == tail->lrt_index ) failed: LustreError: 29516:0:(llog_osd.c:1075:llog_osd_prev_block()) LBUG Pid: 29516, comm: mount.lustre Call Trace: libcfs_debug_dumpstack+0x53/0x80 [libcfs] lbug_with_loc+0x45/0xc0 [libcfs] llog_osd_prev_block+0x9f7/0xaf0 [obdclass] llog_reverse_process+0x147/0xac0 [obdclass] ? changelog_init_cb+0x0/0x1f0 [mdd] llog_cat_reverse_process_cb+0x157/0x540 [obdclass] llog_reverse_process+0x269/0xac0 [obdclass] llog_cat_reverse_process+0x199/0x2d0 [obdclass] mdd_prepare+0x1269/0x1a00 [mdd] mdt_prepare+0x51/0x3b0 [mdt] server_start_targets+0x2574/0x2e10 [obdclass] server_fill_super+0x108d/0x184c [obdclass] lustre_fill_super+0x328/0x950 [obdclass] mount_nodev+0x4d/0xb0 lustre_mount+0x38/0x60 [obdclass] mount_fs+0x39/0x1b0 vfs_kern_mount+0x5f/0xf0 do_mount+0x24e/0xa40 The kernel code should not have an LASSERT() check for data that is read from disk, so this should be removed and replaced with error handling:
LASSERT(last_rec->lrh_index == tail->lrt_index);
The llog_osd_prev_block() function has many other places where errors are returned to the caller, so it looks (at first glance) that the LASSERT() should be replaced with an error message that prints the llog record number and FID, and returns an error to the caller that stops changelog processing and either clears this record or deletes the whole changelog. Also, I see that changelog_init_cb() also has LASSERT() checks for the llog records that could fail if the records are corrupted:
LASSERT(llh->lgh_hdr->llh_flags & LLOG_F_IS_PLAIN);
LASSERT(rec->cr_hdr.lrh_type == CHANGELOG_REC);
that should also be fixed. The llog_cat_reverse_process(changelog_init_cb) handling looks like it is finding the highest changelog index currently in use? If this reverse llog processing fails, then it may be that the last changelog index is lost? Options would include doing "forward" llog processing, but this may also suffer from the same problem, or using the changelog_users file to at least start with a changelog index higher than what the users have processed (e.g. current_index = max(user_index) + 10M or similar). |