[LU-17344] ldlm_resource_get() ASSERTION(name->name[0] != 0) failed Created: 07/Dec/23 Updated: 29/Dec/23 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.14.0, Lustre 2.16.0, Lustre 2.15.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Andreas Dilger | Assignee: | Hongchao Zhang |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
A system running LFSCK was crashing in a loop, apparently trying to destroy a bad object FID: LustreError: 16300:0:(ldlm_resource.c:1488:ldlm_resource_get()) ASSERTION(name->name[0] != 0) failed: kernel:Kernel panic - not syncing: LBUG Call Trace: libcfs_call_trace+0x90/0xf0 [libcfs] lbug_with_loc+0x4c/0xa0 [libcfs] ldlm_resource_get+0x7e9/0x950 [ptlrpc] ldlm_lock_create+0x55/0xa60 [ptlrpc] ldlm_cli_enqueue_local+0xcc/0x850 [ptlrpc] lfsck_layout_slave_conditional_destroy [lfsck] lfsck_layout_slave_in_notify+0xa19/0xed0 [lfsck] lfsck_in_notify+0x23c/0x320 [lfsck] tgt_handle_lfsck_notify+0x5c/0x140 [ptlrpc] tgt_request_handle+0x8bf/0x18c0 [ptlrpc] ptlrpc_server_handle_request+0x253/0xc40 [ptlrpc] ptlrpc_main+0xc4a/0x1cb0 [ptlrpc] kthread+0xd1/0xe0 It probably makes sense to have lfsck_layout_slave_conditional_destroy() or a higher level check that the FID is valid before calling all the way down to ldlm_cli_enqueue_local(). |
| Comments |
| Comment by Andreas Dilger [ 07/Dec/23 ] |
|
Hongchao, could you please take a look into this. It looks like a relatively simple patch in lfsck_layout_slave_conditional_destroy() to check that the FID is valid before calling down the stack to "destroy" this bad object. It looks like this crash was on the OST side (lfsck_layout_slave_in_notify()) so it makes sense to handle this case if there is bad data sent from the MDS, but it would likely also make sense to add a check on the MDS side of LFSCK so that it doesn't even send the request to destroy an object that doesn't exist (e.g. FID 0x0:0x0:0x0 or similar) that is coming from a bad file layout on disk. Probably also changing ldlm_resource_get() to print and return an error instead of LASSERT() would make the code more robust, though harder to debug in the future. |
| Comment by Hongchao Zhang [ 11/Dec/23 ] |
|
Hi, |
| Comment by Gerrit Updater [ 29/Dec/23 ] |
|
"Hongchao Zhang <hongchao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53565 |