[LU-17344] ldlm_resource_get() ASSERTION(name->name[0] != 0) failed Created: 07/Dec/23  Updated: 29/Dec/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.14.0, Lustre 2.16.0, Lustre 2.15.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Andreas Dilger Assignee: Hongchao Zhang
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

A system running LFSCK was crashing in a loop, apparently trying to destroy a bad object FID:

 LustreError: 16300:0:(ldlm_resource.c:1488:ldlm_resource_get()) ASSERTION(name->name[0] != 0) failed:
 kernel:Kernel panic - not syncing: LBUG 
 Call Trace:
 libcfs_call_trace+0x90/0xf0 [libcfs]
 lbug_with_loc+0x4c/0xa0 [libcfs]
 ldlm_resource_get+0x7e9/0x950 [ptlrpc]
 ldlm_lock_create+0x55/0xa60 [ptlrpc]
 ldlm_cli_enqueue_local+0xcc/0x850 [ptlrpc]
 lfsck_layout_slave_conditional_destroy [lfsck]
 lfsck_layout_slave_in_notify+0xa19/0xed0 [lfsck]
 lfsck_in_notify+0x23c/0x320 [lfsck]
 tgt_handle_lfsck_notify+0x5c/0x140 [ptlrpc]
 tgt_request_handle+0x8bf/0x18c0 [ptlrpc]
 ptlrpc_server_handle_request+0x253/0xc40 [ptlrpc]
 ptlrpc_main+0xc4a/0x1cb0 [ptlrpc]
 kthread+0xd1/0xe0

It probably makes sense to have lfsck_layout_slave_conditional_destroy() or a higher level check that the FID is valid before calling all the way down to ldlm_cli_enqueue_local().



 Comments   
Comment by Andreas Dilger [ 07/Dec/23 ]

Hongchao, could you please take a look into this. It looks like a relatively simple patch in lfsck_layout_slave_conditional_destroy() to check that the FID is valid before calling down the stack to "destroy" this bad object.

It looks like this crash was on the OST side (lfsck_layout_slave_in_notify()) so it makes sense to handle this case if there is bad data sent from the MDS, but it would likely also make sense to add a check on the MDS side of LFSCK so that it doesn't even send the request to destroy an object that doesn't exist (e.g. FID 0x0:0x0:0x0 or similar) that is coming from a bad file layout on disk.

Probably also changing ldlm_resource_get() to print and return an error instead of LASSERT() would make the code more robust, though harder to debug in the future.

Comment by Hongchao Zhang [ 11/Dec/23 ]

Hi,
The FID had seemed to parsed correctly and the corresponding dt_object had been returned successfully and
the existance of the dt_object already been verified, then is it possible that the issue is related to the conversion
between ost_id and lu_fid for some special FID? how about adding some corresponding check here?
Thanks

Comment by Gerrit Updater [ 29/Dec/23 ]

"Hongchao Zhang <hongchao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53565
Subject: LU-17344 lfsck: check the validity of the res_id
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 23b263a28ff0b60f9f0674959fc9cb439cb84e71

Generated at Sat Feb 10 03:34:40 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.