[LU-17066] lod_xattr_set()) ASSERTION( (!!(!strcmp(name, "lustre.""lov") || !strcmp(name, "trusted.lov")) == !!(!lod_dt_obj(dt)->ldo_comp_cached)) ) failed Created: 31/Aug/23  Updated: 13/Nov/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.16.0
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Critical
Reporter: Oleg Drokin Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Adding additional racer test https://review.whamcloud.com/c/fs/lustre-release/+/41368 surfaced the following crash:

[16104.408588] LustreError: 4168:0:(lod_object.c:5136:lod_xattr_set()) ASSERTION( (!!(!strcmp(name, "lustre.""lov") || !strcmp(name, "trusted.lov")) == !!(!lod_dt_obj(dt)->ldo_comp_cached)) ) failed: 
[16104.411561] LustreError: 4168:0:(lod_object.c:5136:lod_xattr_set()) LBUG
[16104.412125] Pid: 4168, comm: mdt_rdpg01_003 3.10.0-7.9-debug #2 SMP Tue Feb 1 18:17:58 EST 2022
[16104.413146] Call Trace:
[16104.413633] [<0>] libcfs_call_trace+0x90/0xf0 [libcfs]
[16104.414162] [<0>] lbug_with_loc+0x4c/0xa0 [libcfs]
[16104.414692] [<0>] lod_xattr_set+0x1b13/0x1c90 [lod]
[16104.415221] [<0>] mdo_xattr_set+0xc0/0x4c0 [mdd]
[16104.415746] [<0>] mdd_xattr_set+0xf85/0x1200 [mdd]
[16104.416279] [<0>] mo_xattr_set+0x43/0x45 [mdt]
[16104.416812] [<0>] mdt_close_handle_layouts+0x9a4/0xee0 [mdt]
[16104.417358] [<0>] mdt_mfd_close+0x5b2/0xbb0 [mdt]
[16104.417887] [<0>] mdt_close_internal+0xb4/0x240 [mdt]
[16104.418532] [<0>] mdt_close+0x28c/0x970 [mdt]
[16104.419217] [<0>] tgt_request_handle+0x88e/0x19b0 [ptlrpc]
[16104.419815] [<0>] ptlrpc_server_handle_request+0x251/0xc00 [ptlrpc]
[16104.420555] [<0>] ptlrpc_main+0xc66/0x1670 [ptlrpc]
[16104.421185] [<0>] kthread+0xe4/0xf0
[16104.422327] [<0>] ret_from_fork_nospec_begin+0x7/0x21
[16104.422934] [<0>] 0xfffffffffffffffe
[16104.423456] Kernel panic - not syncing: LBUG

Crashdump: http://testing.linuxhacker.ru/lustre-reports/external/crashes/boilpot-bigmem-65-2023-08-25-15:53:21/

Intresting that the same assertion in the same place was previously seen mid-2021 as part of https://review.whamcloud.com/44178 testing in sanity-flr and hot-pool tests: https://testing.whamcloud.com/test_sets/de728c9c-b9dd-4c65-b898-507df9215560

Here's the tracker for this crash that lists all matching failures: https://knox.linuxhacker.ru/crashdb_ui_external.py.cgi?newid=6580



 Comments   
Comment by Alex Zhuravlev [ 16/Oct/23 ]

I'm hitting this quite often locally.

Comment by Alex Zhuravlev [ 13/Nov/23 ]

it's a race between migrate (which takes LAYOUT bit) and unlink (which does not). the migrate drops cached layout, but a racing unlink reloads it back.

Comment by Alex Zhuravlev [ 13/Nov/23 ]

"Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53113
Subject: LU-17066 tests: reproducer
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: fbcff6be03f5e30e1811cf5790d52cc8daed69dd

Generated at Sat Feb 10 03:32:20 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.