[LU-16638] LustreError: 18531:0:(osc_object.c:410:osc_req_attr_set()) LBUG Created: 13/Mar/23  Updated: 28/Jun/23  Resolved: 13/Mar/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.15.2
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Shane Nehring Assignee: WC Triage
Resolution: Duplicate Votes: 0
Labels: None
Environment:

RHEL 9.0 client running 2.15.2 with tcp networking.


Issue Links:
Duplicate
duplicates LU-16412 check truncated page in ->read page() Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

We're seeing a regular crash on one of our clients that reexports a lustre volume via nfs to other clients. For a while I thought it was related to atime updates as we were seeing that in the stack trace, but it's still crashing in the same spot in osc_object after disabling atime.

[238496.543455] LustreError: 18531:0:(osc_object.c:396:osc_req_attr_set()) page@00000000124db7f5[4 000000001cc24e6a 4 1 0000000000000000]
[238496.543488] LustreError: 18531:0:(osc_object.c:396:osc_req_attr_set()) vvp-page@00000000644ae261(0:0) vm@0000000013d555b5 17ffffc0002001 4:0 ffff8b33354a0500 540 lru
[238496.543514] LustreError: 18531:0:(osc_object.c:396:osc_req_attr_set()) lov-page@00000000fac88b5b
[238496.543532] LustreError: 18531:0:(osc_object.c:396:osc_req_attr_set()) osc-page@00000000c5423838 540: 1< 0x845fed 1 + + > 2< 2211840 0 4096 0x7 0x9 | 0000000000000000 0000000032e9a87e 00000000b31fd886 > 3< 1 0 0 > 4< 0 0 8 156499967 - | - - - + > 5< - - - + | 0 - | 0 - ->
[238496.543569] LustreError: 18531:0:(osc_object.c:396:osc_req_attr_set()) end page@00000000124db7f5
[238496.543585] LustreError: 18531:0:(osc_object.c:396:osc_req_attr_set()) uncovered page!
[238496.543598] LustreError: 18531:0:(ldlm_resource.c:1783:ldlm_resource_dump()) --- Resource: [0xd3409f:0x0:0x0].0x0 (000000004660d5d9) refcount = 3
[238496.543618] LustreError: 18531:0:(ldlm_resource.c:1787:ldlm_resource_dump()) Granted locks (in reverse order):
[238496.543635] LustreError: 18531:0:(ldlm_resource.c:1790:ldlm_resource_dump()) ### ### ns: work-OST0003-osc-ffff8b3d1067e800 lock: 00000000552d990c/0x2904edfb430539b2 lrc: 3/1,0 mode: PR/PR res: [0xd3409f:0x0:0x0].0x0 rrc: 4 type: EXT [0->2211839] (req 2146304->2211839) gid 0 flags: 0x800420400020000 nid: local remote: 0x27d356efda730f51 expref: -99 pid: 18893 timeout: 0 lvb_type: 1
[238496.543687] LustreError: 18531:0:(ldlm_resource.c:1802:ldlm_resource_dump()) Waiting locks:
[238496.543701] LustreError: 18531:0:(ldlm_resource.c:1804:ldlm_resource_dump()) ### ### ns: work-OST0003-osc-ffff8b3d1067e800 lock: 0000000049878f3e/0x2904edfb430539b9 lrc: 4/1,0 mode: --/PR res: [0xd3409f:0x0:0x0].0x0 rrc: 4 type: EXT [2211840->2277375] (req 2211840->2277375) gid 0 flags: 0x20000 nid: local remote: 0x27d356efda730f5f expref: -99 pid: 18894 timeout: 0 lvb_type: 1
[238496.543746] Pid: 18531, comm: ptlrpcd_03_06 5.14.0-70.36.1.el9_0.x86_64 #1 SMP PREEMPT Thu Nov 24 11:28:21 EST 2022
[238496.543762] Call Trace TBD:
[238496.543767] LustreError: 18531:0:(osc_object.c:410:osc_req_attr_set()) LBUG
[238496.543779] Pid: 18531, comm: ptlrpcd_03_06 5.14.0-70.36.1.el9_0.x86_64 #1 SMP PREEMPT Thu Nov 24 11:28:21 EST 2022
[238496.543794] Call Trace TBD:
[238496.543799] Kernel panic - not syncing: LBUG
[238496.543807] CPU: 46 PID: 18531 Comm: ptlrpcd_03_06 Kdump: loaded Tainted: P           OE    --------- ---  5.14.0-70.36.1.el9_0.x86_64 #1
[238496.543827] Hardware name: Supermicro AS -1114S-WN10RT/H12SSW-NTR, BIOS 2.3 12/03/2021
[238496.543840] Call Trace:
[238496.543848]  dump_stack_lvl+0x34/0x48
[238496.543860]  panic+0x102/0x2d4
[238496.543869]  lbug_with_loc.cold+0x18/0x18 [libcfs]
[238496.543887]  osc_req_attr_set+0x32a/0x540 [osc]
[238496.543905]  cl_req_attr_set+0x5e/0x160 [obdclass]
[238496.543939]  osc_build_rpc+0x4a7/0x11f0 [osc]
[238496.544421]  osc_send_read_rpc+0x6de/0x810 [osc]
[238496.545787]  osc_check_rpcs+0x335/0x3c0 [osc]
[238496.546230]  osc_io_unplug0+0x75/0x90 [osc]
[238496.546662]  brw_queue_work+0x2f/0xd0 [osc]
[238496.547086]  work_interpreter+0x32/0x170 [ptlrpc]
[238496.547527]  ptlrpc_check_set+0x415/0x1ea0 [ptlrpc]
[238496.547966]  ptlrpcd_check+0x3d0/0x5c0 [ptlrpc]
[238496.548787]  ptlrpcd+0x20d/0x4a0 [ptlrpc]
[238496.550000]  kthread+0x149/0x170
[238496.550732]  ret_from_fork+0x22/0x30

This crash is relatively new for us, we started to notice it after we switched from o2ib to tcp to address stability issues in our environment that we believe (we're still investigating) are related to rdma on rhel9 with omnipath.

We have a couple vmcores from the crash kernel available if desired, however I'd rather not attach them here.



 Comments   
Comment by Shane Nehring [ 13/Mar/23 ]

Sorry, don't know how I missed that this is a duplicate. I'm assuming this'll end up in b2_15 in time for 2.15.3?

Comment by Peter Jones [ 13/Mar/23 ]

Yes - likely

Comment by Andreas Dilger [ 13/Mar/23 ]

This looks like a duplicate of LU-16412, which exposed a bug in the kernel. There were patches landed to the mainline kernel recently and backported to stable kernels, but likely still need to be added to vendor kernels. You could potentially speed that process up by filing a ticket with your OS vendor and referencing the lore.kernel.org links in LU-16412 to request that patch be added to their kernel (it also improves performance in some workloads, so it is a win-win).

A workaround has also been added to Lustre:
https://review.whamcloud.com/50277

and will likely appear in 2.15.3.

Comment by Andreas Dilger [ 13/Mar/23 ]

Shane, I was on a meeting while writing my last comment, so I didn't see your previous exchange with Peter until after I submitted my comment.

You likely missed that it was a duplicate because I just recently edited LU-16412 to include enough information to find it, which was previously only in a customer ticket. Luckily you don't have to wait for the long time it took to debug this issue before we found it was a kernel bug.

Comment by Shane Nehring [ 13/Mar/23 ]

Thank you Andreas. I've reached out to Red Hat to ask about incorporating that patch to el9.

Comment by Shane Nehring [ 27/Jun/23 ]

the kernel patch that corrects this has been incorporated into rhel 9.2 in RHSA-2023:3723

Generated at Sat Feb 10 03:28:44 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.