[LU-16638] LustreError: 18531:0:(osc_object.c:410:osc_req_attr_set()) LBUG Created: 13/Mar/23 Updated: 28/Jun/23 Resolved: 13/Mar/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.15.2 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Shane Nehring | Assignee: | WC Triage |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Environment: |
RHEL 9.0 client running 2.15.2 with tcp networking. |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
We're seeing a regular crash on one of our clients that reexports a lustre volume via nfs to other clients. For a while I thought it was related to atime updates as we were seeing that in the stack trace, but it's still crashing in the same spot in osc_object after disabling atime. [238496.543455] LustreError: 18531:0:(osc_object.c:396:osc_req_attr_set()) page@00000000124db7f5[4 000000001cc24e6a 4 1 0000000000000000] [238496.543488] LustreError: 18531:0:(osc_object.c:396:osc_req_attr_set()) vvp-page@00000000644ae261(0:0) vm@0000000013d555b5 17ffffc0002001 4:0 ffff8b33354a0500 540 lru [238496.543514] LustreError: 18531:0:(osc_object.c:396:osc_req_attr_set()) lov-page@00000000fac88b5b [238496.543532] LustreError: 18531:0:(osc_object.c:396:osc_req_attr_set()) osc-page@00000000c5423838 540: 1< 0x845fed 1 + + > 2< 2211840 0 4096 0x7 0x9 | 0000000000000000 0000000032e9a87e 00000000b31fd886 > 3< 1 0 0 > 4< 0 0 8 156499967 - | - - - + > 5< - - - + | 0 - | 0 - -> [238496.543569] LustreError: 18531:0:(osc_object.c:396:osc_req_attr_set()) end page@00000000124db7f5 [238496.543585] LustreError: 18531:0:(osc_object.c:396:osc_req_attr_set()) uncovered page! [238496.543598] LustreError: 18531:0:(ldlm_resource.c:1783:ldlm_resource_dump()) --- Resource: [0xd3409f:0x0:0x0].0x0 (000000004660d5d9) refcount = 3 [238496.543618] LustreError: 18531:0:(ldlm_resource.c:1787:ldlm_resource_dump()) Granted locks (in reverse order): [238496.543635] LustreError: 18531:0:(ldlm_resource.c:1790:ldlm_resource_dump()) ### ### ns: work-OST0003-osc-ffff8b3d1067e800 lock: 00000000552d990c/0x2904edfb430539b2 lrc: 3/1,0 mode: PR/PR res: [0xd3409f:0x0:0x0].0x0 rrc: 4 type: EXT [0->2211839] (req 2146304->2211839) gid 0 flags: 0x800420400020000 nid: local remote: 0x27d356efda730f51 expref: -99 pid: 18893 timeout: 0 lvb_type: 1 [238496.543687] LustreError: 18531:0:(ldlm_resource.c:1802:ldlm_resource_dump()) Waiting locks: [238496.543701] LustreError: 18531:0:(ldlm_resource.c:1804:ldlm_resource_dump()) ### ### ns: work-OST0003-osc-ffff8b3d1067e800 lock: 0000000049878f3e/0x2904edfb430539b9 lrc: 4/1,0 mode: --/PR res: [0xd3409f:0x0:0x0].0x0 rrc: 4 type: EXT [2211840->2277375] (req 2211840->2277375) gid 0 flags: 0x20000 nid: local remote: 0x27d356efda730f5f expref: -99 pid: 18894 timeout: 0 lvb_type: 1 [238496.543746] Pid: 18531, comm: ptlrpcd_03_06 5.14.0-70.36.1.el9_0.x86_64 #1 SMP PREEMPT Thu Nov 24 11:28:21 EST 2022 [238496.543762] Call Trace TBD: [238496.543767] LustreError: 18531:0:(osc_object.c:410:osc_req_attr_set()) LBUG [238496.543779] Pid: 18531, comm: ptlrpcd_03_06 5.14.0-70.36.1.el9_0.x86_64 #1 SMP PREEMPT Thu Nov 24 11:28:21 EST 2022 [238496.543794] Call Trace TBD: [238496.543799] Kernel panic - not syncing: LBUG [238496.543807] CPU: 46 PID: 18531 Comm: ptlrpcd_03_06 Kdump: loaded Tainted: P OE --------- --- 5.14.0-70.36.1.el9_0.x86_64 #1 [238496.543827] Hardware name: Supermicro AS -1114S-WN10RT/H12SSW-NTR, BIOS 2.3 12/03/2021 [238496.543840] Call Trace: [238496.543848] dump_stack_lvl+0x34/0x48 [238496.543860] panic+0x102/0x2d4 [238496.543869] lbug_with_loc.cold+0x18/0x18 [libcfs] [238496.543887] osc_req_attr_set+0x32a/0x540 [osc] [238496.543905] cl_req_attr_set+0x5e/0x160 [obdclass] [238496.543939] osc_build_rpc+0x4a7/0x11f0 [osc] [238496.544421] osc_send_read_rpc+0x6de/0x810 [osc] [238496.545787] osc_check_rpcs+0x335/0x3c0 [osc] [238496.546230] osc_io_unplug0+0x75/0x90 [osc] [238496.546662] brw_queue_work+0x2f/0xd0 [osc] [238496.547086] work_interpreter+0x32/0x170 [ptlrpc] [238496.547527] ptlrpc_check_set+0x415/0x1ea0 [ptlrpc] [238496.547966] ptlrpcd_check+0x3d0/0x5c0 [ptlrpc] [238496.548787] ptlrpcd+0x20d/0x4a0 [ptlrpc] [238496.550000] kthread+0x149/0x170 [238496.550732] ret_from_fork+0x22/0x30 This crash is relatively new for us, we started to notice it after we switched from o2ib to tcp to address stability issues in our environment that we believe (we're still investigating) are related to rdma on rhel9 with omnipath. We have a couple vmcores from the crash kernel available if desired, however I'd rather not attach them here. |
| Comments |
| Comment by Shane Nehring [ 13/Mar/23 ] |
|
Sorry, don't know how I missed that this is a duplicate. I'm assuming this'll end up in b2_15 in time for 2.15.3? |
| Comment by Peter Jones [ 13/Mar/23 ] |
|
Yes - likely |
| Comment by Andreas Dilger [ 13/Mar/23 ] |
|
This looks like a duplicate of A workaround has also been added to Lustre: and will likely appear in 2.15.3. |
| Comment by Andreas Dilger [ 13/Mar/23 ] |
|
Shane, I was on a meeting while writing my last comment, so I didn't see your previous exchange with Peter until after I submitted my comment. You likely missed that it was a duplicate because I just recently edited |
| Comment by Shane Nehring [ 13/Mar/23 ] |
|
Thank you Andreas. I've reached out to Red Hat to ask about incorporating that patch to el9. |
| Comment by Shane Nehring [ 27/Jun/23 ] |
|
the kernel patch that corrects this has been incorporated into rhel 9.2 in RHSA-2023:3723 |