-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
Lustre 2.15.8
-
None
-
rhel9 x86_64
-
3
-
9223372036854775807
we've seen about 50 client nodes crash over the past few months. we narrowed it down to one user code, and then distilled that into a simple reproducer. see attached.
seems like an interaction between transparent huge pages + direct io + mmap?
if the reproducer is run on 2 client nodes at the same time then almost always one of the 2 nodes will LBUG.
transparent huge pages on the client needs to be "madvise" or "always".
cat /sys/kernel/mm/transparent_hugepage/enabled
the node doesn't crash if thp is set to "never".
crashes are more reliably achieved if the client node is idle and VFS caches are dropped.
echo 3 > /proc/sys/vm/drop_caches
LBUG is
2026-05-26 14:43:18 [ 6824.545844] LustreError: 1194:0:(osc_object.c:396:osc_req_attr_set()) page@000000002eb233c4[4 0000000003303e39 4 1 0000000000000000]
2026-05-26 14:43:18 [ 6824.545844]
2026-05-26 14:43:18 [ 6824.545881] LustreError: 1194:0:(osc_object.c:396:osc_req_attr_set()) vmpage @000000005a73be6c 17ffffc0008101 3:0 ffff9f6d394de100 319 lru
2026-05-26 14:43:18 [ 6824.545881]
2026-05-26 14:43:18 [ 6824.545907] LustreError: 1194:0:(osc_object.c:396:osc_req_attr_set()) osc-page@000000008e66afcd 319: 1< 0x845fed 1 + + > 2< 1306624 0 4096 0x7 0x9 | 0000000000000000 0000000062419670 00000000c0454700 > 3< 1 0 0 > 4< 0 0 8 5505024 - | - - - + > 5< - - - + | 0 - | 0 - ->
2026-05-26 14:43:18 [ 6824.545907]
2026-05-26 14:43:18 [ 6824.545944] LustreError: 1194:0:(osc_object.c:396:osc_req_attr_set()) end page@000000002eb233c4
2026-05-26 14:43:18 [ 6824.545944]
2026-05-26 14:43:18 [ 6824.545962] LustreError: 1194:0:(osc_object.c:396:osc_req_attr_set()) uncovered page!
2026-05-26 14:43:18 [ 6824.545977] LustreError: 1194:0:(ldlm_resource.c:1786:ldlm_resource_dump()) — Resource: [0x84e8db7:0x0:0x0].0x0 (00000000549d2e3e) refcount = 11
2026-05-26 14:43:18 [ 6824.545997] LustreError: 1194:0:(ldlm_resource.c:1790:ldlm_resource_dump()) Granted locks (in reverse order):
2026-05-26 14:43:18 [ 6824.546017] LustreError: 1194:0:(ldlm_resource.c:1793:ldlm_resource_dump()) ### ### ns: dagg-OST0014-osc-ffff9f838ae16800 lock: 000000000f20ff18/0xff58e093f2cba8d lrc: 3/0,1 mode: PW/PW res: [0x84e8db7:0x0:0x0].0x0 rrc: 12 type: EXT [1196032->1306623] (req 1196032->1200127) gid 0 flags: 0x800020000020000 nid: local remote: 0xf25a968d8e59dff9 expref: -99 pid: 52469 timeout: 0 lvb_type: 1
2026-05-26 14:43:18 [ 6824.546089] Pid: 1194, comm: ptlrpcd_00_13 5.14.0-611.55.1.el9_7.x86_64 #1 SMP PREEMPT_DYNAMIC Tue May 19 15:19:29 EDT 2026
2026-05-26 14:43:18 [ 6824.546106] Call Trace TBD:
2026-05-26 14:43:18 [ 6824.546113] LustreError: 1194:0:(osc_object.c:410:osc_req_attr_set()) LBUG
2026-05-26 14:43:18 [ 6824.546126] Pid: 1194, comm: ptlrpcd_00_13 5.14.0-611.55.1.el9_7.x86_64 #1 SMP PREEMPT_DYNAMIC Tue May 19 15:19:29 EDT 2026
2026-05-26 14:43:18 [ 6824.546143] Call Trace TBD:
2026-05-26 14:43:18 [ 6824.546149] Kernel panic - not syncing: LBUG
2026-05-26 14:43:18 [ 6824.713137] CPU: 10 PID: 1194 Comm: ptlrpcd_00_13 Tainted: P OE ------ — 5.14.0-611.55.1.el9_7.x86_64 #1
2026-05-26 14:43:18 [ 6824.724256] Hardware name: Dell Inc. PowerEdge R740/06G98X, BIOS 2.26.1 01/28/2026
2026-05-26 14:43:18 [ 6824.731820] Call Trace:
2026-05-26 14:43:18 [ 6824.734272] <TASK>
2026-05-26 14:43:18 [ 6824.736379] dump_stack_lvl+0x34/0x48
2026-05-26 14:43:18 [ 6824.740048] panic+0x107/0x2bb
2026-05-26 14:43:18 [ 6824.743107] lbug_with_loc.cold+0x18/0x18 [libcfs]
2026-05-26 14:43:18 [ 6824.747922] osc_req_attr_set+0x32a/0x540 [osc]
2026-05-26 14:43:18 [ 6824.752475] cl_req_attr_set+0x5b/0x160 [obdclass]
2026-05-26 14:43:18 [ 6824.757337] osc_build_rpc+0x6e6/0x1270 [osc]
2026-05-26 14:43:18 [ 6824.761713] osc_send_read_rpc+0x6de/0x810 [osc]
2026-05-26 14:43:18 [ 6824.766356] ? osc_extent_finish+0x431/0xa60 [osc]
2026-05-26 14:43:18 [ 6824.771169] osc_check_rpcs+0x335/0x3c0 [osc]
2026-05-26 14:43:18 [ 6824.775553] osc_io_unplug0+0x75/0x90 [osc]
2026-05-26 14:43:18 [ 6824.779755] brw_queue_work+0x2f/0xd0 [osc]
2026-05-26 14:43:18 [ 6824.783960] work_interpreter+0x2f/0x170 [ptlrpc]
2026-05-26 14:43:18 [ 6824.788751] ptlrpc_check_set+0x411/0x1ea0 [ptlrpc]
2026-05-26 14:43:18 [ 6824.793709] ? schedule+0x2c/0xb0
2026-05-26 14:43:18 [ 6824.797028] ptlrpcd_check+0x3d5/0x5d0 [ptlrpc]
2026-05-26 14:43:18 [ 6824.801639] ptlrpcd+0x20c/0x4a0 [ptlrpc]
2026-05-26 14:43:18 [ 6824.805729] ? __pfx_woken_wake_function+0x10/0x10
2026-05-26 14:43:18 [ 6824.810521] ? __pfx_ptlrpcd+0x10/0x10 [ptlrpc]
2026-05-26 14:43:18 [ 6824.815129] kthread+0x101/0x110
2026-05-26 14:43:18 [ 6824.818362] ? __pfx_kthread+0x10/0x10
2026-05-26 14:43:18 [ 6824.822117] ret_from_fork+0x28/0x50
2026-05-26 14:43:18 [ 6824.825697] </TASK>
2026-05-26 14:43:18 [ 6824.832627] Kernel Offset: 0x38c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
2026-05-26 14:43:18 [ 6824.991250] --[ end Kernel panic - not syncing: LBUG ]--
- is related to
-
LU-18449 Handling kernel mmap read-ahead triggering by advise(MADV_HUGEPAGE)
-
- Resolved
-