Details
-
Bug
-
Resolution: Unresolved
-
Medium
-
None
-
None
-
3
-
9223372036854775807
Description
We have hit this LBUG a few times already, on our systems running with Lustre 2.14.0-ddnXXX versions, Kernels 6.8.0-YY-generic, under Ubuntu 24.04.2 LTS.
The latests messages and LBUG stack always look like following :
................................... [2144379.685702] Lustre: 1934029:0:(sec_gss.c:723:gss_cli_ctx_handle_err_notify()) ai400x2-MDT000c-mdc-ffff96e1f48df800: unknown context (hdl 0x25042b781652d9f:0x84131884100cf992) from server XXX.YY.0.212@o2ib (uid 0), retrying [2144379.716860] LustreError: 3239181:0:(gss_keyring.c:1684:gss_kt_update()) negotiation: rpc err 0, gss err 80000 [2144390.105931] LustreError: 3239554:0:(gss_keyring.c:1684:gss_kt_update()) negotiation: rpc err 0, gss err 80000 [2144390.116499] LustreError: 3239554:0:(gss_keyring.c:1684:gss_kt_update()) Skipped 4 previous similar messages [2144393.189789] Lustre: 1934029:0:(sec_gss.c:723:gss_cli_ctx_handle_err_notify()) ai400x2-OST0030-osc-ffff97118a651800: unknown context (hdl 0x25042b781652d9d:0x84131884100cf95c) from server XXX.YY.0.212@o2ib (uid 0), retrying [2144393.189796] Lustre: 1934029:0:(sec_gss.c:723:gss_cli_ctx_handle_err_notify()) Skipped 4 previous similar messages [2144394.213405] Lustre: 1934054:0:(client.c:2349:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1751190142/real 1751190142] req@0000000039ee632a x1834742542087360/t0(0) o400->ai400x2-OST0034-osc-ffff97118a651800@XXX.YY.0.212@o2ib:28/4 lens 224/224 e 0 to 1 dl 1751190180 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'' [2144394.213420] Lustre: 1934054:0:(client.c:2349:ptlrpc_expire_one_request()) Skipped 17 previous similar messages [2144394.298992] LustreError: 3239573:0:(gss_keyring.c:1684:gss_kt_update()) negotiation: rpc err 0, gss err 80000 [2144394.309549] LustreError: 3239573:0:(gss_keyring.c:1684:gss_kt_update()) Skipped 3 previous similar messages [2144399.462031] Lustre: 1934029:0:(sec_gss.c:723:gss_cli_ctx_handle_err_notify()) ai400x2-OST0035-osc-ffff9711ec8ac000: unknown context (hdl 0x5f3bb58e6609a6e8:0x84131884100cfa4e) from server XXX.YY.0.212@o2ib (uid 0), retrying [2144399.462038] Lustre: 1934029:0:(sec_gss.c:723:gss_cli_ctx_handle_err_notify()) Skipped 4 previous similar messages [2144410.055910] LustreError: 3240371:0:(gss_keyring.c:1684:gss_kt_update()) negotiation: rpc err 0, gss err 80000 [2144410.066544] LustreError: 3240371:0:(gss_keyring.c:1684:gss_kt_update()) Skipped 3 previous similar messages [2144414.181554] Lustre: 1934029:0:(sec_gss.c:723:gss_cli_ctx_handle_err_notify()) ai400x2-OST0035-osc-ffff97118a651800: unknown context (hdl 0x5f3bb58e6609a6e5:0x84131884100cf97e) from server XXX.YY.0.212@o2ib (uid 0), retrying [2144427.941242] Lustre: 3239556:0:(client.c:2349:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1751190176/real 1751190176] req@000000001d7173bd x1834742542137088/t0(0) o801->ai400x2-OST0034-osc-ffff9711ec8ac000@XXX.YY.0.209@o2ib:28/4 lens 224/224 e 0 to 1 dl 1751190214 ref 2 fl Rpc:XQr/0/ffffffff rc 0/-1 job:'' [2144427.941250] Lustre: 3239556:0:(client.c:2349:ptlrpc_expire_one_request()) Skipped 2 previous similar messages [2144461.733097] LustreError: 3239219:0:(gss_keyring.c:1684:gss_kt_update()) negotiation: rpc err -110, gss err 0 [2144461.743558] LustreError: 3239219:0:(gss_keyring.c:1684:gss_kt_update()) Skipped 2 previous similar messages [2144495.524883] Lustre: 3240377:0:(client.c:2349:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1751190239/real 1751190239] req@000000001d02b987 x1834742542223104/t0(0) o801->ai400x2-OST0035-osc-ffff9711ec8ac000@XXX.YY.0.211@o2ib:28/4 lens 224/224 e 0 to 1 dl 1751190282 ref 2 fl Rpc:XQr/0/ffffffff rc 0/-1 job:'' [2144495.524891] Lustre: 3240377:0:(client.c:2349:ptlrpc_expire_one_request()) Skipped 24 previous similar messages [2144495.524959] LustreError: 3240377:0:(gss_keyring.c:1684:gss_kt_update()) negotiation: rpc err -110, gss err 0 [2144495.535398] LustreError: 3240377:0:(gss_keyring.c:1684:gss_kt_update()) Skipped 13 previous similar messages [2144565.028500] LNetError: 1934018:0:(o2iblnd_cb.c:3488:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 34 seconds [2144565.039378] LNetError: 1934018:0:(o2iblnd_cb.c:3578:kiblnd_check_conns()) Timed out RDMA with XXX.YY.0.209@o2ib (40): c: 0, oc: 0, rc: 32 [2144565.053854] LustreError: 3244213:0:(gss_keyring.c:1684:gss_kt_update()) negotiation: rpc err -110, gss err 0 [2144565.064286] LustreError: 3244213:0:(gss_keyring.c:1684:gss_kt_update()) Skipped 10 previous similar messages [2144570.084474] LNetError: 1934018:0:(o2iblnd_cb.c:3488:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 40 seconds [2144570.095341] LNetError: 1934018:0:(o2iblnd_cb.c:3578:kiblnd_check_conns()) Timed out RDMA with XXX.YY.16.209@o2ib (47): c: 0, oc: 0, rc: 32 [2144578.520003] Lustre: ai400x2-MDT000c-mdc-ffff97118a651800: Connection restored to XXX.YY.0.210@o2ib (at XXX.YY.0.210@o2ib) [2144598.478497] Lustre: ai400x2-OST0030-osc-ffff9711ec8ac000: Connection restored to XXX.YY.0.210@o2ib (at XXX.YY.0.210@o2ib) [2144598.478509] Lustre: Skipped 5 previous similar messages [2144704.739750] LNet: 1 peer NIs in recovery (showing 1): XXX.YY.0.210@o2ib [2144727.651636] Lustre: 1934029:0:(client.c:2349:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1751190351/real 1751190351] req@000000001d24e4aa x1834742542381888/t0(0) o400->ai400x2-OST0035-osc-ffff97118a651800@XXX.YY.0.210@o2ib:28/4 lens 224/224 e 5 to 1 dl 1751190514 ref 1 fl Rpc:XQr/c0/ffffffff rc 0/-1 job:'' [2144727.651646] Lustre: 1934029:0:(client.c:2349:ptlrpc_expire_one_request()) Skipped 31 previous similar messages [2144761.728091] Lustre: ai400x2-OST0034-osc-ffff96e1f48df800: Connection restored to XXX.YY.0.212@o2ib (at XXX.YY.0.212@o2ib) [2144761.728103] Lustre: Skipped 2 previous similar messages [2144784.075193] Lustre: ai400x2-OST0035-osc-ffff96e1f48df800: Connection restored to XXX.YY.0.210@o2ib (at XXX.YY.0.210@o2ib) [2144784.075204] Lustre: Skipped 2 previous similar messages [2145153.953562] Lustre: ai400x2-OST0030-osc-ffff9711ec8ac000: Connection to ai400x2-OST0030 (at XXX.YY.0.210@o2ib) was lost; in progress operations using this service will wait for recovery to complete [2145153.953569] Lustre: Skipped 2 previous similar messages [2145153.963540] Lustre: 1934029:0:(sec_gss.c:723:gss_cli_ctx_handle_err_notify()) ai400x2-OST0031-osc-ffff9711ec8ac000: unknown context (hdl 0x34754495f3edaeae:0x84131884100cfab2) from server XXX.YY.0.209@o2ib (uid 0), retrying [2145154.014206] LustreError: 3269841:0:(gss_keyring.c:1684:gss_kt_update()) negotiation: rpc err 0, gss err 80000 [2145154.024726] LustreError: 3269841:0:(gss_keyring.c:1684:gss_kt_update()) Skipped 6 previous similar messages [2145159.073634] Lustre: ai400x2-OST0030-osc-ffff97118a651800: Connection to ai400x2-OST0030 (at XXX.YY.0.210@o2ib) was lost; in progress operations using this service will wait for recovery to complete [2145159.073640] Lustre: Skipped 1 previous similar message [2145159.074339] Lustre: 1934029:0:(sec_gss.c:723:gss_cli_ctx_handle_err_notify()) ai400x2-OST0031-osc-ffff97118a651800: unknown context (hdl 0x34754495f3edaf84:0x84131884100cfab7) from server XXX.YY.0.209@o2ib (uid 0), retrying [2145159.074345] Lustre: 1934029:0:(sec_gss.c:723:gss_cli_ctx_handle_err_notify()) Skipped 1 previous similar message [2145164.710542] Lustre: ai400x2-MDT000c-mdc-ffff9711ec8ac000: Connection to ai400x2-MDT000c (at XXX.YY.0.210@o2ib) was lost; in progress operations using this service will wait for recovery to complete [2145164.710549] Lustre: Skipped 8 previous similar messages [2145164.710711] Lustre: 1934029:0:(sec_gss.c:723:gss_cli_ctx_handle_err_notify()) ai400x2-MDT000c-mdc-ffff9711ec8ac000: unknown context (hdl 0x2596d1f9bc0a9cb7:0x84131884100cfab9) from server XXX.YY.0.209@o2ib (uid 0), retrying [2145164.710716] Lustre: 1934029:0:(sec_gss.c:723:gss_cli_ctx_handle_err_notify()) Skipped 3 previous similar messages [2145179.233547] Lustre: 1934029:0:(sec_gss.c:723:gss_cli_ctx_handle_err_notify()) ai400x2-OST0035-osc-ffff97118a651800: unknown context (hdl 0x2596d1f9bc0a9c2d:0x84131884100cfab5) from server XXX.YY.0.209@o2ib (uid 0), retrying [2145179.233554] Lustre: 1934029:0:(sec_gss.c:723:gss_cli_ctx_handle_err_notify()) Skipped 2 previous similar messages [2145192.033065] Lustre: 1934031:0:(client.c:2349:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1751190940/real 1751190940] req@000000003bd215c5 x1834742543202368/t0(0) o400->ai400x2-MDT000c-mdc-ffff97118a651800@XXX.YY.0.210@o2ib:12/10 lens 224/224 e 0 to 1 dl 1751190978 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'' [2145199.009348] Lustre: 1934029:0:(sec_gss.c:723:gss_cli_ctx_handle_err_notify()) ai400x2-OST0034-osc-ffff97118a651800: unknown context (hdl 0xd350adfb3d96a1a8:0x84131884100cfaca) from server XXX.YY.0.210@o2ib (uid 0), retrying [2145199.009379] Lustre: 1934029:0:(sec_gss.c:723:gss_cli_ctx_handle_err_notify()) Skipped 7 previous similar messages [2145284.247160] Lustre: ai400x2-OST0030-osc-ffff97118a651800: Connection restored to XXX.YY.0.209@o2ib (at XXX.YY.0.209@o2ib) [2145284.247171] Lustre: Skipped 2 previous similar messages [2145284.247408] LustreError: 3274848:0:(osc_quota.c:319:osc_quotactl()) ptlrpc_queue_wait failed, rc: -107 [2145284.247408] LustreError: 3274990:0:(osc_quota.c:319:osc_quotactl()) ptlrpc_queue_wait failed, rc: -107 [2145284.247413] LustreError: 3274990:0:(osc_quota.c:319:osc_quotactl()) Skipped 5 previous similar messages [2145284.256889] LustreError: 3274848:0:(osc_quota.c:319:osc_quotactl()) Skipped 30 previous similar messages [2145295.776696] LustreError: 3274970:0:(osc_quota.c:319:osc_quotactl()) ptlrpc_queue_wait failed, rc: -5 [2145295.786415] LustreError: 3274970:0:(osc_quota.c:319:osc_quotactl()) Skipped 17 previous similar messages [2145319.485266] Lustre: ai400x2-MDT000c-mdc-ffff9711ec8ac000: Connection restored to XXX.YY.0.209@o2ib (at XXX.YY.0.209@o2ib) [2145319.485273] Lustre: Skipped 5 previous similar messages [2145384.693860] Lustre: ai400x2-OST0034-osc-ffff9711ec8ac000: Connection restored to XXX.YY.0.209@o2ib (at XXX.YY.0.209@o2ib) [2145384.693872] Lustre: Skipped 2 previous similar messages [2145412.515799] Lustre: ai400x2-OST0034-osc-ffff9711ec8ac000: Connection to ai400x2-OST0034 (at XXX.YY.0.209@o2ib) was lost; in progress operations using this service will wait for recovery to complete [2145412.515806] Lustre: Skipped 2 previous similar messages [2145412.516087] Lustre: 1934029:0:(sec_gss.c:723:gss_cli_ctx_handle_err_notify()) ai400x2-OST0034-osc-ffff96e1f48df800: unknown context (hdl 0xd350adfb3d96aa02:0x84131884100cfae9) from server XXX.YY.0.211@o2ib (uid 0), retrying [2145412.516094] Lustre: 1934029:0:(sec_gss.c:723:gss_cli_ctx_handle_err_notify()) Skipped 6 previous similar messages [2145421.951268] Lustre: ai400x2-OST0035-osc-ffff9711ec8ac000: Connection restored to XXX.YY.0.209@o2ib (at XXX.YY.0.209@o2ib) [2145421.951280] Lustre: Skipped 2 previous similar messages [2145421.953184] LustreError: 3274977:0:(osc_quota.c:319:osc_quotactl()) ptlrpc_queue_wait failed, rc: -107 [2145421.953184] LustreError: 3274964:0:(osc_quota.c:319:osc_quotactl()) ptlrpc_queue_wait failed, rc: -107 [2145421.953189] LustreError: 3274964:0:(osc_quota.c:319:osc_quotactl()) Skipped 6 previous similar messages [2145421.962664] LustreError: 3274977:0:(osc_quota.c:319:osc_quotactl()) Skipped 15 previous similar messages [2145427.398103] LustreError: 3282960:0:(gss_keyring.c:1684:gss_kt_update()) negotiation: rpc err 0, gss err 80000 [2145427.401522] LustreError: 3282038:0:(osc_quota.c:319:osc_quotactl()) ptlrpc_queue_wait failed, rc: -5 [2145427.408203] LustreError: 3282960:0:(gss_keyring.c:1684:gss_kt_update()) Skipped 21 previous similar messages [2145427.417506] LustreError: 3282038:0:(osc_quota.c:319:osc_quotactl()) Skipped 1 previous similar message [2145436.579373] Lustre: ai400x2-OST0035-osc-ffff9711ec8ac000: Connection to ai400x2-OST0035 (at XXX.YY.0.209@o2ib) was lost; in progress operations using this service will wait for recovery to complete [2145436.579379] Lustre: Skipped 2 previous similar messages [2145439.648021] LustreError: 3282993:0:(osc_quota.c:319:osc_quotactl()) ptlrpc_queue_wait failed, rc: -5 [2145452.959649] LustreError: 3253714:0:(osc_object.c:408:osc_req_attr_set()) page@0000000074958940[2 000000002bc2294e 3 1 0000000000000000] [2145452.974663] LustreError: 3253714:0:(osc_object.c:408:osc_req_attr_set()) vvp-page@000000007ab2b168 vm@000000006e4ac094 17ffffd000802a 2:0 ffff96faaa3e0cb8 0 lru [2145452.991604] LustreError: 3253714:0:(osc_object.c:408:osc_req_attr_set()) osc-page@00000000a415e5e1 0: 1< 0x845fed 2 + + > 2< 0 0 4096 0x5 0x420 | 0000000000000000 000000007ce40d8f 00000000c1a16043 > 3< 1 31 0 > 4< 0 0 32 0 - | - - + - > 5< - - + - | 0 - | 0 - -> [2145453.017647] LustreError: 3253714:0:(osc_object.c:408:osc_req_attr_set()) end page@0000000074958940 [2145453.028947] LustreError: 3253714:0:(osc_object.c:408:osc_req_attr_set()) uncovered page! [2145453.037485] LustreError: 3253714:0:(ldlm_resource.c:1792:ldlm_resource_dump()) --- Resource: [0x98000040f:0x2ef745:0x0].0x0 (00000000cba29b0b) refcount = 2 [2145453.052103] LustreError: 3253714:0:(ldlm_resource.c:1796:ldlm_resource_dump()) Granted locks (in reverse order): [2145453.062739] LustreError: 3253714:0:(ldlm_resource.c:1799:ldlm_resource_dump()) ### ### ns: ai400x2-OST0035-osc-ffff96e1f48df800 lock: 00000000b018eb41/0x6ad49641becf0192 lrc: 2/0,0 mode: PW/PW res: [0x98000040f:0x2ef745:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->4095) gid 0 flags: 0x800020000020000 nid: local remote: 0xab523e3ee5cca203 expref: -99 pid: 3274810 timeout: 0 lvb_type: 1 [2145453.099373] LustreError: 3253714:0:(osc_object.c:421:osc_req_attr_set()) LBUG [2145453.106979] CPU: 18 PID: 3253714 Comm: kworker/u161:2 Kdump: loaded Tainted: G OE 6.8.0-60-generic #63-Ubuntu [2145453.106983] Hardware name: Dell Inc. PowerEdge R640/0X45NX, BIOS 2.12.2 07/09/2021 [2145453.106985] Workqueue: writeback wb_workfn (flush-ai400x2-ffff96e1f48df800) [2145453.106996] Call Trace: [2145453.106998] <TASK> [2145453.107002] dump_stack_lvl+0x76/0xa0 [2145453.107007] dump_stack+0x10/0x20 [2145453.107012] lbug_with_loc+0x3e/0x70 [libcfs] [2145453.107044] osc_req_attr_set+0x303/0x5e0 [osc] [2145453.107074] cl_req_attr_set+0xaf/0x180 [obdclass] [2145453.107154] osc_build_rpc+0x561/0x1350 [osc] [2145453.107178] osc_check_rpcs+0x9f3/0x1370 [osc] [2145453.107204] osc_cache_writeback_range+0xc7d/0xee0 [osc] [2145453.107227] ? osc_io_init+0x70/0xc0 [osc] [2145453.107250] ? cl_io_init0+0xb1/0x170 [obdclass] [2145453.107306] osc_io_fsync_start+0x93/0x370 [osc] [2145453.107331] ? __pfx_osc_io_fsync_start+0x10/0x10 [osc] [2145453.107354] cl_io_start+0x84/0x170 [obdclass] [2145453.107408] ? __pfx_cl_io_start+0x10/0x10 [obdclass] [2145453.107462] ? __pfx_lov_io_start+0x10/0x10 [lov] [2145453.107484] lov_io_call.isra.0+0x58/0x150 [lov] [2145453.107500] lov_io_start+0x117/0x200 [lov] [2145453.107516] cl_io_start+0x84/0x170 [obdclass] [2145453.107570] cl_io_loop+0x109/0x200 [obdclass] [2145453.107625] cl_sync_file_range+0x288/0x340 [lustre] [2145453.107673] ll_writepages+0x187/0x200 [lustre] [2145453.107710] do_writepages+0xcd/0x1f0 [2145453.107716] __writeback_single_inode+0x44/0x290 [2145453.107719] ? desc_read_finalized_seq+0x3c/0xb0 [2145453.107724] writeback_sb_inodes+0x21b/0x520 [2145453.107727] __writeback_inodes_wb+0x54/0x100 [2145453.107730] ? queue_io+0x115/0x120 [2145453.107733] wb_writeback+0x2df/0x350 [2145453.107736] ? get_nr_inodes+0x56/0xa0 [2145453.107740] wb_do_writeback+0x225/0x2a0 [2145453.107743] wb_workfn+0x5f/0x240 [2145453.107746] ? finish_task_switch.isra.0+0x93/0x300 [2145453.107750] ? __schedule+0x284/0x6b0 [2145453.107754] process_one_work+0x181/0x3a0 [2145453.107758] worker_thread+0x306/0x440 [2145453.107760] ? __pfx_worker_thread+0x10/0x10 [2145453.107762] kthread+0xef/0x120 [2145453.107766] ? __pfx_kthread+0x10/0x10 [2145453.107769] ret_from_fork+0x44/0x70 [2145453.107772] ? __pfx_kthread+0x10/0x10 [2145453.107775] ret_from_fork_asm+0x1b/0x30 [2145453.107779] </TASK> [2145453.107786] Kernel panic - not syncing: LBUG [2145453.112523] CPU: 18 PID: 3253714 Comm: kworker/u161:2 Kdump: loaded Tainted: G OE 6.8.0-60-generic #63-Ubuntu [2145453.124477] Hardware name: Dell Inc. PowerEdge R640/0X45NX, BIOS 2.12.2 07/09/2021 [2145453.132509] Workqueue: writeback wb_workfn (flush-ai400x2-ffff96e1f48df800) [2145453.139931] Call Trace: [2145453.142842] <TASK> [2145453.145392] dump_stack_lvl+0x27/0xa0 [2145453.149509] dump_stack+0x10/0x20 [2145453.153270] panic+0x366/0x3c0 [2145453.156767] lbug_with_loc+0x70/0x70 [libcfs] [2145453.161593] osc_req_attr_set+0x303/0x5e0 [osc] [2145453.166593] cl_req_attr_set+0xaf/0x180 [obdclass] [2145453.171964] osc_build_rpc+0x561/0x1350 [osc] [2145453.176801] osc_check_rpcs+0x9f3/0x1370 [osc] [2145453.181718] osc_cache_writeback_range+0xc7d/0xee0 [osc] [2145453.187490] ? osc_io_init+0x70/0xc0 [osc] [2145453.192047] ? cl_io_init0+0xb1/0x170 [obdclass] [2145453.197179] osc_io_fsync_start+0x93/0x370 [osc] [2145453.202263] ? __pfx_osc_io_fsync_start+0x10/0x10 [osc] [2145453.207954] cl_io_start+0x84/0x170 [obdclass] [2145453.212903] ? __pfx_cl_io_start+0x10/0x10 [obdclass] [2145453.218456] ? __pfx_lov_io_start+0x10/0x10 [lov] [2145453.223616] lov_io_call.isra.0+0x58/0x150 [lov] [2145453.228683] lov_io_start+0x117/0x200 [lov] [2145453.233309] cl_io_start+0x84/0x170 [obdclass] [2145453.238229] cl_io_loop+0x109/0x200 [obdclass] [2145453.243147] cl_sync_file_range+0x288/0x340 [lustre] [2145453.248574] ll_writepages+0x187/0x200 [lustre] [2145453.253561] do_writepages+0xcd/0x1f0 [2145453.257635] __writeback_single_inode+0x44/0x290 [2145453.262658] ? desc_read_finalized_seq+0x3c/0xb0 [2145453.267686] writeback_sb_inodes+0x21b/0x520 [2145453.272360] __writeback_inodes_wb+0x54/0x100 [2145453.277110] ? queue_io+0x115/0x120 [2145453.280984] wb_writeback+0x2df/0x350 [2145453.285034] ? get_nr_inodes+0x56/0xa0 [2145453.289170] wb_do_writeback+0x225/0x2a0 [2145453.293476] wb_workfn+0x5f/0x240 [2145453.297170] ? finish_task_switch.isra.0+0x93/0x300 [2145453.302422] ? __schedule+0x284/0x6b0 [2145453.306453] process_one_work+0x181/0x3a0 [2145453.310826] worker_thread+0x306/0x440 [2145453.314942] ? __pfx_worker_thread+0x10/0x10 [2145453.319562] kthread+0xef/0x120 [2145453.323058] ? __pfx_kthread+0x10/0x10 [2145453.327150] ret_from_fork+0x44/0x70 [2145453.331065] ? __pfx_kthread+0x10/0x10 [2145453.335153] ret_from_fork_asm+0x1b/0x30 [2145453.339403] </TASK>
According to further crash-dump analysis, client/server logs browsing, and concerned source code, here is the suspected scenario that may have happened.
In osc_req_attr_set(), when trying to verify that a LDLM lock is set and covers page(s) as expected, osc_dlmlock_at_pgoff() -> osc_match_base() -> ldlm_lock_match_with_skip() call sequence should have occurred and where a lock has been finally found/verified, but ultimately there is a call to sptlrpc_import_check_ctx() that should have returned an error, hence the LBUG.
This context check should have led to a "lgss_keyring" upcall run (after executing the code-path made of get_my_ctx() -> .lookup_ctx = gss_sec_lookup_ctx_kr() -> request_key() -> request_key_and_link() -> construct_key() -> "/sbin/request-key" upcall, which itself runs "lgss_keyring"), according to the following Lustre debug traces found in the Lustre debug log extracted from crash-dump, just before the LBUG :
00010000:00080000:60.0:1751191238.684254:0:3283839:0:(ldlm_request.c:1471:ldlm_cli_update_pool()) @@@ Zero SLV or limit found (SLV=0, limit=0) req@000000002a7e81cd x1834742545129984/t0(0) o801->ai400x2-OST0035-osc-ffff96e1f48df800@XXX.YY.0.209@o2 ib:28/4 lens 224/224 e 0 to 0 dl 1751191276 ref 2 fl Rpc:RQU/0/0 rc 0/0 job:'' 02000000:00020000:60.0:1751191238.684293:0:3283839:0:(gss_keyring.c:1684:gss_kt_update()) negotiation: rpc err 0, gss err 80000
and also as per these messages found in Lustre client Syslog :
2025-06-29T03:00:38.684405-07:00 runner-prenyx01 lgss_keyring: [3283839]:ERROR:lgssc_negotiation(): negotiation gss error 80000 2025-06-29T03:00:38.684548-07:00 runner-prenyx01 lgss_keyring: [3283839]:ERROR:lgssc_kr_negotiate_krb(): key 26deb1ca: failed to negotiation 2025-06-29T03:00:38.684644-07:00 runner-prenyx01 lgss_keyring: [3283839]:ERROR:error_kernel_key(): unlink key 26deb1ca from -3: No such file or directory
Both sets of traces/messages are showing that "lgss_keyring" has returned the GSS error (0x80000/GSS_S_NO_CONTEXT) that had probably been replied by the OST upon SEC_CTX_INIT RPC/request handling, that could have been sent thru the Lustre downcall code that can be triggered by writing to "sptlrpc/gss/init_channel".
The GSS_S_NO_CONTEXT has probably been returned due to OST failover/failback (also confirmed by recovery/reconnect related msgs in both Client/Server Syslogs and Lustre debug logs), and the concerned security context has then been unbinded and flagged in error (PTLRPC_CTX_ERROR_BIT) by "lgss_keyring" during synchronous downcall error returned handling, by using keyctl() syscall which ended in gss_kt_update().
On Server/OST side, the following Syslog messages have been found to confirm :
2025-06-29T03:00:23-07:00 ai400x2vm-025 kernel: LustreError: 0-0: ai400x2-OST0035: not available for GSS context init from XXX.YY.1.54@o2ib (no target). 2025-06-29T03:00:23-07:00 ai400x2vm-025 kernel: LustreError: Skipped 222 previous similar messages
Some change must be done in Lustre code to gracefully handle this situation and allow to avoid this LBUG and wait for OST to be up with GSS context before to answer/return to the Client LDLM layer.
As a first thought, the GSS error handling in sptlrpc_import_check_ctx() could simply fallback to the context refresh, and this could be done using the following change :
--- orig/lustre/ptlrpc/sec.c 2025-05-12 03:16:11.000000000 -0700 +++ bfi/lustre/ptlrpc/sec.c 2025-09-02 08:46:00.444983813 -0700 @@ -1031,8 +1031,11 @@ int sptlrpc_import_check_ctx(struct obd_ } if (cli_ctx_is_error(ctx)) { - sptlrpc_cli_ctx_put(ctx, 1); - RETURN(-EACCES); + /* let's try to refresh as error should be transient and due + * to target failover/fallback... */ + CDEBUG(D_SEC, + "%s: failled to get ctx (%p, fl %lx)\n", + imp->imp_obd->obd_name, newctx, newctx->cc_flags); } req = ptlrpc_request_cache_alloc(GFP_NOFS);
or may be a more "ugly" fix could be to simply ignore the error returned by sptlrpc_import_check_ctx() in ldlm_lock_match_with_skip(), since a lock has been found and there was a valid context just before ??... :
--- orig/lustre/ldlm/ldlm_lock.c 2025-05-12 03:16:11.000000000 -0700 +++ bfi/lustre/ldlm/ldlm_lock.c 2025-09-03 02:56:44.564077043 -0700 @@ -1499,10 +1499,16 @@ enum ldlm_mode ldlm_lock_match_with_skip } /* check user's security context */ - if (lock->l_conn_export && - sptlrpc_import_check_ctx( - class_exp2cliimp(lock->l_conn_export))) - GOTO(out_fail_match, matched = 0); + if (lock->l_conn_export) { + int rc; + + rc = sptlrpc_import_check_ctx( + class_exp2cliimp(lock->l_conn_export)); + /* continue upon transient GSS context errors due to + * server failover/failback */ + if (rc != 0 && rc != -EACCES) + GOTO(out_fail_match, matched = 0); + } LDLM_DEBUG(lock, "matched (%llu %llu)", (type == LDLM_PLAIN || type == LDLM_IBITS) ?
Will try to reproduce and expose these fixes ideas.