Details
-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
Lustre 2.12.7
-
None
-
3
-
9223372036854775807
Description
Seeing crashes in random locations with apparent memory corruption shortly after mdd_dir_page_build() reports an error. https://testing.whamcloud.com/test_sets/293f9d80-1e10-4042-86b5-7816504cc1ae
https://testing.whamcloud.com/test_sets/a6c5e9e1-dbdd-418e-8e51-f417bdee3be7
LNetError: 11244:0:(o2iblnd_cb.c:3371:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds LNetError: 11244:0:(o2iblnd_cb.c:3446:kiblnd_check_conns()) Timed out RDMA with 172.168.202.16@o2ib (105): c: 7, oc: 0, rc: 8 Lustre: 11433:0:(mdd_object.c:3460:mdd_dir_page_build()) build page failed: -22! LustreError: 11251:0:(events.c:496:ptlrpc_master_callback()) ASSERTION( callback == request_out_callback || callback == reply_in_callback || callback == client_bulk_callback || callback == request_in_callback || callback == reply_out_callback || callback == server_bulk_callback ) failed: LustreError: 11251:0:(events.c:496:ptlrpc_master_callback()) LBUG Pid: 11251, comm: kiblnd_sd_01_02 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Wed Mar 25 16:04:09 PDT 2020 Call Trace: libcfs_call_trace+0x8c/0xc0 [libcfs] lbug_with_loc+0x4c/0xa0 [libcfs] ptlrpc_master_callback+0xbd/0xc0 [ptlrpc] lnet_eq_enqueue_event+0x2e/0x140 [lnet] lnet_finalize+0x24c/0xd40 [lnet] kiblnd_recv+0x1cd/0x7c0 [ko2iblnd] lnet_ni_recv+0xc8/0x330 [lnet] lnet_recv_put+0x85/0xb0 [lnet] lnet_parse_local+0x5ae/0xd40 [lnet] lnet_parse+0x99a/0x11e0 [lnet] kiblnd_handle_rx+0x213/0x6b0 [ko2iblnd] kiblnd_scheduler+0xf42/0x1190 [ko2iblnd] kthread+0xd1/0xe0
Not yet sure of cause/effect, but filing ticket to track and submit a debug patch.
LDISKFS-fs warning (device dm-18): ldiskfs_dx_add_entry:2629: Large directory feature is not enabled on this filesystem Lustre: 52007:0:(mdd_object.c:3460:mdd_dir_page_build()) build page failed: -22! WARNING: CPU: 80 PID: 74750 at lib/list_debug.c:59 __list_del_entry+0xa1/0xd0 list_del corruption. prev->next should be ffffa1d80d4a7000, but was (null) LustreError: 56942:0:(events.c:496:ptlrpc_master_callback()) ASSERTION( callback == request_out_callback || callback == reply_in_callback || callback == client_bulk_callback || callback == request_in_callback || callback == reply_out_callback || callback == server_bulk_callback ) failed: CPU: 80 PID: 74750 Comm: mdt_rdpg01_086 Kdump: loaded 3.10.0-1062.1.1.el7_lustre.x86_64 #1 __list_del_entry+0xa1/0xd0 list_del+0xd/0x30 ptlrpc_server_drop_request+0xe5/0x6d0 [ptlrpc] ptlrpc_server_finish_active_request+0x92/0x140 [ptlrpc] ptlrpc_server_handle_request+0x401/0xab0 [ptlrpc] ptlrpc_main+0xb34/0x1470 [ptlrpc] kthread+0xd1/0xe0
Informing that we have recurring MDS crashes that fit the description of this bug. Server is running 2.15.3.