[LU-15487] crash after mdd_dir_page_build() error Created: 27/Jan/22  Updated: 04/Feb/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.7
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Andreas Dilger Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Seeing crashes in random locations with apparent memory corruption shortly after mdd_dir_page_build() reports an error. https://testing.whamcloud.com/test_sets/293f9d80-1e10-4042-86b5-7816504cc1ae
https://testing.whamcloud.com/test_sets/a6c5e9e1-dbdd-418e-8e51-f417bdee3be7

 LNetError: 11244:0:(o2iblnd_cb.c:3371:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds
 LNetError: 11244:0:(o2iblnd_cb.c:3446:kiblnd_check_conns()) Timed out RDMA with 172.168.202.16@o2ib (105): c: 7, oc: 0, rc: 8
 Lustre: 11433:0:(mdd_object.c:3460:mdd_dir_page_build()) build page failed: -22!
 LustreError: 11251:0:(events.c:496:ptlrpc_master_callback()) ASSERTION( callback == request_out_callback || callback == reply_in_callback || callback == client_bulk_callback || callback == request_in_callback || callback == reply_out_callback || callback == server_bulk_callback ) failed: 
 LustreError: 11251:0:(events.c:496:ptlrpc_master_callback()) LBUG
 Pid: 11251, comm: kiblnd_sd_01_02 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Wed Mar 25 16:04:09 PDT 2020
 Call Trace:
 libcfs_call_trace+0x8c/0xc0 [libcfs]
 lbug_with_loc+0x4c/0xa0 [libcfs]
 ptlrpc_master_callback+0xbd/0xc0 [ptlrpc]
 lnet_eq_enqueue_event+0x2e/0x140 [lnet]
 lnet_finalize+0x24c/0xd40 [lnet]
 kiblnd_recv+0x1cd/0x7c0 [ko2iblnd]
 lnet_ni_recv+0xc8/0x330 [lnet]
 lnet_recv_put+0x85/0xb0 [lnet]
 lnet_parse_local+0x5ae/0xd40 [lnet]
 lnet_parse+0x99a/0x11e0 [lnet]
 kiblnd_handle_rx+0x213/0x6b0 [ko2iblnd]
 kiblnd_scheduler+0xf42/0x1190 [ko2iblnd]
 kthread+0xd1/0xe0

Not yet sure of cause/effect, but filing ticket to track and submit a debug patch.

 LDISKFS-fs warning (device dm-18): ldiskfs_dx_add_entry:2629: Large directory feature is not enabled on this filesystem
 Lustre: 52007:0:(mdd_object.c:3460:mdd_dir_page_build()) build page failed: -22!
 WARNING: CPU: 80 PID: 74750 at lib/list_debug.c:59 __list_del_entry+0xa1/0xd0
 list_del corruption. prev->next should be ffffa1d80d4a7000, but was           (null)
 LustreError: 56942:0:(events.c:496:ptlrpc_master_callback()) ASSERTION( callback == request_out_callback || callback == reply_in_callback || callback == client_bulk_callback || callback == request_in_callback || callback == reply_out_callback || callback == server_bulk_callback ) failed: 
CPU: 80 PID: 74750 Comm: mdt_rdpg01_086 Kdump: loaded 3.10.0-1062.1.1.el7_lustre.x86_64 #1
 __list_del_entry+0xa1/0xd0
 list_del+0xd/0x30
 ptlrpc_server_drop_request+0xe5/0x6d0 [ptlrpc]
 ptlrpc_server_finish_active_request+0x92/0x140 [ptlrpc]
 ptlrpc_server_handle_request+0x401/0xab0 [ptlrpc]
 ptlrpc_main+0xb34/0x1470 [ptlrpc]
 kthread+0xd1/0xe0


 Comments   
Comment by Gerrit Updater [ 28/Jan/22 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/46368
Subject: LU-15487 mdd: print FID in mdd_dir_page_build() error
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 01f6a7873c4b9a13d95e2373985a4448e16a5034

Comment by Gerrit Updater [ 04/Mar/22 ]

"Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/46697
Subject: LU-15487 osd-ldiskfs: verify size before packing rec
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 6cb991211de24e624a03e7a099f4132da44f71dd

Comment by Gerrit Updater [ 30/May/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46368/
Subject: LU-15487 mdd: print FID in mdd_dir_page_build() error
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 8bd1104a7b8ad0300e667f025aab17c0f93502f0

Generated at Sat Feb 10 03:18:44 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.