[LU-12907] LNet routers: LNetError: 14141:0:(lib-msg.c:894:lnet_finalize()) ASSERTION( !(((current_thread_info()->preempt_count) & ((((1UL << (10))-1) << ((0 + 8) + 8)) | (((1UL << (8))-1) << (0 + 8)) | (((1UL << (1))-1) << (((0 + 8) + 8) + 10))))) Created: 26/Oct/19 Updated: 07/Dec/19 Resolved: 07/Dec/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Stephane Thiell | Assignee: | Amir Shehata (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Environment: |
CentOS 7.6 |
||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Severity: | 2 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
We have been upgrading our Lnet routers recently to 2.12.3 and all of them crashed simultaneously tonight with the following assertion:
[39140.467535] LNetError: 14141:0:(lib-msg.c:894:lnet_finalize()) ASSERTION( !(((current_thread_info()->preempt_count) & ((((1UL << (10))-1) << ((0 + 8) + 8)) | (((1UL << (8))-1) << (0 + 8)) | (((1UL << (1))-1) << (((0 + 8) + 8) + 10))))) [39140.491917] general protection fault: 0000 [#1] SMP [39140.491969] Modules linked in: ko2iblnd(OE) lnet(OE) libcfs(OE) rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_fpga_tools(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) mlx4_ib(OE) ib_uverbsm [39140.491977] crct10dif_pclmul crct10dif_common tg3 libahci megaraid_sas ptp libata crc32c_intel pps_core [last unloaded: mlx_compat] [39140.491982] CPU: 0 PID: 14141 Comm: kiblnd_connd Kdump: loaded Tainted: G OE ------------ 3.10.0-957.27.2.el7.x86_64 #1 [39140.491983] Hardware name: Dell Inc. PowerEdge R630/02C2CP, BIOS 2.10.5 07/25/2019 [39140.491985] task: ffff90b918b1a080 ti: ffff90b8fa518000 task.ti: ffff90b8fa518000 [39140.491995] RIP: 0010:[<ffffffff886f3875>] [<ffffffff886f3875>] cpuacct_charge+0x35/0x50 [39140.491997] RSP: 0018:ffff90b91c603dd0 EFLAGS: 00010006 [39140.491998] RAX: 18244c8948c18cb8 RBX: ffff90b918b1a0e8 RCX: 000000000000ffff [39140.492000] RDX: ffffffff8925b640 RSI: 0000000001743e28 RDI: ffff90b918b1a080 [39140.492002] RBP: ffff90b91c603dd0 R08: ffffffffffffb820 R09: 000000000000040f [39140.492003] R10: 0000000000000004 R11: 0000000000000005 R12: 0000000001743e28 [39140.492005] R13: ffff90b91c61ac00 R14: ffff90b918b1a080 R15: 0000000000000000 [39140.492008] FS: 0000000000000000(0000) GS:ffff90b91c600000(0000) knlGS:0000000000000000 [39140.492010] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [39140.492011] CR2: 00007fd46cd96248 CR3: 0000000154c10000 CR4: 00000000003607f0 [39140.492013] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [39140.492015] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [39140.492016] Call Trace: [39140.492025] <IRQ> [39140.492025] [<ffffffff886e143c>] update_curr+0x14c/0x1e0 [39140.492029] [<ffffffff886e295d>] task_tick_fair+0x2bd/0x660 [39140.492034] [<ffffffff88634919>] ? sched_clock+0x9/0x10 [39140.492038] [<ffffffff886db1f5>] ? sched_clock_cpu+0x85/0xc0 [39140.492041] [<ffffffff886d60ad>] scheduler_tick+0xcd/0x150 [39140.492046] [<ffffffff8870c160>] ? tick_sched_do_timer+0x50/0x50 [39140.492051] [<ffffffff886ac3a5>] update_process_times+0x65/0x80 [39140.492055] [<ffffffff8870bed0>] tick_sched_handle+0x30/0x70 [39140.492058] [<ffffffff8870c199>] tick_sched_timer+0x39/0x80 [39140.492065] [<ffffffff886c71e3>] __hrtimer_run_queues+0xf3/0x270 [39140.492069] [<ffffffff886c776f>] hrtimer_interrupt+0xaf/0x1d0 [39140.492076] [<ffffffff8865a61b>] local_apic_timer_interrupt+0x3b/0x60 [39140.492081] [<ffffffff88d7b6e3>] smp_apic_timer_interrupt+0x43/0x60 [39140.492087] [<ffffffff88d77df2>] apic_timer_interrupt+0x162/0x170 [39140.492111] <EOI> [39140.492111] [<ffffffffc0ac3f9d>] ? lnet_finalize+0x98d/0x9a0 [lnet] [39140.492127] [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [39140.492156] [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet] [39140.492171] [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet] [39140.492184] [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet] [39140.492196] [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet] [39140.492206] [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [39140.492219] [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet] [39140.492232] [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet] [39140.492243] [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet] [39140.492254] [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet] [39140.492264] [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [39140.492276] [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet] [39140.492288] [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet] [39140.492299] [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet] [39140.492309] [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet] [39140.492319] [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [39140.492330] [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet] [39140.492341] [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet] [39140.492353] [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet] [39140.492363] [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet] [39140.492372] [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [39140.492383] [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet] [39140.492395] [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet] [39140.492406] [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet] [39140.492416] [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet] [39140.492425] [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [39140.492436] [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet] [39140.492447] [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet] [39140.492457] [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet] [39140.492468] [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet] [39140.492476] [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [39140.492487] [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet] [39140.492501] [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet] [39140.492512] [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet] [39140.492522] [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet] [39140.492530] [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [39140.492541] [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet] [39140.492552] [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet] [39140.492563] [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet] [39140.492573] [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet] [39140.492582] [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [39140.492592] [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet] [39140.492603] [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet] [39140.492614] [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet] [39140.492624] [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet] [39140.492632] [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [39140.492642] [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet] [39140.492653] [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet] [39140.492664] [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet] [39140.492674] [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet] [39140.492682] [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [39140.492693] [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet] [39140.492704] [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet] [39140.492714] [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet] [39140.492724] [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet] [39140.492732] [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [39140.492743] [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet] [39140.492753] [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet] [39140.492764] [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet] [39140.492774] [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet] [39140.492782] [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [39140.492792] [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet] [39140.492803] [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet] [39140.492813] [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet] [39140.492823] [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet] [39140.492831] [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [39140.492842] [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet] [39140.492853] [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet] [39140.492863] [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet] [39140.492873] [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet] [39140.492881] [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [39140.492892] [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet] [39140.492902] [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet] [39140.492913] [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet] [39140.492923] [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet] [39140.492931] [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [39140.492941] [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet] [39140.492952] [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet] [39140.492962] [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet] [39140.492972] [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet] [39140.492980] [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [39140.492991] [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet] [39140.493002] [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet] [39140.493012] [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet] [39140.493022] [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet] [39140.493030] [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [39140.493040] [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet] [39140.493051] [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet] [39140.493061] [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet] [39140.493071] [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet] [39140.493079] [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [39140.493089] [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet] [39140.493100] [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet] [39140.493111] [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet] [39140.493121] [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet] [39140.493128] [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [39140.493139] [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet] [39140.493150] [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet] [39140.493160] [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet] [39140.493170] [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet] [39140.493178] [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [39140.493188] [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet] [39140.493199] [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet] [39140.493209] [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet] [39140.493219] [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet] [39140.493227] [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [39140.493237] [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet] [39140.493248] [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet] [39140.493258] [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet] [39140.493268] [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet] [39140.493276] [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [39140.493287] [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet] [39140.493297] [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet] [39140.493308] [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet] [39140.493318] [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet] [39140.493325] [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [39140.493336] [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet] [39140.493347] [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet] [39140.493357] [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet] [39140.493367] [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet] [39140.493375] [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [39140.493385] [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet] [39140.493396] [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet] [39140.493406] [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet] [39140.493416] [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet] [39140.493424] [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [39140.493434] [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet] [39140.493445] [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet] [39140.493455] [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet] [39140.493466] [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet] [39140.493473] [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [39140.493484] [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet] [39140.493496] [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet] [39140.493508] [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet] [39140.493518] [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet] [39140.493525] [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [39140.493536] [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet] [39140.493547] [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet] [39140.493557] [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet] [39140.493567] [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet] [39140.493575] [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [39140.493585] [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet] [39140.493596] [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet] [39140.493606] [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet] [39140.493616] [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet] [39140.493624] [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [39140.493634] [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet] [39140.493645] [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet] [39140.493655] [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet] [39140.493665] [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet] [39140.493673] [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [39140.493684] [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet] [39140.493694] [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet] [39140.493704] [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet] [39140.493714] [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet] [39140.493722] [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [39140.493733] [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet] [39140.493743] [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet] [39140.493753] [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet] [39140.493763] [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet] [39140.493771] [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [39140.493782] [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet] [39140.493792] [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet] [39140.493802] [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet] [39140.493813] [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet] [39140.493820] [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [39140.493831] [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet] [39140.493842] [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet] [39140.493852] [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet] [39140.493862] [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet] [39140.493869] [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [39140.493880] [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet] [39140.493891] [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet] [39140.493901] [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet] [39140.493911] [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet] [39140.493918] [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [39140.493929] [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet] [39140.493940] [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet] [39140.493950] [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet] [39140.493960] [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet] [39140.493967] [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [39140.493978] [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet] [39140.493989] [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet] [39140.493999] [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet] [39140.494009] [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet] [39140.494017] [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [39140.494027] [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet] [39140.494038] [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet] [39140.494048] [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet] [39140.494058] [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet] [39140.494065] [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [39140.494076] [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet] [39140.494087] [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet] [39140.494097] [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet] [39140.494107] [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet] [39140.494114] [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [39140.494125] [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet] [39140.494136] [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet] [39140.494146] [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet] [39140.494156] [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet] [39140.494163] [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [39140.494174] [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet] [39140.494185] [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet] [39140.494195] [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet] [39140.494205] [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet] [39140.494212] [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [39140.494223] [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet] [39140.494233] [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet] [39140.494243] [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet] [39140.494253] [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet] [39140.494261] [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [39140.494271] [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet] [39140.494282] [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet] [39140.494292] [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet] [39140.494302] [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet] [39140.494310] [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [39140.494320] [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet] [39140.494332] [<ffffffffc0ac0082>] ? libcfs_nid2str_r+0xe2/0x130 [lnet] [39140.494343] [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet] [39140.494353] [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet] [39140.494363] [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet] [39140.494372] [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [39140.494382] [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet] [39140.494391] [<ffffffffc090fae8>] ? libcfs_debug_vmsg2+0x6d8/0xb30 [libcfs] [39140.494402] [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet] [39140.494412] [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet] [39140.494423] [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet] [39140.494432] [<ffffffffc0babd22>] ? kiblnd_pool_free_node+0x82/0x170 [ko2iblnd] [39140.494440] [<ffffffffc0bb561d>] ? kiblnd_tx_done+0x10d/0x3e0 [ko2iblnd] [39140.494447] [<ffffffffc0bb593b>] ? kiblnd_txlist_done+0x4b/0x60 [ko2iblnd] [39140.494454] [<ffffffffc0bbab83>] ? kiblnd_check_conns+0x553/0x880 [ko2iblnd] [39140.494465] [<ffffffffc09213ba>] ? cfs_percpt_unlock+0x1a/0xb0 [libcfs] [39140.494473] [<ffffffffc0bbfc1b>] ? kiblnd_connd+0x83b/0xa00 [ko2iblnd] [39140.494476] [<ffffffff886d7c40>] ? wake_up_state+0x20/0x20 [39140.494484] [<ffffffffc0bbf3e0>] ? kiblnd_cm_callback+0x2380/0x2380 [ko2iblnd] [39140.494487] [<ffffffff886c2e81>] ? kthread+0xd1/0xe0 [39140.494490] [<ffffffff886c2db0>] ? insert_kthread_work+0x40/0x40 [39140.494495] [<ffffffff88d76c37>] ? ret_from_fork_nospec_begin+0x21/0x21 [39140.494499] [<ffffffff886c2db0>] ? insert_kthread_work+0x40/0x40 [39140.494536] Code: 48 89 e5 48 63 48 18 48 8b 87 40 09 00 00 48 8b 50 48 eb 0b 66 90 48 8b 50 68 48 85 d2 74 1b 48 8b 42 40 48 03 04 cd a0 bf 34 89 <48> 01 30 48 8b 02 48 8b 40 40 48 85 c0 75 dc 5d c3 66 2e 0f 1f [39140.494541] RIP [<ffffffff886f3875>] cpuacct_charge+0x35/0x50 [39140.494541] RSP <ffff90b91c603dd0> [root@sh-rtr-fir-1-1 127.0.0.1-2019-10-25-21:01:51]# rpm -qa | grep lustre lustre-client-2.12.3-1.el7.x86_64 lustre-client-dkms-2.12.3-1.el7.noarch [root@sh-rtr-fir-1-1 127.0.0.1-2019-10-25-21:01:51]# uname -a Linux sh-rtr-fir-1-1.int 3.10.0-957.27.2.el7.x86_64 #1 SMP Mon Jul 29 17:46:05 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Comments |
| Comment by Peter Jones [ 26/Oct/19 ] |
|
Amir Can you please advise Peter |
| Comment by Amir Shehata (Inactive) [ 28/Oct/19 ] |
|
For both lnetctl set health_sensitivity 0 lnetctl set retry_count 0 lnetctl set transaction_timeout 10 If this resolves the issue. Let's keep it off while I investigate the issue on my side. |
| Comment by Stephane Thiell [ 28/Oct/19 ] |
|
Thanks, we'll try to do that and see how it goes. retry_count should probable be run first as I get: [68613.258761] LNetError: 242742:0:(api-ni.c:467:retry_count_set()) Can not set retry_count when health feature is turned off What I didn't mention in my original report, is that it happened while we were running lfs project -p ... -r -s /scratch/... to assign project IDs to directories, and we had several of them running on a single client (up to 20). I wasn't sure it was related, but it has only happened when doing that. I'm not sure how this could be related to LNet though... |
| Comment by Amir Shehata (Inactive) [ 28/Oct/19 ] |
|
That specific operation could generate a workload that exposes the problem. I also pointed out a couple of patches on |
| Comment by Stephane Thiell [ 30/Oct/19 ] |
|
Hi Amir, All of routers but one (7 total) crashed again last night with this assertion. We didn't turn off health on these yet. So I tried to apply your patch on top of b2_12 but it is failing to compile: Making all in . /tmp/rpmbuild-lustre-sthiell-wfd0qnr4/BUILD/lustre-2.12.3_1_ge97f606/lnet/lnet/api-ni.c: In function 'lnet_unprepare': /tmp/rpmbuild-lustre-sthiell-wfd0qnr4/BUILD/lustre-2.12.3_1_ge97f606/lnet/lnet/api-ni.c:1244:3: error: implicit declaration of function 'lnet_clean_zombie_rstqs' [-Werror=implicit-function-declaration] lnet_clean_zombie_rstqs(); ^ We never had a LNet router crash before 2.12.3 as far as I remember, so this is an important regression of 2.12.3 I think. I hope you can fix the patch so we can try it. Until then, we're going to disable health as much as we can. Thanks! |
| Comment by Stephane Thiell [ 30/Oct/19 ] |
|
Hi Amir, We have now disabled lnet health everywhere (servers, routers and all clients). On the routers for example, we used this: [root@sh-rtr-fir-2-1 ~]# cat /etc/lnet.conf
global:
- retry_count: 0
- health_sensitivity: 0
- transaction_timeout: 10
net:
- net type: o2ib4
local NI(s):
- nid:
interfaces:
0: ib0
- net type: o2ib7
local NI(s):
- nid:
interfaces:
0: ib1
routing:
- enable: 1
I'll report back if the issue happens again. |
| Comment by Amir Shehata (Inactive) [ 30/Oct/19 ] |
|
Hi Stephane, I applied the patch to a fresh checkout of b2_12 and it compiled ok. git clone git://git.whamcloud.com/fs/lustre-release.git cd lustre-release git checkout b2_12 # apply LU-12441 patch git fetch https://review.whamcloud.com/fs/lustre-release refs/changes/52/35452/9 && git cherry-pick FETCH_HEAD # verify there are no conflicts with LU-12856.patch patch -p1 --dry-run < LU-12856.patch # apply the LU-12856.patch patch -p1 < LU-12856.patch make rpms regarding your changes above. The LND calculates its timeout value: transaction_timeout/retry_count. If retry_count is 0, then lnd_timeout = transaction_timeout. When you turn off health you should set the transaction_timeout to whatever timeout you had previously in your LND. I would suggest 50s unless your setup requires a longer timeout. |
| Comment by Stephane Thiell [ 31/Oct/19 ] |
|
Hi Amir, Thanks for the explanation regarding transaction_timeout. You attached |
| Comment by Amir Shehata (Inactive) [ 31/Oct/19 ] |
|
Hi Stephane, I pushed the two patches on b2_12 https://review.whamcloud.com/36634 let me know if they work for you. |
| Comment by Stephane Thiell [ 01/Nov/19 ] |
|
Thanks Amir, [Thu Oct 31 21:00:21 2019][1331442.548826] LustreError: 60972:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff9b725f135a00^M [Thu Oct 31 21:00:21 2019][1331442.559784] LustreError: 60971:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff9b725f135a00^M [Thu Oct 31 21:00:21 2019][1331442.570740] LustreError: 60972:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff9b725f135a00^M [Thu Oct 31 21:00:22 2019][1331442.681474] LustreError: 60974:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff9b8531d99000^M [Thu Oct 31 21:00:22 2019][1331442.755621] LustreError: 60973:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff9b61fb25c800^M [Thu Oct 31 21:00:22 2019][1331442.878766] LustreError: 60967:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff9b660cc4e200^M [Thu Oct 31 21:00:22 2019][1331442.889730] LustreError: 60967:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff9b836c616000^M [Thu Oct 31 21:00:22 2019][1331442.900739] LustreError: 60968:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff9b7d5c8dd800^M [Thu Oct 31 21:00:22 2019][1331442.911739] LustreError: 60967:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff9b7e767e9200^M [Thu Oct 31 21:00:22 2019][1331442.922729] LustreError: 60969:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff9b5e8d4a2a00^M [Thu Oct 31 21:00:22 2019][1331442.922737] LustreError: 118454:0:(ldlm_lib.c:3262:target_bulk_io()) @@@ network error on bulk WRITE req@ffff9b701e11c050 x1648394617161472/t0(0) o4->ae27ee87-ec90-0302-3abf-01a84652e2bd@10.8.27.1@o2ib6:418/0 lens 488/448 e 0 to 0 dl 1572580858 ref 1 fl Interpret:/0/0 rc 0/0^M [Thu Oct 31 21:00:22 2019][1331442.922739] LustreError: 118454:0:(ldlm_lib.c:3262:target_bulk_io()) Skipped 12 previous similar messages^M [Thu Oct 31 21:00:22 2019][1331442.968061] LNetError: 60969:0:(lib-move.c:2729:lnet_detach_rsp_tracker()) ASSERTION( rspt->rspt_cpt == cpt ) failed: ^M [Thu Oct 31 21:00:22 2019][1331442.978925] LNetError: 60969:0:(lib-move.c:2729:lnet_detach_rsp_tracker()) LBUG^M [Thu Oct 31 21:00:22 2019][1331442.986412] Pid: 60969, comm: kiblnd_sd_01_02 3.10.0-957.27.2.el7_lustre.pl1.x86_64 #1 SMP Mon Aug 5 15:28:37 PDT 2019^M [Thu Oct 31 21:00:22 2019][1331442.997281] Call Trace:^M [Thu Oct 31 21:00:22 2019][1331442.999925] [<ffffffffc0ccc7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]^M [Thu Oct 31 21:00:22 2019][1331443.006665] [<ffffffffc0ccc87c>] lbug_with_loc+0x4c/0xa0 [libcfs]^M [Thu Oct 31 21:00:22 2019][1331443.013052] [<ffffffffc0dfb49b>] lnet_detach_rsp_tracker+0x5b/0x60 [lnet]^M [Thu Oct 31 21:00:22 2019][1331443.020142] [<ffffffffc0debd3a>] lnet_finalize+0x72a/0x9a0 [lnet]^M [Thu Oct 31 21:00:22 2019][1331443.026537] [<ffffffffc0df5a51>] lnet_post_send_locked+0x751/0x9c0 [lnet]^M [Thu Oct 31 21:00:22 2019][1331443.033626] [<ffffffffc0df79a8>] lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]^M [Thu Oct 31 21:00:22 2019][1331443.041401] [<ffffffffc0dea5ec>] lnet_msg_decommit+0xec/0x700 [lnet]^M [Thu Oct 31 21:00:22 2019][1331443.048046] [<ffffffffc0deb9b7>] lnet_finalize+0x3a7/0x9a0 [lnet]^M [Thu Oct 31 21:00:22 2019][1331443.054435] [<ffffffffc0d4161d>] kiblnd_tx_done+0x10d/0x3e0 [ko2iblnd]^M [Thu Oct 31 21:00:22 2019][1331443.061257] [<ffffffffc0d4cb0d>] kiblnd_scheduler+0x89d/0x1180 [ko2iblnd]^M [Thu Oct 31 21:00:22 2019][1331443.068335] [<ffffffff8dac2e81>] kthread+0xd1/0xe0^M [Thu Oct 31 21:00:22 2019][1331443.073424] [<ffffffff8e177c24>] ret_from_fork_nospec_begin+0xe/0x21^M [Thu Oct 31 21:00:22 2019][1331443.080071] [<ffffffffffffffff>] 0xffffffffffffffff^M [Thu Oct 31 21:00:22 2019][1331443.085273] Kernel panic - not syncing: LBUG^M Reminder: this is with 2.12.3 on servers, routers and clients |
| Comment by Amir Shehata (Inactive) [ 04/Nov/19 ] |
|
Hi Stephane, Did you have time to try the two new patches? It would be nice to verify if they resolve the issue. |
| Comment by Stephane Thiell [ 05/Nov/19 ] |
|
Hi Amir, Both patches are installed on all Lustre servers now. We haven't done the clients and routers yet (it's another system). However, we're having other issues now, I don't think it is related, I suspect a new DoM issue: [76929.924807] LNet: Service thread pid 42438 was inactive for 537.92s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [76929.941766] LNet: Skipped 3 previous similar messages [76929.946821] Pid: 42438, comm: mdt01_064 3.10.0-957.27.2.el7_lustre.pl1.x86_64 #1 SMP Mon Aug 5 15:28:37 PDT 2019 [76929.957008] Call Trace: [76929.959474] [<ffffffffc105ab75>] ldlm_completion_ast+0x4e5/0x860 [ptlrpc] [76929.966421] [<ffffffffc105b5e1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc] [76929.973621] [<ffffffffc174650b>] mdt_object_local_lock+0x50b/0xb20 [mdt] [76929.980452] [<ffffffffc1746b90>] mdt_object_lock_internal+0x70/0x360 [mdt] [76929.987463] [<ffffffffc1746ea0>] mdt_object_lock+0x20/0x30 [mdt] [76929.993591] [<ffffffffc1785c4b>] mdt_brw_enqueue+0x44b/0x760 [mdt] [76929.999916] [<ffffffffc17344bf>] mdt_intent_brw+0x1f/0x30 [mdt] [76930.005960] [<ffffffffc174cbb5>] mdt_intent_policy+0x435/0xd80 [mdt] [76930.012462] [<ffffffffc1041d46>] ldlm_lock_enqueue+0x356/0xa20 [ptlrpc] [76930.019212] [<ffffffffc106a336>] ldlm_handle_enqueue0+0xa56/0x15f0 [ptlrpc] [76930.026331] [<ffffffffc10f2a12>] tgt_enqueue+0x62/0x210 [ptlrpc] [76930.032496] [<ffffffffc10f736a>] tgt_request_handle+0xaea/0x1580 [ptlrpc] [76930.039444] [<ffffffffc109e24b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [76930.047160] [<ffffffffc10a1bac>] ptlrpc_main+0xb2c/0x1460 [ptlrpc] [76930.053504] [<ffffffffb30c2e81>] kthread+0xd1/0xe0 [76930.058435] [<ffffffffb3777c24>] ret_from_fork_nospec_begin+0xe/0x21 [76930.064929] [<ffffffffffffffff>] 0xffffffffffffffff We just got a crash dump of a MDS and I will open a new ticket when ready re: this issue. |
| Comment by Stephane Thiell [ 06/Nov/19 ] |
|
Hi Amir, We've completed the installation of the two LNet patches on all our routers now. We're deploying new clients with them too. I'll report back if we see any problem. Note: the other issue (mdt_intent_brw ) is DoM related and tracked in |
| Comment by Stephane Thiell [ 08/Nov/19 ] |
|
Amir, we have resumed our parallel runs of lfs project -r, and so far we haven't seen any problem when using your two patches (servers, routers and clients in that case). So far, it looks good. I'll update next week. Fingers crossed. |
| Comment by Stephane Thiell [ 14/Nov/19 ] |
|
Amir, still no problem when using your patches. |
| Comment by Amir Shehata (Inactive) [ 14/Nov/19 ] |
|
great. Hopefully we can land that on the b2_12 branch. |
| Comment by Peter Jones [ 07/Dec/19 ] |
|
Both fixes mentioned have now landed to b2_12 so will be in the upcoming 2.12.4 release |
| Comment by Stephane Thiell [ 07/Dec/19 ] |
|
Thanks! |