[LU-12907] LNet routers: LNetError: 14141:0:(lib-msg.c:894:lnet_finalize()) ASSERTION( !(((current_thread_info()->preempt_count) & ((((1UL << (10))-1) << ((0 + 8) + 8)) | (((1UL << (8))-1) << (0 + 8)) | (((1UL << (1))-1) << (((0 + 8) + 8) + 10))))) Created: 26/Oct/19  Updated: 07/Dec/19  Resolved: 07/Dec/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.3
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Stephane Thiell Assignee: Amir Shehata (Inactive)
Resolution: Duplicate Votes: 0
Labels: None
Environment:

CentOS 7.6


Attachments: Text File vmcore-dmesg-sh-rtr-fir-1-1.log    
Issue Links:
Related
is related to LU-12568 LNetError: 28086:0:(lib-move.c:2862:l... Resolved
is related to LU-12441 Response tracker is not detached on r... Resolved
Severity: 2
Rank (Obsolete): 9223372036854775807

 Description   

We have been upgrading our Lnet routers recently to 2.12.3 and all of them crashed simultaneously tonight with the following assertion:

 

[39140.467535] LNetError: 14141:0:(lib-msg.c:894:lnet_finalize()) ASSERTION( !(((current_thread_info()->preempt_count) & ((((1UL << (10))-1) << ((0 + 8) + 8)) | (((1UL << (8))-1) << (0 + 8)) | (((1UL << (1))-1) << (((0 + 8) + 8) + 10)))))
[39140.491917] general protection fault: 0000 [#1] SMP 
[39140.491969] Modules linked in: ko2iblnd(OE) lnet(OE) libcfs(OE) rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_fpga_tools(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) mlx4_ib(OE) ib_uverbsm
[39140.491977]  crct10dif_pclmul crct10dif_common tg3 libahci megaraid_sas ptp libata crc32c_intel pps_core [last unloaded: mlx_compat]
[39140.491982] CPU: 0 PID: 14141 Comm: kiblnd_connd Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.27.2.el7.x86_64 #1
[39140.491983] Hardware name: Dell Inc. PowerEdge R630/02C2CP, BIOS 2.10.5 07/25/2019
[39140.491985] task: ffff90b918b1a080 ti: ffff90b8fa518000 task.ti: ffff90b8fa518000
[39140.491995] RIP: 0010:[<ffffffff886f3875>]  [<ffffffff886f3875>] cpuacct_charge+0x35/0x50
[39140.491997] RSP: 0018:ffff90b91c603dd0  EFLAGS: 00010006
[39140.491998] RAX: 18244c8948c18cb8 RBX: ffff90b918b1a0e8 RCX: 000000000000ffff
[39140.492000] RDX: ffffffff8925b640 RSI: 0000000001743e28 RDI: ffff90b918b1a080
[39140.492002] RBP: ffff90b91c603dd0 R08: ffffffffffffb820 R09: 000000000000040f
[39140.492003] R10: 0000000000000004 R11: 0000000000000005 R12: 0000000001743e28
[39140.492005] R13: ffff90b91c61ac00 R14: ffff90b918b1a080 R15: 0000000000000000
[39140.492008] FS:  0000000000000000(0000) GS:ffff90b91c600000(0000) knlGS:0000000000000000
[39140.492010] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[39140.492011] CR2: 00007fd46cd96248 CR3: 0000000154c10000 CR4: 00000000003607f0
[39140.492013] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[39140.492015] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[39140.492016] Call Trace:
[39140.492025]  <IRQ> 
[39140.492025]  [<ffffffff886e143c>] update_curr+0x14c/0x1e0
[39140.492029]  [<ffffffff886e295d>] task_tick_fair+0x2bd/0x660
[39140.492034]  [<ffffffff88634919>] ? sched_clock+0x9/0x10
[39140.492038]  [<ffffffff886db1f5>] ? sched_clock_cpu+0x85/0xc0
[39140.492041]  [<ffffffff886d60ad>] scheduler_tick+0xcd/0x150
[39140.492046]  [<ffffffff8870c160>] ? tick_sched_do_timer+0x50/0x50
[39140.492051]  [<ffffffff886ac3a5>] update_process_times+0x65/0x80
[39140.492055]  [<ffffffff8870bed0>] tick_sched_handle+0x30/0x70
[39140.492058]  [<ffffffff8870c199>] tick_sched_timer+0x39/0x80
[39140.492065]  [<ffffffff886c71e3>] __hrtimer_run_queues+0xf3/0x270
[39140.492069]  [<ffffffff886c776f>] hrtimer_interrupt+0xaf/0x1d0
[39140.492076]  [<ffffffff8865a61b>] local_apic_timer_interrupt+0x3b/0x60
[39140.492081]  [<ffffffff88d7b6e3>] smp_apic_timer_interrupt+0x43/0x60
[39140.492087]  [<ffffffff88d77df2>] apic_timer_interrupt+0x162/0x170
[39140.492111]  <EOI> 
[39140.492111]  [<ffffffffc0ac3f9d>] ? lnet_finalize+0x98d/0x9a0 [lnet]
[39140.492127]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[39140.492156]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
[39140.492171]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
[39140.492184]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
[39140.492196]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
[39140.492206]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[39140.492219]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
[39140.492232]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
[39140.492243]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
[39140.492254]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
[39140.492264]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[39140.492276]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
[39140.492288]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
[39140.492299]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
[39140.492309]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
[39140.492319]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[39140.492330]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
[39140.492341]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
[39140.492353]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
[39140.492363]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
[39140.492372]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[39140.492383]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
[39140.492395]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
[39140.492406]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
[39140.492416]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
[39140.492425]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[39140.492436]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
[39140.492447]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
[39140.492457]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
[39140.492468]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
[39140.492476]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[39140.492487]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
[39140.492501]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
[39140.492512]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
[39140.492522]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
[39140.492530]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[39140.492541]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
[39140.492552]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
[39140.492563]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
[39140.492573]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
[39140.492582]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[39140.492592]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
[39140.492603]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
[39140.492614]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
[39140.492624]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
[39140.492632]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[39140.492642]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
[39140.492653]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
[39140.492664]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
[39140.492674]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
[39140.492682]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[39140.492693]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
[39140.492704]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
[39140.492714]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
[39140.492724]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
[39140.492732]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[39140.492743]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
[39140.492753]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
[39140.492764]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
[39140.492774]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
[39140.492782]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[39140.492792]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
[39140.492803]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
[39140.492813]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
[39140.492823]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
[39140.492831]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[39140.492842]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
[39140.492853]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
[39140.492863]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
[39140.492873]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
[39140.492881]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[39140.492892]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
[39140.492902]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
[39140.492913]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
[39140.492923]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
[39140.492931]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[39140.492941]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
[39140.492952]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
[39140.492962]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
[39140.492972]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
[39140.492980]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[39140.492991]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
[39140.493002]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
[39140.493012]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
[39140.493022]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
[39140.493030]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[39140.493040]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
[39140.493051]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
[39140.493061]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
[39140.493071]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
[39140.493079]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[39140.493089]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
[39140.493100]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
[39140.493111]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
[39140.493121]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
[39140.493128]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[39140.493139]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
[39140.493150]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
[39140.493160]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
[39140.493170]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
[39140.493178]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[39140.493188]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
[39140.493199]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
[39140.493209]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
[39140.493219]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
[39140.493227]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[39140.493237]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
[39140.493248]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
[39140.493258]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
[39140.493268]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
[39140.493276]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[39140.493287]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
[39140.493297]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
[39140.493308]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
[39140.493318]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
[39140.493325]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[39140.493336]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
[39140.493347]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
[39140.493357]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
[39140.493367]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
[39140.493375]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[39140.493385]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
[39140.493396]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
[39140.493406]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
[39140.493416]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
[39140.493424]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[39140.493434]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
[39140.493445]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
[39140.493455]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
[39140.493466]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
[39140.493473]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[39140.493484]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
[39140.493496]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
[39140.493508]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
[39140.493518]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
[39140.493525]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[39140.493536]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
[39140.493547]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
[39140.493557]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
[39140.493567]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
[39140.493575]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[39140.493585]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
[39140.493596]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
[39140.493606]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
[39140.493616]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
[39140.493624]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[39140.493634]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
[39140.493645]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
[39140.493655]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
[39140.493665]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
[39140.493673]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[39140.493684]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
[39140.493694]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
[39140.493704]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
[39140.493714]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
[39140.493722]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[39140.493733]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
[39140.493743]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
[39140.493753]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
[39140.493763]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
[39140.493771]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[39140.493782]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
[39140.493792]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
[39140.493802]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
[39140.493813]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
[39140.493820]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[39140.493831]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
[39140.493842]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
[39140.493852]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
[39140.493862]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
[39140.493869]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[39140.493880]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
[39140.493891]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
[39140.493901]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
[39140.493911]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
[39140.493918]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[39140.493929]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
[39140.493940]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
[39140.493950]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
[39140.493960]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
[39140.493967]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[39140.493978]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
[39140.493989]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
[39140.493999]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
[39140.494009]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
[39140.494017]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[39140.494027]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
[39140.494038]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
[39140.494048]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
[39140.494058]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
[39140.494065]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[39140.494076]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
[39140.494087]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
[39140.494097]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
[39140.494107]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
[39140.494114]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[39140.494125]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
[39140.494136]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
[39140.494146]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
[39140.494156]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
[39140.494163]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[39140.494174]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
[39140.494185]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
[39140.494195]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
[39140.494205]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
[39140.494212]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[39140.494223]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
[39140.494233]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
[39140.494243]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
[39140.494253]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
[39140.494261]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[39140.494271]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
[39140.494282]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
[39140.494292]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
[39140.494302]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
[39140.494310]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[39140.494320]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
[39140.494332]  [<ffffffffc0ac0082>] ? libcfs_nid2str_r+0xe2/0x130 [lnet]
[39140.494343]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
[39140.494353]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
[39140.494363]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
[39140.494372]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[39140.494382]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
[39140.494391]  [<ffffffffc090fae8>] ? libcfs_debug_vmsg2+0x6d8/0xb30 [libcfs]
[39140.494402]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
[39140.494412]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
[39140.494423]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
[39140.494432]  [<ffffffffc0babd22>] ? kiblnd_pool_free_node+0x82/0x170 [ko2iblnd]
[39140.494440]  [<ffffffffc0bb561d>] ? kiblnd_tx_done+0x10d/0x3e0 [ko2iblnd]
[39140.494447]  [<ffffffffc0bb593b>] ? kiblnd_txlist_done+0x4b/0x60 [ko2iblnd]
[39140.494454]  [<ffffffffc0bbab83>] ? kiblnd_check_conns+0x553/0x880 [ko2iblnd]
[39140.494465]  [<ffffffffc09213ba>] ? cfs_percpt_unlock+0x1a/0xb0 [libcfs]
[39140.494473]  [<ffffffffc0bbfc1b>] ? kiblnd_connd+0x83b/0xa00 [ko2iblnd]
[39140.494476]  [<ffffffff886d7c40>] ? wake_up_state+0x20/0x20
[39140.494484]  [<ffffffffc0bbf3e0>] ? kiblnd_cm_callback+0x2380/0x2380 [ko2iblnd]
[39140.494487]  [<ffffffff886c2e81>] ? kthread+0xd1/0xe0
[39140.494490]  [<ffffffff886c2db0>] ? insert_kthread_work+0x40/0x40
[39140.494495]  [<ffffffff88d76c37>] ? ret_from_fork_nospec_begin+0x21/0x21
[39140.494499]  [<ffffffff886c2db0>] ? insert_kthread_work+0x40/0x40
[39140.494536] Code: 48 89 e5 48 63 48 18 48 8b 87 40 09 00 00 48 8b 50 48 eb 0b 66 90 48 8b 50 68 48 85 d2 74 1b 48 8b 42 40 48 03 04 cd a0 bf 34 89 <48> 01 30 48 8b 02 48 8b 40 40 48 85 c0 75 dc 5d c3 66 2e 0f 1f 
[39140.494541] RIP  [<ffffffff886f3875>] cpuacct_charge+0x35/0x50
[39140.494541]  RSP <ffff90b91c603dd0>
 
[root@sh-rtr-fir-1-1 127.0.0.1-2019-10-25-21:01:51]# rpm -qa | grep lustre
lustre-client-2.12.3-1.el7.x86_64
lustre-client-dkms-2.12.3-1.el7.noarch
[root@sh-rtr-fir-1-1 127.0.0.1-2019-10-25-21:01:51]# uname -a
Linux sh-rtr-fir-1-1.int 3.10.0-957.27.2.el7.x86_64 #1 SMP Mon Jul 29 17:46:05 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux


 Comments   
Comment by Peter Jones [ 26/Oct/19 ]

Amir

Can you please advise

Peter

Comment by Amir Shehata (Inactive) [ 28/Oct/19 ]

For both LU-12906 and LU-12907 can we turn off health:

lnetctl set health_sensitivity 0
lnetctl set retry_count 0 
lnetctl set transaction_timeout 10

If this resolves the issue. Let's keep it off while I investigate the issue on my side.

Comment by Stephane Thiell [ 28/Oct/19 ]

Thanks, we'll try to do that and see how it goes.

retry_count should probable be run first as I get:

[68613.258761] LNetError: 242742:0:(api-ni.c:467:retry_count_set()) Can not set retry_count when health feature is turned off 

What I didn't mention in my original report, is that it happened while we were running lfs project -p ... -r -s /scratch/... to assign project IDs to directories, and we had several of them running on a single client (up to 20). I wasn't sure it was related, but it has only happened when doing that. I'm not sure how this could be related to LNet though...

Comment by Amir Shehata (Inactive) [ 28/Oct/19 ]

That specific operation could generate a workload that exposes the problem.

I also pointed out a couple of patches on LU-12906 which would be good to confirm if they resolve the issue.

Comment by Stephane Thiell [ 30/Oct/19 ]

Hi Amir,

All of routers but one (7 total) crashed again last night with this assertion. We didn't turn off health on these yet. So I tried to apply your patch on top of b2_12 but it is failing to compile:

Making all in .
/tmp/rpmbuild-lustre-sthiell-wfd0qnr4/BUILD/lustre-2.12.3_1_ge97f606/lnet/lnet/api-ni.c: In function 'lnet_unprepare':
/tmp/rpmbuild-lustre-sthiell-wfd0qnr4/BUILD/lustre-2.12.3_1_ge97f606/lnet/lnet/api-ni.c:1244:3: error: implicit declaration of function 'lnet_clean_zombie_rstqs' [-Werror=implicit-function-declaration]
   lnet_clean_zombie_rstqs();
   ^

We never had a LNet router crash before 2.12.3 as far as I remember, so this is an important regression of 2.12.3 I think. I hope you can fix the patch so we can try it. Until then, we're going to disable health as much as we can. Thanks!

Comment by Stephane Thiell [ 30/Oct/19 ]

Hi Amir,

We have now disabled lnet health everywhere (servers, routers and all clients). On the routers for example, we used this:

[root@sh-rtr-fir-2-1 ~]# cat /etc/lnet.conf 
global:
    - retry_count: 0
    - health_sensitivity: 0
    - transaction_timeout: 10
net:
    - net type: o2ib4
      local NI(s):
        - nid:
          interfaces:
            0: ib0
    - net type: o2ib7
      local NI(s):
        - nid:
          interfaces:
            0: ib1
routing:
    - enable: 1

I'll report back if the issue happens again.

Comment by Amir Shehata (Inactive) [ 30/Oct/19 ]

Hi Stephane,

I applied the patch to a fresh checkout of b2_12 and it compiled ok.

git clone git://git.whamcloud.com/fs/lustre-release.git
cd lustre-release
git checkout b2_12
# apply LU-12441 patch
git fetch https://review.whamcloud.com/fs/lustre-release refs/changes/52/35452/9 && git cherry-pick FETCH_HEAD
# verify there are no conflicts with LU-12856.patch
patch -p1 --dry-run < LU-12856.patch
# apply the LU-12856.patch
patch -p1 < LU-12856.patch
make rpms

regarding your changes above. The LND calculates its timeout value: transaction_timeout/retry_count. If retry_count is 0, then lnd_timeout = transaction_timeout. When you turn off health you should set the transaction_timeout to whatever timeout you had previously in your LND. I would suggest 50s unless your setup requires a longer timeout.

Comment by Stephane Thiell [ 31/Oct/19 ]

Hi Amir,

Thanks for the explanation regarding transaction_timeout.

You attached LU-12568.patch to LU-12906 but now you seem to use LU-12856.patch in your commands above. That is probably why I can't apply the patch. Can you please provide LU-12856.patch then? Or a gerrit link? We'll test it on the routers. thanks!

Comment by Amir Shehata (Inactive) [ 31/Oct/19 ]

Hi Stephane,

I pushed the two patches on b2_12

https://review.whamcloud.com/36634 LU-12441 lnet: Detach rspt when md_threshold is infinite
https://review.whamcloud.com/36635 LU-12568 lnet: Defer rspt cleanup when MD queued for unlink

let me know if they work for you.

Comment by Stephane Thiell [ 01/Nov/19 ]

Thanks Amir,
I will work on rebuilding a new Lustre version first thing tomorrow with your patches. We had another OSS crash tonight even though we have disabled lnet health. We tried again to run multiple lfs project -r commands but from only 2 clients, which triggered the server crash I think. The routers didn't crash this time.

[Thu Oct 31 21:00:21 2019][1331442.548826] LustreError: 60972:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff9b725f135a00^M
[Thu Oct 31 21:00:21 2019][1331442.559784] LustreError: 60971:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff9b725f135a00^M
[Thu Oct 31 21:00:21 2019][1331442.570740] LustreError: 60972:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff9b725f135a00^M
[Thu Oct 31 21:00:22 2019][1331442.681474] LustreError: 60974:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff9b8531d99000^M
[Thu Oct 31 21:00:22 2019][1331442.755621] LustreError: 60973:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff9b61fb25c800^M
[Thu Oct 31 21:00:22 2019][1331442.878766] LustreError: 60967:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff9b660cc4e200^M
[Thu Oct 31 21:00:22 2019][1331442.889730] LustreError: 60967:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff9b836c616000^M
[Thu Oct 31 21:00:22 2019][1331442.900739] LustreError: 60968:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff9b7d5c8dd800^M
[Thu Oct 31 21:00:22 2019][1331442.911739] LustreError: 60967:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff9b7e767e9200^M
[Thu Oct 31 21:00:22 2019][1331442.922729] LustreError: 60969:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff9b5e8d4a2a00^M
[Thu Oct 31 21:00:22 2019][1331442.922737] LustreError: 118454:0:(ldlm_lib.c:3262:target_bulk_io()) @@@ network error on bulk WRITE  req@ffff9b701e11c050 x1648394617161472/t0(0) o4->ae27ee87-ec90-0302-3abf-01a84652e2bd@10.8.27.1@o2ib6:418/0 lens 488/448 e 0 to 0 dl 1572580858 ref 1 fl Interpret:/0/0 rc 0/0^M
[Thu Oct 31 21:00:22 2019][1331442.922739] LustreError: 118454:0:(ldlm_lib.c:3262:target_bulk_io()) Skipped 12 previous similar messages^M
[Thu Oct 31 21:00:22 2019][1331442.968061] LNetError: 60969:0:(lib-move.c:2729:lnet_detach_rsp_tracker()) ASSERTION( rspt->rspt_cpt == cpt ) failed: ^M
[Thu Oct 31 21:00:22 2019][1331442.978925] LNetError: 60969:0:(lib-move.c:2729:lnet_detach_rsp_tracker()) LBUG^M
[Thu Oct 31 21:00:22 2019][1331442.986412] Pid: 60969, comm: kiblnd_sd_01_02 3.10.0-957.27.2.el7_lustre.pl1.x86_64 #1 SMP Mon Aug 5 15:28:37 PDT 2019^M
[Thu Oct 31 21:00:22 2019][1331442.997281] Call Trace:^M
[Thu Oct 31 21:00:22 2019][1331442.999925]  [<ffffffffc0ccc7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]^M
[Thu Oct 31 21:00:22 2019][1331443.006665]  [<ffffffffc0ccc87c>] lbug_with_loc+0x4c/0xa0 [libcfs]^M
[Thu Oct 31 21:00:22 2019][1331443.013052]  [<ffffffffc0dfb49b>] lnet_detach_rsp_tracker+0x5b/0x60 [lnet]^M 
[Thu Oct 31 21:00:22 2019][1331443.020142]  [<ffffffffc0debd3a>] lnet_finalize+0x72a/0x9a0 [lnet]^M
[Thu Oct 31 21:00:22 2019][1331443.026537]  [<ffffffffc0df5a51>] lnet_post_send_locked+0x751/0x9c0 [lnet]^M 
[Thu Oct 31 21:00:22 2019][1331443.033626]  [<ffffffffc0df79a8>] lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]^M 
[Thu Oct 31 21:00:22 2019][1331443.041401]  [<ffffffffc0dea5ec>] lnet_msg_decommit+0xec/0x700 [lnet]^M
[Thu Oct 31 21:00:22 2019][1331443.048046]  [<ffffffffc0deb9b7>] lnet_finalize+0x3a7/0x9a0 [lnet]^M 
[Thu Oct 31 21:00:22 2019][1331443.054435]  [<ffffffffc0d4161d>] kiblnd_tx_done+0x10d/0x3e0 [ko2iblnd]^M
[Thu Oct 31 21:00:22 2019][1331443.061257]  [<ffffffffc0d4cb0d>] kiblnd_scheduler+0x89d/0x1180 [ko2iblnd]^M 
[Thu Oct 31 21:00:22 2019][1331443.068335]  [<ffffffff8dac2e81>] kthread+0xd1/0xe0^M
[Thu Oct 31 21:00:22 2019][1331443.073424]  [<ffffffff8e177c24>] ret_from_fork_nospec_begin+0xe/0x21^M
[Thu Oct 31 21:00:22 2019][1331443.080071]  [<ffffffffffffffff>] 0xffffffffffffffff^M
[Thu Oct 31 21:00:22 2019][1331443.085273] Kernel panic - not syncing: LBUG^M

Reminder: this is with 2.12.3 on servers, routers and clients

Comment by Amir Shehata (Inactive) [ 04/Nov/19 ]

Hi Stephane,

Did you have time to try the two new patches? It would be nice to verify if they resolve the issue.

Comment by Stephane Thiell [ 05/Nov/19 ]

Hi Amir,

Both patches are installed on all Lustre servers now. We haven't done the clients and routers yet (it's another system). However, we're having other issues now, I don't think it is related, I suspect a new DoM issue:

[76929.924807] LNet: Service thread pid 42438 was inactive for 537.92s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[76929.941766] LNet: Skipped 3 previous similar messages
[76929.946821] Pid: 42438, comm: mdt01_064 3.10.0-957.27.2.el7_lustre.pl1.x86_64 #1 SMP Mon Aug 5 15:28:37 PDT 2019
[76929.957008] Call Trace:
[76929.959474]  [<ffffffffc105ab75>] ldlm_completion_ast+0x4e5/0x860 [ptlrpc]
[76929.966421]  [<ffffffffc105b5e1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc]
[76929.973621]  [<ffffffffc174650b>] mdt_object_local_lock+0x50b/0xb20 [mdt]
[76929.980452]  [<ffffffffc1746b90>] mdt_object_lock_internal+0x70/0x360 [mdt]
[76929.987463]  [<ffffffffc1746ea0>] mdt_object_lock+0x20/0x30 [mdt]
[76929.993591]  [<ffffffffc1785c4b>] mdt_brw_enqueue+0x44b/0x760 [mdt]
[76929.999916]  [<ffffffffc17344bf>] mdt_intent_brw+0x1f/0x30 [mdt]
[76930.005960]  [<ffffffffc174cbb5>] mdt_intent_policy+0x435/0xd80 [mdt]
[76930.012462]  [<ffffffffc1041d46>] ldlm_lock_enqueue+0x356/0xa20 [ptlrpc]
[76930.019212]  [<ffffffffc106a336>] ldlm_handle_enqueue0+0xa56/0x15f0 [ptlrpc]
[76930.026331]  [<ffffffffc10f2a12>] tgt_enqueue+0x62/0x210 [ptlrpc]
[76930.032496]  [<ffffffffc10f736a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
[76930.039444]  [<ffffffffc109e24b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[76930.047160]  [<ffffffffc10a1bac>] ptlrpc_main+0xb2c/0x1460 [ptlrpc]
[76930.053504]  [<ffffffffb30c2e81>] kthread+0xd1/0xe0
[76930.058435]  [<ffffffffb3777c24>] ret_from_fork_nospec_begin+0xe/0x21
[76930.064929]  [<ffffffffffffffff>] 0xffffffffffffffff

We just got a crash dump of a MDS and I will open a new ticket when ready re: this issue.

Comment by Stephane Thiell [ 06/Nov/19 ]

Hi Amir,

We've completed the installation of the two LNet patches on all our routers now. We're deploying new clients with them too. I'll report back if we see any problem.

Note: the other issue (mdt_intent_brw ) is DoM related and tracked in LU-12935

Comment by Stephane Thiell [ 08/Nov/19 ]

Amir, we have resumed our parallel runs of lfs project -r, and so far we haven't seen any problem when using your two patches (servers, routers and clients in that case). So far, it looks good. I'll update next week. Fingers crossed.

Comment by Stephane Thiell [ 14/Nov/19 ]

Amir, still no problem when using your patches.

Comment by Amir Shehata (Inactive) [ 14/Nov/19 ]

great. Hopefully we can land that on the b2_12 branch.

Comment by Peter Jones [ 07/Dec/19 ]

Both fixes mentioned have now landed to b2_12 so will be in the upcoming 2.12.4 release

Comment by Stephane Thiell [ 07/Dec/19 ]

Thanks!

Generated at Sat Feb 10 02:56:41 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.