Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12907

LNet routers: LNetError: 14141:0:(lib-msg.c:894:lnet_finalize()) ASSERTION( !(((current_thread_info()->preempt_count) & ((((1UL << (10))-1) << ((0 + 8) + 8)) | (((1UL << (8))-1) << (0 + 8)) | (((1UL << (1))-1) << (((0 + 8) + 8) + 10)))))

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • Lustre 2.12.3
    • None
    • CentOS 7.6
    • 2
    • 9223372036854775807

    Description

      We have been upgrading our Lnet routers recently to 2.12.3 and all of them crashed simultaneously tonight with the following assertion:

       

      [39140.467535] LNetError: 14141:0:(lib-msg.c:894:lnet_finalize()) ASSERTION( !(((current_thread_info()->preempt_count) & ((((1UL << (10))-1) << ((0 + 8) + 8)) | (((1UL << (8))-1) << (0 + 8)) | (((1UL << (1))-1) << (((0 + 8) + 8) + 10)))))
      [39140.491917] general protection fault: 0000 [#1] SMP 
      [39140.491969] Modules linked in: ko2iblnd(OE) lnet(OE) libcfs(OE) rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_fpga_tools(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) mlx4_ib(OE) ib_uverbsm
      [39140.491977]  crct10dif_pclmul crct10dif_common tg3 libahci megaraid_sas ptp libata crc32c_intel pps_core [last unloaded: mlx_compat]
      [39140.491982] CPU: 0 PID: 14141 Comm: kiblnd_connd Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.27.2.el7.x86_64 #1
      [39140.491983] Hardware name: Dell Inc. PowerEdge R630/02C2CP, BIOS 2.10.5 07/25/2019
      [39140.491985] task: ffff90b918b1a080 ti: ffff90b8fa518000 task.ti: ffff90b8fa518000
      [39140.491995] RIP: 0010:[<ffffffff886f3875>]  [<ffffffff886f3875>] cpuacct_charge+0x35/0x50
      [39140.491997] RSP: 0018:ffff90b91c603dd0  EFLAGS: 00010006
      [39140.491998] RAX: 18244c8948c18cb8 RBX: ffff90b918b1a0e8 RCX: 000000000000ffff
      [39140.492000] RDX: ffffffff8925b640 RSI: 0000000001743e28 RDI: ffff90b918b1a080
      [39140.492002] RBP: ffff90b91c603dd0 R08: ffffffffffffb820 R09: 000000000000040f
      [39140.492003] R10: 0000000000000004 R11: 0000000000000005 R12: 0000000001743e28
      [39140.492005] R13: ffff90b91c61ac00 R14: ffff90b918b1a080 R15: 0000000000000000
      [39140.492008] FS:  0000000000000000(0000) GS:ffff90b91c600000(0000) knlGS:0000000000000000
      [39140.492010] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [39140.492011] CR2: 00007fd46cd96248 CR3: 0000000154c10000 CR4: 00000000003607f0
      [39140.492013] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [39140.492015] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [39140.492016] Call Trace:
      [39140.492025]  <IRQ> 
      [39140.492025]  [<ffffffff886e143c>] update_curr+0x14c/0x1e0
      [39140.492029]  [<ffffffff886e295d>] task_tick_fair+0x2bd/0x660
      [39140.492034]  [<ffffffff88634919>] ? sched_clock+0x9/0x10
      [39140.492038]  [<ffffffff886db1f5>] ? sched_clock_cpu+0x85/0xc0
      [39140.492041]  [<ffffffff886d60ad>] scheduler_tick+0xcd/0x150
      [39140.492046]  [<ffffffff8870c160>] ? tick_sched_do_timer+0x50/0x50
      [39140.492051]  [<ffffffff886ac3a5>] update_process_times+0x65/0x80
      [39140.492055]  [<ffffffff8870bed0>] tick_sched_handle+0x30/0x70
      [39140.492058]  [<ffffffff8870c199>] tick_sched_timer+0x39/0x80
      [39140.492065]  [<ffffffff886c71e3>] __hrtimer_run_queues+0xf3/0x270
      [39140.492069]  [<ffffffff886c776f>] hrtimer_interrupt+0xaf/0x1d0
      [39140.492076]  [<ffffffff8865a61b>] local_apic_timer_interrupt+0x3b/0x60
      [39140.492081]  [<ffffffff88d7b6e3>] smp_apic_timer_interrupt+0x43/0x60
      [39140.492087]  [<ffffffff88d77df2>] apic_timer_interrupt+0x162/0x170
      [39140.492111]  <EOI> 
      [39140.492111]  [<ffffffffc0ac3f9d>] ? lnet_finalize+0x98d/0x9a0 [lnet]
      [39140.492127]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      [39140.492156]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
      [39140.492171]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
      [39140.492184]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
      [39140.492196]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
      [39140.492206]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      [39140.492219]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
      [39140.492232]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
      [39140.492243]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
      [39140.492254]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
      [39140.492264]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      [39140.492276]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
      [39140.492288]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
      [39140.492299]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
      [39140.492309]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
      [39140.492319]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      [39140.492330]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
      [39140.492341]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
      [39140.492353]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
      [39140.492363]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
      [39140.492372]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      [39140.492383]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
      [39140.492395]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
      [39140.492406]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
      [39140.492416]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
      [39140.492425]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      [39140.492436]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
      [39140.492447]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
      [39140.492457]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
      [39140.492468]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
      [39140.492476]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      [39140.492487]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
      [39140.492501]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
      [39140.492512]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
      [39140.492522]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
      [39140.492530]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      [39140.492541]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
      [39140.492552]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
      [39140.492563]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
      [39140.492573]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
      [39140.492582]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      [39140.492592]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
      [39140.492603]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
      [39140.492614]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
      [39140.492624]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
      [39140.492632]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      [39140.492642]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
      [39140.492653]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
      [39140.492664]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
      [39140.492674]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
      [39140.492682]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      [39140.492693]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
      [39140.492704]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
      [39140.492714]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
      [39140.492724]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
      [39140.492732]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      [39140.492743]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
      [39140.492753]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
      [39140.492764]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
      [39140.492774]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
      [39140.492782]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      [39140.492792]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
      [39140.492803]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
      [39140.492813]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
      [39140.492823]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
      [39140.492831]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      [39140.492842]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
      [39140.492853]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
      [39140.492863]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
      [39140.492873]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
      [39140.492881]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      [39140.492892]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
      [39140.492902]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
      [39140.492913]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
      [39140.492923]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
      [39140.492931]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      [39140.492941]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
      [39140.492952]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
      [39140.492962]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
      [39140.492972]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
      [39140.492980]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      [39140.492991]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
      [39140.493002]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
      [39140.493012]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
      [39140.493022]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
      [39140.493030]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      [39140.493040]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
      [39140.493051]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
      [39140.493061]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
      [39140.493071]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
      [39140.493079]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      [39140.493089]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
      [39140.493100]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
      [39140.493111]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
      [39140.493121]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
      [39140.493128]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      [39140.493139]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
      [39140.493150]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
      [39140.493160]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
      [39140.493170]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
      [39140.493178]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      [39140.493188]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
      [39140.493199]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
      [39140.493209]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
      [39140.493219]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
      [39140.493227]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      [39140.493237]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
      [39140.493248]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
      [39140.493258]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
      [39140.493268]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
      [39140.493276]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      [39140.493287]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
      [39140.493297]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
      [39140.493308]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
      [39140.493318]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
      [39140.493325]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      [39140.493336]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
      [39140.493347]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
      [39140.493357]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
      [39140.493367]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
      [39140.493375]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      [39140.493385]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
      [39140.493396]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
      [39140.493406]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
      [39140.493416]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
      [39140.493424]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      [39140.493434]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
      [39140.493445]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
      [39140.493455]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
      [39140.493466]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
      [39140.493473]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      [39140.493484]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
      [39140.493496]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
      [39140.493508]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
      [39140.493518]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
      [39140.493525]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      [39140.493536]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
      [39140.493547]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
      [39140.493557]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
      [39140.493567]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
      [39140.493575]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      [39140.493585]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
      [39140.493596]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
      [39140.493606]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
      [39140.493616]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
      [39140.493624]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      [39140.493634]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
      [39140.493645]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
      [39140.493655]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
      [39140.493665]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
      [39140.493673]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      [39140.493684]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
      [39140.493694]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
      [39140.493704]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
      [39140.493714]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
      [39140.493722]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      [39140.493733]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
      [39140.493743]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
      [39140.493753]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
      [39140.493763]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
      [39140.493771]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      [39140.493782]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
      [39140.493792]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
      [39140.493802]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
      [39140.493813]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
      [39140.493820]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      [39140.493831]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
      [39140.493842]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
      [39140.493852]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
      [39140.493862]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
      [39140.493869]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      [39140.493880]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
      [39140.493891]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
      [39140.493901]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
      [39140.493911]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
      [39140.493918]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      [39140.493929]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
      [39140.493940]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
      [39140.493950]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
      [39140.493960]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
      [39140.493967]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      [39140.493978]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
      [39140.493989]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
      [39140.493999]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
      [39140.494009]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
      [39140.494017]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      [39140.494027]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
      [39140.494038]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
      [39140.494048]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
      [39140.494058]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
      [39140.494065]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      [39140.494076]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
      [39140.494087]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
      [39140.494097]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
      [39140.494107]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
      [39140.494114]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      [39140.494125]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
      [39140.494136]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
      [39140.494146]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
      [39140.494156]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
      [39140.494163]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      [39140.494174]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
      [39140.494185]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
      [39140.494195]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
      [39140.494205]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
      [39140.494212]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      [39140.494223]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
      [39140.494233]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
      [39140.494243]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
      [39140.494253]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
      [39140.494261]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      [39140.494271]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
      [39140.494282]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
      [39140.494292]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
      [39140.494302]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
      [39140.494310]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      [39140.494320]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
      [39140.494332]  [<ffffffffc0ac0082>] ? libcfs_nid2str_r+0xe2/0x130 [lnet]
      [39140.494343]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
      [39140.494353]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
      [39140.494363]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
      [39140.494372]  [<ffffffffc090ff97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      [39140.494382]  [<ffffffffc0acd71a>] ? lnet_post_send_locked+0x41a/0x9c0 [lnet]
      [39140.494391]  [<ffffffffc090fae8>] ? libcfs_debug_vmsg2+0x6d8/0xb30 [libcfs]
      [39140.494402]  [<ffffffffc0acf9a8>] ? lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
      [39140.494412]  [<ffffffffc0ac3401>] ? lnet_health_check+0x6a1/0x8b0 [lnet]
      [39140.494423]  [<ffffffffc0ac377f>] ? lnet_finalize+0x16f/0x9a0 [lnet]
      [39140.494432]  [<ffffffffc0babd22>] ? kiblnd_pool_free_node+0x82/0x170 [ko2iblnd]
      [39140.494440]  [<ffffffffc0bb561d>] ? kiblnd_tx_done+0x10d/0x3e0 [ko2iblnd]
      [39140.494447]  [<ffffffffc0bb593b>] ? kiblnd_txlist_done+0x4b/0x60 [ko2iblnd]
      [39140.494454]  [<ffffffffc0bbab83>] ? kiblnd_check_conns+0x553/0x880 [ko2iblnd]
      [39140.494465]  [<ffffffffc09213ba>] ? cfs_percpt_unlock+0x1a/0xb0 [libcfs]
      [39140.494473]  [<ffffffffc0bbfc1b>] ? kiblnd_connd+0x83b/0xa00 [ko2iblnd]
      [39140.494476]  [<ffffffff886d7c40>] ? wake_up_state+0x20/0x20
      [39140.494484]  [<ffffffffc0bbf3e0>] ? kiblnd_cm_callback+0x2380/0x2380 [ko2iblnd]
      [39140.494487]  [<ffffffff886c2e81>] ? kthread+0xd1/0xe0
      [39140.494490]  [<ffffffff886c2db0>] ? insert_kthread_work+0x40/0x40
      [39140.494495]  [<ffffffff88d76c37>] ? ret_from_fork_nospec_begin+0x21/0x21
      [39140.494499]  [<ffffffff886c2db0>] ? insert_kthread_work+0x40/0x40
      [39140.494536] Code: 48 89 e5 48 63 48 18 48 8b 87 40 09 00 00 48 8b 50 48 eb 0b 66 90 48 8b 50 68 48 85 d2 74 1b 48 8b 42 40 48 03 04 cd a0 bf 34 89 <48> 01 30 48 8b 02 48 8b 40 40 48 85 c0 75 dc 5d c3 66 2e 0f 1f 
      [39140.494541] RIP  [<ffffffff886f3875>] cpuacct_charge+0x35/0x50
      [39140.494541]  RSP <ffff90b91c603dd0>
       
      [root@sh-rtr-fir-1-1 127.0.0.1-2019-10-25-21:01:51]# rpm -qa | grep lustre
      lustre-client-2.12.3-1.el7.x86_64
      lustre-client-dkms-2.12.3-1.el7.noarch
      [root@sh-rtr-fir-1-1 127.0.0.1-2019-10-25-21:01:51]# uname -a
      Linux sh-rtr-fir-1-1.int 3.10.0-957.27.2.el7.x86_64 #1 SMP Mon Jul 29 17:46:05 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
      

      Attachments

        Issue Links

          Activity

            [LU-12907] LNet routers: LNetError: 14141:0:(lib-msg.c:894:lnet_finalize()) ASSERTION( !(((current_thread_info()->preempt_count) & ((((1UL << (10))-1) << ((0 + 8) + 8)) | (((1UL << (8))-1) << (0 + 8)) | (((1UL << (1))-1) << (((0 + 8) + 8) + 10)))))

            Thanks!

            sthiell Stephane Thiell added a comment - Thanks!
            pjones Peter Jones added a comment -

            Both fixes mentioned have now landed to b2_12 so will be in the upcoming 2.12.4 release

            pjones Peter Jones added a comment - Both fixes mentioned have now landed to b2_12 so will be in the upcoming 2.12.4 release

            great. Hopefully we can land that on the b2_12 branch.

            ashehata Amir Shehata (Inactive) added a comment - great. Hopefully we can land that on the b2_12 branch.

            Amir, still no problem when using your patches.

            sthiell Stephane Thiell added a comment - Amir, still no problem when using your patches.
            sthiell Stephane Thiell added a comment - - edited

            Amir, we have resumed our parallel runs of lfs project -r, and so far we haven't seen any problem when using your two patches (servers, routers and clients in that case). So far, it looks good. I'll update next week. Fingers crossed.

            sthiell Stephane Thiell added a comment - - edited Amir, we have resumed our parallel runs of lfs project -r , and so far we haven't seen any problem when using your two patches (servers, routers and clients in that case). So far, it looks good. I'll update next week. Fingers crossed.

            Hi Amir,

            We've completed the installation of the two LNet patches on all our routers now. We're deploying new clients with them too. I'll report back if we see any problem.

            Note: the other issue (mdt_intent_brw ) is DoM related and tracked in LU-12935

            sthiell Stephane Thiell added a comment - Hi Amir, We've completed the installation of the two LNet patches on all our routers now. We're deploying new clients with them too. I'll report back if we see any problem. Note: the other issue (mdt_intent_brw ) is DoM related and tracked in LU-12935

            Hi Amir,

            Both patches are installed on all Lustre servers now. We haven't done the clients and routers yet (it's another system). However, we're having other issues now, I don't think it is related, I suspect a new DoM issue:

            [76929.924807] LNet: Service thread pid 42438 was inactive for 537.92s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
            [76929.941766] LNet: Skipped 3 previous similar messages
            [76929.946821] Pid: 42438, comm: mdt01_064 3.10.0-957.27.2.el7_lustre.pl1.x86_64 #1 SMP Mon Aug 5 15:28:37 PDT 2019
            [76929.957008] Call Trace:
            [76929.959474]  [<ffffffffc105ab75>] ldlm_completion_ast+0x4e5/0x860 [ptlrpc]
            [76929.966421]  [<ffffffffc105b5e1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc]
            [76929.973621]  [<ffffffffc174650b>] mdt_object_local_lock+0x50b/0xb20 [mdt]
            [76929.980452]  [<ffffffffc1746b90>] mdt_object_lock_internal+0x70/0x360 [mdt]
            [76929.987463]  [<ffffffffc1746ea0>] mdt_object_lock+0x20/0x30 [mdt]
            [76929.993591]  [<ffffffffc1785c4b>] mdt_brw_enqueue+0x44b/0x760 [mdt]
            [76929.999916]  [<ffffffffc17344bf>] mdt_intent_brw+0x1f/0x30 [mdt]
            [76930.005960]  [<ffffffffc174cbb5>] mdt_intent_policy+0x435/0xd80 [mdt]
            [76930.012462]  [<ffffffffc1041d46>] ldlm_lock_enqueue+0x356/0xa20 [ptlrpc]
            [76930.019212]  [<ffffffffc106a336>] ldlm_handle_enqueue0+0xa56/0x15f0 [ptlrpc]
            [76930.026331]  [<ffffffffc10f2a12>] tgt_enqueue+0x62/0x210 [ptlrpc]
            [76930.032496]  [<ffffffffc10f736a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
            [76930.039444]  [<ffffffffc109e24b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
            [76930.047160]  [<ffffffffc10a1bac>] ptlrpc_main+0xb2c/0x1460 [ptlrpc]
            [76930.053504]  [<ffffffffb30c2e81>] kthread+0xd1/0xe0
            [76930.058435]  [<ffffffffb3777c24>] ret_from_fork_nospec_begin+0xe/0x21
            [76930.064929]  [<ffffffffffffffff>] 0xffffffffffffffff
            

            We just got a crash dump of a MDS and I will open a new ticket when ready re: this issue.

            sthiell Stephane Thiell added a comment - Hi Amir, Both patches are installed on all Lustre servers now. We haven't done the clients and routers yet (it's another system). However, we're having other issues now, I don't think it is related, I suspect a new DoM issue: [76929.924807] LNet: Service thread pid 42438 was inactive for 537.92s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [76929.941766] LNet: Skipped 3 previous similar messages [76929.946821] Pid: 42438, comm: mdt01_064 3.10.0-957.27.2.el7_lustre.pl1.x86_64 #1 SMP Mon Aug 5 15:28:37 PDT 2019 [76929.957008] Call Trace: [76929.959474] [<ffffffffc105ab75>] ldlm_completion_ast+0x4e5/0x860 [ptlrpc] [76929.966421] [<ffffffffc105b5e1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc] [76929.973621] [<ffffffffc174650b>] mdt_object_local_lock+0x50b/0xb20 [mdt] [76929.980452] [<ffffffffc1746b90>] mdt_object_lock_internal+0x70/0x360 [mdt] [76929.987463] [<ffffffffc1746ea0>] mdt_object_lock+0x20/0x30 [mdt] [76929.993591] [<ffffffffc1785c4b>] mdt_brw_enqueue+0x44b/0x760 [mdt] [76929.999916] [<ffffffffc17344bf>] mdt_intent_brw+0x1f/0x30 [mdt] [76930.005960] [<ffffffffc174cbb5>] mdt_intent_policy+0x435/0xd80 [mdt] [76930.012462] [<ffffffffc1041d46>] ldlm_lock_enqueue+0x356/0xa20 [ptlrpc] [76930.019212] [<ffffffffc106a336>] ldlm_handle_enqueue0+0xa56/0x15f0 [ptlrpc] [76930.026331] [<ffffffffc10f2a12>] tgt_enqueue+0x62/0x210 [ptlrpc] [76930.032496] [<ffffffffc10f736a>] tgt_request_handle+0xaea/0x1580 [ptlrpc] [76930.039444] [<ffffffffc109e24b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [76930.047160] [<ffffffffc10a1bac>] ptlrpc_main+0xb2c/0x1460 [ptlrpc] [76930.053504] [<ffffffffb30c2e81>] kthread+0xd1/0xe0 [76930.058435] [<ffffffffb3777c24>] ret_from_fork_nospec_begin+0xe/0x21 [76930.064929] [<ffffffffffffffff>] 0xffffffffffffffff We just got a crash dump of a MDS and I will open a new ticket when ready re: this issue.

            Hi Stephane,

            Did you have time to try the two new patches? It would be nice to verify if they resolve the issue.

            ashehata Amir Shehata (Inactive) added a comment - Hi Stephane, Did you have time to try the two new patches? It would be nice to verify if they resolve the issue.

            Thanks Amir,
            I will work on rebuilding a new Lustre version first thing tomorrow with your patches. We had another OSS crash tonight even though we have disabled lnet health. We tried again to run multiple lfs project -r commands but from only 2 clients, which triggered the server crash I think. The routers didn't crash this time.

            [Thu Oct 31 21:00:21 2019][1331442.548826] LustreError: 60972:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff9b725f135a00^M
            [Thu Oct 31 21:00:21 2019][1331442.559784] LustreError: 60971:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff9b725f135a00^M
            [Thu Oct 31 21:00:21 2019][1331442.570740] LustreError: 60972:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff9b725f135a00^M
            [Thu Oct 31 21:00:22 2019][1331442.681474] LustreError: 60974:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff9b8531d99000^M
            [Thu Oct 31 21:00:22 2019][1331442.755621] LustreError: 60973:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff9b61fb25c800^M
            [Thu Oct 31 21:00:22 2019][1331442.878766] LustreError: 60967:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff9b660cc4e200^M
            [Thu Oct 31 21:00:22 2019][1331442.889730] LustreError: 60967:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff9b836c616000^M
            [Thu Oct 31 21:00:22 2019][1331442.900739] LustreError: 60968:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff9b7d5c8dd800^M
            [Thu Oct 31 21:00:22 2019][1331442.911739] LustreError: 60967:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff9b7e767e9200^M
            [Thu Oct 31 21:00:22 2019][1331442.922729] LustreError: 60969:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff9b5e8d4a2a00^M
            [Thu Oct 31 21:00:22 2019][1331442.922737] LustreError: 118454:0:(ldlm_lib.c:3262:target_bulk_io()) @@@ network error on bulk WRITE  req@ffff9b701e11c050 x1648394617161472/t0(0) o4->ae27ee87-ec90-0302-3abf-01a84652e2bd@10.8.27.1@o2ib6:418/0 lens 488/448 e 0 to 0 dl 1572580858 ref 1 fl Interpret:/0/0 rc 0/0^M
            [Thu Oct 31 21:00:22 2019][1331442.922739] LustreError: 118454:0:(ldlm_lib.c:3262:target_bulk_io()) Skipped 12 previous similar messages^M
            [Thu Oct 31 21:00:22 2019][1331442.968061] LNetError: 60969:0:(lib-move.c:2729:lnet_detach_rsp_tracker()) ASSERTION( rspt->rspt_cpt == cpt ) failed: ^M
            [Thu Oct 31 21:00:22 2019][1331442.978925] LNetError: 60969:0:(lib-move.c:2729:lnet_detach_rsp_tracker()) LBUG^M
            [Thu Oct 31 21:00:22 2019][1331442.986412] Pid: 60969, comm: kiblnd_sd_01_02 3.10.0-957.27.2.el7_lustre.pl1.x86_64 #1 SMP Mon Aug 5 15:28:37 PDT 2019^M
            [Thu Oct 31 21:00:22 2019][1331442.997281] Call Trace:^M
            [Thu Oct 31 21:00:22 2019][1331442.999925]  [<ffffffffc0ccc7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]^M
            [Thu Oct 31 21:00:22 2019][1331443.006665]  [<ffffffffc0ccc87c>] lbug_with_loc+0x4c/0xa0 [libcfs]^M
            [Thu Oct 31 21:00:22 2019][1331443.013052]  [<ffffffffc0dfb49b>] lnet_detach_rsp_tracker+0x5b/0x60 [lnet]^M 
            [Thu Oct 31 21:00:22 2019][1331443.020142]  [<ffffffffc0debd3a>] lnet_finalize+0x72a/0x9a0 [lnet]^M
            [Thu Oct 31 21:00:22 2019][1331443.026537]  [<ffffffffc0df5a51>] lnet_post_send_locked+0x751/0x9c0 [lnet]^M 
            [Thu Oct 31 21:00:22 2019][1331443.033626]  [<ffffffffc0df79a8>] lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]^M 
            [Thu Oct 31 21:00:22 2019][1331443.041401]  [<ffffffffc0dea5ec>] lnet_msg_decommit+0xec/0x700 [lnet]^M
            [Thu Oct 31 21:00:22 2019][1331443.048046]  [<ffffffffc0deb9b7>] lnet_finalize+0x3a7/0x9a0 [lnet]^M 
            [Thu Oct 31 21:00:22 2019][1331443.054435]  [<ffffffffc0d4161d>] kiblnd_tx_done+0x10d/0x3e0 [ko2iblnd]^M
            [Thu Oct 31 21:00:22 2019][1331443.061257]  [<ffffffffc0d4cb0d>] kiblnd_scheduler+0x89d/0x1180 [ko2iblnd]^M 
            [Thu Oct 31 21:00:22 2019][1331443.068335]  [<ffffffff8dac2e81>] kthread+0xd1/0xe0^M
            [Thu Oct 31 21:00:22 2019][1331443.073424]  [<ffffffff8e177c24>] ret_from_fork_nospec_begin+0xe/0x21^M
            [Thu Oct 31 21:00:22 2019][1331443.080071]  [<ffffffffffffffff>] 0xffffffffffffffff^M
            [Thu Oct 31 21:00:22 2019][1331443.085273] Kernel panic - not syncing: LBUG^M
            

            Reminder: this is with 2.12.3 on servers, routers and clients

            sthiell Stephane Thiell added a comment - Thanks Amir, I will work on rebuilding a new Lustre version first thing tomorrow with your patches. We had another OSS crash tonight even though we have disabled lnet health. We tried again to run multiple lfs project -r commands but from only 2 clients, which triggered the server crash I think. The routers didn't crash this time. [Thu Oct 31 21:00:21 2019][1331442.548826] LustreError: 60972:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff9b725f135a00^M [Thu Oct 31 21:00:21 2019][1331442.559784] LustreError: 60971:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff9b725f135a00^M [Thu Oct 31 21:00:21 2019][1331442.570740] LustreError: 60972:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff9b725f135a00^M [Thu Oct 31 21:00:22 2019][1331442.681474] LustreError: 60974:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff9b8531d99000^M [Thu Oct 31 21:00:22 2019][1331442.755621] LustreError: 60973:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff9b61fb25c800^M [Thu Oct 31 21:00:22 2019][1331442.878766] LustreError: 60967:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff9b660cc4e200^M [Thu Oct 31 21:00:22 2019][1331442.889730] LustreError: 60967:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff9b836c616000^M [Thu Oct 31 21:00:22 2019][1331442.900739] LustreError: 60968:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff9b7d5c8dd800^M [Thu Oct 31 21:00:22 2019][1331442.911739] LustreError: 60967:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff9b7e767e9200^M [Thu Oct 31 21:00:22 2019][1331442.922729] LustreError: 60969:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff9b5e8d4a2a00^M [Thu Oct 31 21:00:22 2019][1331442.922737] LustreError: 118454:0:(ldlm_lib.c:3262:target_bulk_io()) @@@ network error on bulk WRITE req@ffff9b701e11c050 x1648394617161472/t0(0) o4->ae27ee87-ec90-0302-3abf-01a84652e2bd@10.8.27.1@o2ib6:418/0 lens 488/448 e 0 to 0 dl 1572580858 ref 1 fl Interpret:/0/0 rc 0/0^M [Thu Oct 31 21:00:22 2019][1331442.922739] LustreError: 118454:0:(ldlm_lib.c:3262:target_bulk_io()) Skipped 12 previous similar messages^M [Thu Oct 31 21:00:22 2019][1331442.968061] LNetError: 60969:0:(lib-move.c:2729:lnet_detach_rsp_tracker()) ASSERTION( rspt->rspt_cpt == cpt ) failed: ^M [Thu Oct 31 21:00:22 2019][1331442.978925] LNetError: 60969:0:(lib-move.c:2729:lnet_detach_rsp_tracker()) LBUG^M [Thu Oct 31 21:00:22 2019][1331442.986412] Pid: 60969, comm: kiblnd_sd_01_02 3.10.0-957.27.2.el7_lustre.pl1.x86_64 #1 SMP Mon Aug 5 15:28:37 PDT 2019^M [Thu Oct 31 21:00:22 2019][1331442.997281] Call Trace:^M [Thu Oct 31 21:00:22 2019][1331442.999925] [<ffffffffc0ccc7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]^M [Thu Oct 31 21:00:22 2019][1331443.006665] [<ffffffffc0ccc87c>] lbug_with_loc+0x4c/0xa0 [libcfs]^M [Thu Oct 31 21:00:22 2019][1331443.013052] [<ffffffffc0dfb49b>] lnet_detach_rsp_tracker+0x5b/0x60 [lnet]^M [Thu Oct 31 21:00:22 2019][1331443.020142] [<ffffffffc0debd3a>] lnet_finalize+0x72a/0x9a0 [lnet]^M [Thu Oct 31 21:00:22 2019][1331443.026537] [<ffffffffc0df5a51>] lnet_post_send_locked+0x751/0x9c0 [lnet]^M [Thu Oct 31 21:00:22 2019][1331443.033626] [<ffffffffc0df79a8>] lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]^M [Thu Oct 31 21:00:22 2019][1331443.041401] [<ffffffffc0dea5ec>] lnet_msg_decommit+0xec/0x700 [lnet]^M [Thu Oct 31 21:00:22 2019][1331443.048046] [<ffffffffc0deb9b7>] lnet_finalize+0x3a7/0x9a0 [lnet]^M [Thu Oct 31 21:00:22 2019][1331443.054435] [<ffffffffc0d4161d>] kiblnd_tx_done+0x10d/0x3e0 [ko2iblnd]^M [Thu Oct 31 21:00:22 2019][1331443.061257] [<ffffffffc0d4cb0d>] kiblnd_scheduler+0x89d/0x1180 [ko2iblnd]^M [Thu Oct 31 21:00:22 2019][1331443.068335] [<ffffffff8dac2e81>] kthread+0xd1/0xe0^M [Thu Oct 31 21:00:22 2019][1331443.073424] [<ffffffff8e177c24>] ret_from_fork_nospec_begin+0xe/0x21^M [Thu Oct 31 21:00:22 2019][1331443.080071] [<ffffffffffffffff>] 0xffffffffffffffff^M [Thu Oct 31 21:00:22 2019][1331443.085273] Kernel panic - not syncing: LBUG^M Reminder: this is with 2.12.3 on servers, routers and clients

            Hi Stephane,

            I pushed the two patches on b2_12

            https://review.whamcloud.com/36634 LU-12441 lnet: Detach rspt when md_threshold is infinite
            https://review.whamcloud.com/36635 LU-12568 lnet: Defer rspt cleanup when MD queued for unlink

            let me know if they work for you.

            ashehata Amir Shehata (Inactive) added a comment - Hi Stephane, I pushed the two patches on b2_12 https://review.whamcloud.com/36634 LU-12441 lnet: Detach rspt when md_threshold is infinite https://review.whamcloud.com/36635 LU-12568 lnet: Defer rspt cleanup when MD queued for unlink let me know if they work for you.

            People

              ashehata Amir Shehata (Inactive)
              sthiell Stephane Thiell
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: