[LU-12953] LNet timeouts with restarted Lustre production file system Created: 08/Nov/19 Updated: 01/Feb/22 Resolved: 01/Feb/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.3 |
| Fix Version/s: | Lustre 2.12.3 |
| Type: | Bug | Priority: | Critical |
| Reporter: | James A Simmons | Assignee: | Amir Shehata (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | ORNL | ||
| Environment: |
Lustre OSS server running ZFS. |
||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
When restarting our production Lustre file system we encountered this bug: [407608.498637] LNetError: 72335:0:(o2iblnd_cb.c:3335:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds Eventually we ended up seeing: [423015.676012] [<ffffffff98d5d28b>] queued_spin_lock_slowpath+0xb/0xf This looks similar to the issues reported by NASA but just to make sure. |
| Comments |
| Comment by Peter Jones [ 09/Nov/19 ] |
|
Amir Could you please advise James Do you have any patches applied or are you running vanilla 2.12.3? Peter |
| Comment by Amir Shehata (Inactive) [ 12/Nov/19 ] |
|
Did you try: lctl set_param osc.*.short_io_bytes=0 Seems like this resolved the issue on |
| Comment by James A Simmons [ 04/Feb/20 ] |
|
We are going to update to a newer 2.12 LTS that has the https://jira.whamcloud.com/browse/LU-12856 fix. We will see if it resolves our problems. A specific users job can reproduce this problem easily. |
| Comment by Jesse Hanley [ 02/Mar/20 ] |
|
We moved to the newer release a couple weeks ago and haven't seen a panic related to the short_io error, and the system has been pretty stable. Last night, we encountered a crash with a similar stack trace: [1272493.521802] Call Trace: [1272493.521807] [<ffffffff8e17544a>] queued_spin_lock_slowpath+0xb/0xf [1272493.521810] [<ffffffff8e183330>] _raw_spin_lock+0x20/0x30 [1272493.521821] [<ffffffffc17ce5a9>] ofd_intent_policy+0x1d9/0x920 [ofd] [1272493.521865] [<ffffffffc1402e7c>] ? ldlm_extent_alloc_lock+0x6c/0x280 [ptlrpc] [1272493.521880] [<ffffffffc13ebe06>] ldlm_lock_enqueue+0x356/0xa20 [ptlrpc] [1272493.521889] [<ffffffffc1072033>] ? cfs_hash_bd_add_locked+0x63/0x80 [libcfs] [1272493.521894] [<ffffffffc10757be>] ? cfs_hash_add+0xbe/0x1a0 [libcfs] [1272493.521912] [<ffffffffc1414506>] ldlm_handle_enqueue0+0xa56/0x15f0 [ptlrpc] [1272493.521934] [<ffffffffc143d300>] ? lustre_swab_ldlm_lock_desc+0x30/0x30 [ptlrpc] [1272493.521965] [<ffffffffc149ccf2>] tgt_enqueue+0x62/0x210 [ptlrpc] [1272493.521989] [<ffffffffc14a3b0a>] tgt_request_handle+0xada/0x1570 [ptlrpc] [1272493.522011] [<ffffffffc147d021>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc] [1272493.522017] [<ffffffffc1066bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs] [1272493.522037] [<ffffffffc144846b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [1272493.522057] [<ffffffffc1445285>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc] [1272493.522059] [<ffffffff8dad3903>] ? __wake_up+0x13/0x20 [1272493.522079] [<ffffffffc144bdd4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] [1272493.522099] [<ffffffffc144b2a0>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc] [1272493.522101] [<ffffffff8dac61f1>] kthread+0xd1/0xe0 [1272493.522103] [<ffffffff8dac6120>] ? insert_kthread_work+0x40/0x40 [1272493.522105] [<ffffffff8e18dd1d>] ret_from_fork_nospec_begin+0x7/0x21 [1272493.522107] [<ffffffff8dac6120>] ? insert_kthread_work+0x40/0x40 The other 2 stack traces I'm seeing repeated are: [1313981.701739] Call Trace: [1313981.701741] <IRQ> [<ffffffff8dada648>] sched_show_task+0xa8/0x110 [1313981.701743] [<ffffffff8dade3e9>] dump_cpu_task+0x39/0x70 [1313981.701744] [<ffffffff8db57fa0>] rcu_dump_cpu_stacks+0x90/0xd0 [1313981.701746] [<ffffffff8db5b662>] rcu_check_callbacks+0x442/0x730 [1313981.701748] [<ffffffff8db0ff40>] ? tick_sched_do_timer+0x50/0x50 [1313981.701750] [<ffffffff8daaf536>] update_process_times+0x46/0x80 [1313981.701752] [<ffffffff8db0fcb0>] tick_sched_handle+0x30/0x70 [1313981.701753] [<ffffffff8db0ff79>] tick_sched_timer+0x39/0x80 [1313981.701755] [<ffffffff8daca5ee>] __hrtimer_run_queues+0x10e/0x270 [1313981.701757] [<ffffffff8dacab4f>] hrtimer_interrupt+0xaf/0x1d0 [1313981.701759] [<ffffffff8da5c60b>] local_apic_timer_interrupt+0x3b/0x60 [1313981.701761] [<ffffffff8e1929d3>] smp_apic_timer_interrupt+0x43/0x60 [1313981.701763] [<ffffffff8e18eefa>] apic_timer_interrupt+0x16a/0x170 [1313981.701765] <EOI> [<ffffffff8db17432>] ? native_queued_spin_lock_slowpath+0x122/0x200 [1313981.701766] [<ffffffff8e17544a>] queued_spin_lock_slowpath+0xb/0xf [1313981.701767] [<ffffffff8e183330>] _raw_spin_lock+0x20/0x30 [1313981.701781] [<ffffffffc13e402c>] lock_res_and_lock+0x2c/0x50 [ptlrpc] [1313981.701795] [<ffffffffc13ebc61>] ldlm_lock_enqueue+0x1b1/0xa20 [ptlrpc] [1313981.701815] [<ffffffffc143b891>] ? lustre_pack_reply+0x11/0x20 [ptlrpc] [1313981.701832] [<ffffffffc1414506>] ldlm_handle_enqueue0+0xa56/0x15f0 [ptlrpc] [1313981.701851] [<ffffffffc143d300>] ? lustre_swab_ldlm_lock_desc+0x30/0x30 [ptlrpc] [1313981.701875] [<ffffffffc149ccf2>] tgt_enqueue+0x62/0x210 [ptlrpc] [1313981.701899] [<ffffffffc14a3b0a>] tgt_request_handle+0xada/0x1570 [ptlrpc] [1313981.701921] [<ffffffffc147d021>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc] [1313981.701924] [<ffffffffc1066bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs] [1313981.701944] [<ffffffffc144846b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [1313981.701963] [<ffffffffc1445285>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc] [1313981.701983] [<ffffffffc144bdd4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] [1313981.702002] [<ffffffffc144b2a0>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc] [1313981.702003] [<ffffffff8dac61f1>] kthread+0xd1/0xe0 [1313981.702005] [<ffffffff8dac6120>] ? insert_kthread_work+0x40/0x40 [1313981.702006] [<ffffffff8e18dd1d>] ret_from_fork_nospec_begin+0x7/0x21 [1313981.702008] [<ffffffff8dac6120>] ? insert_kthread_work+0x40/0x40 and: [1313981.702010] Call Trace: [1313981.702014] [<ffffffffc0bc1797>] ? kiblnd_send+0x357/0xa20 [ko2iblnd] [1313981.702028] [<ffffffffc118e869>] ? lprocfs_counter_add+0xf9/0x160 [obdclass] [1313981.702031] [<ffffffffc1066bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs] [1313981.702049] [<ffffffffc141f2ce>] ? ldlm_pool_recalc+0x12e/0x1f0 [ptlrpc] [1313981.702060] [<ffffffffc118e869>] ? lprocfs_counter_add+0xf9/0x160 [obdclass] [1313981.702078] [<ffffffffc1420a40>] ? ldlm_pool_add+0x80/0xc0 [ptlrpc] [1313981.702092] [<ffffffffc13e9cce>] ? ldlm_grant_lock+0xfe/0x270 [ptlrpc] [1313981.702107] [<ffffffffc13ee841>] ? ldlm_resource_unlink_lock+0x41/0x70 [ptlrpc] [1313981.702122] [<ffffffffc14029be>] ? ldlm_process_extent_lock+0x2ee/0x490 [ptlrpc] [1313981.702124] [<ffffffff8db17436>] ? native_queued_spin_lock_slowpath+0x126/0x200 [1313981.702126] [<ffffffff8e17544a>] ? queued_spin_lock_slowpath+0xb/0xf [1313981.702127] [<ffffffff8e183330>] ? _raw_spin_lock+0x20/0x30 [1313981.702140] [<ffffffffc13e402c>] ? lock_res_and_lock+0x2c/0x50 [ptlrpc] [1313981.702157] [<ffffffffc1414870>] ? ldlm_handle_enqueue0+0xdc0/0x15f0 [ptlrpc] [1313981.702175] [<ffffffffc143d300>] ? lustre_swab_ldlm_lock_desc+0x30/0x30 [ptlrpc] [1313981.702199] [<ffffffffc149ccf2>] ? tgt_enqueue+0x62/0x210 [ptlrpc] [1313981.702222] [<ffffffffc14a3b0a>] ? tgt_request_handle+0xada/0x1570 [ptlrpc] [1313981.702244] [<ffffffffc147d021>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc] [1313981.702248] [<ffffffffc1066bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs] [1313981.702268] [<ffffffffc144846b>] ? ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [1313981.702288] [<ffffffffc1445285>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc] [1313981.702289] [<ffffffff8e17544a>] ? queued_spin_lock_slowpath+0xb/0xf [1313981.702291] [<ffffffff8e183330>] ? _raw_spin_lock+0x20/0x30 [1313981.702310] [<ffffffffc144bdd4>] ? ptlrpc_main+0xb34/0x1470 [ptlrpc] [1313981.702329] [<ffffffffc144b2a0>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc] [1313981.702330] [<ffffffff8dac61f1>] ? kthread+0xd1/0xe0 [1313981.702332] [<ffffffff8dac6120>] ? insert_kthread_work+0x40/0x40 [1313981.702333] [<ffffffff8e18dd1d>] ? ret_from_fork_nospec_begin+0x7/0x21 [1313981.702334] [<ffffffff8dac6120>] ? insert_kthread_work+0x40/0x40 When we brought the server back up, the server immediately panic'd when recovery completed: [ 1176.846234] Call Trace: [ 1176.846250] [<ffffffffc13fde41>] lu_context_refill+0x41/0x50 [obdclass] [ 1176.846263] [<ffffffffc13fded4>] lu_env_refill+0x24/0x30 [obdclass] [ 1176.846271] [<ffffffffc1a1cf61>] ofd_lvbo_init+0x2a1/0x7f0 [ofd] [ 1176.846324] [<ffffffffc16ae0af>] ? req_capsule_shrink+0xff/0x240 [ptlrpc] [ 1176.846343] [<ffffffffc166072d>] ldlm_server_completion_ast+0x5fd/0x980 [ptlrpc] [ 1176.846361] [<ffffffffc1660130>] ? ldlm_server_blocking_ast+0xa40/0xa40 [ptlrpc] [ 1176.846374] [<ffffffffc1632d08>] ldlm_work_cp_ast_lock+0xa8/0x1d0 [ptlrpc] [ 1176.846394] [<ffffffffc167ab42>] ptlrpc_set_wait+0x72/0x790 [ptlrpc] [ 1176.846397] [<ffffffff93a2630d>] ? kmem_cache_alloc_node_trace+0x11d/0x210 [ 1176.846409] [<ffffffffc13db869>] ? lprocfs_counter_add+0xf9/0x160 [obdclass] [ 1176.846423] [<ffffffffc1632c60>] ? ldlm_work_gl_ast_lock+0x3a0/0x3a0 [ptlrpc] [ 1176.846441] [<ffffffffc16713d2>] ? ptlrpc_prep_set+0xd2/0x280 [ptlrpc] [ 1176.846455] [<ffffffffc1638115>] ldlm_run_ast_work+0xd5/0x3a0 [ptlrpc] [ 1176.846469] [<ffffffffc16395ef>] __ldlm_reprocess_all+0x11f/0x360 [ptlrpc] [ 1176.846484] [<ffffffffc1639ba8>] ldlm_reprocess_res+0x28/0x30 [ptlrpc] [ 1176.846492] [<ffffffffc12bffb0>] cfs_hash_for_each_relax+0x250/0x450 [libcfs] [ 1176.846507] [<ffffffffc1639b80>] ? ldlm_lock_mode_downgrade+0x330/0x330 [ptlrpc] [ 1176.846521] [<ffffffffc1639b80>] ? ldlm_lock_mode_downgrade+0x330/0x330 [ptlrpc] [ 1176.846526] [<ffffffffc12c3345>] cfs_hash_for_each_nolock+0x75/0x1c0 [libcfs] [ 1176.846540] [<ffffffffc1639bec>] ldlm_reprocess_recovery_done+0x3c/0x110 [ptlrpc] [ 1176.846556] [<ffffffffc164c751>] target_recovery_thread+0xcd1/0x1160 [ptlrpc] [ 1176.846571] [<ffffffffc164ba80>] ? replay_request_or_update.isra.24+0x8c0/0x8c0 [ptlrpc] [ 1176.846574] [<ffffffff938c61f1>] kthread+0xd1/0xe0 [ 1176.846576] [<ffffffff938c6120>] ? insert_kthread_work+0x40/0x40 [ 1176.846578] [<ffffffff93f8dd1d>] ret_from_fork_nospec_begin+0x7/0x21 [ 1176.846580] [<ffffffff938c6120>] ? insert_kthread_work+0x40/0x40 [ 1176.846594] Code: 0b 74 06 00 0f 1f 00 31 db eb 15 0f 1f 40 00 48 83 c3 08 48 81 fb 40 01 00 00 0f 84 9f 00 00 00 49 8b 45 10 4c 8b a3 e0 80 52 c1 <48> 83 3c 18 00 75 dd 4d 85 e4 74 d8 41 8b 04 24 41 8b 55 00 85 [ 1176.846606] RIP [<ffffffffc13f8d5c>] keys_fill+0x5c/0x180 [obdclass] [ 1176.846607] RSP <ffff9c868fea3ac0> Upon second reboot, the same issue happened. We then aborted recovery during the last few nodes. |
| Comment by James A Simmons [ 02/Mar/20 ] |
|
The last trace looks like |
| Comment by James A Simmons [ 01/Feb/22 ] |
|
Patch landed that resolved this issue. |