[LU-12953] LNet timeouts with restarted Lustre production file system Created: 08/Nov/19  Updated: 01/Feb/22  Resolved: 01/Feb/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.3
Fix Version/s: Lustre 2.12.3

Type: Bug Priority: Critical
Reporter: James A Simmons Assignee: Amir Shehata (Inactive)
Resolution: Fixed Votes: 0
Labels: ORNL
Environment:

Lustre OSS server running ZFS.


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

When restarting our production Lustre file system we encountered this bug:

[407608.498637] LNetError: 72335:0:(o2iblnd_cb.c:3335:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds
[407608.509681] LNetError: 72335:0:(o2iblnd_cb.c:3410:kiblnd_check_conns()) Timed out RDMA with 10.10.32.102@o2ib2 (5): c: 3, oc: 0, rc: 7
[407608.526089] LustreError: 72335:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff8ca33db8a800
[407608.537667] LustreError: 72335:0:(events.c:450:server_bulk_callback()) event type 3, status -103, desc ffff8ca33db8a800
[407608.549244] LustreError: 167072:0:(ldlm_lib.c:3259:target_bulk_io()) @@@ network error on bulk WRITE req@ffff8cabcfcba850 x1648066684855104/t0(0) o4->8d9c48a5-020d-9844-4aa4-57225c35d4e2@10.10.32.102@o2ib2:135/0 lens 608/448 e 0 to 0 dl 1573227610 ref 1 fl Interpret:/0/0 rc 0/0
[407608.576219] Lustre: f2-OST001d: Bulk IO write error with 8d9c48a5-020d-9844-4aa4-57225c35d4e2 (at 10.10.32.102@o2ib2), client will retry: rc = -110

Eventually we ended up seeing:

[423015.676012] [<ffffffff98d5d28b>] queued_spin_lock_slowpath+0xb/0xf
[423015.676017] [<ffffffff98d6b760>] _raw_spin_lock+0x20/0x30
[423015.676026] [<ffffffffc19ddf39>] ofd_intent_policy+0x1d9/0x920 [ofd]
[423015.676070] [<ffffffffc161dd26>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc]
[423015.676080] [<ffffffffc12f4033>] ? cfs_hash_bd_add_locked+0x63/0x80 [libcfs]
[423015.676085] [<ffffffffc12f77be>] ? cfs_hash_add+0xbe/0x1a0 [libcfs]
[423015.676107] [<ffffffffc1646587>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc]
[423015.676130] [<ffffffffc166e6d0>] ? lustre_swab_ldlm_lock_desc+0x30/0x30 [ptlrpc]

This looks similar to the issues reported by NASA but just to make sure.



 Comments   
Comment by Peter Jones [ 09/Nov/19 ]

Amir

Could you please advise

James

Do you have any patches applied or are you running vanilla 2.12.3?

Peter

Comment by Amir Shehata (Inactive) [ 12/Nov/19 ]

Did you try:

 lctl set_param osc.*.short_io_bytes=0

Seems like this resolved the issue on LU-12856

Comment by James A Simmons [ 04/Feb/20 ]

We are going to update to a newer 2.12 LTS that has the https://jira.whamcloud.com/browse/LU-12856 fix. We will see if it resolves our problems. A specific users job can reproduce this problem easily.

Comment by Jesse Hanley [ 02/Mar/20 ]

We moved to the newer release a couple weeks ago and haven't seen a panic related to the short_io error, and the system has been pretty stable.

Last night, we encountered a crash with a similar stack trace:

[1272493.521802] Call Trace:
[1272493.521807]  [<ffffffff8e17544a>] queued_spin_lock_slowpath+0xb/0xf
[1272493.521810]  [<ffffffff8e183330>] _raw_spin_lock+0x20/0x30
[1272493.521821]  [<ffffffffc17ce5a9>] ofd_intent_policy+0x1d9/0x920 [ofd]
[1272493.521865]  [<ffffffffc1402e7c>] ? ldlm_extent_alloc_lock+0x6c/0x280 [ptlrpc]
[1272493.521880]  [<ffffffffc13ebe06>] ldlm_lock_enqueue+0x356/0xa20 [ptlrpc]
[1272493.521889]  [<ffffffffc1072033>] ? cfs_hash_bd_add_locked+0x63/0x80 [libcfs]
[1272493.521894]  [<ffffffffc10757be>] ? cfs_hash_add+0xbe/0x1a0 [libcfs]
[1272493.521912]  [<ffffffffc1414506>] ldlm_handle_enqueue0+0xa56/0x15f0 [ptlrpc]
[1272493.521934]  [<ffffffffc143d300>] ? lustre_swab_ldlm_lock_desc+0x30/0x30 [ptlrpc]
[1272493.521965]  [<ffffffffc149ccf2>] tgt_enqueue+0x62/0x210 [ptlrpc]
[1272493.521989]  [<ffffffffc14a3b0a>] tgt_request_handle+0xada/0x1570 [ptlrpc]
[1272493.522011]  [<ffffffffc147d021>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
[1272493.522017]  [<ffffffffc1066bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
[1272493.522037]  [<ffffffffc144846b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[1272493.522057]  [<ffffffffc1445285>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]
[1272493.522059]  [<ffffffff8dad3903>] ? __wake_up+0x13/0x20
[1272493.522079]  [<ffffffffc144bdd4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
[1272493.522099]  [<ffffffffc144b2a0>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
[1272493.522101]  [<ffffffff8dac61f1>] kthread+0xd1/0xe0
[1272493.522103]  [<ffffffff8dac6120>] ? insert_kthread_work+0x40/0x40
[1272493.522105]  [<ffffffff8e18dd1d>] ret_from_fork_nospec_begin+0x7/0x21
[1272493.522107]  [<ffffffff8dac6120>] ? insert_kthread_work+0x40/0x40

The other 2 stack traces I'm seeing repeated are:

[1313981.701739] Call Trace:
[1313981.701741]  <IRQ>  [<ffffffff8dada648>] sched_show_task+0xa8/0x110
[1313981.701743]  [<ffffffff8dade3e9>] dump_cpu_task+0x39/0x70
[1313981.701744]  [<ffffffff8db57fa0>] rcu_dump_cpu_stacks+0x90/0xd0
[1313981.701746]  [<ffffffff8db5b662>] rcu_check_callbacks+0x442/0x730
[1313981.701748]  [<ffffffff8db0ff40>] ? tick_sched_do_timer+0x50/0x50
[1313981.701750]  [<ffffffff8daaf536>] update_process_times+0x46/0x80
[1313981.701752]  [<ffffffff8db0fcb0>] tick_sched_handle+0x30/0x70
[1313981.701753]  [<ffffffff8db0ff79>] tick_sched_timer+0x39/0x80
[1313981.701755]  [<ffffffff8daca5ee>] __hrtimer_run_queues+0x10e/0x270
[1313981.701757]  [<ffffffff8dacab4f>] hrtimer_interrupt+0xaf/0x1d0
[1313981.701759]  [<ffffffff8da5c60b>] local_apic_timer_interrupt+0x3b/0x60
[1313981.701761]  [<ffffffff8e1929d3>] smp_apic_timer_interrupt+0x43/0x60
[1313981.701763]  [<ffffffff8e18eefa>] apic_timer_interrupt+0x16a/0x170
[1313981.701765]  <EOI>  [<ffffffff8db17432>] ? native_queued_spin_lock_slowpath+0x122/0x200
[1313981.701766]  [<ffffffff8e17544a>] queued_spin_lock_slowpath+0xb/0xf
[1313981.701767]  [<ffffffff8e183330>] _raw_spin_lock+0x20/0x30
[1313981.701781]  [<ffffffffc13e402c>] lock_res_and_lock+0x2c/0x50 [ptlrpc]
[1313981.701795]  [<ffffffffc13ebc61>] ldlm_lock_enqueue+0x1b1/0xa20 [ptlrpc]
[1313981.701815]  [<ffffffffc143b891>] ? lustre_pack_reply+0x11/0x20 [ptlrpc]
[1313981.701832]  [<ffffffffc1414506>] ldlm_handle_enqueue0+0xa56/0x15f0 [ptlrpc]
[1313981.701851]  [<ffffffffc143d300>] ? lustre_swab_ldlm_lock_desc+0x30/0x30 [ptlrpc]
[1313981.701875]  [<ffffffffc149ccf2>] tgt_enqueue+0x62/0x210 [ptlrpc]
[1313981.701899]  [<ffffffffc14a3b0a>] tgt_request_handle+0xada/0x1570 [ptlrpc]
[1313981.701921]  [<ffffffffc147d021>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
[1313981.701924]  [<ffffffffc1066bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
[1313981.701944]  [<ffffffffc144846b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[1313981.701963]  [<ffffffffc1445285>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]
[1313981.701983]  [<ffffffffc144bdd4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
[1313981.702002]  [<ffffffffc144b2a0>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
[1313981.702003]  [<ffffffff8dac61f1>] kthread+0xd1/0xe0
[1313981.702005]  [<ffffffff8dac6120>] ? insert_kthread_work+0x40/0x40
[1313981.702006]  [<ffffffff8e18dd1d>] ret_from_fork_nospec_begin+0x7/0x21
[1313981.702008]  [<ffffffff8dac6120>] ? insert_kthread_work+0x40/0x40

and:

[1313981.702010] Call Trace:
[1313981.702014]  [<ffffffffc0bc1797>] ? kiblnd_send+0x357/0xa20 [ko2iblnd]
[1313981.702028]  [<ffffffffc118e869>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
[1313981.702031]  [<ffffffffc1066bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
[1313981.702049]  [<ffffffffc141f2ce>] ? ldlm_pool_recalc+0x12e/0x1f0 [ptlrpc]
[1313981.702060]  [<ffffffffc118e869>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
[1313981.702078]  [<ffffffffc1420a40>] ? ldlm_pool_add+0x80/0xc0 [ptlrpc]
[1313981.702092]  [<ffffffffc13e9cce>] ? ldlm_grant_lock+0xfe/0x270 [ptlrpc]
[1313981.702107]  [<ffffffffc13ee841>] ? ldlm_resource_unlink_lock+0x41/0x70 [ptlrpc]
[1313981.702122]  [<ffffffffc14029be>] ? ldlm_process_extent_lock+0x2ee/0x490 [ptlrpc]
[1313981.702124]  [<ffffffff8db17436>] ? native_queued_spin_lock_slowpath+0x126/0x200
[1313981.702126]  [<ffffffff8e17544a>] ? queued_spin_lock_slowpath+0xb/0xf
[1313981.702127]  [<ffffffff8e183330>] ? _raw_spin_lock+0x20/0x30
[1313981.702140]  [<ffffffffc13e402c>] ? lock_res_and_lock+0x2c/0x50 [ptlrpc]
[1313981.702157]  [<ffffffffc1414870>] ? ldlm_handle_enqueue0+0xdc0/0x15f0 [ptlrpc]
[1313981.702175]  [<ffffffffc143d300>] ? lustre_swab_ldlm_lock_desc+0x30/0x30 [ptlrpc]
[1313981.702199]  [<ffffffffc149ccf2>] ? tgt_enqueue+0x62/0x210 [ptlrpc]
[1313981.702222]  [<ffffffffc14a3b0a>] ? tgt_request_handle+0xada/0x1570 [ptlrpc]
[1313981.702244]  [<ffffffffc147d021>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
[1313981.702248]  [<ffffffffc1066bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
[1313981.702268]  [<ffffffffc144846b>] ? ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[1313981.702288]  [<ffffffffc1445285>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]
[1313981.702289]  [<ffffffff8e17544a>] ? queued_spin_lock_slowpath+0xb/0xf
[1313981.702291]  [<ffffffff8e183330>] ? _raw_spin_lock+0x20/0x30
[1313981.702310]  [<ffffffffc144bdd4>] ? ptlrpc_main+0xb34/0x1470 [ptlrpc]
[1313981.702329]  [<ffffffffc144b2a0>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
[1313981.702330]  [<ffffffff8dac61f1>] ? kthread+0xd1/0xe0
[1313981.702332]  [<ffffffff8dac6120>] ? insert_kthread_work+0x40/0x40
[1313981.702333]  [<ffffffff8e18dd1d>] ? ret_from_fork_nospec_begin+0x7/0x21
[1313981.702334]  [<ffffffff8dac6120>] ? insert_kthread_work+0x40/0x40

When we brought the server back up, the server immediately panic'd when recovery completed:

[ 1176.846234] Call Trace:
[ 1176.846250]  [<ffffffffc13fde41>] lu_context_refill+0x41/0x50 [obdclass]
[ 1176.846263]  [<ffffffffc13fded4>] lu_env_refill+0x24/0x30 [obdclass]
[ 1176.846271]  [<ffffffffc1a1cf61>] ofd_lvbo_init+0x2a1/0x7f0 [ofd]
[ 1176.846324]  [<ffffffffc16ae0af>] ? req_capsule_shrink+0xff/0x240 [ptlrpc]
[ 1176.846343]  [<ffffffffc166072d>] ldlm_server_completion_ast+0x5fd/0x980 [ptlrpc]
[ 1176.846361]  [<ffffffffc1660130>] ? ldlm_server_blocking_ast+0xa40/0xa40 [ptlrpc]
[ 1176.846374]  [<ffffffffc1632d08>] ldlm_work_cp_ast_lock+0xa8/0x1d0 [ptlrpc]
[ 1176.846394]  [<ffffffffc167ab42>] ptlrpc_set_wait+0x72/0x790 [ptlrpc]
[ 1176.846397]  [<ffffffff93a2630d>] ? kmem_cache_alloc_node_trace+0x11d/0x210
[ 1176.846409]  [<ffffffffc13db869>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
[ 1176.846423]  [<ffffffffc1632c60>] ? ldlm_work_gl_ast_lock+0x3a0/0x3a0 [ptlrpc]
[ 1176.846441]  [<ffffffffc16713d2>] ? ptlrpc_prep_set+0xd2/0x280 [ptlrpc]
[ 1176.846455]  [<ffffffffc1638115>] ldlm_run_ast_work+0xd5/0x3a0 [ptlrpc]
[ 1176.846469]  [<ffffffffc16395ef>] __ldlm_reprocess_all+0x11f/0x360 [ptlrpc]
[ 1176.846484]  [<ffffffffc1639ba8>] ldlm_reprocess_res+0x28/0x30 [ptlrpc]
[ 1176.846492]  [<ffffffffc12bffb0>] cfs_hash_for_each_relax+0x250/0x450 [libcfs]
[ 1176.846507]  [<ffffffffc1639b80>] ? ldlm_lock_mode_downgrade+0x330/0x330 [ptlrpc]
[ 1176.846521]  [<ffffffffc1639b80>] ? ldlm_lock_mode_downgrade+0x330/0x330 [ptlrpc]
[ 1176.846526]  [<ffffffffc12c3345>] cfs_hash_for_each_nolock+0x75/0x1c0 [libcfs]
[ 1176.846540]  [<ffffffffc1639bec>] ldlm_reprocess_recovery_done+0x3c/0x110 [ptlrpc]
[ 1176.846556]  [<ffffffffc164c751>] target_recovery_thread+0xcd1/0x1160 [ptlrpc]
[ 1176.846571]  [<ffffffffc164ba80>] ? replay_request_or_update.isra.24+0x8c0/0x8c0 [ptlrpc]
[ 1176.846574]  [<ffffffff938c61f1>] kthread+0xd1/0xe0
[ 1176.846576]  [<ffffffff938c6120>] ? insert_kthread_work+0x40/0x40
[ 1176.846578]  [<ffffffff93f8dd1d>] ret_from_fork_nospec_begin+0x7/0x21
[ 1176.846580]  [<ffffffff938c6120>] ? insert_kthread_work+0x40/0x40
[ 1176.846594] Code: 0b 74 06 00 0f 1f 00 31 db eb 15 0f 1f 40 00 48 83 c3 08 48 81 fb 40 01 00 00 0f 84 9f 00 00 00 49 8b 45 10 4c 8b a3 e0 80 52 c1 <48> 83 3c 18 00 75 dd 4d 85 e4 74 d8 41 8b 04 24 41 8b 55 00 85
[ 1176.846606] RIP  [<ffffffffc13f8d5c>] keys_fill+0x5c/0x180 [obdclass]
[ 1176.846607]  RSP <ffff9c868fea3ac0>

Upon second reboot, the same issue happened. We then aborted recovery during the last few nodes.

Comment by James A Simmons [ 02/Mar/20 ]

The last trace looks like LU-12853. Let me port the patch.

Comment by James A Simmons [ 01/Feb/22 ]

Patch landed that resolved this issue.

Generated at Sat Feb 10 02:57:05 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.