[LU-12279] client got evicted due to network issue. Created: 09/May/19 Updated: 18/Oct/19 Resolved: 25/May/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.13.0, Lustre 2.12.2 |
| Fix Version/s: | Lustre 2.13.0, Lustre 2.12.2 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Sarah Liu | Assignee: | Amir Shehata (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | soak | ||
| Environment: |
version=2.12.53_13_g4191e0c |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
soak has been running on master branch version 2.12.53_13_g4191e0c for about 2 days, no crash, but many applications failed 511 fail /956 pass. From the syslog, seems caused by network issue. The first 24 hours seems good, failure rate is similar to 2.12.1, but as the test went by, applications started to fail a lot. Some error msg seems similar as [root@soak-16 syslog]# grep -r "Async QP" soak-20.log:May 8 06:43:29 soak-20 kernel: LNetError: 0:0:(o2iblnd_cb.c:3665:kiblnd_qp_event()) 192.168.1.105@o2ib: Async QP event type 1 soak-35.log:May 8 06:42:10 soak-35 kernel: LNetError: 0:0:(o2iblnd_cb.c:3665:kiblnd_qp_event()) 192.168.1.105@o2ib: Async QP event type 1 soak-36.log:May 8 06:41:43 soak-36 kernel: LNetError: 0:0:(o2iblnd_cb.c:3665:kiblnd_qp_event()) 192.168.1.105@o2ib: Async QP event type 1 soak-17.log:May 8 06:42:27 soak-17 kernel: LNetError: 0:0:(o2iblnd_cb.c:3665:kiblnd_qp_event()) 192.168.1.105@o2ib: Async QP event type 1 soak-38.log:May 8 06:42:09 soak-38 kernel: LNetError: 0:0:(o2iblnd_cb.c:3665:kiblnd_qp_event()) 192.168.1.105@o2ib: Async QP event type 1 soak-40.log:May 8 06:41:49 soak-40 kernel: LNetError: 0:0:(o2iblnd_cb.c:3665:kiblnd_qp_event()) 192.168.1.105@o2ib: Async QP event type 1 [root@soak-16 syslog]# many of following errors showed in client syslog May 8 07:24:51 soak-17 kernel: LustreError: 218649:0:(import.c:343:ptlrpc_invalidate_import()) soaked-OST0009_UUID: rc = -110 waiting for callback (6 != 0) May 8 07:24:51 soak-17 kernel: LustreError: 218649:0:(import.c:369:ptlrpc_invalidate_import()) @@@ still on sending list req@ffff94340487a400 x1632824181437936/t0(0) o4->soaked-OST0009-osc-ffff943a9be9a800@192.168.1.105@o2ib:6/4 lens 488/448 e 0 to 0 dl 1557297784 ref 2 fl UnregBULK:ES/0/ffffffff rc -5/-1 May 8 07:24:51 soak-17 kernel: LustreError: 218649:0:(import.c:383:ptlrpc_invalidate_import()) soaked-OST0009_UUID: Unregistering RPCs found (6). Network is sluggish? Waiting them to error out. |
| Comments |
| Comment by Sarah Liu [ 17/May/19 ] |
|
hit the similar issue when testing 2.12.2-rc1 build with MOFED version=4.6 |
| Comment by Sarah Liu [ 21/May/19 ] |
|
Hit this problem again when testing 2.12.2-rc2. soak started with 2.12.2-rc2 on 5/18 00:41, it seems the first 48 hours running was fine (checked the status on Monday 5/20, fail/pass rate is 153 fail/988 pass), then it started to show the LNetError and application fail rate rises to 1311 fail/1348 pass [root@soak-16 syslog]# grep -r "Async QP" soak-20.log:May 20 17:19:36 soak-20 kernel: LNetError: 0:0:(o2iblnd_cb.c:3660:kiblnd_qp_event()) 192.168.1.105@o2ib: Async QP event type 1 soak-29.log:May 20 17:18:01 soak-29 kernel: LNetError: 0:0:(o2iblnd_cb.c:3660:kiblnd_qp_event()) 192.168.1.105@o2ib: Async QP event type 1 soak-31.log:May 20 17:17:55 soak-31 kernel: LNetError: 0:0:(o2iblnd_cb.c:3660:kiblnd_qp_event()) 192.168.1.105@o2ib: Async QP event type 1 soak-36.log:May 20 17:17:21 soak-36 kernel: LNetError: 0:0:(o2iblnd_cb.c:3660:kiblnd_qp_event()) 192.168.1.105@o2ib: Async QP event type 1 soak-37.log:May 20 17:17:42 soak-37 kernel: LNetError: 0:0:(o2iblnd_cb.c:3660:kiblnd_qp_event()) 192.168.1.105@o2ib: Async QP event type 1 soak-38.log:May 20 17:17:55 soak-38 kernel: LNetError: 0:0:(o2iblnd_cb.c:3660:kiblnd_qp_event()) 192.168.1.105@o2ib: Async QP event type 1 soak-8.log:May 21 13:38:22 soak-8 kernel: LNetError: 0:0:(o2iblnd_cb.c:3660:kiblnd_qp_event()) 192.168.1.110@o2ib: Async QP event type 1 [root@soak-16 syslog]# on 1 client, shows following trace [ 9590.178380] Lustre: Lustre: Build Version: 2.12.2_RC2 [ 9590.299163] LNet: Using FMR for registration [ 9590.313035] LNet: Added LNI 192.168.1.117@o2ib [8/256/0/180] [21874.380134] Lustre: Mounted soaked-client [22954.170140] LNetError: 11470:0:(o2iblnd_cb.c:3335:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds [22954.181512] LNetError: 11470:0:(o2iblnd_cb.c:3410:kiblnd_check_conns()) Timed out RDMA with 192.168.1.108@o2ib (7): c: 5, oc: 0, rc: 8 [22954.195335] Lustre: 11496:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1558140941/real 1558140947] req@ffff9930f11 0ba80 x1633815204885120/t0(0) o103->soaked-MDT0000-mdc-ffff9937de2da000@192.168.1.108@o2ib:17/18 lens 328/224 e 0 to 1 dl 1558140986 ref 1 fl Rpc:eX/0/ffffffff rc 0/-1 [22954.195349] Lustre: soaked-MDT0000-mdc-ffff9937de2da000: Connection to soaked-MDT0000 (at 192.168.1.108@o2ib) was lost; in progress operations using this service will wait for recovery to complete [22954.195351] LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fail [22954.264070] Lustre: 11496:0:(client.c:2134:ptlrpc_expire_one_request()) Skipped 2 previous similar messages [23160.754222] INFO: task simul:13725 blocked for more than 120 seconds. [23160.761437] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [23160.770188] simul D ffff992ff5801040 0 13725 13719 0x00000080 [23160.778148] Call Trace: [23160.780916] [<ffffffffab051b92>] ? path_lookupat+0x122/0x8b0 [23160.787364] [<ffffffffab569b69>] schedule_preempt_disabled+0x29/0x70 [23160.794576] [<ffffffffab567ab7>] __mutex_lock_slowpath+0xc7/0x1d0 [23160.801508] [<ffffffffab566e9f>] mutex_lock+0x1f/0x2f [23160.807255] [<ffffffffab052465>] filename_create+0x85/0x180 [23160.813595] [<ffffffffab05319f>] ? getname_flags+0x4f/0x1a0 [23160.819941] [<ffffffffab053214>] ? getname_flags+0xc4/0x1a0 [23160.826274] [<ffffffffab0534c1>] user_path_create+0x41/0x60 [23160.832610] [<ffffffffab054888>] SyS_mkdirat+0x48/0x100 [23160.838562] [<ffffffffab575d15>] ? system_call_after_swapgs+0xa2/0x146 [23160.845972] [<ffffffffab575d21>] ? system_call_after_swapgs+0xae/0x146 [23160.853376] [<ffffffffab054959>] SyS_mkdir+0x19/0x20 [23160.859034] [<ffffffffab575ddb>] system_call_fastpath+0x22/0x27 [23160.865769] [<ffffffffab575d21>] ? system_call_after_swapgs+0xae/0x146 [23160.873175] INFO: task mdtest:16087 blocked for more than 120 seconds. [23160.880480] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [23160.889240] mdtest D ffff9937e7e75140 0 16087 16080 0x00000080 [23160.897164] Call Trace: [23160.899914] [<ffffffffaaed6530>] ? try_to_wake_up+0x190/0x390 [23160.906444] [<ffffffffab569b69>] schedule_preempt_disabled+0x29/0x70 [23160.913665] [<ffffffffab567ab7>] __mutex_lock_slowpath+0xc7/0x1d0 [23160.920592] [<ffffffffab566e9f>] mutex_lock+0x1f/0x2f [23160.926346] [<ffffffffab0536b5>] do_rmdir+0x165/0x220 [23160.932100] [<ffffffffab575d21>] ? system_call_after_swapgs+0xae/0x146 [23160.939493] [<ffffffffab575d15>] ? system_call_after_swapgs+0xa2/0x146 [23160.946896] [<ffffffffab575d21>] ? system_call_after_swapgs+0xae/0x146 [23160.954298] [<ffffffffab575d15>] ? system_call_after_swapgs+0xa2/0x146 [23160.961706] [<ffffffffab575d21>] ? system_call_after_swapgs+0xae/0x146 [23160.969108] [<ffffffffab575d15>] ? system_call_after_swapgs+0xa2/0x146 [23160.976518] [<ffffffffab054976>] SyS_rmdir+0x16/0x20 [23160.982174] [<ffffffffab575ddb>] system_call_fastpath+0x22/0x27 [23160.988895] [<ffffffffab575d21>] ? system_call_after_swapgs+0xae/0x146 [23205.167707] LNet: 11470:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 192.168.1.108@o2ib: 20 seconds [23230.167462] LNet: 11470:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 192.168.1.108@o2ib: 45 seconds [23257.167312] LNet: 11470:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 192.168.1.108@o2ib: 22 seconds [23280.995078] INFO: task simul:13725 blocked for more than 120 seconds. [23281.002285] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. |
| Comment by Amir Shehata (Inactive) [ 21/May/19 ] |
|
The QP is receiving IB_EVENT_QP_FATAL event. The connection gets closed leading to the issues. Can we take a look at the MLX stats to see what kind of errors are being reported where this event is being received and on 192.168.1.105 I'll have to look at the MOFED code to see if I can see something more |
| Comment by Amir Shehata (Inactive) [ 21/May/19 ] |
|
can you attach the config you're using? |
| Comment by Sarah Liu [ 21/May/19 ] |
|
There is no special config for lnet, following is the info Amir requested [root@soak-17 ~]# lnetctl net show -v 4
net:
- net type: lo
local NI(s):
- nid: 0@lo
status: up
statistics:
send_count: 0
recv_count: 0
drop_count: 0
sent_stats:
put: 0
get: 0
reply: 0
ack: 0
hello: 0
received_stats:
put: 0
get: 0
reply: 0
ack: 0
hello: 0
dropped_stats:
put: 0
get: 0
reply: 0
ack: 0
hello: 0
health stats:
health value: 0
interrupts: 0
dropped: 0
aborted: 0
no route: 0
timeouts: 0
error: 0
tunables:
peer_timeout: 0
peer_credits: 0
peer_buffer_credits: 0
credits: 0
dev cpt: 0
tcp bonding: 0
CPT: "[0,1]"
- net type: o2ib
local NI(s):
- nid: 192.168.1.117@o2ib
status: up
interfaces:
0: ib0
statistics:
send_count: 873025729
recv_count: 713516095
drop_count: 2
sent_stats:
put: 873025581
get: 148
reply: 0
ack: 0
hello: 0
received_stats:
put: 709016231
get: 140
reply: 4499716
ack: 8
hello: 0
dropped_stats:
put: 2
get: 0
reply: 0
ack: 0
hello: 0
health stats:
health value: 1000
interrupts: 0
dropped: 1463
aborted: 0
no route: 0
timeouts: 824
error: 0
tunables:
peer_timeout: 180
peer_credits: 8
peer_buffer_credits: 0
credits: 256
peercredits_hiw: 4
map_on_demand: 0
concurrent_sends: 8
fmr_pool_size: 512
fmr_flush_trigger: 384
fmr_cache: 1
ntx: 512
conns_per_peer: 1
lnd tunables:
dev cpt: 0
tcp bonding: 0
CPT: "[0,1]"
[root@soak-17 ~]#
[root@soak-5 ~]# lnetctl net show -v 4
net:
- net type: lo
local NI(s):
- nid: 0@lo
status: up
statistics:
send_count: 0
recv_count: 0
drop_count: 0
sent_stats:
put: 0
get: 0
reply: 0
ack: 0
hello: 0
received_stats:
put: 0
get: 0
reply: 0
ack: 0
hello: 0
dropped_stats:
put: 0
get: 0
reply: 0
ack: 0
hello: 0
health stats:
health value: 0
interrupts: 0
dropped: 0
aborted: 0
no route: 0
timeouts: 0
error: 0
tunables:
peer_timeout: 0
peer_credits: 0
peer_buffer_credits: 0
credits: 0
dev cpt: 0
tcp bonding: 0
CPT: "[0,1]"
- net type: o2ib
local NI(s):
- nid: 192.168.1.105@o2ib
status: up
interfaces:
0: ib0
statistics:
send_count: 12361
recv_count: 12349
drop_count: 1
sent_stats:
put: 12356
get: 5
reply: 0
ack: 0
hello: 0
received_stats:
put: 12340
get: 1
reply: 4
ack: 4
hello: 0
dropped_stats:
put: 1
get: 0
reply: 0
ack: 0
hello: 0
health stats:
health value: 1000
interrupts: 0
dropped: 16
aborted: 0
no route: 0
timeouts: 0
error: 0
tunables:
peer_timeout: 180
peer_credits: 8
peer_buffer_credits: 0
credits: 256
peercredits_hiw: 4
map_on_demand: 0
concurrent_sends: 8
fmr_pool_size: 512
fmr_flush_trigger: 384
fmr_cache: 1
ntx: 512
conns_per_peer: 1
lnd tunables:
dev cpt: 0
tcp bonding: 0
CPT: "[0,1]"
[root@soak-5 ~]#
Around the time when "Async QP event type 1" error shows, soak-5 was rebooted of failover test 2019-05-20 17:19:08,383:fsmgmt.fsmgmt:INFO triggering fault oss_restart 2019-05-20 17:19:08,386:fsmgmt.fsmgmt:INFO executing cmd pm -h powerman -c soak-5> /dev/null Also, when the ticket was opened, master tag-2.12.53 includes |
| Comment by Gerrit Updater [ 21/May/19 ] |
|
Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34933 |
| Comment by James A Simmons [ 21/May/19 ] |
|
I think we need a master patch first. |
| Comment by Peter Jones [ 21/May/19 ] |
|
it's testing a theory not landing at this point |
| Comment by Amir Shehata (Inactive) [ 21/May/19 ] |
|
Yes. I asked Sarah if we can turn on crash on eviction on the client and dump on eviction on the server, so we can investigate the dump if we encounter this issue again. |
| Comment by Gerrit Updater [ 22/May/19 ] |
|
James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34945 |
| Comment by Gerrit Updater [ 25/May/19 ] |
|
Andreas Dilger (adilger@whamcloud.com) merged in patch https://review.whamcloud.com/34945/ |
| Comment by Gerrit Updater [ 25/May/19 ] |
|
Andreas Dilger (adilger@whamcloud.com) merged in patch https://review.whamcloud.com/34933/ |
| Comment by Peter Jones [ 25/May/19 ] |
|
Landed for 2.13 |