[LU-12279] client got evicted due to network issue. Created: 09/May/19  Updated: 18/Oct/19  Resolved: 25/May/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0, Lustre 2.12.2
Fix Version/s: Lustre 2.13.0, Lustre 2.12.2

Type: Bug Priority: Minor
Reporter: Sarah Liu Assignee: Amir Shehata (Inactive)
Resolution: Fixed Votes: 0
Labels: soak
Environment:

version=2.12.53_13_g4191e0c


Issue Links:
Related
is related to LU-11931 RDMA packets sent from client to MGS ... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

soak has been running on master branch version 2.12.53_13_g4191e0c for about 2 days, no crash, but many applications failed 511 fail /956 pass. From the syslog, seems caused by network issue. The first 24 hours seems good, failure rate is similar to 2.12.1, but as the test went by, applications started to fail a lot.

Some error msg seems similar as LU-12065 which has already been fixed in this version.

[root@soak-16 syslog]# grep -r "Async QP"
soak-20.log:May  8 06:43:29 soak-20 kernel: LNetError: 0:0:(o2iblnd_cb.c:3665:kiblnd_qp_event()) 192.168.1.105@o2ib: Async QP event type 1
soak-35.log:May  8 06:42:10 soak-35 kernel: LNetError: 0:0:(o2iblnd_cb.c:3665:kiblnd_qp_event()) 192.168.1.105@o2ib: Async QP event type 1
soak-36.log:May  8 06:41:43 soak-36 kernel: LNetError: 0:0:(o2iblnd_cb.c:3665:kiblnd_qp_event()) 192.168.1.105@o2ib: Async QP event type 1
soak-17.log:May  8 06:42:27 soak-17 kernel: LNetError: 0:0:(o2iblnd_cb.c:3665:kiblnd_qp_event()) 192.168.1.105@o2ib: Async QP event type 1
soak-38.log:May  8 06:42:09 soak-38 kernel: LNetError: 0:0:(o2iblnd_cb.c:3665:kiblnd_qp_event()) 192.168.1.105@o2ib: Async QP event type 1
soak-40.log:May  8 06:41:49 soak-40 kernel: LNetError: 0:0:(o2iblnd_cb.c:3665:kiblnd_qp_event()) 192.168.1.105@o2ib: Async QP event type 1
[root@soak-16 syslog]# 

many of following errors showed in client syslog

May  8 07:24:51 soak-17 kernel: LustreError: 218649:0:(import.c:343:ptlrpc_invalidate_import()) soaked-OST0009_UUID: rc = -110 waiting for callback (6 != 0)
May  8 07:24:51 soak-17 kernel: LustreError: 218649:0:(import.c:369:ptlrpc_invalidate_import()) @@@ still on sending list  req@ffff94340487a400 x1632824181437936/t0(0) o4->soaked-OST0009-osc-ffff943a9be9a800@192.168.1.105@o2ib:6/4 lens 488/448 e 0 to 0 dl 1557297784 ref 2 fl UnregBULK:ES/0/ffffffff rc -5/-1
May  8 07:24:51 soak-17 kernel: LustreError: 218649:0:(import.c:383:ptlrpc_invalidate_import()) soaked-OST0009_UUID: Unregistering RPCs found (6). Network is sluggish? Waiting them to error out.


 Comments   
Comment by Sarah Liu [ 17/May/19 ]

hit the similar issue when testing 2.12.2-rc1 build with MOFED version=4.6

Comment by Sarah Liu [ 21/May/19 ]

Hit this problem again when testing 2.12.2-rc2. soak started with 2.12.2-rc2 on 5/18 00:41, it seems the first 48 hours running was fine (checked the status on Monday 5/20, fail/pass rate is 153 fail/988 pass), then it started to show the LNetError and application fail rate rises to 1311 fail/1348 pass

[root@soak-16 syslog]# grep -r "Async QP"
soak-20.log:May 20 17:19:36 soak-20 kernel: LNetError: 0:0:(o2iblnd_cb.c:3660:kiblnd_qp_event()) 192.168.1.105@o2ib: Async QP event type 1
soak-29.log:May 20 17:18:01 soak-29 kernel: LNetError: 0:0:(o2iblnd_cb.c:3660:kiblnd_qp_event()) 192.168.1.105@o2ib: Async QP event type 1
soak-31.log:May 20 17:17:55 soak-31 kernel: LNetError: 0:0:(o2iblnd_cb.c:3660:kiblnd_qp_event()) 192.168.1.105@o2ib: Async QP event type 1
soak-36.log:May 20 17:17:21 soak-36 kernel: LNetError: 0:0:(o2iblnd_cb.c:3660:kiblnd_qp_event()) 192.168.1.105@o2ib: Async QP event type 1
soak-37.log:May 20 17:17:42 soak-37 kernel: LNetError: 0:0:(o2iblnd_cb.c:3660:kiblnd_qp_event()) 192.168.1.105@o2ib: Async QP event type 1
soak-38.log:May 20 17:17:55 soak-38 kernel: LNetError: 0:0:(o2iblnd_cb.c:3660:kiblnd_qp_event()) 192.168.1.105@o2ib: Async QP event type 1
soak-8.log:May 21 13:38:22 soak-8 kernel: LNetError: 0:0:(o2iblnd_cb.c:3660:kiblnd_qp_event()) 192.168.1.110@o2ib: Async QP event type 1
[root@soak-16 syslog]#

on 1 client, shows following trace

[ 9590.178380] Lustre: Lustre: Build Version: 2.12.2_RC2
[ 9590.299163] LNet: Using FMR for registration
[ 9590.313035] LNet: Added LNI 192.168.1.117@o2ib [8/256/0/180]
[21874.380134] Lustre: Mounted soaked-client
[22954.170140] LNetError: 11470:0:(o2iblnd_cb.c:3335:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds
[22954.181512] LNetError: 11470:0:(o2iblnd_cb.c:3410:kiblnd_check_conns()) Timed out RDMA with 192.168.1.108@o2ib (7): c: 5, oc: 0, rc: 8
[22954.195335] Lustre: 11496:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1558140941/real 1558140947]  req@ffff9930f11
0ba80 x1633815204885120/t0(0) o103->soaked-MDT0000-mdc-ffff9937de2da000@192.168.1.108@o2ib:17/18 lens 328/224 e 0 to 1 dl 1558140986 ref 1 fl Rpc:eX/0/ffffffff rc 0/-1
[22954.195349] Lustre: soaked-MDT0000-mdc-ffff9937de2da000: Connection to soaked-MDT0000 (at 192.168.1.108@o2ib) was lost; in progress operations using this service will wait 
for recovery to complete
[22954.195351] LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fail
[22954.264070] Lustre: 11496:0:(client.c:2134:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
[23160.754222] INFO: task simul:13725 blocked for more than 120 seconds.
[23160.761437] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[23160.770188] simul           D ffff992ff5801040     0 13725  13719 0x00000080
[23160.778148] Call Trace:
[23160.780916]  [<ffffffffab051b92>] ? path_lookupat+0x122/0x8b0
[23160.787364]  [<ffffffffab569b69>] schedule_preempt_disabled+0x29/0x70
[23160.794576]  [<ffffffffab567ab7>] __mutex_lock_slowpath+0xc7/0x1d0
[23160.801508]  [<ffffffffab566e9f>] mutex_lock+0x1f/0x2f
[23160.807255]  [<ffffffffab052465>] filename_create+0x85/0x180
[23160.813595]  [<ffffffffab05319f>] ? getname_flags+0x4f/0x1a0
[23160.819941]  [<ffffffffab053214>] ? getname_flags+0xc4/0x1a0
[23160.826274]  [<ffffffffab0534c1>] user_path_create+0x41/0x60
[23160.832610]  [<ffffffffab054888>] SyS_mkdirat+0x48/0x100
[23160.838562]  [<ffffffffab575d15>] ? system_call_after_swapgs+0xa2/0x146
[23160.845972]  [<ffffffffab575d21>] ? system_call_after_swapgs+0xae/0x146
[23160.853376]  [<ffffffffab054959>] SyS_mkdir+0x19/0x20
[23160.859034]  [<ffffffffab575ddb>] system_call_fastpath+0x22/0x27
[23160.865769]  [<ffffffffab575d21>] ? system_call_after_swapgs+0xae/0x146
[23160.873175] INFO: task mdtest:16087 blocked for more than 120 seconds.
[23160.880480] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[23160.889240] mdtest          D ffff9937e7e75140     0 16087  16080 0x00000080
[23160.897164] Call Trace:
[23160.899914]  [<ffffffffaaed6530>] ? try_to_wake_up+0x190/0x390
[23160.906444]  [<ffffffffab569b69>] schedule_preempt_disabled+0x29/0x70
[23160.913665]  [<ffffffffab567ab7>] __mutex_lock_slowpath+0xc7/0x1d0
[23160.920592]  [<ffffffffab566e9f>] mutex_lock+0x1f/0x2f
[23160.926346]  [<ffffffffab0536b5>] do_rmdir+0x165/0x220
[23160.932100]  [<ffffffffab575d21>] ? system_call_after_swapgs+0xae/0x146
[23160.939493]  [<ffffffffab575d15>] ? system_call_after_swapgs+0xa2/0x146
[23160.946896]  [<ffffffffab575d21>] ? system_call_after_swapgs+0xae/0x146
[23160.954298]  [<ffffffffab575d15>] ? system_call_after_swapgs+0xa2/0x146
[23160.961706]  [<ffffffffab575d21>] ? system_call_after_swapgs+0xae/0x146
[23160.969108]  [<ffffffffab575d15>] ? system_call_after_swapgs+0xa2/0x146
[23160.976518]  [<ffffffffab054976>] SyS_rmdir+0x16/0x20
[23160.982174]  [<ffffffffab575ddb>] system_call_fastpath+0x22/0x27
[23160.988895]  [<ffffffffab575d21>] ? system_call_after_swapgs+0xae/0x146
[23205.167707] LNet: 11470:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 192.168.1.108@o2ib: 20 seconds
[23230.167462] LNet: 11470:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 192.168.1.108@o2ib: 45 seconds
[23257.167312] LNet: 11470:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 192.168.1.108@o2ib: 22 seconds
[23280.995078] INFO: task simul:13725 blocked for more than 120 seconds.
[23281.002285] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Comment by Amir Shehata (Inactive) [ 21/May/19 ]

The QP is receiving IB_EVENT_QP_FATAL event. The connection gets closed leading to the issues.

Can we take a look at the MLX stats to see what kind of errors are being reported where this event is being received and on 192.168.1.105

I'll have to look at the MOFED code to see if I can see something more

Comment by Amir Shehata (Inactive) [ 21/May/19 ]

can you attach the config you're using?

Comment by Sarah Liu [ 21/May/19 ]

There is no special config for lnet, following is the info Amir requested

[root@soak-17 ~]# lnetctl net show -v 4
net:
    - net type: lo
      local NI(s):
        - nid: 0@lo
          status: up
          statistics:
              send_count: 0
              recv_count: 0
              drop_count: 0
          sent_stats:
              put: 0
              get: 0
              reply: 0
              ack: 0
              hello: 0
          received_stats:
              put: 0
              get: 0
              reply: 0
              ack: 0
              hello: 0
          dropped_stats:
              put: 0
              get: 0
              reply: 0
              ack: 0
              hello: 0
          health stats:
              health value: 0
              interrupts: 0
              dropped: 0
              aborted: 0
              no route: 0
              timeouts: 0
              error: 0
          tunables:
              peer_timeout: 0
              peer_credits: 0
              peer_buffer_credits: 0
              credits: 0
          dev cpt: 0
          tcp bonding: 0
          CPT: "[0,1]"
    - net type: o2ib
      local NI(s):
        - nid: 192.168.1.117@o2ib
          status: up
          interfaces:
              0: ib0
          statistics:
              send_count: 873025729
              recv_count: 713516095
              drop_count: 2
          sent_stats:
              put: 873025581
              get: 148
              reply: 0
              ack: 0
              hello: 0
          received_stats:
              put: 709016231
              get: 140
              reply: 4499716
              ack: 8
              hello: 0
          dropped_stats:
              put: 2
              get: 0
              reply: 0
              ack: 0
              hello: 0
          health stats:
              health value: 1000
              interrupts: 0
              dropped: 1463
              aborted: 0
              no route: 0
              timeouts: 824
              error: 0
          tunables:
              peer_timeout: 180
              peer_credits: 8
              peer_buffer_credits: 0
              credits: 256
              peercredits_hiw: 4
              map_on_demand: 0
              concurrent_sends: 8
              fmr_pool_size: 512
              fmr_flush_trigger: 384
              fmr_cache: 1
              ntx: 512
              conns_per_peer: 1
          lnd tunables:
          dev cpt: 0
          tcp bonding: 0
          CPT: "[0,1]"
[root@soak-17 ~]# 
[root@soak-5 ~]# lnetctl net show -v 4
net:
    - net type: lo
      local NI(s):
        - nid: 0@lo
          status: up
          statistics:
              send_count: 0
              recv_count: 0
              drop_count: 0
          sent_stats:
              put: 0
              get: 0
              reply: 0
              ack: 0
              hello: 0
          received_stats:
              put: 0
              get: 0
              reply: 0
              ack: 0
              hello: 0
          dropped_stats:
              put: 0
              get: 0
              reply: 0
              ack: 0
              hello: 0
          health stats:
              health value: 0
              interrupts: 0
              dropped: 0
              aborted: 0
              no route: 0
              timeouts: 0
              error: 0
          tunables:
              peer_timeout: 0
              peer_credits: 0
              peer_buffer_credits: 0
              credits: 0
          dev cpt: 0
          tcp bonding: 0
          CPT: "[0,1]"
    - net type: o2ib
      local NI(s):
        - nid: 192.168.1.105@o2ib
          status: up
          interfaces:
              0: ib0
          statistics:
              send_count: 12361
              recv_count: 12349
              drop_count: 1
          sent_stats:
              put: 12356
              get: 5
              reply: 0
              ack: 0
              hello: 0
          received_stats:
              put: 12340
              get: 1
              reply: 4
              ack: 4
              hello: 0
          dropped_stats:
              put: 1
              get: 0
              reply: 0
              ack: 0
              hello: 0
          health stats:
              health value: 1000
              interrupts: 0
              dropped: 16
              aborted: 0
              no route: 0
              timeouts: 0
              error: 0
          tunables:
              peer_timeout: 180
              peer_credits: 8
              peer_buffer_credits: 0
              credits: 256
              peercredits_hiw: 4
              map_on_demand: 0
              concurrent_sends: 8
              fmr_pool_size: 512
              fmr_flush_trigger: 384
              fmr_cache: 1
              ntx: 512
              conns_per_peer: 1
          lnd tunables:
          dev cpt: 0
          tcp bonding: 0
          CPT: "[0,1]"
[root@soak-5 ~]# 

Around the time when "Async QP event type 1" error shows, soak-5 was rebooted of failover test

2019-05-20 17:19:08,383:fsmgmt.fsmgmt:INFO     triggering fault oss_restart
2019-05-20 17:19:08,386:fsmgmt.fsmgmt:INFO     executing cmd pm -h powerman -c soak-5> /dev/null

Also, when the ticket was opened, master tag-2.12.53 includes LU-11931

Comment by Gerrit Updater [ 21/May/19 ]

Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34933
Subject: LU-12279 lnet: use number of wrs to calculate CQEs
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 8e3ab1154fbf71f5de5bc28206350466fa8e1776

Comment by James A Simmons [ 21/May/19 ]

I think we need a master patch first.

Comment by Peter Jones [ 21/May/19 ]

it's testing a theory not landing at this point

Comment by Amir Shehata (Inactive) [ 21/May/19 ]

Yes. I asked Sarah if we can turn on crash on eviction on the client and dump on eviction on the server, so we can investigate the dump if we encounter this issue again.

Comment by Gerrit Updater [ 22/May/19 ]

James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34945
Subject: LU-12279 lnet: use number of wrs to calculate CQEs
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: cae64abc913d6c1e31881b638c05a104579f21be

Comment by Gerrit Updater [ 25/May/19 ]

Andreas Dilger (adilger@whamcloud.com) merged in patch https://review.whamcloud.com/34945/
Subject: LU-12279 lnet: use number of wrs to calculate CQEs
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 24294b843f79a1167f19d230ff1ab5c1a5cd88e7

Comment by Gerrit Updater [ 25/May/19 ]

Andreas Dilger (adilger@whamcloud.com) merged in patch https://review.whamcloud.com/34933/
Subject: LU-12279 lnet: use number of wrs to calculate CQEs
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 61270a9330e710948ff9097495845692a6450ccd

Comment by Peter Jones [ 25/May/19 ]

Landed for 2.13

Generated at Sat Feb 10 02:51:11 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.