[LU-12772] bulk timeout after 2.12.2 clients upgrade - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Duplicate
Priority: Critical
Fix Version/s: None
Affects Version/s: Lustre 2.12.2
Labels:
None
Environment:
our github
https://github.com/jlan/lustre-nas

Severity:
2
Rank (Obsolete):
9223372036854775807

Description

After upgrading to 2.12.2 clients and 2.10.8 servers we start to see large amounts of bulk io timeouts.

client side

Sep 16 16:36:59 r323i3n6 kernel: [1568677019.825837] Lustre: nbp2-OST0008-osc-ffff9ad887dc1800: Connection to nbp2-OST0008 (at 10.151.26.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Sep 16 16:37:01 r585i7n2 kernel: [1568677021.115645] Lustre: nbp2-OST0008-osc-ffff90c85d05e000: Connection to nbp2-OST0008 (at 10.151.26.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Sep 16 16:37:01 r311i0n3 kernel: [1568677021.371165] Lustre: nbp2-OST0094-osc-ffff976bea358800: Connection to nbp2-OST0094 (at 10.151.26.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Sep 16 16:37:01 r311i4n9 kernel: [1568677021.578522] Lustre: nbp2-OST0094-osc-ffff9c68adf2d000: Connection to nbp2-OST0094 (at 10.151.26.105@o2ib) was lost; in progress operations using this sea12000: Connection to nbp2-OST0008 (at 10.151.26.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Sep 16 16:36:59 r323i3n6 kernel: [1568677019.825837] Lustre: nbp2-OST0008-osc-ffff9ad887dc1800: Connection to nbp2-OST0008 (at 10.151.26.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Sep 16 16:37:01 r585i7n2 kernel: [1568677021.115645] Lustre: nbp2-OST0008-osc-ffff90c85d05e000: Connection to nbp2-OST0008 (at 10.151.26.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Sep 16 16:37:01 r311i0n3 kernel: [1568677021.371165] Lustre: nbp2-OST0094-osc-ffff976bea358800: Connection to nbp2-OST0094 (at 10.151.26.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Sep 16 16:37:01 r311i4n9 kernel: [1568677021.578522] Lustre: nbp2-OST0094-osc-ffff9c68adf2d000: Connection to nbp2-OST0094 (at 10.151.26.105@o2ib) was lost; in progress operations using this seConnection to nbp2-OST0008 (at 10.151.26.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Sep 16 16:36:59 r323i3n6 kernel: [1568677019.825837] Lustre: nbp2-OST0008-osc-ffff9ad887dc1800: Connection to nbp2-OST0008 (at 10.151.26.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Sep 16 16:37:01 r585i7n2 kernel: [1568677021.115645] Lustre: nbp2-OST0008-osc-ffff90c85d05e000: Connection to nbp2-OST0008 (at 10.151.26.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Sep 16 16:37:01 r311i0n3 kernel: [1568677021.371165] Lustre: nbp2-OST0094-osc-ffff976bea358800: Connection to nbp2-OST0094 (at 10.151.26.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Sep 16 16:37:01 r311i4n9 kernel: [1568677021.578522] Lustre: nbp2-OST0094-osc-ffff9c68adf2d000: Connection to nbp2-OST0094 (at 10.151.26.105@o2ib) was lost; in progress operations using this se

server side

[90158.366440] LustreError: 30777:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk WRITE after 300+0s  req@ffff8affe46eb450 x1644657589074304/t0(0) o4->825a80a7-da45-880e-35d1-4a750d2cf7f0@10.151.16.212@o2ib:502/0 lens 2168/448 e 0 to 0 dl 1568676837 ref 1 fl Interpret:/2/0 rc 0/0

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

0001-LU-12621-b_2_10_8-port.patch
4 kB
24/Sep/19 6:08 PM

Issue Links

is duplicated by

LU-12856 LustreError: 82937:0:(ldlm_lib.c:3268:target_bulk_io()) @@@ truncated bulk READ 0(270336)

Resolved

is related to

LU-12817 lnet: module parameters are not set correctly

Resolved

Activity

[LU-12772] bulk timeout after 2.12.2 clients upgrade

Mahmoud Hanafi added a comment - 24/Sep/19 6:29 PM

We update the clients to 2.12 but didn't see these timeouts. Only after rebooting the servers into 2.10.8 did we see the timeouts. But moving 1 filesystem back to 2.10.6 didn't make the timeouts go away on that filesystem. Only reboot of clients stops the timeouts. I am not convened these are actual "network" timeouts. The clients and server get out of sync and the client keeps trying the same bulk io over and the server times it out. The last sent of debug logs I uploaded give a good trace of the RPC traffic.
This is exactly like https://jira.whamcloud.com/browse/LU-12543. When a user job finishes their process can't be killed because of a open file(or a lock). Only an eviction or a reboot will clear up the client.

Mahmoud Hanafi added a comment - 24/Sep/19 6:29 PM We update the clients to 2.12 but didn't see these timeouts. Only after rebooting the servers into 2.10.8 did we see the timeouts. But moving 1 filesystem back to 2.10.6 didn't make the timeouts go away on that filesystem. Only reboot of clients stops the timeouts. I am not convened these are actual "network" timeouts. The clients and server get out of sync and the client keeps trying the same bulk io over and the server times it out. The last sent of debug logs I uploaded give a good trace of the RPC traffic. This is exactly like https://jira.whamcloud.com/browse/LU-12543 . When a user job finishes their process can't be killed because of a open file(or a lock). Only an eviction or a reboot will clear up the client.

Amir Shehata (Inactive) added a comment - 24/Sep/19 6:09 PM

Here is a port for 36073 based on NASA's b2_10_8.

0001-LU-12621-b_2_10_8-port.patch

Mahmoud, just to clarify, did this timeout you posted above started happening after any changes ?

Amir Shehata (Inactive) added a comment - 24/Sep/19 6:09 PM Here is a port for 36073 based on NASA's b2_10_8. 0001-LU-12621-b_2_10_8-port.patch Mahmoud, just to clarify, did this timeout you posted above started happening after any changes ?

Mahmoud Hanafi added a comment - 23/Sep/19 11:37 PM

I noticed that the clients keep timing out the same bulk request over and over. It seems to be stuck.

ep 23 15:45:53 r401i6n8 kernel: [1569278753.126871] Lustre: nbp2-OST00d9-osc-ffff9947f4303800: Connection to nbp2-OST00d9 (at 10.151.26.114@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Sep 23 15:45:53 r401i6n8 kernel: [1569278753.162870] Lustre: nbp2-OST00d9-osc-ffff9947f4303800: Connection restored to 10.151.26.114@o2ib (at 10.151.26.114@o2ib)
Sep 23 15:55:59 r401i6n8 kernel: [1569279359.335922] Lustre: 3467:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1569278753/real 1569278753]  req@ffff99555d201680 x1645173752822896/t0(0) o3->nbp2-OST00d9-osc-ffff9947f4303800@10.151.26.114@o2ib:6/4 lens 488/4536 e 6 to 1 dl 1569279359 ref 2 fl Rpc:X/2/ffffffff rc -11/-1
Sep 23 15:55:59 r401i6n8 kernel: [1569279359.335922] Lustre: nbp2-OST00d9-osc-ffff9947f4303800: Connection to nbp2-OST00d9 (at 10.151.26.114@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Sep 23 15:55:59 r401i6n8 kernel: [1569279359.371921] Lustre: nbp2-OST00d9-osc-ffff9947f4303800: Connection restored to 10.151.26.114@o2ib (at 10.151.26.114@o2ib)
Sep 23 16:06:05 r401i6n8 kernel: [1569279965.540809] Lustre: 3467:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1569279359/real 1569279359]  req@ffff99555d201680 x1645173752822896/t0(0) o3->nbp2-OST00d9-osc-ffff9947f4303800@10.151.26.114@o2ib:6/4 lens 488/4536 e 6 to 1 dl 1569279965 ref 2 fl Rpc:X/2/ffffffff rc -11/-1
Sep 23 16:06:05 r401i6n8 kernel: [1569279965.540809] Lustre: nbp2-OST00d9-osc-ffff9947f4303800: Connection to nbp2-OST00d9 (at 10.151.26.114@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Sep 23 16:06:05 r401i6n8 kernel: [1569279965.576807] Lustre: nbp2-OST00d9-osc-ffff9947f4303800: Connection restored to 10.151.26.114@o2ib (at 10.151.26.114@o2ib)
Sep 23 16:16:11 r401i6n8 kernel: [1569280571.753676] Lustre: 3467:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1569279965/real 1569279965]  req@ffff99555d201680 x1645173752822896/t0(0) o3->nbp2-OST00d9-osc-ffff9947f4303800@10.151.26.114@o2ib:6/4 lens 488/4536 e 6 to 1 dl 1569280571 ref 2 fl Rpc:X/2/ffffffff rc -11/-1
Sep 23 16:16:11 r401i6n8 kernel: [1569280571.753676] Lustre: nbp2-OST00d9-osc-ffff9947f4303800: Connection to nbp2-OST00d9 (at 10.151.26.114@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Sep 23 16:16:11 r401i6n8 kernel: [1569280571.789675] Lustre: nbp2-OST00d9-osc-ffff9947f4303800: Connection restored to 10.151.26.114@o2ib (at 10.151.26.114@o2ib)
Sep 23 16:26:17 r401i6n8 kernel: [1569281177.958608] Lustre: 3467:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1569280571/real 1569280571]  req@ffff99555d201680 x1645173752822896/t0(0) o3->nbp2-OST00d9-osc-ffff9947f4303800@10.151.26.114@o2ib:6/4 lens 488/4536 e 6 to 1 dl 1569281177 ref 2 fl Rpc:X/2/ffffffff rc -11/-1
Sep 23 16:26:17 r401i6n8 kernel: [1569281177.958608] Lustre: nbp2-OST00d9-osc-ffff9947f4303800: Connection to nbp2-OST00d9 (at 10.151.26.114@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Sep 23 16:26:17 r401i6n8 kernel: [1569281177.994606] Lustre: nbp2-OST00d9-osc-ffff9947f4303800: Connection restored to 10.151.26.114@o2ib (at 10.151.26.114@o2ib)

I upload the client side and server side logs for the above.

uploads/~~LU-12772~~/client.tgz
uploads/~~LU-12772~~/server.tgz

Mahmoud Hanafi added a comment - 23/Sep/19 11:37 PM I noticed that the clients keep timing out the same bulk request over and over. It seems to be stuck. ep 23 15:45:53 r401i6n8 kernel: [1569278753.126871] Lustre: nbp2-OST00d9-osc-ffff9947f4303800: Connection to nbp2-OST00d9 (at 10.151.26.114@o2ib) was lost; in progress operations using this service will wait for recovery to complete Sep 23 15:45:53 r401i6n8 kernel: [1569278753.162870] Lustre: nbp2-OST00d9-osc-ffff9947f4303800: Connection restored to 10.151.26.114@o2ib (at 10.151.26.114@o2ib) Sep 23 15:55:59 r401i6n8 kernel: [1569279359.335922] Lustre: 3467:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1569278753/real 1569278753] req@ffff99555d201680 x1645173752822896/t0(0) o3->nbp2-OST00d9-osc-ffff9947f4303800@10.151.26.114@o2ib:6/4 lens 488/4536 e 6 to 1 dl 1569279359 ref 2 fl Rpc:X/2/ffffffff rc -11/-1 Sep 23 15:55:59 r401i6n8 kernel: [1569279359.335922] Lustre: nbp2-OST00d9-osc-ffff9947f4303800: Connection to nbp2-OST00d9 (at 10.151.26.114@o2ib) was lost; in progress operations using this service will wait for recovery to complete Sep 23 15:55:59 r401i6n8 kernel: [1569279359.371921] Lustre: nbp2-OST00d9-osc-ffff9947f4303800: Connection restored to 10.151.26.114@o2ib (at 10.151.26.114@o2ib) Sep 23 16:06:05 r401i6n8 kernel: [1569279965.540809] Lustre: 3467:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1569279359/real 1569279359] req@ffff99555d201680 x1645173752822896/t0(0) o3->nbp2-OST00d9-osc-ffff9947f4303800@10.151.26.114@o2ib:6/4 lens 488/4536 e 6 to 1 dl 1569279965 ref 2 fl Rpc:X/2/ffffffff rc -11/-1 Sep 23 16:06:05 r401i6n8 kernel: [1569279965.540809] Lustre: nbp2-OST00d9-osc-ffff9947f4303800: Connection to nbp2-OST00d9 (at 10.151.26.114@o2ib) was lost; in progress operations using this service will wait for recovery to complete Sep 23 16:06:05 r401i6n8 kernel: [1569279965.576807] Lustre: nbp2-OST00d9-osc-ffff9947f4303800: Connection restored to 10.151.26.114@o2ib (at 10.151.26.114@o2ib) Sep 23 16:16:11 r401i6n8 kernel: [1569280571.753676] Lustre: 3467:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1569279965/real 1569279965] req@ffff99555d201680 x1645173752822896/t0(0) o3->nbp2-OST00d9-osc-ffff9947f4303800@10.151.26.114@o2ib:6/4 lens 488/4536 e 6 to 1 dl 1569280571 ref 2 fl Rpc:X/2/ffffffff rc -11/-1 Sep 23 16:16:11 r401i6n8 kernel: [1569280571.753676] Lustre: nbp2-OST00d9-osc-ffff9947f4303800: Connection to nbp2-OST00d9 (at 10.151.26.114@o2ib) was lost; in progress operations using this service will wait for recovery to complete Sep 23 16:16:11 r401i6n8 kernel: [1569280571.789675] Lustre: nbp2-OST00d9-osc-ffff9947f4303800: Connection restored to 10.151.26.114@o2ib (at 10.151.26.114@o2ib) Sep 23 16:26:17 r401i6n8 kernel: [1569281177.958608] Lustre: 3467:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1569280571/real 1569280571] req@ffff99555d201680 x1645173752822896/t0(0) o3->nbp2-OST00d9-osc-ffff9947f4303800@10.151.26.114@o2ib:6/4 lens 488/4536 e 6 to 1 dl 1569281177 ref 2 fl Rpc:X/2/ffffffff rc -11/-1 Sep 23 16:26:17 r401i6n8 kernel: [1569281177.958608] Lustre: nbp2-OST00d9-osc-ffff9947f4303800: Connection to nbp2-OST00d9 (at 10.151.26.114@o2ib) was lost; in progress operations using this service will wait for recovery to complete Sep 23 16:26:17 r401i6n8 kernel: [1569281177.994606] Lustre: nbp2-OST00d9-osc-ffff9947f4303800: Connection restored to 10.151.26.114@o2ib (at 10.151.26.114@o2ib) I upload the client side and server side logs for the above. uploads/ LU-12772 /client.tgz uploads/ LU-12772 /server.tgz

Jay Lan (Inactive) added a comment - 20/Sep/19 9:05 PM

> 1. https://review.whamcloud.com/#/c/36073/

The above suggested patch does not apply to 2.10.8 (and b2_10). Could you provide a backport or a list of prerequisite commits?

Jay Lan (Inactive) added a comment - 20/Sep/19 9:05 PM > 1. https://review.whamcloud.com/#/c/36073/ The above suggested patch does not apply to 2.10.8 (and b2_10). Could you provide a backport or a list of prerequisite commits?

Amir Shehata (Inactive) added a comment - 20/Sep/19 5:50 PM - edited

For record keeping:
what does this mean.
[73978.224013] LNet: 232:0:(o2iblnd_cb.c:2341:kiblnd_passive_connect()) Can't accept conn from 10.151.26.122@o2ib (version 12): max_frags 256 too large (32 wanted)

When two peers are negotiating the maximum fragments that can be transferred between the peers, one peer can have a larger default max_frags. In this case it's 256. If the other peer supports less, say 32. It'll reject the initial connection attempt and will return a response saying that the maximum it can support is 32. The active side will reconnect with 32. At this point the connection should succeed.

The above message is printed during this negotiation attempt

Amir Shehata (Inactive) added a comment - 20/Sep/19 5:50 PM - edited For record keeping: what does this mean. [73978.224013] LNet: 232:0:(o2iblnd_cb.c:2341:kiblnd_passive_connect()) Can't accept conn from 10.151.26.122@o2ib (version 12): max_frags 256 too large (32 wanted) When two peers are negotiating the maximum fragments that can be transferred between the peers, one peer can have a larger default max_frags. In this case it's 256. If the other peer supports less, say 32. It'll reject the initial connection attempt and will return a response saying that the maximum it can support is 32. The active side will reconnect with 32. At this point the connection should succeed. The above message is printed during this negotiation attempt

Amir Shehata (Inactive) added a comment - 19/Sep/19 10:47 PM

As discussed on the call here is the relevant comment from Andreas regarding the statfs:

I'm working on patch https://review.whamcloud.com/35380 "~~LU-12368~~ obdclass: don't send multiple statfs RPCs" that adds a statfs mutex that will avoid the large majority of concurrent OST_STATFS RPCs being sent that are consuming the credits. While this is a new patch, it is a relatively straightforward client-only change, unlike the statfs proxy patches which are relatively large and change the protocol and need to be installed on both the client and server. This also includes a client-side tunable "llite.*.statfs_max_age" that will allow increasing the statfs cache age from the default 1s to something larger (e.g. 10s or 60s) on compute nodes to further reduce the number of RPCs being sent.

The statfs mutex patch will only allow a single process to send the OST_STATFS RPC per target, and other threads will block until the first thread completes (they would be blocked waiting for their own RPCs to complete anyway, so this does not negatively impact performance), and then use the cached statfs data from the first thread. This will avoid up to 95/96 (ratio of CPU cores) of OST_STATFS RPCs, but each MDT+OST will still get one RPC. Conversely, the statfs proxy patch would avoid 129/130 (ratio of OSTs to OSTs+MDTs) of RPCs, but each thread will still send one MDT an RPC. They improvement is very comparable for this system (98% vs. 99% reduction in RPCs). The two patches are orthogonal, so they can both be used in master.

Amir Shehata (Inactive) added a comment - 19/Sep/19 10:47 PM As discussed on the call here is the relevant comment from Andreas regarding the statfs: I'm working on patch https://review.whamcloud.com/35380 " LU-12368 obdclass: don't send multiple statfs RPCs" that adds a statfs mutex that will avoid the large majority of concurrent OST_STATFS RPCs being sent that are consuming the credits. While this is a new patch, it is a relatively straightforward client-only change, unlike the statfs proxy patches which are relatively large and change the protocol and need to be installed on both the client and server. This also includes a client-side tunable "llite.*.statfs_max_age" that will allow increasing the statfs cache age from the default 1s to something larger (e.g. 10s or 60s) on compute nodes to further reduce the number of RPCs being sent. The statfs mutex patch will only allow a single process to send the OST_STATFS RPC per target, and other threads will block until the first thread completes (they would be blocked waiting for their own RPCs to complete anyway, so this does not negatively impact performance), and then use the cached statfs data from the first thread. This will avoid up to 95/96 (ratio of CPU cores) of OST_STATFS RPCs, but each MDT+OST will still get one RPC. Conversely, the statfs proxy patch would avoid 129/130 (ratio of OSTs to OSTs+MDTs) of RPCs, but each thread will still send one MDT an RPC. They improvement is very comparable for this system (98% vs. 99% reduction in RPCs). The two patches are orthogonal, so they can both be used in master.

Mahmoud Hanafi added a comment - 19/Sep/19 9:30 PM - edited

what does this mean.
[73978.224013] LNet: 232:0:(o2iblnd_cb.c:2341:kiblnd_passive_connect()) Can't accept conn from 10.151.26.122@o2ib (version 12): max_frags 256 too large (32 wanted)

Mahmoud Hanafi added a comment - 19/Sep/19 9:30 PM - edited what does this mean. [73978.224013] LNet: 232:0:(o2iblnd_cb.c:2341:kiblnd_passive_connect()) Can't accept conn from 10.151.26.122@o2ib (version 12): max_frags 256 too large (32 wanted)

Amir Shehata (Inactive) added a comment - 19/Sep/19 8:41 PM

The peer_credits_hiw if not set will be set to half of the peer_credits. However, in the current code if connection queue depth, which is based on the peer_credits is reduced, the peer_credits_hiw is not. The primary purpose of the peer_credits_hiw is to return credits on the connection if they accumulate. If that value is too high then the credits are not returned promptly, which could lead to the no credits situation and eventually the timeouts you're seeing.

~~LU-12621~~ and ~~LU-12569~~ resolve this issue. Basically these two patches fix the core problem in ~~LU-10213~~. I'll add a comment on ~~LU-10213~~ to clarify.

Both ~~LU-12621~~ and ~~LU-12569~~ are planned to land on master and be ported back to 2.12.3. So the three patches which need to be applied to 2.12.2 and 2.10.8 are:

1. https://review.whamcloud.com/#/c/36073/
2. https://review.whamcloud.com/#/c/35578/
3. https://review.whamcloud.com/#/c/34427

These three patches should correctly calculate the queue depth, adjust the hiw, and keep the CQE the same between the peers.

For ~~LU-12569~~, it was intended to fix the issue where the queue depth < hiw. In this case the credits will be depleted and the algorithm which uses the hiw will not be triggered. The patches fixes this issue by:

#define IBLND_CREDITS_HIGHWATER(t, conn) ((conn->ibc_version) == IBLND_MSG_VERSION_1 ? \
           IBLND_CREDIT_HIGHWATER_V1 : \
           min(t->lnd_peercredits_hiw, (__u32)conn->ibc_queue_depth - 1))

The hiw is managed as follows:

239 »·······if (!tunables->lnd_peercredits_hiw)
240 »·······»·······tunables->lnd_peercredits_hiw = peer_credits_hiw;
241 
242 »·······if (tunables->lnd_peercredits_hiw < net_tunables->lct_peer_tx_credits / 2)
243 »·······»·······tunables->lnd_peercredits_hiw = net_tunables->lct_peer_tx_credits / 2;
244 
245 »·······if (tunables->lnd_peercredits_hiw >= net_tunables->lct_peer_tx_credits)
246 »·······»·······tunables->lnd_peercredits_hiw = net_tunables->lct_peer_tx_credits - 1;

In your case you don't set the HIW explicitly. tunables->lnd_peercredits_hiw = 0, which would result in: tunables->lnd_peercredits_hiw = net_tunables->lct_peer_tx_credits / 2

The ~~LU-12569~~ patch would result in the hiw value being set to conn->ibc_queue_depth - 1. IMHO this violates the expectations in the system. The algorithm will be a lot less aggressive in returning the credits than the standard default, which I'm not sure is what we'd want. In fact based on the many no credits we're see in the logs, I suspect that we'd need to be more aggressive in returning the credits. One option is, now that we know that the queue depth is negotiated to 42, would be to explicitly set peer_credits_hiw to 21. I have no hard recommendation, I would try 21 and observe the system. If there are still a lot of no credits, then I would try to decrease the hiw further.

In summary, I think the issue is potentially an amalgamation between the two hiw credit issue and the non-matching CQE.

Amir Shehata (Inactive) added a comment - 19/Sep/19 8:41 PM The peer_credits_hiw if not set will be set to half of the peer_credits. However, in the current code if connection queue depth, which is based on the peer_credits is reduced, the peer_credits_hiw is not. The primary purpose of the peer_credits_hiw is to return credits on the connection if they accumulate. If that value is too high then the credits are not returned promptly, which could lead to the no credits situation and eventually the timeouts you're seeing. LU-12621 and LU-12569 resolve this issue. Basically these two patches fix the core problem in LU-10213 . I'll add a comment on LU-10213 to clarify. Both LU-12621 and LU-12569 are planned to land on master and be ported back to 2.12.3. So the three patches which need to be applied to 2.12.2 and 2.10.8 are: 1. https://review.whamcloud.com/#/c/36073/ 2. https://review.whamcloud.com/#/c/35578/ 3. https://review.whamcloud.com/#/c/34427 These three patches should correctly calculate the queue depth, adjust the hiw, and keep the CQE the same between the peers. For LU-12569 , it was intended to fix the issue where the queue depth < hiw. In this case the credits will be depleted and the algorithm which uses the hiw will not be triggered. The patches fixes this issue by: #define IBLND_CREDITS_HIGHWATER(t, conn) ((conn->ibc_version) == IBLND_MSG_VERSION_1 ? \ IBLND_CREDIT_HIGHWATER_V1 : \ min(t->lnd_peercredits_hiw, (__u32)conn->ibc_queue_depth - 1)) The hiw is managed as follows: 239 »······· if (!tunables->lnd_peercredits_hiw) 240 »·······»·······tunables->lnd_peercredits_hiw = peer_credits_hiw; 241 242 »······· if (tunables->lnd_peercredits_hiw < net_tunables->lct_peer_tx_credits / 2) 243 »·······»·······tunables->lnd_peercredits_hiw = net_tunables->lct_peer_tx_credits / 2; 244 245 »······· if (tunables->lnd_peercredits_hiw >= net_tunables->lct_peer_tx_credits) 246 »·······»·······tunables->lnd_peercredits_hiw = net_tunables->lct_peer_tx_credits - 1; In your case you don't set the HIW explicitly. tunables->lnd_peercredits_hiw = 0, which would result in: tunables->lnd_peercredits_hiw = net_tunables->lct_peer_tx_credits / 2 The LU-12569 patch would result in the hiw value being set to conn->ibc_queue_depth - 1. IMHO this violates the expectations in the system. The algorithm will be a lot less aggressive in returning the credits than the standard default, which I'm not sure is what we'd want. In fact based on the many no credits we're see in the logs, I suspect that we'd need to be more aggressive in returning the credits. One option is, now that we know that the queue depth is negotiated to 42, would be to explicitly set peer_credits_hiw to 21. I have no hard recommendation, I would try 21 and observe the system. If there are still a lot of no credits, then I would try to decrease the hiw further. In summary, I think the issue is potentially an amalgamation between the two hiw credit issue and the non-matching CQE.

Mahmoud Hanafi added a comment - 19/Sep/19 1:43 AM

We see lots of queue depth warnings. both client and server side. They show up only EDR and HDR connect hosts.

Sep 18 16:41:32 elrtr1 kernel: [1568850092.715421] LNet: 43868:0:(o2iblnd.c:941:kiblnd_create_conn()) peer 10.141.3.13@o2ib417 - queue depth reduced from 63 to 42  to allow for qp creation

The two patches you listed are ~~LU-12621~~ and ~~LU-12569~~. But there is a note in ~~LU-12569~~ that your going to rework the patch as part of ~~LU-10213~~. We have that in our 2.10.8 and 2.12.2. And there is a note at the end of ~~LU-10213~~ that it should be reverted for 2.12.x.

Remember we are also struggling with LU-11644. Our workaround of setting +net for it is still working.

It not exactly clear which patches we should apply

Here is our service config
{code:java}
options ko2iblnd require_privileged_port=0 use_privileged_port=0
options ko2iblnd ntx=251072 credits=125536 fmr_pool_size=62769 
options ko2iblnd timeout=100 retry_count=7 peer_timeout=0 map_on_demand=32 peer_credits=63 concurrent_sends=63
#lnet
options lnet networks=o2ib(ib1) 
options lnet routes="o2ib233 10.151.26.[80-94]@o2ib; o2ib313 10.151.25.[195-197,202-205,222]@o2ib 10.151.26.[60,127,140-144,150-154]@o2ib; o2ib417 10.151.26.[148,149]@o2ib 10.151.25.[167-170]@o2ib"
options lnet dead_router_check_interval=60 live_router_check_interval=30
options lnet avoid_asym_router_failure=1 check_routers_before_use=1 small_router_buffers=65536 large_router_buffers=8192
options ptlrpc at_max=600 at_min=200

Client side

options ko2iblnd require_privileged_port=0 use_privileged_port=0
options ko2iblnd timeout=100 retry_count=7 peer_timeout=0 map_on_demand=32 peer_credits=63 concurrent_sends=63
options ko2iblnd credits=64
#lnet
options lnet networks=o2ib(ib1)
options lnet avoid_asym_router_failure=1 check_routers_before_use=1 small_router_buffers=65536 large_router_buffers=8192
options ptlrpc at_max=600 at_min=200

Routers

options ko2iblnd require_privileged_port=0 use_privileged_port=0
options ko2iblnd timeout=150 retry_count=7 map_on_demand=32 peer_credits=63 concurrent_sends=63
options ko2iblnd ntx=32768 credits=32768 fmr_pool_size=8193 
#lnet
options lnet networks="o2ib(ib0,ib1),o2ib417(ib2,ib3)" forwarding=enabled
options lnet lnet_peer_discovery_disabled=1
options lnet avoid_asym_router_failure=1 check_routers_before_use=1 small_router_buffers=65536 large_router_buffers=8192
options ptlrpc at_max=600 at_min=200

Mahmoud Hanafi added a comment - 19/Sep/19 1:43 AM We see lots of queue depth warnings. both client and server side. They show up only EDR and HDR connect hosts. Sep 18 16:41:32 elrtr1 kernel: [1568850092.715421] LNet: 43868:0:(o2iblnd.c:941:kiblnd_create_conn()) peer 10.141.3.13@o2ib417 - queue depth reduced from 63 to 42 to allow for qp creation The two patches you listed are LU-12621 and LU-12569 . But there is a note in LU-12569 that your going to rework the patch as part of LU-10213 . We have that in our 2.10.8 and 2.12.2. And there is a note at the end of LU-10213 that it should be reverted for 2.12.x. Remember we are also struggling with LU-11644 . Our workaround of setting +net for it is still working. It not exactly clear which patches we should apply Here is our service config {code:java} options ko2iblnd require_privileged_port=0 use_privileged_port=0 options ko2iblnd ntx=251072 credits=125536 fmr_pool_size=62769 options ko2iblnd timeout=100 retry_count=7 peer_timeout=0 map_on_demand=32 peer_credits=63 concurrent_sends=63 #lnet options lnet networks=o2ib(ib1) options lnet routes= "o2ib233 10.151.26.[80-94]@o2ib; o2ib313 10.151.25.[195-197,202-205,222]@o2ib 10.151.26.[60,127,140-144,150-154]@o2ib; o2ib417 10.151.26.[148,149]@o2ib 10.151.25.[167-170]@o2ib" options lnet dead_router_check_interval=60 live_router_check_interval=30 options lnet avoid_asym_router_failure=1 check_routers_before_use=1 small_router_buffers=65536 large_router_buffers=8192 options ptlrpc at_max=600 at_min=200 Client side options ko2iblnd require_privileged_port=0 use_privileged_port=0 options ko2iblnd timeout=100 retry_count=7 peer_timeout=0 map_on_demand=32 peer_credits=63 concurrent_sends=63 options ko2iblnd credits=64 #lnet options lnet networks=o2ib(ib1) options lnet avoid_asym_router_failure=1 check_routers_before_use=1 small_router_buffers=65536 large_router_buffers=8192 options ptlrpc at_max=600 at_min=200 Routers options ko2iblnd require_privileged_port=0 use_privileged_port=0 options ko2iblnd timeout=150 retry_count=7 map_on_demand=32 peer_credits=63 concurrent_sends=63 options ko2iblnd ntx=32768 credits=32768 fmr_pool_size=8193 #lnet options lnet networks= "o2ib(ib0,ib1),o2ib417(ib2,ib3)" forwarding=enabled options lnet lnet_peer_discovery_disabled=1 options lnet avoid_asym_router_failure=1 check_routers_before_use=1 small_router_buffers=65536 large_router_buffers=8192 options ptlrpc at_max=600 at_min=200

Amir Shehata (Inactive) added a comment - 19/Sep/19 1:04 AM

The symptom that we're not receiving RDMA completions seem to indicate that completions are not being processed or getting delayed. One suspicion is that the mismatch in the number of CQ entries between the client and the server might have some side effects

Currently when a connection is created the qp_attrs values are negotiated between the two peers ensuring the values are the same. Previous to your client upgrade the CQ entries calculation would've been the same on both the client and the server. After the upgrade, that's no longer the case.

There are also a couple of other patches which fixed some problems around the calculations of the qp_attrs value:

instead of the loop where we keep decrementing the queue depth in iterations until we successfully create the qp, we cache the maximum send_wrs the QP can support and use that as a cap.
https://review.whamcloud.com/#/c/36073/

if we do endup reducing the queue depth, then make sure you take that into account when calculating the high water mark (hiw) for the connection
https://review.whamcloud.com/#/c/35578/

The issue with miscalculating the hiw is that under certain work loads connection credits might get depleted and not returned promptly. This could lead to timeouts on the connection. This is only a problem if the queue depth is reduced. On connection creation you'd see:

 937 »·······»·······CWARN("peer %s - queue depth reduced from %u to %u"
 938 »·······»·······      "  to allow for qp creation\n",
 939 »·······»·······      libcfs_nid2str(peer_ni->ibp_nid),
 940 »·······»·······      peer_ni->ibp_queue_depth,
 941 »·······»·······      conn->ibc_queue_depth);
 942

Do you see that warning?

I do see a lot of "no credits" message in the nbp2 log.

I also see this:

nbp2-oss17.dk.out:00000800:00000200:8.0:1568783002.636317:0:14106:0:(o2iblnd_cb.c:331:kiblnd_handle_rx()) Received d0[33] from 10.151.45.62@o2ib

This indicates that there are 33 credits being returned in one shot on the connection. That's the largest number of returned credits I see in the log and I see one instance. I also see one instance of bulk timeout. That could be a correlation.

What do you have your peer_credits set to? Keep in mind that that value can be reduced. If you see the warning above that'll tell you the exact value. If 33 is greater than the queue depth, then we're sure running in to the problem fixed by the above patches.

I also see 18912 instances of more than 20 credits being returned in one shot.

My recommendations:

apply the two patches above on both your code bases, 2.12.2 and 2.10.8
Make sure that CQE calculations are identical between clients and servers. I would suggest calculating CQEs on 2.10.8 the same way as it's being calculated in 2.12.2. IE apply "0205858f6533c7cff8fe8b43d09cece15e0b5385 LU-12065 lnd: increase CQ entries" on 2.10.8. This way the CQ is calculated identically.

Amir Shehata (Inactive) added a comment - 19/Sep/19 1:04 AM The symptom that we're not receiving RDMA completions seem to indicate that completions are not being processed or getting delayed. One suspicion is that the mismatch in the number of CQ entries between the client and the server might have some side effects Currently when a connection is created the qp_attrs values are negotiated between the two peers ensuring the values are the same. Previous to your client upgrade the CQ entries calculation would've been the same on both the client and the server. After the upgrade, that's no longer the case. There are also a couple of other patches which fixed some problems around the calculations of the qp_attrs value: instead of the loop where we keep decrementing the queue depth in iterations until we successfully create the qp, we cache the maximum send_wrs the QP can support and use that as a cap. https: //review.whamcloud.com/#/c/36073/ if we do endup reducing the queue depth, then make sure you take that into account when calculating the high water mark (hiw) for the connection https: //review.whamcloud.com/#/c/35578/ The issue with miscalculating the hiw is that under certain work loads connection credits might get depleted and not returned promptly. This could lead to timeouts on the connection. This is only a problem if the queue depth is reduced. On connection creation you'd see: 937 »·······»·······CWARN( "peer %s - queue depth reduced from %u to %u" 938 »·······»······· " to allow for qp creation\n" , 939 »·······»······· libcfs_nid2str(peer_ni->ibp_nid), 940 »·······»······· peer_ni->ibp_queue_depth, 941 »·······»······· conn->ibc_queue_depth); 942 Do you see that warning? I do see a lot of "no credits" message in the nbp2 log. I also see this: nbp2-oss17.dk.out:00000800:00000200:8.0:1568783002.636317:0:14106:0:(o2iblnd_cb.c:331:kiblnd_handle_rx()) Received d0[33] from 10.151.45.62@o2ib This indicates that there are 33 credits being returned in one shot on the connection. That's the largest number of returned credits I see in the log and I see one instance. I also see one instance of bulk timeout. That could be a correlation. What do you have your peer_credits set to? Keep in mind that that value can be reduced. If you see the warning above that'll tell you the exact value. If 33 is greater than the queue depth, then we're sure running in to the problem fixed by the above patches. I also see 18912 instances of more than 20 credits being returned in one shot. My recommendations: apply the two patches above on both your code bases, 2.12.2 and 2.10.8 Make sure that CQE calculations are identical between clients and servers. I would suggest calculating CQEs on 2.10.8 the same way as it's being calculated in 2.12.2. IE apply " 0205858f6533c7cff8fe8b43d09cece15e0b5385 LU-12065 lnd: increase CQ entries " on 2.10.8. This way the CQ is calculated identically.

Stephane Thiell added a comment - 18/Sep/19 11:42 PM

@Mahmoud FYI instead of rebooting the clients, we tried to evict them from specific OSTs, and while it's a bit early to confirm it works all the time, I've been playing with that this afternoon and it seems ok so far! It breaks the loop of reconnect/recovery/bulk error (at least for bulk read error):

# echo e5822630-4b69-cb1b-3ba7-273be887cac8 > /proc/fs/lustre/obdfilter/oak-OST001c/evict_client

[172383.896250] Lustre: 48149:0:(genops.c:1734:obd_export_evict_by_uuid()) oak-OST001c: evicting e5822630-4b69-cb1b-3ba7-273be887cac8 at adminstrative request

client:

[172579.060082] Lustre: 48149:0:(genops.c:1734:obd_export_evict_by_uuid()) oak-OST0002: evicting 30024d45-d33c-6bfc-4a6e-e26592b33903 at adminstrative request
[172580.013039] LustreError: 304836:0:(ldlm_lib.c:3243:target_bulk_io()) @@@ Eviction on bulk READ req@ffff9a71a7578450 x1631575659750048/t0(0) o3->30024d45-d33c-6bfc-4a6e-e26592b33903@10.9.101.72@o2ib4:321/0 lens 488/432 e 0 to 0 dl 1568847286 ref 1 fl Interpret:/2/0 rc 0/0

then... reconnect and ok

Stephane Thiell added a comment - 18/Sep/19 11:42 PM @Mahmoud FYI instead of rebooting the clients, we tried to evict them from specific OSTs, and while it's a bit early to confirm it works all the time, I've been playing with that this afternoon and it seems ok so far! It breaks the loop of reconnect/recovery/bulk error (at least for bulk read error): # echo e5822630-4b69-cb1b-3ba7-273be887cac8 > /proc/fs/lustre/obdfilter/oak-OST001c/evict_client [172383.896250] Lustre: 48149:0:(genops.c:1734:obd_export_evict_by_uuid()) oak-OST001c: evicting e5822630-4b69-cb1b-3ba7-273be887cac8 at adminstrative request client: [172579.060082] Lustre: 48149:0:(genops.c:1734:obd_export_evict_by_uuid()) oak-OST0002: evicting 30024d45-d33c-6bfc-4a6e-e26592b33903 at adminstrative request [172580.013039] LustreError: 304836:0:(ldlm_lib.c:3243:target_bulk_io()) @@@ Eviction on bulk READ req@ffff9a71a7578450 x1631575659750048/t0(0) o3->30024d45-d33c-6bfc-4a6e-e26592b33903@10.9.101.72@o2ib4:321/0 lens 488/432 e 0 to 0 dl 1568847286 ref 1 fl Interpret:/2/0 rc 0/0 then... reconnect and ok

People

Assignee:: Amir Shehata (Inactive)

Reporter:: Mahmoud Hanafi

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 16/Sep/19 11:38 PM

Updated:: 15/Jan/20 11:22 PM

Resolved:: 19/Oct/19 3:41 AM