[LU-12772] bulk timeout after 2.12.2 clients upgrade - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Duplicate
Priority: Critical
Fix Version/s: None
Affects Version/s: Lustre 2.12.2
Labels:
None
Environment:
our github
https://github.com/jlan/lustre-nas

Severity:
2
Rank (Obsolete):
9223372036854775807

Description

After upgrading to 2.12.2 clients and 2.10.8 servers we start to see large amounts of bulk io timeouts.

client side

Sep 16 16:36:59 r323i3n6 kernel: [1568677019.825837] Lustre: nbp2-OST0008-osc-ffff9ad887dc1800: Connection to nbp2-OST0008 (at 10.151.26.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Sep 16 16:37:01 r585i7n2 kernel: [1568677021.115645] Lustre: nbp2-OST0008-osc-ffff90c85d05e000: Connection to nbp2-OST0008 (at 10.151.26.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Sep 16 16:37:01 r311i0n3 kernel: [1568677021.371165] Lustre: nbp2-OST0094-osc-ffff976bea358800: Connection to nbp2-OST0094 (at 10.151.26.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Sep 16 16:37:01 r311i4n9 kernel: [1568677021.578522] Lustre: nbp2-OST0094-osc-ffff9c68adf2d000: Connection to nbp2-OST0094 (at 10.151.26.105@o2ib) was lost; in progress operations using this sea12000: Connection to nbp2-OST0008 (at 10.151.26.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Sep 16 16:36:59 r323i3n6 kernel: [1568677019.825837] Lustre: nbp2-OST0008-osc-ffff9ad887dc1800: Connection to nbp2-OST0008 (at 10.151.26.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Sep 16 16:37:01 r585i7n2 kernel: [1568677021.115645] Lustre: nbp2-OST0008-osc-ffff90c85d05e000: Connection to nbp2-OST0008 (at 10.151.26.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Sep 16 16:37:01 r311i0n3 kernel: [1568677021.371165] Lustre: nbp2-OST0094-osc-ffff976bea358800: Connection to nbp2-OST0094 (at 10.151.26.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Sep 16 16:37:01 r311i4n9 kernel: [1568677021.578522] Lustre: nbp2-OST0094-osc-ffff9c68adf2d000: Connection to nbp2-OST0094 (at 10.151.26.105@o2ib) was lost; in progress operations using this seConnection to nbp2-OST0008 (at 10.151.26.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Sep 16 16:36:59 r323i3n6 kernel: [1568677019.825837] Lustre: nbp2-OST0008-osc-ffff9ad887dc1800: Connection to nbp2-OST0008 (at 10.151.26.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Sep 16 16:37:01 r585i7n2 kernel: [1568677021.115645] Lustre: nbp2-OST0008-osc-ffff90c85d05e000: Connection to nbp2-OST0008 (at 10.151.26.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Sep 16 16:37:01 r311i0n3 kernel: [1568677021.371165] Lustre: nbp2-OST0094-osc-ffff976bea358800: Connection to nbp2-OST0094 (at 10.151.26.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Sep 16 16:37:01 r311i4n9 kernel: [1568677021.578522] Lustre: nbp2-OST0094-osc-ffff9c68adf2d000: Connection to nbp2-OST0094 (at 10.151.26.105@o2ib) was lost; in progress operations using this se

server side

[90158.366440] LustreError: 30777:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk WRITE after 300+0s  req@ffff8affe46eb450 x1644657589074304/t0(0) o4->825a80a7-da45-880e-35d1-4a750d2cf7f0@10.151.16.212@o2ib:502/0 lens 2168/448 e 0 to 0 dl 1568676837 ref 1 fl Interpret:/2/0 rc 0/0

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

0001-LU-12621-b_2_10_8-port.patch
4 kB
24/Sep/19 6:08 PM

Issue Links

is duplicated by

LU-12856 LustreError: 82937:0:(ldlm_lib.c:3268:target_bulk_io()) @@@ truncated bulk READ 0(270336)

Resolved

is related to

LU-12817 lnet: module parameters are not set correctly

Resolved

Activity

[LU-12772] bulk timeout after 2.12.2 clients upgrade

Mahmoud Hanafi added a comment - 26/Sep/19 6:47 PM - edited

Been trying to get additional debugging while we wait to update our servers.
I found this exchange between the server and client.
x1645242562975920 is the request that keeps timeout over and over.

server

00000100:00100000:14.0:1569522293.228000:0:32458:0:(service.c:1939:ptlrpc_server_handle_req_in()) got req x1645193981558368
00002000:00100000:14.0:1569522293.228005:0:32458:0:(ofd_dev.c:2563:ofd_rw_hpreq_check()) @@@ nbp8-OST0107 ll_ost_io01_205: refresh rw locks: [0x101070000:0xb9c4b93:0x0] (1134592->1138687)
00002000:00100000:14.0:1569522293.228010:0:32458:0:(ofd_dev.c:2422:ofd_prolong_extent_locks()) Prolong locks for req ffff885aeed2dc50 with x1645193981558368 ext(1134592->1138687)
00002000:00010000:14.0:1569522293.228012:0:32458:0:(ofd_dev.c:2568:ofd_rw_hpreq_check()) nbp8-OST0107: refreshed 0 locks timeout for req ffff885aeed2dc50.
00000100:00100000:14.0:1569522293.228016:0:32458:0:(nrs_fifo.c:179:nrs_fifo_req_get()) NRS start fifo request from 12345-10.141.2.117@o2ib417, seq: 127724415
00000100:00100000:14.0:1569522293.228019:0:32458:0:(service.c:2089:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ll_ost_io01_205:4cf6649f-db20-1dab-e991-b18b25cd7717+6:0:x1645193981558368:12345-10.141.2.117@o2ib417:3
00000100:00000200:14.0:1569522293.228020:0:32458:0:(service.c:2094:ptlrpc_server_handle_request()) got req 1645193981558368
00000400:00000200:14.0:1569522293.228069:0:32458:0:(lib-move.c:2796:LNetPut()) LNetPut -> 12345-10.141.2.117@o2ib417
00000400:00000200:14.0:1569522293.228076:0:32458:0:(lib-move.c:1930:lnet_select_pathway()) TRACE: 10.151.27.65@o2ib(10.151.27.65@o2ib:10.151.27.65@o2ib) -> 10.141.2.117@o2ib417(10.141.2.117@o2ib417:10.151.25.170@o2ib) : PUT
00000800:00000200:14.0:1569522293.228079:0:32458:0:(o2iblnd_cb.c:1510:kiblnd_send()) sending 4096 bytes in 1 frags to 12345-10.151.25.170@o2ib
00000800:00000200:14.0:1569522293.228081:0:32458:0:(o2iblnd_cb.c:703:kiblnd_setup_rd_kiov()) niov 1 offset 0 nob 4096
00000800:00000200:14.0:1569522293.228084:0:32458:0:(o2iblnd.c:405:kiblnd_find_peer_locked()) got peer_ni [ffff885d82149e80] -> 10.151.25.170@o2ib (2) version: 12
00000800:00000200:14.0:1569522293.228085:0:32458:0:(o2iblnd_cb.c:1391:kiblnd_launch_tx()) conn[ffff88588fc32800] (130)++
00000800:00000200:14.0:1569522293.228086:0:32458:0:(o2iblnd_cb.c:1166:kiblnd_queue_tx_locked()) conn[ffff88588fc32800] (131)++
00000800:00000200:14.0:1569522293.228088:0:32458:0:(o2iblnd_cb.c:1397:kiblnd_launch_tx()) conn[ffff88588fc32800] (132)--
00000100:00000200:14.0:1569522293.228090:0:32458:0:(niobuf.c:262:ptlrpc_start_bulk_transfer()) Transferring 1 pages 4096 bytes via portal 8 id 12345-10.141.2.117@o2ib417 mbits 0x0-0x0
00010000:00020000:8.0:1569522343.227486:0:32458:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk READ after 50+0s  req@ffff885aeed2dc50 x1645193981558368/t0(0) o3->4cf6649f-db20-1dab-e991-b18b25cd7717@10.141.2.117@o2ib417:613/0 lens 488/432 e 0 to 0 dl 1569522548 ref 1 fl Interpret:/2/0 rc 0/0

On the client

00000100:00000200:52.0:1569522293.227691:0:4645:0:(events.c:57:request_out_callback()) @@@ type 5, status 0  req@ffff9568c0c48040 x1645193981558368/t0(0) o3->nbp8-OST0107-osc-ffff958574643000@10.151.27.65@o2ib:6/4 lens 488/4536 e 0 to 0 dl 1569522648 ref 3 fl Rpc:/2/ffffffff rc 0/-1
00000800:00000200:47.0:1569522293.227691:0:4643:0:(o2iblnd_cb.c:3859:kiblnd_scheduler()) conn[ffff95709daf1800] (70)--
00000400:00000200:52.0:1569522293.227698:0:4645:0:(lib-md.c:69:lnet_md_unlink()) Unlinking md ffff956f2881d770
00000400:00000200:52.0:1569522293.227700:0:4645:0:(lib-msg.c:816:lnet_is_health_check()) health check = 1, status = 0, hstatus = 0
00000400:00000200:52.0:1569522293.227703:0:4645:0:(lib-msg.c:630:lnet_health_check()) health check: 10.141.2.117@o2ib417->10.141.25.169@o2ib417: PUT: OK
00000800:00000200:52.0:1569522293.227706:0:4645:0:(o2iblnd_cb.c:3859:kiblnd_scheduler()) conn[ffff95709daf1800] (69)--
00000800:00000200:38.2:1569522293.227889:0:0:0:(o2iblnd_cb.c:3721:kiblnd_cq_completion()) conn[ffff956f289cee00] (68)++
00000800:00000200:13.0:1569522293.227905:0:4646:0:(o2iblnd_cb.c:3843:kiblnd_scheduler()) conn[ffff956f289cee00] (69)++
00000800:00000200:13.0:1569522293.227913:0:4646:0:(o2iblnd_cb.c:343:kiblnd_handle_rx()) Received d2[0] from 10.141.25.170@o2ib417
00000800:00000200:15.0:1569522293.227915:0:4644:0:(o2iblnd_cb.c:3859:kiblnd_scheduler()) conn[ffff956f289cee00] (70)--
00000400:00000200:13.0:1569522293.227920:0:4646:0:(lib-move.c:4114:lnet_parse()) TRACE: 10.141.2.117@o2ib417(10.141.2.117@o2ib417) <- 10.151.27.65@o2ib : PUT - for me
00000400:00000200:13.0:1569522293.227928:0:4646:0:(lib-ptl.c:571:lnet_ptl_match_md()) Request from 12345-10.151.27.65@o2ib of length 4096 into portal 8 MB=0x0
00000400:00000100:13.0:1569522293.227933:0:4646:0:(lib-move.c:3753:lnet_parse_put()) Dropping PUT from 12345-10.151.27.65@o2ib portal 8 match 0 offset 0 length 4096: 4

So it drops it because it timeout the bulk io. But I think that was the previous one. It looks like to me thats its stuck timeout a previous transfer and drop the current one.

I will up load full debug log that shows this keeps repeating.

ftp:uploads/~~LU-12772~~/oss4.out.gz

ftp:uploads/~~LU-12772~~/r901i3n9.out.gz

This looks like ~~LU-11951~~ but our 2.12 clients has that patch. FYI we have turn osc idle timeout.

osc.*.idle_timeout=0

Mahmoud Hanafi added a comment - 26/Sep/19 6:47 PM - edited Been trying to get additional debugging while we wait to update our servers. I found this exchange between the server and client. x1645242562975920 is the request that keeps timeout over and over. server 00000100:00100000:14.0:1569522293.228000:0:32458:0:(service.c:1939:ptlrpc_server_handle_req_in()) got req x1645193981558368 00002000:00100000:14.0:1569522293.228005:0:32458:0:(ofd_dev.c:2563:ofd_rw_hpreq_check()) @@@ nbp8-OST0107 ll_ost_io01_205: refresh rw locks: [0x101070000:0xb9c4b93:0x0] (1134592->1138687) 00002000:00100000:14.0:1569522293.228010:0:32458:0:(ofd_dev.c:2422:ofd_prolong_extent_locks()) Prolong locks for req ffff885aeed2dc50 with x1645193981558368 ext(1134592->1138687) 00002000:00010000:14.0:1569522293.228012:0:32458:0:(ofd_dev.c:2568:ofd_rw_hpreq_check()) nbp8-OST0107: refreshed 0 locks timeout for req ffff885aeed2dc50. 00000100:00100000:14.0:1569522293.228016:0:32458:0:(nrs_fifo.c:179:nrs_fifo_req_get()) NRS start fifo request from 12345-10.141.2.117@o2ib417, seq: 127724415 00000100:00100000:14.0:1569522293.228019:0:32458:0:(service.c:2089:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ll_ost_io01_205:4cf6649f-db20-1dab-e991-b18b25cd7717+6:0:x1645193981558368:12345-10.141.2.117@o2ib417:3 00000100:00000200:14.0:1569522293.228020:0:32458:0:(service.c:2094:ptlrpc_server_handle_request()) got req 1645193981558368 00000400:00000200:14.0:1569522293.228069:0:32458:0:(lib-move.c:2796:LNetPut()) LNetPut -> 12345-10.141.2.117@o2ib417 00000400:00000200:14.0:1569522293.228076:0:32458:0:(lib-move.c:1930:lnet_select_pathway()) TRACE: 10.151.27.65@o2ib(10.151.27.65@o2ib:10.151.27.65@o2ib) -> 10.141.2.117@o2ib417(10.141.2.117@o2ib417:10.151.25.170@o2ib) : PUT 00000800:00000200:14.0:1569522293.228079:0:32458:0:(o2iblnd_cb.c:1510:kiblnd_send()) sending 4096 bytes in 1 frags to 12345-10.151.25.170@o2ib 00000800:00000200:14.0:1569522293.228081:0:32458:0:(o2iblnd_cb.c:703:kiblnd_setup_rd_kiov()) niov 1 offset 0 nob 4096 00000800:00000200:14.0:1569522293.228084:0:32458:0:(o2iblnd.c:405:kiblnd_find_peer_locked()) got peer_ni [ffff885d82149e80] -> 10.151.25.170@o2ib (2) version: 12 00000800:00000200:14.0:1569522293.228085:0:32458:0:(o2iblnd_cb.c:1391:kiblnd_launch_tx()) conn[ffff88588fc32800] (130)++ 00000800:00000200:14.0:1569522293.228086:0:32458:0:(o2iblnd_cb.c:1166:kiblnd_queue_tx_locked()) conn[ffff88588fc32800] (131)++ 00000800:00000200:14.0:1569522293.228088:0:32458:0:(o2iblnd_cb.c:1397:kiblnd_launch_tx()) conn[ffff88588fc32800] (132)-- 00000100:00000200:14.0:1569522293.228090:0:32458:0:(niobuf.c:262:ptlrpc_start_bulk_transfer()) Transferring 1 pages 4096 bytes via portal 8 id 12345-10.141.2.117@o2ib417 mbits 0x0-0x0 00010000:00020000:8.0:1569522343.227486:0:32458:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk READ after 50+0s req@ffff885aeed2dc50 x1645193981558368/t0(0) o3->4cf6649f-db20-1dab-e991-b18b25cd7717@10.141.2.117@o2ib417:613/0 lens 488/432 e 0 to 0 dl 1569522548 ref 1 fl Interpret:/2/0 rc 0/0 On the client 00000100:00000200:52.0:1569522293.227691:0:4645:0:(events.c:57:request_out_callback()) @@@ type 5, status 0 req@ffff9568c0c48040 x1645193981558368/t0(0) o3->nbp8-OST0107-osc-ffff958574643000@10.151.27.65@o2ib:6/4 lens 488/4536 e 0 to 0 dl 1569522648 ref 3 fl Rpc:/2/ffffffff rc 0/-1 00000800:00000200:47.0:1569522293.227691:0:4643:0:(o2iblnd_cb.c:3859:kiblnd_scheduler()) conn[ffff95709daf1800] (70)-- 00000400:00000200:52.0:1569522293.227698:0:4645:0:(lib-md.c:69:lnet_md_unlink()) Unlinking md ffff956f2881d770 00000400:00000200:52.0:1569522293.227700:0:4645:0:(lib-msg.c:816:lnet_is_health_check()) health check = 1, status = 0, hstatus = 0 00000400:00000200:52.0:1569522293.227703:0:4645:0:(lib-msg.c:630:lnet_health_check()) health check: 10.141.2.117@o2ib417->10.141.25.169@o2ib417: PUT: OK 00000800:00000200:52.0:1569522293.227706:0:4645:0:(o2iblnd_cb.c:3859:kiblnd_scheduler()) conn[ffff95709daf1800] (69)-- 00000800:00000200:38.2:1569522293.227889:0:0:0:(o2iblnd_cb.c:3721:kiblnd_cq_completion()) conn[ffff956f289cee00] (68)++ 00000800:00000200:13.0:1569522293.227905:0:4646:0:(o2iblnd_cb.c:3843:kiblnd_scheduler()) conn[ffff956f289cee00] (69)++ 00000800:00000200:13.0:1569522293.227913:0:4646:0:(o2iblnd_cb.c:343:kiblnd_handle_rx()) Received d2[0] from 10.141.25.170@o2ib417 00000800:00000200:15.0:1569522293.227915:0:4644:0:(o2iblnd_cb.c:3859:kiblnd_scheduler()) conn[ffff956f289cee00] (70)-- 00000400:00000200:13.0:1569522293.227920:0:4646:0:(lib-move.c:4114:lnet_parse()) TRACE: 10.141.2.117@o2ib417(10.141.2.117@o2ib417) <- 10.151.27.65@o2ib : PUT - for me 00000400:00000200:13.0:1569522293.227928:0:4646:0:(lib-ptl.c:571:lnet_ptl_match_md()) Request from 12345-10.151.27.65@o2ib of length 4096 into portal 8 MB=0x0 00000400:00000100:13.0:1569522293.227933:0:4646:0:(lib-move.c:3753:lnet_parse_put()) Dropping PUT from 12345-10.151.27.65@o2ib portal 8 match 0 offset 0 length 4096: 4 So it drops it because it timeout the bulk io. But I think that was the previous one. It looks like to me thats its stuck timeout a previous transfer and drop the current one. I will up load full debug log that shows this keeps repeating. ftp:uploads/ LU-12772 /oss4.out.gz ftp:uploads/ LU-12772 /r901i3n9.out.gz This looks like LU-11951 but our 2.12 clients has that patch. FYI we have turn osc idle timeout. osc.*.idle_timeout=0

Jay Lan (Inactive) added a comment - 25/Sep/19 10:53 PM - edited

> 1. https://review.whamcloud.com/#/c/36073/
> 2. https://review.whamcloud.com/#/c/35578/

Amir, you ported the above two patches to b2_12. Should I cherry-pick these two into nas-2.12.2? If I cherry-pick these patches to one of our nas-2.10.8 or nas-2.12.2 branches, we should pick them also to the other branch to avoid another out-of-sync problem. Thanks!

Jay Lan (Inactive) added a comment - 25/Sep/19 10:53 PM - edited > 1. https://review.whamcloud.com/#/c/36073/ > 2. https://review.whamcloud.com/#/c/35578/ Amir, you ported the above two patches to b2_12. Should I cherry-pick these two into nas-2.12.2? If I cherry-pick these patches to one of our nas-2.10.8 or nas-2.12.2 branches, we should pick them also to the other branch to avoid another out-of-sync problem. Thanks!

Amir Shehata (Inactive) added a comment - 25/Sep/19 1:21 AM

Did that reduce the "no credits" message seen in the logs?

I think the next step is to apply the patch which makes the cq size calculations the same on both clients and servers.

Amir Shehata (Inactive) added a comment - 25/Sep/19 1:21 AM Did that reduce the "no credits" message seen in the logs? I think the next step is to apply the patch which makes the cq size calculations the same on both clients and servers.

Mahmoud Hanafi added a comment - 24/Sep/19 9:40 PM - edited

yes the errors return with peer_credits of 32.

Mahmoud Hanafi added a comment - 24/Sep/19 9:40 PM - edited yes the errors return with peer_credits of 32.

Amir Shehata (Inactive) added a comment - 24/Sep/19 9:09 PM

do those problems persist after changing the peer_credits and peer_credits_hiw to 32/16 respectively on clients and servers?

Amir Shehata (Inactive) added a comment - 24/Sep/19 9:09 PM do those problems persist after changing the peer_credits and peer_credits_hiw to 32/16 respectively on clients and servers?

Mahmoud Hanafi added a comment - 24/Sep/19 7:58 PM - edited

I have not been able to get a reproducer. Other than the specific IO request that gets timeout over and over (for days) other I/O requests complete just fine. On a client in this state a 'sync' will just hang.
An eviction from the OST will break the cycle of timeouts.
We can run 2.10.8 clients but since we don't have a reproducer it won't help.

Some node after reboot get stuck again in repeated timeouts. I have looked at the jobs running across those there is nothing common about the jobs.

Mahmoud Hanafi added a comment - 24/Sep/19 7:58 PM - edited I have not been able to get a reproducer. Other than the specific IO request that gets timeout over and over (for days) other I/O requests complete just fine. On a client in this state a 'sync' will just hang. An eviction from the OST will break the cycle of timeouts. We can run 2.10.8 clients but since we don't have a reproducer it won't help. Some node after reboot get stuck again in repeated timeouts. I have looked at the jobs running across those there is nothing common about the jobs.

Amir Shehata (Inactive) added a comment - 24/Sep/19 7:42 PM

I looked at the logs. The server logs show 637678 instances of "no credits". I also see 24350 instances of more than 30 credits being returned in one shot.

In the server logs I see:

nbp2-oss18_4:00000100:00100000:15.0:1569279965.578558:0:30761:0:(service.c:1939:ptlrpc_server_handle_req_in()) got req x1645173752822896
nbp2-oss18_4:  req@ffff97c961286850 x1645173752822896/t0(0) o3->4149294c-5096-99e7-e20f-15e803fd47f8@10.151.3.144@o2ib:36/0 lens 488/0 e 0 to 0 dl 1569280371 ref 1 fl New:/2/ffffffff rc 0/-1
nbp2-oss18_4:00002000:00100000:15.0:1569279965.578568:0:30761:0:(ofd_dev.c:2422:ofd_prolong_extent_locks()) Prolong locks for req ffff97c961286850 with x1645173752822896 ext(4096->8191)
nbp2-oss18_4:00000100:00100000:15.0:1569279965.578575:0:30761:0:(service.c:2089:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ll_ost_io01_113:4149294c-5096-99e7-e20f-15e803fd47f8+6:0:x1645173752822896:12345-10.151.3.144@o2ib:3
nbp2-oss18_5:00010000:00020000:23.0:1569280165.581313:0:30761:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk READ after 200+0s  req@ffff97c961286850 x1645173752822896/t0(0) o3->4149294c-5096-99e7-e20f-15e803fd47f8@10.151.3.144@o2ib:36/0 lens 488/432 e 0 to 0 dl 1569280371 ref 1 fl Interpret:/2/0 rc 0/0
nbp2-oss18_5:00010000:00000080:23.0:1569280165.695547:0:30761:0:(ldlm_lib.c:2883:target_committed_to_req()) @@@ not sending last_committed update (0/1)  req@ffff97c961286850 x1645173752822896/t0(0) o3->4149294c-5096-99e7-e20f-15e803fd47f8@10.151.3.144@o2ib:36/0 lens 488/432 e 0 to 0 dl 1569280371 ref 1 fl Interpret:/2/ffffffff rc -110/-1
nbp2-oss18_5:00000100:00100000:23.0:1569280165.695553:0:30761:0:(service.c:2139:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ll_ost_io01_113:4149294c-5096-99e7-e20f-15e803fd47f8+6:0:x1645173752822896:12345-10.151.3.144@o2ib:3 Request processed in 200116977us (200117000us total) trans 0 rc -110/-999

Looks like the bulk read for that particular rpc request times out after 200 seconds. I see this multiple times.

I'm still suspecting the mismatch in the connection attributes as well as the HIW issue discussed on the call.

Do you have a system with both clients and servers on 2.10.8? I'm wondering if we'd see the same behavior.

If we're able to reproduce this issue on the test system, that would be very helpful. We can experiment with the different parameters.

Amir Shehata (Inactive) added a comment - 24/Sep/19 7:42 PM I looked at the logs. The server logs show 637678 instances of "no credits". I also see 24350 instances of more than 30 credits being returned in one shot. In the server logs I see: nbp2-oss18_4:00000100:00100000:15.0:1569279965.578558:0:30761:0:(service.c:1939:ptlrpc_server_handle_req_in()) got req x1645173752822896 nbp2-oss18_4: req@ffff97c961286850 x1645173752822896/t0(0) o3->4149294c-5096-99e7-e20f-15e803fd47f8@10.151.3.144@o2ib:36/0 lens 488/0 e 0 to 0 dl 1569280371 ref 1 fl New:/2/ffffffff rc 0/-1 nbp2-oss18_4:00002000:00100000:15.0:1569279965.578568:0:30761:0:(ofd_dev.c:2422:ofd_prolong_extent_locks()) Prolong locks for req ffff97c961286850 with x1645173752822896 ext(4096->8191) nbp2-oss18_4:00000100:00100000:15.0:1569279965.578575:0:30761:0:(service.c:2089:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ll_ost_io01_113:4149294c-5096-99e7-e20f-15e803fd47f8+6:0:x1645173752822896:12345-10.151.3.144@o2ib:3 nbp2-oss18_5:00010000:00020000:23.0:1569280165.581313:0:30761:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk READ after 200+0s req@ffff97c961286850 x1645173752822896/t0(0) o3->4149294c-5096-99e7-e20f-15e803fd47f8@10.151.3.144@o2ib:36/0 lens 488/432 e 0 to 0 dl 1569280371 ref 1 fl Interpret:/2/0 rc 0/0 nbp2-oss18_5:00010000:00000080:23.0:1569280165.695547:0:30761:0:(ldlm_lib.c:2883:target_committed_to_req()) @@@ not sending last_committed update (0/1) req@ffff97c961286850 x1645173752822896/t0(0) o3->4149294c-5096-99e7-e20f-15e803fd47f8@10.151.3.144@o2ib:36/0 lens 488/432 e 0 to 0 dl 1569280371 ref 1 fl Interpret:/2/ffffffff rc -110/-1 nbp2-oss18_5:00000100:00100000:23.0:1569280165.695553:0:30761:0:(service.c:2139:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ll_ost_io01_113:4149294c-5096-99e7-e20f-15e803fd47f8+6:0:x1645173752822896:12345-10.151.3.144@o2ib:3 Request processed in 200116977us (200117000us total) trans 0 rc -110/-999 Looks like the bulk read for that particular rpc request times out after 200 seconds. I see this multiple times. I'm still suspecting the mismatch in the connection attributes as well as the HIW issue discussed on the call. Do you have a system with both clients and servers on 2.10.8? I'm wondering if we'd see the same behavior. If we're able to reproduce this issue on the test system, that would be very helpful. We can experiment with the different parameters.

Mahmoud Hanafi added a comment - 24/Sep/19 6:29 PM

We update the clients to 2.12 but didn't see these timeouts. Only after rebooting the servers into 2.10.8 did we see the timeouts. But moving 1 filesystem back to 2.10.6 didn't make the timeouts go away on that filesystem. Only reboot of clients stops the timeouts. I am not convened these are actual "network" timeouts. The clients and server get out of sync and the client keeps trying the same bulk io over and the server times it out. The last sent of debug logs I uploaded give a good trace of the RPC traffic.
This is exactly like https://jira.whamcloud.com/browse/LU-12543. When a user job finishes their process can't be killed because of a open file(or a lock). Only an eviction or a reboot will clear up the client.

Mahmoud Hanafi added a comment - 24/Sep/19 6:29 PM We update the clients to 2.12 but didn't see these timeouts. Only after rebooting the servers into 2.10.8 did we see the timeouts. But moving 1 filesystem back to 2.10.6 didn't make the timeouts go away on that filesystem. Only reboot of clients stops the timeouts. I am not convened these are actual "network" timeouts. The clients and server get out of sync and the client keeps trying the same bulk io over and the server times it out. The last sent of debug logs I uploaded give a good trace of the RPC traffic. This is exactly like https://jira.whamcloud.com/browse/LU-12543 . When a user job finishes their process can't be killed because of a open file(or a lock). Only an eviction or a reboot will clear up the client.

Amir Shehata (Inactive) added a comment - 24/Sep/19 6:09 PM

Here is a port for 36073 based on NASA's b2_10_8.

0001-LU-12621-b_2_10_8-port.patch

Mahmoud, just to clarify, did this timeout you posted above started happening after any changes ?

Amir Shehata (Inactive) added a comment - 24/Sep/19 6:09 PM Here is a port for 36073 based on NASA's b2_10_8. 0001-LU-12621-b_2_10_8-port.patch Mahmoud, just to clarify, did this timeout you posted above started happening after any changes ?

Mahmoud Hanafi added a comment - 23/Sep/19 11:37 PM

I noticed that the clients keep timing out the same bulk request over and over. It seems to be stuck.

ep 23 15:45:53 r401i6n8 kernel: [1569278753.126871] Lustre: nbp2-OST00d9-osc-ffff9947f4303800: Connection to nbp2-OST00d9 (at 10.151.26.114@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Sep 23 15:45:53 r401i6n8 kernel: [1569278753.162870] Lustre: nbp2-OST00d9-osc-ffff9947f4303800: Connection restored to 10.151.26.114@o2ib (at 10.151.26.114@o2ib)
Sep 23 15:55:59 r401i6n8 kernel: [1569279359.335922] Lustre: 3467:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1569278753/real 1569278753]  req@ffff99555d201680 x1645173752822896/t0(0) o3->nbp2-OST00d9-osc-ffff9947f4303800@10.151.26.114@o2ib:6/4 lens 488/4536 e 6 to 1 dl 1569279359 ref 2 fl Rpc:X/2/ffffffff rc -11/-1
Sep 23 15:55:59 r401i6n8 kernel: [1569279359.335922] Lustre: nbp2-OST00d9-osc-ffff9947f4303800: Connection to nbp2-OST00d9 (at 10.151.26.114@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Sep 23 15:55:59 r401i6n8 kernel: [1569279359.371921] Lustre: nbp2-OST00d9-osc-ffff9947f4303800: Connection restored to 10.151.26.114@o2ib (at 10.151.26.114@o2ib)
Sep 23 16:06:05 r401i6n8 kernel: [1569279965.540809] Lustre: 3467:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1569279359/real 1569279359]  req@ffff99555d201680 x1645173752822896/t0(0) o3->nbp2-OST00d9-osc-ffff9947f4303800@10.151.26.114@o2ib:6/4 lens 488/4536 e 6 to 1 dl 1569279965 ref 2 fl Rpc:X/2/ffffffff rc -11/-1
Sep 23 16:06:05 r401i6n8 kernel: [1569279965.540809] Lustre: nbp2-OST00d9-osc-ffff9947f4303800: Connection to nbp2-OST00d9 (at 10.151.26.114@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Sep 23 16:06:05 r401i6n8 kernel: [1569279965.576807] Lustre: nbp2-OST00d9-osc-ffff9947f4303800: Connection restored to 10.151.26.114@o2ib (at 10.151.26.114@o2ib)
Sep 23 16:16:11 r401i6n8 kernel: [1569280571.753676] Lustre: 3467:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1569279965/real 1569279965]  req@ffff99555d201680 x1645173752822896/t0(0) o3->nbp2-OST00d9-osc-ffff9947f4303800@10.151.26.114@o2ib:6/4 lens 488/4536 e 6 to 1 dl 1569280571 ref 2 fl Rpc:X/2/ffffffff rc -11/-1
Sep 23 16:16:11 r401i6n8 kernel: [1569280571.753676] Lustre: nbp2-OST00d9-osc-ffff9947f4303800: Connection to nbp2-OST00d9 (at 10.151.26.114@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Sep 23 16:16:11 r401i6n8 kernel: [1569280571.789675] Lustre: nbp2-OST00d9-osc-ffff9947f4303800: Connection restored to 10.151.26.114@o2ib (at 10.151.26.114@o2ib)
Sep 23 16:26:17 r401i6n8 kernel: [1569281177.958608] Lustre: 3467:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1569280571/real 1569280571]  req@ffff99555d201680 x1645173752822896/t0(0) o3->nbp2-OST00d9-osc-ffff9947f4303800@10.151.26.114@o2ib:6/4 lens 488/4536 e 6 to 1 dl 1569281177 ref 2 fl Rpc:X/2/ffffffff rc -11/-1
Sep 23 16:26:17 r401i6n8 kernel: [1569281177.958608] Lustre: nbp2-OST00d9-osc-ffff9947f4303800: Connection to nbp2-OST00d9 (at 10.151.26.114@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Sep 23 16:26:17 r401i6n8 kernel: [1569281177.994606] Lustre: nbp2-OST00d9-osc-ffff9947f4303800: Connection restored to 10.151.26.114@o2ib (at 10.151.26.114@o2ib)

I upload the client side and server side logs for the above.

uploads/~~LU-12772~~/client.tgz
uploads/~~LU-12772~~/server.tgz

Mahmoud Hanafi added a comment - 23/Sep/19 11:37 PM I noticed that the clients keep timing out the same bulk request over and over. It seems to be stuck. ep 23 15:45:53 r401i6n8 kernel: [1569278753.126871] Lustre: nbp2-OST00d9-osc-ffff9947f4303800: Connection to nbp2-OST00d9 (at 10.151.26.114@o2ib) was lost; in progress operations using this service will wait for recovery to complete Sep 23 15:45:53 r401i6n8 kernel: [1569278753.162870] Lustre: nbp2-OST00d9-osc-ffff9947f4303800: Connection restored to 10.151.26.114@o2ib (at 10.151.26.114@o2ib) Sep 23 15:55:59 r401i6n8 kernel: [1569279359.335922] Lustre: 3467:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1569278753/real 1569278753] req@ffff99555d201680 x1645173752822896/t0(0) o3->nbp2-OST00d9-osc-ffff9947f4303800@10.151.26.114@o2ib:6/4 lens 488/4536 e 6 to 1 dl 1569279359 ref 2 fl Rpc:X/2/ffffffff rc -11/-1 Sep 23 15:55:59 r401i6n8 kernel: [1569279359.335922] Lustre: nbp2-OST00d9-osc-ffff9947f4303800: Connection to nbp2-OST00d9 (at 10.151.26.114@o2ib) was lost; in progress operations using this service will wait for recovery to complete Sep 23 15:55:59 r401i6n8 kernel: [1569279359.371921] Lustre: nbp2-OST00d9-osc-ffff9947f4303800: Connection restored to 10.151.26.114@o2ib (at 10.151.26.114@o2ib) Sep 23 16:06:05 r401i6n8 kernel: [1569279965.540809] Lustre: 3467:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1569279359/real 1569279359] req@ffff99555d201680 x1645173752822896/t0(0) o3->nbp2-OST00d9-osc-ffff9947f4303800@10.151.26.114@o2ib:6/4 lens 488/4536 e 6 to 1 dl 1569279965 ref 2 fl Rpc:X/2/ffffffff rc -11/-1 Sep 23 16:06:05 r401i6n8 kernel: [1569279965.540809] Lustre: nbp2-OST00d9-osc-ffff9947f4303800: Connection to nbp2-OST00d9 (at 10.151.26.114@o2ib) was lost; in progress operations using this service will wait for recovery to complete Sep 23 16:06:05 r401i6n8 kernel: [1569279965.576807] Lustre: nbp2-OST00d9-osc-ffff9947f4303800: Connection restored to 10.151.26.114@o2ib (at 10.151.26.114@o2ib) Sep 23 16:16:11 r401i6n8 kernel: [1569280571.753676] Lustre: 3467:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1569279965/real 1569279965] req@ffff99555d201680 x1645173752822896/t0(0) o3->nbp2-OST00d9-osc-ffff9947f4303800@10.151.26.114@o2ib:6/4 lens 488/4536 e 6 to 1 dl 1569280571 ref 2 fl Rpc:X/2/ffffffff rc -11/-1 Sep 23 16:16:11 r401i6n8 kernel: [1569280571.753676] Lustre: nbp2-OST00d9-osc-ffff9947f4303800: Connection to nbp2-OST00d9 (at 10.151.26.114@o2ib) was lost; in progress operations using this service will wait for recovery to complete Sep 23 16:16:11 r401i6n8 kernel: [1569280571.789675] Lustre: nbp2-OST00d9-osc-ffff9947f4303800: Connection restored to 10.151.26.114@o2ib (at 10.151.26.114@o2ib) Sep 23 16:26:17 r401i6n8 kernel: [1569281177.958608] Lustre: 3467:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1569280571/real 1569280571] req@ffff99555d201680 x1645173752822896/t0(0) o3->nbp2-OST00d9-osc-ffff9947f4303800@10.151.26.114@o2ib:6/4 lens 488/4536 e 6 to 1 dl 1569281177 ref 2 fl Rpc:X/2/ffffffff rc -11/-1 Sep 23 16:26:17 r401i6n8 kernel: [1569281177.958608] Lustre: nbp2-OST00d9-osc-ffff9947f4303800: Connection to nbp2-OST00d9 (at 10.151.26.114@o2ib) was lost; in progress operations using this service will wait for recovery to complete Sep 23 16:26:17 r401i6n8 kernel: [1569281177.994606] Lustre: nbp2-OST00d9-osc-ffff9947f4303800: Connection restored to 10.151.26.114@o2ib (at 10.151.26.114@o2ib) I upload the client side and server side logs for the above. uploads/ LU-12772 /client.tgz uploads/ LU-12772 /server.tgz

Jay Lan (Inactive) added a comment - 20/Sep/19 9:05 PM

> 1. https://review.whamcloud.com/#/c/36073/

The above suggested patch does not apply to 2.10.8 (and b2_10). Could you provide a backport or a list of prerequisite commits?

Jay Lan (Inactive) added a comment - 20/Sep/19 9:05 PM > 1. https://review.whamcloud.com/#/c/36073/ The above suggested patch does not apply to 2.10.8 (and b2_10). Could you provide a backport or a list of prerequisite commits?

People

Assignee:: Amir Shehata (Inactive)

Reporter:: Mahmoud Hanafi

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 16/Sep/19 11:38 PM

Updated:: 15/Jan/20 11:22 PM

Resolved:: 19/Oct/19 3:41 AM