Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12772

bulk timeout after 2.12.2 clients upgrade

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • Lustre 2.12.2
    • None
    • 2
    • 9223372036854775807

    Description

      After upgrading to 2.12.2 clients and 2.10.8 servers we start to see large amounts of bulk io timeouts.

      client side

      Sep 16 16:36:59 r323i3n6 kernel: [1568677019.825837] Lustre: nbp2-OST0008-osc-ffff9ad887dc1800: Connection to nbp2-OST0008 (at 10.151.26.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete
      Sep 16 16:37:01 r585i7n2 kernel: [1568677021.115645] Lustre: nbp2-OST0008-osc-ffff90c85d05e000: Connection to nbp2-OST0008 (at 10.151.26.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete
      Sep 16 16:37:01 r311i0n3 kernel: [1568677021.371165] Lustre: nbp2-OST0094-osc-ffff976bea358800: Connection to nbp2-OST0094 (at 10.151.26.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete
      Sep 16 16:37:01 r311i4n9 kernel: [1568677021.578522] Lustre: nbp2-OST0094-osc-ffff9c68adf2d000: Connection to nbp2-OST0094 (at 10.151.26.105@o2ib) was lost; in progress operations using this sea12000: Connection to nbp2-OST0008 (at 10.151.26.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete
      Sep 16 16:36:59 r323i3n6 kernel: [1568677019.825837] Lustre: nbp2-OST0008-osc-ffff9ad887dc1800: Connection to nbp2-OST0008 (at 10.151.26.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete
      Sep 16 16:37:01 r585i7n2 kernel: [1568677021.115645] Lustre: nbp2-OST0008-osc-ffff90c85d05e000: Connection to nbp2-OST0008 (at 10.151.26.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete
      Sep 16 16:37:01 r311i0n3 kernel: [1568677021.371165] Lustre: nbp2-OST0094-osc-ffff976bea358800: Connection to nbp2-OST0094 (at 10.151.26.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete
      Sep 16 16:37:01 r311i4n9 kernel: [1568677021.578522] Lustre: nbp2-OST0094-osc-ffff9c68adf2d000: Connection to nbp2-OST0094 (at 10.151.26.105@o2ib) was lost; in progress operations using this seConnection to nbp2-OST0008 (at 10.151.26.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete
      Sep 16 16:36:59 r323i3n6 kernel: [1568677019.825837] Lustre: nbp2-OST0008-osc-ffff9ad887dc1800: Connection to nbp2-OST0008 (at 10.151.26.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete
      Sep 16 16:37:01 r585i7n2 kernel: [1568677021.115645] Lustre: nbp2-OST0008-osc-ffff90c85d05e000: Connection to nbp2-OST0008 (at 10.151.26.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete
      Sep 16 16:37:01 r311i0n3 kernel: [1568677021.371165] Lustre: nbp2-OST0094-osc-ffff976bea358800: Connection to nbp2-OST0094 (at 10.151.26.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete
      Sep 16 16:37:01 r311i4n9 kernel: [1568677021.578522] Lustre: nbp2-OST0094-osc-ffff9c68adf2d000: Connection to nbp2-OST0094 (at 10.151.26.105@o2ib) was lost; in progress operations using this se

      server side

      [90158.366440] LustreError: 30777:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk WRITE after 300+0s  req@ffff8affe46eb450 x1644657589074304/t0(0) o4->825a80a7-da45-880e-35d1-4a750d2cf7f0@10.151.16.212@o2ib:502/0 lens 2168/448 e 0 to 0 dl 1568676837 ref 1 fl Interpret:/2/0 rc 0/0
      

      Attachments

        Issue Links

          Activity

            [LU-12772] bulk timeout after 2.12.2 clients upgrade

            The logs show that messages are getting to and from client and server. Only way to break the cycle is to evict the client. It would be good from someone to look at the rpc traffic to make sure we not hitting a bug.

            mhanafi Mahmoud Hanafi added a comment - The logs show that messages are getting to and from client and server. Only way to break the cycle is to evict the client. It would be good from someone to look at the rpc traffic to make sure we not hitting a bug.
            ashehata Amir Shehata (Inactive) added a comment - - edited

            Yes. the new 2.12 module parameters should be dynamic. You can set them:

            lnetctl set transaction_timeout 100 

            then you can verify with

            lnetctl global show 

            Let's keep the lnet_retry_count to 0. There are a few issues which have been fixed in master with health that haven't been ported back to 2.12 yet. So I don't think we should introduce any more variables.

            Also to clarify the ko2iblnd retry_count is used on connection establishment time. It's actually passed down to the IB stack rightaway. So we don't process it in LNet or the LND.

            ashehata Amir Shehata (Inactive) added a comment - - edited Yes. the new 2.12 module parameters should be dynamic. You can set them: lnetctl set transaction_timeout 100 then you can verify with lnetctl global show Let's keep the lnet_retry_count to 0. There are a few issues which have been fixed in master with health that haven't been ported back to 2.12 yet. So I don't think we should introduce any more variables. Also to clarify the ko2iblnd retry_count is used on connection establishment time. It's actually passed down to the IB stack rightaway. So we don't process it in LNet or the LND.
            mhanafi Mahmoud Hanafi added a comment - - edited

            looks like this setting can be changed after the module load correct?

            btw,  lnet_retry_count=0 should we change that also to 3?

            mhanafi Mahmoud Hanafi added a comment - - edited looks like this setting can be changed after the module load correct? btw,  lnet_retry_count=0 should we change that also to 3?

            Mahmoud, I might have a clue about the issue with the timeouts. On 2.12 the LND timeouts have been hooked up to the lnet_transaction_timeout as part of the health work. The ko2iblnd timeout parameter has been deprecated.

            The idea is that you can set a transaction timeout and a retry count (lnet_retry_count). The LNet messages will be retried a max of lnet_retry_count within the transaction timeout. So for example if you have lnet_retry_count set to 3, and the transaction timeout set to 100. If a message fails to be sent then it'll be retried 3 times in that 100 seconds. The LND timeout will be derived as follows:

            lnd_timeout = lnet_transaction_timeout / lnet_retry_count 

            If you disable retries, which it is disabled by default in 2.12.2, then:

            lnd_timeout = lnet_transaction_timeout 

            The default value of lnet_transaction_timeout is 50s

            In your configuration you want to set the lnd timeout to 100s.

            On the 2.12.2 clients you should set:

            options lnet lnet_transaction_timeout=100

            That should give you the expected timeout. I'm thinking that in your setup under heavy load you need the timeouts to be that large.

            ashehata Amir Shehata (Inactive) added a comment - Mahmoud, I might have a clue about the issue with the timeouts. On 2.12 the LND timeouts have been hooked up to the lnet_transaction_timeout as part of the health work. The ko2iblnd timeout parameter has been deprecated. The idea is that you can set a transaction timeout and a retry count (lnet_retry_count). The LNet messages will be retried a max of lnet_retry_count within the transaction timeout. So for example if you have lnet_retry_count set to 3, and the transaction timeout set to 100. If a message fails to be sent then it'll be retried 3 times in that 100 seconds. The LND timeout will be derived as follows: lnd_timeout = lnet_transaction_timeout / lnet_retry_count If you disable retries, which it is disabled by default in 2.12.2, then: lnd_timeout = lnet_transaction_timeout The default value of lnet_transaction_timeout is 50s In your configuration you want to set the lnd timeout to 100s. On the 2.12.2 clients you should set: options lnet lnet_transaction_timeout=100 That should give you the expected timeout. I'm thinking that in your setup under heavy load you need the timeouts to be that large.

            Hi Jay,

            Yes those two patches should be applied to both 2.12.2 and 2.10.8

            ashehata Amir Shehata (Inactive) added a comment - Hi Jay, Yes those two patches should be applied to both 2.12.2 and 2.10.8
            mhanafi Mahmoud Hanafi added a comment - - edited

            Been trying to get additional debugging while we wait to update our servers.
            I found this exchange between the server and client.
            x1645242562975920 is the request that keeps timeout over and over.

            server

            00000100:00100000:14.0:1569522293.228000:0:32458:0:(service.c:1939:ptlrpc_server_handle_req_in()) got req x1645193981558368
            00002000:00100000:14.0:1569522293.228005:0:32458:0:(ofd_dev.c:2563:ofd_rw_hpreq_check()) @@@ nbp8-OST0107 ll_ost_io01_205: refresh rw locks: [0x101070000:0xb9c4b93:0x0] (1134592->1138687)
            00002000:00100000:14.0:1569522293.228010:0:32458:0:(ofd_dev.c:2422:ofd_prolong_extent_locks()) Prolong locks for req ffff885aeed2dc50 with x1645193981558368 ext(1134592->1138687)
            00002000:00010000:14.0:1569522293.228012:0:32458:0:(ofd_dev.c:2568:ofd_rw_hpreq_check()) nbp8-OST0107: refreshed 0 locks timeout for req ffff885aeed2dc50.
            00000100:00100000:14.0:1569522293.228016:0:32458:0:(nrs_fifo.c:179:nrs_fifo_req_get()) NRS start fifo request from 12345-10.141.2.117@o2ib417, seq: 127724415
            00000100:00100000:14.0:1569522293.228019:0:32458:0:(service.c:2089:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ll_ost_io01_205:4cf6649f-db20-1dab-e991-b18b25cd7717+6:0:x1645193981558368:12345-10.141.2.117@o2ib417:3
            00000100:00000200:14.0:1569522293.228020:0:32458:0:(service.c:2094:ptlrpc_server_handle_request()) got req 1645193981558368
            00000400:00000200:14.0:1569522293.228069:0:32458:0:(lib-move.c:2796:LNetPut()) LNetPut -> 12345-10.141.2.117@o2ib417
            00000400:00000200:14.0:1569522293.228076:0:32458:0:(lib-move.c:1930:lnet_select_pathway()) TRACE: 10.151.27.65@o2ib(10.151.27.65@o2ib:10.151.27.65@o2ib) -> 10.141.2.117@o2ib417(10.141.2.117@o2ib417:10.151.25.170@o2ib) : PUT
            00000800:00000200:14.0:1569522293.228079:0:32458:0:(o2iblnd_cb.c:1510:kiblnd_send()) sending 4096 bytes in 1 frags to 12345-10.151.25.170@o2ib
            00000800:00000200:14.0:1569522293.228081:0:32458:0:(o2iblnd_cb.c:703:kiblnd_setup_rd_kiov()) niov 1 offset 0 nob 4096
            00000800:00000200:14.0:1569522293.228084:0:32458:0:(o2iblnd.c:405:kiblnd_find_peer_locked()) got peer_ni [ffff885d82149e80] -> 10.151.25.170@o2ib (2) version: 12
            00000800:00000200:14.0:1569522293.228085:0:32458:0:(o2iblnd_cb.c:1391:kiblnd_launch_tx()) conn[ffff88588fc32800] (130)++
            00000800:00000200:14.0:1569522293.228086:0:32458:0:(o2iblnd_cb.c:1166:kiblnd_queue_tx_locked()) conn[ffff88588fc32800] (131)++
            00000800:00000200:14.0:1569522293.228088:0:32458:0:(o2iblnd_cb.c:1397:kiblnd_launch_tx()) conn[ffff88588fc32800] (132)--
            00000100:00000200:14.0:1569522293.228090:0:32458:0:(niobuf.c:262:ptlrpc_start_bulk_transfer()) Transferring 1 pages 4096 bytes via portal 8 id 12345-10.141.2.117@o2ib417 mbits 0x0-0x0
            00010000:00020000:8.0:1569522343.227486:0:32458:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk READ after 50+0s  req@ffff885aeed2dc50 x1645193981558368/t0(0) o3->4cf6649f-db20-1dab-e991-b18b25cd7717@10.141.2.117@o2ib417:613/0 lens 488/432 e 0 to 0 dl 1569522548 ref 1 fl Interpret:/2/0 rc 0/0
            
            

            On the client

            00000100:00000200:52.0:1569522293.227691:0:4645:0:(events.c:57:request_out_callback()) @@@ type 5, status 0  req@ffff9568c0c48040 x1645193981558368/t0(0) o3->nbp8-OST0107-osc-ffff958574643000@10.151.27.65@o2ib:6/4 lens 488/4536 e 0 to 0 dl 1569522648 ref 3 fl Rpc:/2/ffffffff rc 0/-1
            00000800:00000200:47.0:1569522293.227691:0:4643:0:(o2iblnd_cb.c:3859:kiblnd_scheduler()) conn[ffff95709daf1800] (70)--
            00000400:00000200:52.0:1569522293.227698:0:4645:0:(lib-md.c:69:lnet_md_unlink()) Unlinking md ffff956f2881d770
            00000400:00000200:52.0:1569522293.227700:0:4645:0:(lib-msg.c:816:lnet_is_health_check()) health check = 1, status = 0, hstatus = 0
            00000400:00000200:52.0:1569522293.227703:0:4645:0:(lib-msg.c:630:lnet_health_check()) health check: 10.141.2.117@o2ib417->10.141.25.169@o2ib417: PUT: OK
            00000800:00000200:52.0:1569522293.227706:0:4645:0:(o2iblnd_cb.c:3859:kiblnd_scheduler()) conn[ffff95709daf1800] (69)--
            00000800:00000200:38.2:1569522293.227889:0:0:0:(o2iblnd_cb.c:3721:kiblnd_cq_completion()) conn[ffff956f289cee00] (68)++
            00000800:00000200:13.0:1569522293.227905:0:4646:0:(o2iblnd_cb.c:3843:kiblnd_scheduler()) conn[ffff956f289cee00] (69)++
            00000800:00000200:13.0:1569522293.227913:0:4646:0:(o2iblnd_cb.c:343:kiblnd_handle_rx()) Received d2[0] from 10.141.25.170@o2ib417
            00000800:00000200:15.0:1569522293.227915:0:4644:0:(o2iblnd_cb.c:3859:kiblnd_scheduler()) conn[ffff956f289cee00] (70)--
            00000400:00000200:13.0:1569522293.227920:0:4646:0:(lib-move.c:4114:lnet_parse()) TRACE: 10.141.2.117@o2ib417(10.141.2.117@o2ib417) <- 10.151.27.65@o2ib : PUT - for me
            00000400:00000200:13.0:1569522293.227928:0:4646:0:(lib-ptl.c:571:lnet_ptl_match_md()) Request from 12345-10.151.27.65@o2ib of length 4096 into portal 8 MB=0x0
            00000400:00000100:13.0:1569522293.227933:0:4646:0:(lib-move.c:3753:lnet_parse_put()) Dropping PUT from 12345-10.151.27.65@o2ib portal 8 match 0 offset 0 length 4096: 4
            

            So it drops it because it timeout the bulk io. But I think that was the previous one. It looks like to me thats its stuck timeout a previous transfer and drop the current one.

            I will up load full debug log that shows this keeps repeating.

            ftp:uploads/LU-12772/oss4.out.gz

            ftp:uploads/LU-12772/r901i3n9.out.gz

             

            This looks like LU-11951 but our 2.12 clients has that patch. FYI we have turn osc idle timeout.

            osc.*.idle_timeout=0

            mhanafi Mahmoud Hanafi added a comment - - edited Been trying to get additional debugging while we wait to update our servers. I found this exchange between the server and client. x1645242562975920 is the request that keeps timeout over and over. server 00000100:00100000:14.0:1569522293.228000:0:32458:0:(service.c:1939:ptlrpc_server_handle_req_in()) got req x1645193981558368 00002000:00100000:14.0:1569522293.228005:0:32458:0:(ofd_dev.c:2563:ofd_rw_hpreq_check()) @@@ nbp8-OST0107 ll_ost_io01_205: refresh rw locks: [0x101070000:0xb9c4b93:0x0] (1134592->1138687) 00002000:00100000:14.0:1569522293.228010:0:32458:0:(ofd_dev.c:2422:ofd_prolong_extent_locks()) Prolong locks for req ffff885aeed2dc50 with x1645193981558368 ext(1134592->1138687) 00002000:00010000:14.0:1569522293.228012:0:32458:0:(ofd_dev.c:2568:ofd_rw_hpreq_check()) nbp8-OST0107: refreshed 0 locks timeout for req ffff885aeed2dc50. 00000100:00100000:14.0:1569522293.228016:0:32458:0:(nrs_fifo.c:179:nrs_fifo_req_get()) NRS start fifo request from 12345-10.141.2.117@o2ib417, seq: 127724415 00000100:00100000:14.0:1569522293.228019:0:32458:0:(service.c:2089:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ll_ost_io01_205:4cf6649f-db20-1dab-e991-b18b25cd7717+6:0:x1645193981558368:12345-10.141.2.117@o2ib417:3 00000100:00000200:14.0:1569522293.228020:0:32458:0:(service.c:2094:ptlrpc_server_handle_request()) got req 1645193981558368 00000400:00000200:14.0:1569522293.228069:0:32458:0:(lib-move.c:2796:LNetPut()) LNetPut -> 12345-10.141.2.117@o2ib417 00000400:00000200:14.0:1569522293.228076:0:32458:0:(lib-move.c:1930:lnet_select_pathway()) TRACE: 10.151.27.65@o2ib(10.151.27.65@o2ib:10.151.27.65@o2ib) -> 10.141.2.117@o2ib417(10.141.2.117@o2ib417:10.151.25.170@o2ib) : PUT 00000800:00000200:14.0:1569522293.228079:0:32458:0:(o2iblnd_cb.c:1510:kiblnd_send()) sending 4096 bytes in 1 frags to 12345-10.151.25.170@o2ib 00000800:00000200:14.0:1569522293.228081:0:32458:0:(o2iblnd_cb.c:703:kiblnd_setup_rd_kiov()) niov 1 offset 0 nob 4096 00000800:00000200:14.0:1569522293.228084:0:32458:0:(o2iblnd.c:405:kiblnd_find_peer_locked()) got peer_ni [ffff885d82149e80] -> 10.151.25.170@o2ib (2) version: 12 00000800:00000200:14.0:1569522293.228085:0:32458:0:(o2iblnd_cb.c:1391:kiblnd_launch_tx()) conn[ffff88588fc32800] (130)++ 00000800:00000200:14.0:1569522293.228086:0:32458:0:(o2iblnd_cb.c:1166:kiblnd_queue_tx_locked()) conn[ffff88588fc32800] (131)++ 00000800:00000200:14.0:1569522293.228088:0:32458:0:(o2iblnd_cb.c:1397:kiblnd_launch_tx()) conn[ffff88588fc32800] (132)-- 00000100:00000200:14.0:1569522293.228090:0:32458:0:(niobuf.c:262:ptlrpc_start_bulk_transfer()) Transferring 1 pages 4096 bytes via portal 8 id 12345-10.141.2.117@o2ib417 mbits 0x0-0x0 00010000:00020000:8.0:1569522343.227486:0:32458:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk READ after 50+0s req@ffff885aeed2dc50 x1645193981558368/t0(0) o3->4cf6649f-db20-1dab-e991-b18b25cd7717@10.141.2.117@o2ib417:613/0 lens 488/432 e 0 to 0 dl 1569522548 ref 1 fl Interpret:/2/0 rc 0/0 On the client 00000100:00000200:52.0:1569522293.227691:0:4645:0:(events.c:57:request_out_callback()) @@@ type 5, status 0 req@ffff9568c0c48040 x1645193981558368/t0(0) o3->nbp8-OST0107-osc-ffff958574643000@10.151.27.65@o2ib:6/4 lens 488/4536 e 0 to 0 dl 1569522648 ref 3 fl Rpc:/2/ffffffff rc 0/-1 00000800:00000200:47.0:1569522293.227691:0:4643:0:(o2iblnd_cb.c:3859:kiblnd_scheduler()) conn[ffff95709daf1800] (70)-- 00000400:00000200:52.0:1569522293.227698:0:4645:0:(lib-md.c:69:lnet_md_unlink()) Unlinking md ffff956f2881d770 00000400:00000200:52.0:1569522293.227700:0:4645:0:(lib-msg.c:816:lnet_is_health_check()) health check = 1, status = 0, hstatus = 0 00000400:00000200:52.0:1569522293.227703:0:4645:0:(lib-msg.c:630:lnet_health_check()) health check: 10.141.2.117@o2ib417->10.141.25.169@o2ib417: PUT: OK 00000800:00000200:52.0:1569522293.227706:0:4645:0:(o2iblnd_cb.c:3859:kiblnd_scheduler()) conn[ffff95709daf1800] (69)-- 00000800:00000200:38.2:1569522293.227889:0:0:0:(o2iblnd_cb.c:3721:kiblnd_cq_completion()) conn[ffff956f289cee00] (68)++ 00000800:00000200:13.0:1569522293.227905:0:4646:0:(o2iblnd_cb.c:3843:kiblnd_scheduler()) conn[ffff956f289cee00] (69)++ 00000800:00000200:13.0:1569522293.227913:0:4646:0:(o2iblnd_cb.c:343:kiblnd_handle_rx()) Received d2[0] from 10.141.25.170@o2ib417 00000800:00000200:15.0:1569522293.227915:0:4644:0:(o2iblnd_cb.c:3859:kiblnd_scheduler()) conn[ffff956f289cee00] (70)-- 00000400:00000200:13.0:1569522293.227920:0:4646:0:(lib-move.c:4114:lnet_parse()) TRACE: 10.141.2.117@o2ib417(10.141.2.117@o2ib417) <- 10.151.27.65@o2ib : PUT - for me 00000400:00000200:13.0:1569522293.227928:0:4646:0:(lib-ptl.c:571:lnet_ptl_match_md()) Request from 12345-10.151.27.65@o2ib of length 4096 into portal 8 MB=0x0 00000400:00000100:13.0:1569522293.227933:0:4646:0:(lib-move.c:3753:lnet_parse_put()) Dropping PUT from 12345-10.151.27.65@o2ib portal 8 match 0 offset 0 length 4096: 4 So it drops it because it timeout the bulk io. But I think that was the previous one. It looks like to me thats its stuck timeout a previous transfer and drop the current one. I will up load full debug log that shows this keeps repeating. ftp:uploads/ LU-12772 /oss4.out.gz ftp:uploads/ LU-12772 /r901i3n9.out.gz   This looks like LU-11951 but our 2.12 clients has that patch. FYI we have turn osc idle timeout. osc.*.idle_timeout=0
            jaylan Jay Lan (Inactive) added a comment - - edited

            > 1. https://review.whamcloud.com/#/c/36073/
            > 2. https://review.whamcloud.com/#/c/35578/

            Amir, you ported the above two patches to b2_12. Should I cherry-pick these two into nas-2.12.2? If I cherry-pick these patches to one of our nas-2.10.8 or nas-2.12.2 branches, we should pick them also to the other branch to avoid another out-of-sync problem. Thanks!

            jaylan Jay Lan (Inactive) added a comment - - edited > 1. https://review.whamcloud.com/#/c/36073/ > 2. https://review.whamcloud.com/#/c/35578/ Amir, you ported the above two patches to b2_12. Should I cherry-pick these two into nas-2.12.2? If I cherry-pick these patches to one of our nas-2.10.8 or nas-2.12.2 branches, we should pick them also to the other branch to avoid another out-of-sync problem. Thanks!

            Did that reduce the "no credits" message seen in the logs?

            I think the next step is to apply the patch which makes the cq size calculations the same on both clients and servers.

            ashehata Amir Shehata (Inactive) added a comment - Did that reduce the "no credits" message seen in the logs? I think the next step is to apply the patch which makes the cq size calculations the same on both clients and servers.
            mhanafi Mahmoud Hanafi added a comment - - edited

            yes the errors return with peer_credits of 32.

            mhanafi Mahmoud Hanafi added a comment - - edited yes the errors return with peer_credits of 32.

            do those problems persist after changing the peer_credits and peer_credits_hiw to 32/16 respectively on clients and servers?

            ashehata Amir Shehata (Inactive) added a comment - do those problems persist after changing the peer_credits and peer_credits_hiw to 32/16 respectively on clients and servers?
            mhanafi Mahmoud Hanafi added a comment - - edited

            I have not been able to get a reproducer. Other than the specific IO request that gets timeout over and over (for days) other I/O requests complete just fine. On a client in this state a 'sync' will just hang.
            An eviction from the OST will break the cycle of timeouts.
            We can run 2.10.8 clients but since we don't have a reproducer it won't help.

            Some node after reboot get stuck again in repeated timeouts. I have looked at the jobs running across those there is nothing common about the jobs.

            mhanafi Mahmoud Hanafi added a comment - - edited I have not been able to get a reproducer. Other than the specific IO request that gets timeout over and over (for days) other I/O requests complete just fine. On a client in this state a 'sync' will just hang. An eviction from the OST will break the cycle of timeouts. We can run 2.10.8 clients but since we don't have a reproducer it won't help. Some node after reboot get stuck again in repeated timeouts. I have looked at the jobs running across those there is nothing common about the jobs.

            People

              ashehata Amir Shehata (Inactive)
              mhanafi Mahmoud Hanafi
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: