Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12772

bulk timeout after 2.12.2 clients upgrade

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • Lustre 2.12.2
    • None
    • 2
    • 9223372036854775807

    Description

      After upgrading to 2.12.2 clients and 2.10.8 servers we start to see large amounts of bulk io timeouts.

      client side

      Sep 16 16:36:59 r323i3n6 kernel: [1568677019.825837] Lustre: nbp2-OST0008-osc-ffff9ad887dc1800: Connection to nbp2-OST0008 (at 10.151.26.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete
      Sep 16 16:37:01 r585i7n2 kernel: [1568677021.115645] Lustre: nbp2-OST0008-osc-ffff90c85d05e000: Connection to nbp2-OST0008 (at 10.151.26.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete
      Sep 16 16:37:01 r311i0n3 kernel: [1568677021.371165] Lustre: nbp2-OST0094-osc-ffff976bea358800: Connection to nbp2-OST0094 (at 10.151.26.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete
      Sep 16 16:37:01 r311i4n9 kernel: [1568677021.578522] Lustre: nbp2-OST0094-osc-ffff9c68adf2d000: Connection to nbp2-OST0094 (at 10.151.26.105@o2ib) was lost; in progress operations using this sea12000: Connection to nbp2-OST0008 (at 10.151.26.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete
      Sep 16 16:36:59 r323i3n6 kernel: [1568677019.825837] Lustre: nbp2-OST0008-osc-ffff9ad887dc1800: Connection to nbp2-OST0008 (at 10.151.26.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete
      Sep 16 16:37:01 r585i7n2 kernel: [1568677021.115645] Lustre: nbp2-OST0008-osc-ffff90c85d05e000: Connection to nbp2-OST0008 (at 10.151.26.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete
      Sep 16 16:37:01 r311i0n3 kernel: [1568677021.371165] Lustre: nbp2-OST0094-osc-ffff976bea358800: Connection to nbp2-OST0094 (at 10.151.26.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete
      Sep 16 16:37:01 r311i4n9 kernel: [1568677021.578522] Lustre: nbp2-OST0094-osc-ffff9c68adf2d000: Connection to nbp2-OST0094 (at 10.151.26.105@o2ib) was lost; in progress operations using this seConnection to nbp2-OST0008 (at 10.151.26.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete
      Sep 16 16:36:59 r323i3n6 kernel: [1568677019.825837] Lustre: nbp2-OST0008-osc-ffff9ad887dc1800: Connection to nbp2-OST0008 (at 10.151.26.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete
      Sep 16 16:37:01 r585i7n2 kernel: [1568677021.115645] Lustre: nbp2-OST0008-osc-ffff90c85d05e000: Connection to nbp2-OST0008 (at 10.151.26.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete
      Sep 16 16:37:01 r311i0n3 kernel: [1568677021.371165] Lustre: nbp2-OST0094-osc-ffff976bea358800: Connection to nbp2-OST0094 (at 10.151.26.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete
      Sep 16 16:37:01 r311i4n9 kernel: [1568677021.578522] Lustre: nbp2-OST0094-osc-ffff9c68adf2d000: Connection to nbp2-OST0094 (at 10.151.26.105@o2ib) was lost; in progress operations using this se

      server side

      [90158.366440] LustreError: 30777:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk WRITE after 300+0s  req@ffff8affe46eb450 x1644657589074304/t0(0) o4->825a80a7-da45-880e-35d1-4a750d2cf7f0@10.151.16.212@o2ib:502/0 lens 2168/448 e 0 to 0 dl 1568676837 ref 1 fl Interpret:/2/0 rc 0/0
      

      Attachments

        Issue Links

          Activity

            [LU-12772] bulk timeout after 2.12.2 clients upgrade
            mhanafi Mahmoud Hanafi added a comment - LU-12856

            Today we upgrade one of filesystem to match the CQW size as per the 3 patches. After the reboot of the OSS we started to see bulk timeouts again from the clients to the rebooted OSSes. We have the transaction_timeout=100.
            The timeouts are the same as before, repeated cycle over and over again.

            mhanafi Mahmoud Hanafi added a comment - Today we upgrade one of filesystem to match the CQW size as per the 3 patches. After the reboot of the OSS we started to see bulk timeouts again from the clients to the rebooted OSSes. We have the transaction_timeout=100. The timeouts are the same as before, repeated cycle over and over again.

            That's good news.

            I opened a separate ticket to track the issue with setting the module parameters: LU-12817

            ashehata Amir Shehata (Inactive) added a comment - That's good news. I opened a separate ticket to track the issue with setting the module parameters: LU-12817
            mhanafi Mahmoud Hanafi added a comment - - edited

            The transaction_timeout change as eliminated the timeouts. It took awhile to clean up all the nodes. But last night we only had errors on 3 nodes that still needed to be rebooted.  We will keep an eye on it for the next few days to verify it has resolved the issue.

            so setting the value in lustre.conf doesn't work

             r329i0n0 /etc/modprobe.d # cat lustre.conf 
            # $Source: /cvsroot/bcfg2/bcv136/Cfg/etc/modprobe.d/lustre.conf/lustre.conf.genshi,v $
            # Get rid of messages for missing, special-purpose hardware (LU-1599)
            blacklist padlock-sha#ko2ib
            # Setting map_on_demand=32 as per intel recommendation LU-7808
            # Timeing testing on module load show enabling FMR cuts the load 
            # time by 1/2- from 6 min to 3 min.
            # On lustre servers setting credits to 1/2 the ntx value
            # to help with HCA congestion. Stats gathered shows
            # that during heavy reads we run into issues. 
            # Reducing credits should throttle tx. 
            options ko2iblnd require_privileged_port=0 use_privileged_port=0
            options ko2iblnd timeout=100 retry_count=7 peer_timeout=0 map_on_demand=32 peer_credits=63 concurrent_sends=63
            options ko2iblnd credits=64
            #lnet
            options lnet networks=o2ib(ib1) lnet_transaction_timeout=100
            options lnet avoid_asym_router_failure=1 check_routers_before_use=1 small_router_buffers=65536 large_router_buffers=8192
            ## ptlrpc parameters
            # lowering at_max. This will make recovery faster. No
            # evidence that show we need at_max=700 
            options ptlrpc at_max=600 at_min=200
            
            r329i0n0 /etc/modprobe.d # modprobe lustre
            
            r329i0n0 /etc/modprobe.d # lnetctl global show
            global:
                numa_range: 0
                max_intf: 200
                discovery: 1
                drop_asym_route: 0
                retry_count: 0
                transaction_timeout: 50
                health_sensitivity: 0
                recovery_interval: 1
            r329i0n0 /etc/modprobe.d # uname -a
            Linux r329i0n0 4.12.14-95.19.1.20190617-nasa #1 SMP Tue Jun 11 11:04:48 UTC 2019 (a904a7f) x86_64 x86_64 x86_64 GNU/Linux
            r329i0n0 /etc/modprobe.d # modinfo lnet
            filename:       /lib/modules/4.12.14-95.19.1.20190617-nasa/updates/lustre-client/net/lnet.ko
            license:        GPL
            version:        0.7.0
            description:    Lustre Networking layer
            author:         OpenSFS, Inc. <http://www.lustre.org/>
            suserelease:    SLE12-SP4
            srcversion:     B54539F75D86EDF216C3DEF
            depends:        libcfs
            retpoline:      Y
            vermagic:       4.12.14-95.19.1.20190617-nasa SMP mod_unload modversions retpoline 
            parm:           accept:Accept connections (secure|all|none) (charp)
            parm:           accept_port:Acceptor's port (same on all nodes) (int)
            parm:           accept_backlog:Acceptor's listen backlog (int)
            parm:           accept_timeout:Acceptor's timeout (seconds) (int)
            parm:           forwarding:Explicitly enable/disable forwarding between networks (charp)
            parm:           tiny_router_buffers:# of 0 payload messages to buffer in the router (int)
            parm:           small_router_buffers:# of small (1 page) messages to buffer in the router (int)
            parm:           large_router_buffers:# of large messages to buffer in the router (int)
            parm:           peer_buffer_credits:# router buffer credits per peer (int)
            parm:           auto_down:Automatically mark peers down on comms error (int)
            parm:           check_routers_before_use:Assume routers are down and ping them before use (int)
            parm:           avoid_asym_router_failure:Avoid asymmetrical router failures (0 to disable) (int)
            parm:           dead_router_check_interval:Seconds between dead router health checks (<= 0 to disable) (int)
            parm:           live_router_check_interval:Seconds between live router health checks (<= 0 to disable) (int)
            parm:           router_ping_timeout:Seconds to wait for the reply to a router health query (int)
            parm:           config_on_load:configure network at module load (int)
            parm:           local_nid_dist_zero:Reserved (int)
            parm:           portal_rotor:redirect PUTs to different cpu-partitions (int)
            parm:           ip2nets:LNET network <- IP table (charp)
            parm:           networks:local networks (charp)
            parm:           routes:routes to non-local networks (charp)
            parm:           rnet_htable_size:size of remote network hash table (int)
            parm:           use_tcp_bonding:Set to 1 to use socklnd bonding. 0 to use Multi-Rail (int)
            parm:           lnet_numa_range:NUMA range to consider during Multi-Rail selection (uint)
            parm:           lnet_health_sensitivity:Value to decrement the health value by on error (health_sensitivity)
            parm:           lnet_recovery_interval:Interval to recover unhealthy interfaces in seconds (recovery_interval)
            parm:           lnet_interfaces_max:Maximum number of interfaces in a node. (interfaces_max)
            parm:           lnet_peer_discovery_disabled:Set to 1 to disable peer discovery on this node. (discovery_disabled)
            parm:           lnet_drop_asym_route:Set to 1 to drop asymmetrical route messages. (drop_asym_route)
            parm:           lnet_transaction_timeout:Maximum number of seconds to wait for a peer response. (transaction_timeout)
            parm:           lnet_retry_count:Maximum number of times to retry transmitting a message (retry_count)
            
            
            r329i0n0 /etc/modprobe.d # lnetctl set transaction_timeout 100
            r329i0n0 /etc/modprobe.d # lnetctl global show
            global:
                numa_range: 0
                max_intf: 200
                discovery: 1
                drop_asym_route: 0
                retry_count: 0
                transaction_timeout: 100
                health_sensitivity: 0
                recovery_interval: 1
            

             

            mhanafi Mahmoud Hanafi added a comment - - edited The transaction_timeout change as eliminated the timeouts. It took awhile to clean up all the nodes. But last night we only had errors on 3 nodes that still needed to be rebooted.  We will keep an eye on it for the next few days to verify it has resolved the issue. so setting the value in lustre.conf doesn't work r329i0n0 /etc/modprobe.d # cat lustre.conf # $Source: /cvsroot/bcfg2/bcv136/Cfg/etc/modprobe.d/lustre.conf/lustre.conf.genshi,v $ # Get rid of messages for missing, special-purpose hardware (LU-1599) blacklist padlock-sha#ko2ib # Setting map_on_demand=32 as per intel recommendation LU-7808 # Timeing testing on module load show enabling FMR cuts the load # time by 1/2- from 6 min to 3 min. # On lustre servers setting credits to 1/2 the ntx value # to help with HCA congestion. Stats gathered shows # that during heavy reads we run into issues. # Reducing credits should throttle tx. options ko2iblnd require_privileged_port=0 use_privileged_port=0 options ko2iblnd timeout=100 retry_count=7 peer_timeout=0 map_on_demand=32 peer_credits=63 concurrent_sends=63 options ko2iblnd credits=64 #lnet options lnet networks=o2ib(ib1) lnet_transaction_timeout=100 options lnet avoid_asym_router_failure=1 check_routers_before_use=1 small_router_buffers=65536 large_router_buffers=8192 ## ptlrpc parameters # lowering at_max. This will make recovery faster. No # evidence that show we need at_max=700 options ptlrpc at_max=600 at_min=200 r329i0n0 /etc/modprobe.d # modprobe lustre r329i0n0 /etc/modprobe.d # lnetctl global show global: numa_range: 0 max_intf: 200 discovery: 1 drop_asym_route: 0 retry_count: 0 transaction_timeout: 50 health_sensitivity: 0 recovery_interval: 1 r329i0n0 /etc/modprobe.d # uname -a Linux r329i0n0 4.12.14-95.19.1.20190617-nasa #1 SMP Tue Jun 11 11:04:48 UTC 2019 (a904a7f) x86_64 x86_64 x86_64 GNU/Linux r329i0n0 /etc/modprobe.d # modinfo lnet filename: /lib/modules/4.12.14-95.19.1.20190617-nasa/updates/lustre-client/net/lnet.ko license: GPL version: 0.7.0 description: Lustre Networking layer author: OpenSFS, Inc. <http: //www.lustre.org/> suserelease: SLE12-SP4 srcversion: B54539F75D86EDF216C3DEF depends: libcfs retpoline: Y vermagic: 4.12.14-95.19.1.20190617-nasa SMP mod_unload modversions retpoline parm: accept:Accept connections (secure|all|none) (charp) parm: accept_port:Acceptor's port (same on all nodes) ( int ) parm: accept_backlog:Acceptor's listen backlog ( int ) parm: accept_timeout:Acceptor's timeout (seconds) ( int ) parm: forwarding:Explicitly enable/disable forwarding between networks (charp) parm: tiny_router_buffers:# of 0 payload messages to buffer in the router ( int ) parm: small_router_buffers:# of small (1 page) messages to buffer in the router ( int ) parm: large_router_buffers:# of large messages to buffer in the router ( int ) parm: peer_buffer_credits:# router buffer credits per peer ( int ) parm: auto_down:Automatically mark peers down on comms error ( int ) parm: check_routers_before_use:Assume routers are down and ping them before use ( int ) parm: avoid_asym_router_failure:Avoid asymmetrical router failures (0 to disable) ( int ) parm: dead_router_check_interval:Seconds between dead router health checks (<= 0 to disable) ( int ) parm: live_router_check_interval:Seconds between live router health checks (<= 0 to disable) ( int ) parm: router_ping_timeout:Seconds to wait for the reply to a router health query ( int ) parm: config_on_load:configure network at module load ( int ) parm: local_nid_dist_zero:Reserved ( int ) parm: portal_rotor:redirect PUTs to different cpu-partitions ( int ) parm: ip2nets:LNET network <- IP table (charp) parm: networks:local networks (charp) parm: routes:routes to non-local networks (charp) parm: rnet_htable_size:size of remote network hash table ( int ) parm: use_tcp_bonding:Set to 1 to use socklnd bonding. 0 to use Multi-Rail ( int ) parm: lnet_numa_range:NUMA range to consider during Multi-Rail selection (uint) parm: lnet_health_sensitivity:Value to decrement the health value by on error (health_sensitivity) parm: lnet_recovery_interval:Interval to recover unhealthy interfaces in seconds (recovery_interval) parm: lnet_interfaces_max:Maximum number of interfaces in a node. (interfaces_max) parm: lnet_peer_discovery_disabled:Set to 1 to disable peer discovery on this node. (discovery_disabled) parm: lnet_drop_asym_route:Set to 1 to drop asymmetrical route messages. (drop_asym_route) parm: lnet_transaction_timeout:Maximum number of seconds to wait for a peer response. (transaction_timeout) parm: lnet_retry_count:Maximum number of times to retry transmitting a message (retry_count) r329i0n0 /etc/modprobe.d # lnetctl set transaction_timeout 100 r329i0n0 /etc/modprobe.d # lnetctl global show global: numa_range: 0 max_intf: 200 discovery: 1 drop_asym_route: 0 retry_count: 0 transaction_timeout: 100 health_sensitivity: 0 recovery_interval: 1  

            After you evict the client and set its lnet_transaction_timeout to 100, does it get back into the state of the stuck RPC message?

            I'll continue looking at the logs.

            ashehata Amir Shehata (Inactive) added a comment - After you evict the client and set its lnet_transaction_timeout to 100, does it get back into the state of the stuck RPC message? I'll continue looking at the logs.

            The logs show that messages are getting to and from client and server. Only way to break the cycle is to evict the client. It would be good from someone to look at the rpc traffic to make sure we not hitting a bug.

            mhanafi Mahmoud Hanafi added a comment - The logs show that messages are getting to and from client and server. Only way to break the cycle is to evict the client. It would be good from someone to look at the rpc traffic to make sure we not hitting a bug.
            ashehata Amir Shehata (Inactive) added a comment - - edited

            Yes. the new 2.12 module parameters should be dynamic. You can set them:

            lnetctl set transaction_timeout 100 

            then you can verify with

            lnetctl global show 

            Let's keep the lnet_retry_count to 0. There are a few issues which have been fixed in master with health that haven't been ported back to 2.12 yet. So I don't think we should introduce any more variables.

            Also to clarify the ko2iblnd retry_count is used on connection establishment time. It's actually passed down to the IB stack rightaway. So we don't process it in LNet or the LND.

            ashehata Amir Shehata (Inactive) added a comment - - edited Yes. the new 2.12 module parameters should be dynamic. You can set them: lnetctl set transaction_timeout 100 then you can verify with lnetctl global show Let's keep the lnet_retry_count to 0. There are a few issues which have been fixed in master with health that haven't been ported back to 2.12 yet. So I don't think we should introduce any more variables. Also to clarify the ko2iblnd retry_count is used on connection establishment time. It's actually passed down to the IB stack rightaway. So we don't process it in LNet or the LND.
            mhanafi Mahmoud Hanafi added a comment - - edited

            looks like this setting can be changed after the module load correct?

            btw,  lnet_retry_count=0 should we change that also to 3?

            mhanafi Mahmoud Hanafi added a comment - - edited looks like this setting can be changed after the module load correct? btw,  lnet_retry_count=0 should we change that also to 3?

            Mahmoud, I might have a clue about the issue with the timeouts. On 2.12 the LND timeouts have been hooked up to the lnet_transaction_timeout as part of the health work. The ko2iblnd timeout parameter has been deprecated.

            The idea is that you can set a transaction timeout and a retry count (lnet_retry_count). The LNet messages will be retried a max of lnet_retry_count within the transaction timeout. So for example if you have lnet_retry_count set to 3, and the transaction timeout set to 100. If a message fails to be sent then it'll be retried 3 times in that 100 seconds. The LND timeout will be derived as follows:

            lnd_timeout = lnet_transaction_timeout / lnet_retry_count 

            If you disable retries, which it is disabled by default in 2.12.2, then:

            lnd_timeout = lnet_transaction_timeout 

            The default value of lnet_transaction_timeout is 50s

            In your configuration you want to set the lnd timeout to 100s.

            On the 2.12.2 clients you should set:

            options lnet lnet_transaction_timeout=100

            That should give you the expected timeout. I'm thinking that in your setup under heavy load you need the timeouts to be that large.

            ashehata Amir Shehata (Inactive) added a comment - Mahmoud, I might have a clue about the issue with the timeouts. On 2.12 the LND timeouts have been hooked up to the lnet_transaction_timeout as part of the health work. The ko2iblnd timeout parameter has been deprecated. The idea is that you can set a transaction timeout and a retry count (lnet_retry_count). The LNet messages will be retried a max of lnet_retry_count within the transaction timeout. So for example if you have lnet_retry_count set to 3, and the transaction timeout set to 100. If a message fails to be sent then it'll be retried 3 times in that 100 seconds. The LND timeout will be derived as follows: lnd_timeout = lnet_transaction_timeout / lnet_retry_count If you disable retries, which it is disabled by default in 2.12.2, then: lnd_timeout = lnet_transaction_timeout The default value of lnet_transaction_timeout is 50s In your configuration you want to set the lnd timeout to 100s. On the 2.12.2 clients you should set: options lnet lnet_transaction_timeout=100 That should give you the expected timeout. I'm thinking that in your setup under heavy load you need the timeouts to be that large.

            Hi Jay,

            Yes those two patches should be applied to both 2.12.2 and 2.10.8

            ashehata Amir Shehata (Inactive) added a comment - Hi Jay, Yes those two patches should be applied to both 2.12.2 and 2.10.8

            People

              ashehata Amir Shehata (Inactive)
              mhanafi Mahmoud Hanafi
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: