Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12856

LustreError: 82937:0:(ldlm_lib.c:3268:target_bulk_io()) @@@ truncated bulk READ 0(270336)

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.14.0, Lustre 2.12.4
    • Lustre 2.12.2
    • None
    • L2.12.2 server & L2.10.8 Server
      L2.12.2 client & L2.11 Client
    • 2
    • 9223372036854775807

    Description

      We are seeing

       Oct 14 07:51:06 nbp7-oss7 kernel: [1110415.124675] LNet: 72766:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from 10.151.54.10@o2ib
      Oct 14 07:53:28 nbp7-oss7 kernel: [1110557.738283] LustreError: 48242:0:(ldlm_lib.c:3268:target_bulk_io()) @@@ truncated bulk READ 0(270336)  req@ffff8bd7711d9850 x1647322708401808/t0(0) o3->c1bb3af1-465f-4e4e-3b45-831f7f0aa442@10.151.53.134@o2ib:520/0 lens 488/440 e 0 to 0 dl 1571064920 ref 1 fl Interpret:/2/0 rc 0/0
      Oct 14 07:53:28 nbp7-oss7 kernel: [1110557.820414] Lustre: nbp7-OST0009: Bulk IO read error with c1bb3af1-465f-4e4e-3b45-831f7f0aa442 (at 10.151.53.134@o2ib), client will retry: rc -110
      Oct 14 07:58:40 nbp7-oss7 kernel: [1110869.753867] LNet: 72764:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from 10.151.54.10@o2ib
      Oct 14 08:00:58 nbp7-oss7 kernel: [1111007.747557] LustreError: 7338:0:(ldlm_lib.c:3268:target_bulk_io()) @@@ truncated bulk READ 0(270336)  req@ffff8bd4ef6c3050 x1647322708401808/t0(0) o3->c1bb3af1-465f-4e4e-3b45-831f7f0aa442@10.151.53.134@o2ib:220/0 lens 488/440 e 0 to 0 dl 1571065375 ref 1 fl Interpret:/2/0 rc 0/0
      Oct 14 08:00:58 nbp7-oss7 kernel: [1111007.829410] Lustre: nbp7-OST0009: Bulk IO read error with c1bb3af1-465f-4e4e-3b45-831f7f0aa442 (at 10.151.53.134@o2ib), client will retry: rc -110
      Oct 14 08:02:39 nbp7-oss7 kernel: [1111108.765855] Lustre: nbp7-OST0003: Client c1bb3af1-465f-4e4e-3b45-831f7f0aa442 (at 10.151.53.134@o2ib) reconnecting
      Oct 14 08:02:39 nbp7-oss7 kernel: [1111108.800504] Lustre: Skipped 5 previous similar messages
      Oct 14 08:02:39 nbp7-oss7 kernel: [1111108.818286] Lustre: nbp7-OST0003: Connection restored to 2ee3c4e1-9fd7-3338-5bf8-d1f02bcd8a20 (at 10.151.53.134@o2ib)
      Oct 14 08:02:39 nbp7-oss7 kernel: [1111108.818288] Lustre: Skipped 5 previous similar messages
      Oct 14 08:09:47 nbp7-oss7 kernel: [1111536.849491] LNet: 72766:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from 10.151.54.10@o2ib
      Oct 14 08:11:48 nbp7-oss7 kernel: [1111657.759009] LustreError: 82937:0:(ldlm_lib.c:3268:target_bulk_io()) @@@ truncated bulk READ 0(270336)  req@ffff8bd724b11850 x1647322708401808/t0(0) o3->c1bb3af1-465f-4e4e-3b45-831f7f0aa442@10.151.53.134@o2ib:132/0 lens 488/440 e 0 to 0 dl 1571066042 ref 1 fl Interpret:/2/0 rc 0/0
      Oct 14 08:11:48 nbp7-oss7 kernel: [1111657.841189] Lustre: nbp7-OST0009: Bulk IO read error with c1bb3af1-465f-4e4e-3b45-831f7f0aa442 (at 10.151.53.134@o2ib), client will retry: rc -110
      Oct 14 08:12:57 nbp7-oss7 kernel: [1111726.231189] Lustre: nbp7-OST000f: Client c1bb3af1-465f-4e4e-3b45-831f7f0aa442 (at 10.151.53.134@o2ib) reconnecting
      Oct 14 08:12:57 nbp7-oss7 kernel: [1111726.265819] Lustre: Skipped 5 previous similar messages
      

      We have the 3 patch reported in LU-12385 and LU-12772 applied to both client and server.
      Client Configs

      r593i6n16 ~ # lnetctl global show
      global:
          numa_range: 0
          max_intf: 200
          discovery: 1
          drop_asym_route: 0
          retry_count: 0
          transaction_timeout: 100
          health_sensitivity: 0
          recovery_interval: 1
      ----
      options ko2iblnd require_privileged_port=0 use_privileged_port=0
      options ko2iblnd timeout=100 retry_count=7 peer_timeout=0 map_on_demand=32 peer_credits=32 concurrent_sends=32
      options lnet networks=o2ib(ib1)
      options lnet avoid_asym_router_failure=1 check_routers_before_use=1 small_router_buffers=65536 large_router_buffers=8192
      options lnet lnet_transaction_timeout=100
      options ptlrpc at_max=600 at_min=200
      

      Server Configs

      nbp7-oss7 ~ # lnetctl global show
      global:
          numa_range: 0
          max_intf: 200
          discovery: 1
          drop_asym_route: 0
          retry_count: 0
          transaction_timeout: 100
          health_sensitivity: 0
          recovery_interval: 1
      ----
      options ko2iblnd require_privileged_port=0 use_privileged_port=0
      options ko2iblnd ntx=251072 credits=125536 fmr_pool_size=62769 
      options ko2iblnd timeout=100 retry_count=7 peer_timeout=0 map_on_demand=32 peer_credits=32 concurrent_sends=32
      
      
      #lnet
      options lnet networks=o2ib(ib1) 
      options lnet routes="o2ib233 10.151.26.[80-94]@o2ib; o2ib313 10.151.25.[195-197,202-205,222]@o2ib 10.151.26.[60,127,140-144,150-154]@o2ib; o2ib417 10.151.26.[148,149]@o2ib 10.151.25.[167-170]@o2ib"
      options lnet dead_router_check_interval=60 live_router_check_interval=30
      options lnet avoid_asym_router_failure=1 check_routers_before_use=1 small_router_buffers=65536 large_router_buffers=8192options lnet lnet_transaction_timeout=100
      options ptlrpc at_max=600 at_min=200
      

      These bulk errors are a major issue. We though matching client and server to 2.12.2 would solve it but doesn't look like it.

      Attachments

        1. 0001-LU-12856-debug.patch
          2 kB
        2. 0002-LU-12856-revert-LU-9983.patch
          2 kB
        3. client.10.151.53.134.debug.gz
          805 kB
        4. nbp7-mds.2019-10-16.out.gz
          65.90 MB
        5. r407i2n14.out2.gz
          50.70 MB
        6. server.xid.1647322708401808.debug.gz
          78.16 MB

        Issue Links

          Activity

            [LU-12856] LustreError: 82937:0:(ldlm_lib.c:3268:target_bulk_io()) @@@ truncated bulk READ 0(270336)

            Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36632
            Subject: LU-12856 target: check FLFLAGS are valid while accessing them
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 304b3626108ecf2accb2de9a61494807fe7995a4

            gerrit Gerrit Updater added a comment - Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36632 Subject: LU-12856 target: check FLFLAGS are valid while accessing them Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 304b3626108ecf2accb2de9a61494807fe7995a4

            looks like 16k

             # lctl get_param osc.nbptest*.short_io_bytes
            osc.nbptest-OST0000-osc-ffff9573da091800.short_io_bytes=16384
            osc.nbptest-OST0001-osc-ffff9573da091800.short_io_bytes=16384
            osc.nbptest-OST0002-osc-ffff9573da091800.short_io_bytes=16384
            osc.nbptest-OST0003-osc-ffff9573da091800.short_io_bytes=16384
            osc.nbptest-OST0004-osc-ffff9573da091800.short_io_bytes=16384
            osc.nbptest-OST0005-osc-ffff9573da091800.short_io_bytes=16384
            
            mhanafi Mahmoud Hanafi added a comment - looks like 16k # lctl get_param osc.nbptest*.short_io_bytes osc.nbptest-OST0000-osc-ffff9573da091800.short_io_bytes=16384 osc.nbptest-OST0001-osc-ffff9573da091800.short_io_bytes=16384 osc.nbptest-OST0002-osc-ffff9573da091800.short_io_bytes=16384 osc.nbptest-OST0003-osc-ffff9573da091800.short_io_bytes=16384 osc.nbptest-OST0004-osc-ffff9573da091800.short_io_bytes=16384 osc.nbptest-OST0005-osc-ffff9573da091800.short_io_bytes=16384

            btw, what is the default value of short_io_bytes prior setting it to 0?

            tappro Mikhail Pershin added a comment - btw, what is the default value of short_io_bytes prior setting it to 0?
            mhanafi Mahmoud Hanafi added a comment - - edited

            2.10.8 servers (LU-12772) will error with bulk timeout. 2.12.2 servers report truncated  read error.  In both cases client keeps trying the IO and the files can’t be closed

            mhanafi Mahmoud Hanafi added a comment - - edited 2.10.8 servers ( LU-12772 ) will error with bulk timeout. 2.12.2 servers report truncated  read error.  In both cases client keeps trying the IO and the files can’t be closed

            Mahmoud, just to be sure - we are still considering the problem as described in ticket - error due to truncated bulk READ, right?

            tappro Mikhail Pershin added a comment - Mahmoud, just to be sure - we are still considering the problem as described in ticket - error due to truncated bulk READ, right?

            The short_io_bytes param is don't show up in /proc or /sys.

             401i0n3 ~ # ls -l /sys/kernel/debug/lustre/osc/nbp2-OST0000-osc-ffff9bd2999f5000
            total 0
            -r--r--r-- 1 root root 0 Oct 31 09:43 srpc_contexts
            -r--r--r-- 1 root root 0 Oct 31 09:43 srpc_info
            --w------- 1 root root 0 Oct 31 09:43 srpc_sepol
            -rw-r--r-- 1 root root 0 Oct 31 09:43 stats
            r401i0n3 ~ # ls -l /proc/fs/lustre/osc/nbp2-OST0000-osc-ffff9bd2999f5000/
            total 0
            -rw-r--r-- 1 root root 0 Oct 31 10:16 checksum_type
            -r--r--r-- 1 root root 0 Oct 31 10:16 connect_flags
            -rw-r--r-- 1 root root 0 Oct 31 10:16 cur_grant_bytes
            -rw-r--r-- 1 root root 0 Oct 31 10:16 import
            -rw-r--r-- 1 root root 0 Oct 31 10:16 max_pages_per_rpc
            -rw-r--r-- 1 root root 0 Oct 31 10:16 osc_cached_mb
            -rw-r--r-- 1 root root 0 Oct 31 10:16 osc_stats
            -r--r--r-- 1 root root 0 Oct 31 10:16 ost_server_uuid
            -rw-r--r-- 1 root root 0 Oct 31 10:16 pinger_recov
            -rw-r--r-- 1 root root 0 Oct 31 10:16 rpc_stats
            -r--r--r-- 1 root root 0 Oct 31 10:16 state
            -r--r--r-- 1 root root 0 Oct 31 10:16 timeouts
            -r--r--r-- 1 root root 0 Oct 31 10:16 unstable_stats
            r401i0n3 ~ # modinfo lustre
            filename:       /lib/modules/4.12.14-95.19.1.20190617-nasa/updates/lustre-client/fs/lustre.ko
            license:        GPL
            version:        2.12.2
            description:    Lustre Client File System
            author:         OpenSFS, Inc. <http://www.lustre.org/>
            suserelease:    SLE12-SP4
            srcversion:     DA4405C9155F0D08BCC73DB
            depends:        obdclass,ptlrpc,libcfs,lnet,lmv,mdc,lov
            retpoline:      Y
            vermagic:       4.12.14-95.19.1.20190617-nasa SMP mod_unload modversions retpoline 
            
            
            r401i0n3 ~ # lctl get_param osc.nbp2-OST0000-osc-ffff9bd2999f5000.short_io_bytes
            osc.nbp2-OST0000-osc-ffff9bd2999f5000.short_io_bytes=0
            
            mhanafi Mahmoud Hanafi added a comment - The short_io_bytes param is don't show up in /proc or /sys. 401i0n3 ~ # ls -l /sys/kernel/debug/lustre/osc/nbp2-OST0000-osc-ffff9bd2999f5000 total 0 -r--r--r-- 1 root root 0 Oct 31 09:43 srpc_contexts -r--r--r-- 1 root root 0 Oct 31 09:43 srpc_info --w------- 1 root root 0 Oct 31 09:43 srpc_sepol -rw-r--r-- 1 root root 0 Oct 31 09:43 stats r401i0n3 ~ # ls -l /proc/fs/lustre/osc/nbp2-OST0000-osc-ffff9bd2999f5000/ total 0 -rw-r--r-- 1 root root 0 Oct 31 10:16 checksum_type -r--r--r-- 1 root root 0 Oct 31 10:16 connect_flags -rw-r--r-- 1 root root 0 Oct 31 10:16 cur_grant_bytes -rw-r--r-- 1 root root 0 Oct 31 10:16 import -rw-r--r-- 1 root root 0 Oct 31 10:16 max_pages_per_rpc -rw-r--r-- 1 root root 0 Oct 31 10:16 osc_cached_mb -rw-r--r-- 1 root root 0 Oct 31 10:16 osc_stats -r--r--r-- 1 root root 0 Oct 31 10:16 ost_server_uuid -rw-r--r-- 1 root root 0 Oct 31 10:16 pinger_recov -rw-r--r-- 1 root root 0 Oct 31 10:16 rpc_stats -r--r--r-- 1 root root 0 Oct 31 10:16 state -r--r--r-- 1 root root 0 Oct 31 10:16 timeouts -r--r--r-- 1 root root 0 Oct 31 10:16 unstable_stats r401i0n3 ~ # modinfo lustre filename: /lib/modules/4.12.14-95.19.1.20190617-nasa/updates/lustre-client/fs/lustre.ko license: GPL version: 2.12.2 description: Lustre Client File System author: OpenSFS, Inc. <http: //www.lustre.org/> suserelease: SLE12-SP4 srcversion: DA4405C9155F0D08BCC73DB depends: obdclass,ptlrpc,libcfs,lnet,lmv,mdc,lov retpoline: Y vermagic: 4.12.14-95.19.1.20190617-nasa SMP mod_unload modversions retpoline r401i0n3 ~ # lctl get_param osc.nbp2-OST0000-osc-ffff9bd2999f5000.short_io_bytes osc.nbp2-OST0000-osc-ffff9bd2999f5000.short_io_bytes=0

            Mahmoud, that is useful info, thanks, I will check related code first

            tappro Mikhail Pershin added a comment - Mahmoud, that is useful info, thanks, I will check related code first

            I tested ({{lctl set_param osc.*.short_io_bytes=0) }}on 100 clients  it did not reproduce the bulk timeout issue.

            So looks like we don't need to revert the patch. What debug info would you like to further diag the issue.

            mhanafi Mahmoud Hanafi added a comment - I tested ({{lctl set_param osc.*.short_io_bytes=0) }}on 100 clients  it did not reproduce the bulk timeout issue. So looks like we don't need to revert the patch. What debug info would you like to further diag the issue.

            Mahmoud, can you disable short_io on client implicitly and do tests?
            lctl set_param osc.*.short_io_bytes=0
            The patch you've reverted fixes short_io feature agreement between server and client, this explains actually client-server matrix of error, only after that patch clients are really able to use short_io feature. So it seems the real source of problem is that feature and we can check that by disabling it on clients.

            tappro Mikhail Pershin added a comment - Mahmoud, can you disable short_io on client implicitly and do tests? lctl set_param osc.*.short_io_bytes=0 The patch you've reverted fixes short_io feature agreement between server and client, this explains actually client-server matrix of error, only after that patch clients are really able to use short_io feature. So it seems the real source of problem is that feature and we can check that by disabling it on clients.
            pjones Peter Jones added a comment -

            Mike

            Can you identify what the problem is with the LU-1757 patch?

            Peter

            pjones Peter Jones added a comment - Mike Can you identify what the problem is with the LU-1757 patch? Peter
            mhanafi Mahmoud Hanafi added a comment - - edited

            Doing git bisect between 2.10.57 and 2.10.56, I was able to identify the commit that introduced this issue. It is

            commit 3483e195314bddb8d72594ebb10307c83a4bb860
            Author: Patrick Farrell <paf@cray.com>
            Date:   Thu Dec 7 07:00:58 2017 -0600
            
                LU-1757 brw: Fix short i/o and enable for mdc
                
                The short i/o flag was left out of the OST flags in the
                original patch, meaning it was not really on.  Also, the
                short_io_size value was used uninitialized, meaning it
                was sometimes non-zero, which coudl lead to several issues.
                
                Also add the short i/o flag to the MDC/MDT for data on MDT.
                Quick testing suggests this works fine with no further
                changes.
                
                Cray-bug-id: LUS-187
                Signed-off-by: Patrick Farrell <paf@cray.com>
                Change-Id: I4154b87d5ad73b53467b0382368fad7c5ba177fe
                Reviewed-on: https://review.whamcloud.com/30435
                Tested-by: Jenkins
                Reviewed-by: Mike Pershin <mike.pershin@intel.com>
                Reviewed-by: Alexandr Boyko <c17825@cray.com>
                Tested-by: Maloo <hpdd-maloo@intel.com>
                Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
                Reviewed-by: Oleg Drokin <oleg.drokin@intel.com>
            

            I tested reverting the commit in 2.11.0 and it fixed the issue. I tested the reverted in l2.12.2 client it DID NOT fix the issue. Now I am really confused.....

            2.11.0 is sles12sp3 and 2.12.2 is sles12sp4.

            mhanafi Mahmoud Hanafi added a comment - - edited Doing git bisect between 2.10.57 and 2.10.56, I was able to identify the commit that introduced this issue. It is commit 3483e195314bddb8d72594ebb10307c83a4bb860 Author: Patrick Farrell <paf@cray.com> Date: Thu Dec 7 07:00:58 2017 -0600 LU-1757 brw: Fix short i/o and enable for mdc The short i/o flag was left out of the OST flags in the original patch, meaning it was not really on. Also, the short_io_size value was used uninitialized, meaning it was sometimes non-zero, which coudl lead to several issues. Also add the short i/o flag to the MDC/MDT for data on MDT. Quick testing suggests this works fine with no further changes. Cray-bug-id: LUS-187 Signed-off-by: Patrick Farrell <paf@cray.com> Change-Id: I4154b87d5ad73b53467b0382368fad7c5ba177fe Reviewed-on: https: //review.whamcloud.com/30435 Tested-by: Jenkins Reviewed-by: Mike Pershin <mike.pershin@intel.com> Reviewed-by: Alexandr Boyko <c17825@cray.com> Tested-by: Maloo <hpdd-maloo@intel.com> Reviewed-by: Andreas Dilger <andreas.dilger@intel.com> Reviewed-by: Oleg Drokin <oleg.drokin@intel.com> I tested reverting the commit in 2.11.0 and it fixed the issue. I tested the reverted in l2.12.2 client it DID NOT fix the issue. Now I am really confused..... 2.11.0 is sles12sp3 and 2.12.2 is sles12sp4.

            People

              tappro Mikhail Pershin
              mhanafi Mahmoud Hanafi
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: