Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12856

LustreError: 82937:0:(ldlm_lib.c:3268:target_bulk_io()) @@@ truncated bulk READ 0(270336)

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.14.0, Lustre 2.12.4
    • Lustre 2.12.2
    • None
    • L2.12.2 server & L2.10.8 Server
      L2.12.2 client & L2.11 Client
    • 2
    • 9223372036854775807

    Description

      We are seeing

       Oct 14 07:51:06 nbp7-oss7 kernel: [1110415.124675] LNet: 72766:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from 10.151.54.10@o2ib
      Oct 14 07:53:28 nbp7-oss7 kernel: [1110557.738283] LustreError: 48242:0:(ldlm_lib.c:3268:target_bulk_io()) @@@ truncated bulk READ 0(270336)  req@ffff8bd7711d9850 x1647322708401808/t0(0) o3->c1bb3af1-465f-4e4e-3b45-831f7f0aa442@10.151.53.134@o2ib:520/0 lens 488/440 e 0 to 0 dl 1571064920 ref 1 fl Interpret:/2/0 rc 0/0
      Oct 14 07:53:28 nbp7-oss7 kernel: [1110557.820414] Lustre: nbp7-OST0009: Bulk IO read error with c1bb3af1-465f-4e4e-3b45-831f7f0aa442 (at 10.151.53.134@o2ib), client will retry: rc -110
      Oct 14 07:58:40 nbp7-oss7 kernel: [1110869.753867] LNet: 72764:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from 10.151.54.10@o2ib
      Oct 14 08:00:58 nbp7-oss7 kernel: [1111007.747557] LustreError: 7338:0:(ldlm_lib.c:3268:target_bulk_io()) @@@ truncated bulk READ 0(270336)  req@ffff8bd4ef6c3050 x1647322708401808/t0(0) o3->c1bb3af1-465f-4e4e-3b45-831f7f0aa442@10.151.53.134@o2ib:220/0 lens 488/440 e 0 to 0 dl 1571065375 ref 1 fl Interpret:/2/0 rc 0/0
      Oct 14 08:00:58 nbp7-oss7 kernel: [1111007.829410] Lustre: nbp7-OST0009: Bulk IO read error with c1bb3af1-465f-4e4e-3b45-831f7f0aa442 (at 10.151.53.134@o2ib), client will retry: rc -110
      Oct 14 08:02:39 nbp7-oss7 kernel: [1111108.765855] Lustre: nbp7-OST0003: Client c1bb3af1-465f-4e4e-3b45-831f7f0aa442 (at 10.151.53.134@o2ib) reconnecting
      Oct 14 08:02:39 nbp7-oss7 kernel: [1111108.800504] Lustre: Skipped 5 previous similar messages
      Oct 14 08:02:39 nbp7-oss7 kernel: [1111108.818286] Lustre: nbp7-OST0003: Connection restored to 2ee3c4e1-9fd7-3338-5bf8-d1f02bcd8a20 (at 10.151.53.134@o2ib)
      Oct 14 08:02:39 nbp7-oss7 kernel: [1111108.818288] Lustre: Skipped 5 previous similar messages
      Oct 14 08:09:47 nbp7-oss7 kernel: [1111536.849491] LNet: 72766:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from 10.151.54.10@o2ib
      Oct 14 08:11:48 nbp7-oss7 kernel: [1111657.759009] LustreError: 82937:0:(ldlm_lib.c:3268:target_bulk_io()) @@@ truncated bulk READ 0(270336)  req@ffff8bd724b11850 x1647322708401808/t0(0) o3->c1bb3af1-465f-4e4e-3b45-831f7f0aa442@10.151.53.134@o2ib:132/0 lens 488/440 e 0 to 0 dl 1571066042 ref 1 fl Interpret:/2/0 rc 0/0
      Oct 14 08:11:48 nbp7-oss7 kernel: [1111657.841189] Lustre: nbp7-OST0009: Bulk IO read error with c1bb3af1-465f-4e4e-3b45-831f7f0aa442 (at 10.151.53.134@o2ib), client will retry: rc -110
      Oct 14 08:12:57 nbp7-oss7 kernel: [1111726.231189] Lustre: nbp7-OST000f: Client c1bb3af1-465f-4e4e-3b45-831f7f0aa442 (at 10.151.53.134@o2ib) reconnecting
      Oct 14 08:12:57 nbp7-oss7 kernel: [1111726.265819] Lustre: Skipped 5 previous similar messages
      

      We have the 3 patch reported in LU-12385 and LU-12772 applied to both client and server.
      Client Configs

      r593i6n16 ~ # lnetctl global show
      global:
          numa_range: 0
          max_intf: 200
          discovery: 1
          drop_asym_route: 0
          retry_count: 0
          transaction_timeout: 100
          health_sensitivity: 0
          recovery_interval: 1
      ----
      options ko2iblnd require_privileged_port=0 use_privileged_port=0
      options ko2iblnd timeout=100 retry_count=7 peer_timeout=0 map_on_demand=32 peer_credits=32 concurrent_sends=32
      options lnet networks=o2ib(ib1)
      options lnet avoid_asym_router_failure=1 check_routers_before_use=1 small_router_buffers=65536 large_router_buffers=8192
      options lnet lnet_transaction_timeout=100
      options ptlrpc at_max=600 at_min=200
      

      Server Configs

      nbp7-oss7 ~ # lnetctl global show
      global:
          numa_range: 0
          max_intf: 200
          discovery: 1
          drop_asym_route: 0
          retry_count: 0
          transaction_timeout: 100
          health_sensitivity: 0
          recovery_interval: 1
      ----
      options ko2iblnd require_privileged_port=0 use_privileged_port=0
      options ko2iblnd ntx=251072 credits=125536 fmr_pool_size=62769 
      options ko2iblnd timeout=100 retry_count=7 peer_timeout=0 map_on_demand=32 peer_credits=32 concurrent_sends=32
      
      
      #lnet
      options lnet networks=o2ib(ib1) 
      options lnet routes="o2ib233 10.151.26.[80-94]@o2ib; o2ib313 10.151.25.[195-197,202-205,222]@o2ib 10.151.26.[60,127,140-144,150-154]@o2ib; o2ib417 10.151.26.[148,149]@o2ib 10.151.25.[167-170]@o2ib"
      options lnet dead_router_check_interval=60 live_router_check_interval=30
      options lnet avoid_asym_router_failure=1 check_routers_before_use=1 small_router_buffers=65536 large_router_buffers=8192options lnet lnet_transaction_timeout=100
      options ptlrpc at_max=600 at_min=200
      

      These bulk errors are a major issue. We though matching client and server to 2.12.2 would solve it but doesn't look like it.

      Attachments

        1. 0001-LU-12856-debug.patch
          2 kB
        2. 0002-LU-12856-revert-LU-9983.patch
          2 kB
        3. client.10.151.53.134.debug.gz
          805 kB
        4. nbp7-mds.2019-10-16.out.gz
          65.90 MB
        5. r407i2n14.out2.gz
          50.70 MB
        6. server.xid.1647322708401808.debug.gz
          78.16 MB

        Issue Links

          Activity

            [LU-12856] LustreError: 82937:0:(ldlm_lib.c:3268:target_bulk_io()) @@@ truncated bulk READ 0(270336)

            The short_io_bytes param is don't show up in /proc or /sys.

             401i0n3 ~ # ls -l /sys/kernel/debug/lustre/osc/nbp2-OST0000-osc-ffff9bd2999f5000
            total 0
            -r--r--r-- 1 root root 0 Oct 31 09:43 srpc_contexts
            -r--r--r-- 1 root root 0 Oct 31 09:43 srpc_info
            --w------- 1 root root 0 Oct 31 09:43 srpc_sepol
            -rw-r--r-- 1 root root 0 Oct 31 09:43 stats
            r401i0n3 ~ # ls -l /proc/fs/lustre/osc/nbp2-OST0000-osc-ffff9bd2999f5000/
            total 0
            -rw-r--r-- 1 root root 0 Oct 31 10:16 checksum_type
            -r--r--r-- 1 root root 0 Oct 31 10:16 connect_flags
            -rw-r--r-- 1 root root 0 Oct 31 10:16 cur_grant_bytes
            -rw-r--r-- 1 root root 0 Oct 31 10:16 import
            -rw-r--r-- 1 root root 0 Oct 31 10:16 max_pages_per_rpc
            -rw-r--r-- 1 root root 0 Oct 31 10:16 osc_cached_mb
            -rw-r--r-- 1 root root 0 Oct 31 10:16 osc_stats
            -r--r--r-- 1 root root 0 Oct 31 10:16 ost_server_uuid
            -rw-r--r-- 1 root root 0 Oct 31 10:16 pinger_recov
            -rw-r--r-- 1 root root 0 Oct 31 10:16 rpc_stats
            -r--r--r-- 1 root root 0 Oct 31 10:16 state
            -r--r--r-- 1 root root 0 Oct 31 10:16 timeouts
            -r--r--r-- 1 root root 0 Oct 31 10:16 unstable_stats
            r401i0n3 ~ # modinfo lustre
            filename:       /lib/modules/4.12.14-95.19.1.20190617-nasa/updates/lustre-client/fs/lustre.ko
            license:        GPL
            version:        2.12.2
            description:    Lustre Client File System
            author:         OpenSFS, Inc. <http://www.lustre.org/>
            suserelease:    SLE12-SP4
            srcversion:     DA4405C9155F0D08BCC73DB
            depends:        obdclass,ptlrpc,libcfs,lnet,lmv,mdc,lov
            retpoline:      Y
            vermagic:       4.12.14-95.19.1.20190617-nasa SMP mod_unload modversions retpoline 
            
            
            r401i0n3 ~ # lctl get_param osc.nbp2-OST0000-osc-ffff9bd2999f5000.short_io_bytes
            osc.nbp2-OST0000-osc-ffff9bd2999f5000.short_io_bytes=0
            
            mhanafi Mahmoud Hanafi added a comment - The short_io_bytes param is don't show up in /proc or /sys. 401i0n3 ~ # ls -l /sys/kernel/debug/lustre/osc/nbp2-OST0000-osc-ffff9bd2999f5000 total 0 -r--r--r-- 1 root root 0 Oct 31 09:43 srpc_contexts -r--r--r-- 1 root root 0 Oct 31 09:43 srpc_info --w------- 1 root root 0 Oct 31 09:43 srpc_sepol -rw-r--r-- 1 root root 0 Oct 31 09:43 stats r401i0n3 ~ # ls -l /proc/fs/lustre/osc/nbp2-OST0000-osc-ffff9bd2999f5000/ total 0 -rw-r--r-- 1 root root 0 Oct 31 10:16 checksum_type -r--r--r-- 1 root root 0 Oct 31 10:16 connect_flags -rw-r--r-- 1 root root 0 Oct 31 10:16 cur_grant_bytes -rw-r--r-- 1 root root 0 Oct 31 10:16 import -rw-r--r-- 1 root root 0 Oct 31 10:16 max_pages_per_rpc -rw-r--r-- 1 root root 0 Oct 31 10:16 osc_cached_mb -rw-r--r-- 1 root root 0 Oct 31 10:16 osc_stats -r--r--r-- 1 root root 0 Oct 31 10:16 ost_server_uuid -rw-r--r-- 1 root root 0 Oct 31 10:16 pinger_recov -rw-r--r-- 1 root root 0 Oct 31 10:16 rpc_stats -r--r--r-- 1 root root 0 Oct 31 10:16 state -r--r--r-- 1 root root 0 Oct 31 10:16 timeouts -r--r--r-- 1 root root 0 Oct 31 10:16 unstable_stats r401i0n3 ~ # modinfo lustre filename: /lib/modules/4.12.14-95.19.1.20190617-nasa/updates/lustre-client/fs/lustre.ko license: GPL version: 2.12.2 description: Lustre Client File System author: OpenSFS, Inc. <http: //www.lustre.org/> suserelease: SLE12-SP4 srcversion: DA4405C9155F0D08BCC73DB depends: obdclass,ptlrpc,libcfs,lnet,lmv,mdc,lov retpoline: Y vermagic: 4.12.14-95.19.1.20190617-nasa SMP mod_unload modversions retpoline r401i0n3 ~ # lctl get_param osc.nbp2-OST0000-osc-ffff9bd2999f5000.short_io_bytes osc.nbp2-OST0000-osc-ffff9bd2999f5000.short_io_bytes=0

            Mahmoud, that is useful info, thanks, I will check related code first

            tappro Mikhail Pershin added a comment - Mahmoud, that is useful info, thanks, I will check related code first

            I tested ({{lctl set_param osc.*.short_io_bytes=0) }}on 100 clients  it did not reproduce the bulk timeout issue.

            So looks like we don't need to revert the patch. What debug info would you like to further diag the issue.

            mhanafi Mahmoud Hanafi added a comment - I tested ({{lctl set_param osc.*.short_io_bytes=0) }}on 100 clients  it did not reproduce the bulk timeout issue. So looks like we don't need to revert the patch. What debug info would you like to further diag the issue.

            Mahmoud, can you disable short_io on client implicitly and do tests?
            lctl set_param osc.*.short_io_bytes=0
            The patch you've reverted fixes short_io feature agreement between server and client, this explains actually client-server matrix of error, only after that patch clients are really able to use short_io feature. So it seems the real source of problem is that feature and we can check that by disabling it on clients.

            tappro Mikhail Pershin added a comment - Mahmoud, can you disable short_io on client implicitly and do tests? lctl set_param osc.*.short_io_bytes=0 The patch you've reverted fixes short_io feature agreement between server and client, this explains actually client-server matrix of error, only after that patch clients are really able to use short_io feature. So it seems the real source of problem is that feature and we can check that by disabling it on clients.
            pjones Peter Jones added a comment -

            Mike

            Can you identify what the problem is with the LU-1757 patch?

            Peter

            pjones Peter Jones added a comment - Mike Can you identify what the problem is with the LU-1757 patch? Peter
            mhanafi Mahmoud Hanafi added a comment - - edited

            Doing git bisect between 2.10.57 and 2.10.56, I was able to identify the commit that introduced this issue. It is

            commit 3483e195314bddb8d72594ebb10307c83a4bb860
            Author: Patrick Farrell <paf@cray.com>
            Date:   Thu Dec 7 07:00:58 2017 -0600
            
                LU-1757 brw: Fix short i/o and enable for mdc
                
                The short i/o flag was left out of the OST flags in the
                original patch, meaning it was not really on.  Also, the
                short_io_size value was used uninitialized, meaning it
                was sometimes non-zero, which coudl lead to several issues.
                
                Also add the short i/o flag to the MDC/MDT for data on MDT.
                Quick testing suggests this works fine with no further
                changes.
                
                Cray-bug-id: LUS-187
                Signed-off-by: Patrick Farrell <paf@cray.com>
                Change-Id: I4154b87d5ad73b53467b0382368fad7c5ba177fe
                Reviewed-on: https://review.whamcloud.com/30435
                Tested-by: Jenkins
                Reviewed-by: Mike Pershin <mike.pershin@intel.com>
                Reviewed-by: Alexandr Boyko <c17825@cray.com>
                Tested-by: Maloo <hpdd-maloo@intel.com>
                Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
                Reviewed-by: Oleg Drokin <oleg.drokin@intel.com>
            

            I tested reverting the commit in 2.11.0 and it fixed the issue. I tested the reverted in l2.12.2 client it DID NOT fix the issue. Now I am really confused.....

            2.11.0 is sles12sp3 and 2.12.2 is sles12sp4.

            mhanafi Mahmoud Hanafi added a comment - - edited Doing git bisect between 2.10.57 and 2.10.56, I was able to identify the commit that introduced this issue. It is commit 3483e195314bddb8d72594ebb10307c83a4bb860 Author: Patrick Farrell <paf@cray.com> Date: Thu Dec 7 07:00:58 2017 -0600 LU-1757 brw: Fix short i/o and enable for mdc The short i/o flag was left out of the OST flags in the original patch, meaning it was not really on. Also, the short_io_size value was used uninitialized, meaning it was sometimes non-zero, which coudl lead to several issues. Also add the short i/o flag to the MDC/MDT for data on MDT. Quick testing suggests this works fine with no further changes. Cray-bug-id: LUS-187 Signed-off-by: Patrick Farrell <paf@cray.com> Change-Id: I4154b87d5ad73b53467b0382368fad7c5ba177fe Reviewed-on: https: //review.whamcloud.com/30435 Tested-by: Jenkins Reviewed-by: Mike Pershin <mike.pershin@intel.com> Reviewed-by: Alexandr Boyko <c17825@cray.com> Tested-by: Maloo <hpdd-maloo@intel.com> Reviewed-by: Andreas Dilger <andreas.dilger@intel.com> Reviewed-by: Oleg Drokin <oleg.drokin@intel.com> I tested reverting the commit in 2.11.0 and it fixed the issue. I tested the reverted in l2.12.2 client it DID NOT fix the issue. Now I am really confused..... 2.11.0 is sles12sp3 and 2.12.2 is sles12sp4.

            Here is debug output. This was 2.11.0 client with 2.10.8+debugpatch
            Client Side

            00000100:00000200:24.0F:1572319805.927751:0:5247:0:(niobuf.c:884:ptl_send_rpc()) Setup reply buffer: 8192 bytes, xid 1648686524462144, portal 4
            00000100:00000200:24.0:1572319805.927756:0:5247:0:(niobuf.c:85:ptl_send_buf()) Sending 608 bytes to portal 6, xid 1648686524462144, offset 0
            00000400:00000200:24.0:1572319805.927759:0:5247:0:(lib-move.c:2983:LNetPut()) LNetPut -> 12345-10.151.27.39@o2ib
            00000400:00000200:24.0:1572319805.927764:0:5247:0:(lib-move.c:2113:lnet_select_pathway()) TRACE: 10.151.4.80@o2ib(10.151.4.80@o2ib:<?>) -> 10.151.27.39@o2ib(10.151.27.39@o2ib:10.151.27.39@o2ib) : PUT
            00000800:00000200:24.0:1572319805.927767:0:5247:0:(o2iblnd_cb.c:1617:kiblnd_send()) sending 608 bytes in 1 frags to 12345-10.151.27.39@o2ib
            00000800:00000200:24.0:1572319805.927771:0:5247:0:(o2iblnd.c:405:kiblnd_find_peer_locked()) got peer_ni [ffff88090474fc40] -> 10.151.27.39@o2ib (2) version: 12
            00000800:00000200:24.0:1572319805.927772:0:5247:0:(o2iblnd_cb.c:1498:kiblnd_launch_tx()) conn[ffff88103ff79c00] (68)++
            00000800:00000200:24.0:1572319805.927773:0:5247:0:(o2iblnd_cb.c:1273:kiblnd_queue_tx_locked()) conn[ffff88103ff79c00] (69)++
            00000800:00000200:24.0:1572319805.927776:0:5247:0:(o2iblnd_cb.c:1504:kiblnd_launch_tx()) conn[ffff88103ff79c00] (70)--
            00000100:00000200:24.0F:1572319806.043820:0:5247:0:(niobuf.c:429:ptlrpc_register_bulk()) Setup 1 bulk put-sink buffers: 1 pages 4096 bytes, mbits x0x5db78bea32fe0-0x5db78bea32fe0, portal 8
            00000100:00000200:24.0:1572319806.043826:0:5247:0:(niobuf.c:884:ptl_send_rpc()) Setup reply buffer: 1024 bytes, xid 1648686524477408, portal 4
            00000100:00000200:24.0:1572319806.043827:0:5247:0:(niobuf.c:85:ptl_send_buf()) Sending 608 bytes to portal 6, xid 1648686524477408, offset 0
            00000400:00000200:24.0:1572319806.043830:0:5247:0:(lib-move.c:2983:LNetPut()) LNetPut -> 12345-10.151.27.39@o2ib
            00000400:00000200:24.0:1572319806.043835:0:5247:0:(lib-move.c:2113:lnet_select_pathway()) TRACE: 10.151.4.80@o2ib(10.151.4.80@o2ib:<?>) -> 10.151.27.39@o2ib(10.151.27.39@o2ib:10.151.27.39@o2ib) : PUT
            00000800:00000200:24.0:1572319806.043837:0:5247:0:(o2iblnd_cb.c:1617:kiblnd_send()) sending 608 bytes in 1 frags to 12345-10.151.27.39@o2ib
            00000800:00000200:24.0:1572319806.043839:0:5247:0:(o2iblnd.c:405:kiblnd_find_peer_locked()) got peer_ni [ffff88090474fc40] -> 10.151.27.39@o2ib (2) version: 12
            00000800:00000200:24.0:1572319806.043840:0:5247:0:(o2iblnd_cb.c:1498:kiblnd_launch_tx()) conn[ffff88103ff79c00] (70)++
            00000800:00000200:24.0:1572319806.043841:0:5247:0:(o2iblnd_cb.c:1273:kiblnd_queue_tx_locked()) conn[ffff88103ff79c00] (71)++
            00000800:00000200:24.0:1572319806.043843:0:5247:0:(o2iblnd_cb.c:1504:kiblnd_launch_tx()) conn[ffff88103ff79c00] (73)--
            00000100:00000200:24.0:1572319806.044181:0:5247:0:(events.c:93:reply_in_callback()) @@@ type 6, status 0  req@ffff8807b7d589c0 x1648686524477408/t0(0) o3->nbptest-OST0001-osc-ffff88085dfaf800@10.151.27.39@o2ib:6/4 lens 608/440 e 0 to 0 dl 1572320211 ref 2 fl Rpc
            

            Server Side

             00000100:00000200:2.0:1572319602.131825:0:19476:0:(service.c:2540:ptlrpc_main()) service thread 1 (#2) started
            00000100:00000200:14.0:1572319805.872010:0:19476:0:(service.c:2094:ptlrpc_server_handle_request()) got req 1648686524462144
            00000100:00000200:10.0:1572319805.887904:0:19476:0:(niobuf.c:193:ptlrpc_start_bulk_transfer()) NASA rq_mbits = 0x0, rq->rq_xid = 0x5db78bea2f440, mask = 0xfffffffffffffff0
            00000100:00000200:10.0:1572319805.887906:0:19476:0:(niobuf.c:207:ptlrpc_start_bulk_transfer()) NASA posted_md = 0, total_md = 1, mbits = 0x0
            00000400:00000200:10.0:1572319805.887913:0:19476:0:(lib-move.c:2796:LNetPut()) LNetPut -> 12345-10.151.4.80@o2ib
            00000400:00000200:10.0:1572319805.887925:0:19476:0:(lib-move.c:1930:lnet_select_pathway()) TRACE: 10.151.27.39@o2ib(10.151.27.39@o2ib:10.151.27.39@o2ib) -> 10.151.4.80@o2ib(10.151.4.80@o2ib:10.151.4.80@o2ib) : PUT
            00000800:00000200:10.0:1572319805.887930:0:19476:0:(o2iblnd_cb.c:1510:kiblnd_send()) sending 4096 bytes in 1 frags to 12345-10.151.4.80@o2ib
            00000800:00000200:10.0:1572319805.887933:0:19476:0:(o2iblnd_cb.c:703:kiblnd_setup_rd_kiov()) niov 1 offset 0 nob 4096
            00000800:00000200:10.0:1572319805.887938:0:19476:0:(o2iblnd.c:405:kiblnd_find_peer_locked()) got peer_ni [ffff89169ab0b280] -> 10.151.4.80@o2ib (2) version: 12
            00000800:00000200:10.0:1572319805.887940:0:19476:0:(o2iblnd_cb.c:1391:kiblnd_launch_tx()) conn[ffff8916777d4200] (68)++
            00000800:00000200:10.0:1572319805.887942:0:19476:0:(o2iblnd_cb.c:1166:kiblnd_queue_tx_locked()) conn[ffff8916777d4200] (69)++
            00000800:00000200:10.0:1572319805.887946:0:19476:0:(o2iblnd_cb.c:1397:kiblnd_launch_tx()) conn[ffff8916777d4200] (70)--
            00000100:00000200:10.0:1572319805.887949:0:19476:0:(niobuf.c:268:ptlrpc_start_bulk_transfer()) NASA Transferring 1 pages 4096 bytes via portal 8 id 12345-10.151.4.80@o2ib mbits 0x0-0x0 posted_md = 1
            00010000:00020000:0.0:1572319905.887258:0:19476:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk READ after 100+0s  req@ffff8915c5c3f850 x1648686524462144/t0(0) o3->22dff754-ac95-23ef-d877-0f51317e2d84@10.151.4.80@o2ib:95/0 lens 608/432 e 0 to 0 dl 1572320060 ref 1 fl Interpret:/0/0 rc 0/0
            00000100:00000200:0.0:1572319905.887272:0:19476:0:(events.c:449:server_bulk_callback()) event type 6, status 0, desc ffff8915bf42fe00
            00000400:00000200:0.0:1572319905.887274:0:19476:0:(lib-md.c:69:lnet_md_unlink()) Unlinking md ffff891679fdb480
            00000020:02000400:0.0:1572319905.887327:0:19476:0:(tgt_handler.c:2046:tgt_brw_read()) nbptest-OST0000: Bulk IO read error with 22dff754-ac95-23ef-d877-0f51317e2d84 (at 10.151.4.80@o2ib), client will retry: rc -110
            00010000:00000080:0.0:1572319905.887337:0:19476:0:(ldlm_lib.c:2883:target_committed_to_req()) @@@ not sending last_committed update (0/1)  req@ffff8915c5c3f850 x1648686524462144/t0(0) o3->22dff754-ac95-23ef-d877-0f51317e2d84@10.151.4.80@o2ib:95/0 lens 608/432 e 0 to 0 dl 1572320060 ref 1 fl Interpret:/0/ffffffff rc -110/-1
            00000100:00000200:0.0:1572319905.887360:0:19476:0:(niobuf.c:958:ptlrpc_register_rqbd()) LNetMEAttach: portal 6
            
            mhanafi Mahmoud Hanafi added a comment - Here is debug output. This was 2.11.0 client with 2.10.8+debugpatch Client Side 00000100:00000200:24.0F:1572319805.927751:0:5247:0:(niobuf.c:884:ptl_send_rpc()) Setup reply buffer: 8192 bytes, xid 1648686524462144, portal 4 00000100:00000200:24.0:1572319805.927756:0:5247:0:(niobuf.c:85:ptl_send_buf()) Sending 608 bytes to portal 6, xid 1648686524462144, offset 0 00000400:00000200:24.0:1572319805.927759:0:5247:0:(lib-move.c:2983:LNetPut()) LNetPut -> 12345-10.151.27.39@o2ib 00000400:00000200:24.0:1572319805.927764:0:5247:0:(lib-move.c:2113:lnet_select_pathway()) TRACE: 10.151.4.80@o2ib(10.151.4.80@o2ib:<?>) -> 10.151.27.39@o2ib(10.151.27.39@o2ib:10.151.27.39@o2ib) : PUT 00000800:00000200:24.0:1572319805.927767:0:5247:0:(o2iblnd_cb.c:1617:kiblnd_send()) sending 608 bytes in 1 frags to 12345-10.151.27.39@o2ib 00000800:00000200:24.0:1572319805.927771:0:5247:0:(o2iblnd.c:405:kiblnd_find_peer_locked()) got peer_ni [ffff88090474fc40] -> 10.151.27.39@o2ib (2) version: 12 00000800:00000200:24.0:1572319805.927772:0:5247:0:(o2iblnd_cb.c:1498:kiblnd_launch_tx()) conn[ffff88103ff79c00] (68)++ 00000800:00000200:24.0:1572319805.927773:0:5247:0:(o2iblnd_cb.c:1273:kiblnd_queue_tx_locked()) conn[ffff88103ff79c00] (69)++ 00000800:00000200:24.0:1572319805.927776:0:5247:0:(o2iblnd_cb.c:1504:kiblnd_launch_tx()) conn[ffff88103ff79c00] (70)-- 00000100:00000200:24.0F:1572319806.043820:0:5247:0:(niobuf.c:429:ptlrpc_register_bulk()) Setup 1 bulk put-sink buffers: 1 pages 4096 bytes, mbits x0x5db78bea32fe0-0x5db78bea32fe0, portal 8 00000100:00000200:24.0:1572319806.043826:0:5247:0:(niobuf.c:884:ptl_send_rpc()) Setup reply buffer: 1024 bytes, xid 1648686524477408, portal 4 00000100:00000200:24.0:1572319806.043827:0:5247:0:(niobuf.c:85:ptl_send_buf()) Sending 608 bytes to portal 6, xid 1648686524477408, offset 0 00000400:00000200:24.0:1572319806.043830:0:5247:0:(lib-move.c:2983:LNetPut()) LNetPut -> 12345-10.151.27.39@o2ib 00000400:00000200:24.0:1572319806.043835:0:5247:0:(lib-move.c:2113:lnet_select_pathway()) TRACE: 10.151.4.80@o2ib(10.151.4.80@o2ib:<?>) -> 10.151.27.39@o2ib(10.151.27.39@o2ib:10.151.27.39@o2ib) : PUT 00000800:00000200:24.0:1572319806.043837:0:5247:0:(o2iblnd_cb.c:1617:kiblnd_send()) sending 608 bytes in 1 frags to 12345-10.151.27.39@o2ib 00000800:00000200:24.0:1572319806.043839:0:5247:0:(o2iblnd.c:405:kiblnd_find_peer_locked()) got peer_ni [ffff88090474fc40] -> 10.151.27.39@o2ib (2) version: 12 00000800:00000200:24.0:1572319806.043840:0:5247:0:(o2iblnd_cb.c:1498:kiblnd_launch_tx()) conn[ffff88103ff79c00] (70)++ 00000800:00000200:24.0:1572319806.043841:0:5247:0:(o2iblnd_cb.c:1273:kiblnd_queue_tx_locked()) conn[ffff88103ff79c00] (71)++ 00000800:00000200:24.0:1572319806.043843:0:5247:0:(o2iblnd_cb.c:1504:kiblnd_launch_tx()) conn[ffff88103ff79c00] (73)-- 00000100:00000200:24.0:1572319806.044181:0:5247:0:(events.c:93:reply_in_callback()) @@@ type 6, status 0 req@ffff8807b7d589c0 x1648686524477408/t0(0) o3->nbptest-OST0001-osc-ffff88085dfaf800@10.151.27.39@o2ib:6/4 lens 608/440 e 0 to 0 dl 1572320211 ref 2 fl Rpc Server Side 00000100:00000200:2.0:1572319602.131825:0:19476:0:(service.c:2540:ptlrpc_main()) service thread 1 (#2) started 00000100:00000200:14.0:1572319805.872010:0:19476:0:(service.c:2094:ptlrpc_server_handle_request()) got req 1648686524462144 00000100:00000200:10.0:1572319805.887904:0:19476:0:(niobuf.c:193:ptlrpc_start_bulk_transfer()) NASA rq_mbits = 0x0, rq->rq_xid = 0x5db78bea2f440, mask = 0xfffffffffffffff0 00000100:00000200:10.0:1572319805.887906:0:19476:0:(niobuf.c:207:ptlrpc_start_bulk_transfer()) NASA posted_md = 0, total_md = 1, mbits = 0x0 00000400:00000200:10.0:1572319805.887913:0:19476:0:(lib-move.c:2796:LNetPut()) LNetPut -> 12345-10.151.4.80@o2ib 00000400:00000200:10.0:1572319805.887925:0:19476:0:(lib-move.c:1930:lnet_select_pathway()) TRACE: 10.151.27.39@o2ib(10.151.27.39@o2ib:10.151.27.39@o2ib) -> 10.151.4.80@o2ib(10.151.4.80@o2ib:10.151.4.80@o2ib) : PUT 00000800:00000200:10.0:1572319805.887930:0:19476:0:(o2iblnd_cb.c:1510:kiblnd_send()) sending 4096 bytes in 1 frags to 12345-10.151.4.80@o2ib 00000800:00000200:10.0:1572319805.887933:0:19476:0:(o2iblnd_cb.c:703:kiblnd_setup_rd_kiov()) niov 1 offset 0 nob 4096 00000800:00000200:10.0:1572319805.887938:0:19476:0:(o2iblnd.c:405:kiblnd_find_peer_locked()) got peer_ni [ffff89169ab0b280] -> 10.151.4.80@o2ib (2) version: 12 00000800:00000200:10.0:1572319805.887940:0:19476:0:(o2iblnd_cb.c:1391:kiblnd_launch_tx()) conn[ffff8916777d4200] (68)++ 00000800:00000200:10.0:1572319805.887942:0:19476:0:(o2iblnd_cb.c:1166:kiblnd_queue_tx_locked()) conn[ffff8916777d4200] (69)++ 00000800:00000200:10.0:1572319805.887946:0:19476:0:(o2iblnd_cb.c:1397:kiblnd_launch_tx()) conn[ffff8916777d4200] (70)-- 00000100:00000200:10.0:1572319805.887949:0:19476:0:(niobuf.c:268:ptlrpc_start_bulk_transfer()) NASA Transferring 1 pages 4096 bytes via portal 8 id 12345-10.151.4.80@o2ib mbits 0x0-0x0 posted_md = 1 00010000:00020000:0.0:1572319905.887258:0:19476:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk READ after 100+0s req@ffff8915c5c3f850 x1648686524462144/t0(0) o3->22dff754-ac95-23ef-d877-0f51317e2d84@10.151.4.80@o2ib:95/0 lens 608/432 e 0 to 0 dl 1572320060 ref 1 fl Interpret:/0/0 rc 0/0 00000100:00000200:0.0:1572319905.887272:0:19476:0:(events.c:449:server_bulk_callback()) event type 6, status 0, desc ffff8915bf42fe00 00000400:00000200:0.0:1572319905.887274:0:19476:0:(lib-md.c:69:lnet_md_unlink()) Unlinking md ffff891679fdb480 00000020:02000400:0.0:1572319905.887327:0:19476:0:(tgt_handler.c:2046:tgt_brw_read()) nbptest-OST0000: Bulk IO read error with 22dff754-ac95-23ef-d877-0f51317e2d84 (at 10.151.4.80@o2ib), client will retry: rc -110 00010000:00000080:0.0:1572319905.887337:0:19476:0:(ldlm_lib.c:2883:target_committed_to_req()) @@@ not sending last_committed update (0/1) req@ffff8915c5c3f850 x1648686524462144/t0(0) o3->22dff754-ac95-23ef-d877-0f51317e2d84@10.151.4.80@o2ib:95/0 lens 608/432 e 0 to 0 dl 1572320060 ref 1 fl Interpret:/0/ffffffff rc -110/-1 00000100:00000200:0.0:1572319905.887360:0:19476:0:(niobuf.c:958:ptlrpc_register_rqbd()) LNetMEAttach: portal 6

            Hi mhanafi,

            LU-9983 introduced a change that is not in 2.10.56 but is in 2.10.57: "1f50b1e494ff1b4988508c6d6398ee6769467931 LU-9983 osp: align the OSP request size by 4k". This change is supposed to make sure that pages are aligned and so the fragmentation doesn't occur.

            Could you please try the attached 0002-LU-12856-revert-LU-9983.patch on top of 2.10.57? It reverts the fragmentation-related changes introduced by LU-9983 (which are in both 2.10.56 and 2.10.57) and adds a debug message that is printed if fragmentation does occur. 

            If the errors still can be seen, it will mean that some other change that came in between 2.10.56 and 2.10.57 is causing it.

            Thanks,

            Serguei.

            ssmirnov Serguei Smirnov added a comment - Hi mhanafi , LU-9983 introduced a change that is not in 2.10.56 but is in 2.10.57: "1f50b1e494ff1b4988508c6d6398ee6769467931 LU-9983 osp: align the OSP request size by 4k". This change is supposed to make sure that pages are aligned and so the fragmentation doesn't occur. Could you please try the attached 0002-LU-12856-revert-LU-9983.patch on top of 2.10.57? It reverts the fragmentation-related changes introduced by LU-9983 (which are in both 2.10.56 and 2.10.57) and adds a debug message that is printed if fragmentation does occur.  If the errors still can be seen, it will mean that some other change that came in between 2.10.56 and 2.10.57 is causing it. Thanks, Serguei.
            mhanafi Mahmoud Hanafi added a comment - - edited

            I reverted

             7954a52042 - LU-9983 ko2iblnd: allow for discontiguous fragments

            at 2.10.55. Doing so fixed all the bulk timeout issues. But I can't easily revert it at 2.12.2 to absolutely verify that this introduced the issue. Can take a closer look at this commit. And may be give us patch that reverts it for 2.12.2 to try.

             

            Here is the results of the tests

            Client

            Client Server Pass/Fail
            2.10.8 2.10.8 pass
            2.10.8 2.12.2 pass
            2.12.2 2.12.2 Fail
            2.12.0 2.10.8 Fail
            2.12.2 2.10.8 Fail
            2.11.0 2.10.8 Fail
            tag: 2.10.57 2.10.8 Fail
            tag: 2.10.56 2.10.8 Fail with write bulk errors
            tag: 2.10.55 2.10.8 Fail with write build errors
            tag: 2.10.54 2.10.8 Pass
            tag: 2.10.56 (revert 7954a52-LU-9983 ko2iblnd: allow for discontiguous fragments) 2.10.8 pass
            mhanafi Mahmoud Hanafi added a comment - - edited I reverted  7954a52042 - LU-9983 ko2iblnd: allow for discontiguous fragments at 2.10.55. Doing so fixed all the bulk timeout issues. But I can't easily revert it at 2.12.2 to absolutely verify that this introduced the issue. Can take a closer look at this commit. And may be give us patch that reverts it for 2.12.2 to try.   Here is the results of the tests Client Client Server Pass/Fail 2.10.8 2.10.8 pass 2.10.8 2.12.2 pass 2.12.2 2.12.2 Fail 2.12.0 2.10.8 Fail 2.12.2 2.10.8 Fail 2.11.0 2.10.8 Fail tag: 2.10.57 2.10.8 Fail tag: 2.10.56 2.10.8 Fail with write bulk errors tag: 2.10.55 2.10.8 Fail with write build errors tag: 2.10.54 2.10.8 Pass tag: 2.10.56 (revert 7954a52- LU-9983 ko2iblnd: allow for discontiguous fragments) 2.10.8 pass
            mhanafi Mahmoud Hanafi added a comment - - edited

            Ran a bunch of more tests. Tag2.10.54 is good. and tag2.10.55 is very bad. Writes will even fail with out rebooting the server. I will try the debug patch next.

             

            I tryied to do bisect but I had compilation errors.

             * 61f26ea47f - (HEAD) LU-9578 llite: use security context if it's enabled in the kernel (2 years ago)
            * 82e794e268 - LU-9452 ldlm: remove MSG_CONNECT_LIBCLIENT support (2 years ago)
            * 627d0133d9 - LU-7990 llite: increase whole-file readahead to RPC size (2 years ago)
            * 97671436c5 - LU-9405 utils: remove device path parsing from mount.lustre (2 years ago)
            * a3f734db9b - LU-9019 ofd: migrate to 64 bit time (2 years ago)
            * 685ef61f0f - LU-9814 ldiskfs: restore simple_strtol in prealloc (2 years ago)
            * 383ef1a93b - LU-4923 osd-ldiskfs: dirdata is not needed on MGS (2 years ago)
            * 1eb0573fde - LU-9782 osd-ldiskfs: avoid extra search (2 years ago)
            * 45900a7777 - LU-4134 obdclass: obd_device improvement (2 years ago)
            * 3187d551d5 - LU-9990 lnet: add backwards compatibility for YAML config (2 years ago)
            * 52a1befd75 - LU-9968 tests: correct stripe index sanity 300g (2 years ago)
            * d4f7bb22d8 - LU-9860 tests: Run command on MGS for conf-sanity 33a (2 years ago)
            * 66bb2d13f8 - (2.10.9956) LU-9956 kernel: kernel upgrade [SLES12 SP3 4.4.82-6.3] (2 years ago)
            * 66abf6ffe9 - LU-9140 nrs: measure the runtime of dd directly (2 years ago)
            * dd23aa4a64 - LU-10119 scripts: Correct shebang/hashpling format (2 years ago)
            * 45e5e76e32 - LU-8344 test: fix sanity 256 (2 years ago)
            * 9d06804860 - LU-10029 osd-ldiskfs: make project inherit attr removeable (2 years ago)
            * bdb0407957 - LU-9416 hsm: add kkuc before sending registration RPCs (2 years ago)
            * a1eb6de081 - LU-9752 man: Reference zgenhostid instead of genhostid (2 years ago)
            * 84f690eee2 - LU-9469 ldiskfs: add additional attach_jinode call (2 years ago)
            * 20787a89ad - LU-9908 tests: force umount client in test 70e, 41b, and 105 (2 years ago)
            * 7954a52042 - LU-9983 ko2iblnd: allow for discontiguous fragments (2 years ago)
            * 7a024b535b - LU-10051 build: Build with ZFS 0.7.2 (2 years ago)
            * 3a2f24fefb - LU-9158 test: Use project ID for project quota for quota_scan (2 years ago)
            * e81847bd06 - LU-9660 ptlrpc: do not wakeup every second (2 years ago)
            * d94de5c04e - LU-5170 lfs: Standardize error messages in lfs_setstripe() (2 years ago)
            * 1cc354d559 - LU-9741 test: Correct check of stripe count for directories (2 years ago)
            * f354258015 - LU-9672 gss: fix expiration time of sunrpc cache (2 years ago)
            * 481bd7754d - LU-9590 tests: remove replay-single tests from ALWAYS_EXCEPT (2 years ago)
            * aa83ef5a9e - LU-9462 doc: update lfs setstripe man page and usage (2 years ago)
            * 29293649b5 - LU-8721 tests: add parallel-scale fio test (2 years ago)
            * 036641a1e1 - LU-10047 tests: stop skipping test_102 subtests (2 years ago)
            * 30f82889c1 - (2.10.54, refs/bisect/good-30f82889c194ce9a05a9659cea81cc8e5078d0b1) New tag 2.10.54 (2 years ago)
            

             

             In file included from /GITCHECKOUT/lustre-nas/libcfs/include/libcfs/libcfs.h:39:0,
                             from /GITCHECKOUT/lustre-nas/lustre/include/obd_support.h:45,
                             from /GITCHECKOUT/lustre-nas/lustre/ptlrpc/wiretest.c:40:
            /GITCHECKOUT/lustre-nas/lustre/ptlrpc/wiretest.c: In function â:
            /GITCHECKOUT/lustre-nas/lustre/ptlrpc/wiretest.c:1794:11: error: â undeclared (first use in this function)
              CLASSERT(LMV_HASH_FLAG_DEAD == 0x40000000);
                       ^
            /GITCHECKOUT/lustre-nas/libcfs/include/libcfs/libcfs_private.h:360:46: note: in definition of macro â
             #define CLASSERT(cond) do {switch (1) {case (cond): case 0: break; } } while (0)
                                                          ^
            /GITCHECKOUT/lustre-nas/lustre/ptlrpc/wiretest.c:1794:11: note: each undeclared identifier is reported only once for each function it appears in
              CLASSERT(LMV_HASH_FLAG_DEAD == 0x40000000);
                       ^
            /GITCHECKOUT/lustre-nas/libcfs/include/libcfs/libcfs_private.h:360:46: note: in definition of macro â
             #define CLASSERT(cond) do {switch (1) {case (cond): case 0: break; } } while (0)
                                                          ^
            In file included from /usr/src/linux-4.4.143-94.47.1.20180815nasa/include/linux/linkage.h:4:0,
                             from /usr/src/linux-4.4.143-94.47.1.20180815nasa/include/linux/fs.h:4,
                             from /GITCHECKOUT/lustre-nas/lustre/ptlrpc/wiretest.c:36:
            /GITCHECKOUT/lustre-nas/lustre/ptlrpc/wiretest.c:3550:23: error: invalid application of â to incomplete type â
              LASSERTF((int)sizeof(struct mgs_send_param) == 1024, "found %lld\n",
                                   ^
            /usr/src/linux-4.4.143-94.47.1.20180815nasa/include/linux/compiler.h:166:42: note: in definition of macro â
             # define unlikely(x) __builtin_expect(!!(x), 0)
                                                      ^
            /GITCHECKOUT/lustre-nas/lustre/ptlrpc/wiretest.c:3550:2: note: in expansion of macro â
              LASSERTF((int)sizeof(struct mgs_send_param) == 1024, "found %lld\n",
              ^
            In file included from /GITCHECKOUT/lustre-nas/libcfs/include/libcfs/libcfs.h:39:0,
                             from /GITCHECKOUT/lustre-nas/lustre/include/obd_support.h:45,
                             from /GITCHECKOUT/lustre-nas/lustre/ptlrpc/wiretest.c:40:
            /GITCHECKOUT/lustre-nas/lustre/ptlrpc/wiretest.c:3551:27: error: invalid application of â to incomplete type â
                (long long)(int)sizeof(struct mgs_send_param));
                                       ^
            /GITCHECKOUT/lustre-nas/libcfs/include/libcfs/libcfs_private.h:88:9: note: in definition of macro â
                  ## __VA_ARGS__);   \
                     ^
            /GITCHECKOUT/lustre-nas/lustre/ptlrpc/wiretest.c:3552:11: error: â undeclared (first use in this function)
            g
            mhanafi Mahmoud Hanafi added a comment - - edited Ran a bunch of more tests. Tag2.10.54 is good. and tag2.10.55 is very bad. Writes will even fail with out rebooting the server. I will try the debug patch next.   I tryied to do bisect but I had compilation errors. * 61f26ea47f - (HEAD) LU-9578 llite: use security context if it's enabled in the kernel (2 years ago) * 82e794e268 - LU-9452 ldlm: remove MSG_CONNECT_LIBCLIENT support (2 years ago) * 627d0133d9 - LU-7990 llite: increase whole-file readahead to RPC size (2 years ago) * 97671436c5 - LU-9405 utils: remove device path parsing from mount.lustre (2 years ago) * a3f734db9b - LU-9019 ofd: migrate to 64 bit time (2 years ago) * 685ef61f0f - LU-9814 ldiskfs: restore simple_strtol in prealloc (2 years ago) * 383ef1a93b - LU-4923 osd-ldiskfs: dirdata is not needed on MGS (2 years ago) * 1eb0573fde - LU-9782 osd-ldiskfs: avoid extra search (2 years ago) * 45900a7777 - LU-4134 obdclass: obd_device improvement (2 years ago) * 3187d551d5 - LU-9990 lnet: add backwards compatibility for YAML config (2 years ago) * 52a1befd75 - LU-9968 tests: correct stripe index sanity 300g (2 years ago) * d4f7bb22d8 - LU-9860 tests: Run command on MGS for conf-sanity 33a (2 years ago) * 66bb2d13f8 - (2.10.9956) LU-9956 kernel: kernel upgrade [SLES12 SP3 4.4.82-6.3] (2 years ago) * 66abf6ffe9 - LU-9140 nrs: measure the runtime of dd directly (2 years ago) * dd23aa4a64 - LU-10119 scripts: Correct shebang/hashpling format (2 years ago) * 45e5e76e32 - LU-8344 test: fix sanity 256 (2 years ago) * 9d06804860 - LU-10029 osd-ldiskfs: make project inherit attr removeable (2 years ago) * bdb0407957 - LU-9416 hsm: add kkuc before sending registration RPCs (2 years ago) * a1eb6de081 - LU-9752 man: Reference zgenhostid instead of genhostid (2 years ago) * 84f690eee2 - LU-9469 ldiskfs: add additional attach_jinode call (2 years ago) * 20787a89ad - LU-9908 tests: force umount client in test 70e, 41b, and 105 (2 years ago) * 7954a52042 - LU-9983 ko2iblnd: allow for discontiguous fragments (2 years ago) * 7a024b535b - LU-10051 build: Build with ZFS 0.7.2 (2 years ago) * 3a2f24fefb - LU-9158 test: Use project ID for project quota for quota_scan (2 years ago) * e81847bd06 - LU-9660 ptlrpc: do not wakeup every second (2 years ago) * d94de5c04e - LU-5170 lfs: Standardize error messages in lfs_setstripe() (2 years ago) * 1cc354d559 - LU-9741 test: Correct check of stripe count for directories (2 years ago) * f354258015 - LU-9672 gss: fix expiration time of sunrpc cache (2 years ago) * 481bd7754d - LU-9590 tests: remove replay-single tests from ALWAYS_EXCEPT (2 years ago) * aa83ef5a9e - LU-9462 doc: update lfs setstripe man page and usage (2 years ago) * 29293649b5 - LU-8721 tests: add parallel-scale fio test (2 years ago) * 036641a1e1 - LU-10047 tests: stop skipping test_102 subtests (2 years ago) * 30f82889c1 - (2.10.54, refs/bisect/good-30f82889c194ce9a05a9659cea81cc8e5078d0b1) New tag 2.10.54 (2 years ago)   In file included from /GITCHECKOUT/lustre-nas/libcfs/include/libcfs/libcfs.h:39:0, from /GITCHECKOUT/lustre-nas/lustre/include/obd_support.h:45, from /GITCHECKOUT/lustre-nas/lustre/ptlrpc/wiretest.c:40: /GITCHECKOUT/lustre-nas/lustre/ptlrpc/wiretest.c: In function â: /GITCHECKOUT/lustre-nas/lustre/ptlrpc/wiretest.c:1794:11: error: â undeclared (first use in this function) CLASSERT(LMV_HASH_FLAG_DEAD == 0x40000000); ^ /GITCHECKOUT/lustre-nas/libcfs/include/libcfs/libcfs_private.h:360:46: note: in definition of macro â #define CLASSERT(cond) do { switch (1) { case (cond): case 0: break ; } } while (0) ^ /GITCHECKOUT/lustre-nas/lustre/ptlrpc/wiretest.c:1794:11: note: each undeclared identifier is reported only once for each function it appears in CLASSERT(LMV_HASH_FLAG_DEAD == 0x40000000); ^ /GITCHECKOUT/lustre-nas/libcfs/include/libcfs/libcfs_private.h:360:46: note: in definition of macro â #define CLASSERT(cond) do { switch (1) { case (cond): case 0: break ; } } while (0) ^ In file included from /usr/src/linux-4.4.143-94.47.1.20180815nasa/include/linux/linkage.h:4:0, from /usr/src/linux-4.4.143-94.47.1.20180815nasa/include/linux/fs.h:4, from /GITCHECKOUT/lustre-nas/lustre/ptlrpc/wiretest.c:36: /GITCHECKOUT/lustre-nas/lustre/ptlrpc/wiretest.c:3550:23: error: invalid application of â to incomplete type â LASSERTF(( int )sizeof(struct mgs_send_param) == 1024, "found %lld\n" , ^ /usr/src/linux-4.4.143-94.47.1.20180815nasa/include/linux/compiler.h:166:42: note: in definition of macro â # define unlikely(x) __builtin_expect(!!(x), 0) ^ /GITCHECKOUT/lustre-nas/lustre/ptlrpc/wiretest.c:3550:2: note: in expansion of macro â LASSERTF(( int )sizeof(struct mgs_send_param) == 1024, "found %lld\n" , ^ In file included from /GITCHECKOUT/lustre-nas/libcfs/include/libcfs/libcfs.h:39:0, from /GITCHECKOUT/lustre-nas/lustre/include/obd_support.h:45, from /GITCHECKOUT/lustre-nas/lustre/ptlrpc/wiretest.c:40: /GITCHECKOUT/lustre-nas/lustre/ptlrpc/wiretest.c:3551:27: error: invalid application of â to incomplete type â ( long long )( int )sizeof(struct mgs_send_param)); ^ /GITCHECKOUT/lustre-nas/libcfs/include/libcfs/libcfs_private.h:88:9: note: in definition of macro â ## __VA_ARGS__); \ ^ /GITCHECKOUT/lustre-nas/lustre/ptlrpc/wiretest.c:3552:11: error: â undeclared (first use in this function) g

            Hi Mahmoud,

            Attached is a short patch to collect some more info around the problem message. I'll probably be adding more debug patches to try and track down the problem. We're also trying to reproduce it locally.

            0001-LU-12856-debug.patch

            ashehata Amir Shehata (Inactive) added a comment - Hi Mahmoud, Attached is a short patch to collect some more info around the problem message. I'll probably be adding more debug patches to try and track down the problem. We're also trying to reproduce it locally. 0001-LU-12856-debug.patch

            People

              tappro Mikhail Pershin
              mhanafi Mahmoud Hanafi
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: