Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12419

ppc64le: "LNetError: RDMA has too many fragments for peer_ni" when reading two files

Details

    • Bug
    • Resolution: Won't Fix
    • Minor
    • None
    • Lustre 2.10.6, Lustre 2.12.1
    • None
    • 3
    • 9223372036854775807

    Description

      I'm trying to install and configure lustre client on ibm power9 64le machine to mount existing lustre system through router. The power9 host is similar to ORNL Summit worker and we going to use it for debugging software before running it on leadership facilities so this case can be of interest for others.

      At first there were issues with connecting to the lnet (lctl ping); the issues get resolved after explicitly setting options in /etc/modprobe.d/ko2iblnd.conf as per LU-3322:

        map_on_demand=16 - on ibmpower9

        map_on_demand=256 - on x86_64 all servers and router.

      Lustre was restarted and modules were reloaded after these changes.

      Now I can mount lustre on power9 client and execute SINGLE file read with dd, this works for files located on any of six OSTs but doing one read transfer at a time. I did not try writes.

      When I start two transfers in parallel or start one transfer, then start the other 10 seconds later I'm getting LNET error when the transfer starts for the second file. I can kill -9 dd process (but not always all processes); sometimes one of the processes can not be killed with signal -9. Even all IO processes ("dd") are killed on the client the router and servers continue to report errors in the logs; and I do observe IO on both OSTs where files reside.  I can not unmount lustre on power9 client or remove modules.

      "lctl net unconfigure" reports "LNET busy". I have to reboot power9 client. Only after the client reboot errors stop being reported on servers and router.

      Attachments

        1. client.tgz
          2.66 MB
        2. router.tgz
          414 kB
        3. server.tgz
          616 kB

        Issue Links

          Activity

            [LU-12419] ppc64le: "LNetError: RDMA has too many fragments for peer_ni" when reading two files

            64K  page size for 2.10 will not be supported. Moving to  2.12 is the answer

            simmonsja James A Simmons added a comment - 64K  page size for 2.10 will not be supported. Moving to  2.12 is the answer

            Does moving to 2.12 fix everything for you? Can this ticket be closed?

            simmonsja James A Simmons added a comment - Does moving to 2.12 fix everything for you? Can this ticket be closed?

            Is everything working now for you?

            simmonsja James A Simmons added a comment - Is everything working now for you?

            Your router needs wrq_sge=2

            alex.ku Alex Kulyavtsev added a comment - Your router needs wrq_sge=2

              map_on_demand=16 - on client with 64KB page

              map_on_demand=256 - on x86_64 (4KB page) all servers and router.

            alex.ku Alex Kulyavtsev added a comment -   map_on_demand=16 - on client with 64KB page   map_on_demand=256 - on x86_64 (4KB page) all servers and router.

            Attached configuration and debug files:

            $ tar tvf client.tgz

            drwxr-xr-x  0 root   root        0 Jun 11 00:01 client/

            rw------  0 root   root    65260 Jun 10 15:28 client/debug_kernel.2.out

            drwxr-xr-x  0 root   root        0 Jun 11 00:02 client/etc/

            drwxr-xr-x  0 root   root        0 Jun 11 00:03 client/etc/modprobe.d/

            rw-rr-  0 root   root      183 Jun  3 18:04 client/etc/modprobe.d/ko2iblnd.conf

            rw-rr-  0 root   root       88 Mar  7  2018 client/etc/modprobe.d/lustre.conf

            rw-rr-  0 root   root      467 Jun  3 17:56 client/etc/lnet.conf

            rw-rr-  0 root   root      450 Jun  3 17:07 client/etc/lnet_routes.conf

            rw-rr-  0 root   root      903 Jun 10 15:43 client/ibstatus.out

            rw-rr-  0 root   root    10678 Jun 10 15:08 client/lnetctl.export.out

            rw-rr-  0 root   root     1422 Jun 10 15:42 client/ibstat.out

            rw-rr-  0 root   root     2303 Jun 10 15:08 client/systool.lnet.out

            rw-rr-  0 root   root     1992 Jun 10 15:08 client/systool.ko2iblnd.out

            rw-rr-  0 root   root 70350426 Jun 10 17:02 client/debug_kernel.3.out

            rw------  0 root   root   231798 Jun 10 15:15 client/debug_kernel.before.out

            -rwxr-xr-x  0 root   root      165 Jun 10 16:30 client/read-two.sh

            rw-rr-  0 root   root       75 Jun 10 15:38 client/rpms

            router:

            $ tar tvf router.tgz

            drwxr-xr-x  0 root   root        0 Jun 11 00:01 router/

            rw-rr-  0 root   root     2944 Jun 10 15:43 router/ibv_devinfo.out

            drwxr-xr-x  0 root   root        0 Jun 10 23:57 router/etc/

            drwxr-xr-x  0 root   root        0 Jun 10 23:58 router/etc/modprobe.d/

            rw-rr-  0 root   root      252 Jun  4 17:35 router/etc/modprobe.d/ko2iblnd.conf

            rw-rr-  0 root   root      142 Mar 12  2018 router/etc/modprobe.d/lustre.conf

            rw-rr-  0 root   root      317 Aug  1  2018 router/etc/lnet.conf

            rw-rr-  0 root   root      406 Feb 14 13:27 router/etc/lnet_routes.conf

            rw-rr-  0 root   root      451 Jun 10 15:43 router/ibstatus.out

            rw-rr-  0 root   root    10493 Jun 10 15:35 router/lnetctl.export.out

            rw-rr-  0 root   root      705 Jun 10 15:42 router/ibstat.out

            rw-rr-  0 root   root     1987 Jun 10 15:34 router/systool.lnet.out

            rw-rr-  0 root   root     1884 Jun 10 15:34 router/systool.ko2iblnd.out

            rw-rr-  0 root   root 11576985 Jun 10 17:06 router/debug_kernel.3.out

            rw-rr-  0 root   root       73 Jun 10 15:39 router/rpms

            Server:

            $ tar tvf server.tgz

            drwxr-xr-x  0 root   root        0 Jun 11 00:06 server/

            rw-rr-  0 root   root     2087 Jun 10 15:43 server/ibv_devinfo.out

            drwxr-xr-x  0 root   root        0 Jun 11 00:05 server/etc/

            drwxr-xr-x  0 root   root        0 Jun 11 00:04 server/etc/modprobe.d/

            rw-rr-  0 root   root      252 Jun  4 17:35 server/etc/modprobe.d/ko2iblnd.conf

            rw-rr-  0 root   root      142 Mar 12  2018 server/etc/modprobe.d/lustre.conf

            rw-rr-  0 root   root      127 Jul 31  2018 server/etc/lnet.conf

            rw-rr-  0 root   root      406 Dec 21 16:45 server/etc/lnet_routes.conf

            rw-rr-  0 root   root      223 Jun 10 15:42 server/ibstatus.out

            rw-rr-  0 root   root    31618 Jun 10 15:33 server/lnetctl.export.out

            rw-rr-  0 root   root      351 Jun 10 15:42 server/ibstat.out

            rw-rr-  0 root   root     2113 Jun 10 15:33 server/systool.lnet.out

            rw-rr-  0 root   root     2010 Jun 10 15:33 server/systool.ko2iblnd.out

            rw-rr-  0 root   root 17240520 Jun 10 17:05 server/debug_kernel.3.out

            rw-rr-  0 root   root      324 Jun 10 15:39 server/rpms

            mac-129482:LU-12419 aik_a$

             

            alex.ku Alex Kulyavtsev added a comment - Attached configuration and debug files: $ tar tvf client.tgz drwxr-xr-x  0 root   root        0 Jun 11 00:01 client/ rw ------  0 root   root    65260 Jun 10 15:28 client/debug_kernel.2.out drwxr-xr-x  0 root   root        0 Jun 11 00:02 client/etc/ drwxr-xr-x  0 root   root        0 Jun 11 00:03 client/etc/modprobe.d/ rw-r r -  0 root   root      183 Jun  3 18:04 client/etc/modprobe.d/ko2iblnd.conf rw-r r -  0 root   root       88 Mar  7  2018 client/etc/modprobe.d/lustre.conf rw-r r -  0 root   root      467 Jun  3 17:56 client/etc/lnet.conf rw-r r -  0 root   root      450 Jun  3 17:07 client/etc/lnet_routes.conf rw-r r -  0 root   root      903 Jun 10 15:43 client/ibstatus.out rw-r r -  0 root   root    10678 Jun 10 15:08 client/lnetctl.export.out rw-r r -  0 root   root     1422 Jun 10 15:42 client/ibstat.out rw-r r -  0 root   root     2303 Jun 10 15:08 client/systool.lnet.out rw-r r -  0 root   root     1992 Jun 10 15:08 client/systool.ko2iblnd.out rw-r r -  0 root   root 70350426 Jun 10 17:02 client/debug_kernel.3.out rw ------  0 root   root   231798 Jun 10 15:15 client/debug_kernel.before.out -rwxr-xr-x  0 root   root      165 Jun 10 16:30 client/read-two.sh rw-r r -  0 root   root       75 Jun 10 15:38 client/rpms router: $ tar tvf router.tgz drwxr-xr-x  0 root   root        0 Jun 11 00:01 router/ rw-r r -  0 root   root     2944 Jun 10 15:43 router/ibv_devinfo.out drwxr-xr-x  0 root   root        0 Jun 10 23:57 router/etc/ drwxr-xr-x  0 root   root        0 Jun 10 23:58 router/etc/modprobe.d/ rw-r r -  0 root   root      252 Jun  4 17:35 router/etc/modprobe.d/ko2iblnd.conf rw-r r -  0 root   root      142 Mar 12  2018 router/etc/modprobe.d/lustre.conf rw-r r -  0 root   root      317 Aug  1  2018 router/etc/lnet.conf rw-r r -  0 root   root      406 Feb 14 13:27 router/etc/lnet_routes.conf rw-r r -  0 root   root      451 Jun 10 15:43 router/ibstatus.out rw-r r -  0 root   root    10493 Jun 10 15:35 router/lnetctl.export.out rw-r r -  0 root   root      705 Jun 10 15:42 router/ibstat.out rw-r r -  0 root   root     1987 Jun 10 15:34 router/systool.lnet.out rw-r r -  0 root   root     1884 Jun 10 15:34 router/systool.ko2iblnd.out rw-r r -  0 root   root 11576985 Jun 10 17:06 router/debug_kernel.3.out rw-r r -  0 root   root       73 Jun 10 15:39 router/rpms Server: $ tar tvf server.tgz drwxr-xr-x  0 root   root        0 Jun 11 00:06 server/ rw-r r -  0 root   root     2087 Jun 10 15:43 server/ibv_devinfo.out drwxr-xr-x  0 root   root        0 Jun 11 00:05 server/etc/ drwxr-xr-x  0 root   root        0 Jun 11 00:04 server/etc/modprobe.d/ rw-r r -  0 root   root      252 Jun  4 17:35 server/etc/modprobe.d/ko2iblnd.conf rw-r r -  0 root   root      142 Mar 12  2018 server/etc/modprobe.d/lustre.conf rw-r r -  0 root   root      127 Jul 31  2018 server/etc/lnet.conf rw-r r -  0 root   root      406 Dec 21 16:45 server/etc/lnet_routes.conf rw-r r -  0 root   root      223 Jun 10 15:42 server/ibstatus.out rw-r r -  0 root   root    31618 Jun 10 15:33 server/lnetctl.export.out rw-r r -  0 root   root      351 Jun 10 15:42 server/ibstat.out rw-r r -  0 root   root     2113 Jun 10 15:33 server/systool.lnet.out rw-r r -  0 root   root     2010 Jun 10 15:33 server/systool.ko2iblnd.out rw-r r -  0 root   root 17240520 Jun 10 17:05 server/debug_kernel.3.out rw-r r -  0 root   root      324 Jun 10 15:39 server/rpms mac-129482: LU-12419 aik_a$  
            alex.ku Alex Kulyavtsev added a comment - - edited

            log after mount on the power9 client:

            Jun 10 14:48:50 ibmpower9 kernel: LNet: HW NUMA nodes: 6, HW CPU cores: 128, npartitions: 2

            Jun 10 14:48:50 ibmpower9 kernel: alg: No test for adler32 (adler32-zlib)

            Jun 10 14:48:51 ibmpower9 kernel: Lustre: Lustre: Build Version: 2.12.1

            Jun 10 14:48:51 ibmpower9 kernel: LNet: Using FastReg for registration

            Jun 10 14:48:51 ibmpower9 kernel: LNet: Added LNI 192.168.177.202@o2ib177 [8/256/0/180]

            Jun 10 14:48:51 ibmpower9 kernel: Lustre: Mounted lfstev-client

            Jun 10 14:50:31 ibmpower9 kernel: NVRM: Xid (PCI:0004:04:00): 43, Ch 00000010, engmask 00000101

            Jun 10 14:52:55 ibmpower9 kernel: NVRM: Xid (PCI:0004:04:00): 43, Ch 00000010, engmask 00000101

            Read two files with 10 sec delay (first file still being read when second read starts)

            dd of=/dev/null bs=1M if=/lfstev/admin/aik/iotest/osd5/10.GB

            sleep 10

            dd of=/dev/null bs=1M if=/lfstev/admin/aik/iotest/osd4/10.GB

            Client errors:

            Jun 10 16:32:21 ibmpower9 kernel: Lustre: Unmounted lfstev-client

            Jun 10 16:32:23 ibmpower9 kernel: Lustre: Mounted lfstev-client

            Jun 10 16:32:56 ibmpower9 kernel: LustreError: 101884:0:(events.c:200:client_bulk_callback()) event type 2, status -90, desc c000207265a20a00

            Jun 10 16:32:56 ibmpower9 kernel: Lustre: 101916:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1560202375/real 1560202375]  req@c00

            0007fb6779480 x1635984221122096/t0(0) o3->lfstev-OST0005-osc-c0002072ee3bf800@192.168.176.146@o2ib:6/4 lens 488/440 e 0 to 1 dl 1560202382 ref 2 fl Rpc:eX/0/ffffffff rc 0/-1

            Jun 10 16:32:56 ibmpower9 kernel: Lustre: lfstev-OST0005-osc-c0002072ee3bf800: Connection to lfstev-OST0005 (at 192.168.176.146@o2ib) was lost; in progress operations using this service w

            ill wait for recovery to complete

            Jun 10 16:32:56 ibmpower9 kernel: LustreError: 101884:0:(events.c:200:client_bulk_callback()) event type 2, status -90, desc c000207265a29c00

            Jun 10 16:32:56 ibmpower9 kernel: Lustre: lfstev-OST0004-osc-c0002072ee3bf800: Connection restored to 192.168.176.145@o2ib (at 192.168.176.145@o2ib)

            Jun 10 16:32:56 ibmpower9 kernel: LustreError: 101882:0:(events.c:200:client_bulk_callback()) event type 2, status -90, desc c000207265a29c00

            Jun 10 16:32:56 ibmpower9 kernel: LustreError: 101883:0:(events.c:200:client_bulk_callback()) event type 2, status -90, desc c000207265a29c00

            Jun 10 16:32:56 ibmpower9 kernel: LustreError: 101883:0:(events.c:200:client_bulk_callback()) event type 2, status -90, desc c000207265a20a00

            ...

            Jun 10 16:32:56 ibmpower9 kernel: LustreError: 101882:0:(events.c:200:client_bulk_callback()) event type 2, status -90, desc c000207265a20a00

            Jun 10 16:32:56 ibmpower9 kernel: Lustre: 101916:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1560202376/real 1560202376]  req@c000007fb6779480 x1635984221122096/t0(0) o3->lfstev-OST0005-osc-c0002072ee3bf800@192.168.176.146@o2ib:6/4 lens 488/440 e 0 to 1 dl 1560202383 ref 2 fl Rpc:eX/2/ffffffff rc 0/-1

            Jun 10 16:32:56 ibmpower9 kernel: Lustre: 101916:0:(client.c:2134:ptlrpc_expire_one_request()) Skipped 62 previous similar messages

            Jun 10 16:32:56 ibmpower9 kernel: Lustre: lfstev-OST0005-osc-c0002072ee3bf800: Connection to lfstev-OST0005 (at 192.168.176.146@o2ib) was lost; in progress operations using this service will wait for recovery to complete

            Jun 10 16:32:56 ibmpower9 kernel: Lustre: Skipped 62 previous similar messages

            Jun 10 16:32:56 ibmpower9 kernel: LustreError: 101882:0:(events.c:200:client_bulk_callback()) event type 2, status -90, desc c000207265a29c00

            Jun 10 16:32:56 ibmpower9 kernel: LustreError: 101881:0:(events.c:200:client_bulk_callback()) event type 2, status -90, desc c000207265a20a00

            Router has a lot of errors like this when second transfer starts

            Jun 10 16:33:33 ibmpower9 kernel: LustreError: 101883:0:(events.c:200:client_bulk_callback()) event type 2, status -90, desc c000207265a29c00

            Jun 10 16:33:33 ibmpower9 kernel: LustreError: 101882:0:(events.c:200:client_bulk_callback()) event type 2, status -90, desc c000207265a29c00

            Jun 10 16:33:33 ibmpower9 kernel: LustreError: 101881:0:(events.c:200:client_bulk_callback()) event type 2, status -90, desc c000207265a29c00

            Router errors with "client_bulk_callback()" filtered out:

            Jun 10 16:32:56 newtevnfs kernel: LNetError: 4354:0:(o2iblnd_cb.c:1083:kiblnd_init_rdma()) RDMA has too many fragments for peer_ni 192.168.177.202@o2ib177 (16), src idx/frags: 32/240 dst idx/frags: 0/1

            Jun 10 16:32:56 newtevnfs kernel: LNetError: 4354:0:(o2iblnd_cb.c:1083:kiblnd_init_rdma()) Skipped 126371 previous similar messages

            Jun 10 16:32:56 newtevnfs kernel: LNetError: 4354:0:(o2iblnd_cb.c:433:kiblnd_handle_rx()) Can't setup rdma for PUT to 192.168.177.202@o2ib177: -90

            Jun 10 16:32:56 newtevnfs kernel: LNetError: 4354:0:(o2iblnd_cb.c:433:kiblnd_handle_rx()) Skipped 126396 previous similar messages

            Jun 10 16:32:56 newtevnfs kernel: LNet: 4356:0:(o2iblnd_cb.c:396:kiblnd_handle_rx()) PUT_NACK from 192.168.177.202@o2ib177

            Jun 10 16:32:56 newtevnfs kernel: LNet: 4356:0:(o2iblnd_cb.c:396:kiblnd_handle_rx()) Skipped 356 previous similar messages

            Jun 10 16:34:11 newtevnfs kernel: LNetError: 4353:0:(o2iblnd_cb.c:1083:kiblnd_init_rdma()) RDMA has too many fragments for peer_ni 192.168.177.202@o2ib177 (16), src idx/frags: 32/240 dst idx/frags: 0/1

            Jun 10 16:34:11 newtevnfs kernel: LNetError: 4353:0:(o2iblnd_cb.c:1083:kiblnd_init_rdma()) Skipped 18859 previous similar messages

            Jun 10 16:34:11 newtevnfs kernel: LNetError: 4353:0:(o2iblnd_cb.c:433:kiblnd_handle_rx()) Can't setup rdma for PUT to 192.168.177.202@o2ib177: -90

            Jun 10 16:34:11 newtevnfs kernel: LNetError: 4353:0:(o2iblnd_cb.c:433:kiblnd_handle_rx()) Skipped 18859 previous similar messages

            Server errors, first server tevlfs5:

            Jun 10 16:31:09 tevlfs5 kernel: Lustre: lfstev-OST0004: Connection restored to 5cef0352-ac9a-6592-6586-45468e615673 (at 192.168.177.202@o2ib177)

            Jun 10 16:33:52 tevlfs5 kernel: Lustre: lfstev-OST0004: Connection restored to 5cef0352-ac9a-6592-6586-45468e615673 (at 192.168.177.202@o2ib177)

            Jun 10 16:34:18 tevlfs5 kernel: Lustre: lfstev-OST0004: Connection restored to 5cef0352-ac9a-6592-6586-45468e615673 (at 192.168.177.202@o2ib177)

            Jun 10 16:34:25 tevlfs5 kernel: Lustre: lfstev-OST0004: Client 232e0955-aa70-fd53-f988-7299ce54b534 (at 192.168.177.202@o2ib177) reconnecting

            Jun 10 16:34:25 tevlfs5 kernel: Lustre: Skipped 1589 previous similar messages

            Jun 10 16:34:25 tevlfs5 kernel: LustreError: 29667:0:(ldlm_lib.c:3197:target_bulk_io()) @@@ bulk READ failed: rc 107  req@ffff965e3ff53c50 x1635984221122768/t0(0) o3>232e0955-aa70-fd53-f988-7299ce54b534@192.168.177.202@o2ib177:256/0 lens 488/432 e 0 to 0 dl 1560202471 ref 1 fl Interpret:/2/0 rc 0/0

            Jun 10 16:34:25 tevlfs5 kernel: Lustre: lfstev-OST0004: Bulk IO read error with 232e0955-aa70-fd53-f988-7299ce54b534 (at 192.168.177.202@o2ib177), client will retry: rc -107

            Jun 10 16:34:25 tevlfs5 kernel: Lustre: Skipped 1587 previous similar messages

            Jun 10 16:34:25 tevlfs5 kernel: LustreError: 29650:0:(ldlm_lib.c:3247:target_bulk_io()) @@@ Reconnect on bulk READ  req@ffff965e3ff53450 x1635984221122784/t0(0) o3->232e0955-aa70-fd53-f988-7299ce54b534@192.168.177.202@o2ib177:256/0 lens 488/432 e 0 to 0 dl 1560202471 ref 1 fl Interpret:/2/0 rc 0/0

            Jun 10 16:34:25 tevlfs5 kernel: LustreError: 29650:0:(ldlm_lib.c:3247:target_bulk_io()) Skipped 1584 previous similar messages

            Jun 10 16:34:25 tevlfs5 kernel: LustreError: 29667:0:(ldlm_lib.c:3197:target_bulk_io()) Skipped 6 previous similar messages

            Jun 10 16:34:27 tevlfs5 kernel: Lustre: lfstev-OST0004: Connection restored to 5cef0352-ac9a-6592-6586-45468e615673 (at 192.168.177.202@o2ib177)

            Jun 10 16:34:27 tevlfs5 kernel: Lustre: Skipped 221 previous similar messages

            Jun 10 16:34:46 tevlfs5 kernel: Lustre: lfstev-OST0004: Connection restored to 5cef0352-ac9a-6592-6586-45468e615673 (at 192.168.177.202@o2ib177)

            Jun 10 16:34:46 tevlfs5 kernel: Lustre: Skipped 2390 previous similar messages

            Jun 10 16:34:57 tevlfs5 kernel: Lustre: lfstev-OST0004: Client 232e0955-aa70-fd53-f988-7299ce54b534 (at 192.168.177.202@o2ib177) reconnecting

            Jun 10 16:34:57 tevlfs5 kernel: Lustre: Skipped 3995 previous similar messages

            Jun 10 16:34:57 tevlfs5 kernel: Lustre: lfstev-OST0004: Bulk IO read error with 232e0955-aa70-fd53-f988-7299ce54b534 (at 192.168.177.202@o2ib177), client will retry: rc -110

            Jun 10 16:34:57 tevlfs5 kernel: Lustre: Skipped 4021 previous similar messages

            Jun 10 16:34:57 tevlfs5 kernel: LustreError: 29650:0:(ldlm_lib.c:3247:target_bulk_io()) @@@ Reconnect on bulk READ  req@ffff965ce9f00050 x1635984221123728/t0(0) o3->232e0955-aa70-fd53-f988-7299ce54b534@192.168.177.202@o2ib177:312/0 lens 488/432 e 0 to 0 dl 1560202527 ref 1 fl Interpret:/2/0 rc 0/0

            Jun 10 16:34:57 tevlfs5 kernel: LustreError: 29650:0:(ldlm_lib.c:3247:target_bulk_io()) Skipped 3969 previous similar messages

            Jun 10 16:35:23 tevlfs5 kernel: Lustre: lfstev-OST0004: Connection restored to 5cef0352-ac9a-6592-6586-45468e615673 (at 192.168.177.202@o2ib177)

            Jun 10 16:35:23 tevlfs5 kernel: Lustre: Skipped 4735 previous similar messages

            Jun 10 16:36:01 tevlfs5 kernel: Lustre: lfstev-OST0004: Client 232e0955-aa70-fd53-f988-7299ce54b534 (at 192.168.177.202@o2ib177) reconnecting

            Jun 10 16:36:01 tevlfs5 kernel: Lustre: Skipped 8064 previous similar messages

            Jun 10 16:36:01 tevlfs5 kernel: Lustre: lfstev-OST0004: Bulk IO read error with 232e0955-aa70-fd53-f988-7299ce54b534 (at 192.168.177.202@o2ib177), client will retry: rc -110

            Jun 10 16:36:01 tevlfs5 kernel: Lustre: Skipped 8063 previous similar messages

            Jun 10 16:36:01 tevlfs5 kernel: LustreError: 29681:0:(ldlm_lib.c:3247:target_bulk_io()) @@@ Reconnect on bulk READ  req@ffff965d412d8850 x1635984221123728/t0(0) o3->232e0955-aa70-fd53-f988-7299ce54b534@192.168.177.202@o2ib177:376/0 lens 488/432 e 0 to 0 dl 1560202591 ref 1 fl Interpret:/2/0 rc 0/0

            Jun 10 16:36:01 tevlfs5 kernel: LustreError: 29681:0:(ldlm_lib.c:3247:target_bulk_io()) Skipped 8061 previous similar messages

             

             

            Jun 10 16:36:38 tevlfs5 kernel: Lustre: lfstev-OST0004: Connection restored to 5cef0352-ac9a-6592-6586-45468e615673 (at 192.168.177.202@o2ib177)

            Jun 10 16:36:38 tevlfs5 kernel: Lustre: Skipped 9432 previous similar messages

             

            Jun 10 16:38:09 tevlfs5 kernel: Lustre: lfstev-OST0004: Bulk IO read error with 232e0955-aa70-fd53-f988-7299ce54b534 (at 192.168.177.202@o2ib177), client will retry: rc -110

            Jun 10 16:38:09 tevlfs5 kernel: Lustre: Skipped 16127 previous similar messages

            Jun 10 16:38:09 tevlfs5 kernel: Lustre: lfstev-OST0004: Client 232e0955-aa70-fd53-f988-7299ce54b534 (at 192.168.177.202@o2ib177) reconnecting

            Jun 10 16:38:09 tevlfs5 kernel: Lustre: Skipped 16128 previous similar messages

            Jun 10 16:38:09 tevlfs5 kernel: LustreError: 29681:0:(ldlm_lib.c:3247:target_bulk_io()) @@@ Reconnect on bulk READ  req@ffff965e35238050 x1635984221123728/t0(0) o3->232e0955-aa70-fd53-f988-7299ce54b534@192.168.177.202@o2ib177:504/0 lens 488/432 e 0 to 0 dl 1560202719 ref 1 fl Interpret:/2/0 rc 0/0

            Jun 10 16:38:09 tevlfs5 kernel: LustreError: 29681:0:(ldlm_lib.c:3247:target_bulk_io()) Skipped 16141 previous similar messages

            Jun 10 16:39:08 tevlfs5 kernel: Lustre: lfstev-OST0004: Connection restored to 5cef0352-ac9a-6592-6586-45468e615673 (at 192.168.177.202@o2ib177)

            Jun 10 16:39:08 tevlfs5 kernel: Lustre: Skipped 18844 previous similar messages

             

            Jun 10 16:42:25 tevlfs5 kernel: Lustre: lfstev-OST0004: Bulk IO read error with 232e0955-aa70-fd53-f988-7299ce54b534 (at 192.168.177.202@o2ib177), client will retry: rc -110

            Jun 10 16:42:25 tevlfs5 kernel: Lustre: Skipped 32255 previous similar messages

            Jun 10 16:42:25 tevlfs5 kernel: Lustre: lfstev-OST0004: Client 232e0955-aa70-fd53-f988-7299ce54b534 (at 192.168.177.202@o2ib177) reconnecting

            Jun 10 16:42:25 tevlfs5 kernel: Lustre: Skipped 32255 previous similar messages

            Jun 10 16:42:25 tevlfs5 kernel: LustreError: 29650:0:(ldlm_lib.c:3247:target_bulk_io()) @@@ Reconnect on bulk READ  req@ffff965e5d9c8850 x1635984221123728/t0(0) o3->232e0955-aa70-fd53-f988-7299ce54b534@192.168.177.202@o2ib177:5/0 lens 488/432 e 0 to 0 dl 1560202975 ref 1 fl Interpret:/2/0 rc 0/0

            Jun 10 16:42:25 tevlfs5 kernel: LustreError: 29650:0:(ldlm_lib.c:3247:target_bulk_io()) Skipped 32227 previous similar messages

            Jun 10 16:44:08 tevlfs5 kernel: Lustre: lfstev-OST0004: Connection restored to 5cef0352-ac9a-6592-6586-45468e615673 (at 192.168.177.202@o2ib177)

            Jun 10 16:44:08 tevlfs5 kernel: Lustre: Skipped 37797 previous similar messages

            Jun 10 16:32:56 ibmpower9 kernel: LustreError: 101883:0:(events.c:200:client_bulk_callback()) event type 2, status -90, desc c000207265a20a00

            Server errors, second server tevlfs6:

            Jun 10 16:31:45 tevlfs6 kernel: Lustre: lfstev-OST0005: Connection restored to 24e6a956-37a9-6c64-bff4-98a232c29f9a (at 192.168.177.202@o2ib177)

            Jun 10 16:32:18 tevlfs6 kernel: Lustre: lfstev-OST0005: Client 232e0955-aa70-fd53-f988-7299ce54b534 (at 192.168.177.202@o2ib177) reconnecting

            Jun 10 16:32:18 tevlfs6 kernel: Lustre: Skipped 1638 previous similar messages

            Jun 10 16:32:18 tevlfs6 kernel: Lustre: lfstev-OST0005: Connection restored to 24e6a956-37a9-6c64-bff4-98a232c29f9a (at 192.168.177.202@o2ib177)

            Jun 10 16:32:18 tevlfs6 kernel: LustreError: 14016:0:(ldlm_lib.c:3197:target_bulk_io()) @@@ bulk READ failed: rc 107  req@ffff96bb291e4450 x1635984221122544/t0(0) o3>232e0955-aa70-fd53-f988-7299ce54b534@192.168.177.202@o2ib177:129/0 lens 488/432 e 0 to 0 dl 1560202344 ref 1 fl Interpret:/0/0 rc 0/0

            Jun 10 16:32:18 tevlfs6 kernel: LustreError: 14016:0:(ldlm_lib.c:3197:target_bulk_io()) Skipped 122 previous similar messages

            Jun 10 16:32:18 tevlfs6 kernel: Lustre: lfstev-OST0005: Bulk IO read error with 232e0955-aa70-fd53-f988-7299ce54b534 (at 192.168.177.202@o2ib177), client will retry: rc -107

            Jun 10 16:32:18 tevlfs6 kernel: Lustre: Skipped 2 previous similar messages

            Jun 10 16:32:18 tevlfs6 kernel: LustreError: 14006:0:(ldlm_lib.c:3247:target_bulk_io()) @@@ Reconnect on bulk READ  req@ffff96b9f53d4c50 x1635984221122656/t0(0) o3->232e0955-aa70-fd53-f988-7299ce54b534@192.168.177.202@o2ib177:129/0 lens 488/432 e 0 to 0 dl 1560202344 ref 1 fl Interpret:/0/0 rc 0/0

            Jun 10 16:32:18 tevlfs6 kernel: LustreError: 14006:0:(ldlm_lib.c:3247:target_bulk_io()) Skipped 4709 previous similar messages

            Jun 10 16:32:22 tevlfs6 kernel: Lustre: lfstev-OST0005: Connection restored to 24e6a956-37a9-6c64-bff4-98a232c29f9a (at 192.168.177.202@o2ib177)

            Jun 10 16:32:22 tevlfs6 kernel: Lustre: Skipped 501 previous similar messages

            Jun 10 16:32:30 tevlfs6 kernel: Lustre: lfstev-OST0005: Connection restored to 24e6a956-37a9-6c64-bff4-98a232c29f9a (at 192.168.177.202@o2ib177)

            Jun 10 16:32:30 tevlfs6 kernel: Lustre: Skipped 1008 previous similar messages

            Jun 10 16:32:46 tevlfs6 kernel: Lustre: lfstev-OST0005: Connection restored to 24e6a956-37a9-6c64-bff4-98a232c29f9a (at 192.168.177.202@o2ib177)

            Jun 10 16:32:46 tevlfs6 kernel: Lustre: Skipped 2016 previous similar messages

            Jun 10 16:32:55 tevlfs6 kernel: Lustre: lfstev-OST0005: Client 232e0955-aa70-fd53-f988-7299ce54b534 (at 192.168.177.202@o2ib177) reconnecting

            Jun 10 16:32:55 tevlfs6 kernel: Lustre: Skipped 4713 previous similar messages

            Jun 10 16:32:56 tevlfs6 kernel: LustreError: 14018:0:(ldlm_lib.c:3247:target_bulk_io()) @@@ Reconnect on bulk READ  req@ffff96bae7793050 x1635984221122096/t0(0) o3->232e0955-aa70-fd53-f988-7299ce54b534@192.168.177.202@o2ib177:190/0 lens 488/432 e 0 to 0 dl 1560202405 ref 1 fl Interpret:/2/0 rc 0/0

            Jun 10 16:32:56 tevlfs6 kernel: LustreError: 14018:0:(ldlm_lib.c:3247:target_bulk_io()) Skipped 4691 previous similar messages

            Jun 10 16:33:18 tevlfs6 kernel: Lustre: lfstev-OST0005: Connection restored to 24e6a956-37a9-6c64-bff4-98a232c29f9a (at 192.168.177.202@o2ib177)

            Jun 10 16:33:18 tevlfs6 kernel: Lustre: Skipped 4032 previous similar messages

            Jun 10 16:34:10 tevlfs6 kernel: Lustre: lfstev-OST0005: Client 232e0955-aa70-fd53-f988-7299ce54b534 (at 192.168.177.202@o2ib177) reconnecting

            Jun 10 16:34:10 tevlfs6 kernel: Lustre: Skipped 9430 previous similar messages

            Jun 10 16:34:11 tevlfs6 kernel: LustreError: 14018:0:(ldlm_lib.c:3247:target_bulk_io()) @@@ Reconnect on bulk READ  req@ffff96b998978850 x1635984221122096/t0(0) o3->232e0955-aa70-fd53-f988-7299ce54b534@192.168.177.202@o2ib177:265/0 lens 488/432 e 0 to 0 dl 1560202480 ref 1 fl Interpret:/2/0 rc 0/0

            Jun 10 16:34:11 tevlfs6 kernel: LustreError: 14018:0:(ldlm_lib.c:3247:target_bulk_io()) Skipped 9449 previous similar messages

            Jun 10 16:34:22 tevlfs6 kernel: Lustre: lfstev-OST0005: Connection restored to 24e6a956-37a9-6c64-bff4-98a232c29f9a (at 192.168.177.202@o2ib177)

            Jun 10 16:34:22 tevlfs6 kernel: Lustre: Skipped 8064 previous similar messages

            Both processes reading files were killed on client host. ltop reports data being transferred on OSTs:

            Filesystem: lfstev                                                    RECORDING

                Inodes:    169.254m total,      4.819m used (  3%),    164.435m free

                 Space:     83.622t total,      7.306t used (  9%),     76.316t free

               Bytes/s: 0.231g read,       0.000g write,               252 IOPS

               MDops/s: 1 open,        0 close,       0 getattr, 0 setattr

                             0 link,        0 unlink,      0 mkdir,         0 rmdir

                             0 statfs, 0 rename,      0 getxattr

            >OST S        OSS   Exp   CR rMB/s wMB/s  IOPS   LOCKS  LGR  LCR %cpu %mem %spc

            0000   tevlfs1    70    0     0     0     0       0    0    0    0   69    9

            0001   tevlfs2    70    0     0     0     0       0    0    0    1   69    9

            0002   tevlfs3    70    0     0     0     0       0    0    0    1   88    8

            0003   tevlfs4    70    0     0     0     0       0    0    0    0   88    9

            0004   tevlfs5    71  126   118     0   126       0    0    0    0   90    9

            0005   tevlfs6    71  126   118     0   126       0    0    0    1   70    9

            server and router still report errors until client host get rebooted:

            Server:

            Jun 10 16:41:40 tevlfs6 kernel: Lustre: lfstev-OST0005: Client 232e0955-aa70-fd53-f988-7299ce54b534 (at 192.168.177.202@o2ib177) reconnecting

            Jun 10 16:41:40 tevlfs6 kernel: Lustre: Skipped 37786 previous similar messages

            Jun 10 16:41:41 tevlfs6 kernel: LustreError: 14018:0:(ldlm_lib.c:3247:target_bulk_io()) @@@ Reconnect on bulk READ  req@ffff96bb07d49050 x1635984221122096/t0(0) o3->232e0955-aa70-fd53-f988-7299ce54b534@192.168.177.202@o2ib177:716/0 lens 488/432 e 0 to 0 dl 1560202931 ref 1 fl Interpret:/2/0 rc 0/0

            Jun 10 16:41:41 tevlfs6 kernel: LustreError: 14018:0:(ldlm_lib.c:3247:target_bulk_io()) Skipped 37799 previous similar messages

            Router errors:

            Jun 10 16:51:41 newtevnfs kernel: LNetError: 4353:0:(o2iblnd_cb.c:1083:kiblnd_init_rdma()) Skipped 151177 previous similar messages

            Jun 10 16:51:41 newtevnfs kernel: LNetError: 4353:0:(o2iblnd_cb.c:433:kiblnd_handle_rx()) Can't setup rdma for PUT to 192.168.177.202@o2ib177: -90

            Jun 10 16:51:41 newtevnfs kernel: LNetError: 4353:0:(o2iblnd_cb.c:433:kiblnd_handle_rx()) Skipped 151180 previous similar messages

            Jun 10 17:01:41 newtevnfs kernel: LNetError: 4355:0:(o2iblnd_cb.c:1083:kiblnd_init_rdma()) RDMA has too many fragments for peer_ni 192.168.177.202@o2ib177 (16), src idx/frags: 32/240 dst idx/frags: 0/1

            Jun 10 17:01:41 newtevnfs kernel: LNetError: 4355:0:(o2iblnd_cb.c:1083:kiblnd_init_rdma()) Skipped 151165 previous similar messages

            Jun 10 17:01:41 newtevnfs kernel: LNetError: 4355:0:(o2iblnd_cb.c:433:kiblnd_handle_rx()) Can't setup rdma for PUT to 192.168.177.202@o2ib177: -90

            Jun 10 17:01:41 newtevnfs kernel: LNetError: 4355:0:(o2iblnd_cb.c:433:kiblnd_handle_rx()) Skipped 151172 previous similar messages

            Jun 10 17:11:41 newtevnfs kernel: LNetError: 4354:0:(o2iblnd_cb.c:1083:kiblnd_init_rdma()) RDMA has too many fragments for peer_ni 192.168.177.202@o2ib177 (16), src idx/frags: 32/240 dst idx/frags: 0/1

            alex.ku Alex Kulyavtsev added a comment - - edited log after mount on the power9 client: Jun 10 14:48:50 ibmpower9 kernel: LNet: HW NUMA nodes: 6, HW CPU cores: 128, npartitions: 2 Jun 10 14:48:50 ibmpower9 kernel: alg: No test for adler32 (adler32-zlib) Jun 10 14:48:51 ibmpower9 kernel: Lustre: Lustre: Build Version: 2.12.1 Jun 10 14:48:51 ibmpower9 kernel: LNet: Using FastReg for registration Jun 10 14:48:51 ibmpower9 kernel: LNet: Added LNI 192.168.177.202@o2ib177 [8/256/0/180] Jun 10 14:48:51 ibmpower9 kernel: Lustre: Mounted lfstev-client Jun 10 14:50:31 ibmpower9 kernel: NVRM: Xid (PCI:0004:04:00): 43, Ch 00000010, engmask 00000101 Jun 10 14:52:55 ibmpower9 kernel: NVRM: Xid (PCI:0004:04:00): 43, Ch 00000010, engmask 00000101 Read two files with 10 sec delay (first file still being read when second read starts) dd of=/dev/null bs=1M if=/lfstev/admin/aik/iotest/osd5/10.GB sleep 10 dd of=/dev/null bs=1M if=/lfstev/admin/aik/iotest/osd4/10.GB Client errors: Jun 10 16:32:21 ibmpower9 kernel: Lustre: Unmounted lfstev-client Jun 10 16:32:23 ibmpower9 kernel: Lustre: Mounted lfstev-client Jun 10 16:32:56 ibmpower9 kernel: LustreError: 101884:0:(events.c:200:client_bulk_callback()) event type 2, status -90, desc c000207265a20a00 Jun 10 16:32:56 ibmpower9 kernel: Lustre: 101916:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1560202375/real 1560202375]   req@c00 0007fb6779480 x1635984221122096/t0(0) o3->lfstev-OST0005-osc-c0002072ee3bf800@192.168.176.146@o2ib:6/4 lens 488/440 e 0 to 1 dl 1560202382 ref 2 fl Rpc:eX/0/ffffffff rc 0/-1 Jun 10 16:32:56 ibmpower9 kernel: Lustre: lfstev-OST0005-osc-c0002072ee3bf800: Connection to lfstev-OST0005 (at 192.168.176.146@o2ib) was lost; in progress operations using this service w ill wait for recovery to complete Jun 10 16:32:56 ibmpower9 kernel: LustreError: 101884:0:(events.c:200:client_bulk_callback()) event type 2, status -90, desc c000207265a29c00 Jun 10 16:32:56 ibmpower9 kernel: Lustre: lfstev-OST0004-osc-c0002072ee3bf800: Connection restored to 192.168.176.145@o2ib (at 192.168.176.145@o2ib) Jun 10 16:32:56 ibmpower9 kernel: LustreError: 101882:0:(events.c:200:client_bulk_callback()) event type 2, status -90, desc c000207265a29c00 Jun 10 16:32:56 ibmpower9 kernel: LustreError: 101883:0:(events.c:200:client_bulk_callback()) event type 2, status -90, desc c000207265a29c00 Jun 10 16:32:56 ibmpower9 kernel: LustreError: 101883:0:(events.c:200:client_bulk_callback()) event type 2, status -90, desc c000207265a20a00 ... Jun 10 16:32:56 ibmpower9 kernel: LustreError: 101882:0:(events.c:200:client_bulk_callback()) event type 2, status -90, desc c000207265a20a00 Jun 10 16:32:56 ibmpower9 kernel: Lustre: 101916:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1560202376/real 1560202376]   req@c000007fb6779480 x1635984221122096/t0(0) o3->lfstev-OST0005-osc-c0002072ee3bf800@192.168.176.146@o2ib:6/4 lens 488/440 e 0 to 1 dl 1560202383 ref 2 fl Rpc:eX/2/ffffffff rc 0/-1 Jun 10 16:32:56 ibmpower9 kernel: Lustre: 101916:0:(client.c:2134:ptlrpc_expire_one_request()) Skipped 62 previous similar messages Jun 10 16:32:56 ibmpower9 kernel: Lustre: lfstev-OST0005-osc-c0002072ee3bf800: Connection to lfstev-OST0005 (at 192.168.176.146@o2ib) was lost; in progress operations using this service will wait for recovery to complete Jun 10 16:32:56 ibmpower9 kernel: Lustre: Skipped 62 previous similar messages Jun 10 16:32:56 ibmpower9 kernel: LustreError: 101882:0:(events.c:200:client_bulk_callback()) event type 2, status -90, desc c000207265a29c00 Jun 10 16:32:56 ibmpower9 kernel: LustreError: 101881:0:(events.c:200:client_bulk_callback()) event type 2, status -90, desc c000207265a20a00 Router has a lot of errors like this when second transfer starts Jun 10 16:33:33 ibmpower9 kernel: LustreError: 101883:0:(events.c:200:client_bulk_callback()) event type 2, status -90, desc c000207265a29c00 Jun 10 16:33:33 ibmpower9 kernel: LustreError: 101882:0:(events.c:200:client_bulk_callback()) event type 2, status -90, desc c000207265a29c00 Jun 10 16:33:33 ibmpower9 kernel: LustreError: 101881:0:(events.c:200:client_bulk_callback()) event type 2, status -90, desc c000207265a29c00 Router errors with "client_bulk_callback()" filtered out: Jun 10 16:32:56 newtevnfs kernel: LNetError: 4354:0:(o2iblnd_cb.c:1083:kiblnd_init_rdma()) RDMA has too many fragments for peer_ni 192.168.177.202@o2ib177 (16), src idx/frags: 32/240 dst idx/frags: 0/1 Jun 10 16:32:56 newtevnfs kernel: LNetError: 4354:0:(o2iblnd_cb.c:1083:kiblnd_init_rdma()) Skipped 126371 previous similar messages Jun 10 16:32:56 newtevnfs kernel: LNetError: 4354:0:(o2iblnd_cb.c:433:kiblnd_handle_rx()) Can't setup rdma for PUT to 192.168.177.202@o2ib177: -90 Jun 10 16:32:56 newtevnfs kernel: LNetError: 4354:0:(o2iblnd_cb.c:433:kiblnd_handle_rx()) Skipped 126396 previous similar messages Jun 10 16:32:56 newtevnfs kernel: LNet: 4356:0:(o2iblnd_cb.c:396:kiblnd_handle_rx()) PUT_NACK from 192.168.177.202@o2ib177 Jun 10 16:32:56 newtevnfs kernel: LNet: 4356:0:(o2iblnd_cb.c:396:kiblnd_handle_rx()) Skipped 356 previous similar messages Jun 10 16:34:11 newtevnfs kernel: LNetError: 4353:0:(o2iblnd_cb.c:1083:kiblnd_init_rdma()) RDMA has too many fragments for peer_ni 192.168.177.202@o2ib177 (16), src idx/frags: 32/240 dst idx/frags: 0/1 Jun 10 16:34:11 newtevnfs kernel: LNetError: 4353:0:(o2iblnd_cb.c:1083:kiblnd_init_rdma()) Skipped 18859 previous similar messages Jun 10 16:34:11 newtevnfs kernel: LNetError: 4353:0:(o2iblnd_cb.c:433:kiblnd_handle_rx()) Can't setup rdma for PUT to 192.168.177.202@o2ib177: -90 Jun 10 16:34:11 newtevnfs kernel: LNetError: 4353:0:(o2iblnd_cb.c:433:kiblnd_handle_rx()) Skipped 18859 previous similar messages Server errors, first server tevlfs5: Jun 10 16:31:09 tevlfs5 kernel: Lustre: lfstev-OST0004: Connection restored to 5cef0352-ac9a-6592-6586-45468e615673 (at 192.168.177.202@o2ib177) Jun 10 16:33:52 tevlfs5 kernel: Lustre: lfstev-OST0004: Connection restored to 5cef0352-ac9a-6592-6586-45468e615673 (at 192.168.177.202@o2ib177) Jun 10 16:34:18 tevlfs5 kernel: Lustre: lfstev-OST0004: Connection restored to 5cef0352-ac9a-6592-6586-45468e615673 (at 192.168.177.202@o2ib177) Jun 10 16:34:25 tevlfs5 kernel: Lustre: lfstev-OST0004: Client 232e0955-aa70-fd53-f988-7299ce54b534 (at 192.168.177.202@o2ib177) reconnecting Jun 10 16:34:25 tevlfs5 kernel: Lustre: Skipped 1589 previous similar messages Jun 10 16:34:25 tevlfs5 kernel: LustreError: 29667:0:(ldlm_lib.c:3197:target_bulk_io()) @@@ bulk READ failed: rc 107  req@ffff965e3ff53c50 x1635984221122768/t0(0) o3 >232e0955-aa70-fd53-f988-7299ce54b534@192.168.177.202@o2ib177:256/0 lens 488/432 e 0 to 0 dl 1560202471 ref 1 fl Interpret:/2/0 rc 0/0 Jun 10 16:34:25 tevlfs5 kernel: Lustre: lfstev-OST0004: Bulk IO read error with 232e0955-aa70-fd53-f988-7299ce54b534 (at 192.168.177.202@o2ib177), client will retry: rc -107 Jun 10 16:34:25 tevlfs5 kernel: Lustre: Skipped 1587 previous similar messages Jun 10 16:34:25 tevlfs5 kernel: LustreError: 29650:0:(ldlm_lib.c:3247:target_bulk_io()) @@@ Reconnect on bulk READ  req@ffff965e3ff53450 x1635984221122784/t0(0) o3->232e0955-aa70-fd53-f988-7299ce54b534@192.168.177.202@o2ib177:256/0 lens 488/432 e 0 to 0 dl 1560202471 ref 1 fl Interpret:/2/0 rc 0/0 Jun 10 16:34:25 tevlfs5 kernel: LustreError: 29650:0:(ldlm_lib.c:3247:target_bulk_io()) Skipped 1584 previous similar messages Jun 10 16:34:25 tevlfs5 kernel: LustreError: 29667:0:(ldlm_lib.c:3197:target_bulk_io()) Skipped 6 previous similar messages Jun 10 16:34:27 tevlfs5 kernel: Lustre: lfstev-OST0004: Connection restored to 5cef0352-ac9a-6592-6586-45468e615673 (at 192.168.177.202@o2ib177) Jun 10 16:34:27 tevlfs5 kernel: Lustre: Skipped 221 previous similar messages Jun 10 16:34:46 tevlfs5 kernel: Lustre: lfstev-OST0004: Connection restored to 5cef0352-ac9a-6592-6586-45468e615673 (at 192.168.177.202@o2ib177) Jun 10 16:34:46 tevlfs5 kernel: Lustre: Skipped 2390 previous similar messages Jun 10 16:34:57 tevlfs5 kernel: Lustre: lfstev-OST0004: Client 232e0955-aa70-fd53-f988-7299ce54b534 (at 192.168.177.202@o2ib177) reconnecting Jun 10 16:34:57 tevlfs5 kernel: Lustre: Skipped 3995 previous similar messages Jun 10 16:34:57 tevlfs5 kernel: Lustre: lfstev-OST0004: Bulk IO read error with 232e0955-aa70-fd53-f988-7299ce54b534 (at 192.168.177.202@o2ib177), client will retry: rc -110 Jun 10 16:34:57 tevlfs5 kernel: Lustre: Skipped 4021 previous similar messages Jun 10 16:34:57 tevlfs5 kernel: LustreError: 29650:0:(ldlm_lib.c:3247:target_bulk_io()) @@@ Reconnect on bulk READ  req@ffff965ce9f00050 x1635984221123728/t0(0) o3->232e0955-aa70-fd53-f988-7299ce54b534@192.168.177.202@o2ib177:312/0 lens 488/432 e 0 to 0 dl 1560202527 ref 1 fl Interpret:/2/0 rc 0/0 Jun 10 16:34:57 tevlfs5 kernel: LustreError: 29650:0:(ldlm_lib.c:3247:target_bulk_io()) Skipped 3969 previous similar messages Jun 10 16:35:23 tevlfs5 kernel: Lustre: lfstev-OST0004: Connection restored to 5cef0352-ac9a-6592-6586-45468e615673 (at 192.168.177.202@o2ib177) Jun 10 16:35:23 tevlfs5 kernel: Lustre: Skipped 4735 previous similar messages Jun 10 16:36:01 tevlfs5 kernel: Lustre: lfstev-OST0004: Client 232e0955-aa70-fd53-f988-7299ce54b534 (at 192.168.177.202@o2ib177) reconnecting Jun 10 16:36:01 tevlfs5 kernel: Lustre: Skipped 8064 previous similar messages Jun 10 16:36:01 tevlfs5 kernel: Lustre: lfstev-OST0004: Bulk IO read error with 232e0955-aa70-fd53-f988-7299ce54b534 (at 192.168.177.202@o2ib177), client will retry: rc -110 Jun 10 16:36:01 tevlfs5 kernel: Lustre: Skipped 8063 previous similar messages Jun 10 16:36:01 tevlfs5 kernel: LustreError: 29681:0:(ldlm_lib.c:3247:target_bulk_io()) @@@ Reconnect on bulk READ  req@ffff965d412d8850 x1635984221123728/t0(0) o3->232e0955-aa70-fd53-f988-7299ce54b534@192.168.177.202@o2ib177:376/0 lens 488/432 e 0 to 0 dl 1560202591 ref 1 fl Interpret:/2/0 rc 0/0 Jun 10 16:36:01 tevlfs5 kernel: LustreError: 29681:0:(ldlm_lib.c:3247:target_bulk_io()) Skipped 8061 previous similar messages     Jun 10 16:36:38 tevlfs5 kernel: Lustre: lfstev-OST0004: Connection restored to 5cef0352-ac9a-6592-6586-45468e615673 (at 192.168.177.202@o2ib177) Jun 10 16:36:38 tevlfs5 kernel: Lustre: Skipped 9432 previous similar messages   Jun 10 16:38:09 tevlfs5 kernel: Lustre: lfstev-OST0004: Bulk IO read error with 232e0955-aa70-fd53-f988-7299ce54b534 (at 192.168.177.202@o2ib177), client will retry: rc -110 Jun 10 16:38:09 tevlfs5 kernel: Lustre: Skipped 16127 previous similar messages Jun 10 16:38:09 tevlfs5 kernel: Lustre: lfstev-OST0004: Client 232e0955-aa70-fd53-f988-7299ce54b534 (at 192.168.177.202@o2ib177) reconnecting Jun 10 16:38:09 tevlfs5 kernel: Lustre: Skipped 16128 previous similar messages Jun 10 16:38:09 tevlfs5 kernel: LustreError: 29681:0:(ldlm_lib.c:3247:target_bulk_io()) @@@ Reconnect on bulk READ  req@ffff965e35238050 x1635984221123728/t0(0) o3->232e0955-aa70-fd53-f988-7299ce54b534@192.168.177.202@o2ib177:504/0 lens 488/432 e 0 to 0 dl 1560202719 ref 1 fl Interpret:/2/0 rc 0/0 Jun 10 16:38:09 tevlfs5 kernel: LustreError: 29681:0:(ldlm_lib.c:3247:target_bulk_io()) Skipped 16141 previous similar messages Jun 10 16:39:08 tevlfs5 kernel: Lustre: lfstev-OST0004: Connection restored to 5cef0352-ac9a-6592-6586-45468e615673 (at 192.168.177.202@o2ib177) Jun 10 16:39:08 tevlfs5 kernel: Lustre: Skipped 18844 previous similar messages   Jun 10 16:42:25 tevlfs5 kernel: Lustre: lfstev-OST0004: Bulk IO read error with 232e0955-aa70-fd53-f988-7299ce54b534 (at 192.168.177.202@o2ib177), client will retry: rc -110 Jun 10 16:42:25 tevlfs5 kernel: Lustre: Skipped 32255 previous similar messages Jun 10 16:42:25 tevlfs5 kernel: Lustre: lfstev-OST0004: Client 232e0955-aa70-fd53-f988-7299ce54b534 (at 192.168.177.202@o2ib177) reconnecting Jun 10 16:42:25 tevlfs5 kernel: Lustre: Skipped 32255 previous similar messages Jun 10 16:42:25 tevlfs5 kernel: LustreError: 29650:0:(ldlm_lib.c:3247:target_bulk_io()) @@@ Reconnect on bulk READ  req@ffff965e5d9c8850 x1635984221123728/t0(0) o3->232e0955-aa70-fd53-f988-7299ce54b534@192.168.177.202@o2ib177:5/0 lens 488/432 e 0 to 0 dl 1560202975 ref 1 fl Interpret:/2/0 rc 0/0 Jun 10 16:42:25 tevlfs5 kernel: LustreError: 29650:0:(ldlm_lib.c:3247:target_bulk_io()) Skipped 32227 previous similar messages Jun 10 16:44:08 tevlfs5 kernel: Lustre: lfstev-OST0004: Connection restored to 5cef0352-ac9a-6592-6586-45468e615673 (at 192.168.177.202@o2ib177) Jun 10 16:44:08 tevlfs5 kernel: Lustre: Skipped 37797 previous similar messages Jun 10 16:32:56 ibmpower9 kernel: LustreError: 101883:0:(events.c:200:client_bulk_callback()) event type 2, status -90, desc c000207265a20a00 Server errors, second server tevlfs6: Jun 10 16:31:45 tevlfs6 kernel: Lustre: lfstev-OST0005: Connection restored to 24e6a956-37a9-6c64-bff4-98a232c29f9a (at 192.168.177.202@o2ib177) Jun 10 16:32:18 tevlfs6 kernel: Lustre: lfstev-OST0005: Client 232e0955-aa70-fd53-f988-7299ce54b534 (at 192.168.177.202@o2ib177) reconnecting Jun 10 16:32:18 tevlfs6 kernel: Lustre: Skipped 1638 previous similar messages Jun 10 16:32:18 tevlfs6 kernel: Lustre: lfstev-OST0005: Connection restored to 24e6a956-37a9-6c64-bff4-98a232c29f9a (at 192.168.177.202@o2ib177) Jun 10 16:32:18 tevlfs6 kernel: LustreError: 14016:0:(ldlm_lib.c:3197:target_bulk_io()) @@@ bulk READ failed: rc 107  req@ffff96bb291e4450 x1635984221122544/t0(0) o3 >232e0955-aa70-fd53-f988-7299ce54b534@192.168.177.202@o2ib177:129/0 lens 488/432 e 0 to 0 dl 1560202344 ref 1 fl Interpret:/0/0 rc 0/0 Jun 10 16:32:18 tevlfs6 kernel: LustreError: 14016:0:(ldlm_lib.c:3197:target_bulk_io()) Skipped 122 previous similar messages Jun 10 16:32:18 tevlfs6 kernel: Lustre: lfstev-OST0005: Bulk IO read error with 232e0955-aa70-fd53-f988-7299ce54b534 (at 192.168.177.202@o2ib177), client will retry: rc -107 Jun 10 16:32:18 tevlfs6 kernel: Lustre: Skipped 2 previous similar messages Jun 10 16:32:18 tevlfs6 kernel: LustreError: 14006:0:(ldlm_lib.c:3247:target_bulk_io()) @@@ Reconnect on bulk READ  req@ffff96b9f53d4c50 x1635984221122656/t0(0) o3->232e0955-aa70-fd53-f988-7299ce54b534@192.168.177.202@o2ib177:129/0 lens 488/432 e 0 to 0 dl 1560202344 ref 1 fl Interpret:/0/0 rc 0/0 Jun 10 16:32:18 tevlfs6 kernel: LustreError: 14006:0:(ldlm_lib.c:3247:target_bulk_io()) Skipped 4709 previous similar messages Jun 10 16:32:22 tevlfs6 kernel: Lustre: lfstev-OST0005: Connection restored to 24e6a956-37a9-6c64-bff4-98a232c29f9a (at 192.168.177.202@o2ib177) Jun 10 16:32:22 tevlfs6 kernel: Lustre: Skipped 501 previous similar messages Jun 10 16:32:30 tevlfs6 kernel: Lustre: lfstev-OST0005: Connection restored to 24e6a956-37a9-6c64-bff4-98a232c29f9a (at 192.168.177.202@o2ib177) Jun 10 16:32:30 tevlfs6 kernel: Lustre: Skipped 1008 previous similar messages Jun 10 16:32:46 tevlfs6 kernel: Lustre: lfstev-OST0005: Connection restored to 24e6a956-37a9-6c64-bff4-98a232c29f9a (at 192.168.177.202@o2ib177) Jun 10 16:32:46 tevlfs6 kernel: Lustre: Skipped 2016 previous similar messages Jun 10 16:32:55 tevlfs6 kernel: Lustre: lfstev-OST0005: Client 232e0955-aa70-fd53-f988-7299ce54b534 (at 192.168.177.202@o2ib177) reconnecting Jun 10 16:32:55 tevlfs6 kernel: Lustre: Skipped 4713 previous similar messages Jun 10 16:32:56 tevlfs6 kernel: LustreError: 14018:0:(ldlm_lib.c:3247:target_bulk_io()) @@@ Reconnect on bulk READ  req@ffff96bae7793050 x1635984221122096/t0(0) o3->232e0955-aa70-fd53-f988-7299ce54b534@192.168.177.202@o2ib177:190/0 lens 488/432 e 0 to 0 dl 1560202405 ref 1 fl Interpret:/2/0 rc 0/0 Jun 10 16:32:56 tevlfs6 kernel: LustreError: 14018:0:(ldlm_lib.c:3247:target_bulk_io()) Skipped 4691 previous similar messages Jun 10 16:33:18 tevlfs6 kernel: Lustre: lfstev-OST0005: Connection restored to 24e6a956-37a9-6c64-bff4-98a232c29f9a (at 192.168.177.202@o2ib177) Jun 10 16:33:18 tevlfs6 kernel: Lustre: Skipped 4032 previous similar messages Jun 10 16:34:10 tevlfs6 kernel: Lustre: lfstev-OST0005: Client 232e0955-aa70-fd53-f988-7299ce54b534 (at 192.168.177.202@o2ib177) reconnecting Jun 10 16:34:10 tevlfs6 kernel: Lustre: Skipped 9430 previous similar messages Jun 10 16:34:11 tevlfs6 kernel: LustreError: 14018:0:(ldlm_lib.c:3247:target_bulk_io()) @@@ Reconnect on bulk READ  req@ffff96b998978850 x1635984221122096/t0(0) o3->232e0955-aa70-fd53-f988-7299ce54b534@192.168.177.202@o2ib177:265/0 lens 488/432 e 0 to 0 dl 1560202480 ref 1 fl Interpret:/2/0 rc 0/0 Jun 10 16:34:11 tevlfs6 kernel: LustreError: 14018:0:(ldlm_lib.c:3247:target_bulk_io()) Skipped 9449 previous similar messages Jun 10 16:34:22 tevlfs6 kernel: Lustre: lfstev-OST0005: Connection restored to 24e6a956-37a9-6c64-bff4-98a232c29f9a (at 192.168.177.202@o2ib177) Jun 10 16:34:22 tevlfs6 kernel: Lustre: Skipped 8064 previous similar messages Both processes reading files were killed on client host. ltop reports data being transferred on OSTs: Filesystem: lfstev                                                    RECORDING     Inodes:    169.254m total,      4.819m used (  3%),    164.435m free      Space:     83.622t total,      7.306t used (  9%),     76.316t free    Bytes/s: 0.231g read,       0.000g write,               252 IOPS    MDops/s: 1 open,        0 close,       0 getattr, 0 setattr                  0 link,        0 unlink,      0 mkdir,         0 rmdir                  0 statfs, 0 rename,      0 getxattr >OST S        OSS   Exp   CR rMB/s wMB/s  IOPS   LOCKS  LGR  LCR %cpu %mem %spc 0000   tevlfs1    70    0     0     0     0       0    0    0    0   69    9 0001   tevlfs2    70    0     0     0     0       0    0    0    1   69    9 0002   tevlfs3    70    0     0     0     0       0    0    0    1   88    8 0003   tevlfs4    70    0     0     0     0       0    0    0    0   88    9 0004   tevlfs5    71  126   118     0   126       0    0    0    0   90    9 0005   tevlfs6    71  126   118     0   126       0    0    0    1   70    9 server and router still report errors until client host get rebooted: Server: Jun 10 16:41:40 tevlfs6 kernel: Lustre: lfstev-OST0005: Client 232e0955-aa70-fd53-f988-7299ce54b534 (at 192.168.177.202@o2ib177) reconnecting Jun 10 16:41:40 tevlfs6 kernel: Lustre: Skipped 37786 previous similar messages Jun 10 16:41:41 tevlfs6 kernel: LustreError: 14018:0:(ldlm_lib.c:3247:target_bulk_io()) @@@ Reconnect on bulk READ  req@ffff96bb07d49050 x1635984221122096/t0(0) o3->232e0955-aa70-fd53-f988-7299ce54b534@192.168.177.202@o2ib177:716/0 lens 488/432 e 0 to 0 dl 1560202931 ref 1 fl Interpret:/2/0 rc 0/0 Jun 10 16:41:41 tevlfs6 kernel: LustreError: 14018:0:(ldlm_lib.c:3247:target_bulk_io()) Skipped 37799 previous similar messages Router errors: Jun 10 16:51:41 newtevnfs kernel: LNetError: 4353:0:(o2iblnd_cb.c:1083:kiblnd_init_rdma()) Skipped 151177 previous similar messages Jun 10 16:51:41 newtevnfs kernel: LNetError: 4353:0:(o2iblnd_cb.c:433:kiblnd_handle_rx()) Can't setup rdma for PUT to 192.168.177.202@o2ib177: -90 Jun 10 16:51:41 newtevnfs kernel: LNetError: 4353:0:(o2iblnd_cb.c:433:kiblnd_handle_rx()) Skipped 151180 previous similar messages Jun 10 17:01:41 newtevnfs kernel: LNetError: 4355:0:(o2iblnd_cb.c:1083:kiblnd_init_rdma()) RDMA has too many fragments for peer_ni 192.168.177.202@o2ib177 (16), src idx/frags: 32/240 dst idx/frags: 0/1 Jun 10 17:01:41 newtevnfs kernel: LNetError: 4355:0:(o2iblnd_cb.c:1083:kiblnd_init_rdma()) Skipped 151165 previous similar messages Jun 10 17:01:41 newtevnfs kernel: LNetError: 4355:0:(o2iblnd_cb.c:433:kiblnd_handle_rx()) Can't setup rdma for PUT to 192.168.177.202@o2ib177: -90 Jun 10 17:01:41 newtevnfs kernel: LNetError: 4355:0:(o2iblnd_cb.c:433:kiblnd_handle_rx()) Skipped 151172 previous similar messages Jun 10 17:11:41 newtevnfs kernel: LNetError: 4354:0:(o2iblnd_cb.c:1083:kiblnd_init_rdma()) RDMA has too many fragments for peer_ni 192.168.177.202@o2ib177 (16), src idx/frags: 32/240 dst idx/frags: 0/1

            People

              wc-triage WC Triage
              alex.ku Alex Kulyavtsev
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: