[LU-12419] ppc64le: "LNetError: RDMA has too many fragments for peer_ni" when reading two files Created: 11/Jun/19  Updated: 22/Jul/19  Resolved: 22/Jul/19

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.6, Lustre 2.12.1
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Alex Kulyavtsev Assignee: WC Triage
Resolution: Won't Fix Votes: 0
Labels: None
Environment:

Client:
hostname: ibmpower9
NID: 192.168.177.202@o2ib177
kernel: 4.14.0-115.2.2.el7a.ppc64le
Linux: Red Hat Enterprise Linux Server release 7.5 (Maipo)
Architecture: ppc64le
Byte Order: Little Endian Model: 2.1 (pvr 004e 1201)
Model name: POWER9, altivec supported
Lustre (custom rebuild from git distro on this host):
kmod-lustre-client-2.12.1-1.el7.ppc64le
lustre-client-2.12.1-1.el7.ppc64le

Router:
hostname: newtevnfs
NIDs: 192.168.176.28@o2ib
192.168.177.28@o2ib177
kernel: 2.6.32-696.1.1.el6.x86_64
Linux: Scientific Linux Fermi release 6.10 (Ramsey)
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Model name: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz
Lustre version (custom rebuild from from source rpm on this host):
lustre-client-2.10.6-1.el6.x86_64
kmod-lustre-client-2.10.6-1.el6.x86_64

Server(s): tevlfsa (MDS), tevlfs1-6 (OSS)
tevlfsa 192.168.176.140@o2ib
tevlfs1 192.168.176.141@o2ib
...
tevlfs5 192.168.176.145@o2ib
tevlfs6 192.168.176.146@o2ib
kernel: 3.10.0-862.6.3.el7.x86_64
Linux: Scientific Linux release 7.4 (Nitrogen)
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Model name: Intel(R) Xeon(R) CPU E5420 @ 2.50GHz
Lustre version (custom rebuild from from source rpm on this host tevlfs6):
lustre-2.10.6-1.el7.x86_64
zfs-0.7.9-1.el7.x86_64
There are 6 OSTs, one OST per OSS.


Attachments: File client.tgz     File router.tgz     File server.tgz    
Issue Links:
Related
is related to LU-5718 RDMA too fragmented with router Resolved
is related to LU-3322 ko2iblnd support for different map_on... Resolved
is related to LU-7650 ko2iblnd map_on_demand can't negotita... Resolved
is related to LU-6387 Add Power8 support to Lustre Resolved
is related to LU-10300 Can the Lustre 2.10.x clients support... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

I'm trying to install and configure lustre client on ibm power9 64le machine to mount existing lustre system through router. The power9 host is similar to ORNL Summit worker and we going to use it for debugging software before running it on leadership facilities so this case can be of interest for others.

At first there were issues with connecting to the lnet (lctl ping); the issues get resolved after explicitly setting options in /etc/modprobe.d/ko2iblnd.conf as per LU-3322:

  map_on_demand=16 - on ibmpower9

  map_on_demand=256 - on x86_64 all servers and router.

Lustre was restarted and modules were reloaded after these changes.

Now I can mount lustre on power9 client and execute SINGLE file read with dd, this works for files located on any of six OSTs but doing one read transfer at a time. I did not try writes.

When I start two transfers in parallel or start one transfer, then start the other 10 seconds later I'm getting LNET error when the transfer starts for the second file. I can kill -9 dd process (but not always all processes); sometimes one of the processes can not be killed with signal -9. Even all IO processes ("dd") are killed on the client the router and servers continue to report errors in the logs; and I do observe IO on both OSTs where files reside.  I can not unmount lustre on power9 client or remove modules.

"lctl net unconfigure" reports "LNET busy". I have to reboot power9 client. Only after the client reboot errors stop being reported on servers and router.



 Comments   
Comment by Alex Kulyavtsev [ 11/Jun/19 ]

log after mount on the power9 client:

Jun 10 14:48:50 ibmpower9 kernel: LNet: HW NUMA nodes: 6, HW CPU cores: 128, npartitions: 2

Jun 10 14:48:50 ibmpower9 kernel: alg: No test for adler32 (adler32-zlib)

Jun 10 14:48:51 ibmpower9 kernel: Lustre: Lustre: Build Version: 2.12.1

Jun 10 14:48:51 ibmpower9 kernel: LNet: Using FastReg for registration

Jun 10 14:48:51 ibmpower9 kernel: LNet: Added LNI 192.168.177.202@o2ib177 [8/256/0/180]

Jun 10 14:48:51 ibmpower9 kernel: Lustre: Mounted lfstev-client

Jun 10 14:50:31 ibmpower9 kernel: NVRM: Xid (PCI:0004:04:00): 43, Ch 00000010, engmask 00000101

Jun 10 14:52:55 ibmpower9 kernel: NVRM: Xid (PCI:0004:04:00): 43, Ch 00000010, engmask 00000101

Read two files with 10 sec delay (first file still being read when second read starts)

dd of=/dev/null bs=1M if=/lfstev/admin/aik/iotest/osd5/10.GB

sleep 10

dd of=/dev/null bs=1M if=/lfstev/admin/aik/iotest/osd4/10.GB

Client errors:

Jun 10 16:32:21 ibmpower9 kernel: Lustre: Unmounted lfstev-client

Jun 10 16:32:23 ibmpower9 kernel: Lustre: Mounted lfstev-client

Jun 10 16:32:56 ibmpower9 kernel: LustreError: 101884:0:(events.c:200:client_bulk_callback()) event type 2, status -90, desc c000207265a20a00

Jun 10 16:32:56 ibmpower9 kernel: Lustre: 101916:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1560202375/real 1560202375]  req@c00

0007fb6779480 x1635984221122096/t0(0) o3->lfstev-OST0005-osc-c0002072ee3bf800@192.168.176.146@o2ib:6/4 lens 488/440 e 0 to 1 dl 1560202382 ref 2 fl Rpc:eX/0/ffffffff rc 0/-1

Jun 10 16:32:56 ibmpower9 kernel: Lustre: lfstev-OST0005-osc-c0002072ee3bf800: Connection to lfstev-OST0005 (at 192.168.176.146@o2ib) was lost; in progress operations using this service w

ill wait for recovery to complete

Jun 10 16:32:56 ibmpower9 kernel: LustreError: 101884:0:(events.c:200:client_bulk_callback()) event type 2, status -90, desc c000207265a29c00

Jun 10 16:32:56 ibmpower9 kernel: Lustre: lfstev-OST0004-osc-c0002072ee3bf800: Connection restored to 192.168.176.145@o2ib (at 192.168.176.145@o2ib)

Jun 10 16:32:56 ibmpower9 kernel: LustreError: 101882:0:(events.c:200:client_bulk_callback()) event type 2, status -90, desc c000207265a29c00

Jun 10 16:32:56 ibmpower9 kernel: LustreError: 101883:0:(events.c:200:client_bulk_callback()) event type 2, status -90, desc c000207265a29c00

Jun 10 16:32:56 ibmpower9 kernel: LustreError: 101883:0:(events.c:200:client_bulk_callback()) event type 2, status -90, desc c000207265a20a00

...

Jun 10 16:32:56 ibmpower9 kernel: LustreError: 101882:0:(events.c:200:client_bulk_callback()) event type 2, status -90, desc c000207265a20a00

Jun 10 16:32:56 ibmpower9 kernel: Lustre: 101916:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1560202376/real 1560202376]  req@c000007fb6779480 x1635984221122096/t0(0) o3->lfstev-OST0005-osc-c0002072ee3bf800@192.168.176.146@o2ib:6/4 lens 488/440 e 0 to 1 dl 1560202383 ref 2 fl Rpc:eX/2/ffffffff rc 0/-1

Jun 10 16:32:56 ibmpower9 kernel: Lustre: 101916:0:(client.c:2134:ptlrpc_expire_one_request()) Skipped 62 previous similar messages

Jun 10 16:32:56 ibmpower9 kernel: Lustre: lfstev-OST0005-osc-c0002072ee3bf800: Connection to lfstev-OST0005 (at 192.168.176.146@o2ib) was lost; in progress operations using this service will wait for recovery to complete

Jun 10 16:32:56 ibmpower9 kernel: Lustre: Skipped 62 previous similar messages

Jun 10 16:32:56 ibmpower9 kernel: LustreError: 101882:0:(events.c:200:client_bulk_callback()) event type 2, status -90, desc c000207265a29c00

Jun 10 16:32:56 ibmpower9 kernel: LustreError: 101881:0:(events.c:200:client_bulk_callback()) event type 2, status -90, desc c000207265a20a00

Router has a lot of errors like this when second transfer starts

Jun 10 16:33:33 ibmpower9 kernel: LustreError: 101883:0:(events.c:200:client_bulk_callback()) event type 2, status -90, desc c000207265a29c00

Jun 10 16:33:33 ibmpower9 kernel: LustreError: 101882:0:(events.c:200:client_bulk_callback()) event type 2, status -90, desc c000207265a29c00

Jun 10 16:33:33 ibmpower9 kernel: LustreError: 101881:0:(events.c:200:client_bulk_callback()) event type 2, status -90, desc c000207265a29c00

Router errors with "client_bulk_callback()" filtered out:

Jun 10 16:32:56 newtevnfs kernel: LNetError: 4354:0:(o2iblnd_cb.c:1083:kiblnd_init_rdma()) RDMA has too many fragments for peer_ni 192.168.177.202@o2ib177 (16), src idx/frags: 32/240 dst idx/frags: 0/1

Jun 10 16:32:56 newtevnfs kernel: LNetError: 4354:0:(o2iblnd_cb.c:1083:kiblnd_init_rdma()) Skipped 126371 previous similar messages

Jun 10 16:32:56 newtevnfs kernel: LNetError: 4354:0:(o2iblnd_cb.c:433:kiblnd_handle_rx()) Can't setup rdma for PUT to 192.168.177.202@o2ib177: -90

Jun 10 16:32:56 newtevnfs kernel: LNetError: 4354:0:(o2iblnd_cb.c:433:kiblnd_handle_rx()) Skipped 126396 previous similar messages

Jun 10 16:32:56 newtevnfs kernel: LNet: 4356:0:(o2iblnd_cb.c:396:kiblnd_handle_rx()) PUT_NACK from 192.168.177.202@o2ib177

Jun 10 16:32:56 newtevnfs kernel: LNet: 4356:0:(o2iblnd_cb.c:396:kiblnd_handle_rx()) Skipped 356 previous similar messages

Jun 10 16:34:11 newtevnfs kernel: LNetError: 4353:0:(o2iblnd_cb.c:1083:kiblnd_init_rdma()) RDMA has too many fragments for peer_ni 192.168.177.202@o2ib177 (16), src idx/frags: 32/240 dst idx/frags: 0/1

Jun 10 16:34:11 newtevnfs kernel: LNetError: 4353:0:(o2iblnd_cb.c:1083:kiblnd_init_rdma()) Skipped 18859 previous similar messages

Jun 10 16:34:11 newtevnfs kernel: LNetError: 4353:0:(o2iblnd_cb.c:433:kiblnd_handle_rx()) Can't setup rdma for PUT to 192.168.177.202@o2ib177: -90

Jun 10 16:34:11 newtevnfs kernel: LNetError: 4353:0:(o2iblnd_cb.c:433:kiblnd_handle_rx()) Skipped 18859 previous similar messages

Server errors, first server tevlfs5:

Jun 10 16:31:09 tevlfs5 kernel: Lustre: lfstev-OST0004: Connection restored to 5cef0352-ac9a-6592-6586-45468e615673 (at 192.168.177.202@o2ib177)

Jun 10 16:33:52 tevlfs5 kernel: Lustre: lfstev-OST0004: Connection restored to 5cef0352-ac9a-6592-6586-45468e615673 (at 192.168.177.202@o2ib177)

Jun 10 16:34:18 tevlfs5 kernel: Lustre: lfstev-OST0004: Connection restored to 5cef0352-ac9a-6592-6586-45468e615673 (at 192.168.177.202@o2ib177)

Jun 10 16:34:25 tevlfs5 kernel: Lustre: lfstev-OST0004: Client 232e0955-aa70-fd53-f988-7299ce54b534 (at 192.168.177.202@o2ib177) reconnecting

Jun 10 16:34:25 tevlfs5 kernel: Lustre: Skipped 1589 previous similar messages

Jun 10 16:34:25 tevlfs5 kernel: LustreError: 29667:0:(ldlm_lib.c:3197:target_bulk_io()) @@@ bulk READ failed: rc 107  req@ffff965e3ff53c50 x1635984221122768/t0(0) o3>232e0955-aa70-fd53-f988-7299ce54b534@192.168.177.202@o2ib177:256/0 lens 488/432 e 0 to 0 dl 1560202471 ref 1 fl Interpret:/2/0 rc 0/0

Jun 10 16:34:25 tevlfs5 kernel: Lustre: lfstev-OST0004: Bulk IO read error with 232e0955-aa70-fd53-f988-7299ce54b534 (at 192.168.177.202@o2ib177), client will retry: rc -107

Jun 10 16:34:25 tevlfs5 kernel: Lustre: Skipped 1587 previous similar messages

Jun 10 16:34:25 tevlfs5 kernel: LustreError: 29650:0:(ldlm_lib.c:3247:target_bulk_io()) @@@ Reconnect on bulk READ  req@ffff965e3ff53450 x1635984221122784/t0(0) o3->232e0955-aa70-fd53-f988-7299ce54b534@192.168.177.202@o2ib177:256/0 lens 488/432 e 0 to 0 dl 1560202471 ref 1 fl Interpret:/2/0 rc 0/0

Jun 10 16:34:25 tevlfs5 kernel: LustreError: 29650:0:(ldlm_lib.c:3247:target_bulk_io()) Skipped 1584 previous similar messages

Jun 10 16:34:25 tevlfs5 kernel: LustreError: 29667:0:(ldlm_lib.c:3197:target_bulk_io()) Skipped 6 previous similar messages

Jun 10 16:34:27 tevlfs5 kernel: Lustre: lfstev-OST0004: Connection restored to 5cef0352-ac9a-6592-6586-45468e615673 (at 192.168.177.202@o2ib177)

Jun 10 16:34:27 tevlfs5 kernel: Lustre: Skipped 221 previous similar messages

Jun 10 16:34:46 tevlfs5 kernel: Lustre: lfstev-OST0004: Connection restored to 5cef0352-ac9a-6592-6586-45468e615673 (at 192.168.177.202@o2ib177)

Jun 10 16:34:46 tevlfs5 kernel: Lustre: Skipped 2390 previous similar messages

Jun 10 16:34:57 tevlfs5 kernel: Lustre: lfstev-OST0004: Client 232e0955-aa70-fd53-f988-7299ce54b534 (at 192.168.177.202@o2ib177) reconnecting

Jun 10 16:34:57 tevlfs5 kernel: Lustre: Skipped 3995 previous similar messages

Jun 10 16:34:57 tevlfs5 kernel: Lustre: lfstev-OST0004: Bulk IO read error with 232e0955-aa70-fd53-f988-7299ce54b534 (at 192.168.177.202@o2ib177), client will retry: rc -110

Jun 10 16:34:57 tevlfs5 kernel: Lustre: Skipped 4021 previous similar messages

Jun 10 16:34:57 tevlfs5 kernel: LustreError: 29650:0:(ldlm_lib.c:3247:target_bulk_io()) @@@ Reconnect on bulk READ  req@ffff965ce9f00050 x1635984221123728/t0(0) o3->232e0955-aa70-fd53-f988-7299ce54b534@192.168.177.202@o2ib177:312/0 lens 488/432 e 0 to 0 dl 1560202527 ref 1 fl Interpret:/2/0 rc 0/0

Jun 10 16:34:57 tevlfs5 kernel: LustreError: 29650:0:(ldlm_lib.c:3247:target_bulk_io()) Skipped 3969 previous similar messages

Jun 10 16:35:23 tevlfs5 kernel: Lustre: lfstev-OST0004: Connection restored to 5cef0352-ac9a-6592-6586-45468e615673 (at 192.168.177.202@o2ib177)

Jun 10 16:35:23 tevlfs5 kernel: Lustre: Skipped 4735 previous similar messages

Jun 10 16:36:01 tevlfs5 kernel: Lustre: lfstev-OST0004: Client 232e0955-aa70-fd53-f988-7299ce54b534 (at 192.168.177.202@o2ib177) reconnecting

Jun 10 16:36:01 tevlfs5 kernel: Lustre: Skipped 8064 previous similar messages

Jun 10 16:36:01 tevlfs5 kernel: Lustre: lfstev-OST0004: Bulk IO read error with 232e0955-aa70-fd53-f988-7299ce54b534 (at 192.168.177.202@o2ib177), client will retry: rc -110

Jun 10 16:36:01 tevlfs5 kernel: Lustre: Skipped 8063 previous similar messages

Jun 10 16:36:01 tevlfs5 kernel: LustreError: 29681:0:(ldlm_lib.c:3247:target_bulk_io()) @@@ Reconnect on bulk READ  req@ffff965d412d8850 x1635984221123728/t0(0) o3->232e0955-aa70-fd53-f988-7299ce54b534@192.168.177.202@o2ib177:376/0 lens 488/432 e 0 to 0 dl 1560202591 ref 1 fl Interpret:/2/0 rc 0/0

Jun 10 16:36:01 tevlfs5 kernel: LustreError: 29681:0:(ldlm_lib.c:3247:target_bulk_io()) Skipped 8061 previous similar messages

 

 

Jun 10 16:36:38 tevlfs5 kernel: Lustre: lfstev-OST0004: Connection restored to 5cef0352-ac9a-6592-6586-45468e615673 (at 192.168.177.202@o2ib177)

Jun 10 16:36:38 tevlfs5 kernel: Lustre: Skipped 9432 previous similar messages

 

Jun 10 16:38:09 tevlfs5 kernel: Lustre: lfstev-OST0004: Bulk IO read error with 232e0955-aa70-fd53-f988-7299ce54b534 (at 192.168.177.202@o2ib177), client will retry: rc -110

Jun 10 16:38:09 tevlfs5 kernel: Lustre: Skipped 16127 previous similar messages

Jun 10 16:38:09 tevlfs5 kernel: Lustre: lfstev-OST0004: Client 232e0955-aa70-fd53-f988-7299ce54b534 (at 192.168.177.202@o2ib177) reconnecting

Jun 10 16:38:09 tevlfs5 kernel: Lustre: Skipped 16128 previous similar messages

Jun 10 16:38:09 tevlfs5 kernel: LustreError: 29681:0:(ldlm_lib.c:3247:target_bulk_io()) @@@ Reconnect on bulk READ  req@ffff965e35238050 x1635984221123728/t0(0) o3->232e0955-aa70-fd53-f988-7299ce54b534@192.168.177.202@o2ib177:504/0 lens 488/432 e 0 to 0 dl 1560202719 ref 1 fl Interpret:/2/0 rc 0/0

Jun 10 16:38:09 tevlfs5 kernel: LustreError: 29681:0:(ldlm_lib.c:3247:target_bulk_io()) Skipped 16141 previous similar messages

Jun 10 16:39:08 tevlfs5 kernel: Lustre: lfstev-OST0004: Connection restored to 5cef0352-ac9a-6592-6586-45468e615673 (at 192.168.177.202@o2ib177)

Jun 10 16:39:08 tevlfs5 kernel: Lustre: Skipped 18844 previous similar messages

 

Jun 10 16:42:25 tevlfs5 kernel: Lustre: lfstev-OST0004: Bulk IO read error with 232e0955-aa70-fd53-f988-7299ce54b534 (at 192.168.177.202@o2ib177), client will retry: rc -110

Jun 10 16:42:25 tevlfs5 kernel: Lustre: Skipped 32255 previous similar messages

Jun 10 16:42:25 tevlfs5 kernel: Lustre: lfstev-OST0004: Client 232e0955-aa70-fd53-f988-7299ce54b534 (at 192.168.177.202@o2ib177) reconnecting

Jun 10 16:42:25 tevlfs5 kernel: Lustre: Skipped 32255 previous similar messages

Jun 10 16:42:25 tevlfs5 kernel: LustreError: 29650:0:(ldlm_lib.c:3247:target_bulk_io()) @@@ Reconnect on bulk READ  req@ffff965e5d9c8850 x1635984221123728/t0(0) o3->232e0955-aa70-fd53-f988-7299ce54b534@192.168.177.202@o2ib177:5/0 lens 488/432 e 0 to 0 dl 1560202975 ref 1 fl Interpret:/2/0 rc 0/0

Jun 10 16:42:25 tevlfs5 kernel: LustreError: 29650:0:(ldlm_lib.c:3247:target_bulk_io()) Skipped 32227 previous similar messages

Jun 10 16:44:08 tevlfs5 kernel: Lustre: lfstev-OST0004: Connection restored to 5cef0352-ac9a-6592-6586-45468e615673 (at 192.168.177.202@o2ib177)

Jun 10 16:44:08 tevlfs5 kernel: Lustre: Skipped 37797 previous similar messages

Jun 10 16:32:56 ibmpower9 kernel: LustreError: 101883:0:(events.c:200:client_bulk_callback()) event type 2, status -90, desc c000207265a20a00

Server errors, second server tevlfs6:

Jun 10 16:31:45 tevlfs6 kernel: Lustre: lfstev-OST0005: Connection restored to 24e6a956-37a9-6c64-bff4-98a232c29f9a (at 192.168.177.202@o2ib177)

Jun 10 16:32:18 tevlfs6 kernel: Lustre: lfstev-OST0005: Client 232e0955-aa70-fd53-f988-7299ce54b534 (at 192.168.177.202@o2ib177) reconnecting

Jun 10 16:32:18 tevlfs6 kernel: Lustre: Skipped 1638 previous similar messages

Jun 10 16:32:18 tevlfs6 kernel: Lustre: lfstev-OST0005: Connection restored to 24e6a956-37a9-6c64-bff4-98a232c29f9a (at 192.168.177.202@o2ib177)

Jun 10 16:32:18 tevlfs6 kernel: LustreError: 14016:0:(ldlm_lib.c:3197:target_bulk_io()) @@@ bulk READ failed: rc 107  req@ffff96bb291e4450 x1635984221122544/t0(0) o3>232e0955-aa70-fd53-f988-7299ce54b534@192.168.177.202@o2ib177:129/0 lens 488/432 e 0 to 0 dl 1560202344 ref 1 fl Interpret:/0/0 rc 0/0

Jun 10 16:32:18 tevlfs6 kernel: LustreError: 14016:0:(ldlm_lib.c:3197:target_bulk_io()) Skipped 122 previous similar messages

Jun 10 16:32:18 tevlfs6 kernel: Lustre: lfstev-OST0005: Bulk IO read error with 232e0955-aa70-fd53-f988-7299ce54b534 (at 192.168.177.202@o2ib177), client will retry: rc -107

Jun 10 16:32:18 tevlfs6 kernel: Lustre: Skipped 2 previous similar messages

Jun 10 16:32:18 tevlfs6 kernel: LustreError: 14006:0:(ldlm_lib.c:3247:target_bulk_io()) @@@ Reconnect on bulk READ  req@ffff96b9f53d4c50 x1635984221122656/t0(0) o3->232e0955-aa70-fd53-f988-7299ce54b534@192.168.177.202@o2ib177:129/0 lens 488/432 e 0 to 0 dl 1560202344 ref 1 fl Interpret:/0/0 rc 0/0

Jun 10 16:32:18 tevlfs6 kernel: LustreError: 14006:0:(ldlm_lib.c:3247:target_bulk_io()) Skipped 4709 previous similar messages

Jun 10 16:32:22 tevlfs6 kernel: Lustre: lfstev-OST0005: Connection restored to 24e6a956-37a9-6c64-bff4-98a232c29f9a (at 192.168.177.202@o2ib177)

Jun 10 16:32:22 tevlfs6 kernel: Lustre: Skipped 501 previous similar messages

Jun 10 16:32:30 tevlfs6 kernel: Lustre: lfstev-OST0005: Connection restored to 24e6a956-37a9-6c64-bff4-98a232c29f9a (at 192.168.177.202@o2ib177)

Jun 10 16:32:30 tevlfs6 kernel: Lustre: Skipped 1008 previous similar messages

Jun 10 16:32:46 tevlfs6 kernel: Lustre: lfstev-OST0005: Connection restored to 24e6a956-37a9-6c64-bff4-98a232c29f9a (at 192.168.177.202@o2ib177)

Jun 10 16:32:46 tevlfs6 kernel: Lustre: Skipped 2016 previous similar messages

Jun 10 16:32:55 tevlfs6 kernel: Lustre: lfstev-OST0005: Client 232e0955-aa70-fd53-f988-7299ce54b534 (at 192.168.177.202@o2ib177) reconnecting

Jun 10 16:32:55 tevlfs6 kernel: Lustre: Skipped 4713 previous similar messages

Jun 10 16:32:56 tevlfs6 kernel: LustreError: 14018:0:(ldlm_lib.c:3247:target_bulk_io()) @@@ Reconnect on bulk READ  req@ffff96bae7793050 x1635984221122096/t0(0) o3->232e0955-aa70-fd53-f988-7299ce54b534@192.168.177.202@o2ib177:190/0 lens 488/432 e 0 to 0 dl 1560202405 ref 1 fl Interpret:/2/0 rc 0/0

Jun 10 16:32:56 tevlfs6 kernel: LustreError: 14018:0:(ldlm_lib.c:3247:target_bulk_io()) Skipped 4691 previous similar messages

Jun 10 16:33:18 tevlfs6 kernel: Lustre: lfstev-OST0005: Connection restored to 24e6a956-37a9-6c64-bff4-98a232c29f9a (at 192.168.177.202@o2ib177)

Jun 10 16:33:18 tevlfs6 kernel: Lustre: Skipped 4032 previous similar messages

Jun 10 16:34:10 tevlfs6 kernel: Lustre: lfstev-OST0005: Client 232e0955-aa70-fd53-f988-7299ce54b534 (at 192.168.177.202@o2ib177) reconnecting

Jun 10 16:34:10 tevlfs6 kernel: Lustre: Skipped 9430 previous similar messages

Jun 10 16:34:11 tevlfs6 kernel: LustreError: 14018:0:(ldlm_lib.c:3247:target_bulk_io()) @@@ Reconnect on bulk READ  req@ffff96b998978850 x1635984221122096/t0(0) o3->232e0955-aa70-fd53-f988-7299ce54b534@192.168.177.202@o2ib177:265/0 lens 488/432 e 0 to 0 dl 1560202480 ref 1 fl Interpret:/2/0 rc 0/0

Jun 10 16:34:11 tevlfs6 kernel: LustreError: 14018:0:(ldlm_lib.c:3247:target_bulk_io()) Skipped 9449 previous similar messages

Jun 10 16:34:22 tevlfs6 kernel: Lustre: lfstev-OST0005: Connection restored to 24e6a956-37a9-6c64-bff4-98a232c29f9a (at 192.168.177.202@o2ib177)

Jun 10 16:34:22 tevlfs6 kernel: Lustre: Skipped 8064 previous similar messages

Both processes reading files were killed on client host. ltop reports data being transferred on OSTs:

Filesystem: lfstev                                                    RECORDING

    Inodes:    169.254m total,      4.819m used (  3%),    164.435m free

     Space:     83.622t total,      7.306t used (  9%),     76.316t free

   Bytes/s: 0.231g read,       0.000g write,               252 IOPS

   MDops/s: 1 open,        0 close,       0 getattr, 0 setattr

                 0 link,        0 unlink,      0 mkdir,         0 rmdir

                 0 statfs, 0 rename,      0 getxattr

>OST S        OSS   Exp   CR rMB/s wMB/s  IOPS   LOCKS  LGR  LCR %cpu %mem %spc

0000   tevlfs1    70    0     0     0     0       0    0    0    0   69    9

0001   tevlfs2    70    0     0     0     0       0    0    0    1   69    9

0002   tevlfs3    70    0     0     0     0       0    0    0    1   88    8

0003   tevlfs4    70    0     0     0     0       0    0    0    0   88    9

0004   tevlfs5    71  126   118     0   126       0    0    0    0   90    9

0005   tevlfs6    71  126   118     0   126       0    0    0    1   70    9

server and router still report errors until client host get rebooted:

Server:

Jun 10 16:41:40 tevlfs6 kernel: Lustre: lfstev-OST0005: Client 232e0955-aa70-fd53-f988-7299ce54b534 (at 192.168.177.202@o2ib177) reconnecting

Jun 10 16:41:40 tevlfs6 kernel: Lustre: Skipped 37786 previous similar messages

Jun 10 16:41:41 tevlfs6 kernel: LustreError: 14018:0:(ldlm_lib.c:3247:target_bulk_io()) @@@ Reconnect on bulk READ  req@ffff96bb07d49050 x1635984221122096/t0(0) o3->232e0955-aa70-fd53-f988-7299ce54b534@192.168.177.202@o2ib177:716/0 lens 488/432 e 0 to 0 dl 1560202931 ref 1 fl Interpret:/2/0 rc 0/0

Jun 10 16:41:41 tevlfs6 kernel: LustreError: 14018:0:(ldlm_lib.c:3247:target_bulk_io()) Skipped 37799 previous similar messages

Router errors:

Jun 10 16:51:41 newtevnfs kernel: LNetError: 4353:0:(o2iblnd_cb.c:1083:kiblnd_init_rdma()) Skipped 151177 previous similar messages

Jun 10 16:51:41 newtevnfs kernel: LNetError: 4353:0:(o2iblnd_cb.c:433:kiblnd_handle_rx()) Can't setup rdma for PUT to 192.168.177.202@o2ib177: -90

Jun 10 16:51:41 newtevnfs kernel: LNetError: 4353:0:(o2iblnd_cb.c:433:kiblnd_handle_rx()) Skipped 151180 previous similar messages

Jun 10 17:01:41 newtevnfs kernel: LNetError: 4355:0:(o2iblnd_cb.c:1083:kiblnd_init_rdma()) RDMA has too many fragments for peer_ni 192.168.177.202@o2ib177 (16), src idx/frags: 32/240 dst idx/frags: 0/1

Jun 10 17:01:41 newtevnfs kernel: LNetError: 4355:0:(o2iblnd_cb.c:1083:kiblnd_init_rdma()) Skipped 151165 previous similar messages

Jun 10 17:01:41 newtevnfs kernel: LNetError: 4355:0:(o2iblnd_cb.c:433:kiblnd_handle_rx()) Can't setup rdma for PUT to 192.168.177.202@o2ib177: -90

Jun 10 17:01:41 newtevnfs kernel: LNetError: 4355:0:(o2iblnd_cb.c:433:kiblnd_handle_rx()) Skipped 151172 previous similar messages

Jun 10 17:11:41 newtevnfs kernel: LNetError: 4354:0:(o2iblnd_cb.c:1083:kiblnd_init_rdma()) RDMA has too many fragments for peer_ni 192.168.177.202@o2ib177 (16), src idx/frags: 32/240 dst idx/frags: 0/1

Comment by Alex Kulyavtsev [ 11/Jun/19 ]

Attached configuration and debug files:

$ tar tvf client.tgz

drwxr-xr-x  0 root   root        0 Jun 11 00:01 client/

rw------  0 root   root    65260 Jun 10 15:28 client/debug_kernel.2.out

drwxr-xr-x  0 root   root        0 Jun 11 00:02 client/etc/

drwxr-xr-x  0 root   root        0 Jun 11 00:03 client/etc/modprobe.d/

rw-rr-  0 root   root      183 Jun  3 18:04 client/etc/modprobe.d/ko2iblnd.conf

rw-rr-  0 root   root       88 Mar  7  2018 client/etc/modprobe.d/lustre.conf

rw-rr-  0 root   root      467 Jun  3 17:56 client/etc/lnet.conf

rw-rr-  0 root   root      450 Jun  3 17:07 client/etc/lnet_routes.conf

rw-rr-  0 root   root      903 Jun 10 15:43 client/ibstatus.out

rw-rr-  0 root   root    10678 Jun 10 15:08 client/lnetctl.export.out

rw-rr-  0 root   root     1422 Jun 10 15:42 client/ibstat.out

rw-rr-  0 root   root     2303 Jun 10 15:08 client/systool.lnet.out

rw-rr-  0 root   root     1992 Jun 10 15:08 client/systool.ko2iblnd.out

rw-rr-  0 root   root 70350426 Jun 10 17:02 client/debug_kernel.3.out

rw------  0 root   root   231798 Jun 10 15:15 client/debug_kernel.before.out

-rwxr-xr-x  0 root   root      165 Jun 10 16:30 client/read-two.sh

rw-rr-  0 root   root       75 Jun 10 15:38 client/rpms

router:

$ tar tvf router.tgz

drwxr-xr-x  0 root   root        0 Jun 11 00:01 router/

rw-rr-  0 root   root     2944 Jun 10 15:43 router/ibv_devinfo.out

drwxr-xr-x  0 root   root        0 Jun 10 23:57 router/etc/

drwxr-xr-x  0 root   root        0 Jun 10 23:58 router/etc/modprobe.d/

rw-rr-  0 root   root      252 Jun  4 17:35 router/etc/modprobe.d/ko2iblnd.conf

rw-rr-  0 root   root      142 Mar 12  2018 router/etc/modprobe.d/lustre.conf

rw-rr-  0 root   root      317 Aug  1  2018 router/etc/lnet.conf

rw-rr-  0 root   root      406 Feb 14 13:27 router/etc/lnet_routes.conf

rw-rr-  0 root   root      451 Jun 10 15:43 router/ibstatus.out

rw-rr-  0 root   root    10493 Jun 10 15:35 router/lnetctl.export.out

rw-rr-  0 root   root      705 Jun 10 15:42 router/ibstat.out

rw-rr-  0 root   root     1987 Jun 10 15:34 router/systool.lnet.out

rw-rr-  0 root   root     1884 Jun 10 15:34 router/systool.ko2iblnd.out

rw-rr-  0 root   root 11576985 Jun 10 17:06 router/debug_kernel.3.out

rw-rr-  0 root   root       73 Jun 10 15:39 router/rpms

Server:

$ tar tvf server.tgz

drwxr-xr-x  0 root   root        0 Jun 11 00:06 server/

rw-rr-  0 root   root     2087 Jun 10 15:43 server/ibv_devinfo.out

drwxr-xr-x  0 root   root        0 Jun 11 00:05 server/etc/

drwxr-xr-x  0 root   root        0 Jun 11 00:04 server/etc/modprobe.d/

rw-rr-  0 root   root      252 Jun  4 17:35 server/etc/modprobe.d/ko2iblnd.conf

rw-rr-  0 root   root      142 Mar 12  2018 server/etc/modprobe.d/lustre.conf

rw-rr-  0 root   root      127 Jul 31  2018 server/etc/lnet.conf

rw-rr-  0 root   root      406 Dec 21 16:45 server/etc/lnet_routes.conf

rw-rr-  0 root   root      223 Jun 10 15:42 server/ibstatus.out

rw-rr-  0 root   root    31618 Jun 10 15:33 server/lnetctl.export.out

rw-rr-  0 root   root      351 Jun 10 15:42 server/ibstat.out

rw-rr-  0 root   root     2113 Jun 10 15:33 server/systool.lnet.out

rw-rr-  0 root   root     2010 Jun 10 15:33 server/systool.ko2iblnd.out

rw-rr-  0 root   root 17240520 Jun 10 17:05 server/debug_kernel.3.out

rw-rr-  0 root   root      324 Jun 10 15:39 server/rpms

mac-129482:LU-12419 aik_a$

 

Comment by Alex Kulyavtsev [ 14/Jun/19 ]

  map_on_demand=16 - on client with 64KB page

  map_on_demand=256 - on x86_64 (4KB page) all servers and router.

Comment by Alex Kulyavtsev [ 14/Jun/19 ]

Your router needs wrq_sge=2

Comment by James A Simmons [ 23/Jun/19 ]

Is everything working now for you?

Comment by James A Simmons [ 10/Jul/19 ]

Does moving to 2.12 fix everything for you? Can this ticket be closed?

Comment by James A Simmons [ 22/Jul/19 ]

64K  page size for 2.10 will not be supported. Moving to  2.12 is the answer

Generated at Sat Feb 10 02:52:25 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.