Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.1.0, Lustre 2.3.0
-
None
-
RHEL6.1 between a PPC64 node and an x86_64 node, QDR infiniband, o2iblnd
-
3
-
4453
Description
Running the simple lnet-selftest script below, the "lst add_test" line fails with:
add test RPC failed on 12345-172.20.203.24@o2ib1: Unknown error 18446744073709551506
I am the script on the "server1" node, which is an x86_64 architecture RHEL6.1 system. The "ion" node is a ppc64 architecture with RHEL6.1. Note that ppc64 has a 64k page size now by default.
server1 has this message on the console:
LustreError: 21942:0:(lib-move.c:110:lnet_try_match_md()) Matching packet from 12345-172.20.203.24@o2ib1, match 11222407321460450620 length 65536 too big: 4096 left, 4096 allowed
and the ion has this on the console:
LustreError: 14210:0:(framework.c:1298:sfw_bulk_ready()) Bulk transfer failed for RPC: service test service, peer 12345-172.20.250.1@o2ib1, status -61
I have attached lustre kernel logs with "+ net rpctrace" added.
Script:
lst new_session read/write lst add_group ion 172.20.203.24@o2ib1 lst add_group server1 172.20.250.1@o2ib1 lst add_batch bulk_rw lst add_test --batch bulk_rw --concurrency 16 --from ion --to server1 brw write size=1M lst run bulk_rw lst stat ion & sleep 30; kill $! lst end_session