Details
-
Question/Request
-
Resolution: Cannot Reproduce
-
Minor
-
None
-
None
-
None
-
[VM]
CentOS Linux release 7.5.1804 (Core)
3.10.0-862.14.4.el7.x86_64
CPU: Intel Xeon Processor (Skylake) 6 cores
MemTotal : 40GB
mlx5_core: 4.4-2.0.7 / 4.5-1.0.1.0
Lustre : 2.10.5 / 2.10.6
-
9223372036854775807
Description
I have created one VM with IB and mount lustre client ok.
We tested lustre client io access in VM. (dd if=/dev/zero of=/mnt/lustre/testfile bs=1M )
But lustre client IB always hang when VM memory size only 40G.
This issue can not be reproduced when VM memory size is 80GB
We have test with mlx5_core: 4.4-2.0.7 / 4.5-1.0.1.0 or Lustre : 2.10.5 / 2.10.6 ,
they all have same problem when VM memory size only 40G
The VM syslog print these messages (see attached file) ib_err.txt![]()
"mlx5_0:dump_cqe:285:(pid 1854): dump error cqe
.....
LustreError: 1854:0:(events.c:199:client_bulk_callback()) event type 1, status -5, desc ffff8bafb2976400
Jan 21 17:15:14 slurm-vm-1 kernel: Lustre: 1860:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error:
We have got respond from Mellanox
>>Our RnD review the syslog and checked code, they give conclusion below, FYI. I think >>this issue is related with uplevel Lustre design, you can open defect for their community to fix.
>>Parsing CQE shows that it is local protection error – ERR_EXE_BIND_GAHER_TPT.
>>If an application gets a local protection error, in most cases it means that it is using wrong/insufficient mkey in the WR.
>>In this case the application is Lustre - which is open source and maintained by the community, not Mellanox.