[LU-11972] lustre IB client always hang when memory size small than 40GB Created: 15/Feb/19  Updated: 01/Apr/19  Resolved: 01/Apr/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Question/Request Priority: Minor
Reporter: sebg-crd-pm (Inactive) Assignee: Peter Jones
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

[VM]
CentOS Linux release 7.5.1804 (Core)
3.10.0-862.14.4.el7.x86_64
CPU: Intel Xeon Processor (Skylake) 6 cores
MemTotal : 40GB
mlx5_core: 4.4-2.0.7 / 4.5-1.0.1.0
Lustre : 2.10.5 / 2.10.6


Attachments: File dbg1.tgz     Text File ib_err.txt    
Rank (Obsolete): 9223372036854775807

 Description   

I have created one VM with IB and mount lustre client ok.

We tested lustre client io access in VM. (dd if=/dev/zero of=/mnt/lustre/testfile bs=1M )
But lustre client IB always hang when VM memory size only 40G.

This issue can not be reproduced when VM memory size is 80GB

We have test with mlx5_core: 4.4-2.0.7 / 4.5-1.0.1.0 or Lustre : 2.10.5 / 2.10.6 ,
they all have same problem when VM memory size only 40G

The VM syslog print these messages (see attached file) ib_err.txt
"mlx5_0:dump_cqe:285:(pid 1854): dump error cqe

.....

LustreError: 1854:0:(events.c:199:client_bulk_callback()) event type 1, status -5, desc ffff8bafb2976400
Jan 21 17:15:14 slurm-vm-1 kernel: Lustre: 1860:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error:

 

We have got respond from Mellanox

>>Our RnD review the syslog and checked code, they give conclusion below, FYI. I think >>this issue is related with uplevel Lustre design, you can open defect for their community to fix. 
>>Parsing CQE shows that it is  local protection error – ERR_EXE_BIND_GAHER_TPT.
>>If an application gets a local protection error, in most cases it means that it is using wrong/insufficient mkey in the WR.
>>In this case the application is Lustre - which is open source and maintained by the community, not Mellanox.

 



 Comments   
Comment by Patrick Farrell (Inactive) [ 15/Feb/19 ]

In order to understand the error, we'd like to get some Lustre debug logs with appropriate tracing.

Please run these commands on the client:

lctl set_param debug=+rpctrace; lctl set_param debug=+net; lctl clear

lctl mark "debug start"

  1. Run your DD test

dd if=/dev/zero of=/mnt/lustre/testfile bs=1M

lctl mark "debug finish"

lctl set_param debug=-rpctrace; lctl set_param debug=-net

  1. Write out the log file:

lctl dk > /tmp/log

 

Please attach the log file to this ticket (you may need to compress it first).  This will give us more info to go on.

Comment by sebg-crd-pm (Inactive) [ 20/Feb/19 ]

update:

1This issue also happened when VM memory size is 80GB now.

2.It seems easily to reproduce after we add lnet router node.

3. attache log  dbg1.tgz

Comment by sebg-crd-pm (Inactive) [ 20/Feb/19 ]

dbg1.tgz

Comment by sebg-crd-pm (Inactive) [ 26/Feb/19 ]

any suggestion?

Comment by sebg-crd-pm (Inactive) [ 01/Apr/19 ]

The issue can not be reproduce in other server.

It  looks like hardware issue. So you can close it. Thanks.

Generated at Sat Feb 10 02:48:33 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.