[LU-6723] Setting map_on_demand for o2iblnd driver prevents lustre bring up. Created: 15/Jun/15  Updated: 16/Dec/15  Resolved: 16/Dec/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0, Lustre 2.8.0, Lustre 2.5.4
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James A Simmons Assignee: Amir Shehata (Inactive)
Resolution: Won't Fix Votes: 0
Labels: lnet
Environment:

Cray routers running SLES11 SP3. Found this issue exist for all lustre versions.


Issue Links:
Duplicate
is duplicated by LU-6748 excessive client reconnect to OSS ser... Resolved
Related
is related to LU-6748 excessive client reconnect to OSS ser... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

While testing setting map_on_demand with the patch from LU-3322 I discovered set map_on_demand to any value on our Cray routers prevented lustre from functioning. This looks like a bug in the o2iblnd driver which only shows up on our Cray nodes.



 Comments   
Comment by James A Simmons [ 15/Jun/15 ]

As a note this happened also when the patch from LU-3322 is not applied.

Comment by Andreas Dilger [ 16/Jun/15 ]

James, could you please provide a bit more information about what you mean by "Cray routers prevented Lustre from functioning"? Any errors in the logs? Does "lctl ping" work? Does the IB-level network testing still work?

Is Cray running a customized OFED? It may be that this isn't a Lustre/LNet problem at all.

Comment by James A Simmons [ 17/Jun/15 ]

Is is the errors that appear on the OSS nodes when I enabled map_on_demand on the Cray routers.

00000020:02000400:10.0:1433178825.928974:0:28309:0:(tgt_handler.c:1834:tgt_brw_read()) sultan-OST0034: Bulk IO read error with b9cf5051-0ff9-6cf9-cd67-9364a2516176 (at 30@gni1), client will retry: rc -110
00000020:00000001:10.0:1433178825.947041:0:28309:0:(tgt_handler.c:1851:tgt_brw_read()) Process leaving (rc=18446744073709551506 : -110 : ffffffffffffff92)
00010000:00000080:10.0:1433178825.947043:0:28309:0:(ldlm_lib.c:2427:target_committed_to_req()) @@@ not sending l

The Cray routers are using the mlx5 driver from the OFED 3.12 stack. Realizing what the problem is I need to collect logs from the routers so we know what is really going on. The OSS bulk timeouts are a symptom of the real problem.

Comment by James A Simmons [ 22/Jun/15 ]

As a small note the OSS that also had problems when map_on_demand is enabled was running RHEL6.5 with the default distro infiniband stack. So it is not a inifinband issue.

Comment by Jian Yu [ 08/Oct/15 ]

Hi James,
Does the issue in this ticket still exist?

Comment by James A Simmons [ 08/Oct/15 ]

I haven't tried in a while. Will do.

Comment by James A Simmons [ 16/Dec/15 ]

Just tried it. Now that the OFED stack has been updated to a newer 3.12 the mlx5 driver no longer supports FMR so this issue has gone away. I will be trying the LU-5783 work very soon on our Cray routers to see if I hit memory issues. In that case the bugs can be reported under that ticket. You can close this ticket.

Comment by Jian Yu [ 16/Dec/15 ]

Thank you James.

Generated at Sat Feb 10 02:02:41 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.