[LU-5950] Clients fail to connect following OSS failover in file system with 8 MDSs Created: 24/Nov/14  Updated: 05/Dec/14  Resolved: 05/Dec/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0, Lustre 2.5.2
Fix Version/s: Lustre 2.7.0

Type: Bug Priority: Major
Reporter: Alexander Boyko Assignee: Mikhail Pershin
Resolution: Fixed Votes: 0
Labels: patch
Environment:

options lnet ip2nets="o2ib8(ib0) 10.150.10.[1-12]; o2ib8000(ib0) 10.150.10.[1-8]; o2ib8002(ib0:1) 10.151.10.[9-12];"


Severity: 3
Rank (Obsolete): 16610

 Description   

the OSS host has both 10.150 and 10.151 IP addresses. The 10.150 address is actually assigned to the IB NIC while the 10.151 address is a virtual interface.
The o2ib8 network is used for communication between OSS and MDS nodes. MDS nodes are on o2ib8000 (10.150..). OSS nodes are on o2ib8002 (10.151..).

oss001 kernel: LNet: Added LNI 10.150.10.9@o2ib8 [126/4032/0/0]
oss001 kernel: LNet: Added LNI 10.151.10.9@o2ib8002 [126/4032/0/0]

The client and mds nodes fail to complete recovery.
The compute node clients never make the transition to oss002. Throughout the recovery period and after, they continue to send rpcs to oss001. These requests timeout (ptlrpc_expire_one_request).
Client erros:

2014-10-29T17:08:36.465005-05:00 c0-0c1s13n3 LustreError: 6909:0:(mgc_request.c:1488:mgc_apply_recover_logs()) mgc: cannot find uuid by nid 10.150.10.10@o2ib8
2014-10-29T17:08:36.465055-05:00 c0-0c1s13n3 Lustre: 6909:0:(mgc_request.c:1649:mgc_process_recover_log()) Process recover log esfprod-cliir error -2

The main problem is different network address for a nodes and missed functional at the Lustre process config.



 Comments   
Comment by Gerrit Updater [ 24/Nov/14 ]

Alexander Boyko (alexander.boyko@seagate.com) uploaded a new patch: http://review.whamcloud.com/12829
Subject: LU-5950 mgc: add nid iteration
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 35a45d2e5b843a6bfb27cfabbb9f24d8611b1b01

Comment by Alexander Boyko [ 24/Nov/14 ]

fix http://review.whamcloud.com/12829
Xyratex-bug-id: MRP-2255

Comment by Gerrit Updater [ 04/Dec/14 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12829/
Subject: LU-5950 mgc: add nid iteration
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 227ed3d87354cc1122343ef4f4e931960d1fe276

Comment by Andreas Dilger [ 05/Dec/14 ]

Patch landed for 2.7.0.

Generated at Sat Feb 10 01:55:54 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.