[LU-5950] Clients fail to connect following OSS failover in file system with 8 MDSs - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.7.0
Affects Version/s: Lustre 2.7.0, Lustre 2.5.2
Labels:
- patch
Environment:
options lnet ip2nets="o2ib8(ib0) 10.150.10.[1-12]; o2ib8000(ib0) 10.150.10.[1-8]; o2ib8002(ib0:1) 10.151.10.[9-12];"

Severity:
3
Rank (Obsolete):
16610

Description

the OSS host has both 10.150 and 10.151 IP addresses. The 10.150 address is actually assigned to the IB NIC while the 10.151 address is a virtual interface.
The o2ib8 network is used for communication between OSS and MDS nodes. MDS nodes are on o2ib8000 (10.150..). OSS nodes are on o2ib8002 (10.151..).

oss001 kernel: LNet: Added LNI 10.150.10.9@o2ib8 [126/4032/0/0]
oss001 kernel: LNet: Added LNI 10.151.10.9@o2ib8002 [126/4032/0/0]

The client and mds nodes fail to complete recovery.
The compute node clients never make the transition to oss002. Throughout the recovery period and after, they continue to send rpcs to oss001. These requests timeout (ptlrpc_expire_one_request).
Client erros:

2014-10-29T17:08:36.465005-05:00 c0-0c1s13n3 LustreError: 6909:0:(mgc_request.c:1488:mgc_apply_recover_logs()) mgc: cannot find uuid by nid 10.150.10.10@o2ib8
2014-10-29T17:08:36.465055-05:00 c0-0c1s13n3 Lustre: 6909:0:(mgc_request.c:1649:mgc_process_recover_log()) Process recover log esfprod-cliir error -2

The main problem is different network address for a nodes and missed functional at the Lustre process config.

Attachments

Activity

People

Assignee:: Mikhail Pershin

Reporter:: Alexander Boyko

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 24/Nov/14 11:11 AM

Updated:: 05/Dec/14 6:54 PM

Resolved:: 05/Dec/14 6:54 PM