Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.7.0, Lustre 2.5.2
-
options lnet ip2nets="o2ib8(ib0) 10.150.10.[1-12]; o2ib8000(ib0) 10.150.10.[1-8]; o2ib8002(ib0:1) 10.151.10.[9-12];"
-
3
-
16610
Description
the OSS host has both 10.150 and 10.151 IP addresses. The 10.150 address is actually assigned to the IB NIC while the 10.151 address is a virtual interface.
The o2ib8 network is used for communication between OSS and MDS nodes. MDS nodes are on o2ib8000 (10.150..). OSS nodes are on o2ib8002 (10.151..).
oss001 kernel: LNet: Added LNI 10.150.10.9@o2ib8 [126/4032/0/0] oss001 kernel: LNet: Added LNI 10.151.10.9@o2ib8002 [126/4032/0/0]
The client and mds nodes fail to complete recovery.
The compute node clients never make the transition to oss002. Throughout the recovery period and after, they continue to send rpcs to oss001. These requests timeout (ptlrpc_expire_one_request).
Client erros:
2014-10-29T17:08:36.465005-05:00 c0-0c1s13n3 LustreError: 6909:0:(mgc_request.c:1488:mgc_apply_recover_logs()) mgc: cannot find uuid by nid 10.150.10.10@o2ib8 2014-10-29T17:08:36.465055-05:00 c0-0c1s13n3 Lustre: 6909:0:(mgc_request.c:1649:mgc_process_recover_log()) Process recover log esfprod-cliir error -2
The main problem is different network address for a nodes and missed functional at the Lustre process config.