[LU-9823] LNet fails to come up when using lctl but works with lnetctl Created: 03/Aug/17 Updated: 07/Jan/24 Resolved: 27/Sep/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.1, Lustre 2.11.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | James A Simmons | Assignee: | Amir Shehata (Inactive) |
| Resolution: | Not a Bug | Votes: | 0 |
| Labels: | IPv6 | ||
| Environment: |
Seen on various systems with Lustre 2.10 and Lustre 2.11. |
||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
On several systems when attempting to bring a lustre system this is reported: [188273.054578] LNet: Added LNI 10.0.1.22@tcp [8/256/0/180] [188273.054724] LNet: Accept secure, port 988 [191295.504584] Lustre: Lustre: Build Version: 2.10.0_dirty [191300.858629] Lustre: 22140:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1501789735/real 1501789735] req@ffff800fb09cfc80 x1574740673167376/t0(0) o250->MGC128.219.141.4@tcp@128.219.141.4@tcp:26/25 lens 520/544 e 0 to 1 dl 1501789740 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 [191301.858634] LustreError: 22036:0:(mgc_request.c:251:do_config_log_add()) MGC128.219.141.4@tcp: failed processing log, type 1: rc = -5 [191330.858099] Lustre: 22140:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1501789760/real 1501789760] req@ffff800fb4910980 x1574740673167424/t0(0) o250->MGC128.219.141.4@tcp@128.219.141.4@tcp:26/25 lens 520/544 e 0 to 1 dl 1501789770 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 [191332.858106] LustreError: 15c-8: MGC128.219.141.4@tcp: The configuration from log 'legs-client' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information. [191332.858399] Lustre: Unmounted legs-client [191332.859241] LustreError: 22036:0:(obd_mount.c:1505:lustre_fill_super()) Unable to mount (-5) After investigation this is a symptom of the LNet layer communication failure. This occurs when LNet has been setup with lctl but if one uses lnetctl then this issue appears to go away. |
| Comments |
| Comment by Peter Jones [ 03/Aug/17 ] |
|
James what version of SLES12 do you mean? |
| Comment by James A Simmons [ 03/Aug/17 ] |
|
cat /etc/SuSE-release
>uname -r |
| Comment by James A Simmons [ 08/Aug/17 ] |
|
Some one reported this problem also on Power8. I gathered a debug log from the client side. I looked on the server side and I saw this bug which is this new to me. [606005.245327] LNetError: 2606:0:(acceptor.c:406:lnet_acceptor()) Refusing connection from 128.219.141.3: insecure port 55084 |
| Comment by James A Simmons [ 23/Oct/17 ] |
|
I just attempted to bring up our regular testing file system on normal RHEL7 x86 with the latest lustre 2.10.1 and I'm seeing this error. Will try lustre 2.10.54 next. |
| Comment by James A Simmons [ 23/Oct/17 ] |
|
The problem is far worst with the latest master. It takes about 15 minutes to mount any back end disk. Once it does mount after many hours with a 56 OST/16 MDT system the client fails to mount. |
| Comment by James A Simmons [ 23/Oct/17 ] |
|
So on the MDS I see the following lctl dump: 00000100:00080000:5.0:1508798556.578627:0:4131:0:(pinger.c:405:ptlrpc_pinger_add_import()) adding pingable import 19afc095-abef-a794-2f84-9099c3e67329->MGS |
| Comment by James A Simmons [ 25/Oct/17 ] |
|
Okay I git bisect to see when this failure started to happen and its due to the multirail support landing. Currently people moving to lustre 2.10 might find they can't mount lustre at all when deploying a production system. |
| Comment by Peter Jones [ 25/Oct/17 ] |
|
Amir Can you please advise? Peter |
| Comment by Amir Shehata (Inactive) [ 25/Oct/17 ] |
|
Do any of the nodes have multiple interfaces? If so, can you please make sure you follow this general linux routing guideline: |
| Comment by James A Simmons [ 26/Oct/17 ] |
|
For our sultan OSS nodes it was a configuration issue. We placed the two other IB ports on a different subnet and that seems to have worked. As for the ARM system it does have multiple ethernet interfaces for the computes but only one has been setup with an IP address. ip addr show |
| Comment by James A Simmons [ 26/Oct/17 ] |
|
I think I see why we have a problem. The network interface has both an ipv4 and ipv6 address. How you every tried this setup? |
| Comment by Andreas Dilger [ 14/Dec/17 ] |
|
Lustre doesn't support IPv6, though it is definitely something that we should keep in mind moving forward (LU-10391). |
| Comment by Andreas Dilger [ 27/Sep/21 ] |
|
Ended up being a configuration issue. |