Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.10.1
-
None
-
3
-
9223372036854775807
Description
We have just started testing Lustre 2.10.1 (recent git) on one of our RHEL7 clients with our 2.7 based servers. One of the file systems I'm testing on is currently running with the (single) MDT and MGS on the second/failover server and the default mount command doesn't work. It does work if we swap the order of IPs in the mount command.
So this command does not work:
mount -t lustre 10.144.134.13@o2ib,172.23.134.13@tcp:10.144.134.14@o2ib,172.23.134.14@tcp:/lustre04 /mnt/lustre04
But this command works and mounts the file system:
sudo mount -t lustre 10.144.134.14@o2ib,172.23.134.14@tcp:10.144.134.13@o2ib,172.23.134.13@tcp:/lustre04 /mnt/lustre04
On our older clients running 2.7.3, both commands succeed in mounting the file system as I expected it. This has been verified just now.
This is on clients which have both an NID on tcp and o2ib networks and as far as I can tell, the communication over both IB and ethernet works in general on the client where we're testing 2.10.1. I can ping 10.144.134.13 and 172.23.134.13 from the os, but lctl ping 10.144.134.13@o2ib times out.
The first MDS (10.144.134.13 and 172.23.134.13 does not have any lnet or lustre modules loaded.
Mount attempts:
[bnh65367@cs04r-sc-com99-20 ~]$ sudo mount -t lustre 10.144.134.14@o2ib,172.23.134.14@tcp:10.144.134.13@o2ib,172.23.134.13@tcp:/lustre04 /mnt/lustre04 [bnh65367@cs04r-sc-com99-20 ~]$ echo success | logger [bnh65367@cs04r-sc-com99-20 ~]$ sudo umount /mnt/lustre04 [bnh65367@cs04r-sc-com99-20 ~]$ echo manually unmounted | logger [bnh65367@cs04r-sc-com99-20 ~]$ sudo mount -t lustre 10.144.134.13@o2ib,172.23.134.13@tcp:10.144.134.14@o2ib,172.23.134.14@tcp:/lustre04 /mnt/lustre04 mount.lustre: mount 10.144.134.13@o2ib,172.23.134.13@tcp:10.144.134.14@o2ib,172.23.134.14@tcp:/lustre04 at /mnt/lustre04 failed: Input/output error Is the MGS running? [bnh65367@cs04r-sc-com99-20 ~]$
Syslog on the MDS does not seem to show anything, syslog from the clients for both mount attempts is below.
Oct 25 18:37:04 cs04r-sc-com99-20 kernel: Lustre: 5364:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1508953019/real 0] req@ffff88178fb88300 x1582247825193888/t0(0) o38->lustre04-MDT0000-mdc-ffff88015e6ef800@10.144.134.13@o2ib:12/10 lens 520/544 e 0 to 1 dl 1508953024 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 25 18:37:53 cs04r-sc-com99-20 kernel: LNet: 5344:0:(o2iblnd_cb.c:3186:kiblnd_check_conns()) Timed out tx for 10.144.134.13@o2ib: 4 seconds Oct 25 18:38:14 cs04r-sc-com99-20 kernel: Lustre: Mounted lustre04-client Oct 25 18:39:27 cs04r-sc-com99-20 bnh65367: success Oct 25 18:39:35 cs04r-sc-com99-20 systemd: Unit mnt-lustre04.mount entered failed state. Oct 25 18:39:35 cs04r-sc-com99-20 kernel: Lustre: Unmounted lustre04-client Oct 25 18:39:46 cs04r-sc-com99-20 bnh65367: manually unmounted Oct 25 18:40:04 cs04r-sc-com99-20 kernel: Lustre: 5364:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1508953199/real 0] req@ffff88178fb88900 x1582247825198288/t0(0) o250->MGC10.144.134.13@o2ib@10.144.134.13@o2ib:26/25 lens 520/544 e 0 to 1 dl 1508953204 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 25 18:40:05 cs04r-sc-com99-20 kernel: LustreError: 9064:0:(mgc_request.c:251:do_config_log_add()) MGC10.144.134.13@o2ib: failed processing log, type 1: rc = -5 Oct 25 18:40:36 cs04r-sc-com99-20 kernel: LustreError: 15c-8: MGC10.144.134.13@o2ib: The configuration from log 'lustre04-client' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information. Oct 25 18:40:36 cs04r-sc-com99-20 kernel: Lustre: Unmounted lustre04-client Oct 25 18:40:50 cs04r-sc-com99-20 kernel: LNet: 5344:0:(o2iblnd_cb.c:3186:kiblnd_check_conns()) Timed out tx for 10.144.134.13@o2ib: 6 seconds Oct 25 18:40:50 cs04r-sc-com99-20 kernel: LustreError: 9064:0:(obd_mount.c:1505:lustre_fill_super()) Unable to mount (-5) Oct 25 18:46:18 cs04r-sc-com99-20 kernel: LNet: 5344:0:(o2iblnd_cb.c:3186:kiblnd_check_conns()) Timed out tx for 10.144.134.13@o2ib: 3 seconds
For now we are only testing this version but if this doesn't work, I'm not sure we can update any production machines.
Attachments
Issue Links
- is related to
-
LU-8397 take comma as separator of mgsnode's list
- Resolved