Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
Lustre 2.2.0, Lustre 2.3.0
-
None
-
Scientific Linux 5.5, Lustre 2.2.0 on servers, patchless 2.1.1 and 2.2.0 on clients
Description
We are hitting a strange bug while upgrading to 2.2.0. We moved all the servers and some clients to 2.2 already, however our TCP clients are unable to mount the filesystem, because they are unable to find a suitable NID to connect to the MDT. 2.1.1 clients work fine.
In our case o2ib
are the first networks listed in all configs (MGS/MDT/OST config), and the tcp one is occuring as the third one. All the clients which use o2ib work fine, as the first MDT NID they get from MGS works for them, however TCP ones fail (at least thats what we supose).
Our mds params are:
Parameters: mgsnode=172.16.193.1@o2ib,172.16.126.1@tcp mgsnode=172.16.193.3@o2ib,172.16.126.2@tcp failover.node=172.16.193.3@o2ib,172.16.126.2@tcp mdd.quota_type=ug
Servers have:
options lnet networks="o2ib0(ib0),o2ib1(ib1),tcp0(eth0)"
TCP clients:
options lnet networks="tcp0(eth0)"
And the client gets this:
[root@n1-4-1 ~]# lctl which_nid 172.16.126.1@tcp
172.16.126.1@tcp
[root@n1-4-1 ~]# lctl ping 172.16.126.1@tcp
12345-0@lo
12345-172.16.193.1@o2ib
12345-192.168.193.2@o2ib1
12345-172.16.126.1@tcp
[root@n1-4-1 ~]# mount -t lustre 172.16.126.1@tcp:/scratch /mnt/lustre/scratch/
mount.lustre: mount 172.16.126.1@tcp:/scratch at /mnt/lustre/scratch failed: No such file or directory
Is the MGS specification correct?
Is the filesystem name correct?
If upgrading, is the copied client log valid? (see upgrade docs)
Dmesg says:
Apr 11 16:55:53 n1-4-1 kernel: Lustre: MGC172.16.126.1@tcp: Reactivating import
Apr 11 16:55:53 n1-4-1 kernel: LustreError: 2469:0:(ldlm_lib.c:381:client_obd_setup()) can't add initial connection
Apr 11 16:55:53 n1-4-1 kernel: LustreError: 2469:0:(obd_config.c:521:class_setup()) setup scratch-MDT0000-mdc-ffff81018d9d6400 failed (-2)
Apr 11 16:55:53 n1-4-1 kernel: LustreError: 2469:0:(obd_config.c:1362:class_config_llog_handler()) Err -2 on cfg command:
Apr 11 16:55:53 n1-4-1 kernel: Lustre: cmd=cf003 0:scratch-MDT0000-mdc 1:scratch-MDT0000_UUID 2:172.16.193.1@o2ib
Apr 11 16:55:53 n1-4-1 kernel: LustreError: 15c-8: MGC172.16.126.1@tcp: The configuration from log 'scratch-client' failed (-2). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
Apr 11 16:55:53 n1-4-1 kernel: LustreError: 2457:0:(llite_lib.c:978:ll_fill_super()) Unable to process log: -2
Apr 11 16:55:53 n1-4-1 kernel: LustreError: 2457:0:(obd_config.c:566:class_cleanup()) Device 3 not setup
Apr 11 16:55:53 n1-4-1 kernel: LustreError: 2457:0:(ldlm_request.c:1170:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway
Apr 11 16:55:53 n1-4-1 kernel: LustreError: 2457:0:(ldlm_request.c:1796:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108
Apr 11 16:55:53 n1-4-1 kernel: Lustre: client ffff81018d9d6400 umount complete
Apr 11 16:55:53 n1-4-1 kernel: LustreError: 2457:0:(obd_mount.c:2349:lustre_fill_super()) Unable to mount (-2)
I'm also attaching two debug dumps (lctl dk) for 2.1.1 client (works fine) and 2.2.0 client (fails).
Attachments
Issue Links
- Trackbacks
-
Changelog 2.1 Changes from version 2.1.1 to version 2.1.2 Server support for kernels: 2.6.18308.4.1.el5 (RHEL5) 2.6.32220.17.1.el6 (RHEL6) Client support for unpatched kernels: 2.6.18308.4.1.el5 (RHEL5) 2.6.32220.17.1....