Details
-
Bug
-
Resolution: Duplicate
-
Minor
-
None
-
Lustre 2.2.0
-
None
-
Lustre 2.2.0 system
-
3
-
6413
Description
Hello,
I'm trying to build a new multihomed (Infiniband and Ethernet) Lustre 2.2.0 system:
Everything works fine as long as i'm only using one of the two networks, but i'm unable to use both at the same time.
(eg: with ethernet-only and infiniband-only clients)
My test setup looks like this:
n-mds1: Combined MGS/MDT with Infiniband and Ethernet
n-oss01: OSS - also with Infiniband and Ethernet
a9115: Infiniband-only client
a9116: Ethernet-only client
The mds has 2 nids: 1x IB, 1x Ethernet:
[root@n-mds1 ~]# cat /etc/modprobe.d/lustre.conf
options lnet networks=o2ib(ib0),tcp(eth0)
[root@n-mds1 ~]# modprobe lustre
[root@n-mds1 ~]# lctl list_nids
10.201.62.13@o2ib
10.201.30.13@tcp
The MDS was setup like this:
[root@n-mds1 ~]# mkfs.lustre --fsname=foobar --reformat --mdt --mgs --mgsnode=10.201.62.13@o2ib,10.201.30.13@tcp /dev/mapper/vd01
- ..also tried without mgsnode and with --servicenode=10.201....
[root@n-mds1 ~]# mount -t lustre /dev/mapper/vd01 /lustre/mds
[root@n-mds1 ~]# lctl dl
0 UP mgs MGS MGS 5
1 UP mgc MGC10.201.62.13@o2ib 3aecba06-ec8b-aeab-2151-47d5a1c1bc47 5
2 UP lov foobar-MDT0000-mdtlov foobar-MDT0000-mdtlov_UUID 4
3 UP mdt foobar-MDT0000 foobar-MDT0000_UUID 3
4 UP mds mdd_obd-foobar-MDT0000 mdd_obd_uuid-foobar-MDT0000 3
The OSS also has two NIDs and is able to ping the MDS:
[root@n-oss01 ~]# cat /etc/modprobe.d/lustre.conf
options lnet networks=o2ib(ib0),tcp(eth0)
[root@n-oss01 ~]# modprobe lustre
[root@n-oss01 ~]# lctl list_nids
10.201.62.31@o2ib
10.201.30.31@tcp
[root@n-oss01 ~]# lctl ping 10.201.62.13@o2ib # mds-ib
12345-0@lo
12345-10.201.62.13@o2ib
12345-10.201.30.13@tcp
[root@n-oss01 ~]# lctl ping 10.201.30.13@tcp # mds-eth
12345-0@lo
12345-10.201.62.13@o2ib
12345-10.201.30.13@tcp
The filesystem on the OSS was created via:
[root@n-oss01 ~]# mkfs.lustre --reformat --fsname=foobar --ost --mgsnode=10.201.62.13@o2ib,10.201.30.13@tcp --index=0 /dev/mapper/vd01
[root@n-oss01 ~]# mount -t lustre /dev/mapper/vd01 /lustre/vd01 && sleep 2 && lctl dl
[root@n-oss01 ~]# lctl dl
0 UP mgc MGC10.201.62.13@o2ib 7408e9c5-b92e-5423-fa52-497d0c540a43 5
1 UP ost OSS OSS_uuid 3
2 UP obdfilter foobar-OST0000 foobar-OST0000_UUID 5
So the OSS seems to be happy, the MDS also looks fine:
[root@n-mds1 ~]# lctl dl
0 UP mgs MGS MGS 7
1 UP mgc MGC10.201.62.13@o2ib 3aecba06-ec8b-aeab-2151-47d5a1c1bc47 5
2 UP lov foobar-MDT0000-mdtlov foobar-MDT0000-mdtlov_UUID 4
3 UP mdt foobar-MDT0000 foobar-MDT0000_UUID 3
4 UP mds mdd_obd-foobar-MDT0000 mdd_obd_uuid-foobar-MDT0000 3
5 UP osc foobar-OST0000-osc-MDT0000 foobar-MDT0000-mdtlov_UUID 5
Mounting the filesystem on the IB-Only client works just fine now:
[root@a9115 ~]# cat /etc/modprobe.d/lustre.conf
options lnet networks="o2ib(ib0)"
[root@a9115 ~]# modprobe lustre
[root@a9115 ~]# lctl list_nids
10.201.36.34@o2ib
[root@a9115 ~]# lctl ping 10.201.62.13@o2ib
12345-0@lo
12345-10.201.62.13@o2ib
12345-10.201.30.13@tcp
[root@a9115 ~]# mount -t lustre 10.201.62.13@o2ib:/foobar /cluster/scratch
..but the ethernet-only client fails:
[root@a9116 ~]# cat /etc/modprobe.d/lustre.conf
options lnet networks=tcp(eth0)
[root@a9116 ~]# modprobe lustre
[root@a9116 ~]# lctl list_nids
10.201.4.35@tcp
lctl_ping seems to work:
[root@a9116 ~]# lctl ping 10.201.30.13@tcp # mds
12345-0@lo
12345-10.201.62.13@o2ib
12345-10.201.30.13@tcp
[root@a9116 ~]# lctl ping 10.201.30.31@tcp # oss
12345-0@lo
12345-10.201.62.31@o2ib
12345-10.201.30.31@tcp
..the mount operation fails with:
[root@a9116 ~]# lctl clear
[root@a9116 ~]# mount -t lustre 10.201.30.13@tcp:/foobar /cluster/scratch/
mount.lustre: mount 10.201.30.13@tcp:/foobar at /cluster/scratch failed: No such file or directory
Apr 12 14:10:21 a9116 kernel: Lustre: MGC10.201.30.13@tcp: Reactivating import
Apr 12 14:10:21 a9116 kernel: LustreError: 9130:0:(ldlm_lib.c:381:client_obd_setup()) can't add initial connection
Apr 12 14:10:21 a9116 kernel: LustreError: 9130:0:(obd_config.c:521:class_setup()) setup foobar-MDT0000-mdc-ffff880e3a01f400 failed (-2)
Apr 12 14:10:21 a9116 kernel: LustreError: 9130:0:(obd_config.c:1362:class_config_llog_handler()) Err -2 on cfg command:
Apr 12 14:10:21 a9116 kernel: Lustre: cmd=cf003 0:foobar-MDT0000-mdc 1:foobar-MDT0000_UUID 2:10.201.62.13@o2ib
Apr 12 14:10:21 a9116 kernel: LustreError: 15c-8: MGC10.201.30.13@tcp: The configuration from log 'foobar-client' failed (-2). This may be the result of communication errors between this node and the MGS,
a bad configuration, or other errors. See the syslog for more information.
Apr 12 14:10:21 a9116 kernel: LustreError: 9116:0:(llite_lib.c:978:ll_fill_super()) Unable to process log: -2
Apr 12 14:10:21 a9116 kernel: LustreError: 9116:0:(obd_config.c:566:class_cleanup()) Device 3 not setup
Apr 12 14:10:21 a9116 kernel: LustreError: 9116:0:(ldlm_request.c:1170:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway
Apr 12 14:10:22 a9116 kernel: LustreError: 9116:0:(ldlm_request.c:1796:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108
Is the MGS specification correct?
Is the filesystem name correct?
If upgrading, is the copied client log valid? (see upgrade docs)
Apr 12 14:10:22 a9116 kernel: Lustre: client ffff880e3a01f400 umount complete
Apr 12 14:10:22 a9116 kernel: LustreError: 9116:0:(obd_mount.c:2349:lustre_fill_super()) Unable to mount (-2)
Why does the ethernet client receive (or pick?) the infiniband nid of the MDS? (10.201.62.13@o2ib)
'lctl dk' reports the same:
00000020:01000000:7.0:1334232621.765857:0:9130:0:(obd_config.c:1217:class_config_llog_handler()) Marker, inst_flg=0x0 mark_flg=0x1
00000020:00000080:7.0:1334232621.765859:0:9130:0:(obd_config.c:915:class_process_config()) processing cmd: cf010
00000020:00000080:7.0:1334232621.765860:0:9130:0:(obd_config.c:984:class_process_config()) marker 5 (0x1) foobar-MDT0000 add mdc
00000020:00000080:7.0:1334232621.765861:0:9130:0:(obd_config.c:915:class_process_config()) processing cmd: cf005
00000020:00000080:7.0:1334232621.765870:0:9130:0:(obd_config.c:926:class_process_config()) adding mapping from uuid 10.201.62.13@o2ib to nid 0x500000ac93e0d (10.201.62.13@o2ib)
00000020:00000080:7.0:1334232621.765873:0:9130:0:(obd_config.c:915:class_process_config()) processing cmd: cf005
00000020:00000080:7.0:1334232621.765874:0:9130:0:(obd_config.c:926:class_process_config()) adding mapping from uuid 10.201.62.13@o2ib to nid 0x200000ac91e0d (10.201.30.13@tcp)
00000020:01000000:7.0:1334232621.765877:0:9130:0:(obd_config.c:1299:class_config_llog_handler()) cmd cf001, instance name: foobar-MDT0000-mdc-ffff880e3a01f400
00000020:00000080:7.0:1334232621.765878:0:9130:0:(obd_config.c:915:class_process_config()) processing cmd: cf001
00000020:00000080:7.0:1334232621.765879:0:9130:0:(obd_config.c:318:class_attach()) attach type mdc name: foobar-MDT0000-mdc-ffff880e3a01f400 uuid: 6a7aaf3a-bbcb-9abf-3516-4320e2718614
00000020:00000080:7.0:1334232621.765936:0:9130:0:(genops.c:348:class_newdev()) Adding new device foobar-MDT0000-mdc-ffff880e3a01f400 (ffff8810358320b8)
00000020:00000080:7.0:1334232621.765938:0:9130:0:(obd_config.c:392:class_attach()) OBD: dev 3 attached type mdc with refcount 1
00000020:01000000:7.0:1334232621.765940:0:9130:0:(obd_config.c:1299:class_config_llog_handler()) cmd cf003, instance name: foobar-MDT0000-mdc-ffff880e3a01f400
00000020:00000080:7.0:1334232621.765941:0:9130:0:(obd_config.c:915:class_process_config()) processing cmd: cf003
00000100:00000100:7.0:1334232621.765957:0:9130:0:(client.c:80:ptlrpc_uuid_to_connection()) cannot find peer 10.201.62.13@o2ib!
00010000:00080000:7.0:1334232621.765959:0:9130:0:(ldlm_lib.c:74:import_set_conn()) can't find connection 10.201.62.13@o2ib
00010000:00020000:7.0:1334232621.765960:0:9130:0:(ldlm_lib.c:381:client_obd_setup()) can't add initial connection
00000020:00000080:7.0:1334232621.793034:0:9130:0:(genops.c:786:class_export_put()) final put ffff88103af10400/6a7aaf3a-bbcb-9abf-3516-4320e2718614
00000020:00020000:7.0:1334232621.793043:0:9130:0:(obd_config.c:521:class_setup()) setup foobar-MDT0000-mdc-ffff880e3a01f400 failed (-2)
00000020:00000080:4.0:1334232621.793043:0:8930:0:(genops.c:915:class_import_destroy()) destroying import ffff881039698800 for foobar-MDT0000-mdc-ffff880e3a01f400
00000020:00000080:4.0:1334232621.793049:0:8930:0:(genops.c:743:class_export_destroy()) destroying export ffff88103af10400/6a7aaf3a-bbcb-9abf-3516-4320e2718614 for foobar-MDT0000-mdc-ffff880e3a01f400
00000020:00020000:7.0:1334232621.823020:0:9130:0:(obd_config.c:1362:class_config_llog_handler()) Err -2 on cfg command:
00000020:02000400:7.0:1334232621.850778:0:9130:0:(obd_config.c:1456:class_config_dump_handler()) cmd=cf003 0:foobar-MDT0000-mdc 1:foobar-MDT0000_UUID 2:10.201.62.13@o2ib
00000020:01000000:31.0:1334232621.850818:0:9116:0:(obd_config.c:1393:class_config_parse_llog()) Processed log foobar-client gen 1-13 (rc=-2)
I'm also puzzled about the nids in CONFIGS/foobar-[client,MDT0000]:
[root@n-mds1 ~]# llog_reader /lustre/mds/CONFIGS/foobar-client |grep uuid
Target uuid : config_uuid
uuid=foobar-clilov_UUID stripe:cnt=1 size=1048576 offset=18446744073709551615 pattern=0x1
uuid=foobar-clilmv_UUID stripe:cnt=0 size=0 offset=0 pattern=0
#10 (088)add_uuid nid=10.201.62.13@o2ib(0x500000ac93e0d) 0: 1:10.201.62.13@o2ib
#11 (088)add_uuid nid=10.201.30.13@tcp(0x200000ac91e0d) 0: 1:10.201.62.13@o2ib
#20 (088)add_uuid nid=10.201.62.31@o2ib(0x500000ac93e1f) 0: 1:10.201.62.31@o2ib
#21 (088)add_uuid nid=10.201.30.31@tcp(0x200000ac91e1f) 0: 1:10.201.62.31@o2ib
[root@n-mds1 ~]# llog_reader /lustre/mds/CONFIGS/foobar-MDT0000 |grep uuid
Target uuid : config_uuid
uuid=foobar-MDT0000-mdtlov_UUID stripe:cnt=1 size=1048576 offset=18446744073709551615 pattern=0x1
#11 (088)add_uuid nid=10.201.62.31@o2ib(0x500000ac93e1f) 0: 1:10.201.62.31@o2ib
#12 (088)add_uuid nid=10.201.30.31@tcp(0x200000ac91e1f) 0: 1:10.201.62.31@o2ib
Is it normal that there are no tcp nids on the right hand side and what stupid mistake did i make while setting up the system?