[LU-1326] Multihomed configuration with lustre 2.2.0 Created: 16/Apr/12  Updated: 17/Apr/12  Resolved: 16/Apr/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.2.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: ETHz Support (Inactive) Assignee: WC Triage
Resolution: Duplicate Votes: 0
Labels: None
Environment:

Lustre 2.2.0 system


Severity: 3
Rank (Obsolete): 6413

 Description   

Hello,

I'm trying to build a new multihomed (Infiniband and Ethernet) Lustre 2.2.0 system:

Everything works fine as long as i'm only using one of the two networks, but i'm unable to use both at the same time.
(eg: with ethernet-only and infiniband-only clients)

My test setup looks like this:

n-mds1: Combined MGS/MDT with Infiniband and Ethernet
n-oss01: OSS - also with Infiniband and Ethernet
a9115: Infiniband-only client
a9116: Ethernet-only client

The mds has 2 nids: 1x IB, 1x Ethernet:

[root@n-mds1 ~]# cat /etc/modprobe.d/lustre.conf
options lnet networks=o2ib(ib0),tcp(eth0)

[root@n-mds1 ~]# modprobe lustre
[root@n-mds1 ~]# lctl list_nids
10.201.62.13@o2ib
10.201.30.13@tcp

The MDS was setup like this:

[root@n-mds1 ~]# mkfs.lustre --fsname=foobar --reformat --mdt --mgs --mgsnode=10.201.62.13@o2ib,10.201.30.13@tcp /dev/mapper/vd01

  1. ..also tried without mgsnode and with --servicenode=10.201....

[root@n-mds1 ~]# mount -t lustre /dev/mapper/vd01 /lustre/mds

[root@n-mds1 ~]# lctl dl
0 UP mgs MGS MGS 5
1 UP mgc MGC10.201.62.13@o2ib 3aecba06-ec8b-aeab-2151-47d5a1c1bc47 5
2 UP lov foobar-MDT0000-mdtlov foobar-MDT0000-mdtlov_UUID 4
3 UP mdt foobar-MDT0000 foobar-MDT0000_UUID 3
4 UP mds mdd_obd-foobar-MDT0000 mdd_obd_uuid-foobar-MDT0000 3

The OSS also has two NIDs and is able to ping the MDS:

[root@n-oss01 ~]# cat /etc/modprobe.d/lustre.conf
options lnet networks=o2ib(ib0),tcp(eth0)
[root@n-oss01 ~]# modprobe lustre
[root@n-oss01 ~]# lctl list_nids
10.201.62.31@o2ib
10.201.30.31@tcp

[root@n-oss01 ~]# lctl ping 10.201.62.13@o2ib # mds-ib
12345-0@lo
12345-10.201.62.13@o2ib
12345-10.201.30.13@tcp
[root@n-oss01 ~]# lctl ping 10.201.30.13@tcp # mds-eth
12345-0@lo
12345-10.201.62.13@o2ib
12345-10.201.30.13@tcp

The filesystem on the OSS was created via:

[root@n-oss01 ~]# mkfs.lustre --reformat --fsname=foobar --ost --mgsnode=10.201.62.13@o2ib,10.201.30.13@tcp --index=0 /dev/mapper/vd01
[root@n-oss01 ~]# mount -t lustre /dev/mapper/vd01 /lustre/vd01 && sleep 2 && lctl dl
[root@n-oss01 ~]# lctl dl
0 UP mgc MGC10.201.62.13@o2ib 7408e9c5-b92e-5423-fa52-497d0c540a43 5
1 UP ost OSS OSS_uuid 3
2 UP obdfilter foobar-OST0000 foobar-OST0000_UUID 5

So the OSS seems to be happy, the MDS also looks fine:

[root@n-mds1 ~]# lctl dl
0 UP mgs MGS MGS 7
1 UP mgc MGC10.201.62.13@o2ib 3aecba06-ec8b-aeab-2151-47d5a1c1bc47 5
2 UP lov foobar-MDT0000-mdtlov foobar-MDT0000-mdtlov_UUID 4
3 UP mdt foobar-MDT0000 foobar-MDT0000_UUID 3
4 UP mds mdd_obd-foobar-MDT0000 mdd_obd_uuid-foobar-MDT0000 3
5 UP osc foobar-OST0000-osc-MDT0000 foobar-MDT0000-mdtlov_UUID 5

Mounting the filesystem on the IB-Only client works just fine now:

[root@a9115 ~]# cat /etc/modprobe.d/lustre.conf
options lnet networks="o2ib(ib0)"
[root@a9115 ~]# modprobe lustre
[root@a9115 ~]# lctl list_nids
10.201.36.34@o2ib
[root@a9115 ~]# lctl ping 10.201.62.13@o2ib
12345-0@lo
12345-10.201.62.13@o2ib
12345-10.201.30.13@tcp
[root@a9115 ~]# mount -t lustre 10.201.62.13@o2ib:/foobar /cluster/scratch

..but the ethernet-only client fails:

[root@a9116 ~]# cat /etc/modprobe.d/lustre.conf
options lnet networks=tcp(eth0)
[root@a9116 ~]# modprobe lustre
[root@a9116 ~]# lctl list_nids
10.201.4.35@tcp

lctl_ping seems to work:
[root@a9116 ~]# lctl ping 10.201.30.13@tcp # mds
12345-0@lo
12345-10.201.62.13@o2ib
12345-10.201.30.13@tcp
[root@a9116 ~]# lctl ping 10.201.30.31@tcp # oss
12345-0@lo
12345-10.201.62.31@o2ib
12345-10.201.30.31@tcp

..the mount operation fails with:

[root@a9116 ~]# lctl clear
[root@a9116 ~]# mount -t lustre 10.201.30.13@tcp:/foobar /cluster/scratch/
mount.lustre: mount 10.201.30.13@tcp:/foobar at /cluster/scratch failed: No such file or directory

Apr 12 14:10:21 a9116 kernel: Lustre: MGC10.201.30.13@tcp: Reactivating import
Apr 12 14:10:21 a9116 kernel: LustreError: 9130:0:(ldlm_lib.c:381:client_obd_setup()) can't add initial connection
Apr 12 14:10:21 a9116 kernel: LustreError: 9130:0:(obd_config.c:521:class_setup()) setup foobar-MDT0000-mdc-ffff880e3a01f400 failed (-2)
Apr 12 14:10:21 a9116 kernel: LustreError: 9130:0:(obd_config.c:1362:class_config_llog_handler()) Err -2 on cfg command:
Apr 12 14:10:21 a9116 kernel: Lustre: cmd=cf003 0:foobar-MDT0000-mdc 1:foobar-MDT0000_UUID 2:10.201.62.13@o2ib
Apr 12 14:10:21 a9116 kernel: LustreError: 15c-8: MGC10.201.30.13@tcp: The configuration from log 'foobar-client' failed (-2). This may be the result of communication errors between this node and the MGS,
a bad configuration, or other errors. See the syslog for more information.
Apr 12 14:10:21 a9116 kernel: LustreError: 9116:0:(llite_lib.c:978:ll_fill_super()) Unable to process log: -2
Apr 12 14:10:21 a9116 kernel: LustreError: 9116:0:(obd_config.c:566:class_cleanup()) Device 3 not setup
Apr 12 14:10:21 a9116 kernel: LustreError: 9116:0:(ldlm_request.c:1170:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway
Apr 12 14:10:22 a9116 kernel: LustreError: 9116:0:(ldlm_request.c:1796:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108
Is the MGS specification correct?
Is the filesystem name correct?
If upgrading, is the copied client log valid? (see upgrade docs)
Apr 12 14:10:22 a9116 kernel: Lustre: client ffff880e3a01f400 umount complete
Apr 12 14:10:22 a9116 kernel: LustreError: 9116:0:(obd_mount.c:2349:lustre_fill_super()) Unable to mount (-2)

Why does the ethernet client receive (or pick?) the infiniband nid of the MDS? (10.201.62.13@o2ib)

'lctl dk' reports the same:

00000020:01000000:7.0:1334232621.765857:0:9130:0:(obd_config.c:1217:class_config_llog_handler()) Marker, inst_flg=0x0 mark_flg=0x1
00000020:00000080:7.0:1334232621.765859:0:9130:0:(obd_config.c:915:class_process_config()) processing cmd: cf010
00000020:00000080:7.0:1334232621.765860:0:9130:0:(obd_config.c:984:class_process_config()) marker 5 (0x1) foobar-MDT0000 add mdc
00000020:00000080:7.0:1334232621.765861:0:9130:0:(obd_config.c:915:class_process_config()) processing cmd: cf005
00000020:00000080:7.0:1334232621.765870:0:9130:0:(obd_config.c:926:class_process_config()) adding mapping from uuid 10.201.62.13@o2ib to nid 0x500000ac93e0d (10.201.62.13@o2ib)
00000020:00000080:7.0:1334232621.765873:0:9130:0:(obd_config.c:915:class_process_config()) processing cmd: cf005
00000020:00000080:7.0:1334232621.765874:0:9130:0:(obd_config.c:926:class_process_config()) adding mapping from uuid 10.201.62.13@o2ib to nid 0x200000ac91e0d (10.201.30.13@tcp)
00000020:01000000:7.0:1334232621.765877:0:9130:0:(obd_config.c:1299:class_config_llog_handler()) cmd cf001, instance name: foobar-MDT0000-mdc-ffff880e3a01f400
00000020:00000080:7.0:1334232621.765878:0:9130:0:(obd_config.c:915:class_process_config()) processing cmd: cf001
00000020:00000080:7.0:1334232621.765879:0:9130:0:(obd_config.c:318:class_attach()) attach type mdc name: foobar-MDT0000-mdc-ffff880e3a01f400 uuid: 6a7aaf3a-bbcb-9abf-3516-4320e2718614
00000020:00000080:7.0:1334232621.765936:0:9130:0:(genops.c:348:class_newdev()) Adding new device foobar-MDT0000-mdc-ffff880e3a01f400 (ffff8810358320b8)
00000020:00000080:7.0:1334232621.765938:0:9130:0:(obd_config.c:392:class_attach()) OBD: dev 3 attached type mdc with refcount 1
00000020:01000000:7.0:1334232621.765940:0:9130:0:(obd_config.c:1299:class_config_llog_handler()) cmd cf003, instance name: foobar-MDT0000-mdc-ffff880e3a01f400
00000020:00000080:7.0:1334232621.765941:0:9130:0:(obd_config.c:915:class_process_config()) processing cmd: cf003
00000100:00000100:7.0:1334232621.765957:0:9130:0:(client.c:80:ptlrpc_uuid_to_connection()) cannot find peer 10.201.62.13@o2ib!
00010000:00080000:7.0:1334232621.765959:0:9130:0:(ldlm_lib.c:74:import_set_conn()) can't find connection 10.201.62.13@o2ib
00010000:00020000:7.0:1334232621.765960:0:9130:0:(ldlm_lib.c:381:client_obd_setup()) can't add initial connection
00000020:00000080:7.0:1334232621.793034:0:9130:0:(genops.c:786:class_export_put()) final put ffff88103af10400/6a7aaf3a-bbcb-9abf-3516-4320e2718614
00000020:00020000:7.0:1334232621.793043:0:9130:0:(obd_config.c:521:class_setup()) setup foobar-MDT0000-mdc-ffff880e3a01f400 failed (-2)
00000020:00000080:4.0:1334232621.793043:0:8930:0:(genops.c:915:class_import_destroy()) destroying import ffff881039698800 for foobar-MDT0000-mdc-ffff880e3a01f400
00000020:00000080:4.0:1334232621.793049:0:8930:0:(genops.c:743:class_export_destroy()) destroying export ffff88103af10400/6a7aaf3a-bbcb-9abf-3516-4320e2718614 for foobar-MDT0000-mdc-ffff880e3a01f400
00000020:00020000:7.0:1334232621.823020:0:9130:0:(obd_config.c:1362:class_config_llog_handler()) Err -2 on cfg command:
00000020:02000400:7.0:1334232621.850778:0:9130:0:(obd_config.c:1456:class_config_dump_handler()) cmd=cf003 0:foobar-MDT0000-mdc 1:foobar-MDT0000_UUID 2:10.201.62.13@o2ib
00000020:01000000:31.0:1334232621.850818:0:9116:0:(obd_config.c:1393:class_config_parse_llog()) Processed log foobar-client gen 1-13 (rc=-2)

I'm also puzzled about the nids in CONFIGS/foobar-[client,MDT0000]:

[root@n-mds1 ~]# llog_reader /lustre/mds/CONFIGS/foobar-client |grep uuid
Target uuid : config_uuid
uuid=foobar-clilov_UUID stripe:cnt=1 size=1048576 offset=18446744073709551615 pattern=0x1
uuid=foobar-clilmv_UUID stripe:cnt=0 size=0 offset=0 pattern=0
#10 (088)add_uuid nid=10.201.62.13@o2ib(0x500000ac93e0d) 0: 1:10.201.62.13@o2ib
#11 (088)add_uuid nid=10.201.30.13@tcp(0x200000ac91e0d) 0: 1:10.201.62.13@o2ib
#20 (088)add_uuid nid=10.201.62.31@o2ib(0x500000ac93e1f) 0: 1:10.201.62.31@o2ib
#21 (088)add_uuid nid=10.201.30.31@tcp(0x200000ac91e1f) 0: 1:10.201.62.31@o2ib
[root@n-mds1 ~]# llog_reader /lustre/mds/CONFIGS/foobar-MDT0000 |grep uuid
Target uuid : config_uuid
uuid=foobar-MDT0000-mdtlov_UUID stripe:cnt=1 size=1048576 offset=18446744073709551615 pattern=0x1
#11 (088)add_uuid nid=10.201.62.31@o2ib(0x500000ac93e1f) 0: 1:10.201.62.31@o2ib
#12 (088)add_uuid nid=10.201.30.31@tcp(0x200000ac91e1f) 0: 1:10.201.62.31@o2ib

Is it normal that there are no tcp nids on the right hand side and what stupid mistake did i make while setting up the system?



 Comments   
Comment by ETHz Support (Inactive) [ 16/Apr/12 ]

I think that this is the same problem/bug as in
http://jira.whamcloud.com/browse/LU-1308 :

Apr 11 16:55:53 n1-4-1 kernel: Lustre: MGC172.16.126.1@tcp:
Reactivating import <-- mount starts via TCP
....
Apr 11 16:55:53 n1-4-1 kernel: Lustre: cmd=cf003 0:scratch-MDT0000-mdc
1:scratch-MDT0000_UUID 2:172.16.193.1@o2ib <-- what is o2ib doing
here?!

That's exactly the same message that we are getting on our installation.

Comment by Peter Jones [ 16/Apr/12 ]

Yes I think that you are correct about LU1308 being a duplicate.

Comment by Peter Jones [ 16/Apr/12 ]

ok let's track this issue under LU-1308 as that was opened first. I will be assigning that ticket shortly.

Comment by ETHz Support (Inactive) [ 16/Apr/12 ]

Could you give me a workaround ? or are you working for a patch?

Comment by ETHz Support (Inactive) [ 16/Apr/12 ]

ok

Comment by Oleg Drokin [ 17/Apr/12 ]

Please try this patch http://review.whamcloud.com/2561

Generated at Sat Feb 10 01:15:39 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.