[LU-3445] Specifying multiple networks in NIDs does no longer work Created: 07/Jun/13  Updated: 09/Jan/14  Resolved: 20/Aug/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.4.1, Lustre 2.5.0

Type: Bug Priority: Major
Reporter: Oliver Mangold Assignee: Jian Yu
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
is duplicated by LU-3830 mount fails on targets configured wit... Resolved
Severity: 3
Rank (Obsolete): 8593

 Description   

With older version of Lustre I was used to have IB as default network, but also to have a ethernet connection as fallback, so each NID specified for failover.node or mgs.node contained both networks, e.g.:

failover.node=10.3.0.228@o2ib,192.168.50.128@tcp

When I try this with 2.4.0 it appears the parser does not understand this syntax anymore. I get an error in syslog:

LDISKFS-fs (dm-7): Unrecognized mount option "192.168.50.128@tcp" or missing value



 Comments   
Comment by Peter Jones [ 10/Jun/13 ]

Yu, Jian

Could you please look into this issue?

Thanks

Peter

Comment by Jian Yu [ 17/Jun/13 ]

A simple test showed that:

# tunefs.lustre --dryrun /dev/vda5
checking for existing Lustre data: found
Reading CONFIGS/mountdata

Read previous values:
Target: lustre-OST0000
Index: 0
Lustre FS: lustre
Mount type: ldiskfs
Flags: 0x62
(OST first_time update )
Persistent mount opts: errors=remount-ro
Parameters: sys.timeout=20 mgsnode=10.3.0.228@o2ib,192.168.50.128@tcp

Permanent disk data:
Target: lustre:OST0000
Index: 0
Lustre FS: lustre
Mount type: ldiskfs
Flags: 0x62
(OST first_time update )
Persistent mount opts: errors=remount-ro
Parameters: sys.timeout=20 mgsnode=10.3.0.228@o2ib,192.168.50.128@tcp

exiting before disk write.

# mount -v -t lustre /dev/vda5 /mnt/ost1
arg[0] = /sbin/mount.lustre
arg[1] = -v
arg[2] = -o
arg[3] = rw
arg[4] = /dev/vda5
arg[5] = /mnt/ost1
source = /dev/vda5 (/dev/vda5), target = /mnt/ost1
options = rw
checking for existing Lustre data: found
Reading CONFIGS/mountdata
Writing CONFIGS/mountdata
mounting device /dev/vda5 at /mnt/ost1, flags=0x1000000 options=osd=osd-ldiskfs,errors=remount-ro,mgsnode=10.3.0.228@o2ib,192.168.50.128@tcp,virgin,update,param=sys.timeout=20,param=mgsnode=10.3.0.228@o2ib,192.168.50.128@tcp,svname=lustre-OST0000,device=/dev/vda5
mount.lustre: mount /dev/vda5 at /mnt/ost1 failed: Invalid argument retries left: 0
mount.lustre: mount /dev/vda5 at /mnt/ost1 failed: Invalid argument
This may have multiple causes.
Are the mount options correct?
Check the syslog for more info.

I'm creating a patch to fix add_param() to add "key" before each "sub-val" separated by comma in "val".

Comment by Jian Yu [ 18/Jun/13 ]

Patch for Lustre master branch: http://review.whamcloud.com/6686

Comment by Jian Yu [ 12/Aug/13 ]

Hi Oleg,

Could you please cherry-pick the patch to Lustre b2_4 branch? Thanks.

Comment by Oleg Drokin [ 15/Aug/13 ]

this patch cannot be cherrypicked to b2_4 due to a conflict

Comment by Bob Glossman (Inactive) [ 15/Aug/13 ]

back port to b2_4: http://review.whamcloud.com/7344

Comment by Peter Jones [ 20/Aug/13 ]

Landed for 2.4.1 and 2.5

Comment by Aurelien Degremont (Inactive) [ 08/Jan/14 ]

It seems to me this bug is still there for 2 reasons:

-the patch only take care of mkfs/tunefs and so there is still an upgrade issue if mountdata contains something like failover.node=10.3.0.228@o2ib,192.168.50.128@tcp
The way to workaround this looks to be a writeconf, which has side effect. Is there some UPGRADE notes somewhere relative to this?

-the patch seems to modify this kind of string

--failnode=10.3.0.228@o2ib,192.168.50.128@tcp

into

--failnode=10.3.0.228@o2ib --failnode=10.3.0.228@o2ib

Which is not the same. The first example refers to 1 failnode, with 2 NIDS to reach it. The second one refers to 2 different failnodes with 1 NID each.

Comment by Jian Yu [ 09/Jan/14 ]

Thanks Aurelien for pointing this out. I'll look into these issues and figure out whether I should fix the original issue in lmd_parse(). I'll create a new Jira ticket to track the work.

Comment by Aurelien Degremont (Inactive) [ 09/Jan/14 ]

Thanks Jian. For my own tracking I've just created the ticket: LU-4460.

Generated at Sat Feb 10 01:33:58 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.