[LU-17367] Failover on master: Invalid NID string Created: 14/Dec/23  Updated: 18/Jan/24

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.16.0
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Critical
Reporter: Charlie Olmstead Assignee: James A Simmons
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

All failover sessions on master have been failing Lustre init since build 4455 with Invalid NID string:

2023-12-14T08:22:36 CMD: onyx-96vm2 mkfs.lustre --mgsnode=onyx-82vm11:onyx-82vm13 --fsname=lustre --ost --index=0 --param=sys.timeout=20 --backfstype=zfs --device-size=8388608 --reformat lustre-ost1/ost1 /dev/vg_Role_OSS/ost1
2023-12-14T08:22:36 onyx-96vm2: mkfs.lustre: Invalid NID string 'onyx-82vm11:onyx-82vm13'
2023-12-14T08:22:36 onyx-96vm2: mkfs.lustre: Can't parse NID 'onyx-82vm11:onyx-82vm13'
2023-12-14T08:22:36 onyx-96vm2: mkfs.lustre: exiting with 1 (Operation not permitted)
2023-12-14T08:22:36 pdsh@onyx-82vm4: onyx-96vm2: ssh exited with exit code 1 

 

Latest master build: 

https://testing.whamcloud.com/test_sessions/related?jobs=lustre-master&builds=4486#redirect

 

4455 revision: 2a498f06ccc975fb57214961db6e20a6c1cc2ec7

4454 revision: aa8df6a4a3f50dc86554764f6ccb72db027633f8



 Comments   
Comment by Andreas Dilger [ 15/Dec/23 ]

Charlie, if the NIDs are given as "HOSTNAME@tcp0" does this work again?

James, traditionally the "@tcp0" has been assumed as part of the NID if no nettype is provided.

Comment by James A Simmons [ 15/Dec/23 ]

Can you point to an exact test that always fails with this. I was looking at the logs and their are many unrelated failures. Normally if you call libcfs_nidstr() and the string lacks @nettype it should fill it in.

Oh I see. Its the how do you tell ':' as a delimiter from ':' being used in IPv6 addresses problem. The use of ':'  as a delimiter causes so many headaches.

Comment by Andreas Dilger [ 15/Dec/23 ]

Could we exclude IPv6 NIDs for names that contain chars other that hex digits? Not perfect, but would handle most cases...

Comment by Charlie Olmstead [ 15/Dec/23 ]

simmonsja - "Can you point to an exact test that always fails with this."

Essentially every failover-part-x or failover-zfs-part-x session triggered by lustre-master going back to 8/24/2023 will have the invalid NID error.

https://testing.whamcloud.com/search?horizon=15552000&jobs%5B%5D=lustre-master&test_groups%5B%5D=failover-part-1&test_groups%5B%5D=failover-part-2&test_groups%5B%5D=failover-part-3&test_groups%5B%5D=failover-zfs-part-1&test_groups%5B%5D=failover-zfs-part-2&test_groups%5B%5D=failover-zfs-part-3&test_set_script_id=5e9346a2-09e0-11e9-a2cc-52540065bddc&source=test_sets#redirect

Comment by Jian Yu [ 17/Jan/24 ]

The regression failure was introduced by the following commit on master branch:

commit 101f6e84889a9b48238ca320557101058d935fb0
Author:     James Simmons <jsimmons@infradead.org>
AuthorDate: Thu Aug 3 16:57:02 2023 -0400
Commit:     Oleg Drokin <green@whamcloud.com>
CommitDate: Thu Aug 24 04:32:05 2023 +0000
    
    LU-10391 obdclass: handle large NIDs for mount strings
    
    Mount strings support using ':' as a delimiter but this is also
    a part of the some NID strings like IPv6, so rework class_parse_value()
    to only look at ':' when it occurs after '@'.
    
    The mount utilities use the function convert_hostnames() to ensure
    the mount string containing an NID is valid. This only works for
    small size nids so migrate the function to handle large NIDs. This
    should allow mounting with IPv6 or other large NID addresses.
    
    In testing the userland  libcfs_ip_str2addr_size() had bugs that
    rendered incorrect NID strings. Fix those issues.
    
    Fixes: b6c702df5d4 ("LU-10391 libcfs: add large-nid string conversion functions.")
    Change-Id: Ic9b2a368456ba75ceb5911ac7f75ae00d6123870
    Signed-off-by: James Simmons <jsimmons@infradead.org>
    Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/50362

The failover test sessions on master branch and the patches with failover test parameters are blocked.
 

Comment by James A Simmons [ 17/Jan/24 ]

The issue is that the delimiter ':' is used and its apart of the IPv6 address spec.  Currently their is no way to tell the difference between the two. The way NFS handled this was to introduce  "[]" around the addresses. Perhaps we should implement this approach. Sadly I don't see people doing [myhost1]: [myhost2]which is a problem. Mixing addresses with hostnames can happen which make this totally blow up. The problem is that people don't want to change their way with adding "[] around myhost for example. Suggestions.

Comment by Andreas Dilger [ 18/Jan/24 ]

As I mentioned earlier, checking for non-hex characters allows distinguishing between IPv6 and hostnames in most cases. Not perfect, but an improvement. Similarly, "." is not used in IPv6 addresses, so we could detect IPv4 addresses similarly.

If there is something that needs "[]" around it, then it should be the IPv6 NID itself, since that is the "new" case that nobody is using, while "hostname:hostname" or "NID:NID" is the existing case that shouldn't break. My main objection against "[]" in your previous patch was that it was putting "[]" around all of the NIDs like "[nid1:nid2:nid3:nid4]" (where it doesn't actually help parsing the separate NIDs) instead of around each individual NID like "[nid1]:[nid2]:[nid3]:[nid4]" where it would make sense. It should be optional for IPv4 NIDs and hostnames, but possibly required around IPv6 NIDs if they do not also have an "@tcp" to separate them.

Generated at Sat Feb 10 03:34:52 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.