[LU-16462] conf-sanity sles12.5 test_43a: lctl: attr.c:201: validate_nla: Assertion `0' failed. Created: 11/Jan/23  Updated: 21/Jul/23  Resolved: 01/May/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.16.0
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Minor
Reporter: Maloo Assignee: James A Simmons
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-10003 lnetctl error "cannot add network: in... Reopened
is related to LU-9680 Improve the user land to kernel space... In Progress
is related to LU-16694 cleanup test-framework.sh, test direc... Reopened
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

This issue relates to the following test suite run on sles12.5 clients:
https://testing.whamcloud.com/test_sets/258bf667-e863-4adb-af68-213f7877b909
https://testing.whamcloud.com/test_sets/e3b187ad-af69-4bd1-b43c-583da240aef3

test_43a failed with the following error:

lctl dl
BUG at file position attr.c:201:validate_nla
lctl: attr.c:201: validate_nla: Assertion `0' failed.

The same failure exists in a number of other subtests that also use "lctl dl":

  • sanity: test_33i, test_104d, test_154d
  • conf-sanity: test_43b, test_70c, test_91

It looks like the validate_nla() function is part of libnl (netlink), so very likely relates to the new usage of netlink in "lctl dl" to get the device list.

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
conf-sanity test_70c - set deactivate failed



 Comments   
Comment by Andreas Dilger [ 12/Jan/23 ]

James, can you please take a look.

I don't think SLES12 clients are in such heavy usage that they need to get the latest netlink functionality, but at one point at least Cray was heavily based on SLES for their client distro, so at least we shouldn't break it gratuitously. If there isn't a straight forward way to fix it, I'd be fine with just configuring out the netlink functionality and always using ioctl/debugfs in this case (which isn't worse than what was available before).

Comment by James A Simmons [ 12/Jan/23 ]

What version of libnl is installed? Do we have a special Test-parameter tag for SUSE12. I suspect that the libnl library is older so its lacking proper support for NLA_S64.

Comment by Gerrit Updater [ 12/Jan/23 ]

"James Simmons <jsimmons@infradead.org>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49608
Subject: LU-16462 utils: handle lack of NLA_S64
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a6780268c84f876b98f8d31a6830dcb9acf16077

Comment by Andreas Dilger [ 27/Jan/23 ]

This is preventing patches on e2fsprogs from passing testing, since they run with sles12sp5 clients.

https://testing.whamcloud.com/test_sessions/96f8909f-bbaa-40f2-b969-f261d4b0398f
https://testing.whamcloud.com/test_sessions/1f946f9d-86de-401d-9c45-b3b445eb10b4

Comment by Andreas Dilger [ 06/Mar/23 ]

Comment from Dongyang in the 49608 patch that explains the issue:

lctl dl is triggering assert inside libnl3 because we use NLA_NUL_STRING.
in the old libnl3, we don't have NLA_NUL_STRING and NLA_S8|16|32|64:

enum {
	NLA_UNSPEC,	/**< Unspecified type, binary data chunk */
	NLA_U8,		/**< 8 bit integer */
	NLA_U16,	/**< 16 bit integer */
	NLA_U32,	/**< 32 bit integer */
	NLA_U64,	/**< 64 bit integer */
	NLA_STRING,	/**< NUL terminated character string */
	NLA_FLAG,	/**< Flag */
	NLA_MSECS,	/**< Micro seconds (64bit) */
	NLA_NESTED,	/**< Nested attributes */
	__NLA_TYPE_MAX,
};

#define NLA_TYPE_MAX (__NLA_TYPE_MAX - 1)

and if we try to use any NLA_TYPE greater than NLA_TYPE_MAX, it will trigger the assert in validate_nla().
Do we have to use NLA_NUL_STRING instead of NLA_STRING, and the signed nla types?

Comment by Andreas Dilger [ 06/Apr/23 ]

Hi James, any thought on how to make progress on this issue?

We have e2fsck fixes blocked from landing for a couple of months because the netlink patch has broken "lctl dl" on SLES12 clients. I don't think we need to retroactively add support for SLES12 clients to allow non-root users to run "lctl dl", so it would be fine if the netlink code was completely disabled for older clients that don't have NLA_S32 or NLA_NUL_STRING and only the ioctl fallback was used. It just needs to not break the old code.

Comment by Andreas Dilger [ 06/Apr/23 ]

I'm going to push a patch that disables yaml netlink usage if NLA_NUL_STRING is not defined. This works fine for "lctl dl" in my local testing, but still need to fix "lctl ping" (sanity test_217).

Comment by Gerrit Updater [ 12/Apr/23 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50610
Subject: LU-16462 utils: skip netlink for old libnl3
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 5ff26d3c3205baacefb3c0e48c2a03ff0713db39

Comment by Andreas Dilger [ 24/Apr/23 ]

patch https://review.whamcloud.com/49608 "LU-16462 utils: handle lack of NLA_S64" has been updated to handle the sles12sp5 libnl incompatibility, along with test and tool fixes for the netlink-unavailable fallback case so that "lctl ping" and "lctl list_nids" continue to work.

Comment by James A Simmons [ 25/Apr/23 ]

Thank you Andreas for your help

Comment by Andreas Dilger [ 25/Apr/23 ]

I just ran across an old patch from Amir that is replacing usage of "lctl ping" and "lctl list_nids" with the equivalent "lnetctl" commands.

The output is clunky and needs some awk to parse it into just a NID:

$ lnetctl net show
net:
    - net type: lo
      local NI(s):
        - nid: 0@lo
          status: up
    - net type: tcp
      local NI(s):
        - nid: 192.168.10.99@tcp
          status: up
          interfaces:
              0: enp0s3

$ lnetctl net show | awk '/nid:/ && $3 != "0@lo" { print $3 }'
192.168.10.99@tcp

Alexey suggested in that patch to put this into a helper function on test-framework.sh instead of having it inline in multiple places. However, users would probably also want to print some of these fields outside of the testing, instead of the full YAML.

Having a command-line argument like "lnetctl net show -nid" in this case, but also able to print other fields like "... --status", "nettype", "-interfaces") would more convenient than users having to use "awk" or "yq" to extract the fields manually.

Comment by Gerrit Updater [ 01/May/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49608/
Subject: LU-16462 utils: handle lack of newer nla_attrs
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: ae1ee11cea0a90631e14d670883528d6ac6e86b7

Comment by Peter Jones [ 01/May/23 ]

Landed for 2.16

Generated at Sat Feb 10 03:27:13 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.