[LU-17414] liblnetconfig printing bogus error numbers for users Created: 11/Jan/24  Updated: 04/Feb/24  Resolved: 04/Feb/24

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.0, Lustre 2.14.0, Lustre 2.16.0, Lustre 2.12.9, Lustre 2.15.0
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Minor
Reporter: Andreas Dilger Assignee: Arshad Hussain
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

I was lead to looking at lustre_lnet_config_ni(), and found that it is returning somewhat "random" error codes to the user and then lnetctl is printing them like "errno: -1". However, it turns out that "-1" is not the POSIX error code "-EPERM" as one might expect, but rather "LUSTRE_CFG_RC_BAD_PARAM", which is IMHO very confusing.

Also, lustre_lnet_config_ni() is also returning POSIX error codes from the l_ioctl() call in the function, so it is impossible to know which error type is actually being returned.

This code should be changed to use existing POSIX error numbers that have similar meanings, or could at least fill the same role (with some creative mapping):

    #define LUSTRE_CFG_RC_NO_ERR                     0  => fine
    #define LUSTRE_CFG_RC_BAD_PARAM                 -1  => -EINVAL
    #define LUSTRE_CFG_RC_MISSING_PARAM             -2  => -EFAULT
    #define LUSTRE_CFG_RC_OUT_OF_RANGE_PARAM        -3  => -ERANGE
    #define LUSTRE_CFG_RC_OUT_OF_MEM                -4  => -ENOMEM
    #define LUSTRE_CFG_RC_GENERIC_ERR               -5  => -ENODATA
    #define LUSTRE_CFG_RC_NO_MATCH                  -6  => -ENOMSG
    #define LUSTRE_CFG_RC_MATCH                     -7  => -EXFULL
    #define LUSTRE_CFG_RC_SKIP                      -8  => -EBADSLT
    #define LUSTRE_CFG_RC_LAST_ELEM                 -9  => -ECHRNG
    #define LUSTRE_CFG_RC_MARSHAL_FAIL              -10 => -ENOSTR

I don't think "overloading" the POSIX error codes to mean something similar is worse than using random numbers to report errors.

It looks like these error definitions are only in userspace, but I'm not 100% sure. The vast majority of LUSTRE_CFG_RC_* errors are only used inside liblnetconfig, so changing them won't affect the library at all.

The only other place these error values are used are in lnetctl.c, which mostly only checking "!= LUSTRE_CFG_RC_NO_ERR", this should be relatively safe to change, and lustre/tests/lutf/python/ which should also be OK to change I think (as long as the error codes from one node are not interpreted by lnetctl on another node).

If binary compatibility is needed, it looks like none of the values currently overlap, so it would be possible to check both the old and new values for the return in the few places that are not checking LUSTRE_CFG_RC_NO_ERR, and then at some point in the future start actually returning them...



 Comments   
Comment by Andreas Dilger [ 11/Jan/24 ]

The only places outside of liblnetconfig that are not using LUSTRE_CFG_RC_NO_ERR are:

lnet/utils/lnetctl.c:2331:				rc = LUSTRE_CFG_RC_BAD_PARAM;
lnet/utils/lnetctl.c:3971:				rc = LUSTRE_CFG_RC_BAD_PARAM;
lnet/utils/lnetctl.c:3980:				rc = LUSTRE_CFG_RC_BAD_PARAM;
lnet/utils/lnetctl.c:3989:				rc = LUSTRE_CFG_RC_BAD_PARAM;
lustre/tests/lutf/python/tests-infra/lnet_helpers.py:294:	if (rc == lnetconfig.LUSTRE_CFG_RC_MISSING_PARAM):
lustre/tests/lutf/python/tests/suite_dlc/test_dlc_libconfig_01.py:26:	if (rc == lnetconfig.LUSTRE_CFG_RC_MISSING_PARAM):
lustre/tests/lutf/python/tests/suite_dlc/test_dlc_libconfig_01.py:36:	if (rc == lnetconfig.LUSTRE_CFG_RC_MISSING_PARAM):
lustre/tests/lutf/python/tests/suite_dlc/test_dlc_libconfig_01.py:46:	if (rc == lnetconfig.LUSTRE_CFG_RC_MISSING_PARAM):

Of the external usage:

  • the first one is cosmetic (it sets the error, then it is only checked if not LUSTRE_CFG_RC_BAD_PARAM).
  • the second one is printed in an error message
  • the last two in lnetctl are actually clobbered after being set (rc = yaml_lnet_peer() is called immediately afterward)
  • I can't really comment about the Python usage, but think lutf is both currently not even tested, and also just a test case, so compatibility is mostly irrelevant
Comment by Gerrit Updater [ 12/Jan/24 ]

"Arshad Hussain <arshad.hussain@aeoncomputing.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53657
Subject: LU-17414 lnet: Use POSIX error for libnetconfig
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2ddbfd4e4d7cc9ea210a3491038c8264a6614c51

Comment by Gerrit Updater [ 04/Feb/24 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/53657/
Subject: LU-17414 lnet: Use POSIX error number for libnetconfig
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 6c3a2841a77c2f11fd424b0b949d0c77a1ee4026

Comment by Peter Jones [ 04/Feb/24 ]

Merged for 2.16

Generated at Sat Feb 10 03:35:15 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.