[LU-15571] iotrace debug mask causing interop testing failures Created: 20/Feb/22  Updated: 27/Mar/22  Resolved: 27/Mar/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.15.0
Fix Version/s: Lustre 2.15.0

Type: Bug Priority: Major
Reporter: Andreas Dilger Assignee: Patrick Farrell
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
is duplicated by LU-15696 Interop sanity test_24v: set_param -n... Resolved
Related
is related to LU-15317 add iotrace debug Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

The addition of iotrace to the default debug mask is causing interop testing failures with new clients against older servers (eg. 2.14.0) for subtests that restore the debug mask at the end of the test. For example, sanity test_24v:
https://testing.whamcloud.com/test_sets/96c360e4-cbca-46b4-8a8e-0371f9a8f4b4

onyx-61vm3: error: set_param: setting /sys/kernel/debug/lnet/debug=trace inode super iotrace malloc cache info ioctl neterror net warning buffs other dentry nettrace page dlmtrace error emerg ha rpctrace vfstrace reada mmap config console quota sec lfsck hsm snapshot layout: Invalid argument
pdsh@onyx-61vm1: onyx-61vm3: ssh exited with exit code 22

and on the console logs it shows:

cfs_str2mask()) unknown mask 'iotrace'.

This is likely caused by the test-framework using the client debug mask (which contains iotrace by default) being used on all of the remote nodes.

It probably is enough to filter out the "iotrace" string from the saved debug mask before using it on the remote node, if the server version is older than 2.14.57 (or whatever version the patch was included in).



 Comments   
Comment by Andreas Dilger [ 20/Feb/22 ]

Tests that are affected include at least sanity 24v, 24A, 27U, but may be more.

I'm not sure if there is a 2.15.0 release tracker that may show the other interop issues, but if so then this should be linked there. There do appear to be at least some sanity-flr interop failures that appear functional rather than just test issues.

Comment by Patrick Farrell [ 25/Feb/22 ]

You said adding it to the default debug mask - I don't think we did that, did we?  Just its existence in the debug mask at all, right?

Comment by Andreas Dilger [ 26/Feb/22 ]

It may not be in the default debug mask, but iotrace is at least always enabled when these tests are run. The client saves the output only from "lctl get_param debug" and assumes it is the same across the clients and servers, then "restores" it to the older servers, where the error is hit. It looks like this is happening in:

# wrappers for createmany and unlinkmany
# to set debug=0 if number of creates is high enough
# this is to speedup testing
function createmany() {
        local count=${!#}

        (( count > 100 )) && {
                local saved_debug=$($LCTL get_param -n debug) 
                local list=$(comma_list $(all_nodes))

                do_nodes $list $LCTL set_param -n debug=0
        }
        $LUSTRE/tests/createmany $*
        local rc=$?
        (( count > 100 )) &&
                do_nodes $list "$LCTL set_param -n debug=\\\"$saved_debug\\\""
        return $rc
}

I'll push a patch shortly.

Comment by Peter Jones [ 07/Mar/22 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/46636
Subject: LU-15571 tests: save/restore debug mask for interop
Project: fs/lustre-release
Branch: master
Current Patch Set: 2
Commit: 183e00aa0c2892d977f45cda67d6f3352de3429e

Comment by Gerrit Updater [ 27/Mar/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46636/
Subject: LU-15571 tests: save/restore debug mask for interop
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: f236119e6e264b00c20533336303f694d9cfe766

Comment by Peter Jones [ 27/Mar/22 ]

Landed for 2.15

Generated at Sat Feb 10 03:19:27 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.