[LU-15089] sanity-lnet test_230: expected number of tcp connections 18 Created: 12/Oct/21  Updated: 03/Nov/21  Resolved: 03/Nov/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.15.0
Fix Version/s: Lustre 2.15.0

Type: Bug Priority: Critical
Reporter: Maloo Assignee: Serguei Smirnov
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-15150 Correct sanity-lnet cleanup Resolved
is related to LU-15152 auster does not report correct test s... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Elena <elena.gryaznova@hpe.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/9d97b8d9-9da1-4f76-a595-f827627ca5b9

test_230 failed with the following error:

expected number of tcp connections 18

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity-lnet test_230 - expected number of tcp connections 18



 Comments   
Comment by Andreas Dilger [ 21/Oct/21 ]

Serguei, this looks like it is related to the conns_per_peer patch?

Comment by Andreas Dilger [ 21/Oct/21 ]

jamesanunez, something is broken with sanity-lnet. This subtest is continually failing:

test_230 	

    Error: 'expected number of tcp connections 18'
    Failure Rate: 35.00% of most recent 100 runs, 0 skipped (all branches) 

(70/207 in the past 3 days) but the sanity-lnet test run is always being marked as "PASS":
https://testing.whamcloud.com/test_sets/64366b2e-89d8-4fea-a98d-cd5a8af7ef21

I suspect there is something missing at the end of sanity-lnet.sh that is supposed to check for the failure status?

Comment by Andreas Dilger [ 21/Oct/21 ]

The test failure was clearly introduced by patch https://review.whamcloud.com/44417 "LU-12815 socklnd: set conns_per_peer based on link speed" because it was first hit when testing that patch, then immediately started failing regularly on master on 2021-08-17 after it landed.

Comment by Gerrit Updater [ 21/Oct/21 ]

"Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45331
Subject: LU-15089 tests: allow enough time to create tcp connections
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 7a432a1048bb45b0265a3308520c0ff5b7671287

Comment by Andreas Dilger [ 22/Oct/21 ]

It isn't obvious why sanity-lnet.sh is not reporting a failure when test_230() is failing. The script has mostly the same bits at the end as sanity.sh:

complete $SECONDS

cleanup_testsuite
exit_status

and the test logs clearly show that complete() is finding the FAIL in $TESTSUITELOG, and cleanup_testsuite() is run as seen by "Cleaning up LNet":

== sanity-lnet test complete, duration 824 sec =========== 15:31:11 (1634830271)
sanity-lnet: FAIL: test_230 expected number of tcp connections 26
Cleaning up LNet

It then proceeds to format and mount a new filesystem (setupall()), and the last line in the log is from init_param_vars(), and then exit_status() returns "0" (PASS) instead of finding any error:

CMD: trevis-5vm9 /usr/sbin/lctl set_param lod.*.mdt_hash=crush
lod.lustre-MDT0000-mdtlov.mdt_hash=crush
CMD: trevis-5vm9 /usr/sbin/lctl set_param -P *.*.lbug_on_grant_miscount=1

The only thing that I can think of is that $TESTSUITELOG is somehow unset/removed/truncated and the FAIL line is missed? However, I only see $TESTSUITELOG being set/removed in init_test_env() and that is only called at the start of the script.

It might be worthwhile to print out a line in exit_status() to help debug this:

exit_status () {
        local status=0
        local logs="$TESTSUITELOG $1"

+       echo "== checking for failures in logs: '$logs' =="
        for log in $logs; do
                if [ -f "$log" ]; then
                        grep -qw FAIL $log && status=1
                fi
        done

+       echo "== exiting with status=$status =="
        exit $status
}
Comment by Chris Horn [ 22/Oct/21 ]

I think we are deleting $TESTSUITELOG in sanity-lnet.sh:cleanup_testsuite():

cleanup_testsuite() {
        trap "" EXIT
        # Cleanup any tmp files created by the sub tests
        rm -f $TMP/sanity-lnet*
        cleanup_netns
        cleanup_lnet
        if $restore_mounts; then
                setupall || error "Failed to setup Lustre after test execution"
        elif $restore_modules; then
                load_modules ||
                        error "Couldn't load modules after test execution"
        fi
        return 0
}

That rm -f $TMP/sanity-lnet* should probably be rm -f $TMP/sanity-lnet-*.yaml

Comment by Chris Horn [ 22/Oct/21 ]

Opened https://jira.whamcloud.com/browse/LU-15150 for ^

Comment by Gerrit Updater [ 03/Nov/21 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45331/
Subject: LU-15089 tests: allow enough time to create tcp connections
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 5c766b005bf3e0bca0efa9d87ccf230e7cba97cc

Comment by Peter Jones [ 03/Nov/21 ]

Landed for 2.15

Generated at Sat Feb 10 03:15:21 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.