[LU-14936] sanity test_140 returned 1 Created: 13/Aug/21  Updated: 19/Aug/21

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Duplicate
is duplicated by LU-14857 sanity: test_65e returned 1 Open
Related
is related to LU-14773 reduce run_one() overhead Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for S Buisson <sbuisson@ddn.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/ce3dea6f-13cd-4e3d-9615-b25380e9e3ed

test_140 failed with the following error:

test_140 returned 1

It seems to be a bug with sanity test_140 itself, as the output does not differ from a successful run:

== sanity test 140: Check reasonable stack depth (shouldn't LBUG) ==================================== 07:30:38 (1628839838)
The symlink depth = 40
open symlink_self returns 40
Resetting fail_loc on all nodes...CMD: trevis-21vm1,trevis-33vm1,trevis-79vm11.trevis.whamcloud.com,trevis-79vm12 lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null
done.
07:30:49 (1628839849) waited for trevis-21vm1 network 5s

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity test_140 - test_140 returned 1



 Comments   
Comment by Andreas Dilger [ 13/Aug/21 ]

I wonder if this may happen if a stack_trap registered by this subtest returns a non-zero value, since those are run after the test sub shell exits, but before the test is "finished"?

Comment by Andreas Dilger [ 14/Aug/21 ]

I also saw this same failure on sanity test_398b - finished with "waited for trevis-39vm11 network 5s" then failed with "returned 1".

Comment by Alexander Zarochentsev [ 17/Aug/21 ]

sanity test 4, the same failure w/o obvious reason
https://testing.whamcloud.com/test_sets/cb7fa77c-c447-4af8-8294-6846c61a7019 .

Comment by Emoly Liu [ 19/Aug/21 ]

Maybe this issue is related to the recent commit 67752f6d at https://review.whamcloud.com/44033, and seems it only happens to group review-ldiskfs-arm, especially on MDS1.

The t-f change shows:

 check_node_health() {
        local nodes=${1:-$(comma_list $(nodes_list))}
-
-       for node in ${nodes//,/ }; do
-               check_network "$node" 5
-               if [ $? -eq 0 ]; then
-                       do_node $node "$LCTL get_param catastrophe 2>&1" |
-                               grep -q "catastrophe=1" &&
-                               error "$node:LBUG/LASSERT detected" || true
-               fi
-       done
+       local health=$TMP/node_health.$$
+
+       do_nodes $nodes "$LCTL get_param catastrophe 2>&1" | tee $health |
+               grep "catastrophe=1" && error "LBUG/LASSERT detected"
+       # Only check/report network health if get_param isn't reported, since
+       # *clearly* the network is working if get_param returned something.
+       if (( $(grep -c catastro $health) != $(wc -w <<< ${nodes//,/ }) )); then
+               for node in ${nodes//,/}; do
+                       check_network $node 5
+               done
+       fi
+       rm -f $health
 }

check_network() {
        local host=$1
        local max=$2
        local sleep=${3:-5}

        [ "$host" = "$HOSTNAME" ] && return 0

        if ! wait_for_function --quiet "ping -c 1 -w 3 $host" $max $sleep; then
                echo "$(date +'%H:%M:%S (%s)') waited for $host network ${max}s"
                exit 1
        fi
}
Comment by Andreas Dilger [ 19/Aug/21 ]

I was wondering the same about 44033, but that patch only landed 2 days ago and failures on master have been happening for at least 6 days when this ticket was opened.

Generated at Sat Feb 10 03:14:02 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.