[LU-14936] sanity test_140 returned 1 Created: 13/Aug/21 Updated: 19/Aug/21 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
This issue was created by maloo for S Buisson <sbuisson@ddn.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/ce3dea6f-13cd-4e3d-9615-b25380e9e3ed test_140 failed with the following error: test_140 returned 1 It seems to be a bug with sanity test_140 itself, as the output does not differ from a successful run: == sanity test 140: Check reasonable stack depth (shouldn't LBUG) ==================================== 07:30:38 (1628839838) The symlink depth = 40 open symlink_self returns 40 Resetting fail_loc on all nodes...CMD: trevis-21vm1,trevis-33vm1,trevis-79vm11.trevis.whamcloud.com,trevis-79vm12 lctl set_param -n fail_loc=0 fail_val=0 2>/dev/null done. 07:30:49 (1628839849) waited for trevis-21vm1 network 5s VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV |
| Comments |
| Comment by Andreas Dilger [ 13/Aug/21 ] |
|
I wonder if this may happen if a stack_trap registered by this subtest returns a non-zero value, since those are run after the test sub shell exits, but before the test is "finished"? |
| Comment by Andreas Dilger [ 14/Aug/21 ] |
|
I also saw this same failure on sanity test_398b - finished with "waited for trevis-39vm11 network 5s" then failed with "returned 1". |
| Comment by Alexander Zarochentsev [ 17/Aug/21 ] |
|
sanity test 4, the same failure w/o obvious reason |
| Comment by Emoly Liu [ 19/Aug/21 ] |
|
Maybe this issue is related to the recent commit 67752f6d at https://review.whamcloud.com/44033, and seems it only happens to group review-ldiskfs-arm, especially on MDS1. The t-f change shows:
check_node_health() {
local nodes=${1:-$(comma_list $(nodes_list))}
-
- for node in ${nodes//,/ }; do
- check_network "$node" 5
- if [ $? -eq 0 ]; then
- do_node $node "$LCTL get_param catastrophe 2>&1" |
- grep -q "catastrophe=1" &&
- error "$node:LBUG/LASSERT detected" || true
- fi
- done
+ local health=$TMP/node_health.$$
+
+ do_nodes $nodes "$LCTL get_param catastrophe 2>&1" | tee $health |
+ grep "catastrophe=1" && error "LBUG/LASSERT detected"
+ # Only check/report network health if get_param isn't reported, since
+ # *clearly* the network is working if get_param returned something.
+ if (( $(grep -c catastro $health) != $(wc -w <<< ${nodes//,/ }) )); then
+ for node in ${nodes//,/}; do
+ check_network $node 5
+ done
+ fi
+ rm -f $health
}
check_network() {
local host=$1
local max=$2
local sleep=${3:-5}
[ "$host" = "$HOSTNAME" ] && return 0
if ! wait_for_function --quiet "ping -c 1 -w 3 $host" $max $sleep; then
echo "$(date +'%H:%M:%S (%s)') waited for $host network ${max}s"
exit 1
fi
}
|
| Comment by Andreas Dilger [ 19/Aug/21 ] |
|
I was wondering the same about 44033, but that patch only landed 2 days ago and failures on master have been happening for at least 6 days when this ticket was opened. |