Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.9.0
-
Server 2.5.x
Client 2.5.x
4 Node cluster - 1 MDS, 1 OSS, 2 clients
-
3
-
9223372036854775807
Description
stdout.log ost-pools test_1n: @@@@@@ FAIL: LBUG/LASSERT detected Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:4672:error() = /usr/lib64/lustre/tests/test-framework.sh:4936:run_one() = /usr/lib64/lustre/tests/test-framework.sh:4968:run_one_logged() = /usr/lib64/lustre/tests/test-framework.sh:4774:run_test() = /usr/lib64/lustre/tests/ost-pools.sh:336:main() Dumping lctl log to /tmp/test_logs/1458556257/ost-pools.test_1n.*.1458556373.log fre1204: Warning: Permanently added 'fre1203,192.168.112.3' (RSA) to the list of known hosts. fre1201: Warning: Permanently added 'fre1203,192.168.112.3' (RSA) to the list of known hosts. fre1202: Warning: Permanently added 'fre1203,192.168.112.3' (RSA) to the list of known hosts. Resetting fail_loc and fail_val on all nodes...done. FAIL 1n (117s)
check_catastrophe() defect :
check_catastrophe() { local nodes=${1:-$(comma_list $(nodes_list))} do_nodes $nodes "rc=0; val=\\\$($LCTL get_param -n catastrophe 2>&1); if [[ \\\$? -eq 0 && \\\$val -ne 0 ]]; then echo \\\$(hostname -s): \\\$val; rc=\\\$val; fi; exit \\\$rc" }
If some node is not not accessible check_catastrophe() returns 255:
fre1202: ssh: connect to host fre1202 port 22: Connection timed out pdsh@fre1203: fre1202: ssh exited with exit code 255
and run_one() exits with error while LBUG/LASSERT does not happen
run_one() check_catastrophe || error "LBUG/LASSERT detected"
Attachments
Issue Links
- is related to
-
LU-8805 Failover: recovery-mds-scale test_failover_mds: test_failover_mds returned 4
-
- Resolved
-
Getting "val" correctly is fine, but AFAICS the change you have proposed to use "cut" is not necessary, since that is running on the remote node and not the local node, so there shouldn't be a hostname: prefix on the output?
The second question is why shouldn't the check be considered a failure if the remote node is unavailable? Would it be better if this function was called check_node_health() that checked both whether the remote node was running, as well as whether there was an LBUG/LASSERT, and prints a proper error message in both cases? I would be fine with that, and even better would be to move the error message into check_node_health() so it was obvious which node had the problem.
There are only two places that this function is called - after a test was just completed in run_one() and in check_client_load() that verifies that a client node is health and that the test load (e.g. dbench) is still running, so it is possible to change the callers easily.