[LU-8226] t-f check_catastrophe() defect Created: 01/Jun/16  Updated: 06/Dec/16  Resolved: 08/Oct/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.9.0
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Minor
Reporter: parinay v kondekar (Inactive) Assignee: Jian Yu
Resolution: Fixed Votes: 0
Labels: patch
Environment:

Server 2.5.x
Client 2.5.x
4 Node cluster - 1 MDS, 1 OSS, 2 clients


Issue Links:
Related
is related to LU-8805 Failover: recovery-mds-scale test_fai... Resolved
Epic/Theme: test
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   
stdout.log
 ost-pools test_1n: @@@@@@ FAIL: LBUG/LASSERT detected 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:4672:error()
  = /usr/lib64/lustre/tests/test-framework.sh:4936:run_one()
  = /usr/lib64/lustre/tests/test-framework.sh:4968:run_one_logged()
  = /usr/lib64/lustre/tests/test-framework.sh:4774:run_test()
  = /usr/lib64/lustre/tests/ost-pools.sh:336:main()
Dumping lctl log to /tmp/test_logs/1458556257/ost-pools.test_1n.*.1458556373.log
fre1204: Warning: Permanently added 'fre1203,192.168.112.3' (RSA) to the list of known hosts.
fre1201: Warning: Permanently added 'fre1203,192.168.112.3' (RSA) to the list of known hosts.
fre1202: Warning: Permanently added 'fre1203,192.168.112.3' (RSA) to the list of known hosts.
Resetting fail_loc and fail_val on all nodes...done.
FAIL 1n (117s)

check_catastrophe() defect :

check_catastrophe() {
        local nodes=${1:-$(comma_list $(nodes_list))}

        do_nodes $nodes "rc=0;
val=\\\$($LCTL get_param -n catastrophe 2>&1);
if [[ \\\$? -eq 0 && \\\$val -ne 0 ]]; then
        echo \\\$(hostname -s): \\\$val;
        rc=\\\$val;
fi;
exit \\\$rc"
}

If some node is not not accessible check_catastrophe() returns 255:

fre1202: ssh: connect to host fre1202 port 22: Connection timed out
pdsh@fre1203: fre1202: ssh exited with exit code 255

and run_one() exits with error while LBUG/LASSERT does not happen

run_one()
       check_catastrophe || error "LBUG/LASSERT detected"



 Comments   
Comment by Gerrit Updater [ 01/Jun/16 ]

Parinay Kondekar (parinay.kondekar@seagate.com) uploaded a new patch: http://review.whamcloud.com/20539
Subject: LU-8226 tests: Fix check_catastrophe() when node is inaccessible.
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e7f43cc12ee1fba7b472ab44528edd33535a00bd

Comment by Andreas Dilger [ 01/Jun/16 ]

The real question is why the remove node was inaccessible? That would normally be caused by a crash of the node, so it should be considered a catastrophe in that case?

Comment by parinay v kondekar (Inactive) [ 02/Jun/16 ]

> The real question is why the remove node was inaccessible?
LBUG could be one of the reason not the only reason why the node can be inaccessible.
e.g.
1. node has rebooted (FAILURE_MODE=hard) but lustre modules are not loaded.
2. Node is inaccessible ( unable to ping) due to n/w issues, while the node itself has not resulted in any crash.

The patch takes care of not only the scenario where LBUG has actually occurred ( $LCTL get_param -n catastrophe is returning 1 - I tested this by introducing a LBUG() in the code && setting panic_on_lbug to 0 in /proc ) but the false negatives due to above mentioned issues.

Comment by Oleg Drokin [ 02/Jun/16 ]

So, in all those "false positive" cases, what's the script tests to do if the node is not accessible or there are no more lustre modules?
If we don't abort now (perhaps a failure message might need to be adjusted a bit), would there be a cryptic failure in the net innocent test instead?

Also case #2 is really fringe and we should not care about it.

Comment by parinay v kondekar (Inactive) [ 02/Jun/16 ]

Oleg,
IMO, the t-f should not fail aborting test runs for any such cases, that is not the best thing to do. Correct me if wrong. The following code, is taking care of the case where LBUG is actually hit

	val=\\\$($LCTL get_param -n catastrophe | cut -d':' -f2 2>&1);
	if [[ \\\$? -eq 0 && \\\$val -ne 0 ]]; then
Comment by Andreas Dilger [ 02/Jun/16 ]

It isn't clear why it is desirable to continue with the testing if the node has rebooted because of LBUG and panic_on_lbug, or network problems? I would consider those events as much a catastrophe as an LBUG on the local client. The only thing that might be considered a defect is if this fails because the remote node's modules are not loaded, but that in itself might be a sign that the node rebooted during the test when it shouldn't have, since check_catastrophe() is only being run at the end of a test.

Comment by Oleg Drokin [ 02/Jun/16 ]

I kind of agree that the most we can charge this code with is that the message is a bit misleading in some cases.

Actually does this still return retcode 255 if remote node is up, just does not have lustre modules?
Apparently not, so only if the node is dead will this case hit, so we can adjust the test further if rc is 255, then the node is dead and it is a catastrophe
if the file is not present, there is a lustre modules not loaded problem (often due to crash and no automatic module reloading) and a different message might be output.

Comment by parinay v kondekar (Inactive) [ 07/Jun/16 ]

Andreas, Oleg,

The fix takes care of,
1. false LBUG/LASSERT due to this function if any node in $nodes, is inaccessible.
2. setting the local variable 'val' to the correct value returned by $($LCTL get_param -n catastrophe)

For (1), adding " || true" is not going to hurt, especially in case of multiple hosts (e.g. mdts_nodes(), osts_nodes(), node_list() OR other instances where we have such condition present in t-f ) where one of the nodes could be inaccessible && the reason is not LBUG() LASSERT.

The case where LBUG / catastrophe is hit is taken cared of rightly. Other than this, if you want me to change the error message, only, I can do that in this patch, let me know, if so.

For (2), the patch takes care of calculating val correctly.

Thanks

Comment by Andreas Dilger [ 08/Jun/16 ]

Getting "val" correctly is fine, but AFAICS the change you have proposed to use "cut" is not necessary, since that is running on the remote node and not the local node, so there shouldn't be a hostname: prefix on the output?

The second question is why shouldn't the check be considered a failure if the remote node is unavailable? Would it be better if this function was called check_node_health() that checked both whether the remote node was running, as well as whether there was an LBUG/LASSERT, and prints a proper error message in both cases? I would be fine with that, and even better would be to move the error message into check_node_health() so it was obvious which node had the problem.

There are only two places that this function is called - after a test was just completed in run_one() and in check_client_load() that verifies that a client node is health and that the test load (e.g. dbench) is still running, so it is possible to change the callers easily.

Comment by Gerrit Updater [ 08/Oct/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20539/
Subject: LU-8226 tests: Change check_catastrophe() to check_node_health()
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 07cdb91faeda3dc88f9d2c26d573c81e2d70ab95

Comment by Peter Jones [ 08/Oct/16 ]

Landed for 2.9

Generated at Sat Feb 10 02:15:43 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.