[LU-8226] t-f check_catastrophe() defect - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.9.0
Affects Version/s: Lustre 2.9.0
Labels:
- patch
Environment:
Server 2.5.x
Client 2.5.x
4 Node cluster - 1 MDS, 1 OSS, 2 clients

Epic/Theme:
- test
Severity:
3
Rank (Obsolete):
9223372036854775807

Description

stdout.log
 ost-pools test_1n: @@@@@@ FAIL: LBUG/LASSERT detected 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:4672:error()
  = /usr/lib64/lustre/tests/test-framework.sh:4936:run_one()
  = /usr/lib64/lustre/tests/test-framework.sh:4968:run_one_logged()
  = /usr/lib64/lustre/tests/test-framework.sh:4774:run_test()
  = /usr/lib64/lustre/tests/ost-pools.sh:336:main()
Dumping lctl log to /tmp/test_logs/1458556257/ost-pools.test_1n.*.1458556373.log
fre1204: Warning: Permanently added 'fre1203,192.168.112.3' (RSA) to the list of known hosts.
fre1201: Warning: Permanently added 'fre1203,192.168.112.3' (RSA) to the list of known hosts.
fre1202: Warning: Permanently added 'fre1203,192.168.112.3' (RSA) to the list of known hosts.
Resetting fail_loc and fail_val on all nodes...done.
FAIL 1n (117s)

check_catastrophe() defect :

check_catastrophe() {
        local nodes=${1:-$(comma_list $(nodes_list))}

        do_nodes $nodes "rc=0;
val=\\\$($LCTL get_param -n catastrophe 2>&1);
if [[ \\\$? -eq 0 && \\\$val -ne 0 ]]; then
        echo \\\$(hostname -s): \\\$val;
        rc=\\\$val;
fi;
exit \\\$rc"
}

If some node is not not accessible check_catastrophe() returns 255:

fre1202: ssh: connect to host fre1202 port 22: Connection timed out
pdsh@fre1203: fre1202: ssh exited with exit code 255

and run_one() exits with error while LBUG/LASSERT does not happen

run_one()
       check_catastrophe || error "LBUG/LASSERT detected"

Attachments

Issue Links

is related to

LU-8805 Failover: recovery-mds-scale test_failover_mds: test_failover_mds returned 4

Resolved

Activity

[LU-8226] t-f check_catastrophe() defect

Andreas Dilger added a comment - 08/Jun/16 4:19 AM

Getting "val" correctly is fine, but AFAICS the change you have proposed to use "cut" is not necessary, since that is running on the remote node and not the local node, so there shouldn't be a hostname: prefix on the output?

The second question is why shouldn't the check be considered a failure if the remote node is unavailable? Would it be better if this function was called check_node_health() that checked both whether the remote node was running, as well as whether there was an LBUG/LASSERT, and prints a proper error message in both cases? I would be fine with that, and even better would be to move the error message into check_node_health() so it was obvious which node had the problem.

There are only two places that this function is called - after a test was just completed in run_one() and in check_client_load() that verifies that a client node is health and that the test load (e.g. dbench) is still running, so it is possible to change the callers easily.

Andreas Dilger added a comment - 08/Jun/16 4:19 AM Getting "val" correctly is fine, but AFAICS the change you have proposed to use "cut" is not necessary, since that is running on the remote node and not the local node, so there shouldn't be a hostname: prefix on the output? The second question is why shouldn't the check be considered a failure if the remote node is unavailable? Would it be better if this function was called check_node_health() that checked both whether the remote node was running, as well as whether there was an LBUG/LASSERT, and prints a proper error message in both cases? I would be fine with that, and even better would be to move the error message into check_node_health() so it was obvious which node had the problem. There are only two places that this function is called - after a test was just completed in run_one() and in check_client_load() that verifies that a client node is health and that the test load (e.g. dbench) is still running, so it is possible to change the callers easily.

parinay v kondekar (Inactive) added a comment - 07/Jun/16 9:37 AM - edited

Andreas, Oleg,

The fix takes care of,
1. false LBUG/LASSERT due to this function if any node in $nodes, is inaccessible.
2. setting the local variable 'val' to the correct value returned by $($LCTL get_param -n catastrophe)

For (1), adding " || true" is not going to hurt, especially in case of multiple hosts (e.g. mdts_nodes(), osts_nodes(), node_list() OR other instances where we have such condition present in t-f ) where one of the nodes could be inaccessible && the reason is not LBUG() LASSERT.

The case where LBUG / catastrophe is hit is taken cared of rightly. Other than this, if you want me to change the error message, only, I can do that in this patch, let me know, if so.

For (2), the patch takes care of calculating val correctly.

Thanks

parinay v kondekar (Inactive) added a comment - 07/Jun/16 9:37 AM - edited Andreas, Oleg, The fix takes care of, 1. false LBUG/LASSERT due to this function if any node in $nodes, is inaccessible. 2. setting the local variable 'val' to the correct value returned by $($LCTL get_param -n catastrophe) For (1), adding " || true" is not going to hurt, especially in case of multiple hosts (e.g. mdts_nodes(), osts_nodes(), node_list() OR other instances where we have such condition present in t-f ) where one of the nodes could be inaccessible && the reason is not LBUG() LASSERT . The case where LBUG / catastrophe is hit is taken cared of rightly. Other than this, if you want me to change the error message, only , I can do that in this patch, let me know, if so. For (2), the patch takes care of calculating val correctly. Thanks

Oleg Drokin added a comment - 02/Jun/16 4:43 PM

I kind of agree that the most we can charge this code with is that the message is a bit misleading in some cases.

Actually does this still return retcode 255 if remote node is up, just does not have lustre modules?
Apparently not, so only if the node is dead will this case hit, so we can adjust the test further if rc is 255, then the node is dead and it is a catastrophe
if the file is not present, there is a lustre modules not loaded problem (often due to crash and no automatic module reloading) and a different message might be output.

Oleg Drokin added a comment - 02/Jun/16 4:43 PM I kind of agree that the most we can charge this code with is that the message is a bit misleading in some cases. Actually does this still return retcode 255 if remote node is up, just does not have lustre modules? Apparently not, so only if the node is dead will this case hit, so we can adjust the test further if rc is 255, then the node is dead and it is a catastrophe if the file is not present, there is a lustre modules not loaded problem (often due to crash and no automatic module reloading) and a different message might be output.

Andreas Dilger added a comment - 02/Jun/16 7:21 AM

It isn't clear why it is desirable to continue with the testing if the node has rebooted because of LBUG and panic_on_lbug, or network problems? I would consider those events as much a catastrophe as an LBUG on the local client. The only thing that might be considered a defect is if this fails because the remote node's modules are not loaded, but that in itself might be a sign that the node rebooted during the test when it shouldn't have, since check_catastrophe() is only being run at the end of a test.

Andreas Dilger added a comment - 02/Jun/16 7:21 AM It isn't clear why it is desirable to continue with the testing if the node has rebooted because of LBUG and panic_on_lbug, or network problems? I would consider those events as much a catastrophe as an LBUG on the local client. The only thing that might be considered a defect is if this fails because the remote node's modules are not loaded, but that in itself might be a sign that the node rebooted during the test when it shouldn't have, since check_catastrophe() is only being run at the end of a test.

parinay v kondekar (Inactive) added a comment - 02/Jun/16 5:05 AM

Oleg,
IMO, the t-f should not fail aborting test runs for any such cases, that is not the best thing to do. Correct me if wrong. The following code, is taking care of the case where LBUG is actually hit

	val=\\\$($LCTL get_param -n catastrophe | cut -d':' -f2 2>&1);
	if [[ \\\$? -eq 0 && \\\$val -ne 0 ]]; then

parinay v kondekar (Inactive) added a comment - 02/Jun/16 5:05 AM Oleg, IMO, the t-f should not fail aborting test runs for any such cases, that is not the best thing to do. Correct me if wrong. The following code, is taking care of the case where LBUG is actually hit val=\\\$($LCTL get_param -n catastrophe | cut -d ':' -f2 2>&1); if [[ \\\$? -eq 0 && \\\$val -ne 0 ]]; then

Oleg Drokin added a comment - 02/Jun/16 4:52 AM

So, in all those "false positive" cases, what's the script tests to do if the node is not accessible or there are no more lustre modules?
If we don't abort now (perhaps a failure message might need to be adjusted a bit), would there be a cryptic failure in the net innocent test instead?

Also case #2 is really fringe and we should not care about it.

Oleg Drokin added a comment - 02/Jun/16 4:52 AM So, in all those "false positive" cases, what's the script tests to do if the node is not accessible or there are no more lustre modules? If we don't abort now (perhaps a failure message might need to be adjusted a bit), would there be a cryptic failure in the net innocent test instead? Also case #2 is really fringe and we should not care about it.

parinay v kondekar (Inactive) added a comment - 02/Jun/16 2:15 AM

> The real question is why the remove node was inaccessible?
LBUG could be one of the reason not the only reason why the node can be inaccessible.
e.g.
1. node has rebooted (FAILURE_MODE=hard) but lustre modules are not loaded.
2. Node is inaccessible ( unable to ping) due to n/w issues, while the node itself has not resulted in any crash.

The patch takes care of not only the scenario where LBUG has actually occurred ( $LCTL get_param -n catastrophe is returning 1 - I tested this by introducing a LBUG() in the code && setting panic_on_lbug to 0 in /proc ) but the false negatives due to above mentioned issues.

parinay v kondekar (Inactive) added a comment - 02/Jun/16 2:15 AM > The real question is why the remove node was inaccessible? LBUG could be one of the reason not the only reason why the node can be inaccessible. e.g. 1. node has rebooted (FAILURE_MODE=hard) but lustre modules are not loaded. 2. Node is inaccessible ( unable to ping) due to n/w issues, while the node itself has not resulted in any crash. The patch takes care of not only the scenario where LBUG has actually occurred ( $LCTL get_param -n catastrophe is returning 1 - I tested this by introducing a LBUG() in the code && setting panic_on_lbug to 0 in /proc ) but the false negatives due to above mentioned issues.

Andreas Dilger added a comment - 01/Jun/16 7:06 PM

The real question is why the remove node was inaccessible? That would normally be caused by a crash of the node, so it should be considered a catastrophe in that case?

Andreas Dilger added a comment - 01/Jun/16 7:06 PM The real question is why the remove node was inaccessible? That would normally be caused by a crash of the node, so it should be considered a catastrophe in that case?

Gerrit Updater added a comment - 01/Jun/16 12:05 PM

Parinay Kondekar (parinay.kondekar@seagate.com) uploaded a new patch: http://review.whamcloud.com/20539
Subject: ~~LU-8226~~ tests: Fix check_catastrophe() when node is inaccessible.
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e7f43cc12ee1fba7b472ab44528edd33535a00bd

Gerrit Updater added a comment - 01/Jun/16 12:05 PM Parinay Kondekar (parinay.kondekar@seagate.com) uploaded a new patch: http://review.whamcloud.com/20539 Subject: LU-8226 tests: Fix check_catastrophe() when node is inaccessible. Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: e7f43cc12ee1fba7b472ab44528edd33535a00bd

People

Assignee:: Jian Yu

Reporter:: parinay v kondekar (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 01/Jun/16 12:03 PM

Updated:: 06/Dec/16 3:20 PM

Resolved:: 08/Oct/16 7:13 PM