[LU-8226] t-f check_catastrophe() defect Created: 01/Jun/16 Updated: 06/Dec/16 Resolved: 08/Oct/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.9.0 |
| Fix Version/s: | Lustre 2.9.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | parinay v kondekar (Inactive) | Assignee: | Jian Yu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | patch | ||
| Environment: |
Server 2.5.x |
||
| Issue Links: |
|
||||||||
| Epic/Theme: | test | ||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
stdout.log ost-pools test_1n: @@@@@@ FAIL: LBUG/LASSERT detected Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:4672:error() = /usr/lib64/lustre/tests/test-framework.sh:4936:run_one() = /usr/lib64/lustre/tests/test-framework.sh:4968:run_one_logged() = /usr/lib64/lustre/tests/test-framework.sh:4774:run_test() = /usr/lib64/lustre/tests/ost-pools.sh:336:main() Dumping lctl log to /tmp/test_logs/1458556257/ost-pools.test_1n.*.1458556373.log fre1204: Warning: Permanently added 'fre1203,192.168.112.3' (RSA) to the list of known hosts. fre1201: Warning: Permanently added 'fre1203,192.168.112.3' (RSA) to the list of known hosts. fre1202: Warning: Permanently added 'fre1203,192.168.112.3' (RSA) to the list of known hosts. Resetting fail_loc and fail_val on all nodes...done. FAIL 1n (117s) check_catastrophe() defect :
check_catastrophe() {
local nodes=${1:-$(comma_list $(nodes_list))}
do_nodes $nodes "rc=0;
val=\\\$($LCTL get_param -n catastrophe 2>&1);
if [[ \\\$? -eq 0 && \\\$val -ne 0 ]]; then
echo \\\$(hostname -s): \\\$val;
rc=\\\$val;
fi;
exit \\\$rc"
}
If some node is not not accessible check_catastrophe() returns 255: fre1202: ssh: connect to host fre1202 port 22: Connection timed out pdsh@fre1203: fre1202: ssh exited with exit code 255 and run_one() exits with error while LBUG/LASSERT does not happen run_one()
check_catastrophe || error "LBUG/LASSERT detected"
|
| Comments |
| Comment by Gerrit Updater [ 01/Jun/16 ] |
|
Parinay Kondekar (parinay.kondekar@seagate.com) uploaded a new patch: http://review.whamcloud.com/20539 |
| Comment by Andreas Dilger [ 01/Jun/16 ] |
|
The real question is why the remove node was inaccessible? That would normally be caused by a crash of the node, so it should be considered a catastrophe in that case? |
| Comment by parinay v kondekar (Inactive) [ 02/Jun/16 ] |
|
> The real question is why the remove node was inaccessible? The patch takes care of not only the scenario where LBUG has actually occurred ( $LCTL get_param -n catastrophe is returning 1 - I tested this by introducing a LBUG() in the code && setting panic_on_lbug to 0 in /proc ) but the false negatives due to above mentioned issues. |
| Comment by Oleg Drokin [ 02/Jun/16 ] |
|
So, in all those "false positive" cases, what's the script tests to do if the node is not accessible or there are no more lustre modules? Also case #2 is really fringe and we should not care about it. |
| Comment by parinay v kondekar (Inactive) [ 02/Jun/16 ] |
|
Oleg, val=\\\$($LCTL get_param -n catastrophe | cut -d':' -f2 2>&1); if [[ \\\$? -eq 0 && \\\$val -ne 0 ]]; then |
| Comment by Andreas Dilger [ 02/Jun/16 ] |
|
It isn't clear why it is desirable to continue with the testing if the node has rebooted because of LBUG and panic_on_lbug, or network problems? I would consider those events as much a catastrophe as an LBUG on the local client. The only thing that might be considered a defect is if this fails because the remote node's modules are not loaded, but that in itself might be a sign that the node rebooted during the test when it shouldn't have, since check_catastrophe() is only being run at the end of a test. |
| Comment by Oleg Drokin [ 02/Jun/16 ] |
|
I kind of agree that the most we can charge this code with is that the message is a bit misleading in some cases. Actually does this still return retcode 255 if remote node is up, just does not have lustre modules? |
| Comment by parinay v kondekar (Inactive) [ 07/Jun/16 ] |
|
Andreas, Oleg, The fix takes care of, For (1), adding " || true" is not going to hurt, especially in case of multiple hosts (e.g. mdts_nodes(), osts_nodes(), node_list() OR other instances where we have such condition present in t-f ) where one of the nodes could be inaccessible && the reason is not LBUG() LASSERT. The case where LBUG / catastrophe is hit is taken cared of rightly. Other than this, if you want me to change the error message, only, I can do that in this patch, let me know, if so. For (2), the patch takes care of calculating val correctly. Thanks |
| Comment by Andreas Dilger [ 08/Jun/16 ] |
|
Getting "val" correctly is fine, but AFAICS the change you have proposed to use "cut" is not necessary, since that is running on the remote node and not the local node, so there shouldn't be a hostname: prefix on the output? The second question is why shouldn't the check be considered a failure if the remote node is unavailable? Would it be better if this function was called check_node_health() that checked both whether the remote node was running, as well as whether there was an LBUG/LASSERT, and prints a proper error message in both cases? I would be fine with that, and even better would be to move the error message into check_node_health() so it was obvious which node had the problem. There are only two places that this function is called - after a test was just completed in run_one() and in check_client_load() that verifies that a client node is health and that the test load (e.g. dbench) is still running, so it is possible to change the callers easily. |
| Comment by Gerrit Updater [ 08/Oct/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20539/ |
| Comment by Peter Jones [ 08/Oct/16 ] |
|
Landed for 2.9 |