Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.9.0
    • Lustre 2.9.0
    • Server 2.5.x
      Client 2.5.x
      4 Node cluster - 1 MDS, 1 OSS, 2 clients
    • 3
    • 9223372036854775807

    Description

      stdout.log
       ost-pools test_1n: @@@@@@ FAIL: LBUG/LASSERT detected 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:4672:error()
        = /usr/lib64/lustre/tests/test-framework.sh:4936:run_one()
        = /usr/lib64/lustre/tests/test-framework.sh:4968:run_one_logged()
        = /usr/lib64/lustre/tests/test-framework.sh:4774:run_test()
        = /usr/lib64/lustre/tests/ost-pools.sh:336:main()
      Dumping lctl log to /tmp/test_logs/1458556257/ost-pools.test_1n.*.1458556373.log
      fre1204: Warning: Permanently added 'fre1203,192.168.112.3' (RSA) to the list of known hosts.
      fre1201: Warning: Permanently added 'fre1203,192.168.112.3' (RSA) to the list of known hosts.
      fre1202: Warning: Permanently added 'fre1203,192.168.112.3' (RSA) to the list of known hosts.
      Resetting fail_loc and fail_val on all nodes...done.
      FAIL 1n (117s)
      

      check_catastrophe() defect :

      check_catastrophe() {
              local nodes=${1:-$(comma_list $(nodes_list))}
      
              do_nodes $nodes "rc=0;
      val=\\\$($LCTL get_param -n catastrophe 2>&1);
      if [[ \\\$? -eq 0 && \\\$val -ne 0 ]]; then
              echo \\\$(hostname -s): \\\$val;
              rc=\\\$val;
      fi;
      exit \\\$rc"
      }
      

      If some node is not not accessible check_catastrophe() returns 255:

      fre1202: ssh: connect to host fre1202 port 22: Connection timed out
      pdsh@fre1203: fre1202: ssh exited with exit code 255
      

      and run_one() exits with error while LBUG/LASSERT does not happen

      run_one()
             check_catastrophe || error "LBUG/LASSERT detected"
      
      

      Attachments

        Issue Links

          Activity

            [LU-8226] t-f check_catastrophe() defect
            pjones Peter Jones added a comment -

            Landed for 2.9

            pjones Peter Jones added a comment - Landed for 2.9

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20539/
            Subject: LU-8226 tests: Change check_catastrophe() to check_node_health()
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 07cdb91faeda3dc88f9d2c26d573c81e2d70ab95

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20539/ Subject: LU-8226 tests: Change check_catastrophe() to check_node_health() Project: fs/lustre-release Branch: master Current Patch Set: Commit: 07cdb91faeda3dc88f9d2c26d573c81e2d70ab95

            Getting "val" correctly is fine, but AFAICS the change you have proposed to use "cut" is not necessary, since that is running on the remote node and not the local node, so there shouldn't be a hostname: prefix on the output?

            The second question is why shouldn't the check be considered a failure if the remote node is unavailable? Would it be better if this function was called check_node_health() that checked both whether the remote node was running, as well as whether there was an LBUG/LASSERT, and prints a proper error message in both cases? I would be fine with that, and even better would be to move the error message into check_node_health() so it was obvious which node had the problem.

            There are only two places that this function is called - after a test was just completed in run_one() and in check_client_load() that verifies that a client node is health and that the test load (e.g. dbench) is still running, so it is possible to change the callers easily.

            adilger Andreas Dilger added a comment - Getting "val" correctly is fine, but AFAICS the change you have proposed to use "cut" is not necessary, since that is running on the remote node and not the local node, so there shouldn't be a hostname: prefix on the output? The second question is why shouldn't the check be considered a failure if the remote node is unavailable? Would it be better if this function was called check_node_health() that checked both whether the remote node was running, as well as whether there was an LBUG/LASSERT, and prints a proper error message in both cases? I would be fine with that, and even better would be to move the error message into check_node_health() so it was obvious which node had the problem. There are only two places that this function is called - after a test was just completed in run_one() and in check_client_load() that verifies that a client node is health and that the test load (e.g. dbench) is still running, so it is possible to change the callers easily.

            Andreas, Oleg,

            The fix takes care of,
            1. false LBUG/LASSERT due to this function if any node in $nodes, is inaccessible.
            2. setting the local variable 'val' to the correct value returned by $($LCTL get_param -n catastrophe)

            For (1), adding " || true" is not going to hurt, especially in case of multiple hosts (e.g. mdts_nodes(), osts_nodes(), node_list() OR other instances where we have such condition present in t-f ) where one of the nodes could be inaccessible && the reason is not LBUG() LASSERT.

            The case where LBUG / catastrophe is hit is taken cared of rightly. Other than this, if you want me to change the error message, only, I can do that in this patch, let me know, if so.

            For (2), the patch takes care of calculating val correctly.

            Thanks

            parinay parinay v kondekar (Inactive) added a comment - - edited Andreas, Oleg, The fix takes care of, 1. false LBUG/LASSERT due to this function if any node in $nodes, is inaccessible. 2. setting the local variable 'val' to the correct value returned by $($LCTL get_param -n catastrophe) For (1), adding " || true" is not going to hurt, especially in case of multiple hosts (e.g. mdts_nodes(), osts_nodes(), node_list() OR other instances where we have such condition present in t-f ) where one of the nodes could be inaccessible && the reason is not LBUG() LASSERT . The case where LBUG / catastrophe is hit is taken cared of rightly. Other than this, if you want me to change the error message, only , I can do that in this patch, let me know, if so. For (2), the patch takes care of calculating val correctly. Thanks
            green Oleg Drokin added a comment -

            I kind of agree that the most we can charge this code with is that the message is a bit misleading in some cases.

            Actually does this still return retcode 255 if remote node is up, just does not have lustre modules?
            Apparently not, so only if the node is dead will this case hit, so we can adjust the test further if rc is 255, then the node is dead and it is a catastrophe
            if the file is not present, there is a lustre modules not loaded problem (often due to crash and no automatic module reloading) and a different message might be output.

            green Oleg Drokin added a comment - I kind of agree that the most we can charge this code with is that the message is a bit misleading in some cases. Actually does this still return retcode 255 if remote node is up, just does not have lustre modules? Apparently not, so only if the node is dead will this case hit, so we can adjust the test further if rc is 255, then the node is dead and it is a catastrophe if the file is not present, there is a lustre modules not loaded problem (often due to crash and no automatic module reloading) and a different message might be output.

            It isn't clear why it is desirable to continue with the testing if the node has rebooted because of LBUG and panic_on_lbug, or network problems? I would consider those events as much a catastrophe as an LBUG on the local client. The only thing that might be considered a defect is if this fails because the remote node's modules are not loaded, but that in itself might be a sign that the node rebooted during the test when it shouldn't have, since check_catastrophe() is only being run at the end of a test.

            adilger Andreas Dilger added a comment - It isn't clear why it is desirable to continue with the testing if the node has rebooted because of LBUG and panic_on_lbug, or network problems? I would consider those events as much a catastrophe as an LBUG on the local client. The only thing that might be considered a defect is if this fails because the remote node's modules are not loaded, but that in itself might be a sign that the node rebooted during the test when it shouldn't have, since check_catastrophe() is only being run at the end of a test.

            Oleg,
            IMO, the t-f should not fail aborting test runs for any such cases, that is not the best thing to do. Correct me if wrong. The following code, is taking care of the case where LBUG is actually hit

            	val=\\\$($LCTL get_param -n catastrophe | cut -d':' -f2 2>&1);
            	if [[ \\\$? -eq 0 && \\\$val -ne 0 ]]; then
            
            parinay parinay v kondekar (Inactive) added a comment - Oleg, IMO, the t-f should not fail aborting test runs for any such cases, that is not the best thing to do. Correct me if wrong. The following code, is taking care of the case where LBUG is actually hit val=\\\$($LCTL get_param -n catastrophe | cut -d ':' -f2 2>&1); if [[ \\\$? -eq 0 && \\\$val -ne 0 ]]; then
            green Oleg Drokin added a comment -

            So, in all those "false positive" cases, what's the script tests to do if the node is not accessible or there are no more lustre modules?
            If we don't abort now (perhaps a failure message might need to be adjusted a bit), would there be a cryptic failure in the net innocent test instead?

            Also case #2 is really fringe and we should not care about it.

            green Oleg Drokin added a comment - So, in all those "false positive" cases, what's the script tests to do if the node is not accessible or there are no more lustre modules? If we don't abort now (perhaps a failure message might need to be adjusted a bit), would there be a cryptic failure in the net innocent test instead? Also case #2 is really fringe and we should not care about it.

            > The real question is why the remove node was inaccessible?
            LBUG could be one of the reason not the only reason why the node can be inaccessible.
            e.g.
            1. node has rebooted (FAILURE_MODE=hard) but lustre modules are not loaded.
            2. Node is inaccessible ( unable to ping) due to n/w issues, while the node itself has not resulted in any crash.

            The patch takes care of not only the scenario where LBUG has actually occurred ( $LCTL get_param -n catastrophe is returning 1 - I tested this by introducing a LBUG() in the code && setting panic_on_lbug to 0 in /proc ) but the false negatives due to above mentioned issues.

            parinay parinay v kondekar (Inactive) added a comment - > The real question is why the remove node was inaccessible? LBUG could be one of the reason not the only reason why the node can be inaccessible. e.g. 1. node has rebooted (FAILURE_MODE=hard) but lustre modules are not loaded. 2. Node is inaccessible ( unable to ping) due to n/w issues, while the node itself has not resulted in any crash. The patch takes care of not only the scenario where LBUG has actually occurred ( $LCTL get_param -n catastrophe is returning 1 - I tested this by introducing a LBUG() in the code && setting panic_on_lbug to 0 in /proc ) but the false negatives due to above mentioned issues.

            The real question is why the remove node was inaccessible? That would normally be caused by a crash of the node, so it should be considered a catastrophe in that case?

            adilger Andreas Dilger added a comment - The real question is why the remove node was inaccessible? That would normally be caused by a crash of the node, so it should be considered a catastrophe in that case?

            People

              yujian Jian Yu
              parinay parinay v kondekar (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: