Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8017

All Nodes report NOT HEALTHY, system is healthy

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.9.0
    • Lustre 2.9.0
    • 3
    • 9223372036854775807

    Description

      Current build installed; https://build.hpdd.intel.com/job/lustre-reviews/38245/
      This issue has persisted for the last two builds.
      After mounting the filesystem, all nodes report NOT HEALTHY in /proc/fs/lustre/health_check.

      1. pdsh -g server 'lctl get_param health_check' |dshbak -c
        ----------------
        lola-[2-11]
        ----------------
        health_check=healthy
        NOT HEALTHY

      The filesystem otherwise operates normally, jobs run, results are created.
      We were using the health_check as part of our monitoring - this has been discontinued.
      We are uncertain as to the cause, as all operations we can test work fine, and no errors are reported.

      Attachments

        Issue Links

          Activity

            [LU-8017] All Nodes report NOT HEALTHY, system is healthy

            Shall we close this ticket again?

            simmonsja James A Simmons added a comment - Shall we close this ticket again?

            Thankfully the window of failure with broken server version is very small.

            simmonsja James A Simmons added a comment - Thankfully the window of failure with broken server version is very small.
            bogl Bob Glossman (Inactive) added a comment - - edited

            I can avoid the failure by commenting or deleting out or deleting the new health test check in test-framework.sh on the client.
            Pretty sure I could also avoid the fail by installing a newer build on servers too.
            Wasn't sure how exposed we are to hitting this sort of fail in general, in interop between master and other versions.

            Andreas, I will take your advice.

            bogl Bob Glossman (Inactive) added a comment - - edited I can avoid the failure by commenting or deleting out or deleting the new health test check in test-framework.sh on the client. Pretty sure I could also avoid the fail by installing a newer build on servers too. Wasn't sure how exposed we are to hitting this sort of fail in general, in interop between master and other versions. Andreas, I will take your advice.

            Bob, in that case it wouldn't even be possible to have a version check, even if we did that for development versions, since they both have the same version number. I would just update the old nodes and move on.

            adilger Andreas Dilger added a comment - Bob, in that case it wouldn't even be possible to have a version check, even if we did that for development versions, since they both have the same version number. I would just update the old nodes and move on.

            Andreas,
            client is from master, v2.8.52 built yesterday. It has the LU-8017 fix.
            servers are also from master, but older. also v2.8.52, but don't have the fix.

            James,
            example error from runtests:

            subsystem_debug=all -lnet -lnd -pinger
            Setup mgs, mdt, osts
            Starting mds1:   /dev/sdb /mnt/mds1
             runtests test_1: @@@@@@ FAIL: mds1 is in a unhealthy state 
              Trace dump:
              = /usr/lib64/lustre/tests/test-framework.sh:4769:error()
              = /usr/lib64/lustre/tests/test-framework.sh:1281:mount_facet()
              = /usr/lib64/lustre/tests/test-framework.sh:1344:start()
              = /usr/lib64/lustre/tests/test-framework.sh:3649:setupall()
              = /usr/lib64/lustre/tests/runtests:90:test_1()
              = /usr/lib64/lustre/tests/test-framework.sh:5033:run_one()
              = /usr/lib64/lustre/tests/test-framework.sh:5072:run_one_logged()
              = /usr/lib64/lustre/tests/test-framework.sh:4919:run_test()
              = /usr/lib64/lustre/tests/runtests:135:main()
            Dumping lctl log to /tmp/test_logs/2016-05-10/154232/runtests.test_1.*.1462920188.log
            Resetting fail_loc on all nodes...done.
            FAIL 1 (32s)
            

            lctl get_param health_check on servers show:

            # lctl get_param health_check
            health_check=healthy
            NOT HEALTHY
            
            bogl Bob Glossman (Inactive) added a comment - Andreas, client is from master, v2.8.52 built yesterday. It has the LU-8017 fix. servers are also from master, but older. also v2.8.52, but don't have the fix. James, example error from runtests: subsystem_debug=all -lnet -lnd -pinger Setup mgs, mdt, osts Starting mds1: /dev/sdb /mnt/mds1 runtests test_1: @@@@@@ FAIL: mds1 is in a unhealthy state Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:4769:error() = /usr/lib64/lustre/tests/test-framework.sh:1281:mount_facet() = /usr/lib64/lustre/tests/test-framework.sh:1344:start() = /usr/lib64/lustre/tests/test-framework.sh:3649:setupall() = /usr/lib64/lustre/tests/runtests:90:test_1() = /usr/lib64/lustre/tests/test-framework.sh:5033:run_one() = /usr/lib64/lustre/tests/test-framework.sh:5072:run_one_logged() = /usr/lib64/lustre/tests/test-framework.sh:4919:run_test() = /usr/lib64/lustre/tests/runtests:135:main() Dumping lctl log to /tmp/test_logs/2016-05-10/154232/runtests.test_1.*.1462920188.log Resetting fail_loc on all nodes...done. FAIL 1 (32s) lctl get_param health_check on servers show: # lctl get_param health_check health_check=healthy NOT HEALTHY

            Bob, which specific versions are you testing? I thought this only affected 2.8.51 and was fixed in 2.8.52, and we typically do not test interop between point releases during development? Was the 16933 patch backported to some maintenance branch? In that case, the right answer is to also backport 19537 to that same branch, or any HA system that checks health_check will fail.

            adilger Andreas Dilger added a comment - Bob, which specific versions are you testing? I thought this only affected 2.8.51 and was fixed in 2.8.52, and we typically do not test interop between point releases during development? Was the 16933 patch backported to some maintenance branch? In that case, the right answer is to also backport 19537 to that same branch, or any HA system that checks health_check will fail.

            Do you have a example log of the failure? Also what does lctl get_param health_check show?

            simmonsja James A Simmons added a comment - Do you have a example log of the failure? Also what does lctl get_param health_check show?

            running a client build that has the new additional check added in test-framework.sh by this fix on a server that has the old problem about returning incorrect health status causes failures from test-framework nearly everywhere. do we need some version test around the health check in test-framework to avoid phony failures in interop?

            bogl Bob Glossman (Inactive) added a comment - running a client build that has the new additional check added in test-framework.sh by this fix on a server that has the old problem about returning incorrect health status causes failures from test-framework nearly everywhere. do we need some version test around the health check in test-framework to avoid phony failures in interop?

            Patch has landed.

            simmonsja James A Simmons added a comment - Patch has landed.

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/19537/
            Subject: LU-8017 obd: report correct health state of a node
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: c28933602a6971739cb5ec3a1e920409ff19b01e

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/19537/ Subject: LU-8017 obd: report correct health state of a node Project: fs/lustre-release Branch: master Current Patch Set: Commit: c28933602a6971739cb5ec3a1e920409ff19b01e

            People

              simmonsja James A Simmons
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: