[LU-8017] All Nodes report NOT HEALTHY, system is healthy - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.9.0
Affects Version/s: Lustre 2.9.0
Labels:
- soak

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Current build installed; https://build.hpdd.intel.com/job/lustre-reviews/38245/
This issue has persisted for the last two builds.
After mounting the filesystem, all nodes report NOT HEALTHY in /proc/fs/lustre/health_check.

pdsh -g server 'lctl get_param health_check' |dshbak -c
----------------
lola-[2-11]
----------------
health_check=healthy
NOT HEALTHY

The filesystem otherwise operates normally, jobs run, results are created.
We were using the health_check as part of our monitoring - this has been discontinued.
We are uncertain as to the cause, as all operations we can test work fine, and no errors are reported.

Attachments

Issue Links

is related to

LU-8066 Move lustre procfs handling to sysfs and debugfs.

Open

Activity

[LU-8017] All Nodes report NOT HEALTHY, system is healthy

James A Simmons added a comment - 07/Jun/16 8:43 PM

Shall we close this ticket again?

James A Simmons added a comment - 07/Jun/16 8:43 PM Shall we close this ticket again?

James A Simmons added a comment - 11/May/16 2:54 PM

Thankfully the window of failure with broken server version is very small.

James A Simmons added a comment - 11/May/16 2:54 PM Thankfully the window of failure with broken server version is very small.

Bob Glossman (Inactive) added a comment - 11/May/16 2:38 PM - edited

I can avoid the failure by commenting or deleting out or deleting the new health test check in test-framework.sh on the client.
Pretty sure I could also avoid the fail by installing a newer build on servers too.
Wasn't sure how exposed we are to hitting this sort of fail in general, in interop between master and other versions.

Andreas, I will take your advice.

Bob Glossman (Inactive) added a comment - 11/May/16 2:38 PM - edited I can avoid the failure by commenting or deleting out or deleting the new health test check in test-framework.sh on the client. Pretty sure I could also avoid the fail by installing a newer build on servers too. Wasn't sure how exposed we are to hitting this sort of fail in general, in interop between master and other versions. Andreas, I will take your advice.

Andreas Dilger added a comment - 11/May/16 2:38 PM

Bob, in that case it wouldn't even be possible to have a version check, even if we did that for development versions, since they both have the same version number. I would just update the old nodes and move on.

Andreas Dilger added a comment - 11/May/16 2:38 PM Bob, in that case it wouldn't even be possible to have a version check, even if we did that for development versions, since they both have the same version number. I would just update the old nodes and move on.

Bob Glossman (Inactive) added a comment - 11/May/16 2:31 PM

Andreas,
client is from master, v2.8.52 built yesterday. It has the ~~LU-8017~~ fix.
servers are also from master, but older. also v2.8.52, but don't have the fix.

James,
example error from runtests:

subsystem_debug=all -lnet -lnd -pinger
Setup mgs, mdt, osts
Starting mds1:   /dev/sdb /mnt/mds1
 runtests test_1: @@@@@@ FAIL: mds1 is in a unhealthy state 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:4769:error()
  = /usr/lib64/lustre/tests/test-framework.sh:1281:mount_facet()
  = /usr/lib64/lustre/tests/test-framework.sh:1344:start()
  = /usr/lib64/lustre/tests/test-framework.sh:3649:setupall()
  = /usr/lib64/lustre/tests/runtests:90:test_1()
  = /usr/lib64/lustre/tests/test-framework.sh:5033:run_one()
  = /usr/lib64/lustre/tests/test-framework.sh:5072:run_one_logged()
  = /usr/lib64/lustre/tests/test-framework.sh:4919:run_test()
  = /usr/lib64/lustre/tests/runtests:135:main()
Dumping lctl log to /tmp/test_logs/2016-05-10/154232/runtests.test_1.*.1462920188.log
Resetting fail_loc on all nodes...done.
FAIL 1 (32s)

lctl get_param health_check on servers show:

# lctl get_param health_check
health_check=healthy
NOT HEALTHY

Bob Glossman (Inactive) added a comment - 11/May/16 2:31 PM Andreas, client is from master, v2.8.52 built yesterday. It has the LU-8017 fix. servers are also from master, but older. also v2.8.52, but don't have the fix. James, example error from runtests: subsystem_debug=all -lnet -lnd -pinger Setup mgs, mdt, osts Starting mds1: /dev/sdb /mnt/mds1 runtests test_1: @@@@@@ FAIL: mds1 is in a unhealthy state Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:4769:error() = /usr/lib64/lustre/tests/test-framework.sh:1281:mount_facet() = /usr/lib64/lustre/tests/test-framework.sh:1344:start() = /usr/lib64/lustre/tests/test-framework.sh:3649:setupall() = /usr/lib64/lustre/tests/runtests:90:test_1() = /usr/lib64/lustre/tests/test-framework.sh:5033:run_one() = /usr/lib64/lustre/tests/test-framework.sh:5072:run_one_logged() = /usr/lib64/lustre/tests/test-framework.sh:4919:run_test() = /usr/lib64/lustre/tests/runtests:135:main() Dumping lctl log to /tmp/test_logs/2016-05-10/154232/runtests.test_1.*.1462920188.log Resetting fail_loc on all nodes...done. FAIL 1 (32s) lctl get_param health_check on servers show: # lctl get_param health_check health_check=healthy NOT HEALTHY

Andreas Dilger added a comment - 11/May/16 9:01 AM

Bob, which specific versions are you testing? I thought this only affected 2.8.51 and was fixed in 2.8.52, and we typically do not test interop between point releases during development? Was the 16933 patch backported to some maintenance branch? In that case, the right answer is to also backport 19537 to that same branch, or any HA system that checks health_check will fail.

Andreas Dilger added a comment - 11/May/16 9:01 AM Bob, which specific versions are you testing? I thought this only affected 2.8.51 and was fixed in 2.8.52, and we typically do not test interop between point releases during development? Was the 16933 patch backported to some maintenance branch? In that case, the right answer is to also backport 19537 to that same branch, or any HA system that checks health_check will fail.

James A Simmons added a comment - 11/May/16 1:28 AM

Do you have a example log of the failure? Also what does lctl get_param health_check show?

James A Simmons added a comment - 11/May/16 1:28 AM Do you have a example log of the failure? Also what does lctl get_param health_check show?

Bob Glossman (Inactive) added a comment - 10/May/16 11:12 PM

running a client build that has the new additional check added in test-framework.sh by this fix on a server that has the old problem about returning incorrect health status causes failures from test-framework nearly everywhere. do we need some version test around the health check in test-framework to avoid phony failures in interop?

Bob Glossman (Inactive) added a comment - 10/May/16 11:12 PM running a client build that has the new additional check added in test-framework.sh by this fix on a server that has the old problem about returning incorrect health status causes failures from test-framework nearly everywhere. do we need some version test around the health check in test-framework to avoid phony failures in interop?

James A Simmons added a comment - 08/May/16 6:20 PM

Patch has landed.

James A Simmons added a comment - 08/May/16 6:20 PM Patch has landed.

Gerrit Updater added a comment - 08/May/16 5:40 PM

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/19537/
Subject: ~~LU-8017~~ obd: report correct health state of a node
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: c28933602a6971739cb5ec3a1e920409ff19b01e

Gerrit Updater added a comment - 08/May/16 5:40 PM Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/19537/ Subject: LU-8017 obd: report correct health state of a node Project: fs/lustre-release Branch: master Current Patch Set: Commit: c28933602a6971739cb5ec3a1e920409ff19b01e

People

Assignee:: James A Simmons

Reporter:: Cliff White (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 13/Apr/16 5:51 PM

Updated:: 07/Dec/16 8:51 PM

Resolved:: 07/Jun/16 9:07 PM