[LU-8017] All Nodes report NOT HEALTHY, system is healthy Created: 13/Apr/16  Updated: 07/Dec/16  Resolved: 07/Jun/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.9.0
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Critical
Reporter: Cliff White (Inactive) Assignee: James A Simmons
Resolution: Fixed Votes: 0
Labels: soak

Issue Links:
Related
is related to LU-8066 Move lustre procfs handling to sysfs ... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Current build installed; https://build.hpdd.intel.com/job/lustre-reviews/38245/
This issue has persisted for the last two builds.
After mounting the filesystem, all nodes report NOT HEALTHY in /proc/fs/lustre/health_check.

  1. pdsh -g server 'lctl get_param health_check' |dshbak -c
    ----------------
    lola-[2-11]
    ----------------
    health_check=healthy
    NOT HEALTHY

The filesystem otherwise operates normally, jobs run, results are created.
We were using the health_check as part of our monitoring - this has been discontinued.
We are uncertain as to the cause, as all operations we can test work fine, and no errors are reported.



 Comments   
Comment by Andreas Dilger [ 13/Apr/16 ]

This looks like a bug introduced by http://review.whamcloud.com/16933 "LU-6215 lprocfs: handle seq_printf api change".

lustre/obdclass/linux/linux-module.c
@@ -275,7 +277,7 @@ static int obd_proc_health_seq_show(struct seq_file *m, void *data)
        read_unlock(&obd_dev_lock);
 
        if (healthy)
-               return seq_printf(m, "healthy\n");
+               seq_puts(m, "healthy\n");
 
        seq_printf(m, "NOT HEALTHY\n");
        return 0;

It should still have returned after printing "healthy" instead of continuing to "NOT HEALTHY".

        if (healthy) {
               seq_puts(m, "healthy\n");
               return;
        }
Comment by Di Wang [ 13/Apr/16 ]

Even in a healthy environment, it still show "NOT HEALTHY"

[root@testnode tests]# MDSCOUNT=4 sh llmount.sh 
Stopping clients: testnode /mnt/lustre (opts:)
Stopping clients: testnode /mnt/lustre2 (opts:)
Loading modules from /work/lustre-new/lustre-release/lustre/tests/..
........
[root@testnode tests]# cat /proc/fs/lustre/health_check 
healthy
NOT HEALTHY

I checked the code, and it looks like a typo in the code

static int obd_proc_health_seq_show(struct seq_file *m, void *data)
{
    ............
        if (healthy)
                seq_puts(m, "healthy\n");
                                     ---------------------------> probably else is missing here.
        seq_printf(m, "NOT HEALTHY\n");
Comment by Cliff White (Inactive) [ 13/Apr/16 ]

This also indicates a test miss, since we should check health_check in auto test.

Comment by Andreas Dilger [ 13/Apr/16 ]

I'd be fine with an "else" also.

Please also add a test that this is working properly.

Comment by Cliff White (Inactive) [ 13/Apr/16 ]

I think QA team can do this.

Comment by James A Simmons [ 14/Apr/16 ]

Oops, missed fixing up a sed change. Will fix. Sorry I didn't add a test with this patch but it is a really good idea. This way we can see if the upstream client will also behave properly. Which test should it go into?

Comment by Gerrit Updater [ 14/Apr/16 ]

James Simmons (uja.ornl@yahoo.com) uploaded a new patch: http://review.whamcloud.com/19537
Subject: LU-8017 obd: report correct health state of a node
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ad8fb18517e31bc3878b13319c428b27c7279c16

Comment by Andreas Dilger [ 14/Apr/16 ]

The test can go into sanity.sh. It should be enough to have a simple test that checks for "healthy" visible on all nodes and all services, AND that "NOT HEALTHY" is not present.

Comment by Gerrit Updater [ 08/May/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/19537/
Subject: LU-8017 obd: report correct health state of a node
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: c28933602a6971739cb5ec3a1e920409ff19b01e

Comment by James A Simmons [ 08/May/16 ]

Patch has landed.

Comment by Bob Glossman (Inactive) [ 10/May/16 ]

running a client build that has the new additional check added in test-framework.sh by this fix on a server that has the old problem about returning incorrect health status causes failures from test-framework nearly everywhere. do we need some version test around the health check in test-framework to avoid phony failures in interop?

Comment by James A Simmons [ 11/May/16 ]

Do you have a example log of the failure? Also what does lctl get_param health_check show?

Comment by Andreas Dilger [ 11/May/16 ]

Bob, which specific versions are you testing? I thought this only affected 2.8.51 and was fixed in 2.8.52, and we typically do not test interop between point releases during development? Was the 16933 patch backported to some maintenance branch? In that case, the right answer is to also backport 19537 to that same branch, or any HA system that checks health_check will fail.

Comment by Bob Glossman (Inactive) [ 11/May/16 ]

Andreas,
client is from master, v2.8.52 built yesterday. It has the LU-8017 fix.
servers are also from master, but older. also v2.8.52, but don't have the fix.

James,
example error from runtests:

subsystem_debug=all -lnet -lnd -pinger
Setup mgs, mdt, osts
Starting mds1:   /dev/sdb /mnt/mds1
 runtests test_1: @@@@@@ FAIL: mds1 is in a unhealthy state 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:4769:error()
  = /usr/lib64/lustre/tests/test-framework.sh:1281:mount_facet()
  = /usr/lib64/lustre/tests/test-framework.sh:1344:start()
  = /usr/lib64/lustre/tests/test-framework.sh:3649:setupall()
  = /usr/lib64/lustre/tests/runtests:90:test_1()
  = /usr/lib64/lustre/tests/test-framework.sh:5033:run_one()
  = /usr/lib64/lustre/tests/test-framework.sh:5072:run_one_logged()
  = /usr/lib64/lustre/tests/test-framework.sh:4919:run_test()
  = /usr/lib64/lustre/tests/runtests:135:main()
Dumping lctl log to /tmp/test_logs/2016-05-10/154232/runtests.test_1.*.1462920188.log
Resetting fail_loc on all nodes...done.
FAIL 1 (32s)

lctl get_param health_check on servers show:

# lctl get_param health_check
health_check=healthy
NOT HEALTHY
Comment by Andreas Dilger [ 11/May/16 ]

Bob, in that case it wouldn't even be possible to have a version check, even if we did that for development versions, since they both have the same version number. I would just update the old nodes and move on.

Comment by Bob Glossman (Inactive) [ 11/May/16 ]

I can avoid the failure by commenting or deleting out or deleting the new health test check in test-framework.sh on the client.
Pretty sure I could also avoid the fail by installing a newer build on servers too.
Wasn't sure how exposed we are to hitting this sort of fail in general, in interop between master and other versions.

Andreas, I will take your advice.

Comment by James A Simmons [ 11/May/16 ]

Thankfully the window of failure with broken server version is very small.

Comment by James A Simmons [ 07/Jun/16 ]

Shall we close this ticket again?

Generated at Sat Feb 10 02:13:54 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.