[LU-8017] All Nodes report NOT HEALTHY, system is healthy Created: 13/Apr/16 Updated: 07/Dec/16 Resolved: 07/Jun/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.9.0 |
| Fix Version/s: | Lustre 2.9.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Cliff White (Inactive) | Assignee: | James A Simmons |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | soak | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
Current build installed; https://build.hpdd.intel.com/job/lustre-reviews/38245/
The filesystem otherwise operates normally, jobs run, results are created. |
| Comments |
| Comment by Andreas Dilger [ 13/Apr/16 ] |
|
This looks like a bug introduced by http://review.whamcloud.com/16933 " lustre/obdclass/linux/linux-module.c @@ -275,7 +277,7 @@ static int obd_proc_health_seq_show(struct seq_file *m, void *data) read_unlock(&obd_dev_lock); if (healthy) - return seq_printf(m, "healthy\n"); + seq_puts(m, "healthy\n"); seq_printf(m, "NOT HEALTHY\n"); return 0; It should still have returned after printing "healthy" instead of continuing to "NOT HEALTHY". if (healthy) { seq_puts(m, "healthy\n"); return; } |
| Comment by Di Wang [ 13/Apr/16 ] |
|
Even in a healthy environment, it still show "NOT HEALTHY" [root@testnode tests]# MDSCOUNT=4 sh llmount.sh Stopping clients: testnode /mnt/lustre (opts:) Stopping clients: testnode /mnt/lustre2 (opts:) Loading modules from /work/lustre-new/lustre-release/lustre/tests/.. ........ [root@testnode tests]# cat /proc/fs/lustre/health_check healthy NOT HEALTHY I checked the code, and it looks like a typo in the code static int obd_proc_health_seq_show(struct seq_file *m, void *data)
{
............
if (healthy)
seq_puts(m, "healthy\n");
---------------------------> probably else is missing here.
seq_printf(m, "NOT HEALTHY\n");
|
| Comment by Cliff White (Inactive) [ 13/Apr/16 ] |
|
This also indicates a test miss, since we should check health_check in auto test. |
| Comment by Andreas Dilger [ 13/Apr/16 ] |
|
I'd be fine with an "else" also. Please also add a test that this is working properly. |
| Comment by Cliff White (Inactive) [ 13/Apr/16 ] |
|
I think QA team can do this. |
| Comment by James A Simmons [ 14/Apr/16 ] |
|
Oops, missed fixing up a sed change. Will fix. Sorry I didn't add a test with this patch but it is a really good idea. This way we can see if the upstream client will also behave properly. Which test should it go into? |
| Comment by Gerrit Updater [ 14/Apr/16 ] |
|
James Simmons (uja.ornl@yahoo.com) uploaded a new patch: http://review.whamcloud.com/19537 |
| Comment by Andreas Dilger [ 14/Apr/16 ] |
|
The test can go into sanity.sh. It should be enough to have a simple test that checks for "healthy" visible on all nodes and all services, AND that "NOT HEALTHY" is not present. |
| Comment by Gerrit Updater [ 08/May/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/19537/ |
| Comment by James A Simmons [ 08/May/16 ] |
|
Patch has landed. |
| Comment by Bob Glossman (Inactive) [ 10/May/16 ] |
|
running a client build that has the new additional check added in test-framework.sh by this fix on a server that has the old problem about returning incorrect health status causes failures from test-framework nearly everywhere. do we need some version test around the health check in test-framework to avoid phony failures in interop? |
| Comment by James A Simmons [ 11/May/16 ] |
|
Do you have a example log of the failure? Also what does lctl get_param health_check show? |
| Comment by Andreas Dilger [ 11/May/16 ] |
|
Bob, which specific versions are you testing? I thought this only affected 2.8.51 and was fixed in 2.8.52, and we typically do not test interop between point releases during development? Was the 16933 patch backported to some maintenance branch? In that case, the right answer is to also backport 19537 to that same branch, or any HA system that checks health_check will fail. |
| Comment by Bob Glossman (Inactive) [ 11/May/16 ] |
|
Andreas, James, subsystem_debug=all -lnet -lnd -pinger Setup mgs, mdt, osts Starting mds1: /dev/sdb /mnt/mds1 runtests test_1: @@@@@@ FAIL: mds1 is in a unhealthy state Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:4769:error() = /usr/lib64/lustre/tests/test-framework.sh:1281:mount_facet() = /usr/lib64/lustre/tests/test-framework.sh:1344:start() = /usr/lib64/lustre/tests/test-framework.sh:3649:setupall() = /usr/lib64/lustre/tests/runtests:90:test_1() = /usr/lib64/lustre/tests/test-framework.sh:5033:run_one() = /usr/lib64/lustre/tests/test-framework.sh:5072:run_one_logged() = /usr/lib64/lustre/tests/test-framework.sh:4919:run_test() = /usr/lib64/lustre/tests/runtests:135:main() Dumping lctl log to /tmp/test_logs/2016-05-10/154232/runtests.test_1.*.1462920188.log Resetting fail_loc on all nodes...done. FAIL 1 (32s) lctl get_param health_check on servers show: # lctl get_param health_check health_check=healthy NOT HEALTHY |
| Comment by Andreas Dilger [ 11/May/16 ] |
|
Bob, in that case it wouldn't even be possible to have a version check, even if we did that for development versions, since they both have the same version number. I would just update the old nodes and move on. |
| Comment by Bob Glossman (Inactive) [ 11/May/16 ] |
|
I can avoid the failure by commenting or deleting out or deleting the new health test check in test-framework.sh on the client. Andreas, I will take your advice. |
| Comment by James A Simmons [ 11/May/16 ] |
|
Thankfully the window of failure with broken server version is very small. |
| Comment by James A Simmons [ 07/Jun/16 ] |
|
Shall we close this ticket again? |