[LU-17216] enable_health_write, health_check improvements Created: 21/Oct/23  Updated: 23/Jan/24

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Minor
Reporter: Tim Day Assignee: Tim Day
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-17450 sanity: interop test failures with ma... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

enable_health_write should be tunable rather than a compilation option. This allows us to test it more easily and gives admins the option to try it out without having to recompile their Lustre servers. It will still be disabled by default.

 

This health write should be enabled for MDT/MGT also. Especially since DNE means there are many more metadata related disks.

 

Getting more verbose info from health checks would be useful. Lustre should report health by OBD device. It should also tell you what's wrong. To implement this, the health check functions could return a enum indicating the root cause of the health check failure (disk IO, ptlrpc, etc.). Then, the individual check need only return the correct enum.



 Comments   
Comment by Gerrit Updater [ 21/Oct/23 ]

"Timothy Day <timday@amazon.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52782
Subject: LU-17216 ofd: make enable_health_write tunable
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 7b6ff85f64080c5040f123ad18f86a20a8914138

Comment by Gerrit Updater [ 22/Oct/23 ]

"Timothy Day <timday@amazon.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52785
Subject: LU-17216 mdt: implement generic health writes
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c9a70eb100ff8299fc3f1c8e194c05da991016c4

Comment by Gerrit Updater [ 29/Nov/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/52782/
Subject: LU-17216 ofd: make enable_health_write tunable
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: e383791b1ca45ea14f38c98b9ff55faf7590ef30

Comment by Andreas Dilger [ 23/Jan/24 ]

The added test in 52782 is failing interop between master (2.15.60.20) and 2.15.4. Please review failure and push a patch. Either skip because it is not expected to work with old servers, or fix as needed:
https://testing.whamcloud.com/test_sets/dc77145c-b7d3-4010-a7a2-f8435f9353ff

Comment by Gerrit Updater [ 23/Jan/24 ]

"Timothy Day <timday@amazon.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53770
Subject: LU-17216 ofd: skip sanity/70a on old OSTs
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 882188c51ce7ec5bf66e0159fe790b9aba3013c9

Generated at Sat Feb 10 03:33:36 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.