[LU-16479] Automatically manage/control DEGRADED ZFS OST's Created: 16/Jan/23  Updated: 17/Feb/23  Resolved: 14/Feb/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Upstream, Lustre 2.15.0
Fix Version/s: Lustre 2.16.0

Type: Improvement Priority: Minor
Reporter: Akash B Assignee: Akash B
Resolution: Fixed Votes: 0
Labels: None
Environment:

Lustre filesystem with ZFS as backend filesystem for OST's.


Epic/Theme: zfs
Epic: server, zfs
Rank (Obsolete): 9223372036854775807

 Description   

We have the obdfilter.testfs-OST000*.degraded value set/unset by zedlets (/etc/zfs/zed.d/statechange-lustre.sh) based on zpool being DEGRADED/ONLINE, We'd like to have this behavior enabled/disabled through an option so that we have I/O or newer allocations to DEGRADED OST's as well and hence there is no degradation in net bandwidth of the filesystem due to the degraded OSTs.
 
Introduce a new Lustre-specific ZFS dataset user property (lustre:autodegrade=on|off) for this purpose. Update the Lustre zedlet and also extend the mkfs.lustre utility to add this property by default when creating a new Lustre server(only for ZFS OSTs). The default behavior would remain the same (lustre:autodegrade=on) which disables new allocations to DEGRADED OSTs.
Creating a user property has a few advantages:

  1. User properties are a generic ZFS feature and won't be interpreted by ZFS itself. No ZFS changes are needed.
  2. The property can be set per dataset providing more granularity.
  3. The property is persistent and will survive reboots.


 Comments   
Comment by Gerrit Updater [ 17/Jan/23 ]

"Akash B <akash-b@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49660
Subject: LU-16479 utils: Add option to manage degraded ZFS OST
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ccf612d181952744e1dd13a842c02bb28ba181d1

Comment by Olaf Faaland [ 30/Jan/23 ]

Hi Akash,
What motivated this? Did you see disk failures during testing that affected results unnecessarily?
thanks

Comment by Akash B [ 08/Feb/23 ]

Hi Olaf,

By default for ZFS OSTs, the lustre zedlet(statechange-lustre.sh) does the automatic manipulation of Lustre degraded state. While writing to multiple OSS nodes where one or two OSTs are in a degraded state (D seen in lfs df -h at the client side), reduces the overall bandwidth(performance degradation) of the filesystem.

We'd want to have this behavior enabled/disabled through an option. Talked to Brian Behlendorf about this behavior for having this as a user property which administrators can use to enable/disable this behavior. 

Hey Akash,
 
On occasion we've also wanted to be able to administratively disable this behavior, supporting this sounds great.
 
I agree that splitting portions of the patch between Lustre and ZFS is awkward.  It'd be nice to handle it entirely on the Lustre side.  An alternate solution would be to introduce a new Lustre specific ZFS dataset user property.  As I'm sure you've noticed Lustre already adds the following dataset properties which are used for configuration
 
kern3/ost1  lustre:flags           4130                   local 
kern3/ost1  lustre:fsname         lslide                 local
kern3/ost1  lustre:version        1                      local
kern3/ost1  lustre:mgsnode        7@kfi:9@kfi              local
kern3/ost1  lustre:index          0                      local
kern3/ost1  lustre:failover.node  21@kfi:3@kfi             local
kern3/ost1  lustre:svname         lslide-OST0000         local
 
We could add a new lustre:autodegrade=<on|off> user property (See "User Properties in zfsprops(7)).  The statechange-lustre.sh zedlet could then check this property on the dataset to control the behavior.  This has a few advantages:
1. User properties are a generic ZFS feature and won't be interpreted by ZFS itself.  No ZFS changes are needed.
2. The property can be set per dataset providing more granularity.
3. The property is persistent and will survive reboots.  
4. This mechanism is already used within the zedlet to identify Lustre datasets;
e.g.: ZFS get -rH -s local -t filesystem -o name lustre:svname ${ZEVENT_POOL}
5. You can add the property at any time to an existing MDT/OST
What do you think?  If you were to implement this I'd suggest not only updating the Lustre zedlet, but also extending the lustre.mkfs utility to add this property by default when creating a new Lustre server.
 
Thanks,
Brian
Comment by Gerrit Updater [ 14/Feb/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49660/
Subject: LU-16479 utils: Add option to manage degraded ZFS OST
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: a2de6af65d21bff0d9357c30e6eb4ba049ff2059

Generated at Sat Feb 10 03:27:22 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.