[LU-7880] add OST/MDT performance statistics to obd_statfs - Whamcloud Community JIRA

Details

Type: Improvement
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: None
Labels:
- medium
- usability

Rank (Obsolete):
9223372036854775807

Description

In order to facilitate transfer of OST and MDT performance statistics for userspace applications, such as global NRS scheduling, SCR checkpoint scheduling, QOS and allocation decisions on the MDS, etc, it is useful to transport them via obd_statfs to the clients.

The statistics should include <peak, decaying average of current> <IOPS read, IOPS write, KiB/s read, KiB/s write>.

The OSS and MDS already collect these statistics for presentation via /proc and it should be possible to include them into struct obd_statfs in newly-added fields at the end of the struct.

The stats should be fetched and printed with lfs df --stats command for all targets, but not necessarily for regular statfs() requests. With ~~LU-10018~~ "MDT as a statfs() proxy", the MDT_STATFS request now has an mdt_body in the request which can be used to request different behaviour for the RPC.

Attachments

Issue Links

is related to

LU-10070 PFL self-extending file layout

Resolved

LU-9 Optimize weighted QOS Round-Robin allocator

Open

LU-10158 FLR2: improve mirror selection policy functions

Open

is related to

LU-18456 TCU: Trash Can/Undelete for Lustre

Open

LU-10018 MDT as a statfs proxy

Resolved

Activity

[LU-7880] add OST/MDT performance statistics to obd_statfs

Gerrit Updater added a comment - 4 days ago

"George Zhao <georgezhaojobs@gmail.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/59547
Subject: LU-7880 ptlrpc: expand obd_statfs size by 64 bits
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 8db7777311c95fa100daf7acbf9f6f3cd74834af

Gerrit Updater added a comment - 4 days ago "George Zhao <georgezhaojobs@gmail.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/59547 Subject: LU-7880 ptlrpc: expand obd_statfs size by 64 bits Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 8db7777311c95fa100daf7acbf9f6f3cd74834af

Andreas Dilger added a comment - 29/May/25 3:54 PM

The statfs data is cached at each level of the stack, so that reading from multiple "kbytesfree", "kbytestotal", "filesfree", etc. parameters doesn't generate separate RPCs for each one.

I don't think it matters which one you use. Depending on where the stats are available, you could fill in the stats at one level and they will be accessible up the stack.

Each of the main lprocfs stats structures has its own timestamp since it was last reset (eg. lprocfs_stats.ls_init), which is printed by lprocfs_stats_header(), so that should be used instead of the mount time. This ensures the time range of the stats matches the values that are accumulated there.

As for the decay factor, I don't have a fixed number in mind. We typically try to avoid hard-coding constants into the code, but I'm not sure whether this needs to be configurable or not. We need to have both the decay factor as well as the decay interval. If we calculate calculate the stats every 5s to determine "instantaneous" peak IOPS/BW, and the decay by 0.4 after 1 minute, then we need a decay factor of 0.6=a^(60/5), so a=245/256 would work out to be 41% decay after a minute, 93% decay after 5 minutes, which seems reasonable.

Andreas Dilger added a comment - 29/May/25 3:54 PM The statfs data is cached at each level of the stack, so that reading from multiple "kbytesfree", "kbytestotal", "filesfree", etc. parameters doesn't generate separate RPCs for each one. I don't think it matters which one you use. Depending on where the stats are available, you could fill in the stats at one level and they will be accessible up the stack. Each of the main lprocfs stats structures has its own timestamp since it was last reset (eg. lprocfs_stats.ls_init), which is printed by lprocfs_stats_header() , so that should be used instead of the mount time. This ensures the time range of the stats matches the values that are accumulated there. As for the decay factor, I don't have a fixed number in mind. We typically try to avoid hard-coding constants into the code, but I'm not sure whether this needs to be configurable or not. We need to have both the decay factor as well as the decay interval. If we calculate calculate the stats every 5s to determine "instantaneous" peak IOPS/BW, and the decay by 0.4 after 1 minute, then we need a decay factor of 0.6=a^(60/5), so a=245/256 would work out to be 41% decay after a minute, 93% decay after 5 minutes, which seems reasonable.

George Zhao added a comment - 28/May/25 10:08 PM - edited

.I'm trying to populate and cache the new fields in obd_statfs:

__u32           os_read_bytes_peak;
__u32           os_write_bytes_peak;
__u32           os_read_io_peak;
__u32           os_write_io_peak;
__u32           os_read_bytes_avg;
__u32           os_write_bytes_avg;
__u32           os_read_io_avg;
__u32           os_write_io_avg;

If I understood correctly, the logic shuold be in ofd_statfs()->tgt_statfs_internal().

What confused me is that I found tgd_osfs cached in tg_grants_data, and obd_osfs cached in obd_device. Which one should I use?
Another question is that, I'm going to add an argument, obd_device, to tgt_statfs_internal(), so I can get obd_stats. Please correct me if I was wrong.

Two other questions need design decision:

To get the first average value, how about lprocfs_counter.lc_sum/(current_time - mount_time)? Where can I get the mount_time for ofd/mdt?
For decaying average, what's the decay factor a? Is it configurable?
time_delta = current_time - obd->obd_osfs_age
new_avg = new_sample/time_delta
new_d_avg = old_d_avg * a + new_avg * (1-a)

George Zhao added a comment - 28/May/25 10:08 PM - edited .I'm trying to populate and cache the new fields in obd_statfs: __u32 os_read_bytes_peak; __u32 os_write_bytes_peak; __u32 os_read_io_peak; __u32 os_write_io_peak; __u32 os_read_bytes_avg; __u32 os_write_bytes_avg; __u32 os_read_io_avg; __u32 os_write_io_avg; If I understood correctly, the logic shuold be in ofd_statfs()->tgt_statfs_internal() . What confused me is that I found tgd_osfs cached in tg_grants_data , and obd_osfs cached in obd_device . Which one should I use? Another question is that, I'm going to add an argument, obd_device , to tgt_statfs_internal(), so I can get obd_stats . Please correct me if I was wrong. Two other questions need design decision: To get the first average value, how about lprocfs_counter.lc_sum/(current_time - mount_time) ? Where can I get the mount_time for ofd/mdt? For decaying average, what's the decay factor a? Is it configurable? time_delta = current_time - obd->obd_osfs_age new_avg = new_sample/time_delta new_d_avg = old_d_avg * a + new_avg * (1-a)

Andreas Dilger added a comment - 15/May/25 9:38 AM

Actually, it should be possible to increase the size of the obd_statfs structure in the STATFS RPC relatively easily, so long as the nodes handling it do not try to access beyond the actual size requested/replied. I don't think a 16-bit value would be granular enough, no matter what units are chosen.

Andreas Dilger added a comment - 15/May/25 9:38 AM Actually, it should be possible to increase the size of the obd_statfs structure in the STATFS RPC relatively easily, so long as the nodes handling it do not try to access beyond the actual size requested/replied. I don't think a 16-bit value would be granular enough, no matter what units are chosen.

George Zhao added a comment - 14/May/25 8:58 PM

In order to reuse the current polling mechanism, I suppose we need to figure how to fit 8 metrics into 7(or less) fields.

Maybe compress “average read/write io count” into one u32 field? The max would be 65536, in 5 sec. Does that sounds acceptable?

George Zhao added a comment - 14/May/25 8:58 PM In order to reuse the current polling mechanism, I suppose we need to figure how to fit 8 metrics into 7(or less) fields. Maybe compress “average read/write io count” into one u32 field? The max would be 65536, in 5 sec. Does that sounds acceptable?

Andreas Dilger added a comment - 13/May/25 6:32 AM

"The MDS is already polling the OSTs at 5s intervals" - this is done via the "LOD->OSP" code on the MDS:
- lod_qos_statfs_update()->lod_statfs_and_check()
yes
hmm, yes, except when I wrote this ticket many years ago there were more than 8 reserved fields, and now there are only 7 left. All fields need to fit into u32 values comfortably, so units should be chosen carefully. If KB/s this would only give 4 TB/s peak, and that may not be large enough in the future.
time interval is already configurable by lod.*.qos_maxage. However, note that every MDT (10-100 today) needs to fetch this information from every OST (8-1000+), so it shouldn't be done too frequently. However, that part is mostly irrelevant, since the statfs() data will be fetched on demand at the client

Andreas Dilger added a comment - 13/May/25 6:32 AM "The MDS is already polling the OSTs at 5s intervals" - this is done via the "LOD->OSP" code on the MDS: lod_qos_statfs_update() -> lod_statfs_and_check() yes hmm, yes, except when I wrote this ticket many years ago there were more than 8 reserved fields, and now there are only 7 left. All fields need to fit into u32 values comfortably, so units should be chosen carefully. If KB/s this would only give 4 TB/s peak, and that may not be large enough in the future. time interval is already configurable by lod.*.qos_maxage . However, note that every MDT (10-100 today) needs to fetch this information from every OST (8-1000+), so it shouldn't be done too frequently. However, that part is mostly irrelevant, since the statfs() data will be fetched on demand at the client

George Zhao added a comment - 13/May/25 1:35 AM

Before starting the implementation, I want to clarify a few design considerations. Please correct me if question doesn't make sense.

Where can I find this logic? "The MDS is already polling the OSTs at 5s intervals".
Where to calculate: Each target maintains their own metrics and fill obd_statfs in ofd_statfs() and mdt_statfs()? Also, each target records one last stats.
struct obd_statfs Changes: Are we adding 8 fields (peak, avg) X (read, write) X (io, bandwidth)?
Time Window: 5 seconds, or make it configurable?

Any guidance or suggestions on these points would be greatly appreciated.

George Zhao added a comment - 13/May/25 1:35 AM Before starting the implementation, I want to clarify a few design considerations. Please correct me if question doesn't make sense. Where can I find this logic? "The MDS is already polling the OSTs at 5s intervals". Where to calculate: Each target maintains their own metrics and fill obd_statfs in ofd_statfs() and mdt_statfs()? Also, each target records one last stats. struct obd_statfs Changes: Are we adding 8 fields (peak, avg) X (read, write) X (io, bandwidth)? Time Window : 5 seconds, or make it configurable? Any guidance or suggestions on these points would be greatly appreciated.

People

Assignee:: WC Triage

Reporter:: Andreas Dilger

Votes:: 0 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 15/Mar/16 2:41 AM

Updated:: 5 days ago 9:07 PM