[LU-7880] add OST/MDT performance statistics to obd_statfs - Whamcloud Community JIRA

Details

Type: Improvement
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: None
Labels:
- medium
- usability

Rank (Obsolete):
9223372036854775807

Description

In order to facilitate transfer of OST and MDT performance statistics for userspace applications, such as global NRS scheduling, SCR checkpoint scheduling, QOS and allocation decisions on the MDS, etc, it is useful to transport them via obd_statfs to the clients.

The statistics should include <peak, decaying average of current> <IOPS read, IOPS write, KiB/s read, KiB/s write>.

The OSS and MDS already collect these statistics for presentation via /proc and it should be possible to include them into struct obd_statfs in newly-added fields at the end of the struct.

The stats should be fetched and printed with lfs df --stats command for all targets, but not necessarily for regular statfs() requests. With ~~LU-10018~~ "MDT as a statfs() proxy", the MDT_STATFS request now has an mdt_body in the request which can be used to request different behaviour for the RPC.

Attachments

Issue Links

is related to

LU-10070 PFL self-extending file layout

Resolved

LU-9 Optimize weighted QOS Round-Robin allocator

Open

LU-10158 FLR2: improve mirror selection policy functions

Open

is related to

LU-18456 TCU: Trash Can/Undelete for Lustre

Open

LU-10018 MDT as a statfs proxy

Resolved

Activity

[LU-7880] add OST/MDT performance statistics to obd_statfs

George Zhao added a comment - 14/May/25 8:58 PM

In order to reuse the current polling mechanism, I suppose we need to figure how to fit 8 metrics into 7(or less) fields.

Maybe compress “average read/write io count” into one u32 field? The max would be 65536, in 5 sec. Does that sounds acceptable?

George Zhao added a comment - 14/May/25 8:58 PM In order to reuse the current polling mechanism, I suppose we need to figure how to fit 8 metrics into 7(or less) fields. Maybe compress “average read/write io count” into one u32 field? The max would be 65536, in 5 sec. Does that sounds acceptable?

Andreas Dilger added a comment - 13/May/25 6:32 AM

"The MDS is already polling the OSTs at 5s intervals" - this is done via the "LOD->OSP" code on the MDS:
- lod_qos_statfs_update()->lod_statfs_and_check()
yes
hmm, yes, except when I wrote this ticket many years ago there were more than 8 reserved fields, and now there are only 7 left. All fields need to fit into u32 values comfortably, so units should be chosen carefully. If KB/s this would only give 4 TB/s peak, and that may not be large enough in the future.
time interval is already configurable by lod.*.qos_maxage. However, note that every MDT (10-100 today) needs to fetch this information from every OST (8-1000+), so it shouldn't be done too frequently. However, that part is mostly irrelevant, since the statfs() data will be fetched on demand at the client

Andreas Dilger added a comment - 13/May/25 6:32 AM "The MDS is already polling the OSTs at 5s intervals" - this is done via the "LOD->OSP" code on the MDS: lod_qos_statfs_update() -> lod_statfs_and_check() yes hmm, yes, except when I wrote this ticket many years ago there were more than 8 reserved fields, and now there are only 7 left. All fields need to fit into u32 values comfortably, so units should be chosen carefully. If KB/s this would only give 4 TB/s peak, and that may not be large enough in the future. time interval is already configurable by lod.*.qos_maxage . However, note that every MDT (10-100 today) needs to fetch this information from every OST (8-1000+), so it shouldn't be done too frequently. However, that part is mostly irrelevant, since the statfs() data will be fetched on demand at the client

George Zhao added a comment - 13/May/25 1:35 AM

Before starting the implementation, I want to clarify a few design considerations. Please correct me if question doesn't make sense.

Where can I find this logic? "The MDS is already polling the OSTs at 5s intervals".
Where to calculate: Each target maintains their own metrics and fill obd_statfs in ofd_statfs() and mdt_statfs()? Also, each target records one last stats.
struct obd_statfs Changes: Are we adding 8 fields (peak, avg) X (read, write) X (io, bandwidth)?
Time Window: 5 seconds, or make it configurable?

Any guidance or suggestions on these points would be greatly appreciated.

George Zhao added a comment - 13/May/25 1:35 AM Before starting the implementation, I want to clarify a few design considerations. Please correct me if question doesn't make sense. Where can I find this logic? "The MDS is already polling the OSTs at 5s intervals". Where to calculate: Each target maintains their own metrics and fill obd_statfs in ofd_statfs() and mdt_statfs()? Also, each target records one last stats. struct obd_statfs Changes: Are we adding 8 fields (peak, avg) X (read, write) X (io, bandwidth)? Time Window : 5 seconds, or make it configurable? Any guidance or suggestions on these points would be greatly appreciated.

Andreas Dilger added a comment - 05/Mar/25 2:03 AM

The MDS is already polling the OSTs at 5s intervals in order to fetch the free blocks and inode counters to make QOS object allocation decisions. Including the RPC performance counters in obd_statfs will not add significant overhead to this operation.

Returning a 5-second running average for the "current" performance, and "peak" performance ever seen since mount seems reasonable, though I'm open to suggestions.

Andreas Dilger added a comment - 05/Mar/25 2:03 AM The MDS is already polling the OSTs at 5s intervals in order to fetch the free blocks and inode counters to make QOS object allocation decisions. Including the RPC performance counters in obd_statfs will not add significant overhead to this operation. Returning a 5-second running average for the "current" performance, and "peak" performance ever seen since mount seems reasonable, though I'm open to suggestions.

Nathan Rutman added a comment - 04/Oct/23 3:21 PM

I suppose you can get instantaneous rates for the last 5 seconds if you only record the stats when called by the MDT. I think 60-second averages are more useful so we don't have to poll statfs so often; I suppose we could only record the stats if the timestamp of the last record is greater than 60 seconds, so we would effectively have 60-second epochs.

Nathan Rutman added a comment - 04/Oct/23 3:21 PM I suppose you can get instantaneous rates for the last 5 seconds if you only record the stats when called by the MDT. I think 60-second averages are more useful so we don't have to poll statfs so often; I suppose we could only record the stats if the timestamp of the last record is greater than 60 seconds, so we would effectively have 60-second epochs.

Andreas Dilger added a comment - 03/Apr/19 5:04 PM

We already track stats on the OST and MDT for RPCs, read/write calls with min/max duration, read_bytes/write_bytes with sums. It should be fairly straight forward to use the existing stats counters to generate peak performance and decaying average performance, either directly or by doing simple delta calculations when statfs is called (eg. save the last time and last stats and do a simple rate calculation over the past minute or whatever). The MDS is already calling statfs in the background every 5s so that is often enough to keep this updated.

Andreas Dilger added a comment - 03/Apr/19 5:04 PM We already track stats on the OST and MDT for RPCs, read/write calls with min/max duration, read_bytes/write_bytes with sums. It should be fairly straight forward to use the existing stats counters to generate peak performance and decaying average performance, either directly or by doing simple delta calculations when statfs is called (eg. save the last time and last stats and do a simple rate calculation over the past minute or whatever). The MDS is already calling statfs in the background every 5s so that is often enough to keep this updated.

Alex Zhuravlev added a comment - 12/Dec/18 1:53 PM - edited

do I understand correctly, that OFD should track performance on its own? something like a separate thread (or timer-driven callback) collecting stats from OSD and maintaining a history of average/peak throughput, RPC rate?

AFAIU, we don't track average for last few seconds, just average since start or reset.

Alex Zhuravlev added a comment - 12/Dec/18 1:53 PM - edited do I understand correctly, that OFD should track performance on its own? something like a separate thread (or timer-driven callback) collecting stats from OSD and maintaining a history of average/peak throughput, RPC rate? AFAIU, we don't track average for last few seconds, just average since start or reset.

People

Assignee:: WC Triage

Reporter:: Andreas Dilger

Votes:: 0 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 15/Mar/16 2:41 AM

Updated:: 03/Jun/25 9:07 PM