[LU-12872] Adding more stats into JOBSTATS Created: 17/Oct/19 Updated: 24/Apr/23 Resolved: 24/Apr/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Minor |
| Reporter: | Shuichi Ihara | Assignee: | Feng Lei |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||
| Description |
|
JOBSTATS has been very useful to understand what type of IO is coming from application per JOBID/UID and GID, but it also would be nice to have more stats (e.g. like RPC size, discontiguous pages, etc that is covered by "brw_stats" today) into JOBSTATS to understand detail IO workload/size per JOBID.
|
| Comments |
| Comment by Gerrit Updater [ 17/Aug/22 ] |
|
|
| Comment by Andreas Dilger [ 17/Aug/22 ] |
|
Shuichi, can you please confirm, but I think the request here is to include the same or similar information from osd-ldiskfs..brw_stats into the obdfilter..job_stats. My understanding is that the "RPC size" you requested is the bulk data size, and not the size of the RPCs themselves? Feng Lei, it is important to note that unlike some network filesystems, most Lustre bulk read/write RPCs sent from the clients do not contain the actual data, but only a description of the object being read/written and offsets and byte counts for each fragment in the request. Only in a few cases when the read/write request is very small is the data packed directly into the RPC. Normally, for larger RPCs (anything over 4KB), only when the server is processing a bulk read/write RPC it will set up the RDMA descriptors for the request, and then the IB HCA will transfer the data directly from client memory to server memory. That avoids the server memory being filled by the data for the requests that are queued, and avoids copying data from the network request buffers to the filesystem pages for IO. To fit into the job_stats file, it would need to add several sub items for each of the job_stats entries: obdfilter.testfs-OST0001.job_stats=
job_stats:
- job_id: grep.0
snapshot_time : 4562769.380032450 secs.nsecs
start_time : 4562769.053337605 secs.nsecs
elapsed_time : 0.326694845 secs.nsecs
read_bytes: { samples: 5, unit: bytes, min: 32768, max: 4194304, sum: 16777216, sumsq: 70096013754368 }
write_bytes: { samples: 0, unit: bytes, min: 0, max: 0, sum: 0, sumsq: 0 }
read: { samples: 5, unit: usecs, min: 136, max: 4072, sum: 11871, sumsq: 36286067 }
write: { samples: 0, unit: usecs, min: 0, max: 0, sum: 0, sumsq: 0 }
:
brw_stats: {
pages_per_bulk: { 8: nnn 16: mmm 32: ppp ... }
discontig_pages: { }
discontig_blocks: { }
disk_fragments: { }
disk_io_inflight: { }
disk_io_time: { }
disk_io_size: { }
}
I would like that this output is still reasonably easily read by humans, though it should also be properly parsable, unlike the current brw_stats. However, Joe or someone more familiar with YAML parsing than I am should provide the actual layout so that the existing job_stats parser does not explode if it sees these new fields. In LU-13123 I would also like to add a list of client NIDs that sent RPCs for this JobID so that it can be isolated to a specific clients (without the need to access an external job scheduler), so this should also be taken into consideration. |
| Comment by Shuichi Ihara [ 18/Aug/22 ] |
|
Yes, we want to see RPC/IO size for bulk IO to OSTs in Jobstats (e.g. per JOBID, UID) even per NID would be useful since such stats doesn't exist in NID "export" stats today. |
| Comment by Feng Lei [ 19/Aug/22 ] |
|
I'm going to add:
enum {
LPROCFS_CNTR_EXTERNALLOCK = 0x0001,
LPROCFS_CNTR_AVGMINMAX = 0x0002,
LPROCFS_CNTR_STDDEV = 0x0004,
LPROCFS_CNTR_HISTGRAM = 0x0008,
...
LPROCFS_CNTR_RPC_READ_PAGES = LPROCFS_TYPE_PAGES | LPROCFS_CNTR_HISTGRAM,
The output may be like:
rpc_read_pages: { samples: 0, unit: pages, min: 0, max: 0, sum: 0, sumsq: 0, histgram: {1: xxx, 2: yyy, 4: zzz, ...} }
adilger sihara Please feel free to comment.
Move this topic to |
| Comment by Andreas Dilger [ 22/Aug/22 ] |
|
One thing I realized is that job_stats are tracked at the ofd level, while brw_stats are tracked at the osd level because they contain on-disk allocation information. That means the current job_stats file could add the RPC pages_per_bulk, latency, and discontiguous_pages histograms, but not discontiguous disk blocks ir disk IO size. I think that is probably OK, since we can check the disk fragmentation from the main brw_stats (these should not be specific to the application. |
| Comment by Andreas Dilger [ 24/Apr/23 ] |
|
Was handled by |