Loading...

XML

Word

Printable

Type: New Feature
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.16.0
Affects Version/s: None
Labels:
None

Rank (Obsolete):
9223372036854775807

In DDN-3356, by Andreas Dilger:

We don't have a tool to do this today, but it would make sense to write a simple tool "lljobstat" to show the top jobs on a server in order to simplify debugging of high load problems, since this is a reasonably frequent request.

It should be included with the base Lustre RPMs, so it must not have any complex external dependencies that are not included in the base OS distro (el7, el8, sles15, ubuntu22).

It should read all of the local "..job_stats" files (by default, or --ost or --mdt, or a specific jobstats file if given as an argument) every 10s interval (configurable, either "-i N" or last argument) and prints the top e.g. 5 jobs (configurable "-c N"), one line per job similar to "iostat -x -k -z 10". It should show something useful when run with minimal arguments (eg. just the interval), so that users can use it to easily determine which jobs are driving the most load.

Since the job_stats has a large number of stats, it is not possible to fit all of them in a single 80-column line, so any operations that have samples = 0 should not be shown. Priority for display should be to show read, write (counts, if non-zero), read_bytes, write_bytes (in MiB/s units, if non-zero), then the top metadata ops by count. It probably makes sense to use abbreviations for the names, like llobdstat so that more can fit onto the line (cx: create, dx: destroy, st: statfs, pu: punch, etc). In the newer llstat and llobdstst it checks if the terminal width is over 80 and shows more fields, but this doesn't have to be in the first version.

To determine the "top" jobs, it probably makes sense to sum the operations for the same job name across all watched job_stats files, then sort by total count of operations (read+write, but not bytes) and include this as the second item shown ("ops: N") after the job name ("job: name", with escaping/quoting if needed). The timestamp should be shown for each interval.

Given that the input is YAML, the output could also be YAML, but only if it can be formatted nicely for human readability (one line per job, no excessive quoting). The main users of this will be people, since monitoring tools will likely read and process all of the job_stats output directly.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

glljobstat
17/Aug/23 4:38 AM
16 kB
Andreas Dilger
lljobstat
18/Aug/23 7:03 AM
8 kB
Bjoern Olausson

is related to

LU-19861 sanity test_850: Ubuntu24.04 cannot import name 'CLoader' from 'yaml'

Open

is related to

LU-16231 Lustre stats header incorrectly using boot time

Resolved

LU-16251 Fill jobid in an atomic way

Resolved

LU-16110 Make output of jobs_stats and rename_stats valid YAML

Resolved

LU-17352 Enhance lljobstat to read existing job_stats files

Resolved

Assignee:: Feng Lei

Reporter:: Feng Lei

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Created:: 09/Oct/22 6:54 AM

Updated:: 07/Feb/26 5:02 PM

Resolved:: 27/Jan/23 4:19 AM

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates