[LU-16228] create lljobstats command Created: 09/Oct/22 Updated: 07/Feb/24 Resolved: 27/Jan/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | New Feature | Priority: | Minor |
| Reporter: | Feng Lei | Assignee: | Feng Lei |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||
| Description |
|
In DDN-3356, by Andreas Dilger:
|
| Comments |
| Comment by Feng Lei [ 09/Oct/22 ] |
|
adilger What about such a format of output? timestamp: 20221010090000 jobs: - {job: mkdir.100, ops: 3, cr: 1, dt: 2} - {job: rm.101, ops: 1, dt: 1} |
| Comment by Andreas Dilger [ 10/Oct/22 ] |
|
The timestamp should be Unix seconds lik the other timestamps reported by Lustre. That avoids time zone issues and simplifies log correlation. |
| Comment by Feng Lei [ 11/Oct/22 ] |
|
Command Synopsis:
lljobstat [-i|--interval NUM] [-c|--count NUM] [--mdt|--ost|--param PARAM_PATH]
-i NUM: interval in seconds, default 10
-c NUM: how many jobs are displayed, default 5
--mdt: check only mdt job_stats
--ost: check only ost job_stats
--param PARAM_PATH: check specified PARAM_PATH, e.g., *.lustre-*.job_stats
|
| Comment by Feng Lei [ 11/Oct/22 ] |
|
adilger To confirm that snapshot_time is designed to be uptime (the seconds from the last OS bootup), not clock time. For example:
# lctl get_param *.*.job_stats | grep snapshot
snapshot_time: 5754772.790688109 secs.nsecs
It is significantly different from epoch seconds:
# date +%s
1665461988
But similar to system uptime:
# cat /proc/uptime
5755466.00 22244003.59
|
| Comment by Andreas Dilger [ 11/Oct/22 ] |
|
No, the time should be the current Unix timestamp in seconds: # lctl get_param llite.*.stats llite.testfs-ffff89b1b9c27000.stats= snapshot_time 1665476432.161461498 secs.nsecs ioctl 502 samples [reqs] getattr 290 samples [usec] 56 1059 48623 11761597 getxattr 2 samples [usec] 975 30159 31134 910515906 inode_permission 298 samples [usec] 61 566 52783 11517621 opencount 295 samples [reqs] 1 1 295 295 # date +%s 1665476439 there is a bug on master that the timestamp is incorrectly printing the boot-relative time instead of the wallclock time. See |
| Comment by Feng Lei [ 12/Oct/22 ] |
|
adilger Is such an output OK?
# ./lljobstat # Abbr.: # cr: create, op: open, cl: close, mn: mknod, lk: link, # ul: unlink, mk: mkdir, rm: rmdir, mv: rename, ga: getattr, # sa: setattr, gx: getxattr, sx: setxattr, st: statfs, sy: sync, # rd: read, wr: write, pu: punch, mi: migrate, fa: fallocate, # dt: destroy, gi: get_info, si: set_info, qc: quotactl, pa: prealloc, timestamp: 1665557039 top jobs: - touch.500: {ops: 6, op: 1, cl: 1, mn: 1, ga: 1, sa: 2} - rm.0: {ops: 6, cl: 2, ul: 1, rm: 1, ga: 1, st: 1} - chown.0: {ops: 3, ga: 2, sa: 1} - bash.0: {ops: 2, ga: 2} - mkdir.0: {ops: 2, mk: 1, st: 1} |
| Comment by Andreas Dilger [ 12/Oct/22 ] |
|
Feng Lei, this looks mostly good. I would say that the comment is large enough that it shouldn't be printed each time, maybe just document the abbreviations in the man page or if "-h" is used. I would suggest "ln" for link (to match the command name). The "top_jobs:" should have an underscore so it is a single word, even though I know YAML does not require this, since it makes parsing easier with scripts (eg. "awk '/keyname:/ { print $2 }'". |
| Comment by Gerrit Updater [ 17/Oct/22 ] |
|
"Feng Lei <flei@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/48888 |
| Comment by Andreas Dilger [ 25/Jan/23 ] |
|
It looks like the newly-added sanity.sh test_205e needs to add a version check for interop testing: trevis-82vm3: sh: lljobstat: command not found There is a version check in test_205d already. |
| Comment by Gerrit Updater [ 27/Jan/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48888/ |
| Comment by Peter Jones [ 27/Jan/23 ] |
|
Landed for 2.16 |
| Comment by Andreas Dilger [ 17/Aug/23 ] |
|
bolausson, I pushed the "simple" version of your patch but it is reporting an error: This is causing test failures:
lljobstat -n 1 -i 0 -c 1000
Traceback (most recent call last):
File "/usr/bin/lljobstat", line 15, in
from yaml import CLoader as Loader, CDumper as Dumper
ImportError: cannot import name 'CLoader'
|
| Comment by Bjoern Olausson [ 17/Aug/23 ] |
|
See link for solution: https://github.com/yaml/pyyaml/issues/108#issuecomment-370459912 Essentially libyaml-dev is missing on your system. It is required for the CLoader (which replaces the slow Python loader) Greetings, Bjoern |
| Comment by Andreas Dilger [ 17/Aug/23 ] |
|
Bjoern, is there a way to "try" loading the libyaml-dev CLoader, but fall back to the regular Loader if it is not installed? |
| Comment by Bjoern Olausson [ 17/Aug/23 ] |
|
Yes this is possible with a try - except construct. The CLoader worked perfeclty fine on default EXAScaler 5.2.7 install. python3 -m venv lljobstat . ./lljobstat/bin/activate python3 -m pip install pyyaml python3 -m pip install paramiko python3 -m pip install urllib3 By the way, I added my enhneced version to the DDNeu GitHub repo: Cheers, |
| Comment by Bjoern Olausson [ 17/Aug/23 ] |
|
Here the lines you would need to change:
#!/bin/env python3 ''' lljobstat command. Read job_stats files, parse and aggregate data of every job on multiple OSS/MDS, show top jobs ''' import argparse import errno import subprocess import sys import time import yaml import signal import urllib3 import warnings import configparser from multiprocessing import Process, Queue, Pool, Manager, active_children, Pipe from subprocess import Popen, PIPE, STDOUT from pprint import pprint from os.path import expanduser from pathlib import Path try: from yaml import CLoader as Loader, CDumper as Dumper except ImportError: pass warnings.filterwarnings(action='ignore',module='.*paramiko.*') urllib3.disable_warnings() [...] |
| Comment by Feng Lei [ 18/Aug/23 ] |
It can be checked at runtime: if hasattr(yaml, "CLoader"): yaml_obj = yaml.load(output, Loader=yaml.CLoader) else: yaml_obj = yaml.safe_load(output) |
| Comment by Bjoern Olausson [ 18/Aug/23 ] |
|
That works as well but has the disatvantage that you have to use the conditional check whenever you use yaml.load() anywhere in the code. This is only required once: try: from yaml import CLoader as Loader except ImportError: from yaml import Loader and you could add a note on one time on each start of the program: try: from yaml import CLoader as Loader except ImportError: print("Install libyaml-dev for faster processing", file=sys.stderr) from yaml import Loader Example:
(lljobstat) [root@n2admin1 bolausson]# ./glljobstat.py -n1 -c3
Install libyaml-dev for faster processing
---
timestamp: 1692341521
top_jobs:
- .0@n2oss4: {ops: 499163955, op: 11394216, cl: 41637516, mn: 9374215, ga: 191342407, sa: 88644483, gx: 6939749, sx: 146146, st: 2610083, sy: 36495657, rd: 65229069, wr: 42911419, pu: 2438995}
- .0@n2oss8: {ops: 473355574, op: 7909593, cl: 31620149, mn: 6376866, ga: 82344877, sa: 97854466, gx: 6512529, sx: 29034, st: 51, sy: 39334661, rd: 130433638, wr: 66882172, pu: 4057538}
- .0@n2oss7: {ops: 419629946, op: 7035889, cl: 27444959, mn: 5526838, ga: 78507580, sa: 94406102, gx: 5645268, sx: 20790, st: 34, sy: 37915437, rd: 93283959, wr: 66197236, pu: 3645854}
...
(lljobstat) [root@n2admin1 bolausson]#
Attached the modified lljobstat: Cheers, |
| Comment by Andreas Dilger [ 18/Aug/23 ] |
|
I think the best approach is to Suggest: or Recommend: the faster libyaml-dev in lustre.spec.in (for all except el7.9 which doesn't support this, see other similar checks therein), and keep the try/except for fallback if it isn't installed. However, I do not think it makes sense to print a message in that case, as it breaks the output, and I don't think users care so much if it "just works" for them. Feng Lei, can you please also backport the "fix YAML printing of jobstats" patches to b_es5_2 (there are about 3 of them, but not the stats header or histogram patches), so that we get proper quoting of the jobid name in the job_stats output. While the "@" substitution will fix the one case running with DDN Insight, it will not handle all cases of bad jobid names. |
| Comment by Bjoern Olausson [ 18/Aug/23 ] |
|
Makes sense Thanks Andreas! |
| Comment by Bjoern Olausson [ 19/Aug/23 ] |
|
Okay, now we are getting to something that is actually pretty useful: https://github.com/DDNeu/global-lustre-jobstats It is faster by factors! If you don't want all the bells and wistles because of the additional modules (paramiko), you might want to try the naive parser with parallel parsing instead of yaml.load(). It is a drop-in replacement, no other code-changes required. My naive parser:
(lljobstat) [root@n2oss1 bolausson]# time ./glljobstat_testing.py -n 1 -c 2
SSH time : 0.837817907333374
Bjoern time : 2.07401442527771
---
timestamp: 1692439321
servers_queried: 8
total_jobs: 2601
top_2_jobs:
- 4635385@46526@n2cn0225: {ops: 589959692, rd: 589959689, wr: 3}
- @0@n2oss4: {ops: 485340474, op: 10540091, cl: 34838831, mn: 8118882, ga: 191221978, sa: 84975235, gx: 5400547, sx: 145756, st: 2610082, sy: 34832893, rd: 66827250, wr: 43403088, pu: 2425841}
...
real 0m4.603s
user 0m10.878s
sys 0m1.994s
yaml.load() with CLoader
(lljobstat) [root@n2oss1 bolausson]# time ./glljobstat.py -n 1 -c 2
SSH time : 0.8781006336212158
yaml CLoader time: 9.084490060806274
---
timestamp: 1692439328
servers_queried: 8
total_jobs: 2601
top_2_jobs:
- 4635385@46526@n2cn0225: \{ops: 589957196, rd: 589957193, wr: 3}
- .0@n2oss4: \{ops: 485340452, op: 10540089, cl: 34838826, mn: 8118881, ga: 191221973, sa: 84975231, gx: 5400546, sx: 145756, st: 2610082, sy: 34832891, rd: 66827249, wr: 43403087, pu: 2425841}
...
real 0m11.095s
user 0m55.775s
sys 0m4.393s
|
| Comment by Gerrit Updater [ 07/Feb/24 ] |
|
"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/doc/manual/+/53948 |
| Comment by Gerrit Updater [ 07/Feb/24 ] |
|
"Andreas Dilger <adilger@whamcloud.com>" merged in patch https://review.whamcloud.com/c/doc/manual/+/53948/ |