[LU-15870] jobstats name is sometimes corrupted. Created: 18/May/22  Updated: 18/May/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.6
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Beal Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Attachments: PNG File FireShot Capture 004 - Lustre stats Copy jb23 - Grafana - metrics.internal.sanger.ac.uk.png     PNG File FireShot Capture 005 - Lustre stats Copy jb23 - Grafana - metrics.internal.sanger.ac.uk.png    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Sometimes the job stat name becomes corrupted. This is an example:

 

lctl get_param obdfilter.*.job_stats | grep job_id | grep kworker | grep grp
- job_id:          kworker/I/8j,team78-grp
- job_id:          kworker/u515:16am78-grp
- job_id:          kworker/u515:16.078-grp
- job_id:          kworker/u515:0eam78-grp
- job_id:          kworker/u516:4eam78-grp
- job_id:          kworker/u515:6eam78-grp
- job_id:          kworker/u515:3eam78-grp
- job_id:          kworker/u515:0.0m78-grp
- job_id:          kworker/u519:4eam78-grp
- job_id:          kworker/u516:0eam78-grp
- job_id:          kworker/u515:21am78-grp
- job_id:          kworker/u514:1.0m78-grp
- job_id:          kworker/u515:3.0m78-grp
- job_id:          kworker/u58j,team78-grp
- job_id:          kworker/u517:7eam78-grp
- job_id:          kworker/u516:3.0m78-grp
- job_id:          kworker/u516:5eam78-grp
- job_id:          kworker/u517:4.0m78-grp
- job_id:          kworker/u516:2eam78-grp
- job_id:          kworker/u515:2eam78-grp
- job_id:          kworker/u519:8eam78-grp
- job_id:          kworker/u517:6eam78-grp
- job_id:          kworker/u518:3eam78-grp
- job_id:          kworker/u515:21.078-grp
- job_id:          kworker/u519:1.0m78-grp
- job_id:          kworker/u517:2.0m78-grp
- job_id:          kworker/u523:1eam78-grp
- job_id:          kworker/u513:6eam78-grp
- job_id:          kworker/u516:1eam78-grp
- job_id:          kworker/u528:4eam78-grp
- job_id:          kworker/u515:2.0m78-grp
- job_id:          kworker/u515:11.078-grp
- job_id:          kworker/u516:12.078-grp
- job_id:          kworker/u513:4eam78-grp
- job_id:          kworker/u516:6eam78-grp
- job_id:          kworker/u518:6eam78-grp
....
17:27
[root@lus24-oss6 ~]# lctl get_param obdfilter.*.job_stats | grep job_id | grep SA1
- job_id:          SA1CUzMGBwYH,analysis-cgp
- job_id:          SA1TSk+PO487,team311
- job_id:          SA1FJ03/I/8j,team78-grp
- job_id:          SA1FJ03/I/8j,tea0
- job_id:          SA1FJ03/I/8j,te0
- job_id:          SA1FrkxcBFwE,team113-grp
- job_id:          SA1TNFAwOzA7,team311
- job_id:          SA1FJ03/I/8j,t.0
- job_id:          SA1FJ03/I/8j,t.
- job_id:          SA1FJ03/u515:3.0
- job_id:          SA1FJ03/u515:6.
- job_id:          SA1FJ03/u515:7.
- job_id:          SA1FJ03/u516:8.0
- job_id:          SA1FJ03/I/8j,tea
- job_id:          SA1FJ03/u513:0.
- job_id:          SA1FJ03/u513:2.0
- job_id:          SA1FJ03/u516:12.
- job_id:          SA1FJ03/I/8j,team
- job_id:          SA1FJ03/u518:7.
- job_id:          SA1FJ03/u517:0.
- job_id:          SA1FJ03/I/8j,te0m78-grp
- job_id:          SA1CUzMGBwYH,analysis-cgp
- job_id:          SA1TSk+PO487,team311
- job_id:          SA1FJ03/I/8j,team78-grp
- job_id:          SA1FJ03/I/8j,tea0
- job_id:          SA1FJ03/I/8j,te0
- job_id:          SA1FJ03/I/8j,t.
- job_id:          SA1FJ03/u516:8.0
- job_id:          SA1FJ03/I/8j,t.0
- job_id:          SA1FrkxcBFwE,team113-grp
- job_id:          SA1FJ03/u515:6.
- job_id:          SA1FJ03/u515:7.
- job_id:          SA1FJ03/u515:6.0
- job_id:          SA1FJ03/I/8j,tea
- job_id:          SA1FJ03/u513:0.
- job_id:          SA1FJ03/u516:12.
- job_id:          SA1FJ03/u515:19.0
- job_id:          SA1FJ03/u515:16.
- job_id:          SA1FJ03/I/8j,te.0
- job_id:          SA1FJ03/u518:7.
- job_id:          SA1FJ03/u517:0.
- job_id:          SA1FJ03/I/8j,te0m78-grp

 

The jobs which start SA1 are from our batch systems, the format has "project" set after a , so for example " SA1FrkxcBFwE,team113-grp" looks good while "SA1FJ03/I/8j,te0m78-grp" have been corrupted.

I am additionally a bit surprised to see kworkers being reported, I would hope to see the io which started the work being reported.

This happens often enough to mean we have no confidence showing our statistics to our users.  I will include a sample plots to show the effect, one when there has been no corrupt and later when there has.


Generated at Sat Feb 10 03:22:00 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.