[LU-15870] jobstats name is sometimes corrupted. Created: 18/May/22 Updated: 18/May/22 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Beal | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
Sometimes the job stat name becomes corrupted. This is an example:
lctl get_param obdfilter.*.job_stats | grep job_id | grep kworker | grep grp - job_id: kworker/I/8j,team78-grp - job_id: kworker/u515:16am78-grp - job_id: kworker/u515:16.078-grp - job_id: kworker/u515:0eam78-grp - job_id: kworker/u516:4eam78-grp - job_id: kworker/u515:6eam78-grp - job_id: kworker/u515:3eam78-grp - job_id: kworker/u515:0.0m78-grp - job_id: kworker/u519:4eam78-grp - job_id: kworker/u516:0eam78-grp - job_id: kworker/u515:21am78-grp - job_id: kworker/u514:1.0m78-grp - job_id: kworker/u515:3.0m78-grp - job_id: kworker/u58j,team78-grp - job_id: kworker/u517:7eam78-grp - job_id: kworker/u516:3.0m78-grp - job_id: kworker/u516:5eam78-grp - job_id: kworker/u517:4.0m78-grp - job_id: kworker/u516:2eam78-grp - job_id: kworker/u515:2eam78-grp - job_id: kworker/u519:8eam78-grp - job_id: kworker/u517:6eam78-grp - job_id: kworker/u518:3eam78-grp - job_id: kworker/u515:21.078-grp - job_id: kworker/u519:1.0m78-grp - job_id: kworker/u517:2.0m78-grp - job_id: kworker/u523:1eam78-grp - job_id: kworker/u513:6eam78-grp - job_id: kworker/u516:1eam78-grp - job_id: kworker/u528:4eam78-grp - job_id: kworker/u515:2.0m78-grp - job_id: kworker/u515:11.078-grp - job_id: kworker/u516:12.078-grp - job_id: kworker/u513:4eam78-grp - job_id: kworker/u516:6eam78-grp - job_id: kworker/u518:6eam78-grp .... 17:27 [root@lus24-oss6 ~]# lctl get_param obdfilter.*.job_stats | grep job_id | grep SA1 - job_id: SA1CUzMGBwYH,analysis-cgp - job_id: SA1TSk+PO487,team311 - job_id: SA1FJ03/I/8j,team78-grp - job_id: SA1FJ03/I/8j,tea0 - job_id: SA1FJ03/I/8j,te0 - job_id: SA1FrkxcBFwE,team113-grp - job_id: SA1TNFAwOzA7,team311 - job_id: SA1FJ03/I/8j,t.0 - job_id: SA1FJ03/I/8j,t. - job_id: SA1FJ03/u515:3.0 - job_id: SA1FJ03/u515:6. - job_id: SA1FJ03/u515:7. - job_id: SA1FJ03/u516:8.0 - job_id: SA1FJ03/I/8j,tea - job_id: SA1FJ03/u513:0. - job_id: SA1FJ03/u513:2.0 - job_id: SA1FJ03/u516:12. - job_id: SA1FJ03/I/8j,team - job_id: SA1FJ03/u518:7. - job_id: SA1FJ03/u517:0. - job_id: SA1FJ03/I/8j,te0m78-grp - job_id: SA1CUzMGBwYH,analysis-cgp - job_id: SA1TSk+PO487,team311 - job_id: SA1FJ03/I/8j,team78-grp - job_id: SA1FJ03/I/8j,tea0 - job_id: SA1FJ03/I/8j,te0 - job_id: SA1FJ03/I/8j,t. - job_id: SA1FJ03/u516:8.0 - job_id: SA1FJ03/I/8j,t.0 - job_id: SA1FrkxcBFwE,team113-grp - job_id: SA1FJ03/u515:6. - job_id: SA1FJ03/u515:7. - job_id: SA1FJ03/u515:6.0 - job_id: SA1FJ03/I/8j,tea - job_id: SA1FJ03/u513:0. - job_id: SA1FJ03/u516:12. - job_id: SA1FJ03/u515:19.0 - job_id: SA1FJ03/u515:16. - job_id: SA1FJ03/I/8j,te.0 - job_id: SA1FJ03/u518:7. - job_id: SA1FJ03/u517:0. - job_id: SA1FJ03/I/8j,te0m78-grp
The jobs which start SA1 are from our batch systems, the format has "project" set after a , so for example " SA1FrkxcBFwE,team113-grp" looks good while "SA1FJ03/I/8j,te0m78-grp" have been corrupted. I am additionally a bit surprised to see kworkers being reported, I would hope to see the io which started the work being reported. This happens often enough to mean we have no confidence showing our statistics to our users. I will include a sample plots to show the effect, one when there has been no corrupt and later when there has. |