Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15870

jobstats name is sometimes corrupted.

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.12.6
    • None
    • 3
    • 9223372036854775807

    Description

      Sometimes the job stat name becomes corrupted. This is an example:

       

      lctl get_param obdfilter.*.job_stats | grep job_id | grep kworker | grep grp
      - job_id:          kworker/I/8j,team78-grp
      - job_id:          kworker/u515:16am78-grp
      - job_id:          kworker/u515:16.078-grp
      - job_id:          kworker/u515:0eam78-grp
      - job_id:          kworker/u516:4eam78-grp
      - job_id:          kworker/u515:6eam78-grp
      - job_id:          kworker/u515:3eam78-grp
      - job_id:          kworker/u515:0.0m78-grp
      - job_id:          kworker/u519:4eam78-grp
      - job_id:          kworker/u516:0eam78-grp
      - job_id:          kworker/u515:21am78-grp
      - job_id:          kworker/u514:1.0m78-grp
      - job_id:          kworker/u515:3.0m78-grp
      - job_id:          kworker/u58j,team78-grp
      - job_id:          kworker/u517:7eam78-grp
      - job_id:          kworker/u516:3.0m78-grp
      - job_id:          kworker/u516:5eam78-grp
      - job_id:          kworker/u517:4.0m78-grp
      - job_id:          kworker/u516:2eam78-grp
      - job_id:          kworker/u515:2eam78-grp
      - job_id:          kworker/u519:8eam78-grp
      - job_id:          kworker/u517:6eam78-grp
      - job_id:          kworker/u518:3eam78-grp
      - job_id:          kworker/u515:21.078-grp
      - job_id:          kworker/u519:1.0m78-grp
      - job_id:          kworker/u517:2.0m78-grp
      - job_id:          kworker/u523:1eam78-grp
      - job_id:          kworker/u513:6eam78-grp
      - job_id:          kworker/u516:1eam78-grp
      - job_id:          kworker/u528:4eam78-grp
      - job_id:          kworker/u515:2.0m78-grp
      - job_id:          kworker/u515:11.078-grp
      - job_id:          kworker/u516:12.078-grp
      - job_id:          kworker/u513:4eam78-grp
      - job_id:          kworker/u516:6eam78-grp
      - job_id:          kworker/u518:6eam78-grp
      ....
      17:27
      [root@lus24-oss6 ~]# lctl get_param obdfilter.*.job_stats | grep job_id | grep SA1
      - job_id:          SA1CUzMGBwYH,analysis-cgp
      - job_id:          SA1TSk+PO487,team311
      - job_id:          SA1FJ03/I/8j,team78-grp
      - job_id:          SA1FJ03/I/8j,tea0
      - job_id:          SA1FJ03/I/8j,te0
      - job_id:          SA1FrkxcBFwE,team113-grp
      - job_id:          SA1TNFAwOzA7,team311
      - job_id:          SA1FJ03/I/8j,t.0
      - job_id:          SA1FJ03/I/8j,t.
      - job_id:          SA1FJ03/u515:3.0
      - job_id:          SA1FJ03/u515:6.
      - job_id:          SA1FJ03/u515:7.
      - job_id:          SA1FJ03/u516:8.0
      - job_id:          SA1FJ03/I/8j,tea
      - job_id:          SA1FJ03/u513:0.
      - job_id:          SA1FJ03/u513:2.0
      - job_id:          SA1FJ03/u516:12.
      - job_id:          SA1FJ03/I/8j,team
      - job_id:          SA1FJ03/u518:7.
      - job_id:          SA1FJ03/u517:0.
      - job_id:          SA1FJ03/I/8j,te0m78-grp
      - job_id:          SA1CUzMGBwYH,analysis-cgp
      - job_id:          SA1TSk+PO487,team311
      - job_id:          SA1FJ03/I/8j,team78-grp
      - job_id:          SA1FJ03/I/8j,tea0
      - job_id:          SA1FJ03/I/8j,te0
      - job_id:          SA1FJ03/I/8j,t.
      - job_id:          SA1FJ03/u516:8.0
      - job_id:          SA1FJ03/I/8j,t.0
      - job_id:          SA1FrkxcBFwE,team113-grp
      - job_id:          SA1FJ03/u515:6.
      - job_id:          SA1FJ03/u515:7.
      - job_id:          SA1FJ03/u515:6.0
      - job_id:          SA1FJ03/I/8j,tea
      - job_id:          SA1FJ03/u513:0.
      - job_id:          SA1FJ03/u516:12.
      - job_id:          SA1FJ03/u515:19.0
      - job_id:          SA1FJ03/u515:16.
      - job_id:          SA1FJ03/I/8j,te.0
      - job_id:          SA1FJ03/u518:7.
      - job_id:          SA1FJ03/u517:0.
      - job_id:          SA1FJ03/I/8j,te0m78-grp

       

      The jobs which start SA1 are from our batch systems, the format has "project" set after a , so for example " SA1FrkxcBFwE,team113-grp" looks good while "SA1FJ03/I/8j,te0m78-grp" have been corrupted.

      I am additionally a bit surprised to see kworkers being reported, I would hope to see the io which started the work being reported.

      This happens often enough to mean we have no confidence showing our statistics to our users.  I will include a sample plots to show the effect, one when there has been no corrupt and later when there has.

      Attachments

        Activity

          People

            wc-triage WC Triage
            james beal James Beal
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: