Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4935

Collect job stats by both procname_uid and scheduler job ID

    XMLWordPrintable

Details

    • Improvement
    • Resolution: Fixed
    • Minor
    • None
    • None
    • None
    • 13639

    Description

      I have no insight into the code, so there may be some reasons this is unworkable, but here's what I'm thinking.

      1) If you set jobid_var to procname_uid, you capture every process that is using the file system. This is nice for debugging, but not that useful for job statistics, as processes certainly can have similar names/uid across jobs.

      2) If you set jobid_var to the scheduler of your choice, like SLURM_JOB_ID, you of course get those statistics. But if someone is for example sitting on a submit node and issuing commands, those aren't seen.

      Would it be possible to enable collection on both? If every request has the job_id, process name, and UID packed in, why not get it all?

      So if you had a job with a scheduler jobid of "123", run from uid 555, and let's say it runs 2 processes, process1 and process2.

      Could you then have jobstats report job_id that look like:

      123.process1.555
      123.process2.555

      A process 'myscript' not run via the scheduler by uid 561 could be

      0.myscript.561

      Besides not losing the statistics for "non-scheduler" lustre requests, you have possible a little more insight into your job if it's a multi-step type job.

      Finally, to take it to the extreme - consider that we run filesystems which may be accesses by different schedulers, say slurm and SGE on different systems (yes, this happens!). Why not include every possible scheduler scheme? So you end up with something like:

      SLURM_JOB_ID.JOB_ID.LSB_JOBID.LOADL_STEP_ID.PBS_JOBID.ALPS_APP_ID.procname.uid

      So the example above would be:

      123.0.0.0.0.0.process1.555

      I would not be surprised if this is potentially stupid. One thing, is it's overloading a variable to be an array of data. It's also using a character valid for filenames "." as a field seperator.

      Scott

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              sknolin Scott Nolin
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: