Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8926

Race in in job stats code results in untracked I/O

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.10.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      There is a race condition between updating the job id in lustre_get_jobid and setting the job id in outbound RPCs (primarily when getting the job id from the an environment variable is enabled).

      The function lustre_get_jobid is used near the beginning of every I/O to set the job id in the Lustre inode info (lli_jobid, from vvp_io_init)), and then the job id is read out from there when building an RPC. (osc_build_rpc, cl_req_attr_set, vvp_req_attr_set, then it's used in lustre_msg_set_jobid).

      lustre_get_jobid starts out by memsetting the jobid to zero, then re-reading it from the source. Since osc_build_rpc is asynchronous from this and happens in another thread, it can read the jobid at any time, including while it's zero.

      Since cfs_get_environ is a very expensive operation, this can happen a lot for small IO operations.

      In particular, with 4k write operations, we see up to 2/3 of our IOs with a null job id, so they are not tracked.

      Using a lock or other hard synchronization here would be far too expensive, and it's OK if job stats are occasionally inaccurate. So my proposed patch just cuts the window in which the jobid will be invalid from very large to very small. (Also, in practice, the job id should not change much in the cases we really care about, namely when set from a job scheduler.)

      Attachments

        Activity

          [LU-8926] Race in in job stats code results in untracked I/O
          pjones Peter Jones added a comment -

          Landed for 2.10

          pjones Peter Jones added a comment - Landed for 2.10

          Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/24253/
          Subject: LU-8926 llite: reduce jobstats race window
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 8eca92b365fd3efd1541a48b1bb239926838d947

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/24253/ Subject: LU-8926 llite: reduce jobstats race window Project: fs/lustre-release Branch: master Current Patch Set: Commit: 8eca92b365fd3efd1541a48b1bb239926838d947

          Patrick Farrell (paf@cray.com) uploaded a new patch: https://review.whamcloud.com/24253
          Subject: LU-8926 llite: reduce jobstats race window
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 6de41ac2a36d89fa5e6864c7497e922de24289cf

          gerrit Gerrit Updater added a comment - Patrick Farrell (paf@cray.com) uploaded a new patch: https://review.whamcloud.com/24253 Subject: LU-8926 llite: reduce jobstats race window Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 6de41ac2a36d89fa5e6864c7497e922de24289cf

          People

            paf Patrick Farrell (Inactive)
            paf Patrick Farrell (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: