[LU-8926] Race in in job stats code results in untracked I/O - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.10.0
Affects Version/s: None
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

There is a race condition between updating the job id in lustre_get_jobid and setting the job id in outbound RPCs (primarily when getting the job id from the an environment variable is enabled).

The function lustre_get_jobid is used near the beginning of every I/O to set the job id in the Lustre inode info (lli_jobid, from vvp_io_init)), and then the job id is read out from there when building an RPC. (osc_build_rpc, cl_req_attr_set, vvp_req_attr_set, then it's used in lustre_msg_set_jobid).

lustre_get_jobid starts out by memsetting the jobid to zero, then re-reading it from the source. Since osc_build_rpc is asynchronous from this and happens in another thread, it can read the jobid at any time, including while it's zero.

Since cfs_get_environ is a very expensive operation, this can happen a lot for small IO operations.

In particular, with 4k write operations, we see up to 2/3 of our IOs with a null job id, so they are not tracked.

Using a lock or other hard synchronization here would be far too expensive, and it's OK if job stats are occasionally inaccurate. So my proposed patch just cuts the window in which the jobid will be invalid from very large to very small. (Also, in practice, the job id should not change much in the cases we really care about, namely when set from a job scheduler.)

Attachments

Activity

[LU-8926] Race in in job stats code results in untracked I/O

Peter Jones added a comment - 24/Jan/17 2:33 PM

Landed for 2.10

Peter Jones added a comment - 24/Jan/17 2:33 PM Landed for 2.10

Gerrit Updater added a comment - 24/Jan/17 5:23 AM

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/24253/
Subject: ~~LU-8926~~ llite: reduce jobstats race window
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 8eca92b365fd3efd1541a48b1bb239926838d947

Gerrit Updater added a comment - 24/Jan/17 5:23 AM Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/24253/ Subject: LU-8926 llite: reduce jobstats race window Project: fs/lustre-release Branch: master Current Patch Set: Commit: 8eca92b365fd3efd1541a48b1bb239926838d947

Gerrit Updater added a comment - 08/Dec/16 9:49 PM

Patrick Farrell (paf@cray.com) uploaded a new patch: https://review.whamcloud.com/24253
Subject: ~~LU-8926~~ llite: reduce jobstats race window
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 6de41ac2a36d89fa5e6864c7497e922de24289cf

Gerrit Updater added a comment - 08/Dec/16 9:49 PM Patrick Farrell (paf@cray.com) uploaded a new patch: https://review.whamcloud.com/24253 Subject: LU-8926 llite: reduce jobstats race window Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 6de41ac2a36d89fa5e6864c7497e922de24289cf

People

Assignee:: Patrick Farrell (Inactive)

Reporter:: Patrick Farrell (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 08/Dec/16 9:29 PM

Updated:: 24/Jan/17 2:33 PM

Resolved:: 24/Jan/17 2:33 PM