[LU-13490] readahead thread breaks read stats in jobstats Created: 29/Apr/20  Updated: 23/Sep/21  Resolved: 14/May/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.14.0
Fix Version/s: Lustre 2.14.0

Type: Bug Priority: Major
Reporter: Shuichi Ihara Assignee: Wang Shilong (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
Severity: 2
Rank (Obsolete): 9223372036854775807

 Description   

Parallel readahead introcued after LU-12043 (commit c279167) and kernel threads does readahead in parallel, but there is a regression that broek read stats in jobstats.
Here is a reproducer.

[root@mgs ~]# lctl conf_param vLustre.sys.jobid_var=procname_uid

[root@client ~]# ior -w -t 1m -b 1g -e -o /vLustre/out/file -k
[root@client ~]# echo 3 > /proc/sys/vm/drop_caches 
[root@client ~]# ior -r -t 1m -b 1g -e -o /vLustre/out/file -k

[root@oss1 ~]# lctl get_param obdfilter.*.job_stats
obdfilter.vLustre-OST0000.job_stats=
job_stats:
- job_id:          ior.0
  snapshot_time:   1588138284
  read_bytes:      { samples:          16, unit: bytes, min: 1048576, max: 4194304, sum:        62914560 }
  write_bytes:     { samples:         256, unit: bytes, min: 4194304, max: 4194304, sum:      1073741824 }
  getattr:         { samples:           0, unit:  reqs }
  setattr:         { samples:           0, unit:  reqs }
  punch:           { samples:           0, unit:  reqs }
  sync:            { samples:           1, unit:  reqs }
  destroy:         { samples:           0, unit:  reqs }
  create:          { samples:           0, unit:  reqs }
  statfs:          { samples:           0, unit:  reqs }
  get_info:        { samples:           0, unit:  reqs }
  set_info:        { samples:           0, unit:  reqs }
  quotactl:        { samples:           0, unit:  reqs }
- job_id:          kworker/u4:1.0
  snapshot_time:   1588138285
  read_bytes:      { samples:         135, unit: bytes, min: 4194304, max: 4194304, sum:       566231040 }
  write_bytes:     { samples:           0, unit: bytes, min:       0, max:       0, sum:               0 }
  getattr:         { samples:           0, unit:  reqs }
  setattr:         { samples:           0, unit:  reqs }
  punch:           { samples:           0, unit:  reqs }
  sync:            { samples:           0, unit:  reqs }
  destroy:         { samples:           0, unit:  reqs }
  create:          { samples:           0, unit:  reqs }
  statfs:          { samples:           0, unit:  reqs }
  get_info:        { samples:           0, unit:  reqs }
  set_info:        { samples:           0, unit:  reqs }
  quotactl:        { samples:           0, unit:  reqs }
- job_id:          kworker/u4:3.0
  snapshot_time:   1588138284
  read_bytes:      { samples:         106, unit: bytes, min: 4194304, max: 4194304, sum:       444596224 }
  write_bytes:     { samples:           0, unit: bytes, min:       0, max:       0, sum:               0 }
  getattr:         { samples:           0, unit:  reqs }
  setattr:         { samples:           0, unit:  reqs }
  punch:           { samples:           0, unit:  reqs }
  sync:            { samples:           0, unit:  reqs }
  destroy:         { samples:           0, unit:  reqs }
  create:          { samples:           0, unit:  reqs }
  statfs:          { samples:           0, unit:  reqs }
  get_info:        { samples:           0, unit:  reqs }
  set_info:        { samples:           0, unit:  reqs }
  quotactl:        { samples:           0, unit:  reqs }

it's bad idea of tracking read stats per kernel thread rathar than real application pid. it won't be able to see read stats per job id.



 Comments   
Comment by Andreas Dilger [ 29/Apr/20 ]

There is an exception table for jobid that skips specific thread names, but I don't think it can work in this case. Maybe it would be better to check if the thread has PF_KERNTHREAD set? Also, I think it is possible to cache the jobid in struct ll_jnode_info, so if that was set then it would report the correct jobid to the OSS.

Comment by Peter Jones [ 29/Apr/20 ]

Shilong

Could you please advise?

Thanks

Peter

Comment by Wang Shilong (Inactive) [ 30/Apr/20 ]

One of possible way to solve the issue could be we pass original task_struct to job_id, and use passed task_struct to pass jobid information which should fix this problem.

Comment by Gerrit Updater [ 30/Apr/20 ]

Wang Shilong (wshilong@ddn.com) uploaded a new patch: https://review.whamcloud.com/38426
Subject: LU-13490 lustre: fix to make jobstats work for async ra
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: cfdc2b04d3d89a1629e5df1dde882af34d18e4aa

Comment by Gerrit Updater [ 14/May/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38426/
Subject: LU-13490 lustre: fix to make jobstats work for async ra
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: da8972322134ae5741e7176312fca1f980c0f69a

Comment by Peter Jones [ 14/May/20 ]

Landed for 2.14

Generated at Sat Feb 10 03:01:43 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.