[LU-14793] HSM: record the index when scan HSM action llog for new HSM requests Created: 25/Jun/21  Updated: 14/Jan/22  Resolved: 20/Nov/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.15.0

Type: Bug Priority: Minor
Reporter: Qian Yingjin Assignee: Qian Yingjin
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-13689 Replace cdt_state_lock with cdt_llog_... Open
is related to LU-13651 Conditionally skip finding compatible... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

After analyze online for HSM workload, we found the following contention between HSM archive request and hsm_cdtr thread:

hsm_archive:
  mdt_hsm_add_actions
  mdt_hsm_register_hal
  mdt_agent_record_add
    down_write(&cdt->cdt_llog_lock);
    llog_cat_add()
    up_write(&cdt->cdt_llog_lock);


// hsm_cdtr kernel daemon thread:
mdt_coordinator()
  cdt_llog_process(mti->mti_env, mdt, mdt_coordinator_cb,
				      &hsd, 0, 0, WRITE);
    down_write(&cdt->cdt_llog_lock);
    llog_cat_process
    up_write(&cdt->cdt_llog_lock);

# top
PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 7675 root      20   0       0      0      0 R 100.0  0.0   8729:13 hsm_cdtr
    1 root      20   0   53984   6232   2604 S   0.0  0.0   3:17.86 systemd
    2 root      20   0       0      0      0 S   0.0  0.0   0:00.62 kthreadd
    4 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/0:0H

HSM archive and kernel thread hsm_cdtr are both contented for cdt->cdt_llog_lock to add or update the hsm_action llog.

The testing is use max_requests = 10000:

mdt.isg-tiny-MDT0000.hsm.max_requests=10000

It means "hsm_cdtr" kernel thread will scan 10000 llog actions with write lock held in the whole processing, reduce the max_requests to a small value (i.e. 256) should be mitigate the problem. Or we can release the write lock when scan certain number of hsm actions (128) in llog, resched and then re-acquire the write lock to go on scanning the hsm action llog.

And also when "hsm_cdtr" scan the hsm_action llog, we should record the latest cookie that has been handled, next scanning should be started from the previous end position .



 Comments   
Comment by Gerrit Updater [ 25/Jun/21 ]

Yingjin Qian (qian@ddn.com) uploaded a new patch: https://review.whamcloud.com/44077
Subject: LU-14793 hsm: record index for further HSM action scanning
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1238bf2e3a1f29cf9eb7d1af3610a709fbce3865

Comment by Gerrit Updater [ 20/Nov/21 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/44077/
Subject: LU-14793 hsm: record index for further HSM action scanning
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: a15a5432f8063e3a04a87d74eafac0060a8f9d26

Comment by Peter Jones [ 20/Nov/21 ]

Landed for 2.15

Generated at Sat Feb 10 03:12:50 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.