HSM _not only_ small fixes and to do list goes here (LU-3647)

[LU-3876] flow control of HSM requests Created: 04/Sep/13  Updated: 24/Sep/13  Resolved: 24/Sep/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.5.0

Type: Technical task Priority: Blocker
Reporter: Jinshan Xiong (Inactive) Assignee: Jinshan Xiong (Inactive)
Resolution: Fixed Votes: 0
Labels: HSM

Rank (Obsolete): 10057

 Description   

In a stress test I did today, I created 40K files and archive them with 2 clients. The requests were queued into MDT successfully but it caused other problems.

the first problem is the lprocfs implementation of agent_action. The symptom is:

[root@mds01 ~]# lctl get_param mdt.*.hsm.agent_actions
error: get_param: read('/proc/fs/lustre/mdt/hsm-MDT0000/hsm/agent_actions') failed: Cannot allocate memory

Though I didn't look at it yet, I think the root cause is that the llog is too long so it ran into a problem for some reason.

I think the more severe problem is flow control. It's not good to keep the requests in queue so much long, at least we should have a parameter to control how long the maximum length of queue will be.

Another problem I saw in the test is that:

LustreError: 27319:0:(mdt_coordinator.c:1418:mdt_hsm_update_request_state()) hsm-MDT0000: Cannot find running request for cookie 0x5226bb27 on fid=[0x200000400:0xee5:0x0]
LustreError: 27319:0:(mdt_coordinator.c:1418:mdt_hsm_update_request_state()) Skipped 74 previous similar messages

There were a huge number of this warning. I will dig it tomorrow



 Comments   
Comment by Jinshan Xiong (Inactive) [ 10/Sep/13 ]

patch is at: http://review.whamcloud.com/7589

Just fix the problem of ENOMEM. More work will be needed to add flow control.

Comment by John Hammond [ 10/Sep/13 ]

From the autotest logs I have also seen this file return -EIO causing sanity-hsm test 40 to pass when it should have failed. Does anyone have any idea why it might do so?

Comment by Jinshan Xiong (Inactive) [ 18/Sep/13 ]

In 2.5, we're going to fix the problem of dumping a huge amount of agent_actions only. The real flow control will be fixed in 2.6 due to limited resource.

Comment by Jodi Levi (Inactive) [ 24/Sep/13 ]

Patch landed to Master. Follow on work for 2.6 is being tracked in LU-4004

Generated at Sat Feb 10 01:37:41 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.