Details
-
Bug
-
Resolution: Won't Fix
-
Major
-
None
-
None
-
3
-
10716
Description
In a stress test I did today, I created 40K files and archive them with 2 clients. The requests were queued into MDT successfully but it caused other problems.
the first problem is the lprocfs implementation of agent_action. The symptom is:
[root@mds01 ~]# lctl get_param mdt.*.hsm.agent_actions error: get_param: read('/proc/fs/lustre/mdt/hsm-MDT0000/hsm/agent_actions') failed: Cannot allocate memory
Though I didn't look at it yet, I think the root cause is that the llog is too long so it ran into a problem for some reason.
I think the more severe problem is flow control. It's not good to keep the requests in queue so much long, at least we should have a parameter to control how long the maximum length of queue will be.
Another problem I saw in the test is that:
LustreError: 27319:0:(mdt_coordinator.c:1418:mdt_hsm_update_request_state()) hsm-MDT0000: Cannot find running request for cookie 0x5226bb27 on fid=[0x200000400:0xee5:0x0] LustreError: 27319:0:(mdt_coordinator.c:1418:mdt_hsm_update_request_state()) Skipped 74 previous similar messages
There were a huge number of this warning. I will dig it tomorrow
I didn't find a new ticket for this and this one looks very related to a problem we have seen yesterday. I hope, i'm not wrong here.
There still seems to be a flow control issue with HSM requests and i have something that seems to be reliably reproducible to show this:
Environment: 1 MDS, 2 OSSs, 2 clients, RHEL 6.5, Lustre 2.6.0, all VMs hosted on KVM.
Boot machines, start Lustre services (MDS, OSS1, OSS2, mount on both clients)
Start copytool on client 2:
Start test script on client 1:
Run test script
while watching the MDS syslog:
and the copytool:
Immediately after the test, mdt/*/hsm/actions shows the following:
A second, later retry of the archive request works fine, so there are ways to work around this issue, but it still would be nice if archival of a changed file would work on the first try.