Details
-
Technical task
-
Resolution: Fixed
-
Blocker
-
None
-
10057
Description
In a stress test I did today, I created 40K files and archive them with 2 clients. The requests were queued into MDT successfully but it caused other problems.
the first problem is the lprocfs implementation of agent_action. The symptom is:
[root@mds01 ~]# lctl get_param mdt.*.hsm.agent_actions error: get_param: read('/proc/fs/lustre/mdt/hsm-MDT0000/hsm/agent_actions') failed: Cannot allocate memory
Though I didn't look at it yet, I think the root cause is that the llog is too long so it ran into a problem for some reason.
I think the more severe problem is flow control. It's not good to keep the requests in queue so much long, at least we should have a parameter to control how long the maximum length of queue will be.
Another problem I saw in the test is that:
LustreError: 27319:0:(mdt_coordinator.c:1418:mdt_hsm_update_request_state()) hsm-MDT0000: Cannot find running request for cookie 0x5226bb27 on fid=[0x200000400:0xee5:0x0] LustreError: 27319:0:(mdt_coordinator.c:1418:mdt_hsm_update_request_state()) Skipped 74 previous similar messages
There were a huge number of this warning. I will dig it tomorrow