Details

    • Technical task
    • Resolution: Fixed
    • Blocker
    • Lustre 2.5.0
    • None
    • 10057

    Description

      In a stress test I did today, I created 40K files and archive them with 2 clients. The requests were queued into MDT successfully but it caused other problems.

      the first problem is the lprocfs implementation of agent_action. The symptom is:

      [root@mds01 ~]# lctl get_param mdt.*.hsm.agent_actions
      error: get_param: read('/proc/fs/lustre/mdt/hsm-MDT0000/hsm/agent_actions') failed: Cannot allocate memory
      

      Though I didn't look at it yet, I think the root cause is that the llog is too long so it ran into a problem for some reason.

      I think the more severe problem is flow control. It's not good to keep the requests in queue so much long, at least we should have a parameter to control how long the maximum length of queue will be.

      Another problem I saw in the test is that:

      LustreError: 27319:0:(mdt_coordinator.c:1418:mdt_hsm_update_request_state()) hsm-MDT0000: Cannot find running request for cookie 0x5226bb27 on fid=[0x200000400:0xee5:0x0]
      LustreError: 27319:0:(mdt_coordinator.c:1418:mdt_hsm_update_request_state()) Skipped 74 previous similar messages
      

      There were a huge number of this warning. I will dig it tomorrow

      Attachments

        Activity

          People

            jay Jinshan Xiong (Inactive)
            jay Jinshan Xiong (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: