Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11936

High ldlm load, slow/unusable filesystem

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Blocker
    • None
    • Lustre 2.12.0
    • None
    • CentOS 7.6, Lustre 2.12.0, 3.10.0-957.1.3.el7_lustre.x86_64
    • 3
    • 9223372036854775807

    Description

      We upgraded our cluster Sherlock to Lustre 2.12 yesterday and put Fir (Lustre 2.12 servers) into production yesterday but this morning, the filesystem is unusable due to a super high load of ldlm threads on the MDS servers.

      I can see plenty of these on the MDS:

      [Wed Feb  6 09:19:25 2019][1695265.110496] LNet: Service thread pid 35530 was inactive for 350.85s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [Wed Feb  6 09:19:25 2019][1695265.127609] Pid: 35530, comm: mdt02_024 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
      [Wed Feb  6 09:19:25 2019][1695265.137522] Call Trace:
      [Wed Feb  6 09:19:25 2019][1695265.140184]  [<ffffffffc0e3a0bd>] ldlm_completion_ast+0x63d/0x920 [ptlrpc]
      [Wed Feb  6 09:19:25 2019][1695265.147339]  [<ffffffffc0e3adcc>] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc]
      [Wed Feb  6 09:19:25 2019][1695265.154723]  [<ffffffffc15164ab>] mdt_object_local_lock+0x50b/0xb20 [mdt]
      [Wed Feb  6 09:19:25 2019][1695265.161731]  [<ffffffffc1516b30>] mdt_object_lock_internal+0x70/0x3e0 [mdt]
      [Wed Feb  6 09:19:25 2019][1695265.168906]  [<ffffffffc1517d1a>] mdt_getattr_name_lock+0x90a/0x1c30 [mdt]
      [Wed Feb  6 09:19:25 2019][1695265.176002]  [<ffffffffc151fbb5>] mdt_intent_getattr+0x2b5/0x480 [mdt]
      [Wed Feb  6 09:19:25 2019][1695265.182750]  [<ffffffffc151ca18>] mdt_intent_policy+0x2e8/0xd00 [mdt]
      [Wed Feb  6 09:19:25 2019][1695265.189403]  [<ffffffffc0e20ec6>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc]
      [Wed Feb  6 09:19:25 2019][1695265.196349]  [<ffffffffc0e498a7>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc]
      [Wed Feb  6 09:19:25 2019][1695265.203643]  [<ffffffffc0ed0302>] tgt_enqueue+0x62/0x210 [ptlrpc]
      [Wed Feb  6 09:19:25 2019][1695265.210001]  [<ffffffffc0ed735a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
      [Wed Feb  6 09:19:25 2019][1695265.217133]  [<ffffffffc0e7b92b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [Wed Feb  6 09:19:25 2019][1695265.225045]  [<ffffffffc0e7f25c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
      [Wed Feb  6 09:19:25 2019][1695265.231563]  [<ffffffffadcc1c31>] kthread+0xd1/0xe0
      [Wed Feb  6 09:19:25 2019][1695265.236657]  [<ffffffffae374c24>] ret_from_fork_nospec_begin+0xe/0x21
      [Wed Feb  6 09:19:25 2019][1695265.243306]  [<ffffffffffffffff>] 0xffffffffffffffff
      [Wed Feb  6 09:19:25 2019][1695265.248518] LustreError: dumping log to /tmp/lustre-log.1549473603.35530
      

      On Fir, we have two Lustre 2.12 MDS, fir-md1-s1 and fir-md1-s2, each with 2 MDTs. I dumped the current tasks using sysrq to the console and I'm attaching the full console log for both MDS servers. The servers don't crash but lead to unsable/blocking clients,however from time to time we can access the filesystem. Any help would be appreciated. Thanks!

      Stephane

      Attachments

        1. fir-md1-s1_20191022.log
          2.94 MB
        2. fir-md1-s1-console.log
          4.22 MB
        3. fir-md1-s2-console.log
          2.85 MB
        4. lustre-log.1549473603.35530.gz
          1.08 MB

        Issue Links

          Activity

            People

              ssmirnov Serguei Smirnov
              sthiell Stephane Thiell
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated: