Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13447

Sudden slow file create (MDS problem)

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Critical
    • None
    • Lustre 2.12.4
    • None
    • CentOS 7.6
    • 3
    • 9223372036854775807

    Description

      We faced a sudden MDS problem with 2.12.4, a few hours after LU-13442, where users reported this kind of slowness when creating new files:

      sh02-ln03:bp86 09:03:28> time touch $SCRATCH/asdf
      
      real 0m26.923s
      user 0m0.000s
      sys 0m0.003s
       

      Looking at the MDS in question, we could see backtraces like these:

      [1803096.937756] LNet: Service thread pid 41461 completed after 237.28s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
      [1805310.391443] LNet: Service thread pid 20879 was inactive for 200.02s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [1805310.408554] Pid: 20879, comm: mdt03_010 3.10.0-957.27.2.el7_lustre.pl2.x86_64 #1 SMP Thu Nov 7 15:26:16 PST 2019
      [1805310.418900] Call Trace:
      [1805310.421543]  [<ffffffffc1855fa8>] osp_precreate_reserve+0x2e8/0x800 [osp]
      [1805310.428575]  [<ffffffffc184a949>] osp_declare_create+0x199/0x5f0 [osp]
      [1805310.435312]  [<ffffffffc179269f>] lod_sub_declare_create+0xdf/0x210 [lod]
      [1805310.442330]  [<ffffffffc178a86e>] lod_qos_declare_object_on+0xbe/0x3a0 [lod]
      [1805310.449586]  [<ffffffffc178d80e>] lod_alloc_rr.constprop.19+0xeee/0x1490 [lod]
      [1805310.457012]  [<ffffffffc179192d>] lod_qos_prep_create+0x12fd/0x1890 [lod]
      [1805310.464007]  [<ffffffffc177296a>] lod_declare_instantiate_components+0x9a/0x1d0 [lod]
      [1805310.472042]  [<ffffffffc1785725>] lod_declare_layout_change+0xb65/0x10f0 [lod]
      [1805310.479468]  [<ffffffffc17f7f82>] mdd_declare_layout_change+0x62/0x120 [mdd]
      [1805310.486724]  [<ffffffffc1800ec6>] mdd_layout_change+0xb46/0x16a0 [mdd]
      [1805310.493473]  [<ffffffffc166135f>] mdt_layout_change+0x2df/0x480 [mdt]
      [1805310.500130]  [<ffffffffc16697d0>] mdt_intent_layout+0x8a0/0xe00 [mdt]
      [1805310.506787]  [<ffffffffc1666d35>] mdt_intent_policy+0x435/0xd80 [mdt]
      [1805310.513459]  [<ffffffffc0ffbe06>] ldlm_lock_enqueue+0x356/0xa20 [ptlrpc]
      [1805310.520394]  [<ffffffffc10244f6>] ldlm_handle_enqueue0+0xa56/0x15f0 [ptlrpc]
      [1805310.527688]  [<ffffffffc10acb12>] tgt_enqueue+0x62/0x210 [ptlrpc]
      [1805310.534034]  [<ffffffffc10b564a>] tgt_request_handle+0xada/0x1570 [ptlrpc]
      [1805310.541167]  [<ffffffffc105843b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [1805310.549063]  [<ffffffffc105bda4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
      [1805310.555596]  [<ffffffff9fac2e81>] kthread+0xd1/0xe0
      [1805310.560688]  [<ffffffffa0177c24>] ret_from_fork_nospec_begin+0xe/0x21
      [1805310.567340]  [<ffffffffffffffff>] 0xffffffffffffffff
      [1805310.572544] LustreError: dumping log to /tmp/lustre-log.1586359839.20879
      [1805316.373786] LNet: Service thread pid 20879 completed after 206.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
      

      We took a crash dump available in the Whamcloud FTP as fir-md1-s3_20200408_vmcore

      Attaching dmesg as fir-md1-s3_20200408_vmcore-dmesg.txt and foreach bt of the crash dump as fir-md1-s3_20200408_foreach_bt.txt.

      A restart of the MDS has fixed the issue for now.

      Attachments

        Activity

          People

            tappro Mikhail Pershin
            sthiell Stephane Thiell
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: