Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14221

Client hangs when using DoM with a fixed mdc lru_size

    XMLWordPrintable

Details

    • Bug
    • Resolution: Won't Do
    • Major
    • None
    • Lustre 2.12.5, Lustre 2.12.6
    • 3
    • 9223372036854775807

    Description

      After enabling DoM and beginning to use one of our file systems more heavily recently, we discovered a bug seemingly related to locking.

      Basically, with any fixed `lru_size`, everything will work normally until the number of locks hit the `lru_size`. From that point, everything will hang until the `lru_max_age` is hit, at which point it will clear the locks and move on, until filling again. We confirmed this by setting the number of locks pretty low, then setting a low (10s) `lru_max_age`, and kicking off a tar extraction. The tar would extract until the `lock_count` hit our `lru_size` value (basically 1 for 1 with number of files), then hang for 10s, then continue with another batch after the locks had been cleared. The same behavior can be replicated by letting it hang and then running `lctl set_param ldlm.namespaces.mdc.lru_size=clear`, which will free up the process temporarily as well.

       

      Our current workaround is to set `lru_size` to 0 and set the `lru_max_age` to 30s to keep the number of locks to a manageable level.

       

      This appears to only occur on our SLES clients. RHEL clients running the same Lustre version encounter no such problems. This may be due to the kernel version on SLES (4.12.14-197) vs RHEL (3.10.0-1160)

       

      James believes this may be related to LU-11518.

       

      lru_size and lock_count while it's stuck:

      lctl get_param ldlm.namespaces.*.lru_size
      ldlm.namespaces.cyclone-MDT0000-mdc-ffff88078946d800.lru_size=200
      lctl get_param ldlm.namespaces.*.lock_count
      ldlm.namespaces.cyclone-MDT0000-mdc-ffff88078946d800.lock_count=201

       

      Process stack while it's stuck:

      [<ffffffffa0ad1932>] ptlrpc_set_wait+0x362/0x700 [ptlrpc]
      [<ffffffffa0ad1d57>] ptlrpc_queue_wait+0x87/0x230 [ptlrpc]
      [<ffffffffa0ab7217>] ldlm_cli_enqueue+0x417/0x8f0 [ptlrpc]
      [<ffffffffa0a6105d>] mdc_enqueue_base+0x3ad/0x1990 [mdc]
      [<ffffffffa0a62e38>] mdc_intent_lock+0x288/0x4c0 [mdc]
      [<ffffffffa0bf29ca>] lmv_intent_lock+0x9ca/0x1670 [lmv]
      [<ffffffffa0cfea99>] ll_layout_intent+0x319/0x660 [lustre]
      [<ffffffffa0d09fe2>] ll_layout_refresh+0x282/0x11d0 [lustre]
      [<ffffffffa0d47c73>] vvp_io_init+0x233/0x370 [lustre]
      [<ffffffffa085d4d1>] cl_io_init0.isra.15+0xa1/0x150 [obdclass]
      [<ffffffffa085d641>] cl_io_init+0x41/0x80 [obdclass]
      [<ffffffffa085fb64>] cl_io_rw_init+0x104/0x200 [obdclass]
      [<ffffffffa0d02c5b>] ll_file_io_generic+0x2cb/0xb70 [lustre]
      [<ffffffffa0d03825>] ll_file_write_iter+0x125/0x530 [lustre]
      [<ffffffff81214c9b>] __vfs_write+0xdb/0x130
      [<ffffffff81215581>] vfs_write+0xb1/0x1a0
      [<ffffffff81216ac6>] SyS_write+0x46/0xa0
      [<ffffffff81002af5>] do_syscall_64+0x75/0xf0
      [<ffffffff8160008f>] entry_SYSCALL_64_after_hwframe+0x42/0xb7
      [<ffffffffffffffff>] 0xffffffffffffffff

      I can reproduce and provide any other debug data as necessary.

      Attachments

        Issue Links

          Activity

            People

              tappro Mikhail Pershin
              nilesj Jeff Niles
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: