Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15132

Parallel data accesses on a release file could hang a MDT

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.16.0
    • None
    • None
    • Production Lustre 2.12.6 + patches
    • 3
    • 9223372036854775807

    Description

      This issue was seen on a production filesystem.
      A MDT became unresponsive (from clients) after a user (mistakenly) start a job with HSM release files.

      To solve the issue the admin needed to crash the MDT and copytools and stop the problematic job.

      dmesg

      LustreError: 2989:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) ###
      lock timed out (enqueued at 1634368802, 300s ago); not entering recovery in
      server code, just going back to sleep ns: mdt-lustre1-MDT0001_UUID lock: ffff8a89b9d48fc0/0x936f3f94aa7a15ba lrc: 3/0,1 mode: --/EX res: [0xc40051b70:0xa:0x0].0x0 bits 0x8/0x0 rrc: 378 type: IBT flags: 0x40210400000020 nid: local remote: 0x0 expref: -99 pid: 2989 timeout: 0 lvb_type: 0
      

      stack

      [1371853.573298] LNet: Service thread pid 3160 was inactive for 200.47s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [1371853.590331] LNet: Skipped 3 previous similar messages
      [1371853.595565] Pid: 3160, comm: mdt01_083 3.10.0-1062.18.1.el7.x86_64 #1 SMP Tue Mar 17 23:49:1
      7 UTC 2020
      [1371853.595567] Call Trace:
      [1371853.595584]  [<ffffffffc113a220>] ldlm_completion_ast+0x430/0x860 [ptlrpc]
      [1371853.602747]  [<ffffffffc113b101>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc]
      [1371853.610134]  [<ffffffffc170b69b>] mdt_object_local_lock+0x50b/0xb20 [mdt]
      [1371853.617154]  [<ffffffffc170bd20>] mdt_object_lock_internal+0x70/0x360 [mdt]
      [1371853.624325]  [<ffffffffc170ebaa>] mdt_object_find_lock+0x6a/0x1a0 [mdt]
      [1371853.631160]  [<ffffffffc175d56f>] cdt_restore_handle_add+0xdf/0x540 [mdt]
      [1371853.638167]  [<ffffffffc175422b>] mdt_hsm_register_hal+0x59b/0xb10 [mdt]
      [1371853.645082]  [<ffffffffc1754d1c>] mdt_hsm_add_actions+0x57c/0x6c0 [mdt]
      [1371853.651910]  [<ffffffffc1744b8a>] mdt_hsm_request+0x4da/0xa30 [mdt]
      [1371853.658402]  [<ffffffffc11d8aaa>] tgt_request_handle+0xada/0x1570 [ptlrpc]
      [1371853.665549]  [<ffffffffc117d9cb>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [1371853.673461]  [<ffffffffc1181334>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
      [1371853.679982]  [<ffffffff998c6321>] kthread+0xd1/0xe0
      [1371853.685073]  [<ffffffff99f8dd37>] ret_from_fork_nospec_end+0x0/0x39
      [1371853.691545]  [<ffffffffffffffff>] 0xffffffffffffffff
      

      Analysis
      I was able to find the ldlm_lock and ldlm_resource from the crash.
      The release file was a big file +430GB, it take a lot of time from copytool to retrieve data.

      Here the granted lock for the resource

      crash> ldlm_resource.lr_granted -o 0xffff8a8a352e3b00
      struct ldlm_resource {
        [ffff8a8a352e3b20] struct list_head lr_granted;
      }
      crash> list -H ffff8a8a352e3b20 -s ldlm_lock ldlm_lock.l_res_link
      ffff8a89806018c0
      struct ldlm_lock {
        l_handle = {
          h_link = {
            next = 0xffff8a6c38bf5000,
            prev = 0xffff8a6c6aa204e0
          },
          h_cookie = 10623779953738126729,
          h_owner = 0x0,
          h_ops = 0xffffffffc124e730 <lock_handle_ops>,
          h_rcu = {
            next = 0x0,
            func = 0x0
          },
          h_lock = {
            {
              rlock = {
                raw_lock = {
                  val = {
                    counter = 0
                  }
                }
              }
            }
          },
          h_size = 0,
          h_in = 1
        },
        l_refc = {
          counter = 2
        },
        l_lock = {
          {
            rlock = {
              raw_lock = {
                val = {
                  counter = 0
                }
              }
            }
          }
        },
        l_resource = 0xffff8a8a352e3b00,
        l_lru = {
          next = 0xffff8a8980601910,
          prev = 0xffff8a8980601910
        },
        l_res_link = {
          next = 0xffff8a8a352e3b20,
          prev = 0xffff8a8a352e3b20
        },
        {
          l_tree_node = 0xffff8a898ec77a80,
          l_ibits_node = 0xffff8a898ec77a80
        },
        l_exp_hash = {
          next = 0x0,
          pprev = 0x0
        },
        l_exp_flock_hash = {
          next = 0x0,
          pprev = 0x0
        },
        l_req_mode = LCK_EX,
        l_granted_mode = LCK_EX,
        l_completion_ast = 0xffffffffc1139df0 <ldlm_completion_ast>,
        l_blocking_ast = 0xffffffffc16fb960 <mdt_blocking_ast>,
        l_glimpse_ast = 0x0,
        l_export = 0x0,
        l_conn_export = 0x0,
        l_remote_handle = {
          cookie = 0
        },
        l_policy_data = {
          l_extent = {
            start = 8,
            end = 0,
            gid = 0
          },
          l_flock = {
            start = 8,
            end = 0,
            owner = 0,
            blocking_owner = 0,
            blocking_export = 0x0,
            blocking_refs = {
              counter = 0
            },
            pid = 0
          },
          l_inodebits = {
            bits = 8,
            {
              try_bits = 0,
              cancel_bits = 0
            }
          }
        },
        l_flags = 18050699573067808,
        l_readers = 0,
        l_writers = 1,
        l_waitq = {
          lock = {
            {
              rlock = {
                raw_lock = {
                  val = {
                    counter = 0
                  }
                }
              }
            }
          },
          task_list = {
            next = 0xffff8a89806019d8,
            prev = 0xffff8a89806019d8
          }
        },
        l_last_used = {
          tv64 = 1371509486081554
        },
        l_req_extent = {
          start = 0,
          end = 0,
          gid = 0
        },
        l_lvb_type = LVB_T_NONE,
        l_lvb_len = 0,
        l_lvb_data = 0x0,
        l_ast_data = 0x0,
        {
          l_activity = 1634368802,
          l_blast_sent = 1634368802
        },
        l_ost_lvb = {
          lvb_size = 0,
          lvb_mtime = 0,
          lvb_atime = 0,
          lvb_ctime = 0,
          lvb_blocks = 0,
          lvb_mtime_ns = 0,
          lvb_atime_ns = 0,
          lvb_ctime_ns = 0,
          lvb_padding = 0
        },
        l_client_cookie = 10623779952545863461,
        l_pending_chain = {
          next = 0xffff8a8980601a68,
          prev = 0xffff8a8980601a68
        },
        l_callback_timeout = 0,
        l_pid = 2930,
        l_bl_ast_run = 1,
        l_bl_ast = {
          next = 0xffff8a8980601a88,
          prev = 0xffff8a8980601a88
        },
        l_cp_ast = {
          next = 0xffff8a8980601a98,
          prev = 0xffff8a8980601a98
        },
        l_rk_ast = {
          next = 0xffff8a8980601aa8,
          prev = 0xffff8a8980601aa8
        },
        l_blocking_lock = 0x0,
        l_sl_mode = {
          next = 0xffff8a8980601ac0,
          prev = 0xffff8a8980601ac0
        },
        l_sl_policy = {
          next = 0xffff8a8980601ad0,
          prev = 0xffff8a8980601ad0
        },
        l_reference = {<No data fields>},
        l_exp_list = {
          next = 0xffff8a8980601ae0,
          prev = 0xffff8a8980601ae0
        }
      }
      

      There is lot of EX lock in the waiting list:

      crash> ldlm_resource.lr_waiting -o 0xffff8a8a352e3b00
      struct ldlm_resource {
        [ffff8a8a352e3b30] struct list_head lr_waiting;
      }
      crash> list -H ffff8a8a352e3b30 -s ldlm_lock.l_req_mode ldlm_lock.l_res_link | grep EX | wc -l
      243
      

      For each lock the creation pid (l_pid) is linked to a hanging thread. All the corresponding threads are in cdt_restore_handle_add().

      2021-10-16_12:48:44 [1384176467802037] [IN]  PID: 3149   TASK: ffff8a6bd74aa0e0  CPU: 15  COMMAND: "mdt01_081"
      2021-10-16_12:48:44 [1384176467755519] [IN]  PID: 3031   TASK: ffff8a6c61f7b150  CPU: 14  COMMAND: "mdt01_049"
      2021-10-16_12:48:44 [1384176460239208] [IN]  PID: 3186   TASK: ffff8a8cbcae1070  CPU: 2   COMMAND: "mdt00_107"
      2021-10-16_12:48:44 [1384176460192173] [IN]  PID: 2983   TASK: ffff8a8c5f215230  CPU: 0   COMMAND: "mdt00_049"
      2021-10-16_12:48:44 [1384176330809933] [IN]  PID: 2930   TASK: ffff8a8cbb52e2a0  CPU: 5   COMMAND: "mdt00_031"
      
      2021-10-16_09:25:01 [1371953151468039] [IN]  PID: 3259   TASK: ffff8a8cbb613150  CPU: 11  COMMAND: "mdt01_121"
      2021-10-16_09:25:01 [1371953151453579] [IN]  PID: 3115   TASK: ffff8a8c60a841c0  CPU: 11  COMMAND: "mdt01_072"
      ...
      ...
      2021-10-16_09:25:01 [1371953111405844] [IN]  PID: 3005   TASK: ffff8a8c7638a0e0  CPU: 1   COMMAND: "mdt00_056"
      2021-10-16_09:25:01 [1371953111403702] [IN]  PID: 2989   TASK: ffff8a6cbcf28000  CPU: 13  COMMAND: "mdt01_037"
      

      Only 5/248 mdt threads were able to manage client requests (3 on CPT0 and 2 on CPT1).

      For now, I think it exists a window where several clients could send HSM restore requests when EX layout lock is not already taken by the MDT.
      The first mdt thread takes the EX lock via cdt_restore_handle_add, the other threads wait the lock until the RESTORE is finished.
      This could causes all the available threads to hang (waiting the first RESTORE request to finish).

      Attachments

        Issue Links

          Activity

            People

              eaujames Etienne Aujames
              eaujames Etienne Aujames
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: