[LU-15132] Parallel data accesses on a release file could hang a MDT - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.16.0
Affects Version/s: None
Labels:
None
Environment:
Production Lustre 2.12.6 + patches

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

This issue was seen on a production filesystem.
A MDT became unresponsive (from clients) after a user (mistakenly) start a job with HSM release files.

To solve the issue the admin needed to crash the MDT and copytools and stop the problematic job.

dmesg

LustreError: 2989:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) ###
lock timed out (enqueued at 1634368802, 300s ago); not entering recovery in
server code, just going back to sleep ns: mdt-lustre1-MDT0001_UUID lock: ffff8a89b9d48fc0/0x936f3f94aa7a15ba lrc: 3/0,1 mode: --/EX res: [0xc40051b70:0xa:0x0].0x0 bits 0x8/0x0 rrc: 378 type: IBT flags: 0x40210400000020 nid: local remote: 0x0 expref: -99 pid: 2989 timeout: 0 lvb_type: 0

stack

[1371853.573298] LNet: Service thread pid 3160 was inactive for 200.47s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[1371853.590331] LNet: Skipped 3 previous similar messages
[1371853.595565] Pid: 3160, comm: mdt01_083 3.10.0-1062.18.1.el7.x86_64 #1 SMP Tue Mar 17 23:49:1
7 UTC 2020
[1371853.595567] Call Trace:
[1371853.595584]  [<ffffffffc113a220>] ldlm_completion_ast+0x430/0x860 [ptlrpc]
[1371853.602747]  [<ffffffffc113b101>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc]
[1371853.610134]  [<ffffffffc170b69b>] mdt_object_local_lock+0x50b/0xb20 [mdt]
[1371853.617154]  [<ffffffffc170bd20>] mdt_object_lock_internal+0x70/0x360 [mdt]
[1371853.624325]  [<ffffffffc170ebaa>] mdt_object_find_lock+0x6a/0x1a0 [mdt]
[1371853.631160]  [<ffffffffc175d56f>] cdt_restore_handle_add+0xdf/0x540 [mdt]
[1371853.638167]  [<ffffffffc175422b>] mdt_hsm_register_hal+0x59b/0xb10 [mdt]
[1371853.645082]  [<ffffffffc1754d1c>] mdt_hsm_add_actions+0x57c/0x6c0 [mdt]
[1371853.651910]  [<ffffffffc1744b8a>] mdt_hsm_request+0x4da/0xa30 [mdt]
[1371853.658402]  [<ffffffffc11d8aaa>] tgt_request_handle+0xada/0x1570 [ptlrpc]
[1371853.665549]  [<ffffffffc117d9cb>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[1371853.673461]  [<ffffffffc1181334>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
[1371853.679982]  [<ffffffff998c6321>] kthread+0xd1/0xe0
[1371853.685073]  [<ffffffff99f8dd37>] ret_from_fork_nospec_end+0x0/0x39
[1371853.691545]  [<ffffffffffffffff>] 0xffffffffffffffff

Analysis
I was able to find the ldlm_lock and ldlm_resource from the crash.
The release file was a big file +430GB, it take a lot of time from copytool to retrieve data.

Here the granted lock for the resource

crash> ldlm_resource.lr_granted -o 0xffff8a8a352e3b00
struct ldlm_resource {
  [ffff8a8a352e3b20] struct list_head lr_granted;
}
crash> list -H ffff8a8a352e3b20 -s ldlm_lock ldlm_lock.l_res_link
ffff8a89806018c0
struct ldlm_lock {
  l_handle = {
    h_link = {
      next = 0xffff8a6c38bf5000,
      prev = 0xffff8a6c6aa204e0
    },
    h_cookie = 10623779953738126729,
    h_owner = 0x0,
    h_ops = 0xffffffffc124e730 <lock_handle_ops>,
    h_rcu = {
      next = 0x0,
      func = 0x0
    },
    h_lock = {
      {
        rlock = {
          raw_lock = {
            val = {
              counter = 0
            }
          }
        }
      }
    },
    h_size = 0,
    h_in = 1
  },
  l_refc = {
    counter = 2
  },
  l_lock = {
    {
      rlock = {
        raw_lock = {
          val = {
            counter = 0
          }
        }
      }
    }
  },
  l_resource = 0xffff8a8a352e3b00,
  l_lru = {
    next = 0xffff8a8980601910,
    prev = 0xffff8a8980601910
  },
  l_res_link = {
    next = 0xffff8a8a352e3b20,
    prev = 0xffff8a8a352e3b20
  },
  {
    l_tree_node = 0xffff8a898ec77a80,
    l_ibits_node = 0xffff8a898ec77a80
  },
  l_exp_hash = {
    next = 0x0,
    pprev = 0x0
  },
  l_exp_flock_hash = {
    next = 0x0,
    pprev = 0x0
  },
  l_req_mode = LCK_EX,
  l_granted_mode = LCK_EX,
  l_completion_ast = 0xffffffffc1139df0 <ldlm_completion_ast>,
  l_blocking_ast = 0xffffffffc16fb960 <mdt_blocking_ast>,
  l_glimpse_ast = 0x0,
  l_export = 0x0,
  l_conn_export = 0x0,
  l_remote_handle = {
    cookie = 0
  },
  l_policy_data = {
    l_extent = {
      start = 8,
      end = 0,
      gid = 0
    },
    l_flock = {
      start = 8,
      end = 0,
      owner = 0,
      blocking_owner = 0,
      blocking_export = 0x0,
      blocking_refs = {
        counter = 0
      },
      pid = 0
    },
    l_inodebits = {
      bits = 8,
      {
        try_bits = 0,
        cancel_bits = 0
      }
    }
  },
  l_flags = 18050699573067808,
  l_readers = 0,
  l_writers = 1,
  l_waitq = {
    lock = {
      {
        rlock = {
          raw_lock = {
            val = {
              counter = 0
            }
          }
        }
      }
    },
    task_list = {
      next = 0xffff8a89806019d8,
      prev = 0xffff8a89806019d8
    }
  },
  l_last_used = {
    tv64 = 1371509486081554
  },
  l_req_extent = {
    start = 0,
    end = 0,
    gid = 0
  },
  l_lvb_type = LVB_T_NONE,
  l_lvb_len = 0,
  l_lvb_data = 0x0,
  l_ast_data = 0x0,
  {
    l_activity = 1634368802,
    l_blast_sent = 1634368802
  },
  l_ost_lvb = {
    lvb_size = 0,
    lvb_mtime = 0,
    lvb_atime = 0,
    lvb_ctime = 0,
    lvb_blocks = 0,
    lvb_mtime_ns = 0,
    lvb_atime_ns = 0,
    lvb_ctime_ns = 0,
    lvb_padding = 0
  },
  l_client_cookie = 10623779952545863461,
  l_pending_chain = {
    next = 0xffff8a8980601a68,
    prev = 0xffff8a8980601a68
  },
  l_callback_timeout = 0,
  l_pid = 2930,
  l_bl_ast_run = 1,
  l_bl_ast = {
    next = 0xffff8a8980601a88,
    prev = 0xffff8a8980601a88
  },
  l_cp_ast = {
    next = 0xffff8a8980601a98,
    prev = 0xffff8a8980601a98
  },
  l_rk_ast = {
    next = 0xffff8a8980601aa8,
    prev = 0xffff8a8980601aa8
  },
  l_blocking_lock = 0x0,
  l_sl_mode = {
    next = 0xffff8a8980601ac0,
    prev = 0xffff8a8980601ac0
  },
  l_sl_policy = {
    next = 0xffff8a8980601ad0,
    prev = 0xffff8a8980601ad0
  },
  l_reference = {<No data fields>},
  l_exp_list = {
    next = 0xffff8a8980601ae0,
    prev = 0xffff8a8980601ae0
  }
}

There is lot of EX lock in the waiting list:

crash> ldlm_resource.lr_waiting -o 0xffff8a8a352e3b00
struct ldlm_resource {
  [ffff8a8a352e3b30] struct list_head lr_waiting;
}
crash> list -H ffff8a8a352e3b30 -s ldlm_lock.l_req_mode ldlm_lock.l_res_link | grep EX | wc -l
243

For each lock the creation pid (l_pid) is linked to a hanging thread. All the corresponding threads are in cdt_restore_handle_add().

2021-10-16_12:48:44 [1384176467802037] [IN]  PID: 3149   TASK: ffff8a6bd74aa0e0  CPU: 15  COMMAND: "mdt01_081"
2021-10-16_12:48:44 [1384176467755519] [IN]  PID: 3031   TASK: ffff8a6c61f7b150  CPU: 14  COMMAND: "mdt01_049"
2021-10-16_12:48:44 [1384176460239208] [IN]  PID: 3186   TASK: ffff8a8cbcae1070  CPU: 2   COMMAND: "mdt00_107"
2021-10-16_12:48:44 [1384176460192173] [IN]  PID: 2983   TASK: ffff8a8c5f215230  CPU: 0   COMMAND: "mdt00_049"
2021-10-16_12:48:44 [1384176330809933] [IN]  PID: 2930   TASK: ffff8a8cbb52e2a0  CPU: 5   COMMAND: "mdt00_031"

2021-10-16_09:25:01 [1371953151468039] [IN]  PID: 3259   TASK: ffff8a8cbb613150  CPU: 11  COMMAND: "mdt01_121"
2021-10-16_09:25:01 [1371953151453579] [IN]  PID: 3115   TASK: ffff8a8c60a841c0  CPU: 11  COMMAND: "mdt01_072"
...
...
2021-10-16_09:25:01 [1371953111405844] [IN]  PID: 3005   TASK: ffff8a8c7638a0e0  CPU: 1   COMMAND: "mdt00_056"
2021-10-16_09:25:01 [1371953111403702] [IN]  PID: 2989   TASK: ffff8a6cbcf28000  CPU: 13  COMMAND: "mdt01_037"

Only 5/248 mdt threads were able to manage client requests (3 on CPT0 and 2 on CPT1).

For now, I think it exists a window where several clients could send HSM restore requests when EX layout lock is not already taken by the MDT.
The first mdt thread takes the EX lock via cdt_restore_handle_add, the other threads wait the lock until the RESTORE is finished.
This could causes all the available threads to hang (waiting the first RESTORE request to finish).

Attachments

Issue Links

is related to

LU-15145 hsm_cancel on an inactive HSM restore request do not free the EX lock

Resolved

LU-16356 high contention on cdt_request_lock causes clients to hang

Resolved

Parallel data accesses on a release file could hang a MDT

Details

Description

Attachments

Issue Links

Activity

People

Dates