[LU-15132] Parallel data accesses on a release file could hang a MDT Created: 19/Oct/21  Updated: 12/Jan/24  Resolved: 18/Jul/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Major
Reporter: Etienne Aujames Assignee: Etienne Aujames
Resolution: Fixed Votes: 0
Labels: None
Environment:

Production Lustre 2.12.6 + patches


Issue Links:
Related
is related to LU-16356 high contention on cdt_request_lock c... Open
is related to LU-15145 hsm_cancel on an inactive HSM restore... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was seen on a production filesystem.
A MDT became unresponsive (from clients) after a user (mistakenly) start a job with HSM release files.

To solve the issue the admin needed to crash the MDT and copytools and stop the problematic job.

dmesg

LustreError: 2989:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) ###
lock timed out (enqueued at 1634368802, 300s ago); not entering recovery in
server code, just going back to sleep ns: mdt-lustre1-MDT0001_UUID lock: ffff8a89b9d48fc0/0x936f3f94aa7a15ba lrc: 3/0,1 mode: --/EX res: [0xc40051b70:0xa:0x0].0x0 bits 0x8/0x0 rrc: 378 type: IBT flags: 0x40210400000020 nid: local remote: 0x0 expref: -99 pid: 2989 timeout: 0 lvb_type: 0

stack

[1371853.573298] LNet: Service thread pid 3160 was inactive for 200.47s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[1371853.590331] LNet: Skipped 3 previous similar messages
[1371853.595565] Pid: 3160, comm: mdt01_083 3.10.0-1062.18.1.el7.x86_64 #1 SMP Tue Mar 17 23:49:1
7 UTC 2020
[1371853.595567] Call Trace:
[1371853.595584]  [<ffffffffc113a220>] ldlm_completion_ast+0x430/0x860 [ptlrpc]
[1371853.602747]  [<ffffffffc113b101>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc]
[1371853.610134]  [<ffffffffc170b69b>] mdt_object_local_lock+0x50b/0xb20 [mdt]
[1371853.617154]  [<ffffffffc170bd20>] mdt_object_lock_internal+0x70/0x360 [mdt]
[1371853.624325]  [<ffffffffc170ebaa>] mdt_object_find_lock+0x6a/0x1a0 [mdt]
[1371853.631160]  [<ffffffffc175d56f>] cdt_restore_handle_add+0xdf/0x540 [mdt]
[1371853.638167]  [<ffffffffc175422b>] mdt_hsm_register_hal+0x59b/0xb10 [mdt]
[1371853.645082]  [<ffffffffc1754d1c>] mdt_hsm_add_actions+0x57c/0x6c0 [mdt]
[1371853.651910]  [<ffffffffc1744b8a>] mdt_hsm_request+0x4da/0xa30 [mdt]
[1371853.658402]  [<ffffffffc11d8aaa>] tgt_request_handle+0xada/0x1570 [ptlrpc]
[1371853.665549]  [<ffffffffc117d9cb>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[1371853.673461]  [<ffffffffc1181334>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
[1371853.679982]  [<ffffffff998c6321>] kthread+0xd1/0xe0
[1371853.685073]  [<ffffffff99f8dd37>] ret_from_fork_nospec_end+0x0/0x39
[1371853.691545]  [<ffffffffffffffff>] 0xffffffffffffffff

Analysis
I was able to find the ldlm_lock and ldlm_resource from the crash.
The release file was a big file +430GB, it take a lot of time from copytool to retrieve data.

Here the granted lock for the resource

crash> ldlm_resource.lr_granted -o 0xffff8a8a352e3b00
struct ldlm_resource {
  [ffff8a8a352e3b20] struct list_head lr_granted;
}
crash> list -H ffff8a8a352e3b20 -s ldlm_lock ldlm_lock.l_res_link
ffff8a89806018c0
struct ldlm_lock {
  l_handle = {
    h_link = {
      next = 0xffff8a6c38bf5000,
      prev = 0xffff8a6c6aa204e0
    },
    h_cookie = 10623779953738126729,
    h_owner = 0x0,
    h_ops = 0xffffffffc124e730 <lock_handle_ops>,
    h_rcu = {
      next = 0x0,
      func = 0x0
    },
    h_lock = {
      {
        rlock = {
          raw_lock = {
            val = {
              counter = 0
            }
          }
        }
      }
    },
    h_size = 0,
    h_in = 1
  },
  l_refc = {
    counter = 2
  },
  l_lock = {
    {
      rlock = {
        raw_lock = {
          val = {
            counter = 0
          }
        }
      }
    }
  },
  l_resource = 0xffff8a8a352e3b00,
  l_lru = {
    next = 0xffff8a8980601910,
    prev = 0xffff8a8980601910
  },
  l_res_link = {
    next = 0xffff8a8a352e3b20,
    prev = 0xffff8a8a352e3b20
  },
  {
    l_tree_node = 0xffff8a898ec77a80,
    l_ibits_node = 0xffff8a898ec77a80
  },
  l_exp_hash = {
    next = 0x0,
    pprev = 0x0
  },
  l_exp_flock_hash = {
    next = 0x0,
    pprev = 0x0
  },
  l_req_mode = LCK_EX,
  l_granted_mode = LCK_EX,
  l_completion_ast = 0xffffffffc1139df0 <ldlm_completion_ast>,
  l_blocking_ast = 0xffffffffc16fb960 <mdt_blocking_ast>,
  l_glimpse_ast = 0x0,
  l_export = 0x0,
  l_conn_export = 0x0,
  l_remote_handle = {
    cookie = 0
  },
  l_policy_data = {
    l_extent = {
      start = 8,
      end = 0,
      gid = 0
    },
    l_flock = {
      start = 8,
      end = 0,
      owner = 0,
      blocking_owner = 0,
      blocking_export = 0x0,
      blocking_refs = {
        counter = 0
      },
      pid = 0
    },
    l_inodebits = {
      bits = 8,
      {
        try_bits = 0,
        cancel_bits = 0
      }
    }
  },
  l_flags = 18050699573067808,
  l_readers = 0,
  l_writers = 1,
  l_waitq = {
    lock = {
      {
        rlock = {
          raw_lock = {
            val = {
              counter = 0
            }
          }
        }
      }
    },
    task_list = {
      next = 0xffff8a89806019d8,
      prev = 0xffff8a89806019d8
    }
  },
  l_last_used = {
    tv64 = 1371509486081554
  },
  l_req_extent = {
    start = 0,
    end = 0,
    gid = 0
  },
  l_lvb_type = LVB_T_NONE,
  l_lvb_len = 0,
  l_lvb_data = 0x0,
  l_ast_data = 0x0,
  {
    l_activity = 1634368802,
    l_blast_sent = 1634368802
  },
  l_ost_lvb = {
    lvb_size = 0,
    lvb_mtime = 0,
    lvb_atime = 0,
    lvb_ctime = 0,
    lvb_blocks = 0,
    lvb_mtime_ns = 0,
    lvb_atime_ns = 0,
    lvb_ctime_ns = 0,
    lvb_padding = 0
  },
  l_client_cookie = 10623779952545863461,
  l_pending_chain = {
    next = 0xffff8a8980601a68,
    prev = 0xffff8a8980601a68
  },
  l_callback_timeout = 0,
  l_pid = 2930,
  l_bl_ast_run = 1,
  l_bl_ast = {
    next = 0xffff8a8980601a88,
    prev = 0xffff8a8980601a88
  },
  l_cp_ast = {
    next = 0xffff8a8980601a98,
    prev = 0xffff8a8980601a98
  },
  l_rk_ast = {
    next = 0xffff8a8980601aa8,
    prev = 0xffff8a8980601aa8
  },
  l_blocking_lock = 0x0,
  l_sl_mode = {
    next = 0xffff8a8980601ac0,
    prev = 0xffff8a8980601ac0
  },
  l_sl_policy = {
    next = 0xffff8a8980601ad0,
    prev = 0xffff8a8980601ad0
  },
  l_reference = {<No data fields>},
  l_exp_list = {
    next = 0xffff8a8980601ae0,
    prev = 0xffff8a8980601ae0
  }
}

There is lot of EX lock in the waiting list:

crash> ldlm_resource.lr_waiting -o 0xffff8a8a352e3b00
struct ldlm_resource {
  [ffff8a8a352e3b30] struct list_head lr_waiting;
}
crash> list -H ffff8a8a352e3b30 -s ldlm_lock.l_req_mode ldlm_lock.l_res_link | grep EX | wc -l
243

For each lock the creation pid (l_pid) is linked to a hanging thread. All the corresponding threads are in cdt_restore_handle_add().

2021-10-16_12:48:44 [1384176467802037] [IN]  PID: 3149   TASK: ffff8a6bd74aa0e0  CPU: 15  COMMAND: "mdt01_081"
2021-10-16_12:48:44 [1384176467755519] [IN]  PID: 3031   TASK: ffff8a6c61f7b150  CPU: 14  COMMAND: "mdt01_049"
2021-10-16_12:48:44 [1384176460239208] [IN]  PID: 3186   TASK: ffff8a8cbcae1070  CPU: 2   COMMAND: "mdt00_107"
2021-10-16_12:48:44 [1384176460192173] [IN]  PID: 2983   TASK: ffff8a8c5f215230  CPU: 0   COMMAND: "mdt00_049"
2021-10-16_12:48:44 [1384176330809933] [IN]  PID: 2930   TASK: ffff8a8cbb52e2a0  CPU: 5   COMMAND: "mdt00_031"

2021-10-16_09:25:01 [1371953151468039] [IN]  PID: 3259   TASK: ffff8a8cbb613150  CPU: 11  COMMAND: "mdt01_121"
2021-10-16_09:25:01 [1371953151453579] [IN]  PID: 3115   TASK: ffff8a8c60a841c0  CPU: 11  COMMAND: "mdt01_072"
...
...
2021-10-16_09:25:01 [1371953111405844] [IN]  PID: 3005   TASK: ffff8a8c7638a0e0  CPU: 1   COMMAND: "mdt00_056"
2021-10-16_09:25:01 [1371953111403702] [IN]  PID: 2989   TASK: ffff8a6cbcf28000  CPU: 13  COMMAND: "mdt01_037"

Only 5/248 mdt threads were able to manage client requests (3 on CPT0 and 2 on CPT1).

For now, I think it exists a window where several clients could send HSM restore requests when EX layout lock is not already taken by the MDT.
The first mdt thread takes the EX lock via cdt_restore_handle_add, the other threads wait the lock until the RESTORE is finished.
This could causes all the available threads to hang (waiting the first RESTORE request to finish).



 Comments   
Comment by Gerrit Updater [ 25/Oct/21 ]

"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/45359
Subject: LU-15132 hsm: reproducer
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e122c3d98637b451341a7c6992ebc801fb9a6d27

Comment by Gerrit Updater [ 26/Oct/21 ]

"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/45367
Subject: LU-15132 hsm: Protect against parallel HSM restore requests
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c850fd8f6c13cb9db7b3d8fefa3cbbb05da21360

Comment by Gerrit Updater [ 02/May/22 ]

"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/47181
Subject: LU-15132 mdc: Use early cancels for hsm requests
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 18e96bfbfaf23bf7b9637243ed2ea2a440c436b3

Comment by Gerrit Updater [ 06/Jun/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47181/
Subject: LU-15132 mdc: Use early cancels for hsm requests
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 60d2a4b0efa4a944b558bd9b63b6334f7e70419b

Comment by Oleg Drokin [ 11/Jul/22 ]

It looks like Vitaly had very sensible comment on the patch about how to better do it  that I missed so perhaps we should not close this ticket and instead implement his suggestions which would be a lighter weight approach than reverting this patch and redoing it altgether.

Comment by Gerrit Updater [ 18/Jul/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45367/
Subject: LU-15132 hsm: Protect against parallel HSM restore requests
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 66b3e74bccf1451d135b7f331459b6af1c06431b

Comment by Peter Jones [ 18/Jul/22 ]

Landed for 2.16

Comment by Gerrit Updater [ 12/Jan/24 ]

"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53658
Subject: LU-15132 mdc: Use early cancels for hsm requests
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: 339eb17a997bc3335b6e9896c28621cbd68ac0fd

Comment by Gerrit Updater [ 12/Jan/24 ]

"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53659
Subject: LU-15132 hsm: Protect against parallel HSM restore requests
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: 485eec754f5c1978ceffb81db655962c43c513f9

Generated at Sat Feb 10 03:15:44 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.