[LU-5886] DNE II Testing: Hung threads on MDS; evictions Created: 07/Nov/14  Updated: 10/Oct/21  Resolved: 10/Oct/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Patrick Farrell (Inactive) Assignee: Di Wang
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

Versions of 2.6.54 on clients & servers.
Cray SLES11SP3 clients, CentOS servers (2.6.32-431.5.1.el6.x86_64).

Most recent commit on clients:
Ie7a2a98be8cc97db9af7a64476c06fc7321544eb
http://review.whamcloud.com/12142

Most recent commit on servers:
If24443955290b091fd22905dfb74b0d6a6d1b4e8
http://review.whamcloud.com/12490


Severity: 3
Rank (Obsolete): 16457

 Description   

During DNE II testing (same run as LU-5883), both of our MDSes reported hung threads here:
Nov 6 20:22:21 perses-esf-mds001 kernel: Pid: 21117, comm: mdt02_008
Nov 6 20:22:21 perses-esf-mds001 kernel:
Nov 6 20:22:21 perses-esf-mds001 kernel: Call Trace:
Nov 6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa0aaa1a1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
Nov 6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa0e32230>] ? ldlm_expired_completion_wait+0x0/0x360 [ptlrpc]
Nov 6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa0e36df5>] ldlm_completion_ast+0x665/0x9a0 [ptlrpc]
Nov 6 20:22:21 perses-esf-mds001 kernel: [<ffffffff81065df0>] ? default_wake_function+0x0/0x20
Nov 6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa0e3619e>] ldlm_cli_enqueue_local+0x21e/0x810 [ptlrpc]
Nov 6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa0e36790>] ? ldlm_completion_ast+0x0/0x9a0 [ptlrpc]
Nov 6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa1559d70>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Nov 6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa156240d>] mdt_object_local_lock+0x1bd/0xa80 [mdt]
Nov 6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa1559d70>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Nov 6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa0e36790>] ? ldlm_completion_ast+0x0/0x9a0 [ptlrpc]
Nov 6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa1562d35>] mdt_object_lock_internal+0x65/0x360 [mdt]
Nov 6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa15630f4>] mdt_object_lock+0x14/0x20 [mdt]
Nov 6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa156dd6c>] mdt_getattr_name_lock+0xd9c/0x1a50 [mdt]
Nov 6 20:22:21 perses-esf-mds001 kernel: [<ffffffff8128a0ea>] ? strlcpy+0x4a/0x60
Nov 6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa0e63774>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc]
Nov 6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa0e65d40>] ? lustre_swab_ldlm_reply+0x0/0x40 [ptlrpc]
Nov 6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa156ef42>] mdt_intent_getattr+0x292/0x470 [mdt]
Nov 6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa155bc44>] mdt_intent_policy+0x494/0xcf0 [mdt]
Nov 6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa0e16549>] ldlm_lock_enqueue+0x129/0x9d0 [ptlrpc]
Nov 6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa0e4235b>] ldlm_handle_enqueue0+0x51b/0x1340 [ptlrpc]
Nov 6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa0a9a4ce>] ? cfs_timer_arm+0xe/0x10 [libcfs]
Nov 6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa0ec3842>] tgt_enqueue+0x62/0x1d0 [ptlrpc]
Nov 6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa0ec40ce>] tgt_request_handle+0x71e/0xb10 [ptlrpc]
Nov 6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa0e73964>] ptlrpc_main+0xe64/0x1990 [ptlrpc]
Nov 6 20:22:21 perses-esf-mds001 kernel: [<ffffffff810096f0>] ? __switch_to+0xd0/0x320
Nov 6 20:22:21 perses-esf-mds001 kernel: [<ffffffff81527c20>] ? thread_return+0x4e/0x76e
Nov 6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa0e72b00>] ? ptlrpc_main+0x0/0x1990 [ptlrpc]
Nov 6 20:22:21 perses-esf-mds001 kernel: [<ffffffff8109aee6>] kthread+0x96/0xa0
Nov 6 20:22:21 perses-esf-mds001 kernel: [<ffffffff8100c20a>] child_rip+0xa/0x20
Nov 6 20:22:21 perses-esf-mds001 kernel: [<ffffffff8109ae50>] ? kthread+0x0/0xa0
Nov 6 20:22:21 perses-esf-mds001 kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20
Nov 6 20:22:21 perses-esf-mds001 kernel:



 Comments   
Comment by Patrick Farrell (Inactive) [ 07/Nov/14 ]

Partial Lustre logs and messages logs from both MDSes are here:
ftp.whamcloud.com/uploads/LU-5886/lustre_logs_and_messages.tar.gz

Comment by Jodi Levi (Inactive) [ 10/Nov/14 ]

Di,
Would you be able to comment on this one?
Thank you!

Comment by Di Wang [ 12/Nov/14 ]

According to the console log, both MDTs are waiting for the lock of a remote directory.

mds001

Nov  6 20:22:21 perses-esf-mds001 kernel: Call Trace:
Nov  6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa0aaa1a1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
Nov  6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa0e32230>] ? ldlm_expired_completion_wait+0x0/0x360 [ptlrpc]
Nov  6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa0e36df5>] ldlm_completion_ast+0x665/0x9a0 [ptlrpc]
Nov  6 20:22:21 perses-esf-mds001 kernel: [<ffffffff81065df0>] ? default_wake_function+0x0/0x20
Nov  6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa0e3619e>] ldlm_cli_enqueue_local+0x21e/0x810 [ptlrpc]
Nov  6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa0e36790>] ? ldlm_completion_ast+0x0/0x9a0 [ptlrpc]
Nov  6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa1559d70>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Nov  6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa156240d>] mdt_object_local_lock+0x1bd/0xa80 [mdt]
Nov  6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa1559d70>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Nov  6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa0e36790>] ? ldlm_completion_ast+0x0/0x9a0 [ptlrpc]
Nov  6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa1562d35>] mdt_object_lock_internal+0x65/0x360 [mdt]
Nov  6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa15630f4>] mdt_object_lock+0x14/0x20 [mdt]
Nov  6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa156dd6c>] mdt_getattr_name_lock+0xd9c/0x1a50 [mdt]
Nov  6 20:22:21 perses-esf-mds001 kernel: [<ffffffff8128a0ea>] ? strlcpy+0x4a/0x60
Nov  6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa0e63774>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc]
Nov  6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa0e65d40>] ? lustre_swab_ldlm_reply+0x0/0x40 [ptlrpc]
Nov  6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa156ef42>] mdt_intent_getattr+0x292/0x470 [mdt]
Nov  6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa155bc44>] mdt_intent_policy+0x494/0xcf0 [mdt]
Nov  6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa0e16549>] ldlm_lock_enqueue+0x129/0x9d0 [ptlrpc]
Nov  6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa0e4235b>] ldlm_handle_enqueue0+0x51b/0x1340 [ptlrpc]
Nov  6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa0a9a4ce>] ? cfs_timer_arm+0xe/0x10 [libcfs]
Nov  6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa0ec3842>] tgt_enqueue+0x62/0x1d0 [ptlrpc]
Nov  6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa0ec40ce>] tgt_request_handle+0x71e/0xb10 [ptlrpc]
Nov  6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa0e73964>] ptlrpc_main+0xe64/0x1990 [ptlrpc]
Nov  6 20:22:21 perses-esf-mds001 kernel: [<ffffffff810096f0>] ? __switch_to+0xd0/0x320
Nov  6 20:22:21 perses-esf-mds001 kernel: [<ffffffff81527c20>] ? thread_return+0x4e/0x76e
Nov  6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa0e72b00>] ? ptlrpc_main+0x0/0x1990 [ptlrpc]
Nov  6 20:22:21 perses-esf-mds001 kernel: [<ffffffff8109aee6>] kthread+0x96/0xa0
Nov  6 20:22:21 perses-esf-mds001 kernel: [<ffffffff8100c20a>] child_rip+0xa/0x20
Nov  6 20:22:21 perses-esf-mds001 kernel: [<ffffffff8109ae50>] ? kthread+0x0/0xa0
Nov  6 20:22:21 perses-esf-mds001 kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20
Nov  6 20:22:21 perses-esf-mds001 kernel: 

mds002

Nov  6 20:22:16 perses-esf-mds002 kernel:  req@ffff8803330e1c00 x1484073163441132/t0(0) o36->ca4f794c-56f5-709e-3a90-0c25c515ac58@30@gni:0/0 lens 552/568 e 18 to 0 dl 1415326941 ref 2 fl Interpret:/0/0 rc 0/0
Nov  6 20:22:21 perses-esf-mds002 kernel: LNet: Service thread pid 10838 was inactive for 600.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Nov  6 20:22:21 perses-esf-mds002 kernel: Pid: 10838, comm: mdt02_000
Nov  6 20:22:21 perses-esf-mds002 kernel: 
Nov  6 20:22:21 perses-esf-mds002 kernel: Call Trace:
Nov  6 20:22:21 perses-esf-mds002 kernel: [<ffffffffa0e01230>] ? ldlm_expired_completion_wait+0x0/0x360 [ptlrpc]
Nov  6 20:22:21 perses-esf-mds002 kernel: [<ffffffffa0e05df5>] ldlm_completion_ast+0x665/0x9a0 [ptlrpc]
Nov  6 20:22:21 perses-esf-mds002 kernel: [<ffffffff81065df0>] ? default_wake_function+0x0/0x20
Nov  6 20:22:21 perses-esf-mds002 kernel: [<ffffffffa0dffe76>] ldlm_cli_enqueue_fini+0x926/0xe60 [ptlrpc]
Nov  6 20:22:21 perses-esf-mds002 kernel: [<ffffffffa0e00761>] ldlm_cli_enqueue+0x3b1/0x860 [ptlrpc]
Nov  6 20:22:21 perses-esf-mds002 kernel: [<ffffffffa0e05790>] ? ldlm_completion_ast+0x0/0x9a0 [ptlrpc]
Nov  6 20:22:21 perses-esf-mds002 kernel: [<ffffffffa1502b60>] ? mdt_remote_blocking_ast+0x0/0x210 [mdt]
Nov  6 20:22:21 perses-esf-mds002 kernel: [<ffffffffa16d94a8>] osp_md_object_lock+0x188/0x210 [osp]
Nov  6 20:22:21 perses-esf-mds002 kernel: [<ffffffffa15f70bf>] lod_object_lock+0x13f/0x730 [lod]
Nov  6 20:22:21 perses-esf-mds002 kernel: [<ffffffffa166913b>] mdd_object_lock+0x3b/0xd0 [mdd]
Nov  6 20:22:21 perses-esf-mds002 kernel: [<ffffffffa150b07c>] mdt_remote_object_lock+0x14c/0x320 [mdt]
Nov  6 20:22:21 perses-esf-mds002 kernel: [<ffffffffa150bdee>] mdt_object_lock_internal+0x11e/0x360 [mdt]
Nov  6 20:22:21 perses-esf-mds002 kernel: [<ffffffffa150c0f4>] mdt_object_lock+0x14/0x20 [mdt]
Nov  6 20:22:21 perses-esf-mds002 kernel: [<ffffffffa1526f80>] mdt_reint_create+0x700/0xd40 [mdt]
Nov  6 20:22:21 perses-esf-mds002 kernel: [<ffffffffa0c216b0>] ? lu_ucred+0x20/0x30 [obdclass]
Nov  6 20:22:21 perses-esf-mds002 kernel: [<ffffffffa1501ed5>] ? mdt_ucred+0x15/0x20 [mdt]
Nov  6 20:22:21 perses-esf-mds002 kernel: [<ffffffffa151deec>] ? mdt_root_squash+0x2c/0x3f0 [mdt]
Nov  6 20:22:21 perses-esf-mds002 kernel: [<ffffffffa0e57a56>] ? __req_capsule_get+0x166/0x6e0 [ptlrpc]
Nov  6 20:22:21 perses-esf-mds002 kernel: [<ffffffff8128a0ea>] ? strlcpy+0x4a/0x60
Nov  6 20:22:21 perses-esf-mds002 kernel: [<ffffffffa15220fd>] mdt_reint_rec+0x5d/0x200 [mdt]
Nov  6 20:22:21 perses-esf-mds002 kernel: [<ffffffffa15061a3>] mdt_reint_internal+0x4d3/0x7b0 [mdt]
Nov  6 20:22:21 perses-esf-mds002 kernel: [<ffffffffa1506a0b>] mdt_reint+0x6b/0x120 [mdt]
Nov  6 20:22:21 perses-esf-mds002 kernel: [<ffffffffa0e930ce>] tgt_request_handle+0x71e/0xb10 [ptlrpc]
Nov  6 20:22:21 perses-esf-mds002 kernel: [<ffffffffa0e42964>] ptlrpc_main+0xe64/0x1990 [ptlrpc]
Nov  6 20:22:21 perses-esf-mds002 kernel: [<ffffffff810096f0>] ? __switch_to+0xd0/0x320
Nov  6 20:22:21 perses-esf-mds002 kernel: [<ffffffff81527c20>] ? thread_return+0x4e/0x76e
Nov  6 20:22:21 perses-esf-mds002 kernel: [<ffffffffa0e41b00>] ? ptlrpc_main+0x0/0x1990 [ptlrpc]
Nov  6 20:22:21 perses-esf-mds002 kernel: [<ffffffff8109aee6>] kthread+0x96/0xa0
Nov  6 20:22:21 perses-esf-mds002 kernel: [<ffffffff8100c20a>] child_rip+0xa/0x20
Nov  6 20:22:21 perses-esf-mds002 kernel: [<ffffffff8109ae50>] ? kthread+0x0/0xa0
Nov  6 20:22:21 perses-esf-mds002 kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20

Unfortunately, the debug log you posted is about 1 hour earlier, could you be able to find these debug log

Nov  6 20:22:21 perses-esf-mds002 kernel: 
Nov  6 20:22:21 perses-esf-mds002 kernel: LustreError: dumping log to /tmp/lustre-log.1415326941.10838
Nov  6 20:22:21 perses-esf-mds001 kernel: LustreError: dumping log to /tmp/lustre-log.1415326941.21117
Nov  6 20:24:27 perses-esf-mds001 kernel: Lustre: lock timed out (enqueued at 1415326767, 300s ago)

Though I am not sure there are valuable information there, but worth try. Thanks.

Comment by Di Wang [ 26/Aug/15 ]

Patrick, could you please try current master to see if you can still reproduce the problem? Thanks.

Generated at Sat Feb 10 01:55:22 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.