[LU-5886] DNE II Testing: Hung threads on MDS; evictions Created: 07/Nov/14 Updated: 10/Oct/21 Resolved: 10/Oct/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Patrick Farrell (Inactive) | Assignee: | Di Wang |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Versions of 2.6.54 on clients & servers. Most recent commit on clients: Most recent commit on servers: |
||
| Severity: | 3 |
| Rank (Obsolete): | 16457 |
| Description |
|
During DNE II testing (same run as |
| Comments |
| Comment by Patrick Farrell (Inactive) [ 07/Nov/14 ] |
|
Partial Lustre logs and messages logs from both MDSes are here: |
| Comment by Jodi Levi (Inactive) [ 10/Nov/14 ] |
|
Di, |
| Comment by Di Wang [ 12/Nov/14 ] |
|
According to the console log, both MDTs are waiting for the lock of a remote directory. mds001 Nov 6 20:22:21 perses-esf-mds001 kernel: Call Trace: Nov 6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa0aaa1a1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] Nov 6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa0e32230>] ? ldlm_expired_completion_wait+0x0/0x360 [ptlrpc] Nov 6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa0e36df5>] ldlm_completion_ast+0x665/0x9a0 [ptlrpc] Nov 6 20:22:21 perses-esf-mds001 kernel: [<ffffffff81065df0>] ? default_wake_function+0x0/0x20 Nov 6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa0e3619e>] ldlm_cli_enqueue_local+0x21e/0x810 [ptlrpc] Nov 6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa0e36790>] ? ldlm_completion_ast+0x0/0x9a0 [ptlrpc] Nov 6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa1559d70>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] Nov 6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa156240d>] mdt_object_local_lock+0x1bd/0xa80 [mdt] Nov 6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa1559d70>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] Nov 6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa0e36790>] ? ldlm_completion_ast+0x0/0x9a0 [ptlrpc] Nov 6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa1562d35>] mdt_object_lock_internal+0x65/0x360 [mdt] Nov 6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa15630f4>] mdt_object_lock+0x14/0x20 [mdt] Nov 6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa156dd6c>] mdt_getattr_name_lock+0xd9c/0x1a50 [mdt] Nov 6 20:22:21 perses-esf-mds001 kernel: [<ffffffff8128a0ea>] ? strlcpy+0x4a/0x60 Nov 6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa0e63774>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc] Nov 6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa0e65d40>] ? lustre_swab_ldlm_reply+0x0/0x40 [ptlrpc] Nov 6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa156ef42>] mdt_intent_getattr+0x292/0x470 [mdt] Nov 6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa155bc44>] mdt_intent_policy+0x494/0xcf0 [mdt] Nov 6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa0e16549>] ldlm_lock_enqueue+0x129/0x9d0 [ptlrpc] Nov 6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa0e4235b>] ldlm_handle_enqueue0+0x51b/0x1340 [ptlrpc] Nov 6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa0a9a4ce>] ? cfs_timer_arm+0xe/0x10 [libcfs] Nov 6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa0ec3842>] tgt_enqueue+0x62/0x1d0 [ptlrpc] Nov 6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa0ec40ce>] tgt_request_handle+0x71e/0xb10 [ptlrpc] Nov 6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa0e73964>] ptlrpc_main+0xe64/0x1990 [ptlrpc] Nov 6 20:22:21 perses-esf-mds001 kernel: [<ffffffff810096f0>] ? __switch_to+0xd0/0x320 Nov 6 20:22:21 perses-esf-mds001 kernel: [<ffffffff81527c20>] ? thread_return+0x4e/0x76e Nov 6 20:22:21 perses-esf-mds001 kernel: [<ffffffffa0e72b00>] ? ptlrpc_main+0x0/0x1990 [ptlrpc] Nov 6 20:22:21 perses-esf-mds001 kernel: [<ffffffff8109aee6>] kthread+0x96/0xa0 Nov 6 20:22:21 perses-esf-mds001 kernel: [<ffffffff8100c20a>] child_rip+0xa/0x20 Nov 6 20:22:21 perses-esf-mds001 kernel: [<ffffffff8109ae50>] ? kthread+0x0/0xa0 Nov 6 20:22:21 perses-esf-mds001 kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20 Nov 6 20:22:21 perses-esf-mds001 kernel: mds002 Nov 6 20:22:16 perses-esf-mds002 kernel: req@ffff8803330e1c00 x1484073163441132/t0(0) o36->ca4f794c-56f5-709e-3a90-0c25c515ac58@30@gni:0/0 lens 552/568 e 18 to 0 dl 1415326941 ref 2 fl Interpret:/0/0 rc 0/0 Nov 6 20:22:21 perses-esf-mds002 kernel: LNet: Service thread pid 10838 was inactive for 600.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Nov 6 20:22:21 perses-esf-mds002 kernel: Pid: 10838, comm: mdt02_000 Nov 6 20:22:21 perses-esf-mds002 kernel: Nov 6 20:22:21 perses-esf-mds002 kernel: Call Trace: Nov 6 20:22:21 perses-esf-mds002 kernel: [<ffffffffa0e01230>] ? ldlm_expired_completion_wait+0x0/0x360 [ptlrpc] Nov 6 20:22:21 perses-esf-mds002 kernel: [<ffffffffa0e05df5>] ldlm_completion_ast+0x665/0x9a0 [ptlrpc] Nov 6 20:22:21 perses-esf-mds002 kernel: [<ffffffff81065df0>] ? default_wake_function+0x0/0x20 Nov 6 20:22:21 perses-esf-mds002 kernel: [<ffffffffa0dffe76>] ldlm_cli_enqueue_fini+0x926/0xe60 [ptlrpc] Nov 6 20:22:21 perses-esf-mds002 kernel: [<ffffffffa0e00761>] ldlm_cli_enqueue+0x3b1/0x860 [ptlrpc] Nov 6 20:22:21 perses-esf-mds002 kernel: [<ffffffffa0e05790>] ? ldlm_completion_ast+0x0/0x9a0 [ptlrpc] Nov 6 20:22:21 perses-esf-mds002 kernel: [<ffffffffa1502b60>] ? mdt_remote_blocking_ast+0x0/0x210 [mdt] Nov 6 20:22:21 perses-esf-mds002 kernel: [<ffffffffa16d94a8>] osp_md_object_lock+0x188/0x210 [osp] Nov 6 20:22:21 perses-esf-mds002 kernel: [<ffffffffa15f70bf>] lod_object_lock+0x13f/0x730 [lod] Nov 6 20:22:21 perses-esf-mds002 kernel: [<ffffffffa166913b>] mdd_object_lock+0x3b/0xd0 [mdd] Nov 6 20:22:21 perses-esf-mds002 kernel: [<ffffffffa150b07c>] mdt_remote_object_lock+0x14c/0x320 [mdt] Nov 6 20:22:21 perses-esf-mds002 kernel: [<ffffffffa150bdee>] mdt_object_lock_internal+0x11e/0x360 [mdt] Nov 6 20:22:21 perses-esf-mds002 kernel: [<ffffffffa150c0f4>] mdt_object_lock+0x14/0x20 [mdt] Nov 6 20:22:21 perses-esf-mds002 kernel: [<ffffffffa1526f80>] mdt_reint_create+0x700/0xd40 [mdt] Nov 6 20:22:21 perses-esf-mds002 kernel: [<ffffffffa0c216b0>] ? lu_ucred+0x20/0x30 [obdclass] Nov 6 20:22:21 perses-esf-mds002 kernel: [<ffffffffa1501ed5>] ? mdt_ucred+0x15/0x20 [mdt] Nov 6 20:22:21 perses-esf-mds002 kernel: [<ffffffffa151deec>] ? mdt_root_squash+0x2c/0x3f0 [mdt] Nov 6 20:22:21 perses-esf-mds002 kernel: [<ffffffffa0e57a56>] ? __req_capsule_get+0x166/0x6e0 [ptlrpc] Nov 6 20:22:21 perses-esf-mds002 kernel: [<ffffffff8128a0ea>] ? strlcpy+0x4a/0x60 Nov 6 20:22:21 perses-esf-mds002 kernel: [<ffffffffa15220fd>] mdt_reint_rec+0x5d/0x200 [mdt] Nov 6 20:22:21 perses-esf-mds002 kernel: [<ffffffffa15061a3>] mdt_reint_internal+0x4d3/0x7b0 [mdt] Nov 6 20:22:21 perses-esf-mds002 kernel: [<ffffffffa1506a0b>] mdt_reint+0x6b/0x120 [mdt] Nov 6 20:22:21 perses-esf-mds002 kernel: [<ffffffffa0e930ce>] tgt_request_handle+0x71e/0xb10 [ptlrpc] Nov 6 20:22:21 perses-esf-mds002 kernel: [<ffffffffa0e42964>] ptlrpc_main+0xe64/0x1990 [ptlrpc] Nov 6 20:22:21 perses-esf-mds002 kernel: [<ffffffff810096f0>] ? __switch_to+0xd0/0x320 Nov 6 20:22:21 perses-esf-mds002 kernel: [<ffffffff81527c20>] ? thread_return+0x4e/0x76e Nov 6 20:22:21 perses-esf-mds002 kernel: [<ffffffffa0e41b00>] ? ptlrpc_main+0x0/0x1990 [ptlrpc] Nov 6 20:22:21 perses-esf-mds002 kernel: [<ffffffff8109aee6>] kthread+0x96/0xa0 Nov 6 20:22:21 perses-esf-mds002 kernel: [<ffffffff8100c20a>] child_rip+0xa/0x20 Nov 6 20:22:21 perses-esf-mds002 kernel: [<ffffffff8109ae50>] ? kthread+0x0/0xa0 Nov 6 20:22:21 perses-esf-mds002 kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20 Unfortunately, the debug log you posted is about 1 hour earlier, could you be able to find these debug log Nov 6 20:22:21 perses-esf-mds002 kernel: Nov 6 20:22:21 perses-esf-mds002 kernel: LustreError: dumping log to /tmp/lustre-log.1415326941.10838 Nov 6 20:22:21 perses-esf-mds001 kernel: LustreError: dumping log to /tmp/lustre-log.1415326941.21117 Nov 6 20:24:27 perses-esf-mds001 kernel: Lustre: lock timed out (enqueued at 1415326767, 300s ago) Though I am not sure there are valuable information there, but worth try. Thanks. |
| Comment by Di Wang [ 26/Aug/15 ] |
|
Patrick, could you please try current master to see if you can still reproduce the problem? Thanks. |