[LU-7692] LNet: Service thread Hung - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Cannot Reproduce
Priority: Major
Fix Version/s: Lustre 2.7.0
Affects Version/s: Lustre 2.5.3
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

On MDT LNet service thread hung Dumped stack trace

an 20 13:42:10 nbp8-mds1 kernel: LNet: Service thread pid 17862 was inactive for 424.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Jan 20 13:42:10 nbp8-mds1 kernel: Pid: 17862, comm: mdt00_057
Jan 20 13:42:14 nbp8-mds1 kernel: 
Jan 20 13:42:14 nbp8-mds1 kernel: Call Trace:
Jan 20 13:42:14 nbp8-mds1 kernel: [<ffffffff8155a9c2>] schedule_timeout+0x192/0x2e0
Jan 20 13:42:14 nbp8-mds1 kernel: [<ffffffff81083300>] ? process_timeout+0x0/0x10
Jan 20 13:42:14 nbp8-mds1 kernel: [<ffffffffa078af70>] ? ldlm_expired_completion_wait+0x0/0x360 [ptlrpc]
Jan 20 13:42:14 nbp8-mds1 kernel: [<ffffffffa078f7a1>] ldlm_completion_ast+0x4b1/0x920 [ptlrpc]
Jan 20 13:42:14 nbp8-mds1 kernel: [<ffffffff81061fe0>] ? default_wake_function+0x0/0x20
Jan 20 13:42:14 nbp8-mds1 kernel: [<ffffffffa078ef00>] ldlm_cli_enqueue_local+0x1f0/0x5e0 [ptlrpc]
Jan 20 13:42:14 nbp8-mds1 kernel: [<ffffffffa078f2f0>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc]
Jan 20 13:42:14 nbp8-mds1 kernel: [<ffffffffa0e72de0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa0e7cde4>] mdt_object_lock0+0x394/0xb30 [mdt]
Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa0e72de0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa078f2f0>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc]
Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa0e7d644>] mdt_object_lock+0x14/0x20 [mdt]
Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa0e7d801>] mdt_object_find_lock+0x61/0x170 [mdt]
Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa0ea982c>] mdt_reint_open+0x88c/0x21a0 [mdt]
Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa04fb5d6>] ? upcall_cache_get_entry+0x296/0x880 [libcfs]
Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa0652ef0>] ? lu_ucred+0x20/0x30 [obdclass]
Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa0e71935>] ? mdt_ucred+0x15/0x20 [mdt]
Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa0e8e51c>] ? mdt_root_squash+0x2c/0x3f0 [mdt]
Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa07df766>] ? __req_capsule_get+0x166/0x710 [ptlrpc]
Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa0e92481>] mdt_reint_rec+0x41/0xe0 [mdt]
Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa0e77ed3>] mdt_reint_internal+0x4c3/0x780 [mdt]
Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa0e7845e>] mdt_intent_reint+0x1ee/0x410 [mdt]
Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa0e75c3e>] mdt_intent_policy+0x3ae/0x770 [mdt]
Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa076f2c5>] ldlm_lock_enqueue+0x135/0x980 [ptlrpc]
Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa0798ebb>] ldlm_handle_enqueue0+0x51b/0x10c0 [ptlrpc]
Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa0e76106>] mdt_enqueue+0x46/0xe0 [mdt]
Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa0e7aada>] mdt_handle_common+0x52a/0x1470 [mdt]
Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa0eb74a5>] mds_regular_handle+0x15/0x20 [mdt]
Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa07c80c5>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]
Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa04f08d5>] ? lc_watchdog_touch+0x65/0x170 [libcfs]
Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa07c0a69>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]
Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa07ca89d>] ptlrpc_main+0xafd/0x1780 [ptlrpc]
Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffff8100c20a>] child_rip+0xa/0x20
Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa07c9da0>] ? ptlrpc_main+0x0/0x1780 [ptlrpc]
Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20

an 20 13:42:18 nbp8-mds1 kernel: Pid: 17780, comm: mdt00_044
Jan 20 13:42:18 nbp8-mds1 kernel: 
Jan 20 13:42:18 nbp8-mds1 kernel: Call Trace:
Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffff8155a9c2>] schedule_timeout+0x192/0x2e0
Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffff81083300>] ? process_timeout+0x0/0x10
Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffffa078af70>] ? ldlm_expired_completion_wait+0x0/0x360 [ptlrpc]
Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffffa078f7a1>] ldlm_completion_ast+0x4b1/0x920 [ptlrpc]
Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffff81061fe0>] ? default_wake_function+0x0/0x20
Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffffa078ef00>] ldlm_cli_enqueue_local+0x1f0/0x5e0 [ptlrpc]
Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffffa078f2f0>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc]
Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffffa0e72de0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffffa0e7cc06>] mdt_object_lock0+0x1b6/0xb30 [mdt]
Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffffa0e72de0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffffa078f2f0>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc]
Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffffa0e7d644>] mdt_object_lock+0x14/0x20 [mdt]
Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffffa0e85b8e>] mdt_getattr_name_lock+0x8fe/0x19d0 [mdt]
Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffffa07df766>] ? __req_capsule_get+0x166/0x710 [ptlrpc]
Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffffa07ba7b4>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc]
Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffffa0e86ef9>] mdt_intent_getattr+0x299/0x480 [mdt]
Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffffa0e75c3e>] mdt_intent_policy+0x3ae/0x770 [mdt]
Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffffa076f2c5>] ldlm_lock_enqueue+0x135/0x980 [ptlrpc]
Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffffa0798ebb>] ldlm_handle_enqueue0+0x51b/0x10c0 [ptlrpc]
Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffffa0e76106>] mdt_enqueue+0x46/0xe0 [mdt]
Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffffa0e7aada>] mdt_handle_common+0x52a/0x1470 [mdt]
Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffffa0eb74a5>] mds_regular_handle+0x15/0x20 [mdt]
Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffffa07c80c5>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]
Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffff81061ff2>] ? default_wake_function+0x12/0x20
Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffffa07ca89d>] ptlrpc_main+0xafd/0x1780 [ptlrpc]
Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffff8100c20a>] child_rip+0xa/0x20
Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffffa07c9da0>] ? ptlrpc_main+0x0/0x1780 [ptlrpc]
Jan 20 13:42:19 nbp8-mds1 kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20
Jan 20 13:42:19 nbp8-mds1 kernel: 
Jan 20 13:42:19 nbp8-mds1 kernel: Pid: 16840, comm: mdt03_017
Jan 20 13:42:19 nbp8-mds1 kernel: 
Jan 20 13:42:19 nbp8-mds1 kernel: Call Trace:
Jan 20 13:42:19 nbp8-mds1 kernel: [<ffffffffa076a885>] ? _ldlm_lock_debug+0x2d5/0x660 [ptlrpc]
Jan 20 13:42:19 nbp8-mds1 kernel: [<ffffffff8155a9c2>] schedule_timeout+0x192/0x2e0
Jan 20 13:42:19 nbp8-mds1 kernel: [<ffffffff81083300>] ? process_timeout+0x0/0x10
Jan 20 13:42:19 nbp8-mds1 kernel: [<ffffffffa078af70>] ? ldlm_expired_completion_wait+0x0/0x360 [ptlrpc]
Jan 20 13:42:19 nbp8-mds1 kernel: [<ffffffffa078f7a1>] ldlm_completion_ast+0x4b1/0x920 [ptlrpc]
Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffff81061fe0>] ? default_wake_function+0x0/0x20
Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffffa078ef00>] ldlm_cli_enqueue_local+0x1f0/0x5e0 [ptlrpc]
Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffffa078f2f0>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc]
Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffffa0e72de0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffffa0e7cc06>] mdt_object_lock0+0x1b6/0xb30 [mdt]
Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffffa0e72de0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffffa078f2f0>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc]
Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffffa0e7d644>] mdt_object_lock+0x14/0x20 [mdt]
Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffffa0e85b8e>] mdt_getattr_name_lock+0x8fe/0x19d0 [mdt]
Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffffa07df766>] ? __req_capsule_get+0x166/0x710 [ptlrpc]
Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffffa07ba7b4>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc]
Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffffa0e86ef9>] mdt_intent_getattr+0x299/0x480 [mdt]
Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffffa0e75c3e>] mdt_intent_policy+0x3ae/0x770 [mdt]
Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffffa076f2c5>] ldlm_lock_enqueue+0x135/0x980 [ptlrpc]
Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffffa0798ebb>] ldlm_handle_enqueue0+0x51b/0x10c0 [ptlrpc]
Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffffa0e76106>] mdt_enqueue+0x46/0xe0 [mdt]
Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffffa0e7aada>] mdt_handle_common+0x52a/0x1470 [mdt]
Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffffa0eb74a5>] mds_regular_handle+0x15/0x20 [mdt]
Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffffa07c80c5>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]
Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffffa04f08d5>] ? lc_watchdog_touch+0x65/0x170 [libcfs]
Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffffa07c0a69>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]
Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffffa07ca89d>] ptlrpc_main+0xafd/0x1780 [ptlrpc]
Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffff8100c20a>] child_rip+0xa/0x20
Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffffa07c9da0>] ? ptlrpc_main+0x0/0x1780 [ptlrpc]
Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20

Looks the same as ~~LU-7232~~

I can upload debug logs if needed.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

lustre-log.1453326130.17862.gz
0.3 kB
21/Jan/16 7:45 PM
mds.dmesg
237 kB
05/Feb/16 1:04 AM
mds.processtrace
2.01 MB
05/Feb/16 1:04 AM

Issue Links

is related to

LU-7372 replay-dual test_26: test failed to respond and timed out

Resolved

Activity

[LU-7692] LNet: Service thread Hung

Jinshan Xiong (Inactive) added a comment - 11/Feb/16 8:42 PM

I need to understand the problem first and then propose a solution.

Jinshan Xiong (Inactive) added a comment - 11/Feb/16 8:42 PM I need to understand the problem first and then propose a solution.

Mahmoud Hanafi added a comment - 11/Feb/16 7:14 PM

Can you provide any additional info on a permanent solution.

Mahmoud Hanafi added a comment - 11/Feb/16 7:14 PM Can you provide any additional info on a permanent solution.

Jinshan Xiong (Inactive) added a comment - 05/Feb/16 7:00 AM - edited

From the backtrace, ALL mdt threads were blocked at LDLM lock enqueue. This implied a deadlock case if the DLM lock that blocked those MDT threads was canceled by ELC so that it was piggy-backed in a RPC request, but that RPC request couldn't be served due to lack of MDT threads. This is really possible based on the behavior of that user;s application.

Let's try to disable to ELC by writing this:

lctl set_param ldlm.namespaces.*-mdc-*.early_lock_cancel=0

on all client nodes.

This may have some negative impact on performance, definitely should not try it at peak hours.

One thing I have noticed is this message from the mds:

<6>Lustre: nbp8-MDT0000: Recovery over after 2:29, of 12307 clients 12307 recovered and 0 were evicted.

12307 clients were recovered in 2:29 minutes, this is really impressive.

Jinshan Xiong (Inactive) added a comment - 05/Feb/16 7:00 AM - edited From the backtrace, ALL mdt threads were blocked at LDLM lock enqueue. This implied a deadlock case if the DLM lock that blocked those MDT threads was canceled by ELC so that it was piggy-backed in a RPC request, but that RPC request couldn't be served due to lack of MDT threads. This is really possible based on the behavior of that user;s application. Let's try to disable to ELC by writing this: lctl set_param ldlm.namespaces.*-mdc-*.early_lock_cancel=0 on all client nodes. This may have some negative impact on performance, definitely should not try it at peak hours. One thing I have noticed is this message from the mds: <6>Lustre: nbp8-MDT0000: Recovery over after 2:29, of 12307 clients 12307 recovered and 0 were evicted. 12307 clients were recovered in 2:29 minutes, this is really impressive.

Mahmoud Hanafi added a comment - 05/Feb/16 1:03 AM

Attaching the process trace and dmesg output.

mds.processtrace
mds.dmesg

Mahmoud Hanafi added a comment - 05/Feb/16 1:03 AM Attaching the process trace and dmesg output. mds.processtrace mds.dmesg

Jinshan Xiong (Inactive) added a comment - 04/Feb/16 11:40 PM

Hi Jay,

can you please extract the log of all processes from crash dump and post it here?

This doesn't look like a duplication of ~~LU-7372~~.

From the backtrace, the MDT thread is doing open; also it's conflicting with a PR lock with LOOK|UPDATE|LAYOUT, so the evicted client opened the file for execution. However, I don't know what type of lock the stuck MDT thread was requesting.

Jinshan Xiong (Inactive) added a comment - 04/Feb/16 11:40 PM Hi Jay, can you please extract the log of all processes from crash dump and post it here? This doesn't look like a duplication of LU-7372 . From the backtrace, the MDT thread is doing open; also it's conflicting with a PR lock with LOOK|UPDATE|LAYOUT, so the evicted client opened the file for execution. However, I don't know what type of lock the stuck MDT thread was requesting.

Jay Lan (Inactive) added a comment - 28/Jan/16 8:00 PM

I checked the difference of patch set 3 and patch set 4 of
http://review.whamcloud.com/#/c/17853/
Well, no difference. They are the same.

Assuming
http://review.whamcloud.com/17976 is a b2_5_fe port of 17853 patch set 3, and we already have that patch in our mds/mgs, then the patch did not address our problem.

Jay Lan (Inactive) added a comment - 28/Jan/16 8:00 PM I checked the difference of patch set 3 and patch set 4 of http://review.whamcloud.com/#/c/17853/ Well, no difference. They are the same. Assuming http://review.whamcloud.com/17976 is a b2_5_fe port of 17853 patch set 3, and we already have that patch in our mds/mgs, then the patch did not address our problem.

Jay Lan (Inactive) added a comment - 28/Jan/16 7:30 PM - edited

I just posted a comment to ~~LU-7372~~:

"
Zhenyu Xu ported an earlier version (patch set 3, IIRC) of #17853 for b2_5_fe at
http://review.whamcloud.com/17976 (see comment at 12/Jan/16 9:26 PM)

Could someone from Intel update 17976 to the latest patch set #4 for us? Thanks!
"

Jay Lan (Inactive) added a comment - 28/Jan/16 7:30 PM - edited I just posted a comment to LU-7372 : " Zhenyu Xu ported an earlier version (patch set 3, IIRC) of #17853 for b2_5_fe at http://review.whamcloud.com/17976 (see comment at 12/Jan/16 9:26 PM) Could someone from Intel update 17976 to the latest patch set #4 for us? Thanks! "

Jay Lan (Inactive) added a comment - 27/Jan/16 7:18 PM

I always wonder how I can tell which patchset of a review was cherry-picked into our tree. Note that we cherry-picked some patch not in b2_5_fe repo, so the commit id is different than the one that I cherry-picked from. I have been using date that I did cherry-pick and examining codes to judge. Is there a good way to tell?

As to http://review.whamcloud.com/#/c/17853/, I picked patch set #2 of 17853 (see ~~LU-7640~~) but MDS still hung. I noticed that the lastest patch set is #4. I will cherry-pick #4 when it passes your internal testing.

Jay Lan (Inactive) added a comment - 27/Jan/16 7:18 PM I always wonder how I can tell which patchset of a review was cherry-picked into our tree. Note that we cherry-picked some patch not in b2_5_fe repo, so the commit id is different than the one that I cherry-picked from. I have been using date that I did cherry-pick and examining codes to judge. Is there a good way to tell? As to http://review.whamcloud.com/#/c/17853/ , I picked patch set #2 of 17853 (see LU-7640 ) but MDS still hung. I noticed that the lastest patch set is #4. I will cherry-pick #4 when it passes your internal testing.

Zhenyu Xu added a comment - 27/Jan/16 4:12 AM

I think the patch http://review.whamcloud.com/#/c/17853/ addresses the MDT thread lockup issue, where the thread won't come out of ldlm_completion_ast() waiting.

Jan 21 13:17:38 nbp8-mds1 kernel: [<ffffffffa076a885>] ? _ldlm_lock_debug+0x2d5/0x660 [ptlrpc]
Jan 21 13:17:38 nbp8-mds1 kernel: [<ffffffff8155a9c2>] schedule_timeout+0x192/0x2e0
Jan 21 13:17:38 nbp8-mds1 kernel: [<ffffffff81083300>] ? process_timeout+0x0/0x10
Jan 21 13:17:38 nbp8-mds1 kernel: [<ffffffffa078af70>] ? ldlm_expired_completion_wait+0x0/0x360 [ptlrpc]
Jan 21 13:17:38 nbp8-mds1 kernel: [<ffffffffa078f7a1>] ldlm_completion_ast+0x4b1/0x920 [ptlrpc]
...

ldlm_expired_completion_wait() should return -ETIMEDOUT instead of 0, otherwise the ldlm_completion_ast() won't finish and keeps waiting.

Zhenyu Xu added a comment - 27/Jan/16 4:12 AM I think the patch http://review.whamcloud.com/#/c/17853/ addresses the MDT thread lockup issue, where the thread won't come out of ldlm_completion_ast() waiting. Jan 21 13:17:38 nbp8-mds1 kernel: [<ffffffffa076a885>] ? _ldlm_lock_debug+0x2d5/0x660 [ptlrpc] Jan 21 13:17:38 nbp8-mds1 kernel: [<ffffffff8155a9c2>] schedule_timeout+0x192/0x2e0 Jan 21 13:17:38 nbp8-mds1 kernel: [<ffffffff81083300>] ? process_timeout+0x0/0x10 Jan 21 13:17:38 nbp8-mds1 kernel: [<ffffffffa078af70>] ? ldlm_expired_completion_wait+0x0/0x360 [ptlrpc] Jan 21 13:17:38 nbp8-mds1 kernel: [<ffffffffa078f7a1>] ldlm_completion_ast+0x4b1/0x920 [ptlrpc] ... ldlm_expired_completion_wait() should return -ETIMEDOUT instead of 0, otherwise the ldlm_completion_ast() won't finish and keeps waiting.

Mahmoud Hanafi added a comment - 25/Jan/16 9:06 PM

This issues is triggered when our remote cluster loses connectivity to the servers. We are trying to find the root cause of the network errors. But the MDT threads shouldn't lockup.

Mahmoud Hanafi added a comment - 25/Jan/16 9:06 PM This issues is triggered when our remote cluster loses connectivity to the servers. We are trying to find the root cause of the network errors. But the MDT threads shouldn't lockup.

Oleg Drokin added a comment - 25/Jan/16 1:55 AM

The relationship between the message and the eviction is direct more or less.

What happens here is:
1. client gets a lock, starts to do IO under the lock
2. Somebody else wants the lock, so asks the above client to finish the io and drop the lock
3. The IO encounters a timeout, client is evicted
4. Lock is cancelled
5. if some thread spent long time waiting for the lock - you'll get hte message.

Now in your case the eviciton is on one server (OST), but the message is on another (MDT). They should not really influence each other as such.
But if there's some message dropping going on, then if the blocking AST is lost in transint, you'll se exactly this:
"lock callback timer expired after" on the server that then evicts the offending client and then the thread that was waiting for the lock is able to get it, do whatever processing it needed and then complain that processing took too long once it was done - all thanks to the long waiting to get the lock.

Oleg Drokin added a comment - 25/Jan/16 1:55 AM The relationship between the message and the eviction is direct more or less. What happens here is: 1. client gets a lock, starts to do IO under the lock 2. Somebody else wants the lock, so asks the above client to finish the io and drop the lock 3. The IO encounters a timeout, client is evicted 4. Lock is cancelled 5. if some thread spent long time waiting for the lock - you'll get hte message. Now in your case the eviciton is on one server (OST), but the message is on another (MDT). They should not really influence each other as such. But if there's some message dropping going on, then if the blocking AST is lost in transint, you'll se exactly this: "lock callback timer expired after" on the server that then evicts the offending client and then the thread that was waiting for the lock is able to get it, do whatever processing it needed and then complain that processing took too long once it was done - all thanks to the long waiting to get the lock.

People

Assignee:: Zhenyu Xu

Reporter:: Mahmoud Hanafi

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 20/Jan/16 10:17 PM

Updated:: 02/Oct/17 7:37 PM

Resolved:: 08/Sep/16 6:06 PM