[LU-2522] conf-sanity test_23b timed out during mount: RPCs in "Unregistering" phase found - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Cannot Reproduce
Priority: Major
Fix Version/s: None
Affects Version/s: None
Labels:
- llnl

Severity:
3
Rank (Obsolete):
5938

Description

This issue was created by maloo for Andreas Dilger <andreas.dilger@intel.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/f71e120a-4a59-11e2-8523-52540035b04c.

The sub-test test_23b failed with the following errors in the client console log:

15:05:02:Lustre: *** cfs_fail_loc=313, val=0***
15:05:02:LustreError: 26376:0:(llite_lib.c:537:client_common_fill_super()) cannot start close thread: rc -4
15:05:23:LustreError: 25912:0:(import.c:324:ptlrpc_invalidate_import()) lustre-OST0000_UUID: rc = -110 waiting for callback (1 != 0)
15:05:23:LustreError: 25912:0:(import.c:350:ptlrpc_invalidate_import()) @@@ still on sending list req@ffff880077b10000 x1421825299972112/t0(0) o8->lustre-OST0000-osc-ffff8800767e7000@10.10.4.147@tcp:28/4 lens 400/264 e 0 to 0 dl 1355958298 ref 1 fl Interpret:REN/0/0 rc -5/0
15:05:23:LustreError: 25912:0:(import.c:366:ptlrpc_invalidate_import()) lustre-OST0000_UUID: RPCs in "Unregistering" phase found (0). Network is sluggish? Waiting them to error out.
:
[repeats]
:
15:08:35:INFO: task mount.lustre:26376 blocked for more than 120 seconds.
15:08:35:mount.lustre D 0000000000000000 0 26376 26375 0x00000080
15:08:35: ffff880076a8d7e8 0000000000000086 0000000050d24816 ffff880076a8d7b8
15:08:35: 0000000000000000 ffff88007b5e1ba0 ffffffffa0717f80 0000000000000054
15:08:35: ffff88007cd765f8 ffff880076a8dfd8 000000000000fb88 ffff88007cd765f8
15:08:35:Call Trace:
15:08:35: [<ffffffff81500605>] rwsem_down_failed_common+0x95/0x1d0
15:08:35: [<ffffffff81500763>] rwsem_down_write_failed+0x23/0x30
15:08:35: [<ffffffff8127ea53>] call_rwsem_down_write_failed+0x13/0x20
15:08:35: [<ffffffff814ffc62>] ? down_write+0x32/0x40
15:08:35: [<ffffffffa0642061>] client_disconnect_export+0x61/0x460 [ptlrpc]
15:08:35: [<ffffffffa0877335>] osc_disconnect+0xa5/0x2a0 [osc]
15:08:35: [<ffffffffa0504ff7>] class_disconnect_export_list+0x337/0x670 [obdclass]
15:08:35: [<ffffffffa05057b6>] class_disconnect_exports+0x116/0x2f0 [obdclass]
15:08:35: [<ffffffffa052e60f>] class_cleanup+0x16f/0xdc0 [obdclass]
15:08:35: [<ffffffffa0502fd6>] ? class_name2dev+0x56/0xe0 [obdclass]
15:08:35: [<ffffffffa0530305>] class_process_config+0x10a5/0x1ca0 [obdclass]
15:08:35: [<ffffffffa07a2028>] ? libcfs_log_return+0x28/0x40 [libcfs]
15:08:35: [<ffffffffa0529b01>] ? lustre_cfg_new+0x391/0x7e0 [obdclass]
15:08:35: [<ffffffffa0531079>] class_manual_cleanup+0x179/0x6f0 [obdclass]
15:08:35: [<ffffffffa0a70f40>] ll_put_super+0xf0/0x370 [lustre]
15:08:35: [<ffffffffa0a7bf29>] ll_fill_super+0x7f9/0x1500 [lustre]
15:08:35: [<ffffffffa05477cc>] lustre_fill_super+0x12c/0x1b00 [obdclass]
15:08:35: [<ffffffff8117d256>] ? set_anon_super+0x56/0x100
15:08:36: [<ffffffff8117e123>] ? sget+0x3e3/0x480
15:08:36: [<ffffffff8117d200>] ? set_anon_super+0x0/0x100
15:08:36: [<ffffffffa05476a0>] ? lustre_fill_super+0x0/0x1b00 [obdclass]
15:08:36: [<ffffffff8117e66f>] get_sb_nodev+0x5f/0xa0
15:08:36: [<ffffffffa0532d15>] lustre_get_sb+0x25/0x30 [obdclass]
15:08:36: [<ffffffff8117e2cb>] vfs_kern_mount+0x7b/0x1b0
15:08:36: [<ffffffff8117e472>] do_kern_mount+0x52/0x130
15:08:36: [<ffffffff8119cb42>] do_mount+0x2d2/0x8d0
15:08:36: [<ffffffff8119d1d0>] sys_mount+0x90/0xe0

Info required for matching: conf-sanity 23b

Attachments

Issue Links

duplicates

LU-5259 request gets stuck in UNREGISTERING phase

Resolved

is related to

LU-7434 lost bulk leads to a hang

Resolved

Activity

[LU-2522] conf-sanity test_23b timed out during mount: RPCs in "Unregistering" phase found

D. Marc Stearman (Inactive) added a comment - 11/May/15 9:28 PM

We saw a burst of the debug messages on March 10th, 2015:

2015-03-09 17:16:52 Lustre: lcrash-MDT0000-mdc-ffff8804462fd400: Connection to lcrash-MDT0000 (at 10.1.1.120@o2ib9) was lost; in progress operations using this service will wait for recovery to complete
2015-03-09 17:16:52 Lustre: Skipped 3449155 previous similar messages
2015-03-09 17:18:38 Lustre: lcrash-MDT0000-mdc-ffff8804462fd400: Connection restored to lcrash-MDT0000 (at 10.1.1.120@o2ib9)
2015-03-09 17:18:38 Lustre: Skipped 3462984 previous similar messages
2015-03-09 17:19:46 LustreError: 11-0: lcrash-MDT0000-mdc-ffff8804462fd400: Communicating with 10.1.1.120@o2ib9, operation llog_origin_handle_destroy failed with -19.
2015-03-09 17:19:46 LustreError: Skipped 8882876 previous similar messages
2015-03-09 17:26:52 Lustre: lcrash-MDT0000-mdc-ffff8804462fd400: Connection to lcrash-MDT0000 (at 10.1.1.120@o2ib9) was lost; in progress operations using this service will wait for recovery to complete
2015-03-09 17:26:52 Lustre: Skipped 3461504 previous similar messages
2015-03-09 17:26:55 LNetError: 28827:0:(lib-md.c:407:LNetMDCheck()) LU-2522: md(ffff88038f4694c0)->md_flags has LNET_MD_FLAG_ZOMBIE set
2015-03-09 17:28:38 Lustre: lcrash-MDT0000-mdc-ffff8804462fd400: Connection restored to lcrash-MDT0000 (at 10.1.1.120@o2ib9)
2015-03-09 17:28:38 Lustre: Skipped 3472891 previous similar messages
2015-03-09 17:29:18 LNetError: 28829:0:(lib-md.c:407:LNetMDCheck()) LU-2522: md(ffff880305a45a40)->md_flags has LNET_MD_FLAG_ZOMBIE set
2015-03-09 17:29:46 LustreError: 11-0: lcrash-MDT0000-mdc-ffff8804462fd400: Communicating with 10.1.1.120@o2ib9, operation llog_origin_handle_destroy failed with -19.
2015-03-09 17:29:46 LustreError: Skipped 8232155 previous similar messages
2015-03-09 17:36:52 Lustre: lcrash-MDT0000-mdc-ffff8804462fd400: Connection to lcrash-MDT0000 (at 10.1.1.120@o2ib9) was lost; in progress operations using this service will wait for recovery to complete
2015-03-09 17:36:52 Lustre: Skipped 3484395 previous similar messages
2015-03-09 17:37:30 LNetError: 28829:0:(lib-md.c:407:LNetMDCheck()) LU-2522: md(ffff8802255738c0)->md_flags has LNET_MD_FLAG_ZOMBIE set
2015-03-09 17:38:26 LNetError: 28843:0:(lib-md.c:407:LNetMDCheck()) LU-2522: md(ffff8802d20ecb40)->md_flags has LNET_MD_FLAG_ZOMBIE set
2015-03-09 17:38:38 Lustre: lcrash-MDT0000-mdc-ffff8804462fd400: Connection restored to lcrash-MDT0000 (at 10.1.1.120@o2ib9)
2015-03-09 17:38:38 Lustre: Skipped 3475352 previous similar messages
2015-03-09 17:39:46 LustreError: 11-0: lcrash-MDT0000-mdc-ffff8804462fd400: Communicating with 10.1.1.120@o2ib9, operation llog_origin_handle_destroy failed with -19.
2015-03-09 17:39:46 LustreError: Skipped 8602702 previous similar messages
2015-03-09 17:46:52 Lustre: lcrash-MDT0000-mdc-ffff8804462fd400: Connection to lcrash-MDT0000 (at 10.1.1.120@o2ib9) was lost; in progress operations using this service will wait for recovery to complete
2015-03-09 17:46:52 Lustre: Skipped 3456170 previous similar messages

I don't know exactly which version we were running at the time. Chris, do you happen to know?

D. Marc Stearman (Inactive) added a comment - 11/May/15 9:28 PM We saw a burst of the debug messages on March 10th, 2015: 2015-03-09 17:16:52 Lustre: lcrash-MDT0000-mdc-ffff8804462fd400: Connection to lcrash-MDT0000 (at 10.1.1.120@o2ib9) was lost; in progress operations using this service will wait for recovery to complete 2015-03-09 17:16:52 Lustre: Skipped 3449155 previous similar messages 2015-03-09 17:18:38 Lustre: lcrash-MDT0000-mdc-ffff8804462fd400: Connection restored to lcrash-MDT0000 (at 10.1.1.120@o2ib9) 2015-03-09 17:18:38 Lustre: Skipped 3462984 previous similar messages 2015-03-09 17:19:46 LustreError: 11-0: lcrash-MDT0000-mdc-ffff8804462fd400: Communicating with 10.1.1.120@o2ib9, operation llog_origin_handle_destroy failed with -19. 2015-03-09 17:19:46 LustreError: Skipped 8882876 previous similar messages 2015-03-09 17:26:52 Lustre: lcrash-MDT0000-mdc-ffff8804462fd400: Connection to lcrash-MDT0000 (at 10.1.1.120@o2ib9) was lost; in progress operations using this service will wait for recovery to complete 2015-03-09 17:26:52 Lustre: Skipped 3461504 previous similar messages 2015-03-09 17:26:55 LNetError: 28827:0:(lib-md.c:407:LNetMDCheck()) LU-2522: md(ffff88038f4694c0)->md_flags has LNET_MD_FLAG_ZOMBIE set 2015-03-09 17:28:38 Lustre: lcrash-MDT0000-mdc-ffff8804462fd400: Connection restored to lcrash-MDT0000 (at 10.1.1.120@o2ib9) 2015-03-09 17:28:38 Lustre: Skipped 3472891 previous similar messages 2015-03-09 17:29:18 LNetError: 28829:0:(lib-md.c:407:LNetMDCheck()) LU-2522: md(ffff880305a45a40)->md_flags has LNET_MD_FLAG_ZOMBIE set 2015-03-09 17:29:46 LustreError: 11-0: lcrash-MDT0000-mdc-ffff8804462fd400: Communicating with 10.1.1.120@o2ib9, operation llog_origin_handle_destroy failed with -19. 2015-03-09 17:29:46 LustreError: Skipped 8232155 previous similar messages 2015-03-09 17:36:52 Lustre: lcrash-MDT0000-mdc-ffff8804462fd400: Connection to lcrash-MDT0000 (at 10.1.1.120@o2ib9) was lost; in progress operations using this service will wait for recovery to complete 2015-03-09 17:36:52 Lustre: Skipped 3484395 previous similar messages 2015-03-09 17:37:30 LNetError: 28829:0:(lib-md.c:407:LNetMDCheck()) LU-2522: md(ffff8802255738c0)->md_flags has LNET_MD_FLAG_ZOMBIE set 2015-03-09 17:38:26 LNetError: 28843:0:(lib-md.c:407:LNetMDCheck()) LU-2522: md(ffff8802d20ecb40)->md_flags has LNET_MD_FLAG_ZOMBIE set 2015-03-09 17:38:38 Lustre: lcrash-MDT0000-mdc-ffff8804462fd400: Connection restored to lcrash-MDT0000 (at 10.1.1.120@o2ib9) 2015-03-09 17:38:38 Lustre: Skipped 3475352 previous similar messages 2015-03-09 17:39:46 LustreError: 11-0: lcrash-MDT0000-mdc-ffff8804462fd400: Communicating with 10.1.1.120@o2ib9, operation llog_origin_handle_destroy failed with -19. 2015-03-09 17:39:46 LustreError: Skipped 8602702 previous similar messages 2015-03-09 17:46:52 Lustre: lcrash-MDT0000-mdc-ffff8804462fd400: Connection to lcrash-MDT0000 (at 10.1.1.120@o2ib9) was lost; in progress operations using this service will wait for recovery to complete 2015-03-09 17:46:52 Lustre: Skipped 3456170 previous similar messages I don't know exactly which version we were running at the time. Chris, do you happen to know?

Minh Diep added a comment - 10/Jun/14 7:18 PM

I also can not reproduce this in b2_3

Minh Diep added a comment - 10/Jun/14 7:18 PM I also can not reproduce this in b2_3

Minh Diep added a comment - 10/Jun/14 5:25 AM

I also can not reproduce this on b2_4 http://review.whamcloud.com/10653

I am working on b2_3 without any debug patch

Minh Diep added a comment - 10/Jun/14 5:25 AM I also can not reproduce this on b2_4 http://review.whamcloud.com/10653 I am working on b2_3 without any debug patch

Zhenyu Xu added a comment - 09/Jun/14 12:52 PM

porting of the debug patch for b2_4 at http://review.whamcloud.com/10653

Zhenyu Xu added a comment - 09/Jun/14 12:52 PM porting of the debug patch for b2_4 at http://review.whamcloud.com/10653

Peter Jones added a comment - 05/Jun/14 11:29 PM

Could we try porting this path to b2_4 and seeing if it will reproduce then?

Peter Jones added a comment - 05/Jun/14 11:29 PM Could we try porting this path to b2_4 and seeing if it will reproduce then?

Minh Diep added a comment - 05/Jun/14 10:04 PM

I haven't been able to hit this issue again on master

Minh Diep added a comment - 05/Jun/14 10:04 PM I haven't been able to hit this issue again on master

Amir Shehata (Inactive) added a comment - 27/May/14 9:01 PM

I updated patch http://review.whamcloud.com/10427 with debug code to go over the MDs in the system and see if any of them are in LNET_MD_FLAG_ZOMBIE state. This is suppose to be a transitional state. If an MD stays in that state for a period of time that means that the md_refcount > 0. Which in turn means that there is a message being processed on that MD, but lnet_finalize() was never called on that message. We're suspecting that the reply_in_callback() which sets the flags of interest to 0 is not being called. As far as I could tell at the moment, this would happen if lnet_finalize() is not being called. If lnet_finalize is not being called, we would leak an MD, which should remain in LNET_MD_FLAG_ZOMBIE state. This patch should determine if this is the case. If one such MD is found a CERROR is outputed.

Amir Shehata (Inactive) added a comment - 27/May/14 9:01 PM I updated patch http://review.whamcloud.com/10427 with debug code to go over the MDs in the system and see if any of them are in LNET_MD_FLAG_ZOMBIE state. This is suppose to be a transitional state. If an MD stays in that state for a period of time that means that the md_refcount > 0. Which in turn means that there is a message being processed on that MD, but lnet_finalize() was never called on that message. We're suspecting that the reply_in_callback() which sets the flags of interest to 0 is not being called. As far as I could tell at the moment, this would happen if lnet_finalize() is not being called. If lnet_finalize is not being called, we would leak an MD, which should remain in LNET_MD_FLAG_ZOMBIE state. This patch should determine if this is the case. If one such MD is found a CERROR is outputed.

Zhenyu Xu added a comment - 27/May/14 9:11 AM

Minh,

Can you help to reproduce the issue with the debug patch http://review.whamcloud.com/10427 ?

Zhenyu Xu added a comment - 27/May/14 9:11 AM Minh, Can you help to reproduce the issue with the debug patch http://review.whamcloud.com/10427 ?

Amir Shehata (Inactive) added a comment - 26/May/14 9:59 PM - edited

In ptl_send_rpc() it does the following:

        /* If the MD attach succeeds, there _will_ be a reply_in callback */
        request->rq_receiving_reply = !noreply;
        /* We are responsible for unlinking the reply buffer */
        request->rq_must_unlink = !noreply;

Which means these flags would be 1 if there is a reply expected.

These flags are reset to 0, when the reply callback is executed in reply_in_callback()

I'm thinking if the reply is delayed, these flags wouldn't be set to 0. Would that lead to the observed behavior?

It would be good to try and make sure that there are no dropped messages.

Amir Shehata (Inactive) added a comment - 26/May/14 9:59 PM - edited In ptl_send_rpc() it does the following: /* If the MD attach succeeds, there _will_ be a reply_in callback */ request->rq_receiving_reply = !noreply; /* We are responsible for unlinking the reply buffer */ request->rq_must_unlink = !noreply; Which means these flags would be 1 if there is a reply expected. These flags are reset to 0, when the reply callback is executed in reply_in_callback() I'm thinking if the reply is delayed, these flags wouldn't be set to 0. Would that lead to the observed behavior? It would be good to try and make sure that there are no dropped messages.

Zhenyu Xu added a comment - 26/May/14 4:44 AM

I think the md was occupied by sending. I'm guessing the request::rq_must_unlink does not have chance to be set to 0. I've pushed a debug patch to reveal its ::rq_must_unlink and ::rq_receiving_reply flags.

Zhenyu Xu added a comment - 26/May/14 4:44 AM I think the md was occupied by sending. I'm guessing the request::rq_must_unlink does not have chance to be set to 0. I've pushed a debug patch to reveal its ::rq_must_unlink and ::rq_receiving_reply flags.

Amir Shehata (Inactive) added a comment - 25/May/14 3:35 PM

LNetMDUnlink unlinks the memory descriptor from the Match entry. This frees the buffers and any possible messages using this buffer. The only reason the MD wouldn't be unlinked immediately is if the md->md_refcount > 0. The md->md_refcount is incremented when the MD is in use. That happens when you're sending a message using the MD or when you are receiving a message into the MD. Which case is this problem?

Amir Shehata (Inactive) added a comment - 25/May/14 3:35 PM LNetMDUnlink unlinks the memory descriptor from the Match entry. This frees the buffers and any possible messages using this buffer. The only reason the MD wouldn't be unlinked immediately is if the md->md_refcount > 0. The md->md_refcount is incremented when the MD is in use. That happens when you're sending a message using the MD or when you are receiving a message into the MD. Which case is this problem?

People

Assignee:: Zhenyu Xu

Reporter:: Maloo

Votes:: 0 Vote for this issue

Watchers:: 13 Start watching this issue

Dates

Created:: 22/Dec/12 5:07 PM

Updated:: 14/Dec/21 11:00 PM

Resolved:: 14/Dec/21 11:00 PM