[LU-2522] conf-sanity test_23b timed out during mount: RPCs in "Unregistering" phase found Created: 22/Dec/12  Updated: 14/Dec/21  Resolved: 14/Dec/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Maloo Assignee: Zhenyu Xu
Resolution: Cannot Reproduce Votes: 0
Labels: llnl

Issue Links:
Duplicate
duplicates LU-5259 request gets stuck in UNREGISTERING p... Resolved
Related
is related to LU-7434 lost bulk leads to a hang Resolved
Severity: 3
Rank (Obsolete): 5938

 Description   

This issue was created by maloo for Andreas Dilger <andreas.dilger@intel.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/f71e120a-4a59-11e2-8523-52540035b04c.

The sub-test test_23b failed with the following errors in the client console log:

15:05:02:Lustre: *** cfs_fail_loc=313, val=0***
15:05:02:LustreError: 26376:0:(llite_lib.c:537:client_common_fill_super()) cannot start close thread: rc -4
15:05:23:LustreError: 25912:0:(import.c:324:ptlrpc_invalidate_import()) lustre-OST0000_UUID: rc = -110 waiting for callback (1 != 0)
15:05:23:LustreError: 25912:0:(import.c:350:ptlrpc_invalidate_import()) @@@ still on sending list req@ffff880077b10000 x1421825299972112/t0(0) o8->lustre-OST0000-osc-ffff8800767e7000@10.10.4.147@tcp:28/4 lens 400/264 e 0 to 0 dl 1355958298 ref 1 fl Interpret:REN/0/0 rc -5/0
15:05:23:LustreError: 25912:0:(import.c:366:ptlrpc_invalidate_import()) lustre-OST0000_UUID: RPCs in "Unregistering" phase found (0). Network is sluggish? Waiting them to error out.
:
[repeats]
:
15:08:35:INFO: task mount.lustre:26376 blocked for more than 120 seconds.
15:08:35:mount.lustre D 0000000000000000 0 26376 26375 0x00000080
15:08:35: ffff880076a8d7e8 0000000000000086 0000000050d24816 ffff880076a8d7b8
15:08:35: 0000000000000000 ffff88007b5e1ba0 ffffffffa0717f80 0000000000000054
15:08:35: ffff88007cd765f8 ffff880076a8dfd8 000000000000fb88 ffff88007cd765f8
15:08:35:Call Trace:
15:08:35: [<ffffffff81500605>] rwsem_down_failed_common+0x95/0x1d0
15:08:35: [<ffffffff81500763>] rwsem_down_write_failed+0x23/0x30
15:08:35: [<ffffffff8127ea53>] call_rwsem_down_write_failed+0x13/0x20
15:08:35: [<ffffffff814ffc62>] ? down_write+0x32/0x40
15:08:35: [<ffffffffa0642061>] client_disconnect_export+0x61/0x460 [ptlrpc]
15:08:35: [<ffffffffa0877335>] osc_disconnect+0xa5/0x2a0 [osc]
15:08:35: [<ffffffffa0504ff7>] class_disconnect_export_list+0x337/0x670 [obdclass]
15:08:35: [<ffffffffa05057b6>] class_disconnect_exports+0x116/0x2f0 [obdclass]
15:08:35: [<ffffffffa052e60f>] class_cleanup+0x16f/0xdc0 [obdclass]
15:08:35: [<ffffffffa0502fd6>] ? class_name2dev+0x56/0xe0 [obdclass]
15:08:35: [<ffffffffa0530305>] class_process_config+0x10a5/0x1ca0 [obdclass]
15:08:35: [<ffffffffa07a2028>] ? libcfs_log_return+0x28/0x40 [libcfs]
15:08:35: [<ffffffffa0529b01>] ? lustre_cfg_new+0x391/0x7e0 [obdclass]
15:08:35: [<ffffffffa0531079>] class_manual_cleanup+0x179/0x6f0 [obdclass]
15:08:35: [<ffffffffa0a70f40>] ll_put_super+0xf0/0x370 [lustre]
15:08:35: [<ffffffffa0a7bf29>] ll_fill_super+0x7f9/0x1500 [lustre]
15:08:35: [<ffffffffa05477cc>] lustre_fill_super+0x12c/0x1b00 [obdclass]
15:08:35: [<ffffffff8117d256>] ? set_anon_super+0x56/0x100
15:08:36: [<ffffffff8117e123>] ? sget+0x3e3/0x480
15:08:36: [<ffffffff8117d200>] ? set_anon_super+0x0/0x100
15:08:36: [<ffffffffa05476a0>] ? lustre_fill_super+0x0/0x1b00 [obdclass]
15:08:36: [<ffffffff8117e66f>] get_sb_nodev+0x5f/0xa0
15:08:36: [<ffffffffa0532d15>] lustre_get_sb+0x25/0x30 [obdclass]
15:08:36: [<ffffffff8117e2cb>] vfs_kern_mount+0x7b/0x1b0
15:08:36: [<ffffffff8117e472>] do_kern_mount+0x52/0x130
15:08:36: [<ffffffff8119cb42>] do_mount+0x2d2/0x8d0
15:08:36: [<ffffffff8119d1d0>] sys_mount+0x90/0xe0

Info required for matching: conf-sanity 23b



 Comments   
Comment by nasf (Inactive) [ 29/Dec/12 ]

Another failure instance:

https://maloo.whamcloud.com/test_sets/3a25305c-514d-11e2-b56e-52540035b04c

Comment by Minh Diep [ 16/May/13 ]

I can reliably reproduce this issue on 2.3 with the following

./llmount.sh
run mdtest
./llmountcleanup.sh

Comment by Brian Behlendorf [ 30/Apr/14 ]

We've been observing this for some time in production on our 2.4 clients. I just wanted to increase the visibility of this so you're aware this issue is happening on real systems.

Comment by Christopher Morrone [ 20/May/14 ]

As Brian mentioned we have been seeing this in production. We would really like our clients to be able to reconnect instead of blocking forever.

Comment by Matt Ezell [ 21/May/14 ]

We've also seen it in production at ORNL on Cray 2.3 clients

Comment by Peter Jones [ 22/May/14 ]

Bobijam

Could you please advise on this one?

Thanks

Peter

Comment by Zhenyu Xu [ 23/May/14 ]

Pushed http://review.whamcloud.com/10427 to add more request status flag in log message to help debug this issue.

I think I need network expert to help diagnose this issue, it looks that in ptlrpc_check_set(), when the OST_CONNECT request was moved to RQ_PHASE_UNREGISTERING phase, the LNetMDUnlink() (I think this is for unlinking the request, and in someway it should end up cleaning up request::rq_receiving_reply and ::rq_must_unlink and) should ensues ptlrpc_client_recv_or_unlink() checking the request to be false and moves the request to RQ_PHASE_COMPLETE, while the repeated RPCs in "Unregistering" phase found shows the other way.

I suspect that LNetMDUnlink() does not accomplish what it should does in some corner cases.

Comment by Amir Shehata (Inactive) [ 25/May/14 ]

LNetMDUnlink unlinks the memory descriptor from the Match entry. This frees the buffers and any possible messages using this buffer. The only reason the MD wouldn't be unlinked immediately is if the md->md_refcount > 0. The md->md_refcount is incremented when the MD is in use. That happens when you're sending a message using the MD or when you are receiving a message into the MD. Which case is this problem?

Comment by Zhenyu Xu [ 26/May/14 ]

I think the md was occupied by sending. I'm guessing the request::rq_must_unlink does not have chance to be set to 0. I've pushed a debug patch to reveal its ::rq_must_unlink and ::rq_receiving_reply flags.

Comment by Amir Shehata (Inactive) [ 26/May/14 ]

In ptl_send_rpc() it does the following:

        /* If the MD attach succeeds, there _will_ be a reply_in callback */
        request->rq_receiving_reply = !noreply;
        /* We are responsible for unlinking the reply buffer */
        request->rq_must_unlink = !noreply;

Which means these flags would be 1 if there is a reply expected.

These flags are reset to 0, when the reply callback is executed in reply_in_callback()

I'm thinking if the reply is delayed, these flags wouldn't be set to 0. Would that lead to the observed behavior?

It would be good to try and make sure that there are no dropped messages.

Comment by Zhenyu Xu [ 27/May/14 ]

Minh,

Can you help to reproduce the issue with the debug patch http://review.whamcloud.com/10427 ?

Comment by Amir Shehata (Inactive) [ 27/May/14 ]

I updated patch http://review.whamcloud.com/10427 with debug code to go over the MDs in the system and see if any of them are in LNET_MD_FLAG_ZOMBIE state. This is suppose to be a transitional state. If an MD stays in that state for a period of time that means that the md_refcount > 0. Which in turn means that there is a message being processed on that MD, but lnet_finalize() was never called on that message. We're suspecting that the reply_in_callback() which sets the flags of interest to 0 is not being called. As far as I could tell at the moment, this would happen if lnet_finalize() is not being called. If lnet_finalize is not being called, we would leak an MD, which should remain in LNET_MD_FLAG_ZOMBIE state. This patch should determine if this is the case. If one such MD is found a CERROR is outputed.

Comment by Minh Diep [ 05/Jun/14 ]

I haven't been able to hit this issue again on master

Comment by Peter Jones [ 05/Jun/14 ]

Could we try porting this path to b2_4 and seeing if it will reproduce then?

Comment by Zhenyu Xu [ 09/Jun/14 ]

porting of the debug patch for b2_4 at http://review.whamcloud.com/10653

Comment by Minh Diep [ 10/Jun/14 ]

I also can not reproduce this on b2_4 http://review.whamcloud.com/10653

I am working on b2_3 without any debug patch

Comment by Minh Diep [ 10/Jun/14 ]

I also can not reproduce this in b2_3

Comment by D. Marc Stearman (Inactive) [ 11/May/15 ]

We saw a burst of the debug messages on March 10th, 2015:

2015-03-09 17:16:52 Lustre: lcrash-MDT0000-mdc-ffff8804462fd400: Connection to lcrash-MDT0000 (at 10.1.1.120@o2ib9) was lost; in progress operations using this service will wait for recovery to complete
2015-03-09 17:16:52 Lustre: Skipped 3449155 previous similar messages
2015-03-09 17:18:38 Lustre: lcrash-MDT0000-mdc-ffff8804462fd400: Connection restored to lcrash-MDT0000 (at 10.1.1.120@o2ib9)
2015-03-09 17:18:38 Lustre: Skipped 3462984 previous similar messages
2015-03-09 17:19:46 LustreError: 11-0: lcrash-MDT0000-mdc-ffff8804462fd400: Communicating with 10.1.1.120@o2ib9, operation llog_origin_handle_destroy failed with -19.
2015-03-09 17:19:46 LustreError: Skipped 8882876 previous similar messages
2015-03-09 17:26:52 Lustre: lcrash-MDT0000-mdc-ffff8804462fd400: Connection to lcrash-MDT0000 (at 10.1.1.120@o2ib9) was lost; in progress operations using this service will wait for recovery to complete
2015-03-09 17:26:52 Lustre: Skipped 3461504 previous similar messages
2015-03-09 17:26:55 LNetError: 28827:0:(lib-md.c:407:LNetMDCheck()) LU-2522: md(ffff88038f4694c0)->md_flags has LNET_MD_FLAG_ZOMBIE set
2015-03-09 17:28:38 Lustre: lcrash-MDT0000-mdc-ffff8804462fd400: Connection restored to lcrash-MDT0000 (at 10.1.1.120@o2ib9)
2015-03-09 17:28:38 Lustre: Skipped 3472891 previous similar messages
2015-03-09 17:29:18 LNetError: 28829:0:(lib-md.c:407:LNetMDCheck()) LU-2522: md(ffff880305a45a40)->md_flags has LNET_MD_FLAG_ZOMBIE set
2015-03-09 17:29:46 LustreError: 11-0: lcrash-MDT0000-mdc-ffff8804462fd400: Communicating with 10.1.1.120@o2ib9, operation llog_origin_handle_destroy failed with -19.
2015-03-09 17:29:46 LustreError: Skipped 8232155 previous similar messages
2015-03-09 17:36:52 Lustre: lcrash-MDT0000-mdc-ffff8804462fd400: Connection to lcrash-MDT0000 (at 10.1.1.120@o2ib9) was lost; in progress operations using this service will wait for recovery to complete
2015-03-09 17:36:52 Lustre: Skipped 3484395 previous similar messages
2015-03-09 17:37:30 LNetError: 28829:0:(lib-md.c:407:LNetMDCheck()) LU-2522: md(ffff8802255738c0)->md_flags has LNET_MD_FLAG_ZOMBIE set
2015-03-09 17:38:26 LNetError: 28843:0:(lib-md.c:407:LNetMDCheck()) LU-2522: md(ffff8802d20ecb40)->md_flags has LNET_MD_FLAG_ZOMBIE set
2015-03-09 17:38:38 Lustre: lcrash-MDT0000-mdc-ffff8804462fd400: Connection restored to lcrash-MDT0000 (at 10.1.1.120@o2ib9)
2015-03-09 17:38:38 Lustre: Skipped 3475352 previous similar messages
2015-03-09 17:39:46 LustreError: 11-0: lcrash-MDT0000-mdc-ffff8804462fd400: Communicating with 10.1.1.120@o2ib9, operation llog_origin_handle_destroy failed with -19.
2015-03-09 17:39:46 LustreError: Skipped 8602702 previous similar messages
2015-03-09 17:46:52 Lustre: lcrash-MDT0000-mdc-ffff8804462fd400: Connection to lcrash-MDT0000 (at 10.1.1.120@o2ib9) was lost; in progress operations using this service will wait for recovery to complete
2015-03-09 17:46:52 Lustre: Skipped 3456170 previous similar messages

I don't know exactly which version we were running at the time. Chris, do you happen to know?

Comment by Zhenyu Xu [ 01/Jun/15 ]

the latest debug message does not fit this issue, this "llog_origin_handle_destroy failed with -19" means that MDT does not function as the client expect (MDT cannot find the llog context), and it cannot handle this client's config llog.

Comment by Christopher Morrone [ 30/Jun/15 ]

I agree, I do not see a connection with Marc's log to this ticket. If the only place Marc could find this happening was on lcrash (which is a developer crash-and-burn development system, so it could have occurred there for any number of reasons), then I don't think we have grounds to keep this as a top llnl issue. I am removing the topllnl label.

Generated at Sat Feb 10 01:25:55 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.