Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2522

conf-sanity test_23b timed out during mount: RPCs in "Unregistering" phase found

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • None
    • 3
    • 5938

    Description

      This issue was created by maloo for Andreas Dilger <andreas.dilger@intel.com>

      This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/f71e120a-4a59-11e2-8523-52540035b04c.

      The sub-test test_23b failed with the following errors in the client console log:

      15:05:02:Lustre: *** cfs_fail_loc=313, val=0***
      15:05:02:LustreError: 26376:0:(llite_lib.c:537:client_common_fill_super()) cannot start close thread: rc -4
      15:05:23:LustreError: 25912:0:(import.c:324:ptlrpc_invalidate_import()) lustre-OST0000_UUID: rc = -110 waiting for callback (1 != 0)
      15:05:23:LustreError: 25912:0:(import.c:350:ptlrpc_invalidate_import()) @@@ still on sending list req@ffff880077b10000 x1421825299972112/t0(0) o8->lustre-OST0000-osc-ffff8800767e7000@10.10.4.147@tcp:28/4 lens 400/264 e 0 to 0 dl 1355958298 ref 1 fl Interpret:REN/0/0 rc -5/0
      15:05:23:LustreError: 25912:0:(import.c:366:ptlrpc_invalidate_import()) lustre-OST0000_UUID: RPCs in "Unregistering" phase found (0). Network is sluggish? Waiting them to error out.
      :
      [repeats]
      :
      15:08:35:INFO: task mount.lustre:26376 blocked for more than 120 seconds.
      15:08:35:mount.lustre D 0000000000000000 0 26376 26375 0x00000080
      15:08:35: ffff880076a8d7e8 0000000000000086 0000000050d24816 ffff880076a8d7b8
      15:08:35: 0000000000000000 ffff88007b5e1ba0 ffffffffa0717f80 0000000000000054
      15:08:35: ffff88007cd765f8 ffff880076a8dfd8 000000000000fb88 ffff88007cd765f8
      15:08:35:Call Trace:
      15:08:35: [<ffffffff81500605>] rwsem_down_failed_common+0x95/0x1d0
      15:08:35: [<ffffffff81500763>] rwsem_down_write_failed+0x23/0x30
      15:08:35: [<ffffffff8127ea53>] call_rwsem_down_write_failed+0x13/0x20
      15:08:35: [<ffffffff814ffc62>] ? down_write+0x32/0x40
      15:08:35: [<ffffffffa0642061>] client_disconnect_export+0x61/0x460 [ptlrpc]
      15:08:35: [<ffffffffa0877335>] osc_disconnect+0xa5/0x2a0 [osc]
      15:08:35: [<ffffffffa0504ff7>] class_disconnect_export_list+0x337/0x670 [obdclass]
      15:08:35: [<ffffffffa05057b6>] class_disconnect_exports+0x116/0x2f0 [obdclass]
      15:08:35: [<ffffffffa052e60f>] class_cleanup+0x16f/0xdc0 [obdclass]
      15:08:35: [<ffffffffa0502fd6>] ? class_name2dev+0x56/0xe0 [obdclass]
      15:08:35: [<ffffffffa0530305>] class_process_config+0x10a5/0x1ca0 [obdclass]
      15:08:35: [<ffffffffa07a2028>] ? libcfs_log_return+0x28/0x40 [libcfs]
      15:08:35: [<ffffffffa0529b01>] ? lustre_cfg_new+0x391/0x7e0 [obdclass]
      15:08:35: [<ffffffffa0531079>] class_manual_cleanup+0x179/0x6f0 [obdclass]
      15:08:35: [<ffffffffa0a70f40>] ll_put_super+0xf0/0x370 [lustre]
      15:08:35: [<ffffffffa0a7bf29>] ll_fill_super+0x7f9/0x1500 [lustre]
      15:08:35: [<ffffffffa05477cc>] lustre_fill_super+0x12c/0x1b00 [obdclass]
      15:08:35: [<ffffffff8117d256>] ? set_anon_super+0x56/0x100
      15:08:36: [<ffffffff8117e123>] ? sget+0x3e3/0x480
      15:08:36: [<ffffffff8117d200>] ? set_anon_super+0x0/0x100
      15:08:36: [<ffffffffa05476a0>] ? lustre_fill_super+0x0/0x1b00 [obdclass]
      15:08:36: [<ffffffff8117e66f>] get_sb_nodev+0x5f/0xa0
      15:08:36: [<ffffffffa0532d15>] lustre_get_sb+0x25/0x30 [obdclass]
      15:08:36: [<ffffffff8117e2cb>] vfs_kern_mount+0x7b/0x1b0
      15:08:36: [<ffffffff8117e472>] do_kern_mount+0x52/0x130
      15:08:36: [<ffffffff8119cb42>] do_mount+0x2d2/0x8d0
      15:08:36: [<ffffffff8119d1d0>] sys_mount+0x90/0xe0

      Info required for matching: conf-sanity 23b

      Attachments

        Issue Links

          Activity

            [LU-2522] conf-sanity test_23b timed out during mount: RPCs in "Unregistering" phase found

            We saw a burst of the debug messages on March 10th, 2015:

            2015-03-09 17:16:52 Lustre: lcrash-MDT0000-mdc-ffff8804462fd400: Connection to lcrash-MDT0000 (at 10.1.1.120@o2ib9) was lost; in progress operations using this service will wait for recovery to complete
            2015-03-09 17:16:52 Lustre: Skipped 3449155 previous similar messages
            2015-03-09 17:18:38 Lustre: lcrash-MDT0000-mdc-ffff8804462fd400: Connection restored to lcrash-MDT0000 (at 10.1.1.120@o2ib9)
            2015-03-09 17:18:38 Lustre: Skipped 3462984 previous similar messages
            2015-03-09 17:19:46 LustreError: 11-0: lcrash-MDT0000-mdc-ffff8804462fd400: Communicating with 10.1.1.120@o2ib9, operation llog_origin_handle_destroy failed with -19.
            2015-03-09 17:19:46 LustreError: Skipped 8882876 previous similar messages
            2015-03-09 17:26:52 Lustre: lcrash-MDT0000-mdc-ffff8804462fd400: Connection to lcrash-MDT0000 (at 10.1.1.120@o2ib9) was lost; in progress operations using this service will wait for recovery to complete
            2015-03-09 17:26:52 Lustre: Skipped 3461504 previous similar messages
            2015-03-09 17:26:55 LNetError: 28827:0:(lib-md.c:407:LNetMDCheck()) LU-2522: md(ffff88038f4694c0)->md_flags has LNET_MD_FLAG_ZOMBIE set
            2015-03-09 17:28:38 Lustre: lcrash-MDT0000-mdc-ffff8804462fd400: Connection restored to lcrash-MDT0000 (at 10.1.1.120@o2ib9)
            2015-03-09 17:28:38 Lustre: Skipped 3472891 previous similar messages
            2015-03-09 17:29:18 LNetError: 28829:0:(lib-md.c:407:LNetMDCheck()) LU-2522: md(ffff880305a45a40)->md_flags has LNET_MD_FLAG_ZOMBIE set
            2015-03-09 17:29:46 LustreError: 11-0: lcrash-MDT0000-mdc-ffff8804462fd400: Communicating with 10.1.1.120@o2ib9, operation llog_origin_handle_destroy failed with -19.
            2015-03-09 17:29:46 LustreError: Skipped 8232155 previous similar messages
            2015-03-09 17:36:52 Lustre: lcrash-MDT0000-mdc-ffff8804462fd400: Connection to lcrash-MDT0000 (at 10.1.1.120@o2ib9) was lost; in progress operations using this service will wait for recovery to complete
            2015-03-09 17:36:52 Lustre: Skipped 3484395 previous similar messages
            2015-03-09 17:37:30 LNetError: 28829:0:(lib-md.c:407:LNetMDCheck()) LU-2522: md(ffff8802255738c0)->md_flags has LNET_MD_FLAG_ZOMBIE set
            2015-03-09 17:38:26 LNetError: 28843:0:(lib-md.c:407:LNetMDCheck()) LU-2522: md(ffff8802d20ecb40)->md_flags has LNET_MD_FLAG_ZOMBIE set
            2015-03-09 17:38:38 Lustre: lcrash-MDT0000-mdc-ffff8804462fd400: Connection restored to lcrash-MDT0000 (at 10.1.1.120@o2ib9)
            2015-03-09 17:38:38 Lustre: Skipped 3475352 previous similar messages
            2015-03-09 17:39:46 LustreError: 11-0: lcrash-MDT0000-mdc-ffff8804462fd400: Communicating with 10.1.1.120@o2ib9, operation llog_origin_handle_destroy failed with -19.
            2015-03-09 17:39:46 LustreError: Skipped 8602702 previous similar messages
            2015-03-09 17:46:52 Lustre: lcrash-MDT0000-mdc-ffff8804462fd400: Connection to lcrash-MDT0000 (at 10.1.1.120@o2ib9) was lost; in progress operations using this service will wait for recovery to complete
            2015-03-09 17:46:52 Lustre: Skipped 3456170 previous similar messages
            

            I don't know exactly which version we were running at the time. Chris, do you happen to know?

            marc@llnl.gov D. Marc Stearman (Inactive) added a comment - We saw a burst of the debug messages on March 10th, 2015: 2015-03-09 17:16:52 Lustre: lcrash-MDT0000-mdc-ffff8804462fd400: Connection to lcrash-MDT0000 (at 10.1.1.120@o2ib9) was lost; in progress operations using this service will wait for recovery to complete 2015-03-09 17:16:52 Lustre: Skipped 3449155 previous similar messages 2015-03-09 17:18:38 Lustre: lcrash-MDT0000-mdc-ffff8804462fd400: Connection restored to lcrash-MDT0000 (at 10.1.1.120@o2ib9) 2015-03-09 17:18:38 Lustre: Skipped 3462984 previous similar messages 2015-03-09 17:19:46 LustreError: 11-0: lcrash-MDT0000-mdc-ffff8804462fd400: Communicating with 10.1.1.120@o2ib9, operation llog_origin_handle_destroy failed with -19. 2015-03-09 17:19:46 LustreError: Skipped 8882876 previous similar messages 2015-03-09 17:26:52 Lustre: lcrash-MDT0000-mdc-ffff8804462fd400: Connection to lcrash-MDT0000 (at 10.1.1.120@o2ib9) was lost; in progress operations using this service will wait for recovery to complete 2015-03-09 17:26:52 Lustre: Skipped 3461504 previous similar messages 2015-03-09 17:26:55 LNetError: 28827:0:(lib-md.c:407:LNetMDCheck()) LU-2522: md(ffff88038f4694c0)->md_flags has LNET_MD_FLAG_ZOMBIE set 2015-03-09 17:28:38 Lustre: lcrash-MDT0000-mdc-ffff8804462fd400: Connection restored to lcrash-MDT0000 (at 10.1.1.120@o2ib9) 2015-03-09 17:28:38 Lustre: Skipped 3472891 previous similar messages 2015-03-09 17:29:18 LNetError: 28829:0:(lib-md.c:407:LNetMDCheck()) LU-2522: md(ffff880305a45a40)->md_flags has LNET_MD_FLAG_ZOMBIE set 2015-03-09 17:29:46 LustreError: 11-0: lcrash-MDT0000-mdc-ffff8804462fd400: Communicating with 10.1.1.120@o2ib9, operation llog_origin_handle_destroy failed with -19. 2015-03-09 17:29:46 LustreError: Skipped 8232155 previous similar messages 2015-03-09 17:36:52 Lustre: lcrash-MDT0000-mdc-ffff8804462fd400: Connection to lcrash-MDT0000 (at 10.1.1.120@o2ib9) was lost; in progress operations using this service will wait for recovery to complete 2015-03-09 17:36:52 Lustre: Skipped 3484395 previous similar messages 2015-03-09 17:37:30 LNetError: 28829:0:(lib-md.c:407:LNetMDCheck()) LU-2522: md(ffff8802255738c0)->md_flags has LNET_MD_FLAG_ZOMBIE set 2015-03-09 17:38:26 LNetError: 28843:0:(lib-md.c:407:LNetMDCheck()) LU-2522: md(ffff8802d20ecb40)->md_flags has LNET_MD_FLAG_ZOMBIE set 2015-03-09 17:38:38 Lustre: lcrash-MDT0000-mdc-ffff8804462fd400: Connection restored to lcrash-MDT0000 (at 10.1.1.120@o2ib9) 2015-03-09 17:38:38 Lustre: Skipped 3475352 previous similar messages 2015-03-09 17:39:46 LustreError: 11-0: lcrash-MDT0000-mdc-ffff8804462fd400: Communicating with 10.1.1.120@o2ib9, operation llog_origin_handle_destroy failed with -19. 2015-03-09 17:39:46 LustreError: Skipped 8602702 previous similar messages 2015-03-09 17:46:52 Lustre: lcrash-MDT0000-mdc-ffff8804462fd400: Connection to lcrash-MDT0000 (at 10.1.1.120@o2ib9) was lost; in progress operations using this service will wait for recovery to complete 2015-03-09 17:46:52 Lustre: Skipped 3456170 previous similar messages I don't know exactly which version we were running at the time. Chris, do you happen to know?
            mdiep Minh Diep added a comment -

            I also can not reproduce this in b2_3

            mdiep Minh Diep added a comment - I also can not reproduce this in b2_3
            mdiep Minh Diep added a comment -

            I also can not reproduce this on b2_4 http://review.whamcloud.com/10653

            I am working on b2_3 without any debug patch

            mdiep Minh Diep added a comment - I also can not reproduce this on b2_4 http://review.whamcloud.com/10653 I am working on b2_3 without any debug patch
            bobijam Zhenyu Xu added a comment -

            porting of the debug patch for b2_4 at http://review.whamcloud.com/10653

            bobijam Zhenyu Xu added a comment - porting of the debug patch for b2_4 at http://review.whamcloud.com/10653
            pjones Peter Jones added a comment -

            Could we try porting this path to b2_4 and seeing if it will reproduce then?

            pjones Peter Jones added a comment - Could we try porting this path to b2_4 and seeing if it will reproduce then?
            mdiep Minh Diep added a comment -

            I haven't been able to hit this issue again on master

            mdiep Minh Diep added a comment - I haven't been able to hit this issue again on master

            I updated patch http://review.whamcloud.com/10427 with debug code to go over the MDs in the system and see if any of them are in LNET_MD_FLAG_ZOMBIE state. This is suppose to be a transitional state. If an MD stays in that state for a period of time that means that the md_refcount > 0. Which in turn means that there is a message being processed on that MD, but lnet_finalize() was never called on that message. We're suspecting that the reply_in_callback() which sets the flags of interest to 0 is not being called. As far as I could tell at the moment, this would happen if lnet_finalize() is not being called. If lnet_finalize is not being called, we would leak an MD, which should remain in LNET_MD_FLAG_ZOMBIE state. This patch should determine if this is the case. If one such MD is found a CERROR is outputed.

            ashehata Amir Shehata (Inactive) added a comment - I updated patch http://review.whamcloud.com/10427 with debug code to go over the MDs in the system and see if any of them are in LNET_MD_FLAG_ZOMBIE state. This is suppose to be a transitional state. If an MD stays in that state for a period of time that means that the md_refcount > 0. Which in turn means that there is a message being processed on that MD, but lnet_finalize() was never called on that message. We're suspecting that the reply_in_callback() which sets the flags of interest to 0 is not being called. As far as I could tell at the moment, this would happen if lnet_finalize() is not being called. If lnet_finalize is not being called, we would leak an MD, which should remain in LNET_MD_FLAG_ZOMBIE state. This patch should determine if this is the case. If one such MD is found a CERROR is outputed.
            bobijam Zhenyu Xu added a comment -

            Minh,

            Can you help to reproduce the issue with the debug patch http://review.whamcloud.com/10427 ?

            bobijam Zhenyu Xu added a comment - Minh, Can you help to reproduce the issue with the debug patch http://review.whamcloud.com/10427 ?
            ashehata Amir Shehata (Inactive) added a comment - - edited

            In ptl_send_rpc() it does the following:

                    /* If the MD attach succeeds, there _will_ be a reply_in callback */
                    request->rq_receiving_reply = !noreply;
                    /* We are responsible for unlinking the reply buffer */
                    request->rq_must_unlink = !noreply;
            

            Which means these flags would be 1 if there is a reply expected.

            These flags are reset to 0, when the reply callback is executed in reply_in_callback()

            I'm thinking if the reply is delayed, these flags wouldn't be set to 0. Would that lead to the observed behavior?

            It would be good to try and make sure that there are no dropped messages.

            ashehata Amir Shehata (Inactive) added a comment - - edited In ptl_send_rpc() it does the following: /* If the MD attach succeeds, there _will_ be a reply_in callback */ request->rq_receiving_reply = !noreply; /* We are responsible for unlinking the reply buffer */ request->rq_must_unlink = !noreply; Which means these flags would be 1 if there is a reply expected. These flags are reset to 0, when the reply callback is executed in reply_in_callback() I'm thinking if the reply is delayed, these flags wouldn't be set to 0. Would that lead to the observed behavior? It would be good to try and make sure that there are no dropped messages.
            bobijam Zhenyu Xu added a comment -

            I think the md was occupied by sending. I'm guessing the request::rq_must_unlink does not have chance to be set to 0. I've pushed a debug patch to reveal its ::rq_must_unlink and ::rq_receiving_reply flags.

            bobijam Zhenyu Xu added a comment - I think the md was occupied by sending. I'm guessing the request::rq_must_unlink does not have chance to be set to 0. I've pushed a debug patch to reveal its ::rq_must_unlink and ::rq_receiving_reply flags.

            LNetMDUnlink unlinks the memory descriptor from the Match entry. This frees the buffers and any possible messages using this buffer. The only reason the MD wouldn't be unlinked immediately is if the md->md_refcount > 0. The md->md_refcount is incremented when the MD is in use. That happens when you're sending a message using the MD or when you are receiving a message into the MD. Which case is this problem?

            ashehata Amir Shehata (Inactive) added a comment - LNetMDUnlink unlinks the memory descriptor from the Match entry. This frees the buffers and any possible messages using this buffer. The only reason the MD wouldn't be unlinked immediately is if the md->md_refcount > 0. The md->md_refcount is incremented when the MD is in use. That happens when you're sending a message using the MD or when you are receiving a message into the MD. Which case is this problem?

            People

              bobijam Zhenyu Xu
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: