Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2522

conf-sanity test_23b timed out during mount: RPCs in "Unregistering" phase found

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • None
    • 3
    • 5938

    Description

      This issue was created by maloo for Andreas Dilger <andreas.dilger@intel.com>

      This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/f71e120a-4a59-11e2-8523-52540035b04c.

      The sub-test test_23b failed with the following errors in the client console log:

      15:05:02:Lustre: *** cfs_fail_loc=313, val=0***
      15:05:02:LustreError: 26376:0:(llite_lib.c:537:client_common_fill_super()) cannot start close thread: rc -4
      15:05:23:LustreError: 25912:0:(import.c:324:ptlrpc_invalidate_import()) lustre-OST0000_UUID: rc = -110 waiting for callback (1 != 0)
      15:05:23:LustreError: 25912:0:(import.c:350:ptlrpc_invalidate_import()) @@@ still on sending list req@ffff880077b10000 x1421825299972112/t0(0) o8->lustre-OST0000-osc-ffff8800767e7000@10.10.4.147@tcp:28/4 lens 400/264 e 0 to 0 dl 1355958298 ref 1 fl Interpret:REN/0/0 rc -5/0
      15:05:23:LustreError: 25912:0:(import.c:366:ptlrpc_invalidate_import()) lustre-OST0000_UUID: RPCs in "Unregistering" phase found (0). Network is sluggish? Waiting them to error out.
      :
      [repeats]
      :
      15:08:35:INFO: task mount.lustre:26376 blocked for more than 120 seconds.
      15:08:35:mount.lustre D 0000000000000000 0 26376 26375 0x00000080
      15:08:35: ffff880076a8d7e8 0000000000000086 0000000050d24816 ffff880076a8d7b8
      15:08:35: 0000000000000000 ffff88007b5e1ba0 ffffffffa0717f80 0000000000000054
      15:08:35: ffff88007cd765f8 ffff880076a8dfd8 000000000000fb88 ffff88007cd765f8
      15:08:35:Call Trace:
      15:08:35: [<ffffffff81500605>] rwsem_down_failed_common+0x95/0x1d0
      15:08:35: [<ffffffff81500763>] rwsem_down_write_failed+0x23/0x30
      15:08:35: [<ffffffff8127ea53>] call_rwsem_down_write_failed+0x13/0x20
      15:08:35: [<ffffffff814ffc62>] ? down_write+0x32/0x40
      15:08:35: [<ffffffffa0642061>] client_disconnect_export+0x61/0x460 [ptlrpc]
      15:08:35: [<ffffffffa0877335>] osc_disconnect+0xa5/0x2a0 [osc]
      15:08:35: [<ffffffffa0504ff7>] class_disconnect_export_list+0x337/0x670 [obdclass]
      15:08:35: [<ffffffffa05057b6>] class_disconnect_exports+0x116/0x2f0 [obdclass]
      15:08:35: [<ffffffffa052e60f>] class_cleanup+0x16f/0xdc0 [obdclass]
      15:08:35: [<ffffffffa0502fd6>] ? class_name2dev+0x56/0xe0 [obdclass]
      15:08:35: [<ffffffffa0530305>] class_process_config+0x10a5/0x1ca0 [obdclass]
      15:08:35: [<ffffffffa07a2028>] ? libcfs_log_return+0x28/0x40 [libcfs]
      15:08:35: [<ffffffffa0529b01>] ? lustre_cfg_new+0x391/0x7e0 [obdclass]
      15:08:35: [<ffffffffa0531079>] class_manual_cleanup+0x179/0x6f0 [obdclass]
      15:08:35: [<ffffffffa0a70f40>] ll_put_super+0xf0/0x370 [lustre]
      15:08:35: [<ffffffffa0a7bf29>] ll_fill_super+0x7f9/0x1500 [lustre]
      15:08:35: [<ffffffffa05477cc>] lustre_fill_super+0x12c/0x1b00 [obdclass]
      15:08:35: [<ffffffff8117d256>] ? set_anon_super+0x56/0x100
      15:08:36: [<ffffffff8117e123>] ? sget+0x3e3/0x480
      15:08:36: [<ffffffff8117d200>] ? set_anon_super+0x0/0x100
      15:08:36: [<ffffffffa05476a0>] ? lustre_fill_super+0x0/0x1b00 [obdclass]
      15:08:36: [<ffffffff8117e66f>] get_sb_nodev+0x5f/0xa0
      15:08:36: [<ffffffffa0532d15>] lustre_get_sb+0x25/0x30 [obdclass]
      15:08:36: [<ffffffff8117e2cb>] vfs_kern_mount+0x7b/0x1b0
      15:08:36: [<ffffffff8117e472>] do_kern_mount+0x52/0x130
      15:08:36: [<ffffffff8119cb42>] do_mount+0x2d2/0x8d0
      15:08:36: [<ffffffff8119d1d0>] sys_mount+0x90/0xe0

      Info required for matching: conf-sanity 23b

      Attachments

        Issue Links

          Activity

            [LU-2522] conf-sanity test_23b timed out during mount: RPCs in "Unregistering" phase found

            I agree, I do not see a connection with Marc's log to this ticket. If the only place Marc could find this happening was on lcrash (which is a developer crash-and-burn development system, so it could have occurred there for any number of reasons), then I don't think we have grounds to keep this as a top llnl issue. I am removing the topllnl label.

            morrone Christopher Morrone (Inactive) added a comment - I agree, I do not see a connection with Marc's log to this ticket. If the only place Marc could find this happening was on lcrash (which is a developer crash-and-burn development system, so it could have occurred there for any number of reasons), then I don't think we have grounds to keep this as a top llnl issue. I am removing the topllnl label.
            bobijam Zhenyu Xu added a comment -

            the latest debug message does not fit this issue, this "llog_origin_handle_destroy failed with -19" means that MDT does not function as the client expect (MDT cannot find the llog context), and it cannot handle this client's config llog.

            bobijam Zhenyu Xu added a comment - the latest debug message does not fit this issue, this "llog_origin_handle_destroy failed with -19" means that MDT does not function as the client expect (MDT cannot find the llog context), and it cannot handle this client's config llog.

            We saw a burst of the debug messages on March 10th, 2015:

            2015-03-09 17:16:52 Lustre: lcrash-MDT0000-mdc-ffff8804462fd400: Connection to lcrash-MDT0000 (at 10.1.1.120@o2ib9) was lost; in progress operations using this service will wait for recovery to complete
            2015-03-09 17:16:52 Lustre: Skipped 3449155 previous similar messages
            2015-03-09 17:18:38 Lustre: lcrash-MDT0000-mdc-ffff8804462fd400: Connection restored to lcrash-MDT0000 (at 10.1.1.120@o2ib9)
            2015-03-09 17:18:38 Lustre: Skipped 3462984 previous similar messages
            2015-03-09 17:19:46 LustreError: 11-0: lcrash-MDT0000-mdc-ffff8804462fd400: Communicating with 10.1.1.120@o2ib9, operation llog_origin_handle_destroy failed with -19.
            2015-03-09 17:19:46 LustreError: Skipped 8882876 previous similar messages
            2015-03-09 17:26:52 Lustre: lcrash-MDT0000-mdc-ffff8804462fd400: Connection to lcrash-MDT0000 (at 10.1.1.120@o2ib9) was lost; in progress operations using this service will wait for recovery to complete
            2015-03-09 17:26:52 Lustre: Skipped 3461504 previous similar messages
            2015-03-09 17:26:55 LNetError: 28827:0:(lib-md.c:407:LNetMDCheck()) LU-2522: md(ffff88038f4694c0)->md_flags has LNET_MD_FLAG_ZOMBIE set
            2015-03-09 17:28:38 Lustre: lcrash-MDT0000-mdc-ffff8804462fd400: Connection restored to lcrash-MDT0000 (at 10.1.1.120@o2ib9)
            2015-03-09 17:28:38 Lustre: Skipped 3472891 previous similar messages
            2015-03-09 17:29:18 LNetError: 28829:0:(lib-md.c:407:LNetMDCheck()) LU-2522: md(ffff880305a45a40)->md_flags has LNET_MD_FLAG_ZOMBIE set
            2015-03-09 17:29:46 LustreError: 11-0: lcrash-MDT0000-mdc-ffff8804462fd400: Communicating with 10.1.1.120@o2ib9, operation llog_origin_handle_destroy failed with -19.
            2015-03-09 17:29:46 LustreError: Skipped 8232155 previous similar messages
            2015-03-09 17:36:52 Lustre: lcrash-MDT0000-mdc-ffff8804462fd400: Connection to lcrash-MDT0000 (at 10.1.1.120@o2ib9) was lost; in progress operations using this service will wait for recovery to complete
            2015-03-09 17:36:52 Lustre: Skipped 3484395 previous similar messages
            2015-03-09 17:37:30 LNetError: 28829:0:(lib-md.c:407:LNetMDCheck()) LU-2522: md(ffff8802255738c0)->md_flags has LNET_MD_FLAG_ZOMBIE set
            2015-03-09 17:38:26 LNetError: 28843:0:(lib-md.c:407:LNetMDCheck()) LU-2522: md(ffff8802d20ecb40)->md_flags has LNET_MD_FLAG_ZOMBIE set
            2015-03-09 17:38:38 Lustre: lcrash-MDT0000-mdc-ffff8804462fd400: Connection restored to lcrash-MDT0000 (at 10.1.1.120@o2ib9)
            2015-03-09 17:38:38 Lustre: Skipped 3475352 previous similar messages
            2015-03-09 17:39:46 LustreError: 11-0: lcrash-MDT0000-mdc-ffff8804462fd400: Communicating with 10.1.1.120@o2ib9, operation llog_origin_handle_destroy failed with -19.
            2015-03-09 17:39:46 LustreError: Skipped 8602702 previous similar messages
            2015-03-09 17:46:52 Lustre: lcrash-MDT0000-mdc-ffff8804462fd400: Connection to lcrash-MDT0000 (at 10.1.1.120@o2ib9) was lost; in progress operations using this service will wait for recovery to complete
            2015-03-09 17:46:52 Lustre: Skipped 3456170 previous similar messages
            

            I don't know exactly which version we were running at the time. Chris, do you happen to know?

            marc@llnl.gov D. Marc Stearman (Inactive) added a comment - We saw a burst of the debug messages on March 10th, 2015: 2015-03-09 17:16:52 Lustre: lcrash-MDT0000-mdc-ffff8804462fd400: Connection to lcrash-MDT0000 (at 10.1.1.120@o2ib9) was lost; in progress operations using this service will wait for recovery to complete 2015-03-09 17:16:52 Lustre: Skipped 3449155 previous similar messages 2015-03-09 17:18:38 Lustre: lcrash-MDT0000-mdc-ffff8804462fd400: Connection restored to lcrash-MDT0000 (at 10.1.1.120@o2ib9) 2015-03-09 17:18:38 Lustre: Skipped 3462984 previous similar messages 2015-03-09 17:19:46 LustreError: 11-0: lcrash-MDT0000-mdc-ffff8804462fd400: Communicating with 10.1.1.120@o2ib9, operation llog_origin_handle_destroy failed with -19. 2015-03-09 17:19:46 LustreError: Skipped 8882876 previous similar messages 2015-03-09 17:26:52 Lustre: lcrash-MDT0000-mdc-ffff8804462fd400: Connection to lcrash-MDT0000 (at 10.1.1.120@o2ib9) was lost; in progress operations using this service will wait for recovery to complete 2015-03-09 17:26:52 Lustre: Skipped 3461504 previous similar messages 2015-03-09 17:26:55 LNetError: 28827:0:(lib-md.c:407:LNetMDCheck()) LU-2522: md(ffff88038f4694c0)->md_flags has LNET_MD_FLAG_ZOMBIE set 2015-03-09 17:28:38 Lustre: lcrash-MDT0000-mdc-ffff8804462fd400: Connection restored to lcrash-MDT0000 (at 10.1.1.120@o2ib9) 2015-03-09 17:28:38 Lustre: Skipped 3472891 previous similar messages 2015-03-09 17:29:18 LNetError: 28829:0:(lib-md.c:407:LNetMDCheck()) LU-2522: md(ffff880305a45a40)->md_flags has LNET_MD_FLAG_ZOMBIE set 2015-03-09 17:29:46 LustreError: 11-0: lcrash-MDT0000-mdc-ffff8804462fd400: Communicating with 10.1.1.120@o2ib9, operation llog_origin_handle_destroy failed with -19. 2015-03-09 17:29:46 LustreError: Skipped 8232155 previous similar messages 2015-03-09 17:36:52 Lustre: lcrash-MDT0000-mdc-ffff8804462fd400: Connection to lcrash-MDT0000 (at 10.1.1.120@o2ib9) was lost; in progress operations using this service will wait for recovery to complete 2015-03-09 17:36:52 Lustre: Skipped 3484395 previous similar messages 2015-03-09 17:37:30 LNetError: 28829:0:(lib-md.c:407:LNetMDCheck()) LU-2522: md(ffff8802255738c0)->md_flags has LNET_MD_FLAG_ZOMBIE set 2015-03-09 17:38:26 LNetError: 28843:0:(lib-md.c:407:LNetMDCheck()) LU-2522: md(ffff8802d20ecb40)->md_flags has LNET_MD_FLAG_ZOMBIE set 2015-03-09 17:38:38 Lustre: lcrash-MDT0000-mdc-ffff8804462fd400: Connection restored to lcrash-MDT0000 (at 10.1.1.120@o2ib9) 2015-03-09 17:38:38 Lustre: Skipped 3475352 previous similar messages 2015-03-09 17:39:46 LustreError: 11-0: lcrash-MDT0000-mdc-ffff8804462fd400: Communicating with 10.1.1.120@o2ib9, operation llog_origin_handle_destroy failed with -19. 2015-03-09 17:39:46 LustreError: Skipped 8602702 previous similar messages 2015-03-09 17:46:52 Lustre: lcrash-MDT0000-mdc-ffff8804462fd400: Connection to lcrash-MDT0000 (at 10.1.1.120@o2ib9) was lost; in progress operations using this service will wait for recovery to complete 2015-03-09 17:46:52 Lustre: Skipped 3456170 previous similar messages I don't know exactly which version we were running at the time. Chris, do you happen to know?
            mdiep Minh Diep added a comment -

            I also can not reproduce this in b2_3

            mdiep Minh Diep added a comment - I also can not reproduce this in b2_3
            mdiep Minh Diep added a comment -

            I also can not reproduce this on b2_4 http://review.whamcloud.com/10653

            I am working on b2_3 without any debug patch

            mdiep Minh Diep added a comment - I also can not reproduce this on b2_4 http://review.whamcloud.com/10653 I am working on b2_3 without any debug patch
            bobijam Zhenyu Xu added a comment -

            porting of the debug patch for b2_4 at http://review.whamcloud.com/10653

            bobijam Zhenyu Xu added a comment - porting of the debug patch for b2_4 at http://review.whamcloud.com/10653
            pjones Peter Jones added a comment -

            Could we try porting this path to b2_4 and seeing if it will reproduce then?

            pjones Peter Jones added a comment - Could we try porting this path to b2_4 and seeing if it will reproduce then?
            mdiep Minh Diep added a comment -

            I haven't been able to hit this issue again on master

            mdiep Minh Diep added a comment - I haven't been able to hit this issue again on master

            I updated patch http://review.whamcloud.com/10427 with debug code to go over the MDs in the system and see if any of them are in LNET_MD_FLAG_ZOMBIE state. This is suppose to be a transitional state. If an MD stays in that state for a period of time that means that the md_refcount > 0. Which in turn means that there is a message being processed on that MD, but lnet_finalize() was never called on that message. We're suspecting that the reply_in_callback() which sets the flags of interest to 0 is not being called. As far as I could tell at the moment, this would happen if lnet_finalize() is not being called. If lnet_finalize is not being called, we would leak an MD, which should remain in LNET_MD_FLAG_ZOMBIE state. This patch should determine if this is the case. If one such MD is found a CERROR is outputed.

            ashehata Amir Shehata (Inactive) added a comment - I updated patch http://review.whamcloud.com/10427 with debug code to go over the MDs in the system and see if any of them are in LNET_MD_FLAG_ZOMBIE state. This is suppose to be a transitional state. If an MD stays in that state for a period of time that means that the md_refcount > 0. Which in turn means that there is a message being processed on that MD, but lnet_finalize() was never called on that message. We're suspecting that the reply_in_callback() which sets the flags of interest to 0 is not being called. As far as I could tell at the moment, this would happen if lnet_finalize() is not being called. If lnet_finalize is not being called, we would leak an MD, which should remain in LNET_MD_FLAG_ZOMBIE state. This patch should determine if this is the case. If one such MD is found a CERROR is outputed.
            bobijam Zhenyu Xu added a comment -

            Minh,

            Can you help to reproduce the issue with the debug patch http://review.whamcloud.com/10427 ?

            bobijam Zhenyu Xu added a comment - Minh, Can you help to reproduce the issue with the debug patch http://review.whamcloud.com/10427 ?

            People

              bobijam Zhenyu Xu
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: