Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4413

Test failure on test suite conf-sanity, subtest test_56

Details

    • 3
    • 12112

    Description

      This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

      This issue relates to the following test suite run:
      http://maloo.whamcloud.com/test_sets/d5f1f7c2-6a01-11e3-9248-52540035b04c
      https://maloo.whamcloud.com/test_sets/e4cbde82-6a35-11e3-8e21-52540035b04c

      The sub-test test_56 failed with the following error:

      test failed to respond and timed out

      Info required for matching: conf-sanity 56

      Attachments

        Issue Links

          Activity

            [LU-4413] Test failure on test suite conf-sanity, subtest test_56

            Patches landed to Master. New tickets will be opened if additional fixes are required.

            jlevi Jodi Levi (Inactive) added a comment - Patches landed to Master. New tickets will be opened if additional fixes are required.
            di.wang Di Wang added a comment -

            Hmm, right now, OST can not tell whether MDT0 is being umounted or not being setup yet. So OST will retry if the request return -ENOTREADY or EBUSY. But I agree something needs to be fixed here, probably not in this patch.

            di.wang Di Wang added a comment - Hmm, right now, OST can not tell whether MDT0 is being umounted or not being setup yet. So OST will retry if the request return -ENOTREADY or EBUSY. But I agree something needs to be fixed here, probably not in this patch.

            What happened if the OST try to get sequence from MDT0 which has been amount? It will block or return failure. I think the later is reasonable. If OST got failure from MDT0, it should forward the failure back to the MDS2, so the MDS2 also should not be blocked for ever. Missed anything?

            yong.fan nasf (Inactive) added a comment - What happened if the OST try to get sequence from MDT0 which has been amount? It will block or return failure. I think the later is reasonable. If OST got failure from MDT0, it should forward the failure back to the MDS2, so the MDS2 also should not be blocked for ever. Missed anything?
            di.wang Di Wang added a comment - http://review.whamcloud.com/8997
            di.wang Di Wang added a comment -

            Ah, the connecting RPC is being to ptlrpcd_rcv thread, but ptlrpcd_rcv is stuck at

            ptlrpcd_rcv   S 0000000000000001     0 15409      2 0x00000080
             ffff88005ef2b8d0 0000000000000046 0000000000000000 00000000fffffffe
             ffff88005ef2b890 0000000000000000 ffff880079ddea80 ffff880079ddeab0
             ffff8800377a7af8 ffff88005ef2bfd8 000000000000fb88 ffff8800377a7af8
            Call Trace:
             [<ffffffff8150f3f2>] schedule_timeout+0x192/0x2e0
             [<ffffffff810811e0>] ? process_timeout+0x0/0x10
             [<ffffffffa0b3f99a>] ptlrpc_set_wait+0x2da/0x860 [ptlrpc]
             [<ffffffff81063990>] ? default_wake_function+0x0/0x20
             [<ffffffffa0b49c06>] ? lustre_msg_set_jobid+0xb6/0x140 [ptlrpc]
             [<ffffffffa0b3ffa7>] ptlrpc_queue_wait+0x87/0x220 [ptlrpc]
             [<ffffffffa0292bea>] seq_client_rpc+0x12a/0x910 [fid]
             [<ffffffffa06efa81>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
             [<ffffffffa02937cf>] seq_client_alloc_seq+0x3ff/0x480 [fid]
             [<ffffffffa02929c3>] ? seq_fid_alloc_prep+0x43/0xc0 [fid]
             [<ffffffffa02938c5>] seq_client_get_seq+0x75/0x1f0 [fid]
             [<ffffffff81063990>] ? default_wake_function+0x0/0x20
             [<ffffffffa0914263>] ? lu_context_init+0xa3/0x240 [obdclass]
             [<ffffffffa0debe12>] osp_init_pre_fid+0x3c2/0x500 [osp]
             [<ffffffffa0de1b3f>] osp_import_event+0xaf/0x4e0 [osp]
             [<ffffffffa0b63524>] ptlrpc_activate_import+0xb4/0x260 [ptlrpc]
             [<ffffffffa0b68d22>] ptlrpc_connect_interpret+0x1912/0x2160 [ptlrpc]
             [<ffffffffa0b3de44>] ptlrpc_check_set+0x2c4/0x1b40 [ptlrpc]
             [<ffffffffa0b6a4cb>] ptlrpcd_check+0x53b/0x560 [ptlrpc]
             [<ffffffffa0b6ab1b>] ptlrpcd+0x33b/0x3f0 [ptlrpc]
             [<ffffffff81063990>] ? default_wake_function+0x0/0x20
             [<ffffffffa0b6a7e0>] ? ptlrpcd+0x0/0x3f0 [ptlrpc]
             [<ffffffff81096a36>] kthread+0x96/0xa0
             [<ffffffff8100c0ca>] child_rip+0xa/0x20
             [<ffffffff810969a0>] ? kthread+0x0/0xa0
             [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
            

            So the whole process looks like a deadlock to me

            1. MDS1 is being amounted
            2. LWP on MDS2 is trying to re-connect to MDS1, which will add Connecting RPC to ptlrpcd_rcv, and set import to CONNECTING.
            3. umount MDS2, LWP will try to abort all of RPC related to the import. But the ptlrpcd_rcv is stuck as above, i.e. MDS2 is trying to allocate sequence from one of OST, but the OST is not able to get super sequence from MDT0, because MDT0 has been umounted, so the RPC is stuck there as above.

            4. Since the connecting RPC does not get the chance to be handled in ptlrpcd_rcv, then the import of LWP will be connecting for ever.

            So ptlrpc_disconnect_import will wait the recovering import forever.

            di.wang Di Wang added a comment - Ah, the connecting RPC is being to ptlrpcd_rcv thread, but ptlrpcd_rcv is stuck at ptlrpcd_rcv S 0000000000000001 0 15409 2 0x00000080 ffff88005ef2b8d0 0000000000000046 0000000000000000 00000000fffffffe ffff88005ef2b890 0000000000000000 ffff880079ddea80 ffff880079ddeab0 ffff8800377a7af8 ffff88005ef2bfd8 000000000000fb88 ffff8800377a7af8 Call Trace: [<ffffffff8150f3f2>] schedule_timeout+0x192/0x2e0 [<ffffffff810811e0>] ? process_timeout+0x0/0x10 [<ffffffffa0b3f99a>] ptlrpc_set_wait+0x2da/0x860 [ptlrpc] [<ffffffff81063990>] ? default_wake_function+0x0/0x20 [<ffffffffa0b49c06>] ? lustre_msg_set_jobid+0xb6/0x140 [ptlrpc] [<ffffffffa0b3ffa7>] ptlrpc_queue_wait+0x87/0x220 [ptlrpc] [<ffffffffa0292bea>] seq_client_rpc+0x12a/0x910 [fid] [<ffffffffa06efa81>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [<ffffffffa02937cf>] seq_client_alloc_seq+0x3ff/0x480 [fid] [<ffffffffa02929c3>] ? seq_fid_alloc_prep+0x43/0xc0 [fid] [<ffffffffa02938c5>] seq_client_get_seq+0x75/0x1f0 [fid] [<ffffffff81063990>] ? default_wake_function+0x0/0x20 [<ffffffffa0914263>] ? lu_context_init+0xa3/0x240 [obdclass] [<ffffffffa0debe12>] osp_init_pre_fid+0x3c2/0x500 [osp] [<ffffffffa0de1b3f>] osp_import_event+0xaf/0x4e0 [osp] [<ffffffffa0b63524>] ptlrpc_activate_import+0xb4/0x260 [ptlrpc] [<ffffffffa0b68d22>] ptlrpc_connect_interpret+0x1912/0x2160 [ptlrpc] [<ffffffffa0b3de44>] ptlrpc_check_set+0x2c4/0x1b40 [ptlrpc] [<ffffffffa0b6a4cb>] ptlrpcd_check+0x53b/0x560 [ptlrpc] [<ffffffffa0b6ab1b>] ptlrpcd+0x33b/0x3f0 [ptlrpc] [<ffffffff81063990>] ? default_wake_function+0x0/0x20 [<ffffffffa0b6a7e0>] ? ptlrpcd+0x0/0x3f0 [ptlrpc] [<ffffffff81096a36>] kthread+0x96/0xa0 [<ffffffff8100c0ca>] child_rip+0xa/0x20 [<ffffffff810969a0>] ? kthread+0x0/0xa0 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 So the whole process looks like a deadlock to me 1. MDS1 is being amounted 2. LWP on MDS2 is trying to re-connect to MDS1, which will add Connecting RPC to ptlrpcd_rcv, and set import to CONNECTING. 3. umount MDS2, LWP will try to abort all of RPC related to the import. But the ptlrpcd_rcv is stuck as above, i.e. MDS2 is trying to allocate sequence from one of OST, but the OST is not able to get super sequence from MDT0, because MDT0 has been umounted, so the RPC is stuck there as above. 4. Since the connecting RPC does not get the chance to be handled in ptlrpcd_rcv, then the import of LWP will be connecting for ever. So ptlrpc_disconnect_import will wait the recovering import forever.
            di.wang Di Wang added a comment -

            Yes, I can see the from the debug log that it is a racy condition, i.e. MDS2 is doing umount, while its lwp connection is not been setup MDS1 (import is in CONNECTING state), at the mean time MDS1 has been umounted. But here is one thing I do not understand.

            1. MDS2 is trying to setup connection with MDS1, which mean there must be a connecting RPC inflight or somewhere in ptlrpcd.
            2. and before lwp_disconnect call ptlrpc_disconnect_import, it suppose to do ptlrpc_deactivate_import, which will drain all of RPC related with this import IMHO, then the connect interpret callback should set the import to be DISCONNECTED. Then ptlrpc_disconnect_import should be wakeup instead of waiting for non-recovery state.

            If what I understand is correct, then we probably need find out why the connecting RPC is not being interrupted during the umount process.

            di.wang Di Wang added a comment - Yes, I can see the from the debug log that it is a racy condition, i.e. MDS2 is doing umount, while its lwp connection is not been setup MDS1 (import is in CONNECTING state), at the mean time MDS1 has been umounted. But here is one thing I do not understand. 1. MDS2 is trying to setup connection with MDS1, which mean there must be a connecting RPC inflight or somewhere in ptlrpcd. 2. and before lwp_disconnect call ptlrpc_disconnect_import, it suppose to do ptlrpc_deactivate_import, which will drain all of RPC related with this import IMHO, then the connect interpret callback should set the import to be DISCONNECTED. Then ptlrpc_disconnect_import should be wakeup instead of waiting for non-recovery state. If what I understand is correct, then we probably need find out why the connecting RPC is not being interrupted during the umount process.

            Di, would it be possible to just change ptlrpc_disconnect_import() to not wait for recovery if obd_no_recov is set (this is done in ptlrpc_pinger_del_import())? Otherwise, it seems LWP connections will never reconnect because the import was removed from the pinger via ptlrpc_pinger_del_import() in lwp_disconnect(). That means nothing is going to try and reconnect the import and it will just wait there forever.

            I've pushed http://review.whamcloud.com/8996 which may or may not fix this problem. I think what is happening is that the MDS is trying to unmount while it still has a failed LWP connection to another MDS that was recently stopped. It is racy if that connection has recovered yet or not, but if it hasn't then lwp_disconnect->ptlrpc_pinger_del_import() mean that the connection will never be recovered, and ptlrpc_disconnect_import() will wait forever for that to happen.

            adilger Andreas Dilger added a comment - Di, would it be possible to just change ptlrpc_disconnect_import() to not wait for recovery if obd_no_recov is set (this is done in ptlrpc_pinger_del_import())? Otherwise, it seems LWP connections will never reconnect because the import was removed from the pinger via ptlrpc_pinger_del_import() in lwp_disconnect(). That means nothing is going to try and reconnect the import and it will just wait there forever. I've pushed http://review.whamcloud.com/8996 which may or may not fix this problem. I think what is happening is that the MDS is trying to unmount while it still has a failed LWP connection to another MDS that was recently stopped. It is racy if that connection has recovered yet or not, but if it hasn't then lwp_disconnect->ptlrpc_pinger_del_import() mean that the connection will never be recovered, and ptlrpc_disconnect_import() will wait forever for that to happen.

            Update from Di on this topic:

            [1/24/2014, 2:08:11 PM] Wang Di:  here is the amount process stack
            [1/24/2014, 2:08:13 PM] Wang Di: umount        S 0000000000000000     0 25182  25181 0x00000080
             ffff8800647cdc68 0000000000000086 0000000000000000 ffffffff8150f3fa
             00000000ffffff0a 0000000000000282 0000000000000000 dead000000200200
             ffff880037eb2638 ffff8800647cdfd8 000000000000fb88 ffff880037eb2638
            Call Trace:
             [<ffffffff8150f3fa>] ? schedule_timeout+0x19a/0x2e0
             [<ffffffffa0b6326d>] ptlrpc_disconnect_import+0x77d/0x7a0 [ptlrpc]
             [<ffffffff81063990>] ? default_wake_function+0x0/0x20
             [<ffffffffa06ea4e8>] ? libcfs_log_return+0x28/0x40 [libcfs]
             [<ffffffffa0df94e9>] lwp_disconnect+0x69/0x160 [osp]
             [<ffffffffa0dfb250>] lwp_process_config+0x150/0x260 [osp]
             [<ffffffffa0937e85>] lustre_disconnect_lwp+0x1c5/0xe40 [obdclass]
             [<ffffffffa093ec40>] server_put_super+0x630/0xf60 [obdclass]
             [<ffffffff8119d906>] ? invalidate_inodes+0xf6/0x190
             [<ffffffff8118366b>] generic_shutdown_super+0x5b/0xe0
             [<ffffffff81183756>] kill_anon_super+0x16/0x60
             [<ffffffffa0906506>] lustre_kill_super+0x36/0x60 [obdclass]
             [<ffffffff81183ef7>] deactivate_super+0x57/0x80
             [<ffffffff811a21ef>] mntput_no_expire+0xbf/0x110
             [<ffffffff811a2c5b>] sys_umount+0x7b/0x3a0
             [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
            

            It seems lwp is waiting for recovery, but lwp should not be in recovery states at all. I think the problem is the state of LWP should only be FULL, CLOSED, DISCONN, which will make sure the lwp will not be in recovery states, but somehow some one change the state of LWP to some other states. trying to figure out why.

            adilger Andreas Dilger added a comment - Update from Di on this topic: [1/24/2014, 2:08:11 PM] Wang Di: here is the amount process stack [1/24/2014, 2:08:13 PM] Wang Di: umount S 0000000000000000 0 25182 25181 0x00000080 ffff8800647cdc68 0000000000000086 0000000000000000 ffffffff8150f3fa 00000000ffffff0a 0000000000000282 0000000000000000 dead000000200200 ffff880037eb2638 ffff8800647cdfd8 000000000000fb88 ffff880037eb2638 Call Trace: [<ffffffff8150f3fa>] ? schedule_timeout+0x19a/0x2e0 [<ffffffffa0b6326d>] ptlrpc_disconnect_import+0x77d/0x7a0 [ptlrpc] [<ffffffff81063990>] ? default_wake_function+0x0/0x20 [<ffffffffa06ea4e8>] ? libcfs_log_return+0x28/0x40 [libcfs] [<ffffffffa0df94e9>] lwp_disconnect+0x69/0x160 [osp] [<ffffffffa0dfb250>] lwp_process_config+0x150/0x260 [osp] [<ffffffffa0937e85>] lustre_disconnect_lwp+0x1c5/0xe40 [obdclass] [<ffffffffa093ec40>] server_put_super+0x630/0xf60 [obdclass] [<ffffffff8119d906>] ? invalidate_inodes+0xf6/0x190 [<ffffffff8118366b>] generic_shutdown_super+0x5b/0xe0 [<ffffffff81183756>] kill_anon_super+0x16/0x60 [<ffffffffa0906506>] lustre_kill_super+0x36/0x60 [obdclass] [<ffffffff81183ef7>] deactivate_super+0x57/0x80 [<ffffffff811a21ef>] mntput_no_expire+0xbf/0x110 [<ffffffff811a2c5b>] sys_umount+0x7b/0x3a0 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b It seems lwp is waiting for recovery, but lwp should not be in recovery states at all. I think the problem is the state of LWP should only be FULL, CLOSED, DISCONN, which will make sure the lwp will not be in recovery states, but somehow some one change the state of LWP to some other states. trying to figure out why.

            If you do a search for TIMEOUT on conf-sanity for master, there are problems in conf-sanity on many different tests - 58, 160a, 17n, 72, 80f, 56, 12, 28a, 70b, 90, 161a, 47, 0, 2b, 12, 50h, 230d, 133f, just on a single page of results.

            This is a major problem and needs to be investigated as the highest priority.

            adilger Andreas Dilger added a comment - If you do a search for TIMEOUT on conf-sanity for master, there are problems in conf-sanity on many different tests - 58, 160a, 17n, 72, 80f, 56, 12, 28a, 70b, 90, 161a, 47, 0, 2b, 12, 50h, 230d, 133f, just on a single page of results. This is a major problem and needs to be investigated as the highest priority.
            utopiabound Nathaniel Clark added a comment - Seeing this on review-dne a lot (like LU-4508 ) conf-sanity/72 https://maloo.whamcloud.com/test_sets/e70aef8c-8464-11e3-bab5-52540035b04c conf-sanity/58 https://maloo.whamcloud.com/test_sets/990ad45a-84ff-11e3-86af-52540035b04c conf-sanity/56 https://maloo.whamcloud.com/test_sets/6498f574-84cf-11e3-8da9-52540035b04c https://maloo.whamcloud.com/test_sets/6498f574-84cf-11e3-8da9-52540035b04c

            People

              di.wang Di Wang
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: