Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.11.0
    • Lustre 2.7.0
    • 3
    • 12403

    Description

      This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

      This issue relates to the following test suite run:
      http://maloo.whamcloud.com/test_sets/c57a95dc-7c91-11e3-b3fa-52540035b04c
      https://maloo.whamcloud.com/test_sets/f73662fc-83f0-11e3-bab5-52540035b04c

      The sub-test test_65ic failed with the following error:

      test failed to respond and timed out

      Info required for matching: sanity 65ic

      Attachments

        Issue Links

          Activity

            [LU-4536] sanity test_65ic
            green Oleg Drokin added a comment -

            so if you run with increased debugging where you can trace by rpc xid to see what happened with that request? could it bemds replied and client missed it? unplausible, but a start to see what's going on.

            green Oleg Drokin added a comment - so if you run with increased debugging where you can trace by rpc xid to see what happened with that request? could it bemds replied and client missed it? unplausible, but a start to see what's going on.
            green Oleg Drokin added a comment -

            Actualy that's only in console logs that there's no lfs, but in syslog the lfs is there which is pretty strange.
            MDS still appears to be totally idle.

            green Oleg Drokin added a comment - Actualy that's only in console logs that there's no lfs, but in syslog the lfs is there which is pretty strange. MDS still appears to be totally idle.

            One possibility here is that the -1 stripe count is causing the MDS to try and access a layout with (__u16)-1 stripes and this is causing it to be slow? I can't see any other reason why this test might timeout only in ZFS.

            adilger Andreas Dilger added a comment - One possibility here is that the -1 stripe count is causing the MDS to try and access a layout with (__u16)-1 stripes and this is causing it to be slow? I can't see any other reason why this test might timeout only in ZFS.
            green Oleg Drokin added a comment -

            I looked at the last two reports references. It's interesting that both MDS and clients are completely idle the lfs command is nowhere to be found so I assuem it's already terminated?

            Now why the test is stuck then is a complete mystery too.

            green Oleg Drokin added a comment - I looked at the last two reports references. It's interesting that both MDS and clients are completely idle the lfs command is nowhere to be found so I assuem it's already terminated? Now why the test is stuck then is a complete mystery too.

            The last three instances of this bug all have the same signature for lfs:
            https://testing.hpdd.intel.com/test_sets/29809bf4-86ae-11e4-87d3-5254006e85c2
            https://testing.hpdd.intel.com/test_sets/45034a76-8752-11e4-a70f-5254006e85c2
            https://testing.hpdd.intel.com/test_sets/8c14502e-8769-11e4-b712-5254006e85c2

            lfs           S 0000000000000001     0  2499   2350 0x00000080
            ffff88007304d978 0000000000000086 ffff88007bc392f0 ffff88007bc392c0
            ffff88007c9e9800 ffff88007bc392f0 ffff88007304d948 ffffffffa03ee1e1
            ffff88007a8fdab8 ffff88007304dfd8 000000000000fbc8 ffff88007a8fdab8
            Call Trace:
            [<ffffffffa03ee1e1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
            [<ffffffff81529c72>] schedule_timeout+0x192/0x2e0
            [<ffffffff81083f30>] ? process_timeout+0x0/0x10
            [<ffffffffa07524d2>] ptlrpc_set_wait+0x2b2/0x890 [ptlrpc]
            [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
            [<ffffffffa075c576>] ? lustre_msg_set_jobid+0xb6/0x140 [ptlrpc]
            [<ffffffffa0752b31>] ptlrpc_queue_wait+0x81/0x220 [ptlrpc]
            [<ffffffffa0992eac>] mdc_getattr_common+0xfc/0x420 [mdc]
            [<ffffffffa0996327>] mdc_getattr_name+0x147/0x2f0 [mdc]
            [<ffffffffa095c279>] lmv_getattr_name+0x209/0x970 [lmv]
            [<ffffffffa0b00090>] ll_lov_getstripe_ea_info+0x150/0x660 [lustre]
            [<ffffffffa0afa4f9>] ll_dir_ioctl+0x3c09/0x64d0 [lustre]
            [<ffffffffa03edba3>] ? libcfs_debug_vmsg2+0x5e3/0xbe0 [libcfs]
            [<ffffffffa03ee1e1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
            [<ffffffffa03ee1e1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
            [<ffffffff8119e4e2>] vfs_ioctl+0x22/0xa0
            [<ffffffff8119e684>] do_vfs_ioctl+0x84/0x580
            [<ffffffff81188ec2>] ? vfs_write+0x132/0x1a0
            [<ffffffff8119ec01>] sys_ioctl+0x81/0xa0
            [<ffffffff810e1bfe>] ? __audit_syscall_exit+0x25e/0x290
            [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
            
            utopiabound Nathaniel Clark added a comment - The last three instances of this bug all have the same signature for lfs: https://testing.hpdd.intel.com/test_sets/29809bf4-86ae-11e4-87d3-5254006e85c2 https://testing.hpdd.intel.com/test_sets/45034a76-8752-11e4-a70f-5254006e85c2 https://testing.hpdd.intel.com/test_sets/8c14502e-8769-11e4-b712-5254006e85c2 lfs S 0000000000000001 0 2499 2350 0x00000080 ffff88007304d978 0000000000000086 ffff88007bc392f0 ffff88007bc392c0 ffff88007c9e9800 ffff88007bc392f0 ffff88007304d948 ffffffffa03ee1e1 ffff88007a8fdab8 ffff88007304dfd8 000000000000fbc8 ffff88007a8fdab8 Call Trace: [<ffffffffa03ee1e1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [<ffffffff81529c72>] schedule_timeout+0x192/0x2e0 [<ffffffff81083f30>] ? process_timeout+0x0/0x10 [<ffffffffa07524d2>] ptlrpc_set_wait+0x2b2/0x890 [ptlrpc] [<ffffffff81061d00>] ? default_wake_function+0x0/0x20 [<ffffffffa075c576>] ? lustre_msg_set_jobid+0xb6/0x140 [ptlrpc] [<ffffffffa0752b31>] ptlrpc_queue_wait+0x81/0x220 [ptlrpc] [<ffffffffa0992eac>] mdc_getattr_common+0xfc/0x420 [mdc] [<ffffffffa0996327>] mdc_getattr_name+0x147/0x2f0 [mdc] [<ffffffffa095c279>] lmv_getattr_name+0x209/0x970 [lmv] [<ffffffffa0b00090>] ll_lov_getstripe_ea_info+0x150/0x660 [lustre] [<ffffffffa0afa4f9>] ll_dir_ioctl+0x3c09/0x64d0 [lustre] [<ffffffffa03edba3>] ? libcfs_debug_vmsg2+0x5e3/0xbe0 [libcfs] [<ffffffffa03ee1e1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [<ffffffffa03ee1e1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [<ffffffff8119e4e2>] vfs_ioctl+0x22/0xa0 [<ffffffff8119e684>] do_vfs_ioctl+0x84/0x580 [<ffffffff81188ec2>] ? vfs_write+0x132/0x1a0 [<ffffffff8119ec01>] sys_ioctl+0x81/0xa0 [<ffffffff810e1bfe>] ? __audit_syscall_exit+0x25e/0x290 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
            laisiyao Lai Siyao added a comment -

            The debuglog shows statahead thread was successfully created, and statahead went well. So it didn't stuck in statahead code.

            And I also checked several other logs, and found no connection with statahead.

            laisiyao Lai Siyao added a comment - The debuglog shows statahead thread was successfully created, and statahead went well. So it didn't stuck in statahead code. And I also checked several other logs, and found no connection with statahead.

            Lai or Fan Yong, this problem seems to be related to statahead. Is there anything you can comment on to get this bug moving.

            adilger Andreas Dilger added a comment - Lai or Fan Yong, this problem seems to be related to statahead. Is there anything you can comment on to get this bug moving.

            Andreas Dilger (andreas.dilger@intel.com) merged in patch http://review.whamcloud.com/13114/
            Subject: LU-4536 tests: disable sanity test_65ic for ZFS
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 82169531176569f1202878b9d68c3d7cc499aa15

            gerrit Gerrit Updater added a comment - Andreas Dilger (andreas.dilger@intel.com) merged in patch http://review.whamcloud.com/13114/ Subject: LU-4536 tests: disable sanity test_65ic for ZFS Project: fs/lustre-release Branch: master Current Patch Set: Commit: 82169531176569f1202878b9d68c3d7cc499aa15

            Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/13114
            Subject: LU-4536 tests: disable sanity test_65ic for ZFS
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 9a49ed6c2e8c32b7ce22302345ea6a2edceb71f7

            gerrit Gerrit Updater added a comment - Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/13114 Subject: LU-4536 tests: disable sanity test_65ic for ZFS Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 9a49ed6c2e8c32b7ce22302345ea6a2edceb71f7

            I'm trying to get this bug moving forward, since it seems to be stalled at the moment but is probably the #2 or #3 reason for review-zfs test failures. Looking at the client stack traces, they all appear to have lfs stuck either waiting on the MDS to reply to an RPC:
            https://testing.hpdd.intel.com/test_sets/603bbaca-5eae-11e4-a2a3-5254006e85c2
            https://testing.hpdd.intel.com/test_sets/a62db834-4ffa-11e4-9892-5254006e85c2
            https://testing.hpdd.intel.com/test_sets/7c03803e-8529-11e4-8e46-5254006e85c2

            21:41:03:lfs           S 0000000000000000     0 23186  23035 0x00000080
            21:41:03:Call Trace:
            21:41:03: [<ffffffff81529c72>] schedule_timeout+0x192/0x2e0
            21:41:03: [<ffffffffa075f71a>] ptlrpc_set_wait+0x2da/0x860 [ptlrpc]
            21:41:03: [<ffffffffa075fd27>] ptlrpc_queue_wait+0x87/0x220 [ptlrpc]
            21:41:03: [<ffffffffa0740f8e>] ldlm_cli_enqueue+0x36e/0x860 [ptlrpc]
            21:41:03: [<ffffffffa09b4f90>] mdc_enqueue+0x2d0/0x1a70 [mdc]
            21:41:03: [<ffffffffa09b6912>] mdc_intent_lock+0x1e2/0x5f9 [mdc]
            21:41:03: [<ffffffffa096ea4b>] lmv_intent_open+0x31b/0x9f0 [lmv]
            21:41:03: [<ffffffffa096f3ff>] lmv_intent_lock+0x2df/0x11c0 [lmv]
            21:41:03: [<ffffffffa0b63d27>] ll_lookup_it+0x257/0xad0 [lustre]
            21:41:03: [<ffffffffa0b64629>] ll_lookup_nd+0x89/0x5e0 [lustre]
            21:41:03: [<ffffffff81198a35>] do_lookup+0x1a5/0x230
            21:41:03: [<ffffffff811996a4>] __link_path_walk+0x7a4/0x1000
            21:41:03: [<ffffffff8119a1ba>] path_walk+0x6a/0xe0
            21:41:03: [<ffffffff8119a3cb>] filename_lookup+0x6b/0xc0
            21:41:04: [<ffffffff8119b8a4>] do_filp_open+0x104/0xd20
            21:41:04: [<ffffffff81185be9>] do_sys_open+0x69/0x140
            21:41:04: [<ffffffff81185d00>] sys_open+0x20/0x30
            

            though others appear to be waiting on the client-side statahead thread to start, so there may be two related issues here:
            https://testing.hpdd.intel.com/test_sets/c75ba76e-5ee6-11e4-badb-5254006e85c2
            https://testing.hpdd.intel.com/test_sets/6b3726e2-8516-11e4-985f-5254006e85c2

            03:58:24:lfs           R  running task        0 22884  22735 0x00000080
            03:58:24:Call Trace:
            03:58:24: [<ffffffff81529840>] _cond_resched+0x30/0x40
            03:58:24: [<ffffffff81529884>] wait_for_common+0x34/0x180
            03:58:24: [<ffffffff81529a8d>] wait_for_completion+0x1d/0x20
            03:58:24: [<ffffffff8109ad69>] kthread_create+0x99/0x120
            03:58:24: [<ffffffffa0b87960>] do_statahead_enter+0xb50/0x1220 [lustre]
            03:58:24: [<ffffffffa0b6f328>] ll_lookup_it+0x858/0xad0 [lustre]
            03:58:24: [<ffffffffa0b6f629>] ll_lookup_nd+0x89/0x5e0 [lustre]
            03:58:24: [<ffffffff81198a35>] do_lookup+0x1a5/0x230
            03:58:24: [<ffffffff811996a4>] __link_path_walk+0x7a4/0x1000
            03:58:24: [<ffffffff8119a1ba>] path_walk+0x6a/0xe0
            03:58:24: [<ffffffff8119a3cb>] filename_lookup+0x6b/0xc0
            03:58:24: [<ffffffff8119b4f7>] user_path_at+0x57/0xa0
            03:58:24: [<ffffffff8118e990>] vfs_fstatat+0x50/0xa0
            03:58:24: [<ffffffff8118ea4e>] vfs_lstat+0x1e/0x20
            03:58:24: [<ffffffff8118ea74>] sys_newlstat+0x24/0x50
            

            Unfortunately, I don't see anything in progress on the MDT at all, though the chance of catching an RPC in progress would be small. I also couldn't find the statahead thread in the client stack traces. I haven't looked at the client debug logs yet, but I expect that they will help narrow down exactly what is happening on the client (is it stuck on the client during first access/statahead, or is there some long-running problem that only makes itself seen when a large number of files have already been accessed?

            adilger Andreas Dilger added a comment - I'm trying to get this bug moving forward, since it seems to be stalled at the moment but is probably the #2 or #3 reason for review-zfs test failures. Looking at the client stack traces, they all appear to have lfs stuck either waiting on the MDS to reply to an RPC: https://testing.hpdd.intel.com/test_sets/603bbaca-5eae-11e4-a2a3-5254006e85c2 https://testing.hpdd.intel.com/test_sets/a62db834-4ffa-11e4-9892-5254006e85c2 https://testing.hpdd.intel.com/test_sets/7c03803e-8529-11e4-8e46-5254006e85c2 21:41:03:lfs S 0000000000000000 0 23186 23035 0x00000080 21:41:03:Call Trace: 21:41:03: [<ffffffff81529c72>] schedule_timeout+0x192/0x2e0 21:41:03: [<ffffffffa075f71a>] ptlrpc_set_wait+0x2da/0x860 [ptlrpc] 21:41:03: [<ffffffffa075fd27>] ptlrpc_queue_wait+0x87/0x220 [ptlrpc] 21:41:03: [<ffffffffa0740f8e>] ldlm_cli_enqueue+0x36e/0x860 [ptlrpc] 21:41:03: [<ffffffffa09b4f90>] mdc_enqueue+0x2d0/0x1a70 [mdc] 21:41:03: [<ffffffffa09b6912>] mdc_intent_lock+0x1e2/0x5f9 [mdc] 21:41:03: [<ffffffffa096ea4b>] lmv_intent_open+0x31b/0x9f0 [lmv] 21:41:03: [<ffffffffa096f3ff>] lmv_intent_lock+0x2df/0x11c0 [lmv] 21:41:03: [<ffffffffa0b63d27>] ll_lookup_it+0x257/0xad0 [lustre] 21:41:03: [<ffffffffa0b64629>] ll_lookup_nd+0x89/0x5e0 [lustre] 21:41:03: [<ffffffff81198a35>] do_lookup+0x1a5/0x230 21:41:03: [<ffffffff811996a4>] __link_path_walk+0x7a4/0x1000 21:41:03: [<ffffffff8119a1ba>] path_walk+0x6a/0xe0 21:41:03: [<ffffffff8119a3cb>] filename_lookup+0x6b/0xc0 21:41:04: [<ffffffff8119b8a4>] do_filp_open+0x104/0xd20 21:41:04: [<ffffffff81185be9>] do_sys_open+0x69/0x140 21:41:04: [<ffffffff81185d00>] sys_open+0x20/0x30 though others appear to be waiting on the client-side statahead thread to start, so there may be two related issues here: https://testing.hpdd.intel.com/test_sets/c75ba76e-5ee6-11e4-badb-5254006e85c2 https://testing.hpdd.intel.com/test_sets/6b3726e2-8516-11e4-985f-5254006e85c2 03:58:24:lfs R running task 0 22884 22735 0x00000080 03:58:24:Call Trace: 03:58:24: [<ffffffff81529840>] _cond_resched+0x30/0x40 03:58:24: [<ffffffff81529884>] wait_for_common+0x34/0x180 03:58:24: [<ffffffff81529a8d>] wait_for_completion+0x1d/0x20 03:58:24: [<ffffffff8109ad69>] kthread_create+0x99/0x120 03:58:24: [<ffffffffa0b87960>] do_statahead_enter+0xb50/0x1220 [lustre] 03:58:24: [<ffffffffa0b6f328>] ll_lookup_it+0x858/0xad0 [lustre] 03:58:24: [<ffffffffa0b6f629>] ll_lookup_nd+0x89/0x5e0 [lustre] 03:58:24: [<ffffffff81198a35>] do_lookup+0x1a5/0x230 03:58:24: [<ffffffff811996a4>] __link_path_walk+0x7a4/0x1000 03:58:24: [<ffffffff8119a1ba>] path_walk+0x6a/0xe0 03:58:24: [<ffffffff8119a3cb>] filename_lookup+0x6b/0xc0 03:58:24: [<ffffffff8119b4f7>] user_path_at+0x57/0xa0 03:58:24: [<ffffffff8118e990>] vfs_fstatat+0x50/0xa0 03:58:24: [<ffffffff8118ea4e>] vfs_lstat+0x1e/0x20 03:58:24: [<ffffffff8118ea74>] sys_newlstat+0x24/0x50 Unfortunately, I don't see anything in progress on the MDT at all, though the chance of catching an RPC in progress would be small. I also couldn't find the statahead thread in the client stack traces. I haven't looked at the client debug logs yet, but I expect that they will help narrow down exactly what is happening on the client (is it stuck on the client during first access/statahead, or is there some long-running problem that only makes itself seen when a large number of files have already been accessed?

            Submitted patch is only improving the debugging messages, and will not resolve the problem for this bug, so don't close it.

            adilger Andreas Dilger added a comment - Submitted patch is only improving the debugging messages, and will not resolve the problem for this bug, so don't close it.

            People

              utopiabound Nathaniel Clark
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: