Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.11.0
    • Lustre 2.7.0
    • 3
    • 12403

    Description

      This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

      This issue relates to the following test suite run:
      http://maloo.whamcloud.com/test_sets/c57a95dc-7c91-11e3-b3fa-52540035b04c
      https://maloo.whamcloud.com/test_sets/f73662fc-83f0-11e3-bab5-52540035b04c

      The sub-test test_65ic failed with the following error:

      test failed to respond and timed out

      Info required for matching: sanity 65ic

      Attachments

        Issue Links

          Activity

            [LU-4536] sanity test_65ic

            This hasn't been seen because the test is currently always being skipped.

            adilger Andreas Dilger added a comment - This hasn't been seen because the test is currently always being skipped.
            pjones Peter Jones added a comment -

            Has not been seen in many months

            pjones Peter Jones added a comment - Has not been seen in many months

            Hm. Looks like I need to push a patch for cleanup the time wrappers.

            simmonsja James A Simmons added a comment - Hm. Looks like I need to push a patch for cleanup the time wrappers.

            Just ran into this (or something quite similar) on a 2.5.2 build

            I think the ldlm_pool_recalc may help, but I also noticed that internally ldlm_pool_recalc is working with time_t, and returns an int. I'm wondering if there are 32/64 bit issues here. In my case, I know time_t is 64 bit, and int is 32 bit.

            bevans Ben Evans (Inactive) added a comment - Just ran into this (or something quite similar) on a 2.5.2 build I think the ldlm_pool_recalc may help, but I also noticed that internally ldlm_pool_recalc is working with time_t, and returns an int. I'm wondering if there are 32/64 bit issues here. In my case, I know time_t is 64 bit, and int is 32 bit.

            Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: http://review.whamcloud.com/13512
            Subject: LU-4536 tests: Add debugging to sanity/65ic
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: d2c638d2e7edaca81699249f35b6b2567a47dd7d

            gerrit Gerrit Updater added a comment - Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: http://review.whamcloud.com/13512 Subject: LU-4536 tests: Add debugging to sanity/65ic Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: d2c638d2e7edaca81699249f35b6b2567a47dd7d
            green Oleg Drokin added a comment -

            so if you run with increased debugging where you can trace by rpc xid to see what happened with that request? could it bemds replied and client missed it? unplausible, but a start to see what's going on.

            green Oleg Drokin added a comment - so if you run with increased debugging where you can trace by rpc xid to see what happened with that request? could it bemds replied and client missed it? unplausible, but a start to see what's going on.
            green Oleg Drokin added a comment -

            Actualy that's only in console logs that there's no lfs, but in syslog the lfs is there which is pretty strange.
            MDS still appears to be totally idle.

            green Oleg Drokin added a comment - Actualy that's only in console logs that there's no lfs, but in syslog the lfs is there which is pretty strange. MDS still appears to be totally idle.

            One possibility here is that the -1 stripe count is causing the MDS to try and access a layout with (__u16)-1 stripes and this is causing it to be slow? I can't see any other reason why this test might timeout only in ZFS.

            adilger Andreas Dilger added a comment - One possibility here is that the -1 stripe count is causing the MDS to try and access a layout with (__u16)-1 stripes and this is causing it to be slow? I can't see any other reason why this test might timeout only in ZFS.
            green Oleg Drokin added a comment -

            I looked at the last two reports references. It's interesting that both MDS and clients are completely idle the lfs command is nowhere to be found so I assuem it's already terminated?

            Now why the test is stuck then is a complete mystery too.

            green Oleg Drokin added a comment - I looked at the last two reports references. It's interesting that both MDS and clients are completely idle the lfs command is nowhere to be found so I assuem it's already terminated? Now why the test is stuck then is a complete mystery too.

            The last three instances of this bug all have the same signature for lfs:
            https://testing.hpdd.intel.com/test_sets/29809bf4-86ae-11e4-87d3-5254006e85c2
            https://testing.hpdd.intel.com/test_sets/45034a76-8752-11e4-a70f-5254006e85c2
            https://testing.hpdd.intel.com/test_sets/8c14502e-8769-11e4-b712-5254006e85c2

            lfs           S 0000000000000001     0  2499   2350 0x00000080
            ffff88007304d978 0000000000000086 ffff88007bc392f0 ffff88007bc392c0
            ffff88007c9e9800 ffff88007bc392f0 ffff88007304d948 ffffffffa03ee1e1
            ffff88007a8fdab8 ffff88007304dfd8 000000000000fbc8 ffff88007a8fdab8
            Call Trace:
            [<ffffffffa03ee1e1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
            [<ffffffff81529c72>] schedule_timeout+0x192/0x2e0
            [<ffffffff81083f30>] ? process_timeout+0x0/0x10
            [<ffffffffa07524d2>] ptlrpc_set_wait+0x2b2/0x890 [ptlrpc]
            [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
            [<ffffffffa075c576>] ? lustre_msg_set_jobid+0xb6/0x140 [ptlrpc]
            [<ffffffffa0752b31>] ptlrpc_queue_wait+0x81/0x220 [ptlrpc]
            [<ffffffffa0992eac>] mdc_getattr_common+0xfc/0x420 [mdc]
            [<ffffffffa0996327>] mdc_getattr_name+0x147/0x2f0 [mdc]
            [<ffffffffa095c279>] lmv_getattr_name+0x209/0x970 [lmv]
            [<ffffffffa0b00090>] ll_lov_getstripe_ea_info+0x150/0x660 [lustre]
            [<ffffffffa0afa4f9>] ll_dir_ioctl+0x3c09/0x64d0 [lustre]
            [<ffffffffa03edba3>] ? libcfs_debug_vmsg2+0x5e3/0xbe0 [libcfs]
            [<ffffffffa03ee1e1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
            [<ffffffffa03ee1e1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
            [<ffffffff8119e4e2>] vfs_ioctl+0x22/0xa0
            [<ffffffff8119e684>] do_vfs_ioctl+0x84/0x580
            [<ffffffff81188ec2>] ? vfs_write+0x132/0x1a0
            [<ffffffff8119ec01>] sys_ioctl+0x81/0xa0
            [<ffffffff810e1bfe>] ? __audit_syscall_exit+0x25e/0x290
            [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
            
            utopiabound Nathaniel Clark added a comment - The last three instances of this bug all have the same signature for lfs: https://testing.hpdd.intel.com/test_sets/29809bf4-86ae-11e4-87d3-5254006e85c2 https://testing.hpdd.intel.com/test_sets/45034a76-8752-11e4-a70f-5254006e85c2 https://testing.hpdd.intel.com/test_sets/8c14502e-8769-11e4-b712-5254006e85c2 lfs S 0000000000000001 0 2499 2350 0x00000080 ffff88007304d978 0000000000000086 ffff88007bc392f0 ffff88007bc392c0 ffff88007c9e9800 ffff88007bc392f0 ffff88007304d948 ffffffffa03ee1e1 ffff88007a8fdab8 ffff88007304dfd8 000000000000fbc8 ffff88007a8fdab8 Call Trace: [<ffffffffa03ee1e1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [<ffffffff81529c72>] schedule_timeout+0x192/0x2e0 [<ffffffff81083f30>] ? process_timeout+0x0/0x10 [<ffffffffa07524d2>] ptlrpc_set_wait+0x2b2/0x890 [ptlrpc] [<ffffffff81061d00>] ? default_wake_function+0x0/0x20 [<ffffffffa075c576>] ? lustre_msg_set_jobid+0xb6/0x140 [ptlrpc] [<ffffffffa0752b31>] ptlrpc_queue_wait+0x81/0x220 [ptlrpc] [<ffffffffa0992eac>] mdc_getattr_common+0xfc/0x420 [mdc] [<ffffffffa0996327>] mdc_getattr_name+0x147/0x2f0 [mdc] [<ffffffffa095c279>] lmv_getattr_name+0x209/0x970 [lmv] [<ffffffffa0b00090>] ll_lov_getstripe_ea_info+0x150/0x660 [lustre] [<ffffffffa0afa4f9>] ll_dir_ioctl+0x3c09/0x64d0 [lustre] [<ffffffffa03edba3>] ? libcfs_debug_vmsg2+0x5e3/0xbe0 [libcfs] [<ffffffffa03ee1e1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [<ffffffffa03ee1e1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [<ffffffff8119e4e2>] vfs_ioctl+0x22/0xa0 [<ffffffff8119e684>] do_vfs_ioctl+0x84/0x580 [<ffffffff81188ec2>] ? vfs_write+0x132/0x1a0 [<ffffffff8119ec01>] sys_ioctl+0x81/0xa0 [<ffffffff810e1bfe>] ? __audit_syscall_exit+0x25e/0x290 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
            laisiyao Lai Siyao added a comment -

            The debuglog shows statahead thread was successfully created, and statahead went well. So it didn't stuck in statahead code.

            And I also checked several other logs, and found no connection with statahead.

            laisiyao Lai Siyao added a comment - The debuglog shows statahead thread was successfully created, and statahead went well. So it didn't stuck in statahead code. And I also checked several other logs, and found no connection with statahead.

            People

              utopiabound Nathaniel Clark
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: