[LU-4536] sanity test_65ic - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.11.0
Affects Version/s: Lustre 2.7.0
Labels:
- HB
- zfs

Severity:
3
Rank (Obsolete):
12403

Description

This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

This issue relates to the following test suite run:
http://maloo.whamcloud.com/test_sets/c57a95dc-7c91-11e3-b3fa-52540035b04c
https://maloo.whamcloud.com/test_sets/f73662fc-83f0-11e3-bab5-52540035b04c

The sub-test test_65ic failed with the following error:

test failed to respond and timed out

Info required for matching: sanity 65ic

Attachments

Issue Links

is related to

LU-5415 High ldlm_poold load on client

Resolved

LU-2524 Tests regressions: tests interrelation introduced.

Closed

LU-9019 Migrate lustre to standard 64 bit time kernel API

Resolved

Activity

[LU-4536] sanity test_65ic

Andreas Dilger added a comment - 29/May/17 6:00 AM

This hasn't been seen because the test is currently always being skipped.

Andreas Dilger added a comment - 29/May/17 6:00 AM This hasn't been seen because the test is currently always being skipped.

Peter Jones added a comment - 05/Aug/16 11:15 PM

Has not been seen in many months

Peter Jones added a comment - 05/Aug/16 11:15 PM Has not been seen in many months

James A Simmons added a comment - 19/Aug/15 3:57 PM

Hm. Looks like I need to push a patch for cleanup the time wrappers.

James A Simmons added a comment - 19/Aug/15 3:57 PM Hm. Looks like I need to push a patch for cleanup the time wrappers.

Ben Evans (Inactive) added a comment - 19/Aug/15 2:39 PM

Just ran into this (or something quite similar) on a 2.5.2 build

I think the ldlm_pool_recalc may help, but I also noticed that internally ldlm_pool_recalc is working with time_t, and returns an int. I'm wondering if there are 32/64 bit issues here. In my case, I know time_t is 64 bit, and int is 32 bit.

Ben Evans (Inactive) added a comment - 19/Aug/15 2:39 PM Just ran into this (or something quite similar) on a 2.5.2 build I think the ldlm_pool_recalc may help, but I also noticed that internally ldlm_pool_recalc is working with time_t, and returns an int. I'm wondering if there are 32/64 bit issues here. In my case, I know time_t is 64 bit, and int is 32 bit.

Gerrit Updater added a comment - 23/Jan/15 2:17 PM

Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: http://review.whamcloud.com/13512
Subject: ~~LU-4536~~ tests: Add debugging to sanity/65ic
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d2c638d2e7edaca81699249f35b6b2567a47dd7d

Gerrit Updater added a comment - 23/Jan/15 2:17 PM Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: http://review.whamcloud.com/13512 Subject: LU-4536 tests: Add debugging to sanity/65ic Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: d2c638d2e7edaca81699249f35b6b2567a47dd7d

Oleg Drokin added a comment - 21/Jan/15 7:42 PM

so if you run with increased debugging where you can trace by rpc xid to see what happened with that request? could it bemds replied and client missed it? unplausible, but a start to see what's going on.

Oleg Drokin added a comment - 21/Jan/15 7:42 PM so if you run with increased debugging where you can trace by rpc xid to see what happened with that request? could it bemds replied and client missed it? unplausible, but a start to see what's going on.

Oleg Drokin added a comment - 21/Jan/15 7:38 PM

Actualy that's only in console logs that there's no lfs, but in syslog the lfs is there which is pretty strange.
MDS still appears to be totally idle.

Oleg Drokin added a comment - 21/Jan/15 7:38 PM Actualy that's only in console logs that there's no lfs, but in syslog the lfs is there which is pretty strange. MDS still appears to be totally idle.

Andreas Dilger added a comment - 21/Jan/15 7:35 PM

One possibility here is that the -1 stripe count is causing the MDS to try and access a layout with (__u16)-1 stripes and this is causing it to be slow? I can't see any other reason why this test might timeout only in ZFS.

Andreas Dilger added a comment - 21/Jan/15 7:35 PM One possibility here is that the -1 stripe count is causing the MDS to try and access a layout with (__u16)-1 stripes and this is causing it to be slow? I can't see any other reason why this test might timeout only in ZFS.

Oleg Drokin added a comment - 21/Jan/15 7:31 PM

I looked at the last two reports references. It's interesting that both MDS and clients are completely idle the lfs command is nowhere to be found so I assuem it's already terminated?

Now why the test is stuck then is a complete mystery too.

Oleg Drokin added a comment - 21/Jan/15 7:31 PM I looked at the last two reports references. It's interesting that both MDS and clients are completely idle the lfs command is nowhere to be found so I assuem it's already terminated? Now why the test is stuck then is a complete mystery too.

Nathaniel Clark added a comment - 21/Jan/15 3:56 PM

The last three instances of this bug all have the same signature for lfs:
https://testing.hpdd.intel.com/test_sets/29809bf4-86ae-11e4-87d3-5254006e85c2
https://testing.hpdd.intel.com/test_sets/45034a76-8752-11e4-a70f-5254006e85c2
https://testing.hpdd.intel.com/test_sets/8c14502e-8769-11e4-b712-5254006e85c2

lfs           S 0000000000000001     0  2499   2350 0x00000080
ffff88007304d978 0000000000000086 ffff88007bc392f0 ffff88007bc392c0
ffff88007c9e9800 ffff88007bc392f0 ffff88007304d948 ffffffffa03ee1e1
ffff88007a8fdab8 ffff88007304dfd8 000000000000fbc8 ffff88007a8fdab8
Call Trace:
[<ffffffffa03ee1e1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
[<ffffffff81529c72>] schedule_timeout+0x192/0x2e0
[<ffffffff81083f30>] ? process_timeout+0x0/0x10
[<ffffffffa07524d2>] ptlrpc_set_wait+0x2b2/0x890 [ptlrpc]
[<ffffffff81061d00>] ? default_wake_function+0x0/0x20
[<ffffffffa075c576>] ? lustre_msg_set_jobid+0xb6/0x140 [ptlrpc]
[<ffffffffa0752b31>] ptlrpc_queue_wait+0x81/0x220 [ptlrpc]
[<ffffffffa0992eac>] mdc_getattr_common+0xfc/0x420 [mdc]
[<ffffffffa0996327>] mdc_getattr_name+0x147/0x2f0 [mdc]
[<ffffffffa095c279>] lmv_getattr_name+0x209/0x970 [lmv]
[<ffffffffa0b00090>] ll_lov_getstripe_ea_info+0x150/0x660 [lustre]
[<ffffffffa0afa4f9>] ll_dir_ioctl+0x3c09/0x64d0 [lustre]
[<ffffffffa03edba3>] ? libcfs_debug_vmsg2+0x5e3/0xbe0 [libcfs]
[<ffffffffa03ee1e1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
[<ffffffffa03ee1e1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
[<ffffffff8119e4e2>] vfs_ioctl+0x22/0xa0
[<ffffffff8119e684>] do_vfs_ioctl+0x84/0x580
[<ffffffff81188ec2>] ? vfs_write+0x132/0x1a0
[<ffffffff8119ec01>] sys_ioctl+0x81/0xa0
[<ffffffff810e1bfe>] ? __audit_syscall_exit+0x25e/0x290
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

Nathaniel Clark added a comment - 21/Jan/15 3:56 PM The last three instances of this bug all have the same signature for lfs: https://testing.hpdd.intel.com/test_sets/29809bf4-86ae-11e4-87d3-5254006e85c2 https://testing.hpdd.intel.com/test_sets/45034a76-8752-11e4-a70f-5254006e85c2 https://testing.hpdd.intel.com/test_sets/8c14502e-8769-11e4-b712-5254006e85c2 lfs S 0000000000000001 0 2499 2350 0x00000080 ffff88007304d978 0000000000000086 ffff88007bc392f0 ffff88007bc392c0 ffff88007c9e9800 ffff88007bc392f0 ffff88007304d948 ffffffffa03ee1e1 ffff88007a8fdab8 ffff88007304dfd8 000000000000fbc8 ffff88007a8fdab8 Call Trace: [<ffffffffa03ee1e1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [<ffffffff81529c72>] schedule_timeout+0x192/0x2e0 [<ffffffff81083f30>] ? process_timeout+0x0/0x10 [<ffffffffa07524d2>] ptlrpc_set_wait+0x2b2/0x890 [ptlrpc] [<ffffffff81061d00>] ? default_wake_function+0x0/0x20 [<ffffffffa075c576>] ? lustre_msg_set_jobid+0xb6/0x140 [ptlrpc] [<ffffffffa0752b31>] ptlrpc_queue_wait+0x81/0x220 [ptlrpc] [<ffffffffa0992eac>] mdc_getattr_common+0xfc/0x420 [mdc] [<ffffffffa0996327>] mdc_getattr_name+0x147/0x2f0 [mdc] [<ffffffffa095c279>] lmv_getattr_name+0x209/0x970 [lmv] [<ffffffffa0b00090>] ll_lov_getstripe_ea_info+0x150/0x660 [lustre] [<ffffffffa0afa4f9>] ll_dir_ioctl+0x3c09/0x64d0 [lustre] [<ffffffffa03edba3>] ? libcfs_debug_vmsg2+0x5e3/0xbe0 [libcfs] [<ffffffffa03ee1e1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [<ffffffffa03ee1e1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [<ffffffff8119e4e2>] vfs_ioctl+0x22/0xa0 [<ffffffff8119e684>] do_vfs_ioctl+0x84/0x580 [<ffffffff81188ec2>] ? vfs_write+0x132/0x1a0 [<ffffffff8119ec01>] sys_ioctl+0x81/0xa0 [<ffffffff810e1bfe>] ? __audit_syscall_exit+0x25e/0x290 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

Lai Siyao added a comment - 21/Jan/15 2:33 PM

The debuglog shows statahead thread was successfully created, and statahead went well. So it didn't stuck in statahead code.

And I also checked several other logs, and found no connection with statahead.

Lai Siyao added a comment - 21/Jan/15 2:33 PM The debuglog shows statahead thread was successfully created, and statahead went well. So it didn't stuck in statahead code. And I also checked several other logs, and found no connection with statahead.

People

Assignee:: Nathaniel Clark

Reporter:: Maloo

Votes:: 0 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 24/Jan/14 5:14 PM

Updated:: 22/Dec/17 10:47 AM

Resolved:: 02/Aug/17 12:51 PM