[LU-4536] sanity test_65ic - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.11.0
Affects Version/s: Lustre 2.7.0
Labels:
- HB
- zfs

Severity:
3
Rank (Obsolete):
12403

Description

This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

This issue relates to the following test suite run:
http://maloo.whamcloud.com/test_sets/c57a95dc-7c91-11e3-b3fa-52540035b04c
https://maloo.whamcloud.com/test_sets/f73662fc-83f0-11e3-bab5-52540035b04c

The sub-test test_65ic failed with the following error:

test failed to respond and timed out

Info required for matching: sanity 65ic

Attachments

Issue Links

is related to

LU-5415 High ldlm_poold load on client

Resolved

LU-2524 Tests regressions: tests interrelation introduced.

Closed

LU-9019 Migrate lustre to standard 64 bit time kernel API

Resolved

Activity

[LU-4536] sanity test_65ic

Oleg Drokin added a comment - 21/Jan/15 7:42 PM

so if you run with increased debugging where you can trace by rpc xid to see what happened with that request? could it bemds replied and client missed it? unplausible, but a start to see what's going on.

Oleg Drokin added a comment - 21/Jan/15 7:42 PM so if you run with increased debugging where you can trace by rpc xid to see what happened with that request? could it bemds replied and client missed it? unplausible, but a start to see what's going on.

Oleg Drokin added a comment - 21/Jan/15 7:38 PM

Actualy that's only in console logs that there's no lfs, but in syslog the lfs is there which is pretty strange.
MDS still appears to be totally idle.

Oleg Drokin added a comment - 21/Jan/15 7:38 PM Actualy that's only in console logs that there's no lfs, but in syslog the lfs is there which is pretty strange. MDS still appears to be totally idle.

Andreas Dilger added a comment - 21/Jan/15 7:35 PM

One possibility here is that the -1 stripe count is causing the MDS to try and access a layout with (__u16)-1 stripes and this is causing it to be slow? I can't see any other reason why this test might timeout only in ZFS.

Andreas Dilger added a comment - 21/Jan/15 7:35 PM One possibility here is that the -1 stripe count is causing the MDS to try and access a layout with (__u16)-1 stripes and this is causing it to be slow? I can't see any other reason why this test might timeout only in ZFS.

Oleg Drokin added a comment - 21/Jan/15 7:31 PM

I looked at the last two reports references. It's interesting that both MDS and clients are completely idle the lfs command is nowhere to be found so I assuem it's already terminated?

Now why the test is stuck then is a complete mystery too.

Oleg Drokin added a comment - 21/Jan/15 7:31 PM I looked at the last two reports references. It's interesting that both MDS and clients are completely idle the lfs command is nowhere to be found so I assuem it's already terminated? Now why the test is stuck then is a complete mystery too.

Nathaniel Clark added a comment - 21/Jan/15 3:56 PM

The last three instances of this bug all have the same signature for lfs:
https://testing.hpdd.intel.com/test_sets/29809bf4-86ae-11e4-87d3-5254006e85c2
https://testing.hpdd.intel.com/test_sets/45034a76-8752-11e4-a70f-5254006e85c2
https://testing.hpdd.intel.com/test_sets/8c14502e-8769-11e4-b712-5254006e85c2

lfs           S 0000000000000001     0  2499   2350 0x00000080
ffff88007304d978 0000000000000086 ffff88007bc392f0 ffff88007bc392c0
ffff88007c9e9800 ffff88007bc392f0 ffff88007304d948 ffffffffa03ee1e1
ffff88007a8fdab8 ffff88007304dfd8 000000000000fbc8 ffff88007a8fdab8
Call Trace:
[<ffffffffa03ee1e1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
[<ffffffff81529c72>] schedule_timeout+0x192/0x2e0
[<ffffffff81083f30>] ? process_timeout+0x0/0x10
[<ffffffffa07524d2>] ptlrpc_set_wait+0x2b2/0x890 [ptlrpc]
[<ffffffff81061d00>] ? default_wake_function+0x0/0x20
[<ffffffffa075c576>] ? lustre_msg_set_jobid+0xb6/0x140 [ptlrpc]
[<ffffffffa0752b31>] ptlrpc_queue_wait+0x81/0x220 [ptlrpc]
[<ffffffffa0992eac>] mdc_getattr_common+0xfc/0x420 [mdc]
[<ffffffffa0996327>] mdc_getattr_name+0x147/0x2f0 [mdc]
[<ffffffffa095c279>] lmv_getattr_name+0x209/0x970 [lmv]
[<ffffffffa0b00090>] ll_lov_getstripe_ea_info+0x150/0x660 [lustre]
[<ffffffffa0afa4f9>] ll_dir_ioctl+0x3c09/0x64d0 [lustre]
[<ffffffffa03edba3>] ? libcfs_debug_vmsg2+0x5e3/0xbe0 [libcfs]
[<ffffffffa03ee1e1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
[<ffffffffa03ee1e1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
[<ffffffff8119e4e2>] vfs_ioctl+0x22/0xa0
[<ffffffff8119e684>] do_vfs_ioctl+0x84/0x580
[<ffffffff81188ec2>] ? vfs_write+0x132/0x1a0
[<ffffffff8119ec01>] sys_ioctl+0x81/0xa0
[<ffffffff810e1bfe>] ? __audit_syscall_exit+0x25e/0x290
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

Nathaniel Clark added a comment - 21/Jan/15 3:56 PM The last three instances of this bug all have the same signature for lfs: https://testing.hpdd.intel.com/test_sets/29809bf4-86ae-11e4-87d3-5254006e85c2 https://testing.hpdd.intel.com/test_sets/45034a76-8752-11e4-a70f-5254006e85c2 https://testing.hpdd.intel.com/test_sets/8c14502e-8769-11e4-b712-5254006e85c2 lfs S 0000000000000001 0 2499 2350 0x00000080 ffff88007304d978 0000000000000086 ffff88007bc392f0 ffff88007bc392c0 ffff88007c9e9800 ffff88007bc392f0 ffff88007304d948 ffffffffa03ee1e1 ffff88007a8fdab8 ffff88007304dfd8 000000000000fbc8 ffff88007a8fdab8 Call Trace: [<ffffffffa03ee1e1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [<ffffffff81529c72>] schedule_timeout+0x192/0x2e0 [<ffffffff81083f30>] ? process_timeout+0x0/0x10 [<ffffffffa07524d2>] ptlrpc_set_wait+0x2b2/0x890 [ptlrpc] [<ffffffff81061d00>] ? default_wake_function+0x0/0x20 [<ffffffffa075c576>] ? lustre_msg_set_jobid+0xb6/0x140 [ptlrpc] [<ffffffffa0752b31>] ptlrpc_queue_wait+0x81/0x220 [ptlrpc] [<ffffffffa0992eac>] mdc_getattr_common+0xfc/0x420 [mdc] [<ffffffffa0996327>] mdc_getattr_name+0x147/0x2f0 [mdc] [<ffffffffa095c279>] lmv_getattr_name+0x209/0x970 [lmv] [<ffffffffa0b00090>] ll_lov_getstripe_ea_info+0x150/0x660 [lustre] [<ffffffffa0afa4f9>] ll_dir_ioctl+0x3c09/0x64d0 [lustre] [<ffffffffa03edba3>] ? libcfs_debug_vmsg2+0x5e3/0xbe0 [libcfs] [<ffffffffa03ee1e1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [<ffffffffa03ee1e1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [<ffffffff8119e4e2>] vfs_ioctl+0x22/0xa0 [<ffffffff8119e684>] do_vfs_ioctl+0x84/0x580 [<ffffffff81188ec2>] ? vfs_write+0x132/0x1a0 [<ffffffff8119ec01>] sys_ioctl+0x81/0xa0 [<ffffffff810e1bfe>] ? __audit_syscall_exit+0x25e/0x290 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

Lai Siyao added a comment - 21/Jan/15 2:33 PM

The debuglog shows statahead thread was successfully created, and statahead went well. So it didn't stuck in statahead code.

And I also checked several other logs, and found no connection with statahead.

Lai Siyao added a comment - 21/Jan/15 2:33 PM The debuglog shows statahead thread was successfully created, and statahead went well. So it didn't stuck in statahead code. And I also checked several other logs, and found no connection with statahead.

Andreas Dilger added a comment - 20/Jan/15 7:26 PM

Lai or Fan Yong, this problem seems to be related to statahead. Is there anything you can comment on to get this bug moving.

Andreas Dilger added a comment - 20/Jan/15 7:26 PM Lai or Fan Yong, this problem seems to be related to statahead. Is there anything you can comment on to get this bug moving.

Gerrit Updater added a comment - 18/Dec/14 6:30 PM

Andreas Dilger (andreas.dilger@intel.com) merged in patch http://review.whamcloud.com/13114/
Subject: ~~LU-4536~~ tests: disable sanity test_65ic for ZFS
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 82169531176569f1202878b9d68c3d7cc499aa15

Gerrit Updater added a comment - 18/Dec/14 6:30 PM Andreas Dilger (andreas.dilger@intel.com) merged in patch http://review.whamcloud.com/13114/ Subject: LU-4536 tests: disable sanity test_65ic for ZFS Project: fs/lustre-release Branch: master Current Patch Set: Commit: 82169531176569f1202878b9d68c3d7cc499aa15

Gerrit Updater added a comment - 18/Dec/14 12:42 AM

Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/13114
Subject: ~~LU-4536~~ tests: disable sanity test_65ic for ZFS
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 9a49ed6c2e8c32b7ce22302345ea6a2edceb71f7

Gerrit Updater added a comment - 18/Dec/14 12:42 AM Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/13114 Subject: LU-4536 tests: disable sanity test_65ic for ZFS Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 9a49ed6c2e8c32b7ce22302345ea6a2edceb71f7

Andreas Dilger added a comment - 17/Dec/14 7:39 AM

I'm trying to get this bug moving forward, since it seems to be stalled at the moment but is probably the #2 or #3 reason for review-zfs test failures. Looking at the client stack traces, they all appear to have lfs stuck either waiting on the MDS to reply to an RPC:
https://testing.hpdd.intel.com/test_sets/603bbaca-5eae-11e4-a2a3-5254006e85c2
https://testing.hpdd.intel.com/test_sets/a62db834-4ffa-11e4-9892-5254006e85c2
https://testing.hpdd.intel.com/test_sets/7c03803e-8529-11e4-8e46-5254006e85c2

21:41:03:lfs           S 0000000000000000     0 23186  23035 0x00000080
21:41:03:Call Trace:
21:41:03: [<ffffffff81529c72>] schedule_timeout+0x192/0x2e0
21:41:03: [<ffffffffa075f71a>] ptlrpc_set_wait+0x2da/0x860 [ptlrpc]
21:41:03: [<ffffffffa075fd27>] ptlrpc_queue_wait+0x87/0x220 [ptlrpc]
21:41:03: [<ffffffffa0740f8e>] ldlm_cli_enqueue+0x36e/0x860 [ptlrpc]
21:41:03: [<ffffffffa09b4f90>] mdc_enqueue+0x2d0/0x1a70 [mdc]
21:41:03: [<ffffffffa09b6912>] mdc_intent_lock+0x1e2/0x5f9 [mdc]
21:41:03: [<ffffffffa096ea4b>] lmv_intent_open+0x31b/0x9f0 [lmv]
21:41:03: [<ffffffffa096f3ff>] lmv_intent_lock+0x2df/0x11c0 [lmv]
21:41:03: [<ffffffffa0b63d27>] ll_lookup_it+0x257/0xad0 [lustre]
21:41:03: [<ffffffffa0b64629>] ll_lookup_nd+0x89/0x5e0 [lustre]
21:41:03: [<ffffffff81198a35>] do_lookup+0x1a5/0x230
21:41:03: [<ffffffff811996a4>] __link_path_walk+0x7a4/0x1000
21:41:03: [<ffffffff8119a1ba>] path_walk+0x6a/0xe0
21:41:03: [<ffffffff8119a3cb>] filename_lookup+0x6b/0xc0
21:41:04: [<ffffffff8119b8a4>] do_filp_open+0x104/0xd20
21:41:04: [<ffffffff81185be9>] do_sys_open+0x69/0x140
21:41:04: [<ffffffff81185d00>] sys_open+0x20/0x30

though others appear to be waiting on the client-side statahead thread to start, so there may be two related issues here:
https://testing.hpdd.intel.com/test_sets/c75ba76e-5ee6-11e4-badb-5254006e85c2
https://testing.hpdd.intel.com/test_sets/6b3726e2-8516-11e4-985f-5254006e85c2

03:58:24:lfs           R  running task        0 22884  22735 0x00000080
03:58:24:Call Trace:
03:58:24: [<ffffffff81529840>] _cond_resched+0x30/0x40
03:58:24: [<ffffffff81529884>] wait_for_common+0x34/0x180
03:58:24: [<ffffffff81529a8d>] wait_for_completion+0x1d/0x20
03:58:24: [<ffffffff8109ad69>] kthread_create+0x99/0x120
03:58:24: [<ffffffffa0b87960>] do_statahead_enter+0xb50/0x1220 [lustre]
03:58:24: [<ffffffffa0b6f328>] ll_lookup_it+0x858/0xad0 [lustre]
03:58:24: [<ffffffffa0b6f629>] ll_lookup_nd+0x89/0x5e0 [lustre]
03:58:24: [<ffffffff81198a35>] do_lookup+0x1a5/0x230
03:58:24: [<ffffffff811996a4>] __link_path_walk+0x7a4/0x1000
03:58:24: [<ffffffff8119a1ba>] path_walk+0x6a/0xe0
03:58:24: [<ffffffff8119a3cb>] filename_lookup+0x6b/0xc0
03:58:24: [<ffffffff8119b4f7>] user_path_at+0x57/0xa0
03:58:24: [<ffffffff8118e990>] vfs_fstatat+0x50/0xa0
03:58:24: [<ffffffff8118ea4e>] vfs_lstat+0x1e/0x20
03:58:24: [<ffffffff8118ea74>] sys_newlstat+0x24/0x50

Unfortunately, I don't see anything in progress on the MDT at all, though the chance of catching an RPC in progress would be small. I also couldn't find the statahead thread in the client stack traces. I haven't looked at the client debug logs yet, but I expect that they will help narrow down exactly what is happening on the client (is it stuck on the client during first access/statahead, or is there some long-running problem that only makes itself seen when a large number of files have already been accessed?

Andreas Dilger added a comment - 17/Dec/14 7:39 AM I'm trying to get this bug moving forward, since it seems to be stalled at the moment but is probably the #2 or #3 reason for review-zfs test failures. Looking at the client stack traces, they all appear to have lfs stuck either waiting on the MDS to reply to an RPC: https://testing.hpdd.intel.com/test_sets/603bbaca-5eae-11e4-a2a3-5254006e85c2 https://testing.hpdd.intel.com/test_sets/a62db834-4ffa-11e4-9892-5254006e85c2 https://testing.hpdd.intel.com/test_sets/7c03803e-8529-11e4-8e46-5254006e85c2 21:41:03:lfs S 0000000000000000 0 23186 23035 0x00000080 21:41:03:Call Trace: 21:41:03: [<ffffffff81529c72>] schedule_timeout+0x192/0x2e0 21:41:03: [<ffffffffa075f71a>] ptlrpc_set_wait+0x2da/0x860 [ptlrpc] 21:41:03: [<ffffffffa075fd27>] ptlrpc_queue_wait+0x87/0x220 [ptlrpc] 21:41:03: [<ffffffffa0740f8e>] ldlm_cli_enqueue+0x36e/0x860 [ptlrpc] 21:41:03: [<ffffffffa09b4f90>] mdc_enqueue+0x2d0/0x1a70 [mdc] 21:41:03: [<ffffffffa09b6912>] mdc_intent_lock+0x1e2/0x5f9 [mdc] 21:41:03: [<ffffffffa096ea4b>] lmv_intent_open+0x31b/0x9f0 [lmv] 21:41:03: [<ffffffffa096f3ff>] lmv_intent_lock+0x2df/0x11c0 [lmv] 21:41:03: [<ffffffffa0b63d27>] ll_lookup_it+0x257/0xad0 [lustre] 21:41:03: [<ffffffffa0b64629>] ll_lookup_nd+0x89/0x5e0 [lustre] 21:41:03: [<ffffffff81198a35>] do_lookup+0x1a5/0x230 21:41:03: [<ffffffff811996a4>] __link_path_walk+0x7a4/0x1000 21:41:03: [<ffffffff8119a1ba>] path_walk+0x6a/0xe0 21:41:03: [<ffffffff8119a3cb>] filename_lookup+0x6b/0xc0 21:41:04: [<ffffffff8119b8a4>] do_filp_open+0x104/0xd20 21:41:04: [<ffffffff81185be9>] do_sys_open+0x69/0x140 21:41:04: [<ffffffff81185d00>] sys_open+0x20/0x30 though others appear to be waiting on the client-side statahead thread to start, so there may be two related issues here: https://testing.hpdd.intel.com/test_sets/c75ba76e-5ee6-11e4-badb-5254006e85c2 https://testing.hpdd.intel.com/test_sets/6b3726e2-8516-11e4-985f-5254006e85c2 03:58:24:lfs R running task 0 22884 22735 0x00000080 03:58:24:Call Trace: 03:58:24: [<ffffffff81529840>] _cond_resched+0x30/0x40 03:58:24: [<ffffffff81529884>] wait_for_common+0x34/0x180 03:58:24: [<ffffffff81529a8d>] wait_for_completion+0x1d/0x20 03:58:24: [<ffffffff8109ad69>] kthread_create+0x99/0x120 03:58:24: [<ffffffffa0b87960>] do_statahead_enter+0xb50/0x1220 [lustre] 03:58:24: [<ffffffffa0b6f328>] ll_lookup_it+0x858/0xad0 [lustre] 03:58:24: [<ffffffffa0b6f629>] ll_lookup_nd+0x89/0x5e0 [lustre] 03:58:24: [<ffffffff81198a35>] do_lookup+0x1a5/0x230 03:58:24: [<ffffffff811996a4>] __link_path_walk+0x7a4/0x1000 03:58:24: [<ffffffff8119a1ba>] path_walk+0x6a/0xe0 03:58:24: [<ffffffff8119a3cb>] filename_lookup+0x6b/0xc0 03:58:24: [<ffffffff8119b4f7>] user_path_at+0x57/0xa0 03:58:24: [<ffffffff8118e990>] vfs_fstatat+0x50/0xa0 03:58:24: [<ffffffff8118ea4e>] vfs_lstat+0x1e/0x20 03:58:24: [<ffffffff8118ea74>] sys_newlstat+0x24/0x50 Unfortunately, I don't see anything in progress on the MDT at all, though the chance of catching an RPC in progress would be small. I also couldn't find the statahead thread in the client stack traces. I haven't looked at the client debug logs yet, but I expect that they will help narrow down exactly what is happening on the client (is it stuck on the client during first access/statahead, or is there some long-running problem that only makes itself seen when a large number of files have already been accessed?

Andreas Dilger added a comment - 16/Dec/14 8:04 PM

Submitted patch is only improving the debugging messages, and will not resolve the problem for this bug, so don't close it.

Andreas Dilger added a comment - 16/Dec/14 8:04 PM Submitted patch is only improving the debugging messages, and will not resolve the problem for this bug, so don't close it.

People

Assignee:: Nathaniel Clark

Reporter:: Maloo

Votes:: 0 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 24/Jan/14 5:14 PM

Updated:: 22/Dec/17 10:47 AM

Resolved:: 02/Aug/17 12:51 PM

Lustre