[LU-4536] sanity test_65ic Created: 24/Jan/14  Updated: 22/Dec/17  Resolved: 02/Aug/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: Lustre 2.11.0

Type: Bug Priority: Major
Reporter: Maloo Assignee: Nathaniel Clark
Resolution: Fixed Votes: 0
Labels: HB, zfs

Issue Links:
Related
is related to LU-5415 High ldlm_poold load on client Resolved
is related to LU-2524 Tests regressions: tests interrelatio... Closed
is related to LU-9019 Migrate lustre to standard 64 bit tim... Resolved
Severity: 3
Rank (Obsolete): 12403

 Description   

This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

This issue relates to the following test suite run:
http://maloo.whamcloud.com/test_sets/c57a95dc-7c91-11e3-b3fa-52540035b04c
https://maloo.whamcloud.com/test_sets/f73662fc-83f0-11e3-bab5-52540035b04c

The sub-test test_65ic failed with the following error:

test failed to respond and timed out

Info required for matching: sanity 65ic



 Comments   
Comment by Andreas Dilger [ 27/Jan/14 ]

This may be related to the patch from LU-2524 not deleting large numbers of old files from test_51b in the filesystem, causing the "find" to access a lot more files than previously.

Comment by Jian Yu [ 09/Oct/14 ]

One more instance on master branch:
https://testing.hpdd.intel.com/test_sets/a4cc14a6-4f9e-11e4-8e65-5254006e85c2

Comment by Jian Yu [ 09/Oct/14 ]

This is blocking patch review testing on master branch:
https://testing.hpdd.intel.com/test_sets/a62db834-4ffa-11e4-9892-5254006e85c2

Comment by nasf (Inactive) [ 28/Oct/14 ]

Another failure instances:
https://testing.hpdd.intel.com/test_sets/c75ba76e-5ee6-11e4-badb-5254006e85c2
https://testing.hpdd.intel.com/test_sets/603bbaca-5eae-11e4-a2a3-5254006e85c2

Comment by Nathaniel Clark [ 04/Nov/14 ]

Fix CDEBUG error in MDS logs:

format at ldlm_pool.c:612:ldlm_pool_recalc doesn't end in newline

http://review.whamcloud.com/12547

Comment by Gerrit Updater [ 16/Dec/14 ]

Andreas Dilger (andreas.dilger@intel.com) merged in patch http://review.whamcloud.com/12547/
Subject: LU-4536 ldlm: Recalculate interval in ldlm_pool_recalc()
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 6a8181a7262c59337de89d0f7245e46d6e2f8c26

Comment by Andreas Dilger [ 16/Dec/14 ]

Submitted patch is only improving the debugging messages, and will not resolve the problem for this bug, so don't close it.

Comment by Andreas Dilger [ 17/Dec/14 ]

I'm trying to get this bug moving forward, since it seems to be stalled at the moment but is probably the #2 or #3 reason for review-zfs test failures. Looking at the client stack traces, they all appear to have lfs stuck either waiting on the MDS to reply to an RPC:
https://testing.hpdd.intel.com/test_sets/603bbaca-5eae-11e4-a2a3-5254006e85c2
https://testing.hpdd.intel.com/test_sets/a62db834-4ffa-11e4-9892-5254006e85c2
https://testing.hpdd.intel.com/test_sets/7c03803e-8529-11e4-8e46-5254006e85c2

21:41:03:lfs           S 0000000000000000     0 23186  23035 0x00000080
21:41:03:Call Trace:
21:41:03: [<ffffffff81529c72>] schedule_timeout+0x192/0x2e0
21:41:03: [<ffffffffa075f71a>] ptlrpc_set_wait+0x2da/0x860 [ptlrpc]
21:41:03: [<ffffffffa075fd27>] ptlrpc_queue_wait+0x87/0x220 [ptlrpc]
21:41:03: [<ffffffffa0740f8e>] ldlm_cli_enqueue+0x36e/0x860 [ptlrpc]
21:41:03: [<ffffffffa09b4f90>] mdc_enqueue+0x2d0/0x1a70 [mdc]
21:41:03: [<ffffffffa09b6912>] mdc_intent_lock+0x1e2/0x5f9 [mdc]
21:41:03: [<ffffffffa096ea4b>] lmv_intent_open+0x31b/0x9f0 [lmv]
21:41:03: [<ffffffffa096f3ff>] lmv_intent_lock+0x2df/0x11c0 [lmv]
21:41:03: [<ffffffffa0b63d27>] ll_lookup_it+0x257/0xad0 [lustre]
21:41:03: [<ffffffffa0b64629>] ll_lookup_nd+0x89/0x5e0 [lustre]
21:41:03: [<ffffffff81198a35>] do_lookup+0x1a5/0x230
21:41:03: [<ffffffff811996a4>] __link_path_walk+0x7a4/0x1000
21:41:03: [<ffffffff8119a1ba>] path_walk+0x6a/0xe0
21:41:03: [<ffffffff8119a3cb>] filename_lookup+0x6b/0xc0
21:41:04: [<ffffffff8119b8a4>] do_filp_open+0x104/0xd20
21:41:04: [<ffffffff81185be9>] do_sys_open+0x69/0x140
21:41:04: [<ffffffff81185d00>] sys_open+0x20/0x30

though others appear to be waiting on the client-side statahead thread to start, so there may be two related issues here:
https://testing.hpdd.intel.com/test_sets/c75ba76e-5ee6-11e4-badb-5254006e85c2
https://testing.hpdd.intel.com/test_sets/6b3726e2-8516-11e4-985f-5254006e85c2

03:58:24:lfs           R  running task        0 22884  22735 0x00000080
03:58:24:Call Trace:
03:58:24: [<ffffffff81529840>] _cond_resched+0x30/0x40
03:58:24: [<ffffffff81529884>] wait_for_common+0x34/0x180
03:58:24: [<ffffffff81529a8d>] wait_for_completion+0x1d/0x20
03:58:24: [<ffffffff8109ad69>] kthread_create+0x99/0x120
03:58:24: [<ffffffffa0b87960>] do_statahead_enter+0xb50/0x1220 [lustre]
03:58:24: [<ffffffffa0b6f328>] ll_lookup_it+0x858/0xad0 [lustre]
03:58:24: [<ffffffffa0b6f629>] ll_lookup_nd+0x89/0x5e0 [lustre]
03:58:24: [<ffffffff81198a35>] do_lookup+0x1a5/0x230
03:58:24: [<ffffffff811996a4>] __link_path_walk+0x7a4/0x1000
03:58:24: [<ffffffff8119a1ba>] path_walk+0x6a/0xe0
03:58:24: [<ffffffff8119a3cb>] filename_lookup+0x6b/0xc0
03:58:24: [<ffffffff8119b4f7>] user_path_at+0x57/0xa0
03:58:24: [<ffffffff8118e990>] vfs_fstatat+0x50/0xa0
03:58:24: [<ffffffff8118ea4e>] vfs_lstat+0x1e/0x20
03:58:24: [<ffffffff8118ea74>] sys_newlstat+0x24/0x50

Unfortunately, I don't see anything in progress on the MDT at all, though the chance of catching an RPC in progress would be small. I also couldn't find the statahead thread in the client stack traces. I haven't looked at the client debug logs yet, but I expect that they will help narrow down exactly what is happening on the client (is it stuck on the client during first access/statahead, or is there some long-running problem that only makes itself seen when a large number of files have already been accessed?

Comment by Gerrit Updater [ 18/Dec/14 ]

Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/13114
Subject: LU-4536 tests: disable sanity test_65ic for ZFS
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 9a49ed6c2e8c32b7ce22302345ea6a2edceb71f7

Comment by Gerrit Updater [ 18/Dec/14 ]

Andreas Dilger (andreas.dilger@intel.com) merged in patch http://review.whamcloud.com/13114/
Subject: LU-4536 tests: disable sanity test_65ic for ZFS
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 82169531176569f1202878b9d68c3d7cc499aa15

Comment by Andreas Dilger [ 20/Jan/15 ]

Lai or Fan Yong, this problem seems to be related to statahead. Is there anything you can comment on to get this bug moving.

Comment by Lai Siyao [ 21/Jan/15 ]

The debuglog shows statahead thread was successfully created, and statahead went well. So it didn't stuck in statahead code.

And I also checked several other logs, and found no connection with statahead.

Comment by Nathaniel Clark [ 21/Jan/15 ]

The last three instances of this bug all have the same signature for lfs:
https://testing.hpdd.intel.com/test_sets/29809bf4-86ae-11e4-87d3-5254006e85c2
https://testing.hpdd.intel.com/test_sets/45034a76-8752-11e4-a70f-5254006e85c2
https://testing.hpdd.intel.com/test_sets/8c14502e-8769-11e4-b712-5254006e85c2

lfs           S 0000000000000001     0  2499   2350 0x00000080
ffff88007304d978 0000000000000086 ffff88007bc392f0 ffff88007bc392c0
ffff88007c9e9800 ffff88007bc392f0 ffff88007304d948 ffffffffa03ee1e1
ffff88007a8fdab8 ffff88007304dfd8 000000000000fbc8 ffff88007a8fdab8
Call Trace:
[<ffffffffa03ee1e1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
[<ffffffff81529c72>] schedule_timeout+0x192/0x2e0
[<ffffffff81083f30>] ? process_timeout+0x0/0x10
[<ffffffffa07524d2>] ptlrpc_set_wait+0x2b2/0x890 [ptlrpc]
[<ffffffff81061d00>] ? default_wake_function+0x0/0x20
[<ffffffffa075c576>] ? lustre_msg_set_jobid+0xb6/0x140 [ptlrpc]
[<ffffffffa0752b31>] ptlrpc_queue_wait+0x81/0x220 [ptlrpc]
[<ffffffffa0992eac>] mdc_getattr_common+0xfc/0x420 [mdc]
[<ffffffffa0996327>] mdc_getattr_name+0x147/0x2f0 [mdc]
[<ffffffffa095c279>] lmv_getattr_name+0x209/0x970 [lmv]
[<ffffffffa0b00090>] ll_lov_getstripe_ea_info+0x150/0x660 [lustre]
[<ffffffffa0afa4f9>] ll_dir_ioctl+0x3c09/0x64d0 [lustre]
[<ffffffffa03edba3>] ? libcfs_debug_vmsg2+0x5e3/0xbe0 [libcfs]
[<ffffffffa03ee1e1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
[<ffffffffa03ee1e1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
[<ffffffff8119e4e2>] vfs_ioctl+0x22/0xa0
[<ffffffff8119e684>] do_vfs_ioctl+0x84/0x580
[<ffffffff81188ec2>] ? vfs_write+0x132/0x1a0
[<ffffffff8119ec01>] sys_ioctl+0x81/0xa0
[<ffffffff810e1bfe>] ? __audit_syscall_exit+0x25e/0x290
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Comment by Oleg Drokin [ 21/Jan/15 ]

I looked at the last two reports references. It's interesting that both MDS and clients are completely idle the lfs command is nowhere to be found so I assuem it's already terminated?

Now why the test is stuck then is a complete mystery too.

Comment by Andreas Dilger [ 21/Jan/15 ]

One possibility here is that the -1 stripe count is causing the MDS to try and access a layout with (__u16)-1 stripes and this is causing it to be slow? I can't see any other reason why this test might timeout only in ZFS.

Comment by Oleg Drokin [ 21/Jan/15 ]

Actualy that's only in console logs that there's no lfs, but in syslog the lfs is there which is pretty strange.
MDS still appears to be totally idle.

Comment by Oleg Drokin [ 21/Jan/15 ]

so if you run with increased debugging where you can trace by rpc xid to see what happened with that request? could it bemds replied and client missed it? unplausible, but a start to see what's going on.

Comment by Gerrit Updater [ 23/Jan/15 ]

Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: http://review.whamcloud.com/13512
Subject: LU-4536 tests: Add debugging to sanity/65ic
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d2c638d2e7edaca81699249f35b6b2567a47dd7d

Comment by Ben Evans (Inactive) [ 19/Aug/15 ]

Just ran into this (or something quite similar) on a 2.5.2 build

I think the ldlm_pool_recalc may help, but I also noticed that internally ldlm_pool_recalc is working with time_t, and returns an int. I'm wondering if there are 32/64 bit issues here. In my case, I know time_t is 64 bit, and int is 32 bit.

Comment by James A Simmons [ 19/Aug/15 ]

Hm. Looks like I need to push a patch for cleanup the time wrappers.

Comment by Peter Jones [ 05/Aug/16 ]

Has not been seen in many months

Comment by Andreas Dilger [ 29/May/17 ]

This hasn't been seen because the test is currently always being skipped.

Comment by James A Simmons [ 18/Jun/17 ]

This Ben thinks this might be a 64 bit timer issues I linked it to LU-9019.

Comment by Gerrit Updater [ 23/Jun/17 ]

Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: https://review.whamcloud.com/27803
Subject: LU-4536 tests: Re-enable sanity/65ic
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d7b2f7547eb7fcd4e276baf223b9ca61babe1d94

Comment by Gerrit Updater [ 01/Aug/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27803/
Subject: LU-4536 tests: Re-enable sanity/65ic
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 10f7f715e23c75b76c9af38c815311e92dc0ef81

Comment by James A Simmons [ 02/Aug/17 ]

Patch has landed and the time 64 work fixed the problems.

Generated at Sat Feb 10 01:43:36 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.