[LU-4536] sanity test_65ic Created: 24/Jan/14 Updated: 22/Dec/17 Resolved: 02/Aug/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.7.0 |
| Fix Version/s: | Lustre 2.11.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Maloo | Assignee: | Nathaniel Clark |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | HB, zfs | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 12403 | ||||||||||||||||
| Description |
|
This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com> This issue relates to the following test suite run: The sub-test test_65ic failed with the following error:
Info required for matching: sanity 65ic |
| Comments |
| Comment by Andreas Dilger [ 27/Jan/14 ] |
|
This may be related to the patch from |
| Comment by Jian Yu [ 09/Oct/14 ] |
|
One more instance on master branch: |
| Comment by Jian Yu [ 09/Oct/14 ] |
|
This is blocking patch review testing on master branch: |
| Comment by nasf (Inactive) [ 28/Oct/14 ] |
|
Another failure instances: |
| Comment by Nathaniel Clark [ 04/Nov/14 ] |
|
Fix CDEBUG error in MDS logs: format at ldlm_pool.c:612:ldlm_pool_recalc doesn't end in newline |
| Comment by Gerrit Updater [ 16/Dec/14 ] |
|
Andreas Dilger (andreas.dilger@intel.com) merged in patch http://review.whamcloud.com/12547/ |
| Comment by Andreas Dilger [ 16/Dec/14 ] |
|
Submitted patch is only improving the debugging messages, and will not resolve the problem for this bug, so don't close it. |
| Comment by Andreas Dilger [ 17/Dec/14 ] |
|
I'm trying to get this bug moving forward, since it seems to be stalled at the moment but is probably the #2 or #3 reason for review-zfs test failures. Looking at the client stack traces, they all appear to have lfs stuck either waiting on the MDS to reply to an RPC: 21:41:03:lfs S 0000000000000000 0 23186 23035 0x00000080 21:41:03:Call Trace: 21:41:03: [<ffffffff81529c72>] schedule_timeout+0x192/0x2e0 21:41:03: [<ffffffffa075f71a>] ptlrpc_set_wait+0x2da/0x860 [ptlrpc] 21:41:03: [<ffffffffa075fd27>] ptlrpc_queue_wait+0x87/0x220 [ptlrpc] 21:41:03: [<ffffffffa0740f8e>] ldlm_cli_enqueue+0x36e/0x860 [ptlrpc] 21:41:03: [<ffffffffa09b4f90>] mdc_enqueue+0x2d0/0x1a70 [mdc] 21:41:03: [<ffffffffa09b6912>] mdc_intent_lock+0x1e2/0x5f9 [mdc] 21:41:03: [<ffffffffa096ea4b>] lmv_intent_open+0x31b/0x9f0 [lmv] 21:41:03: [<ffffffffa096f3ff>] lmv_intent_lock+0x2df/0x11c0 [lmv] 21:41:03: [<ffffffffa0b63d27>] ll_lookup_it+0x257/0xad0 [lustre] 21:41:03: [<ffffffffa0b64629>] ll_lookup_nd+0x89/0x5e0 [lustre] 21:41:03: [<ffffffff81198a35>] do_lookup+0x1a5/0x230 21:41:03: [<ffffffff811996a4>] __link_path_walk+0x7a4/0x1000 21:41:03: [<ffffffff8119a1ba>] path_walk+0x6a/0xe0 21:41:03: [<ffffffff8119a3cb>] filename_lookup+0x6b/0xc0 21:41:04: [<ffffffff8119b8a4>] do_filp_open+0x104/0xd20 21:41:04: [<ffffffff81185be9>] do_sys_open+0x69/0x140 21:41:04: [<ffffffff81185d00>] sys_open+0x20/0x30 though others appear to be waiting on the client-side statahead thread to start, so there may be two related issues here: 03:58:24:lfs R running task 0 22884 22735 0x00000080 03:58:24:Call Trace: 03:58:24: [<ffffffff81529840>] _cond_resched+0x30/0x40 03:58:24: [<ffffffff81529884>] wait_for_common+0x34/0x180 03:58:24: [<ffffffff81529a8d>] wait_for_completion+0x1d/0x20 03:58:24: [<ffffffff8109ad69>] kthread_create+0x99/0x120 03:58:24: [<ffffffffa0b87960>] do_statahead_enter+0xb50/0x1220 [lustre] 03:58:24: [<ffffffffa0b6f328>] ll_lookup_it+0x858/0xad0 [lustre] 03:58:24: [<ffffffffa0b6f629>] ll_lookup_nd+0x89/0x5e0 [lustre] 03:58:24: [<ffffffff81198a35>] do_lookup+0x1a5/0x230 03:58:24: [<ffffffff811996a4>] __link_path_walk+0x7a4/0x1000 03:58:24: [<ffffffff8119a1ba>] path_walk+0x6a/0xe0 03:58:24: [<ffffffff8119a3cb>] filename_lookup+0x6b/0xc0 03:58:24: [<ffffffff8119b4f7>] user_path_at+0x57/0xa0 03:58:24: [<ffffffff8118e990>] vfs_fstatat+0x50/0xa0 03:58:24: [<ffffffff8118ea4e>] vfs_lstat+0x1e/0x20 03:58:24: [<ffffffff8118ea74>] sys_newlstat+0x24/0x50 Unfortunately, I don't see anything in progress on the MDT at all, though the chance of catching an RPC in progress would be small. I also couldn't find the statahead thread in the client stack traces. I haven't looked at the client debug logs yet, but I expect that they will help narrow down exactly what is happening on the client (is it stuck on the client during first access/statahead, or is there some long-running problem that only makes itself seen when a large number of files have already been accessed? |
| Comment by Gerrit Updater [ 18/Dec/14 ] |
|
Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/13114 |
| Comment by Gerrit Updater [ 18/Dec/14 ] |
|
Andreas Dilger (andreas.dilger@intel.com) merged in patch http://review.whamcloud.com/13114/ |
| Comment by Andreas Dilger [ 20/Jan/15 ] |
|
Lai or Fan Yong, this problem seems to be related to statahead. Is there anything you can comment on to get this bug moving. |
| Comment by Lai Siyao [ 21/Jan/15 ] |
|
The debuglog shows statahead thread was successfully created, and statahead went well. So it didn't stuck in statahead code. And I also checked several other logs, and found no connection with statahead. |
| Comment by Nathaniel Clark [ 21/Jan/15 ] |
|
The last three instances of this bug all have the same signature for lfs: lfs S 0000000000000001 0 2499 2350 0x00000080 ffff88007304d978 0000000000000086 ffff88007bc392f0 ffff88007bc392c0 ffff88007c9e9800 ffff88007bc392f0 ffff88007304d948 ffffffffa03ee1e1 ffff88007a8fdab8 ffff88007304dfd8 000000000000fbc8 ffff88007a8fdab8 Call Trace: [<ffffffffa03ee1e1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [<ffffffff81529c72>] schedule_timeout+0x192/0x2e0 [<ffffffff81083f30>] ? process_timeout+0x0/0x10 [<ffffffffa07524d2>] ptlrpc_set_wait+0x2b2/0x890 [ptlrpc] [<ffffffff81061d00>] ? default_wake_function+0x0/0x20 [<ffffffffa075c576>] ? lustre_msg_set_jobid+0xb6/0x140 [ptlrpc] [<ffffffffa0752b31>] ptlrpc_queue_wait+0x81/0x220 [ptlrpc] [<ffffffffa0992eac>] mdc_getattr_common+0xfc/0x420 [mdc] [<ffffffffa0996327>] mdc_getattr_name+0x147/0x2f0 [mdc] [<ffffffffa095c279>] lmv_getattr_name+0x209/0x970 [lmv] [<ffffffffa0b00090>] ll_lov_getstripe_ea_info+0x150/0x660 [lustre] [<ffffffffa0afa4f9>] ll_dir_ioctl+0x3c09/0x64d0 [lustre] [<ffffffffa03edba3>] ? libcfs_debug_vmsg2+0x5e3/0xbe0 [libcfs] [<ffffffffa03ee1e1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [<ffffffffa03ee1e1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [<ffffffff8119e4e2>] vfs_ioctl+0x22/0xa0 [<ffffffff8119e684>] do_vfs_ioctl+0x84/0x580 [<ffffffff81188ec2>] ? vfs_write+0x132/0x1a0 [<ffffffff8119ec01>] sys_ioctl+0x81/0xa0 [<ffffffff810e1bfe>] ? __audit_syscall_exit+0x25e/0x290 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b |
| Comment by Oleg Drokin [ 21/Jan/15 ] |
|
I looked at the last two reports references. It's interesting that both MDS and clients are completely idle the lfs command is nowhere to be found so I assuem it's already terminated? Now why the test is stuck then is a complete mystery too. |
| Comment by Andreas Dilger [ 21/Jan/15 ] |
|
One possibility here is that the -1 stripe count is causing the MDS to try and access a layout with (__u16)-1 stripes and this is causing it to be slow? I can't see any other reason why this test might timeout only in ZFS. |
| Comment by Oleg Drokin [ 21/Jan/15 ] |
|
Actualy that's only in console logs that there's no lfs, but in syslog the lfs is there which is pretty strange. |
| Comment by Oleg Drokin [ 21/Jan/15 ] |
|
so if you run with increased debugging where you can trace by rpc xid to see what happened with that request? could it bemds replied and client missed it? unplausible, but a start to see what's going on. |
| Comment by Gerrit Updater [ 23/Jan/15 ] |
|
Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: http://review.whamcloud.com/13512 |
| Comment by Ben Evans (Inactive) [ 19/Aug/15 ] |
|
Just ran into this (or something quite similar) on a 2.5.2 build I think the ldlm_pool_recalc may help, but I also noticed that internally ldlm_pool_recalc is working with time_t, and returns an int. I'm wondering if there are 32/64 bit issues here. In my case, I know time_t is 64 bit, and int is 32 bit. |
| Comment by James A Simmons [ 19/Aug/15 ] |
|
Hm. Looks like I need to push a patch for cleanup the time wrappers. |
| Comment by Peter Jones [ 05/Aug/16 ] |
|
Has not been seen in many months |
| Comment by Andreas Dilger [ 29/May/17 ] |
|
This hasn't been seen because the test is currently always being skipped. |
| Comment by James A Simmons [ 18/Jun/17 ] |
|
This Ben thinks this might be a 64 bit timer issues I linked it to |
| Comment by Gerrit Updater [ 23/Jun/17 ] |
|
Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: https://review.whamcloud.com/27803 |
| Comment by Gerrit Updater [ 01/Aug/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27803/ |
| Comment by James A Simmons [ 02/Aug/17 ] |
|
Patch has landed and the time 64 work fixed the problems. |