[LU-4011] problems with upstream lustre client code Created: 25/Sep/13 Updated: 23/Jul/17 Resolved: 23/Jul/17 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Bob Glossman (Inactive) | Assignee: | Bob Glossman (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
fc19, 3.11 kernel |
||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||
| Sub-Tasks: |
|
||||||||||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 10743 | ||||||||||||||||||||||||||||||||||||
| Description |
|
This ticket is to track issues with the upstream lustre client code that is part of the 3.11 kernel source in fc19. Making a separate ticket as suggested by Andreas in https://jira.hpdd.intel.com/browse/LU-3974?focusedCommentId=67574&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-67574 |
| Comments |
| Comment by Bob Glossman (Inactive) [ 25/Sep/13 ] |
|
Encountered a few difficulties in just trying to build the lustre client code found under drivers/staging/lustre in the current (3.11.1-200) version of kernel source in fc19. 1) lustre options don't even show up in the kernel config menus presented by common commands like 'make menuconfig' or 'make nconfig'. This appears to be due to all lustre related config settings being conditioned on CONFIG_BROKEN. However there is no menu option to enable CONFIG_BROKEN that I can find. The only way I could do it was to manually edit the file init/Kconfig, adding the line 'default y' to the section of the file for CONFIG_BROKEN so it reads:
config BROKEN
bool
default y
This enables many experimental options in the config menus, including lustre ones. 2) after enabling various lustre options in Staging Drivers, the lustre code start to compile during kernel build, but fails. error seen: CC [M] drivers/staging/lustre/lustre/fid/fid_handler.o
In file included from drivers/staging/lustre/lustre/fid/../include/linux/lustre_compat25.h:44:0,
from drivers/staging/lustre/lustre/fid/../include/linux/lvfs.h:48,
from drivers/staging/lustre/lustre/fid/../include/lvfs.h:45,
from drivers/staging/lustre/lustre/fid/../include/obd_support.h:41,
from drivers/staging/lustre/lustre/fid/../include/linux/obd.h:44,
from drivers/staging/lustre/lustre/fid/../include/obd.h:40,
from drivers/staging/lustre/lustre/fid/fid_handler.c:48:
drivers/staging/lustre/lustre/fid/../include/linux/lustre_patchless_compat.h: In function ‘truncate_complete_page’:
drivers/staging/lustre/lustre/fid/../include/linux/lustre_patchless_compat.h:56:3: error: too few arguments to function ‘page->mapping->a_ops->invalidatepage’
page->mapping->a_ops->invalidatepage(page, 0);
^
make[5]: *** [drivers/staging/lustre/lustre/fid/fid_handler.o] Error 1
make[4]: *** [drivers/staging/lustre/lustre/fid] Error 2
make[3]: *** [drivers/staging/lustre/lustre] Error 2
make[2]: *** [drivers/staging/lustre] Error 2
make[1]: *** [drivers/staging] Error 2
make: *** [drivers] Error 2
The specific errors suggest that at least one of the mods from 3) casual inspection of the client code reveals that many references to num_physpages are still there. comments by Peng Tao in http://review.whamcloud.com/#/c/7726 suggest he has already fixed this in upstream client code. If so the fix isn't in this snapshot of upstream kernel source. |
| Comment by Peng Tao [ 26/Sep/13 ] |
|
Bob, thanks for trying out upstream kernel client. It is great to have multiple eyes on it. I think by "upstream", it is better to look at Linus tree rather than Fedora kernels. For your issues: 1. CONFIG_BROKEN is dropped by commit 22eb2c3d900b558ea1e300cbf2f74a9edaef6ecc 2. invalidatepage() prototype change is fixed by commit 5237c44194a5605257b09af5b421dd6995645e65 3. num_physpages is replace by commit 4f6cc9ab5337879c4a79564b3aed4fa429d1cd12 All these commits are in Linus tree. They didn't appear in v3.11 release because Greg pushed them to Linus in 3.12 merge window that is after v3.11 being released. |
| Comment by Andreas Dilger [ 30/Oct/13 ] |
|
I think it makes more sense for any effort to be made against the actual vanilla kernel, instead of what is in FC19. I've previously filed TEI-701 (formerly TT-1719) to get a Gerrit repo set up for the vanilla kernel that includes the staging branch, but I haven't had enough time to actually investigate this. |
| Comment by Dmitry Eremin (Inactive) [ 12/Dec/13 ] |
|
I run sanity on latest staging tree and got the stable following crash: [ 809.119955] LustreError: 3283:0:(cl_lock.c:315:cl_lock_get()) ASSERTION( cl_lock_invariant(((void *)0), lock) ) failed: [ 809.120101] LustreError: 5838:0:(cl_lock.c:463:cl_lock_fits_into()) ASSERTION( cl_lock_invariant_trusted(env, lock) ) failed: [ 809.120114] LustreError: 5838:0:(cl_lock.c:463:cl_lock_fits_into()) LBUG [ 809.120120] CPU: 0 PID: 5838 Comm: mc Tainted: G C 3.13.0-rc2+ #1 [ 809.120126] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [ 809.120137] ffff8800cfbcfee0 ffff8800cfceda68 ffffffff817b61a2 ffffffffa014fc40 [ 809.120147] ffff8800cfceda88 ffffffffa0000a76 ffff880204e9a620 ffff8800cfbcfee0 [ 809.120157] ffff8800cfcedac0 ffffffffa0118edc ffff8800cfbcfee0 0000000000000000 [ 809.120162] Call Trace: [ 809.120179] [<ffffffff817b61a2>] dump_stack+0x45/0x56 [ 809.120207] [<ffffffffa0000a76>] lbug_with_loc+0x46/0xb0 [libcfs] [ 809.120224] [<ffffffffa0118edc>] cl_lock_fits_into+0xac/0xb0 [obdclass] [ 809.120240] [<ffffffffa0119d05>] cl_lock_lookup+0x1b5/0x200 [obdclass] [ 809.120256] [<ffffffffa011eb09>] ? cl_io_sub_init+0x49/0xa0 [obdclass] [ 809.120271] [<ffffffffa011c26d>] cl_lock_hold_mutex.isra.46+0x8d/0x430 [obdclass] [ 809.120286] [<ffffffffa011c61f>] cl_lock_hold+0xf/0x30 [obdclass] [ 809.120299] [<ffffffffa0378620>] lov_sublock_alloc.isra.22+0xe0/0x390 [lov] [ 809.120313] [<ffffffffa037ada8>] lov_lock_init_raid0+0x398/0xba0 [lov] [ 809.120326] [<ffffffffa0374405>] lov_lock_init+0x25/0x60 [lov] [ 809.120340] [<ffffffffa011c3c3>] cl_lock_hold_mutex.isra.46+0x1e3/0x430 [obdclass] [ 809.120355] [<ffffffffa011d1ef>] cl_lock_request+0x3f/0x1c0 [obdclass] [ 809.120370] [<ffffffffa0440e12>] cl_glimpse_lock+0xf2/0x310 [lustre] [ 809.120383] [<ffffffffa04410f9>] cl_glimpse_size0+0xc9/0xf0 [lustre] [ 809.120397] [<ffffffffa040c43a>] ll_inode_revalidate_it+0x7a/0xa0 [lustre] [ 809.120410] [<ffffffffa040c491>] ll_getattr_it+0x31/0x140 [lustre] [ 809.120422] [<ffffffffa040c5cf>] ll_getattr+0x2f/0x40 [lustre] [ 809.120439] [<ffffffff81142244>] vfs_getattr_nosec+0x24/0x40 [ 809.120447] [<ffffffff811422f8>] vfs_getattr+0x28/0x30 [ 809.120455] [<ffffffff811423bd>] vfs_fstatat+0x5d/0xa0 [ 809.120463] [<ffffffff811427f2>] SYSC_newlstat+0x22/0x40 [ 809.120477] [<ffffffff810c0bde>] ? __audit_syscall_exit+0x22e/0x2d0 [ 809.120486] [<ffffffff811429d9>] SyS_newlstat+0x9/0x10 [ 809.120498] [<ffffffff817c6ae2>] system_call_fastpath+0x16/0x1b void cl_lock_get(struct cl_lock *lock)
{
LINVRNT(cl_lock_invariant(NULL, lock)); <-- cl_lock.c:315
CDEBUG(D_TRACE, "acquiring reference: %d %p %lu\n",
atomic_read(&lock->cll_ref), lock, RETIP);
atomic_inc(&lock->cll_ref);
}
static int cl_lock_fits_into(const struct lu_env *env,
const struct cl_lock *lock,
const struct cl_lock_descr *need,
const struct cl_io *io)
{
const struct cl_lock_slice *slice;
LINVRNT(cl_lock_invariant_trusted(env, lock)); <-- cl_lock.c:463
list_for_each_entry(slice, &lock->cll_layers, cls_linkage) {
if (slice->cls_ops->clo_fits_into != NULL &&
!slice->cls_ops->clo_fits_into(env, slice, need, io))
return 0;
}
return 1;
}
|
| Comment by Dmitry Eremin (Inactive) [ 12/Dec/13 ] |
|
Other crash stack is: (gdb) where
#0 ?? () at kernel/debug/debug_core.c:1042
#1 0xffffffff810c5994 in ?? () at kernel/debug/debug_core.c:817
#2 0xffffffff817cf4dc in ?? () at kernel/notifier.c:93
#3 0xffffffff817cf525 in ?? () at kernel/notifier.c:182
#4 0xffffffff817bece8 in ?? () at kernel/panic.c:130
#5 0xffffffffa0000ad4 in lbug_with_loc ()
#6 0xffffffffa011994f in cl_lock_get ()
#7 0xffffffffa011999a in cl_lock_hold_add ()
#8 0xffffffffa0119aa1 in cl_lock_intransit ()
#9 0xffffffffa011b9c5 in cl_unuse_try ()
#10 0xffffffffa030899e in osc_lock_upcall ()
at drivers/staging/lustre/lustre/osc/osc_lock.c:562
#11 0xffffffffa02f193b in osc_enqueue_fini ()
at drivers/staging/lustre/lustre/osc/osc_request.c:2331
#12 0xffffffffa02f1b13 in osc_enqueue_interpret ()
at drivers/staging/lustre/lustre/osc/osc_request.c:2376
#13 0xffffffffa01d5d3d in ptlrpc_check_set.part.21 ()
#14 0xffffffffa01d81b5 in ptlrpc_check_set ()
#15 0xffffffffa01f9e2b in ptlrpcd_check ()
#16 0xffffffffa01fa133 in ptlrpcd ()
#17 0xffffffff81064a5d in ?? ()
#18 <signal handler called>
#19 mo_xattr_get (name=Unhandled dwarf expression opcode 0xfa
)
at drivers/staging/lustre/lustre/obdecho/../include/md_object.h:567
#20 0x0000000000000000 in ?? ()
|
| Comment by Peng Tao [ 16/Dec/13 ] |
|
Dmitry, in which sanity case did you see the LBUG? There is one known issue (fixed by Yang Sheng in http://review.whamcloud.com/#/c/8110/4) that can cause client umount LBUG. But it doesn't seem to be the same one as you saw. |
| Comment by Dmitry Eremin (Inactive) [ 16/Dec/13 ] |
|
I just launch acceptance-small.sh and first test produce this crash. The patch from Yang Sheng you mentioned don't helps. The crash happens during "remove /mnt/lustre/d0.runtests/d1" after test_1. |
| Comment by Dmitry Eremin (Inactive) [ 16/Dec/13 ] |
Lustre: Echo OBD driver; http://www.lustre.org/ Lustre: Layout lock feature supported. Lustre: Mounted lustre-client Lustre: DEBUG MARKER: Using TIMEOUT=20 Lustre: DEBUG MARKER: -----============= acceptance-small: runtests ====== Lustre: DEBUG MARKER: Using TIMEOUT=20 Lustre: DEBUG MARKER: == runtests test 1: All Runtests =================== Lustre: DEBUG MARKER: touching /mnt/lustre at Mon Dec 16 16:52:56 MSK 2013 Lustre: DEBUG MARKER: create an empty file /mnt/lustre/hosts.10496 Lustre: DEBUG MARKER: copying /etc/hosts to /mnt/lustre/hosts.10496 Lustre: DEBUG MARKER: comparing /etc/hosts and /mnt/lustre/hosts.10496 Lustre: DEBUG MARKER: renaming /mnt/lustre/hosts.10496 to /mnt/lustre/hosts.10496.ren Lustre: DEBUG MARKER: copying /etc/hosts to /mnt/lustre/hosts.10496 again Lustre: DEBUG MARKER: truncating /mnt/lustre/hosts.10496 Lustre: DEBUG MARKER: removing /mnt/lustre/hosts.10496 Lustre: DEBUG MARKER: copying /etc/hosts to /mnt/lustre/hosts.10496.2 Lustre: DEBUG MARKER: truncating /mnt/lustre/hosts.10496.2 to 123 bytes Lustre: DEBUG MARKER: creating /mnt/lustre/d0.runtests/d1 Lustre: DEBUG MARKER: copying 1000 files from /etc /bin to /mnt/lustre/d0.runtests/d1/etc /bin at Mon Dec 16 16:52:58 MSK 2013 Lustre: DEBUG MARKER: comparing 1000 newly copied files at Mon Dec 16 16:53:14 MSK 2013 Lustre: DEBUG MARKER: finished at Mon Dec 16 16:53:20 MSK 2013 (24) Lustre: Unmounted lustre-client Lustre: Layout lock feature supported. Lustre: Mounted lustre-client Lustre: DEBUG MARKER: Using TIMEOUT=20 Lustre: DEBUG MARKER: comparing 1000 previously copied files Lustre: DEBUG MARKER: runtests test_1: @@@@@@ FAIL: old and new files are different: rc=22 Lustre: Unmounted lustre-client Lustre: Layout lock feature supported. Lustre: Mounted lustre-client Lustre: DEBUG MARKER: Using TIMEOUT=20 Lustre: DEBUG MARKER: removing /mnt/lustre/d0.runtests/d1 LustreError: 9679:0:(cl_lock.c:315:cl_lock_get()) ASSERTION( cl_lock_invariant(((void *)0), lock) ) failed: |
| Comment by Dmitry Eremin (Inactive) [ 14/Jan/14 ] |
|
This LBUG was fixed by the following small patch: diff --git a/lustre/obdclass/cl_lock.c b/lustre/obdclass/cl_lock.c
index d440da9..2544053 100644
--- a/lustre/obdclass/cl_lock.c
+++ b/lustre/obdclass/cl_lock.c
@@ -2053,8 +2053,8 @@ void cl_lock_hold_add(const struct lu_env *env, struct cl_
lock *lock,
LASSERT(lock->cll_state != CLS_FREEING);
ENTRY;
- cl_lock_hold_mod(env, lock, +1);
cl_lock_get(lock);
+ cl_lock_hold_mod(env, lock, +1);
lu_ref_add(&lock->cll_holders, scope, source);
lu_ref_add(&lock->cll_reference, scope, source);
EXIT;
and also it happens only if CONFIG_LUSTRE_DEBUG_EXPENSIVE_CHECK=y is set. |
| Comment by Peng Tao [ 16/Jan/14 ] |
|
Dmitry, thanks for digging the patch. Your patch also applies to lustre master. Do you see the same crash with master? |
| Comment by Dmitry Eremin (Inactive) [ 16/Jan/14 ] |
|
The situation on master even worse. It's not even compiled. I have submit http://review.whamcloud.com/#/c/8853/. And after this fix I observe other crash |
| Comment by James Nunez (Inactive) [ 01/Jun/15 ] |
|
We're seeing a similar (same?) failure with runtests test_1 again at: |
| Comment by James Nunez (Inactive) [ 29/Jul/15 ] |
|
Another runtests test 1 failure in review-dne-part-2: |
| Comment by James A Simmons [ 29/Jul/15 ] |
|
Numez are you testing the upstream client? |
| Comment by James Nunez (Inactive) [ 29/Jul/15 ] |
|
No, but found these failures in our regular autotest runs on master. |
| Comment by James A Simmons [ 13/Jun/17 ] |
|
With lastest 4.12-rc5 upstream client as of today in my testing we fail the following sanity test. Besides these test the patch sanity 27z, 27D, 29, 77c, 101g, 102a, 102b, 102n, 103a, 125, 133h, 154B, 154a, 154g, 160a, 160c, 160e, 161c, 161d, 162a, 205, 215, 226a, 242, 251, 405, 900 |
| Comment by Bob Glossman (Inactive) [ 13/Jun/17 ] |
|
James, |
| Comment by James A Simmons [ 18/Jun/17 ] |
|
Bob can you close this ticket. |