[LU-4011] problems with upstream lustre client code Created: 25/Sep/13  Updated: 23/Jul/17  Resolved: 23/Jul/17

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Bob Glossman (Inactive) Assignee: Bob Glossman (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

fc19, 3.11 kernel


Issue Links:
Related
is related to LU-2355 orph_index_delete()) ASSERTION(obj->m... Resolved
is related to LU-4530 Mainline kernel client (3.12-3.14): l... Resolved
is related to LU-3974 Support for linux 3.11 kernel Resolved
is related to LU-4451 Kernel Oops with NFS reexport using m... Resolved
is related to LU-6204 modinfo data is stale, and would be n... Resolved
is related to LU-6285 Assert fails in staging client module... Resolved
is related to LU-7747 sanity test_56w: dataversion changed ... Resolved
is related to LU-6209 remove old LNDs - ralnd, mxlnd Resolved
Sub-Tasks:
Key
Summary
Type
Status
Assignee
LU-6201 remove duplicate fiemap code/defines Technical task Resolved Zhenyu Xu  
Severity: 3
Rank (Obsolete): 10743

 Description   

This ticket is to track issues with the upstream lustre client code that is part of the 3.11 kernel source in fc19.

Making a separate ticket as suggested by Andreas in https://jira.hpdd.intel.com/browse/LU-3974?focusedCommentId=67574&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-67574



 Comments   
Comment by Bob Glossman (Inactive) [ 25/Sep/13 ]

Encountered a few difficulties in just trying to build the lustre client code found under drivers/staging/lustre in the current (3.11.1-200) version of kernel source in fc19.

1) lustre options don't even show up in the kernel config menus presented by common commands like 'make menuconfig' or 'make nconfig'. This appears to be due to all lustre related config settings being conditioned on CONFIG_BROKEN. However there is no menu option to enable CONFIG_BROKEN that I can find. The only way I could do it was to manually edit the file init/Kconfig, adding the line 'default y' to the section of the file for CONFIG_BROKEN so it reads:

config BROKEN
        bool
        default y

This enables many experimental options in the config menus, including lustre ones.

2) after enabling various lustre options in Staging Drivers, the lustre code start to compile during kernel build, but fails. error seen:

  CC [M]  drivers/staging/lustre/lustre/fid/fid_handler.o
In file included from drivers/staging/lustre/lustre/fid/../include/linux/lustre_compat25.h:44:0,
                 from drivers/staging/lustre/lustre/fid/../include/linux/lvfs.h:48,
                 from drivers/staging/lustre/lustre/fid/../include/lvfs.h:45,
                 from drivers/staging/lustre/lustre/fid/../include/obd_support.h:41,
                 from drivers/staging/lustre/lustre/fid/../include/linux/obd.h:44,
                 from drivers/staging/lustre/lustre/fid/../include/obd.h:40,
                 from drivers/staging/lustre/lustre/fid/fid_handler.c:48:
drivers/staging/lustre/lustre/fid/../include/linux/lustre_patchless_compat.h: In function ‘truncate_complete_page’:
drivers/staging/lustre/lustre/fid/../include/linux/lustre_patchless_compat.h:56:3: error: too few arguments to function ‘page->mapping->a_ops->invalidatepage’
   page->mapping->a_ops->invalidatepage(page, 0);
   ^
make[5]: *** [drivers/staging/lustre/lustre/fid/fid_handler.o] Error 1
make[4]: *** [drivers/staging/lustre/lustre/fid] Error 2
make[3]: *** [drivers/staging/lustre/lustre] Error 2
make[2]: *** [drivers/staging/lustre] Error 2
make[1]: *** [drivers/staging] Error 2
make: *** [drivers] Error 2

The specific errors suggest that at least one of the mods from LU-3974 is needed and isn't present in the upstream client code.

3) casual inspection of the client code reveals that many references to num_physpages are still there. comments by Peng Tao in http://review.whamcloud.com/#/c/7726 suggest he has already fixed this in upstream client code. If so the fix isn't in this snapshot of upstream kernel source.

Comment by Peng Tao [ 26/Sep/13 ]

Bob, thanks for trying out upstream kernel client. It is great to have multiple eyes on it. I think by "upstream", it is better to look at Linus tree rather than Fedora kernels. For your issues:

1. CONFIG_BROKEN is dropped by commit 22eb2c3d900b558ea1e300cbf2f74a9edaef6ecc

2. invalidatepage() prototype change is fixed by commit 5237c44194a5605257b09af5b421dd6995645e65

3. num_physpages is replace by commit 4f6cc9ab5337879c4a79564b3aed4fa429d1cd12

All these commits are in Linus tree. They didn't appear in v3.11 release because Greg pushed them to Linus in 3.12 merge window that is after v3.11 being released.

Comment by Andreas Dilger [ 30/Oct/13 ]

I think it makes more sense for any effort to be made against the actual vanilla kernel, instead of what is in FC19. I've previously filed TEI-701 (formerly TT-1719) to get a Gerrit repo set up for the vanilla kernel that includes the staging branch, but I haven't had enough time to actually investigate this.

Comment by Dmitry Eremin (Inactive) [ 12/Dec/13 ]

I run sanity on latest staging tree and got the stable following crash:

[  809.119955] LustreError: 3283:0:(cl_lock.c:315:cl_lock_get()) ASSERTION( cl_lock_invariant(((void *)0), lock) ) failed:
[  809.120101] LustreError: 5838:0:(cl_lock.c:463:cl_lock_fits_into()) ASSERTION( cl_lock_invariant_trusted(env, lock) ) failed:
[  809.120114] LustreError: 5838:0:(cl_lock.c:463:cl_lock_fits_into()) LBUG
[  809.120120] CPU: 0 PID: 5838 Comm: mc Tainted: G         C   3.13.0-rc2+ #1
[  809.120126] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[  809.120137]  ffff8800cfbcfee0 ffff8800cfceda68 ffffffff817b61a2 ffffffffa014fc40
[  809.120147]  ffff8800cfceda88 ffffffffa0000a76 ffff880204e9a620 ffff8800cfbcfee0
[  809.120157]  ffff8800cfcedac0 ffffffffa0118edc ffff8800cfbcfee0 0000000000000000
[  809.120162] Call Trace:
[  809.120179]  [<ffffffff817b61a2>] dump_stack+0x45/0x56
[  809.120207]  [<ffffffffa0000a76>] lbug_with_loc+0x46/0xb0 [libcfs]
[  809.120224]  [<ffffffffa0118edc>] cl_lock_fits_into+0xac/0xb0 [obdclass]
[  809.120240]  [<ffffffffa0119d05>] cl_lock_lookup+0x1b5/0x200 [obdclass]
[  809.120256]  [<ffffffffa011eb09>] ? cl_io_sub_init+0x49/0xa0 [obdclass]
[  809.120271]  [<ffffffffa011c26d>] cl_lock_hold_mutex.isra.46+0x8d/0x430 [obdclass]
[  809.120286]  [<ffffffffa011c61f>] cl_lock_hold+0xf/0x30 [obdclass]
[  809.120299]  [<ffffffffa0378620>] lov_sublock_alloc.isra.22+0xe0/0x390 [lov]
[  809.120313]  [<ffffffffa037ada8>] lov_lock_init_raid0+0x398/0xba0 [lov]
[  809.120326]  [<ffffffffa0374405>] lov_lock_init+0x25/0x60 [lov]
[  809.120340]  [<ffffffffa011c3c3>] cl_lock_hold_mutex.isra.46+0x1e3/0x430 [obdclass]
[  809.120355]  [<ffffffffa011d1ef>] cl_lock_request+0x3f/0x1c0 [obdclass]
[  809.120370]  [<ffffffffa0440e12>] cl_glimpse_lock+0xf2/0x310 [lustre]
[  809.120383]  [<ffffffffa04410f9>] cl_glimpse_size0+0xc9/0xf0 [lustre]
[  809.120397]  [<ffffffffa040c43a>] ll_inode_revalidate_it+0x7a/0xa0 [lustre]
[  809.120410]  [<ffffffffa040c491>] ll_getattr_it+0x31/0x140 [lustre]
[  809.120422]  [<ffffffffa040c5cf>] ll_getattr+0x2f/0x40 [lustre]
[  809.120439]  [<ffffffff81142244>] vfs_getattr_nosec+0x24/0x40
[  809.120447]  [<ffffffff811422f8>] vfs_getattr+0x28/0x30
[  809.120455]  [<ffffffff811423bd>] vfs_fstatat+0x5d/0xa0
[  809.120463]  [<ffffffff811427f2>] SYSC_newlstat+0x22/0x40
[  809.120477]  [<ffffffff810c0bde>] ? __audit_syscall_exit+0x22e/0x2d0
[  809.120486]  [<ffffffff811429d9>] SyS_newlstat+0x9/0x10
[  809.120498]  [<ffffffff817c6ae2>] system_call_fastpath+0x16/0x1b
void cl_lock_get(struct cl_lock *lock)
{
        LINVRNT(cl_lock_invariant(NULL, lock));             <-- cl_lock.c:315
        CDEBUG(D_TRACE, "acquiring reference: %d %p %lu\n",
               atomic_read(&lock->cll_ref), lock, RETIP);
        atomic_inc(&lock->cll_ref);
}

static int cl_lock_fits_into(const struct lu_env *env,
                             const struct cl_lock *lock,
                             const struct cl_lock_descr *need,
                             const struct cl_io *io)
{
        const struct cl_lock_slice *slice;
 
        LINVRNT(cl_lock_invariant_trusted(env, lock));                     <-- cl_lock.c:463
        list_for_each_entry(slice, &lock->cll_layers, cls_linkage) {
                if (slice->cls_ops->clo_fits_into != NULL &&
                    !slice->cls_ops->clo_fits_into(env, slice, need, io))
                        return 0;
        }
        return 1;
}
Comment by Dmitry Eremin (Inactive) [ 12/Dec/13 ]

Other crash stack is:

(gdb) where
#0  ?? () at kernel/debug/debug_core.c:1042
#1  0xffffffff810c5994 in ?? () at kernel/debug/debug_core.c:817
#2  0xffffffff817cf4dc in ?? () at kernel/notifier.c:93
#3  0xffffffff817cf525 in ?? () at kernel/notifier.c:182
#4  0xffffffff817bece8 in ?? () at kernel/panic.c:130
#5  0xffffffffa0000ad4 in lbug_with_loc ()
#6  0xffffffffa011994f in cl_lock_get ()
#7  0xffffffffa011999a in cl_lock_hold_add ()
#8  0xffffffffa0119aa1 in cl_lock_intransit ()
#9  0xffffffffa011b9c5 in cl_unuse_try ()
#10 0xffffffffa030899e in osc_lock_upcall ()
    at drivers/staging/lustre/lustre/osc/osc_lock.c:562
#11 0xffffffffa02f193b in osc_enqueue_fini ()
    at drivers/staging/lustre/lustre/osc/osc_request.c:2331
#12 0xffffffffa02f1b13 in osc_enqueue_interpret ()
    at drivers/staging/lustre/lustre/osc/osc_request.c:2376
#13 0xffffffffa01d5d3d in ptlrpc_check_set.part.21 ()
#14 0xffffffffa01d81b5 in ptlrpc_check_set ()
#15 0xffffffffa01f9e2b in ptlrpcd_check ()
#16 0xffffffffa01fa133 in ptlrpcd ()
#17 0xffffffff81064a5d in ?? ()
#18 <signal handler called>
#19 mo_xattr_get (name=Unhandled dwarf expression opcode 0xfa
)
    at drivers/staging/lustre/lustre/obdecho/../include/md_object.h:567
#20 0x0000000000000000 in ?? ()
Comment by Peng Tao [ 16/Dec/13 ]

Dmitry, in which sanity case did you see the LBUG? There is one known issue (fixed by Yang Sheng in http://review.whamcloud.com/#/c/8110/4) that can cause client umount LBUG. But it doesn't seem to be the same one as you saw.

Comment by Dmitry Eremin (Inactive) [ 16/Dec/13 ]

I just launch acceptance-small.sh and first test produce this crash. The patch from Yang Sheng you mentioned don't helps. The crash happens during "remove /mnt/lustre/d0.runtests/d1" after test_1.

Comment by Dmitry Eremin (Inactive) [ 16/Dec/13 ]
Lustre: Echo OBD driver; http://www.lustre.org/
Lustre: Layout lock feature supported.
Lustre: Mounted lustre-client
Lustre: DEBUG MARKER: Using TIMEOUT=20
Lustre: DEBUG MARKER: -----============= acceptance-small: runtests ======
Lustre: DEBUG MARKER: Using TIMEOUT=20
Lustre: DEBUG MARKER: == runtests test 1: All Runtests ===================
Lustre: DEBUG MARKER: touching /mnt/lustre at Mon Dec 16 16:52:56 MSK 2013
Lustre: DEBUG MARKER: create an empty file /mnt/lustre/hosts.10496
Lustre: DEBUG MARKER: copying /etc/hosts to /mnt/lustre/hosts.10496
Lustre: DEBUG MARKER: comparing /etc/hosts and /mnt/lustre/hosts.10496
Lustre: DEBUG MARKER: renaming /mnt/lustre/hosts.10496 to /mnt/lustre/hosts.10496.ren
Lustre: DEBUG MARKER: copying /etc/hosts to /mnt/lustre/hosts.10496 again
Lustre: DEBUG MARKER: truncating /mnt/lustre/hosts.10496
Lustre: DEBUG MARKER: removing /mnt/lustre/hosts.10496
Lustre: DEBUG MARKER: copying /etc/hosts to /mnt/lustre/hosts.10496.2
Lustre: DEBUG MARKER: truncating /mnt/lustre/hosts.10496.2 to 123 bytes
Lustre: DEBUG MARKER: creating /mnt/lustre/d0.runtests/d1
Lustre: DEBUG MARKER: copying 1000 files from /etc /bin to /mnt/lustre/d0.runtests/d1/etc /bin at Mon Dec 16 16:52:58 MSK 2013
Lustre: DEBUG MARKER: comparing 1000 newly copied files at Mon Dec 16 16:53:14 MSK 2013
Lustre: DEBUG MARKER: finished at Mon Dec 16 16:53:20 MSK 2013 (24)
Lustre: Unmounted lustre-client
Lustre: Layout lock feature supported.
Lustre: Mounted lustre-client
Lustre: DEBUG MARKER: Using TIMEOUT=20
Lustre: DEBUG MARKER: comparing 1000 previously copied files
Lustre: DEBUG MARKER: runtests test_1: @@@@@@ FAIL: old and new files are different: rc=22
Lustre: Unmounted lustre-client
Lustre: Layout lock feature supported.
Lustre: Mounted lustre-client
Lustre: DEBUG MARKER: Using TIMEOUT=20
Lustre: DEBUG MARKER: removing /mnt/lustre/d0.runtests/d1
LustreError: 9679:0:(cl_lock.c:315:cl_lock_get()) ASSERTION( cl_lock_invariant(((void *)0), lock) ) failed:
Comment by Dmitry Eremin (Inactive) [ 14/Jan/14 ]

This LBUG was fixed by the following small patch:

diff --git a/lustre/obdclass/cl_lock.c b/lustre/obdclass/cl_lock.c
index d440da9..2544053 100644
--- a/lustre/obdclass/cl_lock.c
+++ b/lustre/obdclass/cl_lock.c
@@ -2053,8 +2053,8 @@ void cl_lock_hold_add(const struct lu_env *env, struct cl_
lock *lock,
         LASSERT(lock->cll_state != CLS_FREEING);

         ENTRY;
-        cl_lock_hold_mod(env, lock, +1);
         cl_lock_get(lock);
+        cl_lock_hold_mod(env, lock, +1);
         lu_ref_add(&lock->cll_holders, scope, source);
         lu_ref_add(&lock->cll_reference, scope, source);
         EXIT;

and also it happens only if CONFIG_LUSTRE_DEBUG_EXPENSIVE_CHECK=y is set.

Comment by Peng Tao [ 16/Jan/14 ]

Dmitry, thanks for digging the patch. Your patch also applies to lustre master. Do you see the same crash with master?

Comment by Dmitry Eremin (Inactive) [ 16/Jan/14 ]

The situation on master even worse. It's not even compiled. I have submit http://review.whamcloud.com/#/c/8853/. And after this fix I observe other crash LU-4489.

Comment by James Nunez (Inactive) [ 01/Jun/15 ]

We're seeing a similar (same?) failure with runtests test_1 again at:
2015-05-29 16:08:46 - https://testing.hpdd.intel.com/test_sets/21fd7836-0668-11e5-bf9f-5254006e85c2

Comment by James Nunez (Inactive) [ 29/Jul/15 ]

Another runtests test 1 failure in review-dne-part-2:
2015-07-27 11:06:17 - https://testing.hpdd.intel.com/test_sets/558dd67a-3497-11e5-a9b3-5254006e85c2

Comment by James A Simmons [ 29/Jul/15 ]

Numez are you testing the upstream client?

Comment by James Nunez (Inactive) [ 29/Jul/15 ]

No, but found these failures in our regular autotest runs on master.

Comment by James A Simmons [ 13/Jun/17 ]

With lastest 4.12-rc5 upstream client as of today in my testing we fail the following sanity test. Besides these test the patch LU-8680 needs to be applied to make the lustre client stable.

sanity 27z, 27D, 29, 77c, 101g, 102a, 102b, 102n, 103a, 125, 133h, 154B, 154a, 154g, 160a, 160c, 160e, 161c, 161d, 162a, 205, 215, 226a, 242, 251, 405, 900

Comment by Bob Glossman (Inactive) [ 13/Jun/17 ]

James,
I think this ticket was to track only old fc19 upstream client.
Would suggest a new distinct ticket for upstream clients in 4.12 kernels.

Comment by James A Simmons [ 18/Jun/17 ]

Bob can you close this ticket.

Generated at Sat Feb 10 01:38:52 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.