[LU-4601] Failure on test suite parallel-scale-nfsv3 test_compilebench Created: 07/Feb/14  Updated: 27/May/14  Resolved: 27/May/14

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: patch
Environment:

client and server: lustre-master build # 1876 RHEL6 ldiskfs


Issue Links:
Related
is related to LU-4603 NFS reexport leads to problems of "ls" Resolved
Severity: 3
Rank (Obsolete): 12586

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/cada8aee-8e18-11e3-9383-52540035b04c.

The sub-test test_compilebench failed with the following error:

compilebench failed: 1

== parallel-scale-nfsv3 test compilebench: compilebench == 14:45:47 (1391553947)
OPTIONS:
cbench_DIR=/usr/bin
cbench_IDIRS=2
cbench_RUNS=2
client-16vm1
client-16vm2.lab.whamcloud.com
./compilebench -D /mnt/lustre/d0.compilebench -i 2         -r 2 --makej
using working directory /mnt/lustre/d0.compilebench, 2 intial dirs 2 runs
native unpatched native-0 222MB in 189.77 seconds (1.17 MB/s)
Traceback (most recent call last):
  File "./compilebench", line 567, in <module>
    dset = dataset(options.sources, rnd)
  File "./compilebench", line 319, in __init__
    self.unpatched = native_order(self.unpatched, "unpatched")
  File "./compilebench", line 104, in native_order
    os.rmdir(fullpath)
OSError: [Errno 39] Directory not empty: '/mnt/lustre/d0.compilebench/native-0'
 parallel-scale-nfsv3 test_compilebench: @@@@@@ FAIL: compilebench failed: 1 


 Comments   
Comment by Sarah Liu [ 07/Feb/14 ]

in mds dmesg found

------------[ cut here ]------------
WARNING: at fs/proc/generic.c:590 proc_register+0x129/0x220() (Not tainted)
Hardware name: KVM
proc_dir_entry 'lustre/osc' already registered
Modules linked in: osc(+)(U) lmv(U) osp(U) mdd(U) lod(U) lfsck(U) mdt(U) mgs(U) mgc(U) nodemap(U) osd_ldiskfs(U) lquota(U) lustre(U) lov(U) mdc(U) fid(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) libcfs(U) ldiskfs(U) sha512_generic sha256_generic jbd2 nfs fscache nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs autofs4 ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk pata_acpi ata_generic ata_piix virtio_pci virtio_ring virtio dm_mirror dm_region_hash dm_log dm_mod [last unloaded: lnet_selftest]
Pid: 16973, comm: modprobe Not tainted 2.6.32-358.23.2.el6_lustre.x86_64 #1
Call Trace:
 [<ffffffff8106e3e7>] ? warn_slowpath_common+0x87/0xc0
 [<ffffffff8106e4d6>] ? warn_slowpath_fmt+0x46/0x50
 [<ffffffff811f0299>] ? proc_register+0x129/0x220
 [<ffffffff811f05c2>] ? proc_mkdir_mode+0x42/0x60
 [<ffffffff811f05f6>] ? proc_mkdir+0x16/0x20
 [<ffffffffa09cec30>] ? lprocfs_seq_register+0x20/0x80 [obdclass]
 [<ffffffffa09c7407>] ? class_register_type+0xa07/0xe10 [obdclass]
 [<ffffffffa022c000>] ? init_module+0x0/0x1de [osc]
 [<ffffffffa022c07c>] ? init_module+0x7c/0x1de [osc]
 [<ffffffff8100204c>] ? do_one_initcall+0x3c/0x1d0
 [<ffffffff810b75c1>] ? sys_init_module+0xe1/0x250
 [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
---[ end trace e197a8d70f3ab62f ]---
Lustre: Mounted lustre-client

client dmesg shows

Lustre: DEBUG MARKER: ./compilebench -D /mnt/lustre/d0.compilebench -i 2 -r 2 --makej
NFS: directory d0.compilebench/native-0 contains a readdir loop.Please contact your server vendor.  The file: Makefile has duplicate cookie 8794493083554441875
NFS: directory d0.compilebench/native-0 contains a readdir loop.Please contact your server vendor.  The file: Makefile has duplicate cookie 8794493083554441875
Comment by James A Simmons [ 10/Feb/14 ]

Can you tell what snapshot of master this occurred on? What is the file system setup? Peng pointed the potential of a race in the osc/osp module loading. If osp gets loaded first then it conflict with the osc loading. Is this the case?

Comment by Peng Tao [ 15/Feb/14 ]

I think it is the proc race, as the dmesg says

proc_dir_entry 'lustre/osc' already registered
Comment by James A Simmons [ 17/Feb/14 ]

Lustre manages its own internal proc_dir_entry list which has the benefit of being searched with lprocfs_srch. As we move to the using the linux kernel's own internal list we lose the ability to search if a directory is already registered and worst yet the linux kernel will allow more than one directory to created with the same name in the same parent directory. That is why you see two osc directories if the client and server are run on the same node. You also see this problem with lod/lov.

I was already seeing module loading race conditions with symlinks so I folded the fix into patch

http://review.whamcloud.com/#/c/9038

from LU-3319. To handle this problem I create a new proc entry for struct obd_type called typ_procsym. So for example when osp module is loaded first both osp and osc proc entries are registered first. When osc module is loaded then we find the obd_type for OSP if it exist and see if procsym is set. If it is then use that as the procroot for osc. The basic idea is make the procroot registration conditional for class_register_type.

Comment by Jian Yu [ 18/Feb/14 ]

Lustre Build: http://build.whamcloud.com/job/lustre-master/1890/

The same failure occurred:
https://maloo.whamcloud.com/test_sets/4ab871b0-9687-11e3-bc3b-52540035b04c
https://maloo.whamcloud.com/test_sets/68f2c766-9687-11e3-bc3b-52540035b04c
https://maloo.whamcloud.com/test_sets/56a578e4-9680-11e3-a009-52540035b04c
https://maloo.whamcloud.com/test_sets/68db57f4-9680-11e3-a009-52540035b04c

Comment by Cliff White (Inactive) [ 19/Feb/14 ]

James, is there anything you can do to address the Maloo failures?

Comment by James A Simmons [ 19/Feb/14 ]

The problem is that one module could before another when the server/client stack is in the same node. In this case it is osc/osp. A work around could be for the test framework to load osc.ko first then osp.ko later. Also please try my 9038 patch to make sure on your end that it passes the failing test. My testing is going okay but you never know.

Comment by Cliff White (Inactive) [ 19/Feb/14 ]

bobijam, can you look at the Maloo issues?

Comment by James A Simmons [ 19/Feb/14 ]

better yet I push a separate 9038 patch to see if it address this problem with the proper test suite string. What is the maloo test string I should push to cover the above failures?

Comment by James A Simmons [ 16/May/14 ]

Is this bug still seen. Now that patch 9038 has landed it should have gone away. If this bug is not longer showing up please close this ticket.

Comment by Sarah Liu [ 27/May/14 ]

Close this bug since didn't see it in the latest tests. The tests are failed due to LU-5109

Generated at Sat Feb 10 01:44:12 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.