[LU-4601] Failure on test suite parallel-scale-nfsv3 test_compilebench Created: 07/Feb/14 Updated: 27/May/14 Resolved: 27/May/14 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.6.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Maloo | Assignee: | Zhenyu Xu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | patch | ||
| Environment: |
client and server: lustre-master build # 1876 RHEL6 ldiskfs |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 12586 | ||||||||
| Description |
|
This issue was created by maloo for sarah <sarah@whamcloud.com> This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/cada8aee-8e18-11e3-9383-52540035b04c. The sub-test test_compilebench failed with the following error:
== parallel-scale-nfsv3 test compilebench: compilebench == 14:45:47 (1391553947)
OPTIONS:
cbench_DIR=/usr/bin
cbench_IDIRS=2
cbench_RUNS=2
client-16vm1
client-16vm2.lab.whamcloud.com
./compilebench -D /mnt/lustre/d0.compilebench -i 2 -r 2 --makej
using working directory /mnt/lustre/d0.compilebench, 2 intial dirs 2 runs
native unpatched native-0 222MB in 189.77 seconds (1.17 MB/s)
Traceback (most recent call last):
File "./compilebench", line 567, in <module>
dset = dataset(options.sources, rnd)
File "./compilebench", line 319, in __init__
self.unpatched = native_order(self.unpatched, "unpatched")
File "./compilebench", line 104, in native_order
os.rmdir(fullpath)
OSError: [Errno 39] Directory not empty: '/mnt/lustre/d0.compilebench/native-0'
parallel-scale-nfsv3 test_compilebench: @@@@@@ FAIL: compilebench failed: 1
|
| Comments |
| Comment by Sarah Liu [ 07/Feb/14 ] |
|
in mds dmesg found ------------[ cut here ]------------ WARNING: at fs/proc/generic.c:590 proc_register+0x129/0x220() (Not tainted) Hardware name: KVM proc_dir_entry 'lustre/osc' already registered Modules linked in: osc(+)(U) lmv(U) osp(U) mdd(U) lod(U) lfsck(U) mdt(U) mgs(U) mgc(U) nodemap(U) osd_ldiskfs(U) lquota(U) lustre(U) lov(U) mdc(U) fid(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) libcfs(U) ldiskfs(U) sha512_generic sha256_generic jbd2 nfs fscache nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs autofs4 ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk pata_acpi ata_generic ata_piix virtio_pci virtio_ring virtio dm_mirror dm_region_hash dm_log dm_mod [last unloaded: lnet_selftest] Pid: 16973, comm: modprobe Not tainted 2.6.32-358.23.2.el6_lustre.x86_64 #1 Call Trace: [<ffffffff8106e3e7>] ? warn_slowpath_common+0x87/0xc0 [<ffffffff8106e4d6>] ? warn_slowpath_fmt+0x46/0x50 [<ffffffff811f0299>] ? proc_register+0x129/0x220 [<ffffffff811f05c2>] ? proc_mkdir_mode+0x42/0x60 [<ffffffff811f05f6>] ? proc_mkdir+0x16/0x20 [<ffffffffa09cec30>] ? lprocfs_seq_register+0x20/0x80 [obdclass] [<ffffffffa09c7407>] ? class_register_type+0xa07/0xe10 [obdclass] [<ffffffffa022c000>] ? init_module+0x0/0x1de [osc] [<ffffffffa022c07c>] ? init_module+0x7c/0x1de [osc] [<ffffffff8100204c>] ? do_one_initcall+0x3c/0x1d0 [<ffffffff810b75c1>] ? sys_init_module+0xe1/0x250 [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b ---[ end trace e197a8d70f3ab62f ]--- Lustre: Mounted lustre-client client dmesg shows Lustre: DEBUG MARKER: ./compilebench -D /mnt/lustre/d0.compilebench -i 2 -r 2 --makej NFS: directory d0.compilebench/native-0 contains a readdir loop.Please contact your server vendor. The file: Makefile has duplicate cookie 8794493083554441875 NFS: directory d0.compilebench/native-0 contains a readdir loop.Please contact your server vendor. The file: Makefile has duplicate cookie 8794493083554441875 |
| Comment by James A Simmons [ 10/Feb/14 ] |
|
Can you tell what snapshot of master this occurred on? What is the file system setup? Peng pointed the potential of a race in the osc/osp module loading. If osp gets loaded first then it conflict with the osc loading. Is this the case? |
| Comment by Peng Tao [ 15/Feb/14 ] |
|
I think it is the proc race, as the dmesg says proc_dir_entry 'lustre/osc' already registered
|
| Comment by James A Simmons [ 17/Feb/14 ] |
|
Lustre manages its own internal proc_dir_entry list which has the benefit of being searched with lprocfs_srch. As we move to the using the linux kernel's own internal list we lose the ability to search if a directory is already registered and worst yet the linux kernel will allow more than one directory to created with the same name in the same parent directory. That is why you see two osc directories if the client and server are run on the same node. You also see this problem with lod/lov. I was already seeing module loading race conditions with symlinks so I folded the fix into patch http://review.whamcloud.com/#/c/9038 from |
| Comment by Jian Yu [ 18/Feb/14 ] |
|
Lustre Build: http://build.whamcloud.com/job/lustre-master/1890/ The same failure occurred: |
| Comment by Cliff White (Inactive) [ 19/Feb/14 ] |
|
James, is there anything you can do to address the Maloo failures? |
| Comment by James A Simmons [ 19/Feb/14 ] |
|
The problem is that one module could before another when the server/client stack is in the same node. In this case it is osc/osp. A work around could be for the test framework to load osc.ko first then osp.ko later. Also please try my 9038 patch to make sure on your end that it passes the failing test. My testing is going okay but you never know. |
| Comment by Cliff White (Inactive) [ 19/Feb/14 ] |
|
bobijam, can you look at the Maloo issues? |
| Comment by James A Simmons [ 19/Feb/14 ] |
|
better yet I push a separate 9038 patch to see if it address this problem with the proper test suite string. What is the maloo test string I should push to cover the above failures? |
| Comment by James A Simmons [ 16/May/14 ] |
|
Is this bug still seen. Now that patch 9038 has landed it should have gone away. If this bug is not longer showing up please close this ticket. |
| Comment by Sarah Liu [ 27/May/14 ] |
|
Close this bug since didn't see it in the latest tests. The tests are failed due to |