[LU-3874] parallel-scale-nfsv3 test_iorfpp: ERROR in aiori-POSIX.c (line 235): seek failed Created: 04/Sep/13  Updated: 04/Apr/14  Resolved: 23/Sep/13

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Jian Yu Assignee: Oleg Drokin
Resolution: Duplicate Votes: 0
Labels: None
Environment:

Lustre build: http://build.whamcloud.com/job/lustre-b2_4/44/ (2.4.1 RC1)
Distro/Arch: SLES11SP2/x86_64 (client), RHEL6.4/x86_64 (server)


Issue Links:
Related
is related to LU-3027 Failure on test suite parallel-scale ... Resolved
Severity: 3
Rank (Obsolete): 10055

 Description   

parallel-scale-nfsv3 test iorfpp failed as follows:

** error **
ERROR in aiori-POSIX.c (line 235): seek failed.
ERROR: Permission denied
** exiting **
** error **
ERROR in aiori-POSIX.c (line 235): seek failed.
ERROR: Permission denied
** exiting **

Maloo report: https://maloo.whamcloud.com/test_sets/6fc200da-14b3-11e3-8ae7-52540035b04c

This is a regression on Lustre b2_4 branch.



 Comments   
Comment by Jian Yu [ 04/Sep/13 ]

By searching on Maloo, I found the similar failure started occurring since Lustre b2_4 build #42:

Lustre build: https://build.whamcloud.com/job/lustre-b2_4/42
Distro/Arch: RHEL6.4/x86_64

parallel-scale-nfsv3 test iorssf:

** error **
ERROR in aiori-POSIX.c (line 256): transfer failed.
ERROR: Permission denied
** exiting **
** error **
ERROR in aiori-POSIX.c (line 256): transfer failed.
ERROR: Permission denied
** exiting **

Maloo reports:
https://maloo.whamcloud.com/test_sets/15807156-1355-11e3-a1df-52540035b04c
https://maloo.whamcloud.com/test_sets/06ce48c2-13f5-11e3-9e61-52540035b04c

Comment by Peter Jones [ 04/Sep/13 ]

Fan Yong

Is this related to the most recent LU-2904 fix?

Peter

Comment by nasf (Inactive) [ 04/Sep/13 ]

I found the following failure on MDS side for each failure cases:

==============================
08:08:02:Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test iorfpp: iorfpp ========================================================== 08:07:55 (1378220875)
08:08:02:Lustre: DEBUG MARKER: lfs setstripe /mnt/lustre/d0.ior.fpp -c -1
08:08:02:LustreError: 12896:0:(lov_lock.c:674:lov_lock_enqueue()) ASSERTION( sublock->cll_conflict == ((void *)0) ) failed:
08:08:03:LustreError: 12896:0:(lov_lock.c:674:lov_lock_enqueue()) LBUG
08:08:03:Pid: 12896, comm: nfsd
08:08:04:
08:08:05:Call Trace:
08:08:05: [<ffffffffa06e7895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
08:08:06: [<ffffffffa06e7e97>] lbug_with_loc+0x47/0xb0 [libcfs]
08:08:06: [<ffffffffa0c305ee>] lov_lock_enqueue+0x76e/0x850 [lov]
08:08:06: [<ffffffffa082351c>] cl_enqueue_try+0xfc/0x300 [obdclass]
08:08:06: [<ffffffffa082490f>] cl_enqueue_locked+0x6f/0x1f0 [obdclass]
08:08:06: [<ffffffffa082557e>] cl_lock_request+0x7e/0x270 [obdclass]
08:08:06: [<ffffffffa0cfd750>] cl_glimpse_lock+0x180/0x490 [lustre]
08:08:07: [<ffffffffa0cfdfc5>] cl_glimpse_size0+0x1a5/0x1d0 [lustre]
08:08:08: [<ffffffffa0cb1478>] ll_inode_revalidate_it+0x198/0x1c0 [lustre]
08:08:09: [<ffffffffa0cb14e9>] ll_getattr_it+0x49/0x170 [lustre]
08:08:09: [<ffffffffa0cb1647>] ll_getattr+0x37/0x40 [lustre]
08:08:09: [<ffffffff8121d1a3>] ? security_inode_getattr+0x23/0x30
08:08:10: [<ffffffff81186d81>] vfs_getattr+0x51/0x80
08:08:11: [<ffffffff81182bc1>] ? __fput+0x1a1/0x210
08:08:12: [<ffffffffa02f8d8d>] encode_post_op_attr+0x5d/0xc0 [nfsd]
08:08:13: [<ffffffff81182c55>] ? fput+0x25/0x30
08:08:13: [<ffffffffa02ef123>] ? nfsd_write+0xf3/0x100 [nfsd]
08:08:13: [<ffffffffa02f9752>] encode_wcc_data+0x72/0xd0 [nfsd]
08:08:13: [<ffffffffa02f987b>] nfs3svc_encode_writeres+0x1b/0x80 [nfsd]
08:08:14: [<ffffffffa02e84e6>] nfsd_dispatch+0x1a6/0x240 [nfsd]
08:08:14: [<ffffffffa0260614>] svc_process_common+0x344/0x640 [sunrpc]
08:08:15: [<ffffffff81063410>] ? default_wake_function+0x0/0x20
08:08:15: [<ffffffffa0260c50>] svc_process+0x110/0x160 [sunrpc]
08:08:16: [<ffffffffa02e8b62>] nfsd+0xc2/0x160 [nfsd]
08:08:16: [<ffffffffa02e8aa0>] ? nfsd+0x0/0x160 [nfsd]
08:08:16: [<ffffffff81096a36>] kthread+0x96/0xa0
08:08:17: [<ffffffff8100c0ca>] child_rip+0xa/0x20
08:08:18: [<ffffffff810969a0>] ? kthread+0x0/0xa0
08:08:19: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
==============================

I suspect it is related with the following patch, which has been new landed to Lustre build: https://build.whamcloud.com/job/lustre-b2_4/42

==============================
Commit 3781d4465c9fa72120f35084d4eb5edb55a0b66a by Oleg Drokin
LU-3027 clio: Do not shrink sublock at cancel

Shrinking sublock at ldlm lock cancel time means whoever happened
to attach to this lock just before will reenqueue the wrong lock.

Test-Parameters: envdefinitions=SLOW=yes,ONLY=write_disjoint \
clientdistro=el6 serverdistro=el6 clientarch=x86_64 \
serverarch=x86_64 testlist=parallel-scale

Change-Id: I8f2de683812621fb2f8d761cf2aceebc12868d75
Signed-off-by: Oleg Drokin <oleg.drokin@intel.com>
Signed-off-by: Jian Yu <jian.yu@intel.com>
Reviewed-on: http://review.whamcloud.com/7481
Tested-by: Hudson
Tested-by: Maloo <whamcloud.maloo@gmail.com>
Reviewed-by: Jinshan Xiong <jinshan.xiong@intel.com>
Reviewed-by: Bobi Jam <bobijam@gmail.com>
==============================

Yujian, my suggestion is to revert above patch to verify.

Comment by Jian Yu [ 05/Sep/13 ]

Lustre build: https://build.whamcloud.com/job/lustre-b2_4/44 (2.4.1 RC1)
Distro/Arch: RHEL6.4/x86_64

parallel-scale-nfsv3 test iorssf hit the same failure:
https://maloo.whamcloud.com/test_sets/3098fb4c-15e0-11e3-9aab-52540035b04c

Comment by Jian Yu [ 05/Sep/13 ]

Lustre build: http://build.whamcloud.com/job/lustre-b2_4/44 (2.4.1 RC1)
Distro/Arch: RHEL6.4/x86_64
FSTYPE=zfs

parallel-scale-nfsv3 test iorfpp hit the same failure:
https://maloo.whamcloud.com/test_sets/a07679fc-155a-11e3-8938-52540035b04c

Comment by Sarah Liu [ 08/Sep/13 ]

I hit the same error when running parallel-scale-nfsv4 subtest iorssf on SLES11 SP2 client:

https://maloo.whamcloud.com/test_sets/5daf4960-174d-11e3-9d30-52540035b04c

MDS console:

15:43:02:Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test iorssf: iorssf ========================================================== 15:42:54 (1378507374)
15:43:02:Lustre: DEBUG MARKER: lfs setstripe /mnt/lustre/d0.ior.ssf -c -1
15:43:02:LustreError: 27445:0:(lov_lock.c:674:lov_lock_enqueue()) ASSERTION( sublock->cll_conflict == ((void *)0) ) failed: 
15:43:02:LustreError: 27445:0:(lov_lock.c:674:lov_lock_enqueue()) LBUG
15:43:02:Pid: 27445, comm: nfsd
15:43:03:
15:43:03:Call Trace:
15:43:03: [<ffffffffa0479895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
15:43:03: [<ffffffffa0479e97>] lbug_with_loc+0x47/0xb0 [libcfs]
15:43:03: [<ffffffffa0a944ee>] lov_lock_enqueue+0x76e/0x850 [lov]
15:43:04: [<ffffffffa06033dc>] cl_enqueue_try+0xfc/0x300 [obdclass]
15:43:05: [<ffffffffa06047cf>] cl_enqueue_locked+0x6f/0x1f0 [obdclass]
15:43:05: [<ffffffffa060541e>] cl_lock_request+0x7e/0x270 [obdclass]
15:43:06: [<ffffffffa0b64ab0>] cl_glimpse_lock+0x180/0x490 [lustre]
15:43:06: [<ffffffffa0b65325>] cl_glimpse_size0+0x1a5/0x1d0 [lustre]
15:43:06: [<ffffffffa0b15765>] ll_inode_revalidate_it+0x1a5/0x1d0 [lustre]
15:43:08: [<ffffffffa05eddc4>] ? keys_fini+0xe4/0x130 [obdclass]
15:43:08: [<ffffffffa0b157d9>] ll_getattr_it+0x49/0x170 [lustre]
15:43:08: [<ffffffffa0b15937>] ll_getattr+0x37/0x40 [lustre]
15:43:09: [<ffffffff8121d1e3>] ? security_inode_getattr+0x23/0x30
15:43:09: [<ffffffff81186d81>] vfs_getattr+0x51/0x80
15:43:09: [<ffffffffa0303541>] nfsd4_encode_fattr+0x181/0x1390 [nfsd]
15:43:09: [<ffffffff810096f0>] ? __switch_to+0xd0/0x320
15:43:09: [<ffffffff8150e570>] ? thread_return+0x4e/0x76e
15:43:09: [<ffffffff81064e6a>] ? __cond_resched+0x2a/0x40
15:43:10: [<ffffffff8121c436>] ? security_task_setgroups+0x16/0x20
15:43:10: [<ffffffff8109fbc5>] ? set_groups+0x25/0x190
15:43:11: [<ffffffffa02f3926>] ? nfsd_setuser+0x126/0x2c0 [nfsd]
15:43:11: [<ffffffffa02eb992>] ? nfsd_setuser_and_check_port+0x62/0xb0 [nfsd]
15:43:12: [<ffffffffa03047b9>] nfsd4_encode_getattr+0x69/0x80 [nfsd]
15:43:12: [<ffffffffa02fcd35>] nfsd4_encode_operation+0x75/0x180 [nfsd]
15:43:12: [<ffffffffa02fad35>] nfsd4_proc_compound+0x195/0x490 [nfsd]
15:43:13: [<ffffffffa02e843e>] nfsd_dispatch+0xfe/0x240 [nfsd]
15:43:13: [<ffffffffa0260614>] svc_process_common+0x344/0x640 [sunrpc]
15:43:15: [<ffffffff81063410>] ? default_wake_function+0x0/0x20
15:43:15: [<ffffffffa0260c50>] svc_process+0x110/0x160 [sunrpc]
15:43:15: [<ffffffffa02e8b62>] nfsd+0xc2/0x160 [nfsd]
15:43:17: [<ffffffffa02e8aa0>] ? nfsd+0x0/0x160 [nfsd]
15:43:17: [<ffffffff81096a36>] kthread+0x96/0xa0
15:43:17: [<ffffffff8100c0ca>] child_rip+0xa/0x20
15:43:18: [<ffffffff810969a0>] ? kthread+0x0/0xa0
15:43:18: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
15:43:18:
15:43:18:Kernel panic - not syncing: LBUG
Comment by Jian Yu [ 09/Sep/13 ]

After patch http://review.whamcloud.com/7481 was reverted from Lustre b2_4 branch, the failure did not occur on Lustre 2.4.1 RC2.

Comment by Jodi Levi (Inactive) [ 23/Sep/13 ]

Duplicate of LU-3027

Comment by Patrick Farrell (Inactive) [ 04/Apr/14 ]

Just FYI:
I see LU-3027 clio was reverted from master (http://review.whamcloud.com/#/c/7749/) by Oleg to resolve this issue. Cray has continued carrying that patch in our 2.5 without any problems.

I just tested my original reproducer for LU-3027, which was fixed by the now reverted LU-3027 clio patch, on current master. (Reproducer is described in LU-3889)
That reproducer no longer hits any bugs, so it looks like that problem has been fixed by some other patch that's been landed.

Generated at Sat Feb 10 01:37:40 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.