I found the following failure on MDS side for each failure cases:
==============================
08:08:02:Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test iorfpp: iorfpp ========================================================== 08:07:55 (1378220875)
08:08:02:Lustre: DEBUG MARKER: lfs setstripe /mnt/lustre/d0.ior.fpp -c -1
08:08:02:LustreError: 12896:0:(lov_lock.c:674:lov_lock_enqueue()) ASSERTION( sublock->cll_conflict == ((void *)0) ) failed:
08:08:03:LustreError: 12896:0:(lov_lock.c:674:lov_lock_enqueue()) LBUG
08:08:03:Pid: 12896, comm: nfsd
08:08:04:
08:08:05:Call Trace:
08:08:05: [<ffffffffa06e7895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
08:08:06: [<ffffffffa06e7e97>] lbug_with_loc+0x47/0xb0 [libcfs]
08:08:06: [<ffffffffa0c305ee>] lov_lock_enqueue+0x76e/0x850 [lov]
08:08:06: [<ffffffffa082351c>] cl_enqueue_try+0xfc/0x300 [obdclass]
08:08:06: [<ffffffffa082490f>] cl_enqueue_locked+0x6f/0x1f0 [obdclass]
08:08:06: [<ffffffffa082557e>] cl_lock_request+0x7e/0x270 [obdclass]
08:08:06: [<ffffffffa0cfd750>] cl_glimpse_lock+0x180/0x490 [lustre]
08:08:07: [<ffffffffa0cfdfc5>] cl_glimpse_size0+0x1a5/0x1d0 [lustre]
08:08:08: [<ffffffffa0cb1478>] ll_inode_revalidate_it+0x198/0x1c0 [lustre]
08:08:09: [<ffffffffa0cb14e9>] ll_getattr_it+0x49/0x170 [lustre]
08:08:09: [<ffffffffa0cb1647>] ll_getattr+0x37/0x40 [lustre]
08:08:09: [<ffffffff8121d1a3>] ? security_inode_getattr+0x23/0x30
08:08:10: [<ffffffff81186d81>] vfs_getattr+0x51/0x80
08:08:11: [<ffffffff81182bc1>] ? __fput+0x1a1/0x210
08:08:12: [<ffffffffa02f8d8d>] encode_post_op_attr+0x5d/0xc0 [nfsd]
08:08:13: [<ffffffff81182c55>] ? fput+0x25/0x30
08:08:13: [<ffffffffa02ef123>] ? nfsd_write+0xf3/0x100 [nfsd]
08:08:13: [<ffffffffa02f9752>] encode_wcc_data+0x72/0xd0 [nfsd]
08:08:13: [<ffffffffa02f987b>] nfs3svc_encode_writeres+0x1b/0x80 [nfsd]
08:08:14: [<ffffffffa02e84e6>] nfsd_dispatch+0x1a6/0x240 [nfsd]
08:08:14: [<ffffffffa0260614>] svc_process_common+0x344/0x640 [sunrpc]
08:08:15: [<ffffffff81063410>] ? default_wake_function+0x0/0x20
08:08:15: [<ffffffffa0260c50>] svc_process+0x110/0x160 [sunrpc]
08:08:16: [<ffffffffa02e8b62>] nfsd+0xc2/0x160 [nfsd]
08:08:16: [<ffffffffa02e8aa0>] ? nfsd+0x0/0x160 [nfsd]
08:08:16: [<ffffffff81096a36>] kthread+0x96/0xa0
08:08:17: [<ffffffff8100c0ca>] child_rip+0xa/0x20
08:08:18: [<ffffffff810969a0>] ? kthread+0x0/0xa0
08:08:19: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
==============================
I suspect it is related with the following patch, which has been new landed to Lustre build: https://build.whamcloud.com/job/lustre-b2_4/42
==============================
Commit 3781d4465c9fa72120f35084d4eb5edb55a0b66a by Oleg Drokin
LU-3027 clio: Do not shrink sublock at cancel
Shrinking sublock at ldlm lock cancel time means whoever happened
to attach to this lock just before will reenqueue the wrong lock.
Test-Parameters: envdefinitions=SLOW=yes,ONLY=write_disjoint \
clientdistro=el6 serverdistro=el6 clientarch=x86_64 \
serverarch=x86_64 testlist=parallel-scale
Change-Id: I8f2de683812621fb2f8d761cf2aceebc12868d75
Signed-off-by: Oleg Drokin <oleg.drokin@intel.com>
Signed-off-by: Jian Yu <jian.yu@intel.com>
Reviewed-on: http://review.whamcloud.com/7481
Tested-by: Hudson
Tested-by: Maloo <whamcloud.maloo@gmail.com>
Reviewed-by: Jinshan Xiong <jinshan.xiong@intel.com>
Reviewed-by: Bobi Jam <bobijam@gmail.com>
==============================
Yujian, my suggestion is to revert above patch to verify.
Just FYI:
I see
LU-3027clio was reverted from master (http://review.whamcloud.com/#/c/7749/) by Oleg to resolve this issue. Cray has continued carrying that patch in our 2.5 without any problems.I just tested my original reproducer for
LU-3027, which was fixed by the now revertedLU-3027clio patch, on current master. (Reproducer is described inLU-3889)That reproducer no longer hits any bugs, so it looks like that problem has been fixed by some other patch that's been landed.