[LU-2645] 1.8<->2.4 interop: enqueue objid 0x2 subobj 0x1 on OST idx 0: rc -5 Created: 18/Jan/13  Updated: 05/Mar/13  Resolved: 05/Mar/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0, Lustre 1.8.9
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Blocker
Reporter: Jian Yu Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: HB
Environment:

Lustre Client: b1_8
Lustre Client Build: http://build.whamcloud.com/job/lustre-b1_8/245/

Lustre Server: master
Lustre Server Build: http://build.whamcloud.com/job/lustre-master/1172/

Distro/Arch: RHEL6.3/x86_64


Severity: 3
Rank (Obsolete): 6181

 Description   

While running runtests test on Lustre b1_8 clients with master servers, it failed as follows:

copying /etc/hosts to /mnt/lustre/hosts.9085 again
cp: writing `/mnt/lustre/hosts.9085': Input/output error
 runtests : @@@@@@ FAIL: can't cp /etc/hosts to /mnt/lustre/hosts.9085 again 6 

Dmesg on the client node client-12vm1 showed that:

Lustre: DEBUG MARKER: copying /etc/hosts to /mnt/lustre/hosts.9085 again
LustreError: 11-0: an error occurred while communicating with 10.10.4.209@tcp. The obd_ping operation failed with -107
Lustre: lustre-OST0000-osc-ffff88007cea8800: Connection to service lustre-OST0000 via nid 10.10.4.209@tcp was lost; in progress operations using this service will wait for recovery to complete.
LustreError: 167-0: This client was evicted by lustre-OST0000; in progress operations using this service will fail.
Lustre: Server lustre-OST0000_UUID version (2.3.58.0) is much newer than client version (1.8.8.60)
Lustre: Skipped 7 previous similar messages
LustreError: 10880:0:(ldlm_resource.c:519:ldlm_namespace_cleanup()) Namespace lustre-OST0000-osc-ffff88007cea8800 resource refcount nonzero (1) after lock cleanup; forcing cleanup.
LustreError: 10880:0:(ldlm_resource.c:524:ldlm_namespace_cleanup()) Resource: ffff88007b9de380 (1/0/0/0) (rc: 1)
Lustre: lustre-OST0000-osc-ffff88007cea8800: Connection restored to service lustre-OST0000 using nid 10.10.4.209@tcp.
LustreError: 10869:0:(lov_request.c:211:lov_update_enqueue_set()) enqueue objid 0x2 subobj 0x1 on OST idx 0: rc -5
Lustre: DEBUG MARKER: /usr/sbin/lctl mark  runtests : @@@@@@ FAIL: can\'t cp \/etc\/hosts to \/mnt\/lustre\/hosts.9085 again 6 

Dmesg on the OSS node client-12vm4 showed that:

Lustre: DEBUG MARKER: copying /etc/hosts to /mnt/lustre/hosts.9085 again
LustreError: 0:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 151s: evicting client at 10.10.4.206@tcp  ns: filter-ffff880037b92000 lock: ffff88007b5f6000/0xf7c07c2f873c39f lrc: 3/0,0 mode: PR/PR res: 1/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x20 nid: 10.10.4.206@tcp remote: 0x904b6ab232a7b36 expref: 5 pid: 4268 timeout: 4296500377 lvb_type: 1
Lustre: DEBUG MARKER: /usr/sbin/lctl mark  runtests : @@@@@@ FAIL: can\'t cp \/etc\/hosts to \/mnt\/lustre\/hosts.9085 again 6 

Maloo report: https://maloo.whamcloud.com/test_sets/9ad0fc8a-6181-11e2-be04-52540035b04c



 Comments   
Comment by Jian Yu [ 18/Jan/13 ]

This issue is blocking Lustre b1_8<->master interop testing:

sanity-quota: https://maloo.whamcloud.com/test_sets/f7e6eb7e-6158-11e2-be04-52540035b04c
sanity: https://maloo.whamcloud.com/test_sets/b794edde-616a-11e2-be04-52540035b04c

Comment by Peter Jones [ 18/Jan/13 ]

Bob will look into this one

Comment by Bob Glossman (Inactive) [ 21/Jan/13 ]

Looks like the problem is on the server side. Varying the client version has no effect, failure still happens. So far I've confirmed the problem doesn't happen with pure b2_3 servers. Working to narrow in on at exactly what change the failure first happens.

Comment by Bob Glossman (Inactive) [ 21/Jan/13 ]

The problem first appeared sometime between v2_3_57 and v2_3_58 tags. No failure in 2.3.57

Comment by Bob Glossman (Inactive) [ 21/Jan/13 ]

I think I've narrowed it down to something that went in 12/17. Unfortunately that was a very active day, more than two dozen commits that one day.

I'll keep trying to narrow it down to a specific commit, but maybe those who did commits on that day could start looking at them?

Comment by Andreas Dilger [ 21/Jan/13 ]

Bob, can you please add all of the committers from 12/17 to the CC list of this bug, and paste a list of commits using git log --pretty=short for that day. Maybe someone will find this issue more quickly.

Comment by Bob Glossman (Inactive) [ 21/Jan/13 ]

Andreas,
I could do that but I'm just on the verge of pinning down the specific commit.
It sure looks like the offending commit is

LU-2378 lma: move HSM & SOM attributes to dedicated xattrs

author Johann Lombardi <johann.lombardi@intel.com>
Fri, 23 Nov 2012 14:23:29 +0000 (15:23 +0100)
committer Oleg Drokin <green@whamcloud.com>
Mon, 17 Dec 2012 03:13:15 +0000 (22:13 -0500)
commit d0c104aa0e96b1f1d2366bc1e4715fe3830f41b4

If I've done the right builds and checks so far a build with that commit fails while a build with all commits up until that one doesn't fail.

Double checking now.

Comment by Bob Glossman (Inactive) [ 21/Jan/13 ]

Just for reference here is the short log for 12/17:

2012-12-17	Alex Zhuravlev	LU-2100 ofd: small batched precreation on a small system	commit | commitdiff | tree | snapshot
2012-12-17	Peng Tao	LU-1994 kernel: 3.6 dentry_open uses struct path as...	commit | commitdiff | tree | snapshot
2012-12-17	yang sheng	LU-1994 kernel: kernel 3.6 changes i_dentry/d_alias...	commit | commitdiff | tree | snapshot
2012-12-17	Peng Tao	LU-1994 kernel: 3.5 kernel encode_fh passes in parent...	commit | commitdiff | tree | snapshot
2012-12-17	yang sheng	LU-1994 llite: kernel 3.5 renames end_writeback to...	commit | commitdiff | tree | snapshot
2012-12-17	Niu Yawei	LU-2329 quota: wait longer in test_7c	commit | commitdiff | tree | snapshot
2012-12-17	Bobi Jam	LU-1741 test: fix conf_sanity test_18 test case	commit | commitdiff | tree | snapshot
2012-12-17	Daniel Kobras	LU-2302 scripts: prevent lfs_migrate data disclosure	commit | commitdiff | tree | snapshot
2012-12-17	Daniel Kobras	LU-2302 scripts: null-terminated file lists in lfs_migrate	commit | commitdiff | tree | snapshot
2012-12-17	Daniel Kobras	LU-2302 scripts: fix lfs_migrate with non-English locale	commit | commitdiff | tree | snapshot
2012-12-17	Johann Lombardi	LU-2371 ptlrpc: get new xid for resend on EINPROGRESS	commit | commitdiff | tree | snapshot
2012-12-17	Lai Siyao	LU-2388 statahead: don't statahead if it's stopped	commit | commitdiff | tree | snapshot
2012-12-17	Johann Lombardi	LU-2361 quota: keep slave's glb idx consistent with...	commit | commitdiff | tree | snapshot
2012-12-17	Niu Yawei	LU-2346 quota: set default grace time	commit | commitdiff | tree | snapshot
2012-12-17	John L. Hammond	LU-2358 procfs: Implement /proc/fs/lustre/mgs/MGS/fstyp...	commit | commitdiff | tree | snapshot
2012-12-17	Lai Siyao	LU-1287 mountconf: write failover nid config correctly	commit | commitdiff | tree | snapshot
2012-12-17	wangdi	LU-1632 fid: remove fid_delete in delete_inode procedure	commit | commitdiff | tree | snapshot
2012-12-17	John L. Hammond	LU-2363 lod: Fix statfs entries in lod procfs	commit | commitdiff | tree | snapshot
2012-12-17	Peng Tao	LU-1756 kernel: cleanup lustre_compat25.h	commit | commitdiff | tree | snapshot
2012-12-17	Peng Tao	LU-1484 kernel: fix build error with 2.6.18 kernel	commit | commitdiff | tree | snapshot
2012-12-17	Lai Siyao	LU-1887 ptlrpc: grant shrink rpc format is special	commit | commitdiff | tree | snapshot
2012-12-17	Thomas Stibor	LU-1924 build: configure can not find libgssapi_krb5.so	commit | commitdiff | tree | snapshot
2012-12-17	Nikitas Angelinas	LU-398 libcfs: Add libcfs heap, a binary heap implement...	commit | commitdiff | tree | snapshot
2012-12-17	jcl	LU-2016 mdd: add layout swap between 2 objects	commit | commitdiff | tree | snapshot
2012-12-17	Jinshan Xiong	LU-744 obdclass: revise cl_page refcount	commit | commitdiff | tree | snapshot
2012-12-17	Nathaniel Clark	LU-2194 tests: Wait for reconnect in recovery-small/19	commit | commitdiff | tree | snapshot
2012-12-17	Johann Lombardi	LU-2378 lma: move HSM & SOM attributes to dedicated...	commit | commitdiff | tree | snapshot
Comment by Bob Glossman (Inactive) [ 21/Jan/13 ]

Looks like I was hasty in pointing the finger. Definitely after the v2_3_57 tag, but I now see failures from earlier than 12/17. Almost certainly not the Johann's change for LU-2378. Will keep on trying to narrow it down.

Comment by Bob Glossman (Inactive) [ 21/Jan/13 ]

I have a winner. Pretty sure I got it right this time as I double checked the result before reporting it here. I believe the offending commit is:

LU-1710 lvb: variable sized LVB support

author Jinshan Xiong <jinshan.xiong@intel.com>
Thu, 6 Dec 2012 17:03:27 +0000 (09:03 -0800)
committer Oleg Drokin <green@whamcloud.com>
Wed, 12 Dec 2012 17:21:07 +0000 (12:21 -0500)
commit 929ec628e6fef5609e55d519a1eb9e2cbbf1f1e8
tree 84a5009309b9ed1700b94e3d285aeb0aa67a0ac1 tree | snapshot
parent caf5bdffb4eb6e3fb31724a1cb037cecfeb6ae6c commit | diff

A build of the immediate parent in the tree, commit caf5bdffb4eb6e3fb31724a1cb037cecfeb6ae6c, succeeds while a build of this commit fails.

Comment by Peter Jones [ 23/Jan/13 ]

Jinshan

Could you please comment on this one?

Thanks

Peter

Comment by Jinshan Xiong (Inactive) [ 24/Jan/13 ]

Actually Bob pinged about this issue on skype. Fanyong is the right person to take a look at this issue. I'll ping him.

Comment by nasf (Inactive) [ 25/Jan/13 ]

Since it is related with variable sized LVB patch, I will take and fix it.

Comment by nasf (Inactive) [ 06/Feb/13 ]

I am working on it.

Comment by Jian Yu [ 17/Feb/13 ]

Lustre Client: b1_8
Lustre Client Build: http://build.whamcloud.com/job/lustre-b1_8/256

Lustre Server: master
Lustre Server Build: http://build.whamcloud.com/job/lustre-master/1256

Distro/Arch: RHEL6.3/x86_64

A full test session:
https://maloo.whamcloud.com/test_sessions/487f5cd4-7805-11e2-9928-52540035b04c

Most of the tests failed with the issue in this ticket.

Comment by nasf (Inactive) [ 18/Feb/13 ]

This is the patch for master:

http://review.whamcloud.com/#change,5459

Yujian, would you please to verify the patch? Thanks!

Comment by Jian Yu [ 18/Feb/13 ]

Yujian, would you please to verify the patch? Thanks!

Please add the following test parameters into the commit message to verify the patch. Thanks.

Test-Parameters: envdefinitions=SLOW=yes,ENABLE_QUOTA=yes \
clientjob=lustre-b1_8 clientbuildno=256 testlist=sanity
Comment by nasf (Inactive) [ 20/Feb/13 ]

The patch has passed runtests on Maloo under interoperability mode:

https://maloo.whamcloud.com/test_sessions/649fb4c4-7aba-11e2-b5c8-52540035b04c

Comment by Peter Jones [ 05/Mar/13 ]

Landed for 2.4

Generated at Sat Feb 10 01:26:59 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.