[LU-2645] 1.8<->2.4 interop: enqueue objid 0x2 subobj 0x1 on OST idx 0: rc -5 Created: 18/Jan/13 Updated: 05/Mar/13 Resolved: 05/Mar/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0, Lustre 1.8.9 |
| Fix Version/s: | Lustre 2.4.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Jian Yu | Assignee: | nasf (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | HB | ||
| Environment: |
Lustre Client: b1_8 Lustre Server: master Distro/Arch: RHEL6.3/x86_64 |
||
| Severity: | 3 |
| Rank (Obsolete): | 6181 |
| Description |
|
While running runtests test on Lustre b1_8 clients with master servers, it failed as follows: copying /etc/hosts to /mnt/lustre/hosts.9085 again cp: writing `/mnt/lustre/hosts.9085': Input/output error runtests : @@@@@@ FAIL: can't cp /etc/hosts to /mnt/lustre/hosts.9085 again 6 Dmesg on the client node client-12vm1 showed that: Lustre: DEBUG MARKER: copying /etc/hosts to /mnt/lustre/hosts.9085 again LustreError: 11-0: an error occurred while communicating with 10.10.4.209@tcp. The obd_ping operation failed with -107 Lustre: lustre-OST0000-osc-ffff88007cea8800: Connection to service lustre-OST0000 via nid 10.10.4.209@tcp was lost; in progress operations using this service will wait for recovery to complete. LustreError: 167-0: This client was evicted by lustre-OST0000; in progress operations using this service will fail. Lustre: Server lustre-OST0000_UUID version (2.3.58.0) is much newer than client version (1.8.8.60) Lustre: Skipped 7 previous similar messages LustreError: 10880:0:(ldlm_resource.c:519:ldlm_namespace_cleanup()) Namespace lustre-OST0000-osc-ffff88007cea8800 resource refcount nonzero (1) after lock cleanup; forcing cleanup. LustreError: 10880:0:(ldlm_resource.c:524:ldlm_namespace_cleanup()) Resource: ffff88007b9de380 (1/0/0/0) (rc: 1) Lustre: lustre-OST0000-osc-ffff88007cea8800: Connection restored to service lustre-OST0000 using nid 10.10.4.209@tcp. LustreError: 10869:0:(lov_request.c:211:lov_update_enqueue_set()) enqueue objid 0x2 subobj 0x1 on OST idx 0: rc -5 Lustre: DEBUG MARKER: /usr/sbin/lctl mark runtests : @@@@@@ FAIL: can\'t cp \/etc\/hosts to \/mnt\/lustre\/hosts.9085 again 6 Dmesg on the OSS node client-12vm4 showed that: Lustre: DEBUG MARKER: copying /etc/hosts to /mnt/lustre/hosts.9085 again LustreError: 0:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 151s: evicting client at 10.10.4.206@tcp ns: filter-ffff880037b92000 lock: ffff88007b5f6000/0xf7c07c2f873c39f lrc: 3/0,0 mode: PR/PR res: 1/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x20 nid: 10.10.4.206@tcp remote: 0x904b6ab232a7b36 expref: 5 pid: 4268 timeout: 4296500377 lvb_type: 1 Lustre: DEBUG MARKER: /usr/sbin/lctl mark runtests : @@@@@@ FAIL: can\'t cp \/etc\/hosts to \/mnt\/lustre\/hosts.9085 again 6 Maloo report: https://maloo.whamcloud.com/test_sets/9ad0fc8a-6181-11e2-be04-52540035b04c |
| Comments |
| Comment by Jian Yu [ 18/Jan/13 ] |
|
This issue is blocking Lustre b1_8<->master interop testing: sanity-quota: https://maloo.whamcloud.com/test_sets/f7e6eb7e-6158-11e2-be04-52540035b04c |
| Comment by Peter Jones [ 18/Jan/13 ] |
|
Bob will look into this one |
| Comment by Bob Glossman (Inactive) [ 21/Jan/13 ] |
|
Looks like the problem is on the server side. Varying the client version has no effect, failure still happens. So far I've confirmed the problem doesn't happen with pure b2_3 servers. Working to narrow in on at exactly what change the failure first happens. |
| Comment by Bob Glossman (Inactive) [ 21/Jan/13 ] |
|
The problem first appeared sometime between v2_3_57 and v2_3_58 tags. No failure in 2.3.57 |
| Comment by Bob Glossman (Inactive) [ 21/Jan/13 ] |
|
I think I've narrowed it down to something that went in 12/17. Unfortunately that was a very active day, more than two dozen commits that one day. I'll keep trying to narrow it down to a specific commit, but maybe those who did commits on that day could start looking at them? |
| Comment by Andreas Dilger [ 21/Jan/13 ] |
|
Bob, can you please add all of the committers from 12/17 to the CC list of this bug, and paste a list of commits using git log --pretty=short for that day. Maybe someone will find this issue more quickly. |
| Comment by Bob Glossman (Inactive) [ 21/Jan/13 ] |
|
Andreas,
author Johann Lombardi <johann.lombardi@intel.com> If I've done the right builds and checks so far a build with that commit fails while a build with all commits up until that one doesn't fail. Double checking now. |
| Comment by Bob Glossman (Inactive) [ 21/Jan/13 ] |
|
Just for reference here is the short log for 12/17: 2012-12-17 Alex Zhuravlev LU-2100 ofd: small batched precreation on a small system commit | commitdiff | tree | snapshot 2012-12-17 Peng Tao LU-1994 kernel: 3.6 dentry_open uses struct path as... commit | commitdiff | tree | snapshot 2012-12-17 yang sheng LU-1994 kernel: kernel 3.6 changes i_dentry/d_alias... commit | commitdiff | tree | snapshot 2012-12-17 Peng Tao LU-1994 kernel: 3.5 kernel encode_fh passes in parent... commit | commitdiff | tree | snapshot 2012-12-17 yang sheng LU-1994 llite: kernel 3.5 renames end_writeback to... commit | commitdiff | tree | snapshot 2012-12-17 Niu Yawei LU-2329 quota: wait longer in test_7c commit | commitdiff | tree | snapshot 2012-12-17 Bobi Jam LU-1741 test: fix conf_sanity test_18 test case commit | commitdiff | tree | snapshot 2012-12-17 Daniel Kobras LU-2302 scripts: prevent lfs_migrate data disclosure commit | commitdiff | tree | snapshot 2012-12-17 Daniel Kobras LU-2302 scripts: null-terminated file lists in lfs_migrate commit | commitdiff | tree | snapshot 2012-12-17 Daniel Kobras LU-2302 scripts: fix lfs_migrate with non-English locale commit | commitdiff | tree | snapshot 2012-12-17 Johann Lombardi LU-2371 ptlrpc: get new xid for resend on EINPROGRESS commit | commitdiff | tree | snapshot 2012-12-17 Lai Siyao LU-2388 statahead: don't statahead if it's stopped commit | commitdiff | tree | snapshot 2012-12-17 Johann Lombardi LU-2361 quota: keep slave's glb idx consistent with... commit | commitdiff | tree | snapshot 2012-12-17 Niu Yawei LU-2346 quota: set default grace time commit | commitdiff | tree | snapshot 2012-12-17 John L. Hammond LU-2358 procfs: Implement /proc/fs/lustre/mgs/MGS/fstyp... commit | commitdiff | tree | snapshot 2012-12-17 Lai Siyao LU-1287 mountconf: write failover nid config correctly commit | commitdiff | tree | snapshot 2012-12-17 wangdi LU-1632 fid: remove fid_delete in delete_inode procedure commit | commitdiff | tree | snapshot 2012-12-17 John L. Hammond LU-2363 lod: Fix statfs entries in lod procfs commit | commitdiff | tree | snapshot 2012-12-17 Peng Tao LU-1756 kernel: cleanup lustre_compat25.h commit | commitdiff | tree | snapshot 2012-12-17 Peng Tao LU-1484 kernel: fix build error with 2.6.18 kernel commit | commitdiff | tree | snapshot 2012-12-17 Lai Siyao LU-1887 ptlrpc: grant shrink rpc format is special commit | commitdiff | tree | snapshot 2012-12-17 Thomas Stibor LU-1924 build: configure can not find libgssapi_krb5.so commit | commitdiff | tree | snapshot 2012-12-17 Nikitas Angelinas LU-398 libcfs: Add libcfs heap, a binary heap implement... commit | commitdiff | tree | snapshot 2012-12-17 jcl LU-2016 mdd: add layout swap between 2 objects commit | commitdiff | tree | snapshot 2012-12-17 Jinshan Xiong LU-744 obdclass: revise cl_page refcount commit | commitdiff | tree | snapshot 2012-12-17 Nathaniel Clark LU-2194 tests: Wait for reconnect in recovery-small/19 commit | commitdiff | tree | snapshot 2012-12-17 Johann Lombardi LU-2378 lma: move HSM & SOM attributes to dedicated... commit | commitdiff | tree | snapshot |
| Comment by Bob Glossman (Inactive) [ 21/Jan/13 ] |
|
Looks like I was hasty in pointing the finger. Definitely after the v2_3_57 tag, but I now see failures from earlier than 12/17. Almost certainly not the Johann's change for |
| Comment by Bob Glossman (Inactive) [ 21/Jan/13 ] |
|
I have a winner. Pretty sure I got it right this time as I double checked the result before reporting it here. I believe the offending commit is:
author Jinshan Xiong <jinshan.xiong@intel.com> A build of the immediate parent in the tree, commit caf5bdffb4eb6e3fb31724a1cb037cecfeb6ae6c, succeeds while a build of this commit fails. |
| Comment by Peter Jones [ 23/Jan/13 ] |
|
Jinshan Could you please comment on this one? Thanks Peter |
| Comment by Jinshan Xiong (Inactive) [ 24/Jan/13 ] |
|
Actually Bob pinged about this issue on skype. Fanyong is the right person to take a look at this issue. I'll ping him. |
| Comment by nasf (Inactive) [ 25/Jan/13 ] |
|
Since it is related with variable sized LVB patch, I will take and fix it. |
| Comment by nasf (Inactive) [ 06/Feb/13 ] |
|
I am working on it. |
| Comment by Jian Yu [ 17/Feb/13 ] |
|
Lustre Client: b1_8 Lustre Server: master Distro/Arch: RHEL6.3/x86_64 A full test session: Most of the tests failed with the issue in this ticket. |
| Comment by nasf (Inactive) [ 18/Feb/13 ] |
|
This is the patch for master: http://review.whamcloud.com/#change,5459 Yujian, would you please to verify the patch? Thanks! |
| Comment by Jian Yu [ 18/Feb/13 ] |
Please add the following test parameters into the commit message to verify the patch. Thanks. Test-Parameters: envdefinitions=SLOW=yes,ENABLE_QUOTA=yes \ clientjob=lustre-b1_8 clientbuildno=256 testlist=sanity |
| Comment by nasf (Inactive) [ 20/Feb/13 ] |
|
The patch has passed runtests on Maloo under interoperability mode: https://maloo.whamcloud.com/test_sessions/649fb4c4-7aba-11e2-b5c8-52540035b04c |
| Comment by Peter Jones [ 05/Mar/13 ] |
|
Landed for 2.4 |