[LU-2269] Test failure on test suite recovery-small, subtest test_50: writemany returned rc 2 Created: 03/Nov/12  Updated: 19/Apr/13  Resolved: 19/Nov/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Nathaniel Clark
Resolution: Duplicate Votes: 0
Labels: NFBlocker
Environment:

lustre master build #1011 SLES11 SP2 client


Severity: 3
Rank (Obsolete): 5428

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/163db52a-253c-11e2-9e7c-52540035b04c.

The sub-test test_50 failed with the following error:

writemany returned rc 2

ost console log:

01:34:03:Lustre: DEBUG MARKER: == recovery-small test 50: failover MDS under load =================================================== 01:34:02 (1351845242)
01:36:25:LustreError: 20350:0:(qsd_reint.c:58:qsd_reint_completion()) lustre-OST0000: failed to enqueue global quota lock, glb fid:[0x200000006:0x20000:0x0], rc:-5
01:36:25:LustreError: 20350:0:(qsd_reint.c:58:qsd_reint_completion()) Skipped 13 previous similar messages
01:38:16:Lustre: DEBUG MARKER: /usr/sbin/lctl mark  recovery-small test_50: @@@@@@ IGNORE \(bz13652\): writemany returned rc 2 
01:38:16:Lustre: DEBUG MARKER: recovery-small test_50: @@@@@@ IGNORE (bz13652): writemany returned rc 2
01:38:16:Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /logdir/test_logs/2012-11-01/lustre-master-el6-x86_64-sles11sp2-x86_64__1011__-69983126486880-163552/recovery-small.test_50.debug_log.$(hostname -s).1351845491.log;
01:38:16:         dmesg > /logdir/test_logs/2012-11-01/lustre-master-el6-x86_64
01:38:16:Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 2>/dev/null || true
01:38:16:Lustre: DEBUG MARKER: rc=$([ -f /proc/sys/lnet/catastrophe ] && echo $(< /proc/sys/lnet/catastrophe) || echo 0);
01:38:16:		if [ $rc -ne 0 ]; then echo $(hostname): $rc; fi
01:38:16:		exit $rc;
01:38:17:Lustre: DEBUG MARKER: /usr/sbin/lctl mark == recovery-small test 51: failover MDS during recovery ============================================== 01:38:16 \(1351845496\)


 Comments   
Comment by Sarah Liu [ 05/Nov/12 ]

another failure: https://maloo.whamcloud.com/test_sets/4fc66cd2-2731-11e2-b04c-52540035b04c

Comment by Nathaniel Clark [ 13/Nov/12 ]

Line in test log that appears when either test 50 or 51 fails is like the following (dmesg on oss):

Lustre: 18518:0:(ofd_obd.c:1069:ofd_orphans_destroy()) lustre-OST0000: deleting orphan objects from 5376 to 5419

Comment by Andreas Dilger [ 16/Nov/12 ]

Nathaniel, the "deleting orphans" message itself is normal operation during recovery. It indicates that the MDS requested the OST delete objects that it had precreated because they were not needed by the MDS after it crashed, but might have garbage data in them.

Li Wei, could this potentially relate to the "deleting too many orphans" problem? I haven't really looked into the logs to determine the root cause.

Comment by Li Wei (Inactive) [ 16/Nov/12 ]

Andreas, it is not obvious from the console logs, but I'll download and take a look at the debug logs.

Comment by Li Wei (Inactive) [ 18/Nov/12 ]

I didn't find much from the debug logs either. However, there were lines like

LustreError: 19991:0:(osp_precreate.c:274:osp_precreate_send()) lustre-OST0000-osc-MDT0000: can't precreate: rc = -5
LustreError: 19991:0:(osp_precreate.c:609:osp_precreate_thread()) lustre-OST0000-osc-MDT0000: cannot precreate objects: rc = -5
LustreError: 19994:0:(osp_precreate.c:274:osp_precreate_send()) lustre-OST0001-osc-MDT0000: can't precreate: rc = -5
LustreError: 19994:0:(osp_precreate.c:609:osp_precreate_thread()) lustre-OST0001-osc-MDT0000: cannot precreate objects: rc = -5
[...]
LustreError: 20557:0:(osp_precreate.c:563:osp_precreate_thread()) lustre-OST0000-osc-MDT0000: cannot cleanup orphans: rc = -5

with quota error messages above them.

Comment by Nathaniel Clark [ 19/Nov/12 ]

The fix in LU-2285 seems to have cleared up this issue. I can no longer reproduce it after using a build including that fix.

Generated at Sat Feb 10 01:23:47 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.