[LU-655] sanity test 27q hung Created: 01/Sep/11  Updated: 25/Oct/11  Resolved: 25/Oct/11

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Jian Yu Assignee: Hongchao Zhang
Resolution: Duplicate Votes: 0
Labels: None
Environment:

Lustre Branch: master
Lustre Build: http://newbuild.whamcloud.com/job/lustre-master/275/
Distro/Arch: RHEL6/x86_64


Severity: 3
Rank (Obsolete): 5473

 Description   

While running sanity tests on a single test node, test 27q hung as follows:

== sanity test 27q: append to truncated file with all OSTs full (should error) ===== 21:56:29 (1314852989)
fail_loc=0
/mnt/lustre/d0.sanity/d27/f27q has size 80000000 OK
OSTIDX=0 MDSIDX=1
osc.lustre-OST0000-osc-MDT0000.prealloc_last_id=65
osc.lustre-OST0000-osc-MDT0000.prealloc_next_id=65
osc.lustre-OST0001-osc-MDT0000.prealloc_last_id=513
osc.lustre-OST0001-osc-MDT0000.prealloc_next_id=292
fail_val=-1
fail_loc=0x215
Creating to objid 65 on ost lustre-OST0000...

Console log showed that:

Lustre: DEBUG MARKER: == sanity test 27q: append to truncated file with all OSTs full (should error) ===== 21:56:29 (1314852989)
LustreError: 8011:0:(lov_request.c:569:lov_update_create_set()) error creating fid 0x2917 sub-object on OST idx 0/1: rc = -28
LustreError: 8011:0:(lov_request.c:569:lov_update_create_set()) Skipped 1 previous similar message
LustreError: 9113:0:(libcfs_fail.h:81:cfs_fail_check_set()) *** cfs_fail_loc=215 ***
LustreError: 4276:0:(libcfs_fail.h:81:cfs_fail_check_set()) *** cfs_fail_loc=215 ***
LustreError: 4276:0:(ldlm_lib.c:2128:target_send_reply_msg()) @@@ processing error (-28)  req@ffff88060eafe400 x1378722884187257/t0(0) o-1->dda7060d-4cff-074b-9616-5a5558ff548b@NET_0x9000000000000_UUID:0/0 lens 456/0 e 0 to 0 dl 1314853018 ref 1 fl Interpret:/ffffffff/ffffffff rc 0/-1
LustreError: 11-0: an error occurred while communicating with 0@lo. The ost_write operation failed with -28
LustreError: 4151:0:(libcfs_fail.h:81:cfs_fail_check_set()) *** cfs_fail_loc=215 ***
LustreError: 4151:0:(libcfs_fail.h:81:cfs_fail_check_set()) Skipped 49013 previous similar messages
LustreError: 4151:0:(libcfs_fail.h:81:cfs_fail_check_set()) *** cfs_fail_loc=215 ***
LustreError: 4151:0:(libcfs_fail.h:81:cfs_fail_check_set()) Skipped 100123 previous similar messages
Lustre: Service thread pid 8011 was inactive for 62.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Pid: 8011, comm: mdt_04

Call Trace:
 [<ffffffffa03076fe>] cfs_waitq_wait+0xe/0x10 [libcfs]
 [<ffffffffa076f79e>] lov_create+0xbce/0x1580 [lov]
 [<ffffffff8105dc60>] ? default_wake_function+0x0/0x20
 [<ffffffffa09a64dd>] mdd_lov_create+0xadd/0x21f0 [mdd]
 [<ffffffffa09b79f2>] mdd_create+0xca2/0x1db0 [mdd]
 [<ffffffffa0995208>] ? mdd_version_get+0x68/0xa0 [mdd]
 [<ffffffffa0a792bc>] cml_create+0xbc/0x280 [cmm]
 [<ffffffffa0a1eb7a>] mdt_reint_open+0x1bca/0x2c80 [mdt]
 [<ffffffffa0a069df>] mdt_reint_rec+0x3f/0x100 [mdt]
 [<ffffffffa09fee84>] mdt_reint_internal+0x6d4/0x9f0 [mdt]
 [<ffffffffa09ff505>] mdt_intent_reint+0x245/0x600 [mdt]
 [<ffffffffa09f7410>] mdt_intent_policy+0x3c0/0x6b0 [mdt]
 [<ffffffffa0510afa>] ldlm_lock_enqueue+0x2da/0xa50 [ptlrpc]
 [<ffffffffa052f305>] ? ldlm_export_lock_get+0x15/0x20 [ptlrpc]
 [<ffffffffa03155e2>] ? cfs_hash_bd_add_locked+0x62/0x90 [libcfs]
 [<ffffffffa05371f7>] ldlm_handle_enqueue0+0x447/0x1090 [ptlrpc]
 [<ffffffffa09ea53f>] ? mdt_unpack_req_pack_rep+0xcf/0x5d0 [mdt]
 [<ffffffffa09f782a>] mdt_enqueue+0x4a/0x110 [mdt]
 [<ffffffffa09f2bb5>] mdt_handle_common+0x8d5/0x1810 [mdt]
 [<ffffffffa0555104>] ? lustre_msg_get_opc+0x94/0x100 [ptlrpc]
 [<ffffffffa09f3bc5>] mdt_regular_handle+0x15/0x20 [mdt]
 [<ffffffffa0565c7e>] ptlrpc_main+0xb8e/0x1900 [ptlrpc]
 [<ffffffffa05650f0>] ? ptlrpc_main+0x0/0x1900 [ptlrpc]
 [<ffffffff8100c1ca>] child_rip+0xa/0x20
 [<ffffffffa05650f0>] ? ptlrpc_main+0x0/0x1900 [ptlrpc]
 [<ffffffffa05650f0>] ? ptlrpc_main+0x0/0x1900 [ptlrpc]
 [<ffffffff8100c1c0>] ? child_rip+0x0/0x20

LustreError: dumping log to /tmp/lustre-log.1314853064.8011

Maloo report: https://maloo.whamcloud.com/test_sets/7ce0bc08-d45e-11e0-8d02-52540025f9af

This could be reproduced easily by just running 'REFORMAT="--reformat" bash sanity.sh' on a single node.



 Comments   
Comment by Oleg Drokin [ 01/Sep/11 ]

This was introduced somewhere in the last 3 weeks.

Comment by John Hammond [ 21/Sep/11 ]

filter_handle_precreate() is setting oa->o_valid to OBD_MD_FLID|OBD_MD_FLGROUP, clobbering the setting of OBD_MD_FLFLAGS in filter_precreate().

Comment by Alexey Lyashkov [ 26/Sep/11 ]

that is looks regression introduced by
commit 9f3f665577797660984bc1b6cbd443111dceef49
Author: hongchao.zhang <hongchao.zhang@whamcloud.com>
Date:   Fri Aug 12 21:25:13 2011 +0800

   LU-325 using preallocated objects if OST has enough disk space
...

OST reported ENOSPC to mdt, but MDT don't return a error to client and resend create request many times.

Comment by Alexey Lyashkov [ 26/Sep/11 ]

I reverted that commit on my local tree and that bug go away.

Comment by Alexey Lyashkov [ 26/Sep/11 ]

high cpu load produced by result of loop of OSC create requests caused
if (handle_async_create(fake_req, rc) == -EAGAIN) {
oscc_internal_create(oscc);
...
when OST returned ENOSPC to create request.
before that change handle_async_create return a error in same case
if (oscc->oscc_flags & OSCC_FLAG_NOSPC)
GOTO(out_wake, rc = -ENOSPC);
...
but now EGAIN returned and it's produced a loop from requests.

Comment by Oleg Drokin [ 22/Oct/11 ]

Now that we are in 2.2 cycle it would be nice to have this fixed.
Can you please take a look?

Comment by Peter Jones [ 25/Oct/11 ]

Being worked under LU791

Generated at Sat Feb 10 01:09:09 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.