[LU-655] sanity test 27q hung Created: 01/Sep/11 Updated: 25/Oct/11 Resolved: 25/Oct/11 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Jian Yu | Assignee: | Hongchao Zhang |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Lustre Branch: master |
||
| Severity: | 3 |
| Rank (Obsolete): | 5473 |
| Description |
|
While running sanity tests on a single test node, test 27q hung as follows: == sanity test 27q: append to truncated file with all OSTs full (should error) ===== 21:56:29 (1314852989) fail_loc=0 /mnt/lustre/d0.sanity/d27/f27q has size 80000000 OK OSTIDX=0 MDSIDX=1 osc.lustre-OST0000-osc-MDT0000.prealloc_last_id=65 osc.lustre-OST0000-osc-MDT0000.prealloc_next_id=65 osc.lustre-OST0001-osc-MDT0000.prealloc_last_id=513 osc.lustre-OST0001-osc-MDT0000.prealloc_next_id=292 fail_val=-1 fail_loc=0x215 Creating to objid 65 on ost lustre-OST0000... Console log showed that: Lustre: DEBUG MARKER: == sanity test 27q: append to truncated file with all OSTs full (should error) ===== 21:56:29 (1314852989) LustreError: 8011:0:(lov_request.c:569:lov_update_create_set()) error creating fid 0x2917 sub-object on OST idx 0/1: rc = -28 LustreError: 8011:0:(lov_request.c:569:lov_update_create_set()) Skipped 1 previous similar message LustreError: 9113:0:(libcfs_fail.h:81:cfs_fail_check_set()) *** cfs_fail_loc=215 *** LustreError: 4276:0:(libcfs_fail.h:81:cfs_fail_check_set()) *** cfs_fail_loc=215 *** LustreError: 4276:0:(ldlm_lib.c:2128:target_send_reply_msg()) @@@ processing error (-28) req@ffff88060eafe400 x1378722884187257/t0(0) o-1->dda7060d-4cff-074b-9616-5a5558ff548b@NET_0x9000000000000_UUID:0/0 lens 456/0 e 0 to 0 dl 1314853018 ref 1 fl Interpret:/ffffffff/ffffffff rc 0/-1 LustreError: 11-0: an error occurred while communicating with 0@lo. The ost_write operation failed with -28 LustreError: 4151:0:(libcfs_fail.h:81:cfs_fail_check_set()) *** cfs_fail_loc=215 *** LustreError: 4151:0:(libcfs_fail.h:81:cfs_fail_check_set()) Skipped 49013 previous similar messages LustreError: 4151:0:(libcfs_fail.h:81:cfs_fail_check_set()) *** cfs_fail_loc=215 *** LustreError: 4151:0:(libcfs_fail.h:81:cfs_fail_check_set()) Skipped 100123 previous similar messages Lustre: Service thread pid 8011 was inactive for 62.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Pid: 8011, comm: mdt_04 Call Trace: [<ffffffffa03076fe>] cfs_waitq_wait+0xe/0x10 [libcfs] [<ffffffffa076f79e>] lov_create+0xbce/0x1580 [lov] [<ffffffff8105dc60>] ? default_wake_function+0x0/0x20 [<ffffffffa09a64dd>] mdd_lov_create+0xadd/0x21f0 [mdd] [<ffffffffa09b79f2>] mdd_create+0xca2/0x1db0 [mdd] [<ffffffffa0995208>] ? mdd_version_get+0x68/0xa0 [mdd] [<ffffffffa0a792bc>] cml_create+0xbc/0x280 [cmm] [<ffffffffa0a1eb7a>] mdt_reint_open+0x1bca/0x2c80 [mdt] [<ffffffffa0a069df>] mdt_reint_rec+0x3f/0x100 [mdt] [<ffffffffa09fee84>] mdt_reint_internal+0x6d4/0x9f0 [mdt] [<ffffffffa09ff505>] mdt_intent_reint+0x245/0x600 [mdt] [<ffffffffa09f7410>] mdt_intent_policy+0x3c0/0x6b0 [mdt] [<ffffffffa0510afa>] ldlm_lock_enqueue+0x2da/0xa50 [ptlrpc] [<ffffffffa052f305>] ? ldlm_export_lock_get+0x15/0x20 [ptlrpc] [<ffffffffa03155e2>] ? cfs_hash_bd_add_locked+0x62/0x90 [libcfs] [<ffffffffa05371f7>] ldlm_handle_enqueue0+0x447/0x1090 [ptlrpc] [<ffffffffa09ea53f>] ? mdt_unpack_req_pack_rep+0xcf/0x5d0 [mdt] [<ffffffffa09f782a>] mdt_enqueue+0x4a/0x110 [mdt] [<ffffffffa09f2bb5>] mdt_handle_common+0x8d5/0x1810 [mdt] [<ffffffffa0555104>] ? lustre_msg_get_opc+0x94/0x100 [ptlrpc] [<ffffffffa09f3bc5>] mdt_regular_handle+0x15/0x20 [mdt] [<ffffffffa0565c7e>] ptlrpc_main+0xb8e/0x1900 [ptlrpc] [<ffffffffa05650f0>] ? ptlrpc_main+0x0/0x1900 [ptlrpc] [<ffffffff8100c1ca>] child_rip+0xa/0x20 [<ffffffffa05650f0>] ? ptlrpc_main+0x0/0x1900 [ptlrpc] [<ffffffffa05650f0>] ? ptlrpc_main+0x0/0x1900 [ptlrpc] [<ffffffff8100c1c0>] ? child_rip+0x0/0x20 LustreError: dumping log to /tmp/lustre-log.1314853064.8011 Maloo report: https://maloo.whamcloud.com/test_sets/7ce0bc08-d45e-11e0-8d02-52540025f9af This could be reproduced easily by just running 'REFORMAT="--reformat" bash sanity.sh' on a single node. |
| Comments |
| Comment by Oleg Drokin [ 01/Sep/11 ] |
|
This was introduced somewhere in the last 3 weeks. |
| Comment by John Hammond [ 21/Sep/11 ] |
|
filter_handle_precreate() is setting oa->o_valid to OBD_MD_FLID|OBD_MD_FLGROUP, clobbering the setting of OBD_MD_FLFLAGS in filter_precreate(). |
| Comment by Alexey Lyashkov [ 26/Sep/11 ] |
|
that is looks regression introduced by OST reported ENOSPC to mdt, but MDT don't return a error to client and resend create request many times. |
| Comment by Alexey Lyashkov [ 26/Sep/11 ] |
|
I reverted that commit on my local tree and that bug go away. |
| Comment by Alexey Lyashkov [ 26/Sep/11 ] |
|
high cpu load produced by result of loop of OSC create requests caused |
| Comment by Oleg Drokin [ 22/Oct/11 ] |
|
Now that we are in 2.2 cycle it would be nice to have this fixed. |
| Comment by Peter Jones [ 25/Oct/11 ] |
|
Being worked under LU791 |