[LU-4758] parallel-scale test_metabench: metabench failed with 1 Created: 12/Mar/14 Updated: 14/May/14 Resolved: 14/May/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.6.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Sarah Liu | Assignee: | Hongchao Zhang |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Environment: |
client and server: lustre-master build # 1911 RHEL6 ldiskfs |
||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 13091 | ||||||||||||
| Description |
|
https://maloo.whamcloud.com/test_sets/2254277a-a030-11e3-947c-52540035b04c test log shows: [02/27/2014 12:30:13] Entering par_create_multidir to create 4343 files in 1 dirs Removed 30400 files in 205.132 seconds [02/27/2014 12:34:02] Leaving par_create_multidir Parallel file creation by 7 processes in seperate directories Process Files Time Rate -------- ---------- ---------- ---------- 001/008 4343 23.758 182.805 002/008 4343 18.392 236.135 003/008 4343 23.868 181.957 004/008 4343 18.325 236.996 005/008 4343 23.885 181.828 006/008 4343 18.388 236.186 007/008 4343 23.927 181.510 -------- ---------- ---------- ---------- Total 30401 23.927 1270.568 Elapsed 30401 23.928 1270.507 -------- ---------- ---------- ---------- Average 4343 21.506 205.345 Std Dev 2.718 26.932 ( 12.64%) ( 13.12%) [02/27/2014 12:36:47] FATAL error on process 0 Proc 0: Unable to stat file [/mnt/lustre/d0.metabench/CREATE_MD_008.000/nSUFhBISq]: No such file or directory -------------------------------------------------------------------------- mpirun has exited due to process rank 0 with PID 29305 on node client-32vm5 exiting improperly. There are two reasons this could occur: 1. this process did not call "init" before exiting, but others in the job did. This can cause a job to hang indefinitely while it waits for all processes to call "init". By rule, if one process calls "init", then ALL processes must call "init" prior to termination. 2. this process called "init", but exited without calling "finalize". By rule, all processes that call "init" MUST call "finalize" prior to exiting or it will be considered an "abnormal termination" This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here). -------------------------------------------------------------------------- parallel-scale test_metabench: @@@@@@ FAIL: metabench failed! 1 |
| Comments |
| Comment by Peter Jones [ 21/Mar/14 ] |
|
Hongchao Could you please look into this one? Thanks Peter |
| Comment by Hongchao Zhang [ 26/Mar/14 ] |
|
status update: checked the some result of the recent failed tests in Maloo, and found no -ENOENT(-2) in the logs of both client and server. |
| Comment by Hongchao Zhang [ 29/Mar/14 ] |
|
the debug patch collecting more specific logs was tracked at http://review.whamcloud.com/#/c/9813/ |
| Comment by Hongchao Zhang [ 31/Mar/14 ] |
|
in test report https://maloo.whamcloud.com/sub_tests/2d11e162-b87b-11e3-ac5f-52540035b04c the missing file (/mnt/lustre/d0.metabench/STAT_MD_008.000/3SoD) was just deleted before "stat" it, 1396208726.881017:0:5933:0:(file.c:3517:ll_inode_permission()) VFS Op:inode=[0x200005221:0x98be:0x0](ffff88000a887b38), inode mode 41e8 mask 1 1396208726.881019:0:5933:0:(file.c:3517:ll_inode_permission()) VFS Op:inode=[0x200005221:0x98be:0x0](ffff88000a887b38), inode mode 41e8 mask 1 1396208726.881020:0:5933:0:(dcache.c:421:ll_revalidate_nd()) VFS Op:name=3SoD, flags=0 1396208726.881021:0:5933:0:(file.c:3517:ll_inode_permission()) VFS Op:inode=[0x200005221:0x98be:0x0](ffff88000a887b38), inode mode 41e8 mask 3 1396208726.881022:0:5933:0:(namei.c:1297:ll_unlink_generic()) VFS Op:name=3SoD, dir=[0x200005221:0x98be:0x0](ffff88000a887b38) ... 1396208726.955788:0:5933:0:(file.c:3517:ll_inode_permission()) VFS Op:inode=[0x200005221:0x65d5:0x0](ffff88005c63abb8), inode mode 41e8 mask 1 1396208726.955790:0:5933:0:(namei.c:568:ll_lookup_it()) VFS Op:name=3SoD, dir=[0x200005221:0x65d5:0x0](ffff88005c63abb8), intent=getattr 1396208726.955793:0:5933:0:(mdc_locks.c:1173:mdc_intent_lock()) (name: 3SoD,[0x0:0x0:0x0]) in obj [0x200005221:0x65d5:0x0], intent: getattr flags 00 ... 1396208727.630480:0:6049:0:(debug.c:345:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl mark parallel-scale test_metabench: @@@@@@ FAIL: metabench failed! 1 1396208727.744342:0:6057:0:(debug.c:345:libcfs_debug_mark_buffer()) DEBUG MARKER: parallel-scale test_metabench: @@@@@@ FAIL: metabench failed! 1 this issue could be related to the test script or metabench itself. |
| Comment by Hongchao Zhang [ 31/Mar/14 ] |
|
there is another type of failure of this ticket, Proc 0: Unable to stat file [/mnt/lustre/d0.metabench/STAT_MD_008.000/STAT_007_008/Q4oDt.1EWkTu]: Input/output error it's caused by the client eviction from OST for the ptlrpc_request of LDLM_BL_CALLBACK(104) was timed out. 1394542614.447124:0:21137:0:(client.c:2121:ptlrpc_set_wait()) set ffff88001b35c880 going to sleep for 6 seconds 1394542614.447131:0:24355:0:(client.c:1901:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1394542606/real 1394542606] req@ffff88000afc3000 x1462244279669228/t0(0) o104->lustre-OST0003@10.10.4.120@tcp:15/16 lens 296/224 e 0 to 1 dl 1394542613 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 1394542614.447152:0:24355:0:(client.c:1901:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1394542606/real 1394542613] req@ffff88001ef28400 x1462244279669232/t0(0) o104->lustre-OST0003@10.10.4.121@tcp:15/16 lens 296/224 e 0 to 1 dl 1394542613 ref 1 fl Rpc:RXN/0/ffffffff rc 0/-1 1394542614.447157:0:24355:0:(client.c:2121:ptlrpc_set_wait()) set ffff880040e87ec0 going to sleep for 0 seconds 1394542614.447161:0:24355:0:(ldlm_lockd.c:523:ldlm_del_waiting_lock()) ### removed ns: filter-lustre-OST0003_UUID lock: ffff880072554340/0xd73a284e105aae19 lrc: 3/0,0 mode: PR/PR res: [0x543b1:0x0:0x0].0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000000010020 nid: 10.10.4.120@tcp remote: 0xfd059f9b5a2f235d expref: 1918 pid: 21114 timeout: 4304418095 lvb_type: 1 1394542614.447166:0:24355:0:(ldlm_lockd.c:594:ldlm_failed_ast()) 138-a: lustre-OST0003: A client on nid 10.10.4.120@tcp was evicted due to a lock blocking callback time out: rc -107 the request with rq_xid="x1462244279669228" was indeed received by the client, 1394542606.112063:0:19904:0:(service.c:1866:ptlrpc_server_handle_req_in()) got req x1462244279669228 1394542606.112066:0:19904:0:(service.c:1079:ptlrpc_update_export_timer()) updating export LOV_OSC_UUID at 1394542606 exp ffff880064797000 1394542606.112080:0:19904:0:(nrs_fifo.c:182:nrs_fifo_req_get()) NRS start fifo request from 12345-10.10.4.123@tcp, seq: 88055 1394542606.112086:0:19904:0:(service.c:2011:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cb00_001:LOV_OSC_UUID+4:24355:x1462244279669228:12345-10.10.4.123@tcp:104 1394542606.112112:0:19904:0:(service.c:2055:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cb00_001:LOV_OSC_UUID+4:24355:x1462244279669228:12345-10.10.4.123@tcp:104 Request procesed in 32us (61us total) trans 0 rc 0/0 1394542606.112118:0:19904:0:(nrs_fifo.c:244:nrs_fifo_req_stop()) NRS stop fifo request from 12345-10.10.4.123@tcp, seq: 88055 and the corresponding lock was cancelled subsequently, 1394542606.112122:0:19905:0:(ldlm_lockd.c:1655:ldlm_handle_bl_callback()) ### client blocking AST callback handler ns: lustre-OST0003-osc-ffff88007013b000 lock: ffff88004d2ae180/0xfd059f9b5a2f235d lrc: 3/0,0 mode: PR/PR res: [0x543b1:0x0:0x0].0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x420000000000 nid: local remote: 0xd73a284e105aae19 expref: -99 pid: 24556 timeout: 0 lvb_type: 1 1394542606.112127:0:19905:0:(ldlm_lockd.c:1668:ldlm_handle_bl_callback()) Lock ffff88004d2ae180 already unused, calling callback (ffffffffa0a55a30) 1394542606.112133:0:19905:0:(cl_lock.c:151:cl_lock_trace0()) cancel lock: ffff880028d504b0@(1 ffff8800654780c0 1 5 0 0 0 0)(ffff88007a495b18/1/1) at cl_lock_cancel():1834 1394542606.112137:0:19905:0:(vvp_io.c:1165:vvp_io_init()) [0x2000013aa:0x4476:0x0] ignore/verify layout 1/0, layout version 0 restore needed 0 1394542606.112143:0:19905:0:(vvp_io.c:153:vvp_io_fini()) [0x2000013aa:0x4476:0x0] ignore/verify layout 1/0, layout version 0 restore needed 0 1394542606.112146:0:19905:0:(ldlm_request.c:1127:ldlm_cli_cancel_local()) ### client-side cancel ns: lustre-OST0003-osc-ffff88007013b000 lock: ffff88004d2ae180/0xfd059f9b5a2f235d lrc: 4/0,0 mode: PR/PR res: [0x543b1:0x0:0x0].0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x428400000000 nid: local remote: 0xd73a284e105aae19 expref: -99 pid: 24556 timeout: 0 lvb_type: 1 1394542606.112154:0:19905:0:(cl_lock.c:151:cl_lock_trace0()) cancel lock: ffff880028d504b0@(2 ffff8800654780c0 2 5 0 0 0 1)(ffff88007a495920/1/1) at cl_lock_cancel():1834 1394542606.112157:0:19905:0:(cl_lock.c:151:cl_lock_trace0()) delete lock: ffff880028d504b0@(2 ffff8800654780c0 2 5 0 0 0 1)(ffff88007a495920/1/1) at cl_lock_delete():1781 1394542606.112171:0:19905:0:(ldlm_request.c:1186:ldlm_cancel_pack()) ### packing ns: lustre-OST0003-osc-ffff88007013b000 lock: ffff88004d2ae180/0xfd059f9b5a2f235d lrc: 2/0,0 mode: --/PR res: [0x543b1:0x0:0x0].0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x4c69400000000 nid: local remote: 0xd73a284e105aae19 expref: -99 pid: 24556 timeout: 0 lvb_type: 1 1394542606.112175:0:19905:0:(ldlm_request.c:1190:ldlm_cancel_pack()) 1 locks packed 1394542606.112180:0:19905:0:(cl_lock.c:151:cl_lock_trace0()) delete lock: ffff880028d504b0@(1 ffff8800654780c0 1 6 0 0 0 1)(ffff88007a495b18/1/1) at cl_lock_delete():1781 1394542606.112182:0:19905:0:(cl_lock.c:151:cl_lock_trace0()) free lock: ffff880028d504b0@(0 (null) 0 6 0 0 0 1)(ffff88007a495b18/1/0) at cl_lock_free():270 1394542606.112186:0:19905:0:(ldlm_lockd.c:1677:ldlm_handle_bl_callback()) ### client blocking callback handler END ns: lustre-OST0003-osc-ffff88007013b000 lock: ffff88004d2ae180/0xfd059f9b5a2f235d lrc: 1/0,0 mode: --/PR res: [0x543b1:0x0:0x0].0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x4c69400000000 nid: local remote: 0xd73a284e105aae19 expref: -99 pid: 24556 timeout: 0 lvb_type: 1 1394542606.112190:0:19905:0:(ldlm_lock.c:219:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: lustre-OST0003-osc-ffff88007013b000 lock: ffff88004d2ae180/0xfd059f9b5a2f235d lrc: 0/0,0 mode: --/PR res: [0x543b1:0x0:0x0].0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x4c69400000000 nid: local remote: 0xd73a284e105aae19 expref: -99 pid: 24556 timeout: 0 lvb_type: 1 1394542606.112200:0:2899:0:(client.c:1473:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_0:4581b497-cf77-5ea1-5ebc-c8dd032f07ff:2899:1462244323541000:10.10.4.123@tcp:103 1394542607.731980:0:2899:0:(ptlrpcd.c:387:ptlrpcd_check()) transfer 1 async RPCs [1->0] but the lock cancellation request (rq_xid=1462244323541000) was received just after the above "LDLM_BL_CALLBACK" request was timed out. 1394542614.454055:0:17823:0:(service.c:1866:ptlrpc_server_handle_req_in()) got req x1462244323540996 1394542614.454067:0:25143:0:(service.c:1866:ptlrpc_server_handle_req_in()) got req x1462244323541000 1394542614.454071:0:25143:0:(service.c:1866:ptlrpc_server_handle_req_in()) got req x1462244323541008 1394542614.454074:0:25143:0:(service.c:1866:ptlrpc_server_handle_req_in()) got req x1462244323541044 then this kind of failure should be caused by heavy network traffic or network problem. |
| Comment by Hongchao Zhang [ 01/Apr/14 ] |
|
according to the source code of metabench, this ticket could be caused by reading staled directory data in "MDS_READPAGE" request code snippet in metabench/util.c void clear_dir(char* path, int recurse) { ... struct dirent *entry; ... while ( (entry = readdir(dir)) != NULL ) { if (lstat(entry->d_name,&st)) { fatal("Unable to stat file [%s/%s]",path,entry->d_name); } if (S_ISDIR(st.st_mode) ) { if (! recurse) continue; if (strcmp(entry->d_name,".") == 0) continue; if (strcmp(entry->d_name,"..") == 0) continue; sprintf(newpath,"%s/%s",path,entry->d_name); level++; clear_dir(newpath,recurse); level--; if (testing) { printf("Would remove dir %s\n",newpath); } else if (rmdir(newpath)) { fatal("Cant remove directory [%s]",newpath); } dirs++; } else { if (testing) { printf("Would remove file %s/%s\n",path,entry->d_name); } else if (unlink(entry->d_name)) { fatal("Unable to remove file [%s]",entry->d_name); } files++; } num_files++; } ... } in test report https://maloo.whamcloud.com/sub_tests/2d11e162-b87b-11e3-ac5f-52540035b04c |
| Comment by Oleg Drokin [ 01/Apr/14 ] |
|
Well, if metabench already deleted the file, then subsequent readdit should not have been able to do it. Lustre drops readdir cache and while glibc can cache this readdir result, is there any evidence metabench works with same readdir by doign a rewind and readdir again? |
| Comment by Hongchao Zhang [ 02/Apr/14 ] |
|
I have extracted the directory cleanup stuff in "metabench" to a new test that only deletes some directory, and found this issue can be reproduced locally, LustreError: 20820:0:(dir.c:259:ll_dir_read()) [0x200000400:0x776:0x0] filldir: K75udev-post LustreError: 20820:0:(dir.c:259:ll_dir_read()) [0x200000400:0x776:0x0] filldir: K01certmonger LustreError: 20820:0:(dir.c:307:ll_readdir()) VFS Op:inode=[0x200000400:0x776:0x0](ffff88003cba4678) pos/size9174905734640535999/4096 32bit_api 0 LustreError: 20820:0:(dir.c:307:ll_readdir()) VFS Op:inode=[0x200000400:0x5f3:0x0](ffff8800652bd138) pos/size8883493503157300815/4096 32bit_api 0 LustreError: 20820:0:(dir.c:259:ll_dir_read()) [0x200000400:0x5f3:0x0] filldir: K75udev-post the file "K75udev-post" is not in "[0x200000400:0x5f3:0x0]", but in "[0x200000400:0x776:0x0]", the steps to reproduce the issue, 1, copy /etc/ to /mnt/lustre/test/ 2, using the new test to cleanup /mnt/lustre/test, and it fails just as this ticket (-ENOENT) [root@zhanghc metabench]# ./test [03/28/2014 02:55:43] FATAL error on process 0 Proc 0: Unable to stat file [/mnt/lustre/test/xinetd.d/environment]: No such file or directory <-- Note: there is no "environment" in original "/etc/xinetd.d" btw, the command "rm -fr /mnt/lustre/test" succeed in cleaning up the directory. will update the status soon. |
| Comment by Hongchao Zhang [ 02/Apr/14 ] |
|
status update: this issue is caused by the patch http://review.whamcloud.com/#/c/7196/ in |
| Comment by Hongchao Zhang [ 03/Apr/14 ] |
|
there is a problem in the patch http://review.whamcloud.com/#/c/7196/ in diff --git a/lustre/osd-ldiskfs/osd_handler.c b/lustre/osd-ldiskfs/osd_handler.c @@ -4831,14 +4836,16 @@ static int osd_ldiskfs_it_fill(const struct lu_env *env, else up_read(&obj->oo_ext_idx_sem); - if (it->oie_rd_dirent == 0) { - result = -EIO; - } else { - it->oie_dirent = it->oie_buf; - it->oie_it_dirent = 1; - } + if (it->oie_rd_dirent == 0) { + /*If it does not get any dirent, it means it has been reached + *to the end of the dir */ + it->oie_file.f_pos = ldiskfs_get_htree_eof(&it->oie_file); <-- here, we should return an error atm, otherwise, the caller maybe use the "it" + } else { + it->oie_dirent = it->oie_buf; + it->oie_it_dirent = 1; + } - RETURN(result); + RETURN(result); } if there is no more entry found in the directory, we should return some kind of error to tell the caller not use the "it". btw, this only fixed the -ENOENT problem (No such file or directory), and the other kind of problem of this ticket (Input/output error) should be caused |
| Comment by Hongchao Zhang [ 03/Apr/14 ] |
|
the patch is tracked at http://review.whamcloud.com/#/c/9880/ |
| Comment by Hongchao Zhang [ 14/May/14 ] |
|
the patch (http://review.whamcloud.com/#/c/9880/) of this issue has been included in http://review.whamcloud.com/#/c/9511/, which has been landed on master. |
| Comment by Peter Jones [ 14/May/14 ] |
|
It sounds like this issue is a duplicate of |