[LU-4125] sanity-hsm test_228 failure: 'request on 0x20000040b:0x61:0x0 is not SUCCEED' Created: 18/Oct/13 Updated: 03/Jun/14 Resolved: 31/Mar/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | Bruno Faccini (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | HSM | ||
| Environment: |
Lustre 2.5.0-RC1, el6 OpenSFS cluster with combined MGS/MDS (c03), single OSS (c04) with two OSTs, archive MGS/MDS (c05), archive OST (c06) with two OSTs, archive OST2 (c07) with two OSTs, eight clients; one agent + client(c08), one robinhood/db + client(c09) and others just running as Lustre clients (c10, c11, c12, c13,c14, c15) |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 11134 | ||||||||
| Description |
|
Test results at: https://maloo.whamcloud.com/test_sessions/af9dfc14-3834-11e3-8bc4-52540035b04c From the test_log: == sanity-hsm test 228: On released file, return extend to FIEMAP. For [cp,tar] --sparse == 08:09:43 (1382108983) pdsh@c15: c08: ssh exited with exit code 1 Purging archive on c08 Starting copytool agt1 on c08 1+0 records in 1+0 records out 1048576 bytes (1.0 MB) copied, 0.145494 s, 7.2 MB/s Changed after 0s: from '' to 'STARTED' Waiting 100 secs for update Changed after 1s: from 'STARTED' to 'FAILED' Waiting 90 secs for update Waiting 80 secs for update Waiting 70 secs for update Waiting 60 secs for update Waiting 50 secs for update Changed after 60s: from 'FAILED' to '' Waiting 40 secs for update Waiting 30 secs for update Waiting 20 secs for update Waiting 10 secs for update Update not seen after 100s: wanted 'SUCCEED' got '' sanity-hsm test_228: @@@@@@ FAIL: request on 0x20000040b:0x61:0x0 is not SUCCEED Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:4264:error_noexit() = /usr/lib64/lustre/tests/test-framework.sh:4291:error() = /usr/lib64/lustre/tests/sanity-hsm.sh:474:wait_request_state() = /usr/lib64/lustre/tests/sanity-hsm.sh:3085:test_228() = /usr/lib64/lustre/tests/test-framework.sh:4530:run_one() = /usr/lib64/lustre/tests/test-framework.sh:4563:run_one_logged() = /usr/lib64/lustre/tests/test-framework.sh:4433:run_test() = /usr/lib64/lustre/tests/sanity-hsm.sh:3111:main() Dumping lctl log to /tmp/test_logs/2013-10-18/074316/sanity-hsm.test_228.*.1382109102.log Copytool is stopped on c08 From the copytool_log on c08: lhsmtool_posix[26046]: action=0 src=(null) dst=(null) mount_point=/lustre/scratch lhsmtool_posix[26047]: waiting for message from kernel lhsmtool_posix[26047]: copytool fs=scratch archive#=2 item_count=1 lhsmtool_posix[26047]: waiting for message from kernel lhsmtool_posix[26048]: '[0x20000040b:0x61:0x0]' action ARCHIVE reclen 72, cookie=0x52614df7 lhsmtool_posix[26048]: processing file 'f.sanity-hsm.228' lhsmtool_posix[26048]: archiving '/lustre/scratch/.lustre/fid/0x20000040b:0x61:0x0' to '/lustre/archive/0061/0000/040b/0000/0002/0000/0x20000040b:0x61:0x0_tmp' lhsmtool_posix[26048]: saving stripe info of '/lustre/scratch/.lustre/fid/0x20000040b:0x61:0x0' in /lustre/archive/0061/0000/040b/0000/0002/0000/0x20000040b:0x61:0x0_tmp.lov lhsmtool_posix[26048]: going to copy data from '/lustre/scratch/.lustre/fid/0x20000040b:0x61:0x0' to '/lustre/archive/0061/0000/040b/0000/0002/0000/0x20000040b:0x61:0x0_tmp' lhsmtool_posix[26048]: progress ioctl for copy '/lustre/scratch/.lustre/fid/0x20000040b:0x61:0x0'->'/lustre/archive/0061/0000/040b/0000/0002/0000/0x20000040b:0x61:0x0_tmp' failed: No such file or directory (2) lhsmtool_posix[26048]: data copy failed from '/lustre/scratch/.lustre/fid/0x20000040b:0x61:0x0' to '/lustre/archive/0061/0000/040b/0000/0002/0000/0x20000040b:0x61:0x0_tmp': No such file or directory (2) lhsmtool_posix[26048]: Action completed, notifying coordinator cookie=0x52614df7, FID=[0x20000040b:0x61:0x0], hp_flags=0 err=2 lhsmtool_posix[26048]: llapi_hsm_action_end() on '/lustre/scratch/.lustre/fid/0x20000040b:0x61:0x0' failed: No such file or directory (2) exiting: Interrupt From dmesg on the MDS (c03): Lustre: DEBUG MARKER: == sanity-hsm test 228: On released file, return extend to FIEMAP. For [cp,tar] --sparse == 08:09:43 (1382108983) LustreError: 7622:0:(mdt_coordinator.c:1448:mdt_hsm_update_request_state()) scratch-MDT0000: Cannot find running request for cookie 0x52614df7 on fid=[0x20000040b:0x61:0x0] LustreError: 7622:0:(mdt_coordinator.c:1448:mdt_hsm_update_request_state()) scratch-MDT0000: Cannot find running request for cookie 0x52614df7 on fid=[0x20000040b:0x61:0x0] Lustre: DEBUG MARKER: sanity-hsm test_228: @@@@@@ FAIL: request on 0x20000040b:0x61:0x0 is not SUCCEED |
| Comments |
| Comment by Bruno Faccini (Inactive) [ 25/Oct/13 ] |
|
James, did you encounter the same kind of error multiple times during your testing on OpenSFS cluster ? These particular lines/errors seen by copytool indicate a real problem during ARCHIVE from sanity-hsm/test_228 : lhsmtool_posix[26048]: progress ioctl for copy '/lustre/scratch/.lustre/fid/0x20000040b:0x61:0x0'->'/lustre/archive/0061/0000/040b/0000/0002/0000/0x20000040b:0x61:0x0_tmp' failed: No such file or directory (2) lhsmtool_posix[26048]: data copy failed from '/lustre/scratch/.lustre/fid/0x20000040b:0x61:0x0' to '/lustre/archive/0061/0000/040b/0000/0002/0000/0x20000040b:0x61:0x0_tmp': No such file or directory (2) lhsmtool_posix[26048]: Action completed, notifying coordinator cookie=0x52614df7, FID=[0x20000040b:0x61:0x0], hp_flags=0 err=2 lhsmtool_posix[26048]: llapi_hsm_action_end() on '/lustre/scratch/.lustre/fid/0x20000040b:0x61:0x0' failed: No such file or directory (2) And it seems to correspond to the following error in the agent/node dmesg : Lustre: DEBUG MARKER: == sanity-hsm test 228: On released file, return extend to FIEMAP. For [cp,tar] --sparse == 08:09:43 (1382108983) LustreError: 11-0: scratch-OST0000-osc-ffff8806309c0400: Communicating with 192.168.2.104@o2ib, operation ost_getattr failed with -12. Lustre: DEBUG MARKER: sanity-hsm test_228: @@@@@@ FAIL: request on 0x20000040b:0x61:0x0 is not SUCCEED And also to the corresponding error in OSS dmesg : Lustre: DEBUG MARKER: == sanity-hsm test 228: On released file, return extend to FIEMAP. For [cp,tar] --sparse == 08:09:43 (1382108983) LustreError: 11216:0:(ldlm_resource.c:1188:ldlm_resource_get()) scratch-OST0000: lvbo_init failed for resource 0x18c:0x0: rc = -2 Lustre: DEBUG MARKER: sanity-hsm test_228: @@@@@@ FAIL: request on 0x20000040b:0x61:0x0 is not SUCCEED |
| Comment by Bruno Faccini (Inactive) [ 04/Nov/13 ] |
|
According to Maloo/auto-tests stats, this problem has not re-occured yet. What about on your side James ?? |
| Comment by James Nunez (Inactive) [ 04/Nov/13 ] |
|
I just installed the latest master build and ran sanity-hsm and only tests 90 and 300 failed. I then ran sanity-hsm test 228 alone and it failed with the same error message as above. So, I can get this test to fail, but not consistently. |
| Comment by Bruno Faccini (Inactive) [ 02/Dec/13 ] |
|
James, |
| Comment by James Nunez (Inactive) [ 03/Dec/13 ] |
|
Bruno, I just ran sanity-hsm on the OpenSFS cluster and was able to reproduce the test 228 sanity-hsm failure. Here's the set up: Lustre file system with combined MDS/MGS, single OSS with two OSTs Using Lustre: First I ran the whole sanity-hsm suite of tests and got one failure: ./auster -v -k -f ncli sanity-hsm ... == sanity-hsm test complete, duration 1970 sec == 10:47:42 (1386096462) sanity-hsm: FAIL: test_300 hsm_control state is not 'stopped' on mds1 I then ran sanity-hsm again and got several failures. I did nothing to the nodes in between these two runs of sanity-hsm: ./auster -v -k -f ncli sanity-hsm ... == sanity-hsm test complete, duration 2117 sec == 11:32:11 (1386099131) sanity-hsm: FAIL: test_58 truncate 3158 does not trig restore, state = sanity-hsm: FAIL: test_228 request on 0x200000406:0x63:0x0 is not SUCCEED sanity-hsm: FAIL: test_300 hsm_control state is not 'stopped' on mds1 There may be an issue with some file left behind in the file system or archive that is impacting test 228. I say this because, in my limited testing, test 228 doesn't fail on the first run of sanity-hsm, only after a full run of sanity-hsm completed. Maybe something with test 300 failing influences test 228? Does test 300 fail for you? Keeping everything the same, I just changed to using c12, c13 and c15 as clients because I didn't want to run anything on the DB node. When I made that change, the failures were the same; I still get test 300 failing on the first pass of sanity-hsm and test 228 and 300 failing the second run of sanity-hsm. Test 58 did not fail this time. Running test 228 alone after these two sanity-hsm runs fails with the same error message.* I have not tried to reproduce this error on any other cluster. I'll leave the nodes alone for a while and can provide more configuration information if you tell me what you need. I'm happy to try some other configuration, try some other tests/test sequence or let you get on the nodes to look around. I've added the following to cfg/local.sh: AGTCOUNT=1 AGTDEV1="/lustre/archive" agt1_HOST="c12" HSMTOOL_VERBOSE="-vvvvvv"
|
| Comment by Bruno Faccini (Inactive) [ 04/Dec/13 ] |
|
Hello James, Also it is true that when I run sanity-hsm full/sub-tests in a loop, I always clear Lustre and archive file-systems between loops since I also found before that left-over files/archives can impact next runs/loops … |
| Comment by James Nunez (Inactive) [ 04/Dec/13 ] |
|
The testing reported above was run with the latest master at the time of setting up the nodes; build #1790. The patch for |
| Comment by James Nunez (Inactive) [ 05/Dec/13 ] |
|
I just uploaded some results for sanity-hsm test 228 failing on the OpenSFS cluster at https://maloo.whamcloud.com/test_sessions/643460bc-5d4e-11e3-956b-52540035b04c . I don't know if there is any new information here; a quick review of the logs look the same as what was posted previously. The node setup is the same as described in my comment above from Dec 03 and these sanity-hsm results were from the second run of sanity-hsm; one run where only test 300 failed and then the second run, the results above, where 300 and 228 failed. I then ran test 228 alone and it failed. I then cleaned out the archive and test 228 passed. Even when it passes, test 228 doesn't clean up after itself. So, I just ran test 228 again, with something in the archive and it failed. Clean out the archive again and test 228 passes. With some more digging, it looks like the test fails when the shadow directory is in the archive when test 228 starts. The key here is what is in the shadow directory. By running test 228 with the /lustre/archive/shadow/<file name that test 228 is writing to>, the test fails. If there is no shadow directory or if file in the shadow directory is not the same name as the one test 228 is trying to write to, test 228 passes. |
| Comment by Bruno Faccini (Inactive) [ 09/Dec/13 ] |
|
James, thanks for you searches and comment! |
| Comment by Bruno Faccini (Inactive) [ 12/Dec/13 ] |
|
Also I wonder if the patch you pushed for |
| Comment by Bruno Faccini (Inactive) [ 11/Mar/14 ] |
|
James, I did some analysis of the 24 sanity-hsm/test_228 failures in auto-tests since mid-december and I did not find any that could be linked to the situation described in this ticket. In fact these auto-tests failures occured after almost all the preceeding subtests (starting with test_8 !!) in sanity-hsm already failed for the same reason, like if any ARCHIVE request failed. So again, I still strongly suspect that the cleanup you introduced as part of What do you think ?? |
| Comment by James Nunez (Inactive) [ 12/Mar/14 ] |
|
Bruno, I'm testing with 2.5.1-RC3 running sanity-hsm on the OpenSFS cluster. I am not able to recreate this problem. Test 228 succeeds running the full sanity-hsm suite and running just test 228 alone. James |
| Comment by Bruno Faccini (Inactive) [ 21/Mar/14 ] |
|
Ok so, do you agree if we close tis ticket as a dup of |
| Comment by James Nunez (Inactive) [ 21/Mar/14 ] |
|
Since I can't reproduce anymore, I'm fine with closing this ticket. |