[LU-4125] sanity-hsm test_228 failure: 'request on 0x20000040b:0x61:0x0 is not SUCCEED' Created: 18/Oct/13  Updated: 03/Jun/14  Resolved: 31/Mar/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: Bruno Faccini (Inactive)
Resolution: Duplicate Votes: 0
Labels: HSM
Environment:

Lustre 2.5.0-RC1, el6

OpenSFS cluster with combined MGS/MDS (c03), single OSS (c04) with two OSTs, archive MGS/MDS (c05), archive OST (c06) with two OSTs, archive OST2 (c07) with two OSTs, eight clients; one agent + client(c08), one robinhood/db + client(c09) and others just running as Lustre clients (c10, c11, c12, c13,c14, c15)


Issue Links:
Duplicate
is duplicated by LU-4343 sanity-hsm test_228 failure: FAIL: ta... Resolved
Severity: 3
Rank (Obsolete): 11134

 Description   

Test results at: https://maloo.whamcloud.com/test_sessions/af9dfc14-3834-11e3-8bc4-52540035b04c

From the test_log:

== sanity-hsm test 228: On released file, return extend to FIEMAP. For [cp,tar] --sparse == 08:09:43 (1382108983)
pdsh@c15: c08: ssh exited with exit code 1
Purging archive on c08
Starting copytool agt1 on c08
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.145494 s, 7.2 MB/s
Changed after 0s: from '' to 'STARTED'
Waiting 100 secs for update
Changed after 1s: from 'STARTED' to 'FAILED'
Waiting 90 secs for update
Waiting 80 secs for update
Waiting 70 secs for update
Waiting 60 secs for update
Waiting 50 secs for update
Changed after 60s: from 'FAILED' to ''
Waiting 40 secs for update
Waiting 30 secs for update
Waiting 20 secs for update
Waiting 10 secs for update
Update not seen after 100s: wanted 'SUCCEED' got ''
 sanity-hsm test_228: @@@@@@ FAIL: request on 0x20000040b:0x61:0x0 is not SUCCEED 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:4264:error_noexit()
  = /usr/lib64/lustre/tests/test-framework.sh:4291:error()
  = /usr/lib64/lustre/tests/sanity-hsm.sh:474:wait_request_state()
  = /usr/lib64/lustre/tests/sanity-hsm.sh:3085:test_228()
  = /usr/lib64/lustre/tests/test-framework.sh:4530:run_one()
  = /usr/lib64/lustre/tests/test-framework.sh:4563:run_one_logged()
  = /usr/lib64/lustre/tests/test-framework.sh:4433:run_test()
  = /usr/lib64/lustre/tests/sanity-hsm.sh:3111:main()
Dumping lctl log to /tmp/test_logs/2013-10-18/074316/sanity-hsm.test_228.*.1382109102.log
Copytool is stopped on c08

From the copytool_log on c08:

lhsmtool_posix[26046]: action=0 src=(null) dst=(null) mount_point=/lustre/scratch
lhsmtool_posix[26047]: waiting for message from kernel
lhsmtool_posix[26047]: copytool fs=scratch archive#=2 item_count=1
lhsmtool_posix[26047]: waiting for message from kernel
lhsmtool_posix[26048]: '[0x20000040b:0x61:0x0]' action ARCHIVE reclen 72, cookie=0x52614df7
lhsmtool_posix[26048]: processing file 'f.sanity-hsm.228'
lhsmtool_posix[26048]: archiving '/lustre/scratch/.lustre/fid/0x20000040b:0x61:0x0' to '/lustre/archive/0061/0000/040b/0000/0002/0000/0x20000040b:0x61:0x0_tmp'
lhsmtool_posix[26048]: saving stripe info of '/lustre/scratch/.lustre/fid/0x20000040b:0x61:0x0' in /lustre/archive/0061/0000/040b/0000/0002/0000/0x20000040b:0x61:0x0_tmp.lov
lhsmtool_posix[26048]: going to copy data from '/lustre/scratch/.lustre/fid/0x20000040b:0x61:0x0' to '/lustre/archive/0061/0000/040b/0000/0002/0000/0x20000040b:0x61:0x0_tmp'
lhsmtool_posix[26048]: progress ioctl for copy '/lustre/scratch/.lustre/fid/0x20000040b:0x61:0x0'->'/lustre/archive/0061/0000/040b/0000/0002/0000/0x20000040b:0x61:0x0_tmp' failed: No such file or directory (2)
lhsmtool_posix[26048]: data copy failed from '/lustre/scratch/.lustre/fid/0x20000040b:0x61:0x0' to '/lustre/archive/0061/0000/040b/0000/0002/0000/0x20000040b:0x61:0x0_tmp': No such file or directory (2)
lhsmtool_posix[26048]: Action completed, notifying coordinator cookie=0x52614df7, FID=[0x20000040b:0x61:0x0], hp_flags=0 err=2
lhsmtool_posix[26048]: llapi_hsm_action_end() on '/lustre/scratch/.lustre/fid/0x20000040b:0x61:0x0' failed: No such file or directory (2)
exiting: Interrupt

From dmesg on the MDS (c03):

Lustre: DEBUG MARKER: == sanity-hsm test 228: On released file, return extend to FIEMAP. For [cp,tar] --sparse == 08:09:43 (1382108983)
LustreError: 7622:0:(mdt_coordinator.c:1448:mdt_hsm_update_request_state()) scratch-MDT0000: Cannot find running request for cookie 0x52614df7 on fid=[0x20000040b:0x61:0x0]
LustreError: 7622:0:(mdt_coordinator.c:1448:mdt_hsm_update_request_state()) scratch-MDT0000: Cannot find running request for cookie 0x52614df7 on fid=[0x20000040b:0x61:0x0]
Lustre: DEBUG MARKER: sanity-hsm test_228: @@@@@@ FAIL: request on 0x20000040b:0x61:0x0 is not SUCCEED


 Comments   
Comment by Bruno Faccini (Inactive) [ 25/Oct/13 ]

James, did you encounter the same kind of error multiple times during your testing on OpenSFS cluster ?
BTW, I am asking this because it seems that multiple sanity-hsm sub-tests failed within the same suite which has been prolific in term of errors/pbs and of created/generated tickets !! This may come from a new configuration exposure or may be some external factor ??…

These particular lines/errors seen by copytool indicate a real problem during ARCHIVE from sanity-hsm/test_228 :

lhsmtool_posix[26048]: progress ioctl for copy '/lustre/scratch/.lustre/fid/0x20000040b:0x61:0x0'->'/lustre/archive/0061/0000/040b/0000/0002/0000/0x20000040b:0x61:0x0_tmp' failed: No such file or directory (2)
lhsmtool_posix[26048]: data copy failed from '/lustre/scratch/.lustre/fid/0x20000040b:0x61:0x0' to '/lustre/archive/0061/0000/040b/0000/0002/0000/0x20000040b:0x61:0x0_tmp': No such file or directory (2)
lhsmtool_posix[26048]: Action completed, notifying coordinator cookie=0x52614df7, FID=[0x20000040b:0x61:0x0], hp_flags=0 err=2
lhsmtool_posix[26048]: llapi_hsm_action_end() on '/lustre/scratch/.lustre/fid/0x20000040b:0x61:0x0' failed: No such file or directory (2)

And it seems to correspond to the following error in the agent/node dmesg :

Lustre: DEBUG MARKER: == sanity-hsm test 228: On released file, return extend to FIEMAP. For [cp,tar] --sparse == 08:09:43 (1382108983)
LustreError: 11-0: scratch-OST0000-osc-ffff8806309c0400: Communicating with 192.168.2.104@o2ib, operation ost_getattr failed with -12.
Lustre: DEBUG MARKER: sanity-hsm test_228: @@@@@@ FAIL: request on 0x20000040b:0x61:0x0 is not SUCCEED

And also to the corresponding error in OSS dmesg :

Lustre: DEBUG MARKER: == sanity-hsm test 228: On released file, return extend to FIEMAP. For [cp,tar] --sparse == 08:09:43 (1382108983)
LustreError: 11216:0:(ldlm_resource.c:1188:ldlm_resource_get()) scratch-OST0000: lvbo_init failed for resource 0x18c:0x0: rc = -2
Lustre: DEBUG MARKER: sanity-hsm test_228: @@@@@@ FAIL: request on 0x20000040b:0x61:0x0 is not SUCCEED
Comment by Bruno Faccini (Inactive) [ 04/Nov/13 ]

According to Maloo/auto-tests stats, this problem has not re-occured yet. What about on your side James ??

Comment by James Nunez (Inactive) [ 04/Nov/13 ]

I just installed the latest master build and ran sanity-hsm and only tests 90 and 300 failed. I then ran sanity-hsm test 228 alone and it failed with the same error message as above. So, I can get this test to fail, but not consistently.

Comment by Bruno Faccini (Inactive) [ 02/Dec/13 ]

James,
I am sorry but I am unable to reproduce this actually. I ran test_228 in a loop on different local configs but no failure at all.
Do you still see this ? And then can you better describe the platform/configuration you use when you trigger problem?? I the "Environment" section of this ticket, you indicate "OpenSFS cluster", did you only see the problem there ?

Comment by James Nunez (Inactive) [ 03/Dec/13 ]

Bruno,

I just ran sanity-hsm on the OpenSFS cluster and was able to reproduce the test 228 sanity-hsm failure. Here's the set up:

Lustre file system with combined MDS/MGS, single OSS with two OSTs
archive Lustre file system with combined MDS/MGS, single OSS with two OSTs
One HSM agent + Lustre client (c12) - mounts both file systems
One Robinhood DB node running Robinhood 2.5.0 + Lustre client(c14) - does not mount archive file system
One additional Lustre client (c15) - this is where I ran the tests, does not mount archive file system

Using Lustre:
lustre: 2.5.51
kernel: patchless_client
build: jenkins-arch=x86_64,build_type=client,distro=el6,ib_stack=inkernel-1790-g645d1e9-PRISTINE-2.6.32-358.23.2.el6.x86_64

First I ran the whole sanity-hsm suite of tests and got one failure:

./auster -v -k -f ncli sanity-hsm
...
== sanity-hsm test complete, duration 1970 sec == 10:47:42 (1386096462)
sanity-hsm: FAIL: test_300 hsm_control state is not 'stopped' on mds1

I then ran sanity-hsm again and got several failures. I did nothing to the nodes in between these two runs of sanity-hsm:

./auster -v -k -f ncli sanity-hsm
...
== sanity-hsm test complete, duration 2117 sec == 11:32:11 (1386099131)
sanity-hsm: FAIL: test_58 truncate 3158 does not trig restore, state = 
sanity-hsm: FAIL: test_228 request on 0x200000406:0x63:0x0 is not SUCCEED
sanity-hsm: FAIL: test_300 hsm_control state is not 'stopped' on mds1

There may be an issue with some file left behind in the file system or archive that is impacting test 228. I say this because, in my limited testing, test 228 doesn't fail on the first run of sanity-hsm, only after a full run of sanity-hsm completed. Maybe something with test 300 failing influences test 228? Does test 300 fail for you?

Keeping everything the same, I just changed to using c12, c13 and c15 as clients because I didn't want to run anything on the DB node. When I made that change, the failures were the same; I still get test 300 failing on the first pass of sanity-hsm and test 228 and 300 failing the second run of sanity-hsm. Test 58 did not fail this time. Running test 228 alone after these two sanity-hsm runs fails with the same error message.*

I have not tried to reproduce this error on any other cluster. I'll leave the nodes alone for a while and can provide more configuration information if you tell me what you need.

I'm happy to try some other configuration, try some other tests/test sequence or let you get on the nodes to look around.

I've added the following to cfg/local.sh:

AGTCOUNT=1
AGTDEV1="/lustre/archive"
agt1_HOST="c12"
HSMTOOL_VERBOSE="-vvvvvv"
  • Note: Until LU-2524 lands, the mkdir in test 228 needs a -p when run alone. I'll open a ticket for this.
Comment by Bruno Faccini (Inactive) [ 04/Dec/13 ]

Hello James,
Concerning the test_300 failures for "hsm_control state is not 'stopped' on mds1", I see the tag of the version you run is 2.5.51 and it may not integrate my patch for LU-4065 which should fix this problem and has landed recently ?? Can you check/test with latest master ??

Also it is true that when I run sanity-hsm full/sub-tests in a loop, I always clear Lustre and archive file-systems between loops since I also found before that left-over files/archives can impact next runs/loops …

Comment by James Nunez (Inactive) [ 04/Dec/13 ]

The testing reported above was run with the latest master at the time of setting up the nodes; build #1790. The patch for LU-4065 was included in build #1765. So, I was testing with this patch.

Comment by James Nunez (Inactive) [ 05/Dec/13 ]

I just uploaded some results for sanity-hsm test 228 failing on the OpenSFS cluster at https://maloo.whamcloud.com/test_sessions/643460bc-5d4e-11e3-956b-52540035b04c . I don't know if there is any new information here; a quick review of the logs look the same as what was posted previously.

The node setup is the same as described in my comment above from Dec 03 and these sanity-hsm results were from the second run of sanity-hsm; one run where only test 300 failed and then the second run, the results above, where 300 and 228 failed. I then ran test 228 alone and it failed. I then cleaned out the archive and test 228 passed. Even when it passes, test 228 doesn't clean up after itself. So, I just ran test 228 again, with something in the archive and it failed. Clean out the archive again and test 228 passes.

With some more digging, it looks like the test fails when the shadow directory is in the archive when test 228 starts. The key here is what is in the shadow directory. By running test 228 with the /lustre/archive/shadow/<file name that test 228 is writing to>, the test fails. If there is no shadow directory or if file in the shadow directory is not the same name as the one test 228 is trying to write to, test 228 passes.

Comment by Bruno Faccini (Inactive) [ 09/Dec/13 ]

James, thanks for you searches and comment!
Hummm, the orphan entry in shadow directory is a good start-point in order to find the cause of the failed HSM Archive op. I will investigate in this direction.

Comment by Bruno Faccini (Inactive) [ 12/Dec/13 ]

Also I wonder if the patch you pushed for LU-4343 (an other issue with test_228!), and where you clean files being left actually by test_228, could also be a good work-around for this ticket?

Comment by Bruno Faccini (Inactive) [ 11/Mar/14 ]

James,

I did some analysis of the 24 sanity-hsm/test_228 failures in auto-tests since mid-december and I did not find any that could be linked to the situation described in this ticket. In fact these auto-tests failures occured after almost all the preceeding subtests (starting with test_8 !!) in sanity-hsm already failed for the same reason, like if any ARCHIVE request failed.

So again, I still strongly suspect that the cleanup you introduced as part of LU-4343 should have fix the issue for this ticket.

What do you think ??

Comment by James Nunez (Inactive) [ 12/Mar/14 ]

Bruno,

I'm testing with 2.5.1-RC3 running sanity-hsm on the OpenSFS cluster. I am not able to recreate this problem. Test 228 succeeds running the full sanity-hsm suite and running just test 228 alone.

James

Comment by Bruno Faccini (Inactive) [ 21/Mar/14 ]

Ok so, do you agree if we close tis ticket as a dup of LU-4343 ?

Comment by James Nunez (Inactive) [ 21/Mar/14 ]

Since I can't reproduce anymore, I'm fine with closing this ticket.

Generated at Sat Feb 10 01:39:53 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.