[LU-4035] sanity-hsm test_58 failure: 'truncate 3158 does not trig restore, state = ' Created: 01/Oct/13 Updated: 19/Jan/22 Resolved: 04/Dec/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.0, Lustre 2.6.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | Jinshan Xiong (Inactive) |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | HSM | ||
| Environment: |
Lustre 2.4.93 build # 1687 |
||
| Severity: | 3 |
| Rank (Obsolete): | 10839 |
| Description |
|
test results at: https://maloo.whamcloud.com/test_sessions/d9bd658c-2a2b-11e3-8527-52540035b04c From the client test_log: == sanity-hsm test 58: Truncate a released file will trigger restore == 14:49:23 (1380577763) pdsh@c10: c07: ssh exited with exit code 1 Purging archive on c07 Starting copytool agt1 on c07 truncate up from 3158 to 6316 /lustre/scratch/d0.sanity-hsm/d58/f.sanity-hsm.58: (0x0000000b) exists dirty archived, archive_id:2 truncate down from 3158 to 1579 /lustre/scratch/d0.sanity-hsm/d58/f.sanity-hsm.58: (0x0000000b) exists dirty archived, archive_id:2 truncate to 0 /lustre/scratch/d0.sanity-hsm/d58/f.sanity-hsm.58: (0x0000000b) exists dirty archived, archive_id:2 sanity-hsm test_58: @@@@@@ FAIL: truncate 3158 does not trig restore, state = Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:4264:error_noexit() = /usr/lib64/lustre/tests/test-framework.sh:4291:error() = /usr/lib64/lustre/tests/sanity-hsm.sh:2158:truncate_released_file() = /usr/lib64/lustre/tests/sanity-hsm.sh:2181:test_58() = /usr/lib64/lustre/tests/test-framework.sh:4530:run_one() = /usr/lib64/lustre/tests/test-framework.sh:4563:run_one_logged() = /usr/lib64/lustre/tests/test-framework.sh:4433:run_test() = /usr/lib64/lustre/tests/sanity-hsm.sh:2185:main() From the copytool log, it looks like there are three file restores, one for each of the calls to truncate_release_file: lhsmtool_posix[31629]: '[0x200000401:0x1ba:0x0]' action RESTORE reclen 72, cookie=0x5249f10d
lhsmtool_posix[31629]: processing file 'd0.sanity-hsm/d58/f.sanity-hsm.58'
lhsmtool_posix[31629]: reading stripe rules from '/archive/scratch/01ba/0000/0401/0000/0002/0000/0x200000401:0x1ba:0x0.lov' for '/archive/scratch/01ba/0000/0401/0000/0002/0000/0x200000401:0x1ba:0x0'
lhsmtool_posix[31629]: restoring data from '/archive/scratch/01ba/0000/0401/0000/0002/0000/0x200000401:0x1ba:0x0' to '{VOLATILE}=[0x200000402:0x18:0x0]'
lhsmtool_posix[31629]: going to copy data from '/archive/scratch/01ba/0000/0401/0000/0002/0000/0x200000401:0x1ba:0x0' to '{VOLATILE}=[0x200000402:0x18:0x0]'
lhsmtool_posix[31629]: Going to copy 0 bytes /archive/scratch/01ba/0000/0401/0000/0002/0000/0x200000401:0x1ba:0x0 -> {VOLATILE}=[0x200000402:0x18:0x0]
lhsmtool_posix[31629]: data restore from '/archive/scratch/01ba/0000/0401/0000/0002/0000/0x200000401:0x1ba:0x0' to '{VOLATILE}=[0x200000402:0x18:0x0]' done
lhsmtool_posix[31629]: Action completed, notifying coordinator cookie=0x5249f10d, FID=[0x200000401:0x1ba:0x0], hp_flags=0 err=0
lhsmtool_posix[31629]: llapi_hsm_action_end() on '/lustre/scratch/.lustre/fid/0x200000401:0x1ba:0x0' ok (rc=0)
It's not clear why the status for the file didn't return "SUCCEED". |
| Comments |
| Comment by Jinshan Xiong (Inactive) [ 02/Oct/13 ] |
|
from the log, the restore process has been finished, but for unknown reason it delayed for a while, so that the test script didn;t see the restore state afterwards. From the MDT log, here is the only thing interesting: LustreError: 0-0: scratch-MDT0000: trigger OI scrub by RPC for [0x200000405:0x9:0x0], rc = 0 [1] LustreError: 0-0: scratch-MDT0000: trigger OI scrub by RPC for [0x200000405:0x9:0x0], rc = 0 [1] LustreError: 0-0: scratch-MDT0000: trigger OI scrub by RPC for [0x200000405:0x9:0x0], rc = 0 [1] LustreError: 0-0: scratch-MDT0000: trigger OI scrub by RPC for [0x200000405:0x9:0x0], rc = 0 [1] Not sure if this can cause the delay. |
| Comment by James Nunez (Inactive) [ 05/Dec/13 ] |
|
I've run some more tests on the OpenSFS cluster and test 58 still fails some of the time. I can't consistently get it to fail and I don't see a pattern of failures when running the test alone nor in the full sanity-hsm test suite. I've uploaded new logs at https://maloo.whamcloud.com/test_sessions/b223c634-5d51-11e3-ad71-52540035b04c . |
| Comment by Jinshan Xiong (Inactive) [ 06/Dec/13 ] |
|
Hi James, do you know what made the test case failed? |
| Comment by James Nunez (Inactive) [ 07/Dec/13 ] |
|
Jinshan, Unfortunately, I don't have any more information on why the test is failing. I'll try to see if I can get it to reproduce more reliably and that might provide more information. |
| Comment by Sarah Liu [ 03/Jan/14 ] |
|
another instance seen in reivew-dne https://maloo.whamcloud.com/test_sets/7e3e01a2-7388-11e3-8412-52540035b04c |
| Comment by Sarah Liu [ 25/Jan/14 ] |
|
hit this in interop testing between 2.6 client and 2.5 server: https://maloo.whamcloud.com/test_sets/64654b22-8333-11e3-a5fa-52540035b04c |
| Comment by Aurelien Degremont (Inactive) [ 12/Mar/14 ] |
|
I looked at it and this issue seems difficult to track. "Usually" this kind of issues appears when request state list is badly parsed or some timing issues where the request list entry is removed before we tried to read it. If we have few apparition of this issue and we do not know how to reproduce it, this becomes tricky. I only imagine adding some debugging patch to try to gather more information about that. (Like printing request state content ) |
| Comment by Andreas Dilger [ 04/Dec/14 ] |
|
Test has not failed in the past month, and those failures have a different message "request on 0x200000401:0x2d:0x0 is not SUCCEED on mds1", so that should be a different bug. |