[LU-4035] sanity-hsm test_58 failure: 'truncate 3158 does not trig restore, state = ' Created: 01/Oct/13  Updated: 19/Jan/22  Resolved: 04/Dec/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0, Lustre 2.6.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: Jinshan Xiong (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: HSM
Environment:

Lustre 2.4.93 build # 1687
OpenSFS cluster with combined MGS/MDS, single OSS with two OSTs, four clients; one agent + Lustre client (c07), one Lustre client + with robinhood/db running (c08) and two Lustre clients (c09, c10)


Severity: 3
Rank (Obsolete): 10839

 Description   

test results at: https://maloo.whamcloud.com/test_sessions/d9bd658c-2a2b-11e3-8527-52540035b04c

From the client test_log:

== sanity-hsm test 58: Truncate a released file will trigger restore == 14:49:23 (1380577763)
pdsh@c10: c07: ssh exited with exit code 1
Purging archive on c07
Starting copytool agt1 on c07
truncate up from 3158 to 6316
/lustre/scratch/d0.sanity-hsm/d58/f.sanity-hsm.58: (0x0000000b) exists dirty archived, archive_id:2
truncate down from 3158 to 1579
/lustre/scratch/d0.sanity-hsm/d58/f.sanity-hsm.58: (0x0000000b) exists dirty archived, archive_id:2
truncate to 0
/lustre/scratch/d0.sanity-hsm/d58/f.sanity-hsm.58: (0x0000000b) exists dirty archived, archive_id:2
 sanity-hsm test_58: @@@@@@ FAIL: truncate 3158 does not trig restore, state =  
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:4264:error_noexit()
  = /usr/lib64/lustre/tests/test-framework.sh:4291:error()
  = /usr/lib64/lustre/tests/sanity-hsm.sh:2158:truncate_released_file()
  = /usr/lib64/lustre/tests/sanity-hsm.sh:2181:test_58()
  = /usr/lib64/lustre/tests/test-framework.sh:4530:run_one()
  = /usr/lib64/lustre/tests/test-framework.sh:4563:run_one_logged()
  = /usr/lib64/lustre/tests/test-framework.sh:4433:run_test()
  = /usr/lib64/lustre/tests/sanity-hsm.sh:2185:main()

From the copytool log, it looks like there are three file restores, one for each of the calls to truncate_release_file:

lhsmtool_posix[31629]: '[0x200000401:0x1ba:0x0]' action RESTORE reclen 72, cookie=0x5249f10d
lhsmtool_posix[31629]: processing file 'd0.sanity-hsm/d58/f.sanity-hsm.58'
lhsmtool_posix[31629]: reading stripe rules from '/archive/scratch/01ba/0000/0401/0000/0002/0000/0x200000401:0x1ba:0x0.lov' for '/archive/scratch/01ba/0000/0401/0000/0002/0000/0x200000401:0x1ba:0x0'
lhsmtool_posix[31629]: restoring data from '/archive/scratch/01ba/0000/0401/0000/0002/0000/0x200000401:0x1ba:0x0' to '{VOLATILE}=[0x200000402:0x18:0x0]'
lhsmtool_posix[31629]: going to copy data from '/archive/scratch/01ba/0000/0401/0000/0002/0000/0x200000401:0x1ba:0x0' to '{VOLATILE}=[0x200000402:0x18:0x0]'
lhsmtool_posix[31629]: Going to copy 0 bytes /archive/scratch/01ba/0000/0401/0000/0002/0000/0x200000401:0x1ba:0x0 -> {VOLATILE}=[0x200000402:0x18:0x0]

lhsmtool_posix[31629]: data restore from '/archive/scratch/01ba/0000/0401/0000/0002/0000/0x200000401:0x1ba:0x0' to '{VOLATILE}=[0x200000402:0x18:0x0]' done
lhsmtool_posix[31629]: Action completed, notifying coordinator cookie=0x5249f10d, FID=[0x200000401:0x1ba:0x0], hp_flags=0 err=0
lhsmtool_posix[31629]: llapi_hsm_action_end() on '/lustre/scratch/.lustre/fid/0x200000401:0x1ba:0x0' ok (rc=0)

It's not clear why the status for the file didn't return "SUCCEED".



 Comments   
Comment by Jinshan Xiong (Inactive) [ 02/Oct/13 ]

from the log, the restore process has been finished, but for unknown reason it delayed for a while, so that the test script didn;t see the restore state afterwards.

From the MDT log, here is the only thing interesting:

LustreError: 0-0: scratch-MDT0000: trigger OI scrub by RPC for [0x200000405:0x9:0x0], rc = 0 [1]
LustreError: 0-0: scratch-MDT0000: trigger OI scrub by RPC for [0x200000405:0x9:0x0], rc = 0 [1]
LustreError: 0-0: scratch-MDT0000: trigger OI scrub by RPC for [0x200000405:0x9:0x0], rc = 0 [1]
LustreError: 0-0: scratch-MDT0000: trigger OI scrub by RPC for [0x200000405:0x9:0x0], rc = 0 [1]

Not sure if this can cause the delay.

Comment by James Nunez (Inactive) [ 05/Dec/13 ]

I've run some more tests on the OpenSFS cluster and test 58 still fails some of the time. I can't consistently get it to fail and I don't see a pattern of failures when running the test alone nor in the full sanity-hsm test suite. I've uploaded new logs at https://maloo.whamcloud.com/test_sessions/b223c634-5d51-11e3-ad71-52540035b04c .

Comment by Jinshan Xiong (Inactive) [ 06/Dec/13 ]

Hi James, do you know what made the test case failed?

Comment by James Nunez (Inactive) [ 07/Dec/13 ]

Jinshan, Unfortunately, I don't have any more information on why the test is failing. I'll try to see if I can get it to reproduce more reliably and that might provide more information.

Comment by Sarah Liu [ 03/Jan/14 ]

another instance seen in reivew-dne

https://maloo.whamcloud.com/test_sets/7e3e01a2-7388-11e3-8412-52540035b04c

Comment by Sarah Liu [ 25/Jan/14 ]

hit this in interop testing between 2.6 client and 2.5 server:

https://maloo.whamcloud.com/test_sets/64654b22-8333-11e3-a5fa-52540035b04c

Comment by Aurelien Degremont (Inactive) [ 12/Mar/14 ]

I looked at it and this issue seems difficult to track. "Usually" this kind of issues appears when request state list is badly parsed or some timing issues where the request list entry is removed before we tried to read it.
Looking at sanity-hsm.sh, it is set to 10 seconds. According to Maloo logs, it seems there is only few seconds elapsed between truncate and get_request_state() call.

If we have few apparition of this issue and we do not know how to reproduce it, this becomes tricky. I only imagine adding some debugging patch to try to gather more information about that. (Like printing request state content )

Comment by Andreas Dilger [ 04/Dec/14 ]

Test has not failed in the past month, and those failures have a different message "request on 0x200000401:0x2d:0x0 is not SUCCEED on mds1", so that should be a different bug.

Generated at Sat Feb 10 01:39:05 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.