HSM _not only_ small fixes and to do list goes here (LU-3647)

[LU-3704] sanity-hsm test_21 - test_24 Error: 'could not release file' Created: 05/Aug/13  Updated: 21/Oct/13  Resolved: 02/Sep/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0
Fix Version/s: None

Type: Technical task Priority: Major
Reporter: Keith Mannthey (Inactive) Assignee: Bruno Faccini (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: HSM
Environment:

autotest


Issue Links:
Related
is related to LU-3730 sanity-hsm test_3 Error: 'could not c... Resolved
is related to LU-3700 sanity-hsm test_21 Error: 'wrong bloc... Closed
Rank (Obsolete): 9553

 Description   

Sanity-hsm test 21 seems to fail alot.

An example:
https://maloo.whamcloud.com/test_sets/adac0ef6-fb6b-11e2-8c6e-52540035b04c

'could not release file' seem to be the slightly more common error.

test_21 	

    Error: 'could not release file'
    Failure Rate: 32.00% of last 100 executions [all branches] 

The test logs look like.

== sanity-hsm test 21: Simple release tests == 09:25:28 (1375374328)
2+0 records in
2+0 records out
2097152 bytes (2.1 MB) copied, 0.387647 s, 5.4 MB/s
Cannot send HSM request (use of /mnt/lustre/d0.sanity-hsm/d21/test_release): Device or resource busy
 sanity-hsm test_21: @@@@@@ FAIL: could not release file 
  Trace dump:

test 3 and 22-24 errors look to all be related.



 Comments   
Comment by Aurelien Degremont (Inactive) [ 09/Aug/13 ]

I think that test #3 and tests #21-24 are not related.

  • test #3 is not doing any "hsm_release" commands.
    It is failing to copy /etc/passwd into lustre mount point. May be there is a side effect on the system with this file. May be we should replace this copy by another to create the file.
  • tests #21-24.
    I've looked at test #21. It is failing somewhere at the end of the release process, when closing the file. I'm lacking debug information to detect exactly where this is failing. EBUSY is replied when MDS_CLOSE is replied but OBD_FL_RELEASE flag was not set for some reasons. It could be nice to have more debug for that.
Comment by Keith Mannthey (Inactive) [ 09/Aug/13 ]

The reason I mention they might be linked is they seem to fail together alot.

Another example:
https://maloo.whamcloud.com/test_sets/0c378cde-ff90-11e2-a3fb-52540035b04c

I will open LUs to track the other the other subtests.

Comment by James Nunez (Inactive) [ 12/Aug/13 ]

Another set of test 3, 21-24 failures at: https://maloo.whamcloud.com/test_sets/aa3d4616-001d-11e3-a856-52540035b04c

Comment by Jinshan Xiong (Inactive) [ 13/Aug/13 ]

I looked at the failure of test_3.

The failure occurred when the client was trying to set system.posix_acl_access since test_3 uses `cp -p' to copy file. Before doing that, it tried to update the client xattr cache so the calling sequence is as follows:

ll_setxattr -> ll_xattr_cache_update -> ll_xattr_cache_refill -> IT_GETXATTR -> mdt_getxattr.

However, in mdt_getxattr(), the code snippet below:

                eadatasize = mo_xattr_list(info->mti_env, next, buf);
                if (eadatasize < 0)
                        GOTO(out, rc = eadatasize);

                v = req_capsule_server_get(info->mti_pill, &RMF_EAVALS);
                sizes = req_capsule_server_get(info->mti_pill,
                                                &RMF_EAVALS_LENS);

                /* Fill out EAVALS and EAVALS_LENS */
                for (b = buf->lb_buf;
                     b < (char *)buf->lb_buf + eadatasize;
                     b += strlen(b) + 1, v += rc) {
                        buf2.lb_buf = v;
                        rc = mdt_getxattr_one(info, b, next, &buf2, med, uc);
                        if (rc < 0)
                                GOTO(out, rc);
                        sizes[eavallens] = rc;
                        buf2.lb_len -= rc;
                        eavallens++;
                        eavallen += rc;
                }

returned ENODATA from mdt_getxattr_one(). I don't know what's the xattr name due to lack of log on the MDT side.

Comment by Bruno Faccini (Inactive) [ 20/Aug/13 ]

test_3 failures are due to, as Jinshan detailled, ENODATA return during [cp -p /etc/passwd $TESTFILE.append || error "could not create file"] command when trying to set "system.posix_acl_access" XATTR (but why ?). And it is now specifically addressed by LU-3730 !!

test_21-24 failures, as detailled by Aurelien and to be addressed here, looks more like a race/timing issue (like test_21 "wrong block number" errors for LU-3700) due to current usage of "lfs hsm_set --archived --exist <file>" command to mimic "lfs hsm_archive <file>". "wait_request_state $fid ARCHIVE SUCCEED" may need to be also used here ?

Will try to setup a platform to reproduce problem, with HSM debug traces enabled on Client/MDS VMs, and running sanity-hsm tests in a loop.

Comment by Bruno Faccini (Inactive) [ 30/Aug/13 ]

I am not able to reproduce problem with current master, even by running sanity-hsm/test_[21-24] in a loop. I only had to avoid/ignore test_24 "atime should be ..." errors since it is still being worked by John in LU-3814/LU-3832 I think, when test_24 is actually not run by default.

BTW, according to Maloo reports test_[21-24] failures for 'could not release file' stopped around Aug. 14th. And this seems to match with landing of patch for LU-3561 that brings "real" HSM features (copytool, lfs hsm-commands usage instead of hsm-flags setting) in tests and according tools testing.

So my strong assumption is that that this ticket can be closed because unrelated now.

Comment by Bruno Faccini (Inactive) [ 02/Sep/13 ]

To be re-opened in case of re-occurence.

Generated at Sat Feb 10 01:36:12 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.