HSM _not only_ small fixes and to do list goes here
(LU-3647)
|
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.0 |
| Fix Version/s: | None |
| Type: | Technical task | Priority: | Major |
| Reporter: | Keith Mannthey (Inactive) | Assignee: | Bruno Faccini (Inactive) |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | HSM | ||
| Environment: |
autotest |
||
| Issue Links: |
|
||||||||||||
| Rank (Obsolete): | 9553 | ||||||||||||
| Description |
|
Sanity-hsm test 21 seems to fail alot. An example: 'could not release file' seem to be the slightly more common error. test_21
Error: 'could not release file'
Failure Rate: 32.00% of last 100 executions [all branches]
The test logs look like. == sanity-hsm test 21: Simple release tests == 09:25:28 (1375374328) 2+0 records in 2+0 records out 2097152 bytes (2.1 MB) copied, 0.387647 s, 5.4 MB/s Cannot send HSM request (use of /mnt/lustre/d0.sanity-hsm/d21/test_release): Device or resource busy sanity-hsm test_21: @@@@@@ FAIL: could not release file Trace dump: test 3 and 22-24 errors look to all be related. |
| Comments |
| Comment by Aurelien Degremont (Inactive) [ 09/Aug/13 ] |
|
I think that test #3 and tests #21-24 are not related.
|
| Comment by Keith Mannthey (Inactive) [ 09/Aug/13 ] |
|
The reason I mention they might be linked is they seem to fail together alot. Another example: I will open LUs to track the other the other subtests. |
| Comment by James Nunez (Inactive) [ 12/Aug/13 ] |
|
Another set of test 3, 21-24 failures at: https://maloo.whamcloud.com/test_sets/aa3d4616-001d-11e3-a856-52540035b04c |
| Comment by Jinshan Xiong (Inactive) [ 13/Aug/13 ] |
|
I looked at the failure of test_3. The failure occurred when the client was trying to set system.posix_acl_access since test_3 uses `cp -p' to copy file. Before doing that, it tried to update the client xattr cache so the calling sequence is as follows: ll_setxattr -> ll_xattr_cache_update -> ll_xattr_cache_refill -> IT_GETXATTR -> mdt_getxattr. However, in mdt_getxattr(), the code snippet below: eadatasize = mo_xattr_list(info->mti_env, next, buf);
if (eadatasize < 0)
GOTO(out, rc = eadatasize);
v = req_capsule_server_get(info->mti_pill, &RMF_EAVALS);
sizes = req_capsule_server_get(info->mti_pill,
&RMF_EAVALS_LENS);
/* Fill out EAVALS and EAVALS_LENS */
for (b = buf->lb_buf;
b < (char *)buf->lb_buf + eadatasize;
b += strlen(b) + 1, v += rc) {
buf2.lb_buf = v;
rc = mdt_getxattr_one(info, b, next, &buf2, med, uc);
if (rc < 0)
GOTO(out, rc);
sizes[eavallens] = rc;
buf2.lb_len -= rc;
eavallens++;
eavallen += rc;
}
returned ENODATA from mdt_getxattr_one(). I don't know what's the xattr name due to lack of log on the MDT side. |
| Comment by Bruno Faccini (Inactive) [ 20/Aug/13 ] |
|
test_3 failures are due to, as Jinshan detailled, ENODATA return during [cp -p /etc/passwd $TESTFILE.append || error "could not create file"] command when trying to set "system.posix_acl_access" XATTR (but why ?). And it is now specifically addressed by test_21-24 failures, as detailled by Aurelien and to be addressed here, looks more like a race/timing issue (like test_21 "wrong block number" errors for Will try to setup a platform to reproduce problem, with HSM debug traces enabled on Client/MDS VMs, and running sanity-hsm tests in a loop. |
| Comment by Bruno Faccini (Inactive) [ 30/Aug/13 ] |
|
I am not able to reproduce problem with current master, even by running sanity-hsm/test_[21-24] in a loop. I only had to avoid/ignore test_24 "atime should be ..." errors since it is still being worked by John in BTW, according to Maloo reports test_[21-24] failures for 'could not release file' stopped around Aug. 14th. And this seems to match with landing of patch for So my strong assumption is that that this ticket can be closed because unrelated now. |
| Comment by Bruno Faccini (Inactive) [ 02/Sep/13 ] |
|
To be re-opened in case of re-occurence. |