[LU-9306] sanity-hsm test 24d is failing with 'request on 0x200000405:0x24:0x0 is not SUCCEED on mds1' Created: 07/Apr/17 Updated: 29/May/17 Resolved: 05/May/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.0 |
| Fix Version/s: | Lustre 2.10.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | James Nunez (Inactive) | Assignee: | John Hammond |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
sanity_hsm test_24d is failing. From the test log, we wait for an update for over 200 seconds: CMD: onyx-39vm7 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000405:0x24:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: onyx-39vm7 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000405:0x24:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
Update not seen after 200s: wanted 'SUCCEED' got 'STARTED'
sanity-hsm test_24d: @@@@@@ FAIL: request on 0x200000405:0x24:0x0 is not SUCCEED on mds1
There is nothing obviously wrong in the console logs for any of the nodes. The copytool_log for this test is nearly empty and doesn’t provide any information on what is causing this problem. The full copytool_log for this test is: 1491012689.288932 lhsmtool_posix[24069]: action=0 src=(null) dst=(null) mount_point=/mnt/lustre3 1491012689.334255 lhsmtool_posix[24070]: waiting for message from kernel exiting: Terminated This test failure could be leading to a cascade of failures. After test 24d fails, the following tests fail 24e, 24f, 25b, 26, 27b, 28, 29b, 29c, 30b, 30c, 31b, and many more. I don’t know if all the failures are related, but we should clean up the first test that’s failing. So far, I’ve only seen this test fail for review-dne-part-2. So, the issue may be DNE related? This test started to fail on the master branch on 2017-03-25 and has failed about 19 times since then. The patch for Here are links to some of the failed test logs: |
| Comments |
| Comment by James Nunez (Inactive) [ 12/Apr/17 ] |
|
Quentin - Would you please take a look at this issue? There are a couple of problems here; test 24d is failing at a high rate and the test is not cleaning up properly after itself and leaving the file system as read only. Your patch for |
| Comment by Quentin Bouget [ 12/Apr/17 ] |
|
On it |
| Comment by Quentin Bouget [ 12/Apr/17 ] |
|
The cleanup is not done correctly because test_24d calls trap cleanup_test_24d EXIT before copytool_setup $SINGLEAGT "$MOUNT3"
which sets its own trap on EXIT. I have a function to stack traps, I am working on integrating it in sanity-hsm. As to why the test itself fails, I must say I still have no idea. |
| Comment by Quentin Bouget [ 12/Apr/17 ] |
|
James, Could you tell me when was the last time test_9a and test_29d were run please? |
| Comment by James Nunez (Inactive) [ 12/Apr/17 ] |
|
Quentin, Both sanity-hsm test 9a and test 29d are skipped because they require three or more clients. I reviewed results for these tests for the past two years and found the last time test 9a was not skipped was September 9, 2015 and earlier; it actually passed 13 times between February and September, 2015. Test 29d was skipped for this whole period. |
| Comment by Quentin Bouget [ 12/Apr/17 ] |
|
Thank you! |
| Comment by Andreas Dilger [ 17/Apr/17 ] |
|
This test has failed 25x in the past week, as often as 7x in a single day. |
| Comment by Andreas Dilger [ 17/Apr/17 ] |
|
The first recent failure was 2017-03-25 with test https://testing.hpdd.intel.com/test_sets/82d3a2da-115f-11e7-9073-5254006e85c2 which was right after the ZFS Snapshot feature was landed on 2017-03-23. It makes sense to look at the patches that landed on that day to see if any of them could have caused this. |
| Comment by Peter Jones [ 17/Apr/17 ] |
|
Fan Yong Could this have been caused by any of the ZFS Snapshots patches? Peter |
| Comment by Gerrit Updater [ 19/Apr/17 ] |
|
Quentin Bouget (quentin.bouget@cea.fr) uploaded a new patch: https://review.whamcloud.com/26734 |
| Comment by nasf (Inactive) [ 21/Apr/17 ] |
|
Honestly, I cannot establish the relationship between the current sanity-hsm test_24d failure and the snapshot patches. The unique possible relation is the test_24d will try to check HSM actions with client mounted as read-only that seems something affected by the patch "0001- |
| Comment by Gerrit Updater [ 21/Apr/17 ] |
|
Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/26770 |
| Comment by Steve Guminski (Inactive) [ 21/Apr/17 ] |
|
Another on master: https://testing.hpdd.intel.com/test_sessions/991eb176-e246-4d8f-bcca-2a62fc2e179f |
| Comment by Gerrit Updater [ 26/Apr/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26770/ |
| Comment by Peter Jones [ 26/Apr/17 ] |
|
Landed for 2.10 |
| Comment by James Nunez (Inactive) [ 26/Apr/17 ] |
|
I'm reopening this ticket because it's not clear to me that the root cause of sanity-hsm test 24 failures is known/fixed and I'd still like to see Quentin's patch https://review.whamcloud.com/#/c/26734/ land so this test cleans up properly when it encounters an error. |
| Comment by Gerrit Updater [ 27/Apr/17 ] |
|
Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/26850 |
| Comment by Gerrit Updater [ 28/Apr/17 ] |
|
James Nunez (james.a.nunez@intel.com) uploaded a new patch: https://review.whamcloud.com/26877 |
| Comment by John Hammond [ 28/Apr/17 ] |
|
This is another instance of diff --git a/lustre/obdclass/kernelcomm.c b/lustre/obdclass/kernelcomm.c
index 0787f46..cb52660 100644
--- a/lustre/obdclass/kernelcomm.c
+++ b/lustre/obdclass/kernelcomm.c
@@ -194,6 +194,14 @@ int libcfs_kkuc_group_put(int group, void *payload)
ENTRY;
down_write(&kg_sem);
+
+ if (unlikely(kkuc_groups[group].next == NULL) ||
+ unlikely(OBD_FAIL_CHECK(OBD_FAIL_MDS_HSM_CT_REGISTER_NET))) {
+ /* no agent have fully registered, CDT will retry */
+ up_write(&kg_sem);
+ RETURN(-EAGAIN);
+ }
+
list_for_each_entry(reg, &kkuc_groups[group], kr_chain) {
if (reg->kr_fp != NULL) {
rc = libcfs_kkuc_msg_put(reg->kr_fp, payload);
|
| Comment by Gerrit Updater [ 28/Apr/17 ] |
|
John L. Hammond (john.hammond@intel.com) uploaded a new patch: https://review.whamcloud.com/26883 |
| Comment by Gerrit Updater [ 05/May/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26883/ |
| Comment by Peter Jones [ 05/May/17 ] |
|
Am I right in thinking that the patch just landed to master will address the blocking aspect of this ticket and that any other work discussed will be tracked elsewhere? |
| Comment by Quentin Bouget [ 05/May/17 ] |
|
I will open a new ticket to track the work on trap registration in sanity-hsm. Just from the commit messages of the other patches (disable sanity-hsm test 24d and more debug info for hsm test_24d), I don't think they are needed anymore. On the other hand, the patch that adds debug info (and was merged) should probably be reverted once we are confident enough the issue was solved. I don't know how this revert should be tracked though. |
| Comment by John Hammond [ 05/May/17 ] |
|
Peter, > Am I right in thinking that the patch just landed to master will address the blocking aspect of this ticket and that any other work discussed will be tracked elsewhere? Correct. |