[LU-9306] sanity-hsm test 24d is failing with 'request on 0x200000405:0x24:0x0 is not SUCCEED on mds1' Created: 07/Apr/17  Updated: 29/May/17  Resolved: 05/May/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.0
Fix Version/s: Lustre 2.10.0

Type: Bug Priority: Blocker
Reporter: James Nunez (Inactive) Assignee: John Hammond
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

sanity_hsm test_24d is failing. From the test log, we wait for an update for over 200 seconds:

CMD: onyx-39vm7 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000405:0x24:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: onyx-39vm7 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000405:0x24:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
Update not seen after 200s: wanted 'SUCCEED' got 'STARTED'
 sanity-hsm test_24d: @@@@@@ FAIL: request on 0x200000405:0x24:0x0 is not SUCCEED on mds1 

There is nothing obviously wrong in the console logs for any of the nodes.

The copytool_log for this test is nearly empty and doesn’t provide any information on what is causing this problem. The full copytool_log for this test is:

1491012689.288932 lhsmtool_posix[24069]: action=0 src=(null) dst=(null) mount_point=/mnt/lustre3
1491012689.334255 lhsmtool_posix[24070]: waiting for message from kernel
exiting: Terminated

This test failure could be leading to a cascade of failures. After test 24d fails, the following tests fail 24e, 24f, 25b, 26, 27b, 28, 29b, 29c, 30b, 30c, 31b, and many more. I don’t know if all the failures are related, but we should clean up the first test that’s failing.

So far, I’ve only seen this test fail for review-dne-part-2. So, the issue may be DNE related?

This test started to fail on the master branch on 2017-03-25 and has failed about 19 times since then. The patch for LU-8911, https://review.whamcloud.com/#/c/24185/, is the last patch that made modifications to this test and sanity-hsm.

Here are links to some of the failed test logs:
2017-04-06 - https://testing.hpdd.intel.com/test_sets/81096390-1ae7-11e7-9073-5254006e85c2
2017-04-05 - https://testing.hpdd.intel.com/test_sets/ad0ce212-1a3f-11e7-9de9-5254006e85c2
2017-04-05 - https://testing.hpdd.intel.com/test_sets/28ab074e-19ed-11e7-b742-5254006e85c2
2017-04-05 - https://testing.hpdd.intel.com/test_sets/2bd0287a-19cd-11e7-8920-5254006e85c2
2017-04-04 - https://testing.hpdd.intel.com/test_sets/550c4e1a-1952-11e7-9de9-5254006e85c2
2017-04-03 - https://testing.hpdd.intel.com/test_sets/d986e31e-18c9-11e7-8920-5254006e85c2



 Comments   
Comment by James Nunez (Inactive) [ 12/Apr/17 ]

Quentin - Would you please take a look at this issue? There are a couple of problems here; test 24d is failing at a high rate and the test is not cleaning up properly after itself and leaving the file system as read only.

Your patch for LU-8911, https://review.whamcloud.com/#/c/24185/, was the last to modify this test.

Comment by Quentin Bouget [ 12/Apr/17 ]

On it

Comment by Quentin Bouget [ 12/Apr/17 ]

The cleanup is not done correctly because test_24d calls

trap cleanup_test_24d EXIT

before

copytool_setup $SINGLEAGT "$MOUNT3"

which sets its own trap on EXIT.

I have a function to stack traps, I am working on integrating it in sanity-hsm.

As to why the test itself fails, I must say I still have no idea.

Comment by Quentin Bouget [ 12/Apr/17 ]

James,

Could you tell me when was the last time test_9a and test_29d were run please?

Comment by James Nunez (Inactive) [ 12/Apr/17 ]

Quentin,

Both sanity-hsm test 9a and test 29d are skipped because they require three or more clients. I reviewed results for these tests for the past two years and found the last time test 9a was not skipped was September 9, 2015 and earlier; it actually passed 13 times between February and September, 2015. Test 29d was skipped for this whole period.

Comment by Quentin Bouget [ 12/Apr/17 ]

Thank you!

Comment by Andreas Dilger [ 17/Apr/17 ]

This test has failed 25x in the past week, as often as 7x in a single day.

Comment by Andreas Dilger [ 17/Apr/17 ]

The first recent failure was 2017-03-25 with test https://testing.hpdd.intel.com/test_sets/82d3a2da-115f-11e7-9073-5254006e85c2 which was right after the ZFS Snapshot feature was landed on 2017-03-23. It makes sense to look at the patches that landed on that day to see if any of them could have caused this.

Comment by Peter Jones [ 17/Apr/17 ]

Fan Yong

Could this have been caused by any of the ZFS Snapshots patches?

Peter

Comment by Gerrit Updater [ 19/Apr/17 ]

Quentin Bouget (quentin.bouget@cea.fr) uploaded a new patch: https://review.whamcloud.com/26734
Subject: LU-9306 tests: sanity-hsm, register traps in a better order
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: abfc0251f892862145c7576912640cce841fd7ba

Comment by nasf (Inactive) [ 21/Apr/17 ]

Honestly, I cannot establish the relationship between the current sanity-hsm test_24d failure and the snapshot patches. The unique possible relation is the test_24d will try to check HSM actions with client mounted as read-only that seems something affected by the patch "0001-LU-8900-snapshot-simulate-readonly-device.patch" (https://review.whamcloud.com/24267). But in fact, such patch almost has nothing related with HSM. So I have to make some debug patch to collect more information.

Comment by Gerrit Updater [ 21/Apr/17 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/26770
Subject: LU-9306 tests: more debug info for hsm test_24d
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 8c7c03ff53e5c5f63da6e99f42c3124c4f5c2d29

Comment by Steve Guminski (Inactive) [ 21/Apr/17 ]

Another on master:

https://testing.hpdd.intel.com/test_sessions/991eb176-e246-4d8f-bcca-2a62fc2e179f

Comment by Gerrit Updater [ 26/Apr/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26770/
Subject: LU-9306 tests: more debug info for hsm test_24d
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: fc7c513b4cbcc8775076f6490f2df03b52cf4051

Comment by Peter Jones [ 26/Apr/17 ]

Landed for 2.10

Comment by James Nunez (Inactive) [ 26/Apr/17 ]

I'm reopening this ticket because it's not clear to me that the root cause of sanity-hsm test 24 failures is known/fixed and I'd still like to see Quentin's patch https://review.whamcloud.com/#/c/26734/ land so this test cleans up properly when it encounters an error.

Comment by Gerrit Updater [ 27/Apr/17 ]

Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/26850
Subject: LU-9306 tests: more debug info for hsm test_24d
Project: fs/lustre-release
Branch: pfl
Current Patch Set: 1
Commit: 2a7f9003fcb2002e3d2163e8b8e8f628491b1cec

Comment by Gerrit Updater [ 28/Apr/17 ]

James Nunez (james.a.nunez@intel.com) uploaded a new patch: https://review.whamcloud.com/26877
Subject: LU-9306 test: disable sanity-hsm test 24d
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 135809281665710c5a6c32db45434c0dd8e9ba05

Comment by John Hammond [ 28/Apr/17 ]

This is another instance of LU-9038. The CT has not registered when the request was sent, so it will take some time for the request to be reissued. In the fix from LU-9038 https://review.whamcloud.com/25050 we should also check for an empty list (since otherwise the message will be dropped):

diff --git a/lustre/obdclass/kernelcomm.c b/lustre/obdclass/kernelcomm.c
index 0787f46..cb52660 100644
--- a/lustre/obdclass/kernelcomm.c
+++ b/lustre/obdclass/kernelcomm.c
@@ -194,6 +194,14 @@ int libcfs_kkuc_group_put(int group, void *payload)
        ENTRY;
 
        down_write(&kg_sem);
+
+       if (unlikely(kkuc_groups[group].next == NULL) ||
+           unlikely(OBD_FAIL_CHECK(OBD_FAIL_MDS_HSM_CT_REGISTER_NET))) {
+               /* no agent have fully registered, CDT will retry */
+               up_write(&kg_sem);
+               RETURN(-EAGAIN);
+       }
+
        list_for_each_entry(reg, &kkuc_groups[group], kr_chain) {
                if (reg->kr_fp != NULL) {
                        rc = libcfs_kkuc_msg_put(reg->kr_fp, payload);
Comment by Gerrit Updater [ 28/Apr/17 ]

John L. Hammond (john.hammond@intel.com) uploaded a new patch: https://review.whamcloud.com/26883
Subject: LU-9306 kuc: initialize kkuc_groups and module init time
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 3547e2146e14f7c112dd002694b3df3bc9d95a35

Comment by Gerrit Updater [ 05/May/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26883/
Subject: LU-9306 kuc: initialize kkuc_groups at module init time
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 37bbe9a1bf33b77d3f239ea69ce052e99f09308d

Comment by Peter Jones [ 05/May/17 ]

Am I right in thinking that the patch just landed to master will address the blocking aspect of this ticket and that any other work discussed will be tracked elsewhere?

Comment by Quentin Bouget [ 05/May/17 ]

I will open a new ticket to track the work on trap registration in sanity-hsm.

Just from the commit messages of the other patches (disable sanity-hsm test 24d and more debug info for hsm test_24d), I don't think they are needed anymore.

On the other hand, the patch that adds debug info (and was merged) should probably be reverted once we are confident enough the issue was solved. I don't know how this revert should be tracked though.

Comment by John Hammond [ 05/May/17 ]

Peter,

> Am I right in thinking that the patch just landed to master will address the blocking aspect of this ticket and that any other work discussed will be tracked elsewhere?

Correct.

Generated at Sat Feb 10 02:25:01 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.