[LU-9306] sanity-hsm test 24d is failing with 'request on 0x200000405:0x24:0x0 is not SUCCEED on mds1' - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: Lustre 2.10.0
Affects Version/s: Lustre 2.10.0
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

sanity_hsm test_24d is failing. From the test log, we wait for an update for over 200 seconds:

CMD: onyx-39vm7 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000405:0x24:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: onyx-39vm7 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000405:0x24:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
Update not seen after 200s: wanted 'SUCCEED' got 'STARTED'
 sanity-hsm test_24d: @@@@@@ FAIL: request on 0x200000405:0x24:0x0 is not SUCCEED on mds1

There is nothing obviously wrong in the console logs for any of the nodes.

The copytool_log for this test is nearly empty and doesn’t provide any information on what is causing this problem. The full copytool_log for this test is:

1491012689.288932 lhsmtool_posix[24069]: action=0 src=(null) dst=(null) mount_point=/mnt/lustre3
1491012689.334255 lhsmtool_posix[24070]: waiting for message from kernel
exiting: Terminated

This test failure could be leading to a cascade of failures. After test 24d fails, the following tests fail 24e, 24f, 25b, 26, 27b, 28, 29b, 29c, 30b, 30c, 31b, and many more. I don’t know if all the failures are related, but we should clean up the first test that’s failing.

So far, I’ve only seen this test fail for review-dne-part-2. So, the issue may be DNE related?

This test started to fail on the master branch on 2017-03-25 and has failed about 19 times since then. The patch for ~~LU-8911~~, https://review.whamcloud.com/#/c/24185/, is the last patch that made modifications to this test and sanity-hsm.

Here are links to some of the failed test logs:
2017-04-06 - https://testing.hpdd.intel.com/test_sets/81096390-1ae7-11e7-9073-5254006e85c2
2017-04-05 - https://testing.hpdd.intel.com/test_sets/ad0ce212-1a3f-11e7-9de9-5254006e85c2
2017-04-05 - https://testing.hpdd.intel.com/test_sets/28ab074e-19ed-11e7-b742-5254006e85c2
2017-04-05 - https://testing.hpdd.intel.com/test_sets/2bd0287a-19cd-11e7-8920-5254006e85c2
2017-04-04 - https://testing.hpdd.intel.com/test_sets/550c4e1a-1952-11e7-9de9-5254006e85c2
2017-04-03 - https://testing.hpdd.intel.com/test_sets/d986e31e-18c9-11e7-8920-5254006e85c2

Attachments

Activity

[LU-9306] sanity-hsm test 24d is failing with 'request on 0x200000405:0x24:0x0 is not SUCCEED on mds1'

Steve Guminski (Inactive) added a comment - 21/Apr/17 11:57 AM

Another on master:

https://testing.hpdd.intel.com/test_sessions/991eb176-e246-4d8f-bcca-2a62fc2e179f

Steve Guminski (Inactive) added a comment - 21/Apr/17 11:57 AM Another on master: https://testing.hpdd.intel.com/test_sessions/991eb176-e246-4d8f-bcca-2a62fc2e179f

Gerrit Updater added a comment - 21/Apr/17 7:23 AM

Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/26770
Subject: ~~LU-9306~~ tests: more debug info for hsm test_24d
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 8c7c03ff53e5c5f63da6e99f42c3124c4f5c2d29

Gerrit Updater added a comment - 21/Apr/17 7:23 AM Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/26770 Subject: LU-9306 tests: more debug info for hsm test_24d Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 8c7c03ff53e5c5f63da6e99f42c3124c4f5c2d29

nasf (Inactive) added a comment - 21/Apr/17 7:22 AM

Honestly, I cannot establish the relationship between the current sanity-hsm test_24d failure and the snapshot patches. The unique possible relation is the test_24d will try to check HSM actions with client mounted as read-only that seems something affected by the patch "0001-~~LU-8900~~-snapshot-simulate-readonly-device.patch" (https://review.whamcloud.com/24267). But in fact, such patch almost has nothing related with HSM. So I have to make some debug patch to collect more information.

nasf (Inactive) added a comment - 21/Apr/17 7:22 AM Honestly, I cannot establish the relationship between the current sanity-hsm test_24d failure and the snapshot patches. The unique possible relation is the test_24d will try to check HSM actions with client mounted as read-only that seems something affected by the patch "0001- LU-8900 -snapshot-simulate-readonly-device.patch" ( https://review.whamcloud.com/24267 ). But in fact, such patch almost has nothing related with HSM. So I have to make some debug patch to collect more information.

Gerrit Updater added a comment - 19/Apr/17 12:04 PM

Quentin Bouget (quentin.bouget@cea.fr) uploaded a new patch: https://review.whamcloud.com/26734
Subject: ~~LU-9306~~ tests: sanity-hsm, register traps in a better order
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: abfc0251f892862145c7576912640cce841fd7ba

Gerrit Updater added a comment - 19/Apr/17 12:04 PM Quentin Bouget (quentin.bouget@cea.fr) uploaded a new patch: https://review.whamcloud.com/26734 Subject: LU-9306 tests: sanity-hsm, register traps in a better order Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: abfc0251f892862145c7576912640cce841fd7ba

Peter Jones added a comment - 17/Apr/17 5:38 PM

Fan Yong

Could this have been caused by any of the ZFS Snapshots patches?

Peter

Peter Jones added a comment - 17/Apr/17 5:38 PM Fan Yong Could this have been caused by any of the ZFS Snapshots patches? Peter

Andreas Dilger added a comment - 17/Apr/17 5:03 PM

The first recent failure was 2017-03-25 with test https://testing.hpdd.intel.com/test_sets/82d3a2da-115f-11e7-9073-5254006e85c2 which was right after the ZFS Snapshot feature was landed on 2017-03-23. It makes sense to look at the patches that landed on that day to see if any of them could have caused this.

Andreas Dilger added a comment - 17/Apr/17 5:03 PM The first recent failure was 2017-03-25 with test https://testing.hpdd.intel.com/test_sets/82d3a2da-115f-11e7-9073-5254006e85c2 which was right after the ZFS Snapshot feature was landed on 2017-03-23. It makes sense to look at the patches that landed on that day to see if any of them could have caused this.

Andreas Dilger added a comment - 17/Apr/17 4:22 PM

This test has failed 25x in the past week, as often as 7x in a single day.

Andreas Dilger added a comment - 17/Apr/17 4:22 PM This test has failed 25x in the past week, as often as 7x in a single day.

Quentin Bouget (Inactive) added a comment - 12/Apr/17 5:26 PM

Thank you!

Quentin Bouget (Inactive) added a comment - 12/Apr/17 5:26 PM Thank you!

James Nunez (Inactive) added a comment - 12/Apr/17 4:30 PM

Quentin,

Both sanity-hsm test 9a and test 29d are skipped because they require three or more clients. I reviewed results for these tests for the past two years and found the last time test 9a was not skipped was September 9, 2015 and earlier; it actually passed 13 times between February and September, 2015. Test 29d was skipped for this whole period.

James Nunez (Inactive) added a comment - 12/Apr/17 4:30 PM Quentin, Both sanity-hsm test 9a and test 29d are skipped because they require three or more clients. I reviewed results for these tests for the past two years and found the last time test 9a was not skipped was September 9, 2015 and earlier; it actually passed 13 times between February and September, 2015. Test 29d was skipped for this whole period.

Quentin Bouget (Inactive) added a comment - 12/Apr/17 4:13 PM

James,

Could you tell me when was the last time test_9a and test_29d were run please?

Quentin Bouget (Inactive) added a comment - 12/Apr/17 4:13 PM James, Could you tell me when was the last time test_9a and test_29d were run please?

Quentin Bouget (Inactive) added a comment - 12/Apr/17 3:58 PM

The cleanup is not done correctly because test_24d calls

trap cleanup_test_24d EXIT

before

copytool_setup $SINGLEAGT "$MOUNT3"

which sets its own trap on EXIT.

I have a function to stack traps, I am working on integrating it in sanity-hsm.

As to why the test itself fails, I must say I still have no idea.

Quentin Bouget (Inactive) added a comment - 12/Apr/17 3:58 PM The cleanup is not done correctly because test_24d calls trap cleanup_test_24d EXIT before copytool_setup $SINGLEAGT "$MOUNT3" which sets its own trap on EXIT. I have a function to stack traps, I am working on integrating it in sanity-hsm. As to why the test itself fails, I must say I still have no idea.

People

Assignee:: John Hammond

Reporter:: James Nunez (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 07/Apr/17 7:30 PM

Updated:: 29/May/17 5:52 AM

Resolved:: 05/May/17 3:21 AM