[LU-12428] sanity-sec: test_13 nodemap_del failed with 1 Created: 12/Jun/19  Updated: 12/Sep/19  Resolved: 12/Sep/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0
Fix Version/s: Lustre 2.13.0

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Sebastien Buisson
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Li Xi <pkuelelixi@gmail.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/2fe8e80e-8cce-11e9-abe3-52540065bddc

By checking the test logs, we can find that in test_13, after running "lctl nodemap_del 48714_1", the test script check whether the nodemap has been deleted or not immediately in delete_nodemaps() of sanity-sec.sh. However, "lctl get_param nodemap.48714_1.id" still prints a result, which is unexpected by delete_nodemaps(). And thus, delete_nodemaps() quit with error reporting failure of test_13.

test_14 and test_15 failed too, but that is consequence of test_13 failure. In test_13, delete_nodemaps() didn't remove the existing nodemaps after 48714_1, so the nodemap_add of 48714_2 fails.

I think we need to have improvemens here. test_13, test_14 and test_15 are unrelated, so before running these test cases, delete_nodemaps() need to delete existing nodemaps to avoid failure.



 Comments   
Comment by James Nunez (Inactive) [ 03/Jul/19 ]

We’re seeing a varying number of sanity-sec tests fail with “nodemap_del failed with 1” followed by several tests failing with “nodemap_add failed with 1”. This looks like is started with on June 9, 2019 for Lustre version 2.12.54.52, for review-dne-zfs-part-2 and review-dne-part-2 only.

Some examples of failures are:
https://testing.whamcloud.com/test_sets/c3c90e88-8ad4-11e9-a77a-52540065bddc and https://testing.whamcloud.com/test_sets/b81989e4-8c52-11e9-9bb5-52540065bddc

  • sanity-sec test_7 fails with “nodemap_del failed with 1” and then tests 8, 9, 10a, 11, 12, 13, 14, 15 fail with “nodemap_add failed with 1”

https://testing.whamcloud.com/test_sets/cdd12596-9c81-11e9-8dbe-52540065bddc - sanity-sec test_8 fails with “nodemap_del failed with 1” and then tests 9, 10a, 11, 12, 13, 14, 15 fail with “nodemap_add failed with 1”

Comment by Peter Jones [ 03/Jul/19 ]

Sebastien can you please investigate?

Comment by Gerrit Updater [ 04/Jul/19 ]

Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: https://review.whamcloud.com/35418
Subject: LU-12428 tests: add traces to delete_nodemaps
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b5bca15d9ea3c8b8032e9792cf4f668abaa85a16

Comment by Gerrit Updater [ 05/Jul/19 ]

Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: https://review.whamcloud.com/35421
Subject: LU-12428 tests: wait for nodemaps to be synchronized
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1c65b0984c0cca5d194d33250e75e52e55764e9a

Comment by Gerrit Updater [ 15/Aug/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35421/
Subject: LU-12428 tests: wait for nodemaps to be synchronized
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 654cbf3c9d4daa7def0ba52b45b2b8ead39fcf90

Comment by Peter Jones [ 15/Aug/19 ]

Landed for 2.13

Comment by Jian Yu [ 28/Aug/19 ]

The failure occurred 5 times on master branch last week:
https://testing.whamcloud.com/test_sets/969517b8-c9a9-11e9-9fc9-52540065bddc
https://testing.whamcloud.com/test_sets/bc4a1c04-c843-11e9-a25b-52540065bddc
https://testing.whamcloud.com/test_sets/eeb69820-c811-11e9-a2b6-52540065bddc

Comment by Sebastien Buisson [ 29/Aug/19 ]

Hmm, when comparing test log from one of the recent failures (https://testing.whamcloud.com/test_sets/969517b8-c9a9-11e9-9fc9-52540065bddc) and test log from patch https://review.whamcloud.com/35421/ when it passed Maloo (https://testing.whamcloud.com/sub_tests/7f7dee14-9f70-11e9-9e3d-52540065bddc), it appears that there are no such message as "On MGS 10.9.4.124, 40996_0.id = nodemap.40996_0.id=1" in the failure case.
It means wait_nm_sync did not do its job, possibly because of the empty third parameter not taken into account properly. I will push a patch to make that more robust.

Comment by Gerrit Updater [ 29/Aug/19 ]

Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: https://review.whamcloud.com/35990
Subject: LU-12428 tests: robustify 'inactive' option of wait_nm_sync
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c88a2d5a4a6c576ca79d71e5e331811246a94605

Comment by Sebastien Buisson [ 30/Aug/19 ]

Oh, I finally found out the reason behind this strange behavior with wait_nm_sync() in sanity-sec.sh.
In fact, after a bad rebase, patch https://review.whamcloud.com/34090 introduced a different version of the wait_nm_sync() function in the same script sanity-sec.sh.

Comment by Gerrit Updater [ 30/Aug/19 ]

Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: https://review.whamcloud.com/36009
Subject: LU-12428 tests: fix sanity-sec wait_nm_sync
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 83526c493048e6f2a1ddf0bbf7cdf48d40982d16

Comment by Gerrit Updater [ 12/Sep/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36009/
Subject: LU-12428 tests: fix sanity-sec wait_nm_sync
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: ab398920fc20986a8ec686cad984f0cf0145a8d9

Comment by Peter Jones [ 12/Sep/19 ]

Landed for 2.13

Generated at Sat Feb 10 02:52:30 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.