[LU-10734] sanity test_160g: User cl8 still found in changelog_users Created: 27/Feb/18  Updated: 19/Jul/18  Resolved: 18/Jul/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0
Fix Version/s: Lustre 2.12.0

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Bruno Faccini (Inactive)
Resolution: Fixed Votes: 0
Labels: sles12, suse

Issue Links:
Related
is related to LU-10680 MDT becoming unresponsive in 2.10.3 Resolved
is related to LU-11161 sanity test_160g fails for DNE with '... Resolved
is related to LU-9624 enable sanity.sh test_160a failures f... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

sanity test_160g - User cl8 still found in changelog_users
^^^^^^^^^^^^^ DO NOT REMOVE LINE ABOVE ^^^^^^^^^^^^^

This issue was created by maloo for Bob Glossman <bob.glossman@intel.com>

This issue relates to the following test suite run:
https://testing.hpdd.intel.com/test_sets/5a8495f4-1bfa-11e8-a6ad-52540065bddc
https://testing.hpdd.intel.com/test_sets/34e243bc-1be3-11e8-a7cd-52540065bddc

test_160g failed with the following error:

User cl8 still found in changelog_users

This may be a dup of LU-9624
I can't tell if it is so I am raising a fresh ticket.
Will let somebody else decide if it's a dup or not.



 Comments   
Comment by Bob Glossman (Inactive) [ 28/Feb/18 ]

more on master:
https://testing.hpdd.intel.com/test_sets/9ac85f8a-1cac-11e8-a7cd-52540065bddc
https://testing.hpdd.intel.com/test_sets/e7203b64-1cac-11e8-a7cd-52540065bddc
https://testing.hpdd.intel.com/test_sets/dc78ca60-1cba-11e8-a6ad-52540065bddc
https://testing.hpdd.intel.com/test_sets/27be5f90-1cc9-11e8-bd00-52540065bddc
https://testing.hpdd.intel.com/test_sets/1645ec0c-1ccd-11e8-a7cd-52540065bddc
https://testing.hpdd.intel.com/test_sets/fc6956e8-1d62-11e8-a10a-52540065bddc

Comment by Bob Glossman (Inactive) [ 28/Feb/18 ]

These fails only seen on master. Probably because sanity, 160g only exists on master.

Comment by Bruno Faccini (Inactive) [ 02/Mar/18 ]

Well, it is strange that it seems to only fail running with SLES, and looks like it started to fail after my 1st patch for LU-10680 (https://review.whamcloud.com/31347/) has landed, when it should fix an issue with my previous patch for LU-7340 which has introduced sanity/test_160[f,g] tests !!... I think I need to reproduce and debug it this way.

Comment by Bob Glossman (Inactive) [ 05/Mar/18 ]

more on master:
https://testing.hpdd.intel.com/test_sets/f9311b56-209a-11e8-a4b1-52540065bddc
https://testing.hpdd.intel.com/test_sets/c68848ae-20b4-11e8-9ec4-52540065bddc
https://testing.hpdd.intel.com/test_sets/50a68744-20d3-11e8-a6ca-52540065bddc
https://testing.hpdd.intel.com/test_sets/9d2125f0-2157-11e8-9ec4-52540065bddc
https://testing.hpdd.intel.com/test_sets/bbef9e12-2184-11e8-9ec4-52540065bddc

Comment by Bob Glossman (Inactive) [ 05/Mar/18 ]

after many similar fails here is a SLES test run that did NOT hit the failure:
https://testing.hpdd.intel.com/test_sessions/ba2847b6-e445-448d-882d-356fca02b96e

Don't know what the diff is between runs that fail and those that don't.
Noting this instance that didn't fail in the hope that it may be of some use.

Comment by Bob Glossman (Inactive) [ 06/Mar/18 ]

here is a similar fail seen on el7, not on SLES at all.
proof this problem isn't SLES only.

https://testing.hpdd.intel.com/test_sets/b3cb95da-20dd-11e8-a4b1-52540065bddc

Comment by Mikhail Pershin [ 06/Mar/18 ]

+1 on master, all with DNE
testing.hpdd.intel.com/test_sessions/7a5adbc7-2d4b-425a-9e71-a4674823a0df
testing.hpdd.intel.com/test_sessions/a9ae8e29-d45d-49b6-a639-a6fba84f5dfc

Comment by Bruno Faccini (Inactive) [ 07/Mar/18 ]

Having a better look to the recent changes that may have introduced this regression, I think that "a37134d LU-9624 tests: fix pre-DNE test exceptions/llog usage" could better be the cause of it.

Hope to get more about this soon now.

Comment by Andreas Dilger [ 07/Mar/18 ]

It looks like this failure relates to the landing of patch https://review.whamcloud.com/27535 "LU-9624 tests: fix pre-DNE test exceptions/llog usage". It didn't fail during normal testing, but I guess SLES is not part of regular testing.

Comment by Andreas Dilger [ 07/Mar/18 ]

Note also that with patch https://review.whamcloud.com/31552 "LU-10680 mdd: disable changelog garbage collection by default" test_160f and test_160g need to be modified to set changelog_gc=1 at the start of each test, and remove the tests from ALWAYS_EXCEPT so that the tests will run properly.

Comment by Peter Jones [ 07/Mar/18 ]

> It didn't fail during normal testing, but I guess SLES is not part of regular testing.

Well, it is tested regularly, but due to the round robin system used for pre-landing review test runs, it is not guaranteed to run before everything lands unless people proactively request this with test parameters.

Comment by Bruno Faccini (Inactive) [ 08/Mar/18 ]

Eh eh, after taking some time to think about it, I was wondering if the only regression/side effect coming from patch https://review.whamcloud.com/27535 ("a37134d LU-9624 tests: fix pre-DNE test exceptions/llog usage"), that we strongly suspect to be the cause of these failures, is to have slightly reduced the execution/elapsed time of sanity/test_160g script's beginning/prologue that may now take less than the 2 seconds of delay interval between 2x garbage-collection thread runs (as it has just run in sanity/test_160f, when sanity.sh is being fully executed during auto-tests) being used/configured ("changelog_min_gc_interval=2").
And this seems to be confirmed during my reproducer testing.

So a simple "sleep 2" at the beginning of sanity/test_160g should fix this problem.

Comment by Gerrit Updater [ 09/Mar/18 ]

Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: https://review.whamcloud.com/31604
Subject: LU-10734 tests: ensure current GC interval is over
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 72032b016ea8ab62cc681e72b5565ba207a6c316

Comment by Bob Glossman (Inactive) [ 20/Mar/18 ]

This fail is blocked for now. test 160g was added to ALWAYS_EXCEPT in a patch landed to master for LU-10680. May need to look for similar fails if and when test 160g is taken back out of ALWAYS_EXCEPT.

Comment by Gerrit Updater [ 18/Jul/18 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/31604/
Subject: LU-10734 tests: ensure current GC interval is over
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 821087e65882a9885964ed07d6f2a630dfb599d5

Comment by Peter Jones [ 18/Jul/18 ]

Landed for 2.12

Generated at Sat Feb 10 02:37:43 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.