[LU-10443] sanity - test_255c: Ladvise test 13, bad lock count, returned 100, actual 0 Created: 28/Dec/17  Updated: 02/Mar/23  Resolved: 22/Feb/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0
Fix Version/s: Lustre 2.11.0

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Patrick Farrell (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

Rolling Upgrade/Downgrade
Servers/Clients = 2.10.56_62 lustre version
ldiskfs


Issue Links:
Related
is related to LU-10104 sanity test_255c: Ladvise test 15, ba... Resolved
is related to LU-10136 sanity test_255c: Ladvise test11 fail... Resolved
is related to LU-16608 test 255c failing in GK testing Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Saurabh Tandan <saurabh.tandan@intel.com>

This issue relates to the following test suite run:

https://testing.hpdd.intel.com/test_sets/515b6450-ebfb-11e7-8c23-52540065bddc

This issue occurred while performing rolling upgrade/downgrade testing for tag 2.10.56_62.
Both the servers and clients are 2.10.56_62 lustre versions. Hence both must support Lockahead.

test logs:

== sanity test 255c: suite of ladvise lockahead tests ================================================ 07:51:57 (1514361117)
Starting test test10 at 1514361118
Finishing test test10 at 1514361118
Starting test test11 at 1514361118
Finishing test test11 at 1514361118
Starting test test12 at 1514361118
Finishing test test12 at 1514361119
Starting test test13 at 1514361119
Finishing test test13 at 1514361119
 sanity test_255c: @@@@@@ FAIL: Ladvise test 13, bad lock count, returned  100, actual 0 

Might be related to LU-10136 and LU-10104



 Comments   
Comment by Peter Jones [ 03/Jan/18 ]

Patrick

Could you please advise on this one?

Peter

Comment by Patrick Farrell (Inactive) [ 03/Jan/18 ]

Sure, will take a look.

Comment by Saurabh Tandan (Inactive) [ 03/Jan/18 ]

Steps followed for Rolling Upgrade testing:
1. Setup Lustre with clients and servers both having 2.10.2 GA version.
2. Upgrade OSS to 2.10.56_62 , Ran Sanity.sh
3. Upgrade MDS to 2.10.56_62 , Ran Sanity.sh
4. Upgrade Clients to 2.10.56_62, Ran Sanity.sh and hit this issue.

Comment by Peter Jones [ 06/Feb/18 ]

paf when do you expect to have a chance to get to this?

Comment by Patrick Farrell (Inactive) [ 08/Feb/18 ]

Tomorrow or early next week, sorry, I didn't realize this was urgent (passed on to me from the LWG today). I can try to reproduce the procedure described.

So to be clear, sanity was run with everything at 2.10.2, OSSes were unmounted and upgraded - I assume the file system remains up for this, this is a failover type scenario? Then sanity run again. Then the same for the MDS, followed by sanity.

And then finally, clients were all unmounted and upgraded, then remounted, followed by sanity, which had this failure?

I can replicate most of this, but I strongly suspect I won't hit this issue. Lockahead creates no on disk state, only LDLM state which should be destroyed well, A) when the file is deleted, and B) when clients are unmounted and remounted. I think it's way more likely we hit some other rare issue with the test rather than this being upgrade related.

First I'll dig through the logs and see if I can find anything.

Comment by Patrick Farrell (Inactive) [ 09/Feb/18 ]

... woah. Well, this test should never pass and neither should any of the others, really. We unlink the file and check the lock count after that. I guess we're just consistently winning the race.

I'll get a patch generated. This is a bug in the test, if that affects urgency.

Comment by Gerrit Updater [ 09/Feb/18 ]

Patrick Farrell (paf@cray.com) uploaded a new patch: https://review.whamcloud.com/31254
Subject: LU-10443 test: Handle file lifecycle correctly
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 286e0f7839267b8bf4119ce084427ffeb20455f6

Comment by Patrick Farrell (Inactive) [ 09/Feb/18 ]

This bug is not specific to interop. Just a question of timing. Patch should resolve.

Comment by Peter Jones [ 09/Feb/18 ]

Thanks Patrick! Good to know that this is not something that would affect those using the feature for real.

Comment by Gerrit Updater [ 22/Feb/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31254/
Subject: LU-10443 test: Handle file lifecycle correctly
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: e528677e1630093362394ae36d725c321d0da4f2

Comment by Peter Jones [ 22/Feb/18 ]

Landed for 2.11

Generated at Sat Feb 10 02:35:08 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.