[LU-10149] sanityn test_23: timeout after test_18 mmap_sanity takes thousands of seconds Created: 23/Oct/17  Updated: 26/Jan/18  Resolved: 26/Jan/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0
Fix Version/s: Lustre 2.11.0

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Bruno Faccini (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
is duplicated by LU-10184 sanityn test_18: hung after mmap test_2 Open
Related
is related to LU-1205 sanityn.sh test_18 mmap_sanity someti... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for jianyu <jian.yu@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/99cbc9cc-b407-11e7-9eee-5254006e85c2.

The sub-test test_23 failed with the following error:

Timeout occurred after 151 mins, last suite running was sanityn, restarting cluster to continue tests

Please provide additional information about the failure here.

Info required for matching: sanityn 23



 Comments   
Comment by Jian Yu [ 23/Oct/17 ]

More failure instance on master branch:
https://testing.hpdd.intel.com/test_sets/3c3a6f64-b58f-11e7-8afb-52540065bddc

Comment by Gerrit Updater [ 23/Oct/17 ]

Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: https://review.whamcloud.com/29715
Subject: LU-10149 tests: avoid live-lock with concurrent memsets
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 6679ad023c6c63bb9495eed570e8dd9ad06f3e9a

Comment by Bruno Faccini (Inactive) [ 23/Oct/17 ]

Change #29175 is an attempt to split and delay concurrent memsets to reduce the execution time of auto-tests using mmap_sanity, due to some possible live-lock situation likely to occur.

Comment by Andreas Dilger [ 08/Nov/17 ]

While this is a workaround for the test failures (which should land, don't get me wrong), it would be better to have some way to ensure that the client is at least making some forward progress when faulting in pages, so that applications using mmap don't fall over so badly.

One option would be to use a delay mechanism in the DLM lock cancellation, so that mmap locks cannot be cancelled within, say, 10ms of being granted or last modified. Secondly, in the two-node mmap case (and in other DLM lock ping-pong cases, we should look at reducing extent lock expansion (e.g. to 1MB) so that multiple writers are not causing needless lock contention. That would allow one node to get some work done, and hopefully move out of the IO range of the other node so they can work independently.

Comment by Bruno Faccini (Inactive) [ 09/Nov/17 ]

Andreas,
As per my heavy testing results, my current fix attempt (by splitting and delaying concurrent memsets) in change #29175 is not a 100% workaround.
Thus I was already trying to find a more definitive way to fix, and delay DLM lock cancelation had also appeared a good way to me. I should push a new patch to implement this soon.
I will also investigate the way of extent lock expansion reduction that you have pointed, thanks.

Comment by Jian Yu [ 21/Nov/17 ]

More failure instances on master branch:
https://testing.hpdd.intel.com/test_sets/b059527a-ce17-11e7-9c63-52540065bddc
https://testing.hpdd.intel.com/test_sets/6ddc3400-ce74-11e7-9c63-52540065bddc
https://testing.hpdd.intel.com/test_sets/cae1c010-cbc3-11e7-a066-52540065bddc

Comment by Gerrit Updater [ 11/Dec/17 ]

Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: https://review.whamcloud.com/30465
Subject: LU-10149 llite: avoid live-lock when concurrent mmap()s
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 783df45c6faf25c51add9d2965604599ead6f0d5

Comment by Bruno Faccini (Inactive) [ 11/Dec/17 ]

Change #30465 is an attempt to fix in a more generic way the live-lock upon concurrent mmap()s situation.

Comment by Gerrit Updater [ 22/Dec/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30465/
Subject: LU-10149 llite: avoid live-lock when concurrent mmap()s
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: cd793b4067b62862185f481cfef7b779927b801f

Comment by Bruno Faccini (Inactive) [ 16/Jan/18 ]

Change #29715 has been abandoned in favor of more generic change #30465.

Generated at Sat Feb 10 02:32:29 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.