[LU-10149] sanityn test_23: timeout after test_18 mmap_sanity takes thousands of seconds Created: 23/Oct/17 Updated: 26/Jan/18 Resolved: 26/Jan/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.11.0 |
| Fix Version/s: | Lustre 2.11.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | Bruno Faccini (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
This issue was created by maloo for jianyu <jian.yu@intel.com> This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/99cbc9cc-b407-11e7-9eee-5254006e85c2. The sub-test test_23 failed with the following error: Timeout occurred after 151 mins, last suite running was sanityn, restarting cluster to continue tests Please provide additional information about the failure here. Info required for matching: sanityn 23 |
| Comments |
| Comment by Jian Yu [ 23/Oct/17 ] |
|
More failure instance on master branch: |
| Comment by Gerrit Updater [ 23/Oct/17 ] |
|
Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: https://review.whamcloud.com/29715 |
| Comment by Bruno Faccini (Inactive) [ 23/Oct/17 ] |
|
Change #29175 is an attempt to split and delay concurrent memsets to reduce the execution time of auto-tests using mmap_sanity, due to some possible live-lock situation likely to occur. |
| Comment by Andreas Dilger [ 08/Nov/17 ] |
|
While this is a workaround for the test failures (which should land, don't get me wrong), it would be better to have some way to ensure that the client is at least making some forward progress when faulting in pages, so that applications using mmap don't fall over so badly. One option would be to use a delay mechanism in the DLM lock cancellation, so that mmap locks cannot be cancelled within, say, 10ms of being granted or last modified. Secondly, in the two-node mmap case (and in other DLM lock ping-pong cases, we should look at reducing extent lock expansion (e.g. to 1MB) so that multiple writers are not causing needless lock contention. That would allow one node to get some work done, and hopefully move out of the IO range of the other node so they can work independently. |
| Comment by Bruno Faccini (Inactive) [ 09/Nov/17 ] |
|
Andreas, |
| Comment by Jian Yu [ 21/Nov/17 ] |
|
More failure instances on master branch: |
| Comment by Gerrit Updater [ 11/Dec/17 ] |
|
Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: https://review.whamcloud.com/30465 |
| Comment by Bruno Faccini (Inactive) [ 11/Dec/17 ] |
|
Change #30465 is an attempt to fix in a more generic way the live-lock upon concurrent mmap()s situation. |
| Comment by Gerrit Updater [ 22/Dec/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30465/ |
| Comment by Bruno Faccini (Inactive) [ 16/Jan/18 ] |
|
Change #29715 has been abandoned in favor of more generic change #30465. |