[LU-1205] sanityn.sh test_18 mmap_sanity sometimes takes thousands of seconds to run Created: 12/Mar/12 Updated: 23/Oct/17 Resolved: 22/May/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.2.0, Lustre 2.3.0, Lustre 2.1.1 |
| Fix Version/s: | Lustre 2.3.0, Lustre 2.1.2 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Andreas Dilger | Assignee: | Jinshan Xiong (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||
| Story Points: | 2 | ||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 4230 | ||||||||||||||||
| Description |
|
Looking through the test history for sanity.sh subtest_18 it appears that the runtime of this test varies by a huge amount. Sometimes it is completing in as few as 6s or over 3600s, and sometimes failing with a timeout or causing one of the later sanityn.sh tests to time out instead. The timeouts happen most often on the orion branch, but long test runs also happen on master review. Name Status Run at Duration test_18 PASS 2012-03-12 14:22:06 UTC 13 test_18 PASS 2012-03-12 11:28:53 UTC 3800 test_18 PASS 2012-03-12 07:57:47 UTC 9 test_18 PASS 2012-03-12 06:51:38 UTC 15 test_18 PASS 2012-03-12 06:39:20 UTC 436 test_18 PASS 2012-03-12 05:55:11 UTC 90 test_18 PASS 2012-03-12 04:43:36 UTC 8 test_18 PASS 2012-03-12 03:54:46 UTC 52 test_18 PASS 2012-03-12 01:18:41 UTC 85 test_18 PASS 2012-03-12 00:34:43 UTC 1323 test_18 PASS 2012-03-12 00:34:43 UTC 1323 test_18 TIMEOUT 2012-03-12 00:25:14 UTC 3600 test_18 TIMEOUT 2012-03-11 22:52:26 UTC 3600 test_18 PASS 2012-03-11 22:26:00 UTC 49 test_18 TIMEOUT 2012-03-11 20:35:55 UTC 3600 test_18 PASS 2012-03-11 20:30:23 UTC 3778 test_18 PASS 2012-03-11 19:01:24 UTC 24 test_18 PASS 2012-03-11 15:19:46 UTC 169 test_18 PASS 2012-03-11 14:58:48 UTC 6 test_18 PASS 2012-03-11 14:16:18 UTC 10 test_18 PASS 2012-03-11 13:40:19 UTC 2249 test_18 PASS 2012-03-11 12:29:16 UTC 334 test_18 PASS 2012-03-11 12:03:12 UTC 115 test_18 PASS 2012-03-11 11:06:46 UTC 612 test_18 PASS 2012-03-11 10:23:40 UTC 16 test_18 TIMEOUT 2012-03-11 07:18:33 UTC 3600 test_18 PASS 2012-03-11 06:03:44 UTC 26 test_18 PASS 2012-03-11 06:01:00 UTC 123 test_18 PASS 2012-03-11 04:58:09 UTC 105 test_18 PASS 2012-03-11 01:38:42 UTC 30 Info required for matching: sanityn 18 |
| Comments |
| Comment by Brian Murrell (Inactive) [ 12/Mar/12 ] |
|
FWIW, in ltest (a previous incarnation of Lustre testing), a test was marked a failure (even if the test itself was a pass – or maybe it was a "warning") if any single run time to exceed some given threshold beyond the stddev of the past $n successful test runs. This forced the investigation of wildly varying test results such as is documented here. |
| Comment by Andreas Dilger [ 12/Mar/12 ] |
|
For orion: http://review.whamcloud.com/2289 |
| Comment by Andreas Dilger [ 12/Mar/12 ] |
|
The above patches are for debugging the time taken by the sub-parts of this test. I don't yet know why the test takes so long. |
| Comment by Andreas Dilger [ 15/Mar/12 ] |
|
A successful run on master looks like: == sanityn test 18: mmap sanity check == 03:05:41 (1331633141) mmap test1: basic mmap operation (PASS, 0.021068s) mmap test2: MAP_PRIVATE not write back (PASS, 0.012644s) mmap test3: concurrent mmap ops on two nodes (PASS, 3.3153s) mmap test4: c1 write to f1 from mmapped f2, c2 write to f1 from mmapped f1 (PASS, 2.1067s) mmap test5: read/write file to/from the buffer which mmapped to just this file (PASS, 0.039092s) mmap test6: check mmap write/read content on two nodes (PASS, 0.11611s) mmap test7: file i/o with an unmapped buffer (PASS, 0.063589s) mmap test8: SIGBUS for beyond file size (PASS, 0.00761s) Resetting fail_loc on all nodes...done. An unsuccessful run on Orion looks like: == sanityn test 18: mmap sanity check ================================================================ 06:07:36 (1331644056) mmap test1: basic mmap operation (PASS, 0.028616s) mmap test2: MAP_PRIVATE not write back (PASS, 0.01381s) at which point the client is hung and the mmap process does not appear in the stack dumps. |
| Comment by Mikhail Pershin [ 09/Apr/12 ] |
|
bug exists in master for a long time, the first occurrence I've found is November, 2011: https://maloo.whamcloud.com/test_sets/433b98bc-178c-11e1-8f40-52540025f9af it appeared in Orion only in February, 2012: but that can be non-reliable as test 16 failed in Orion quite often at that time and may hide test 18 issue. |
| Comment by Build Master (Inactive) [ 10/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 10/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 10/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 10/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 10/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 10/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 10/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 10/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 10/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 10/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 10/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 10/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 10/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 10/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 10/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 10/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 10/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 10/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 10/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 10/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 10/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 10/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 10/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 10/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 10/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 10/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 10/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 10/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 10/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 10/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 10/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 10/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 10/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 10/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Andreas Dilger [ 12/Apr/12 ] |
|
Jinshan, I suspect some bad interaction between CLIO and mmap. |
| Comment by Jinshan Xiong (Inactive) [ 13/Apr/12 ] |
|
I took a look at this problem and obviously this is a live-lock problem. In this test(test 3), there are two processes writing the same mmap file via two mount points. So when a process applied for cl_lock and installed page into page table but before it has a chance to do write, the other one cancelled the lock so the page is truncated, back and forth and nobody can go forward. lustre mmap just doesn't support this kind of pathological application, fix is easy - just to disable this test case. |
| Comment by Andreas Dilger [ 13/Apr/12 ] |
|
I think it is reasonable to skip this problem by disabling the test, but only for a short time in order to avoid this test failing so much. Please add a "--except" option to this program, and have the ALWAYS_EXCEPT for 18c cause this subtest to be skipped. It wasn't always failing (only since November according to Mike). Is it not possible to hold the DLM lock during page dirtying (e.g. in page_mkwrite())? |
| Comment by Jinshan Xiong (Inactive) [ 13/Apr/12 ] |
|
Yes, implementation of ll_page_mkwrite() makes it easier to hit this problem because more things have to be done to make the page writable. We could mitigate the problem by doing page_mkwrite() in ll_fault() if this page is to be written, but the problem I described in my previous comment is a fundmental problem and can't be fixed in current linux kernel implementation. |
| Comment by Jinshan Xiong (Inactive) [ 13/Apr/12 ] |
|
patch is at: http://review.whamcloud.com/2544 |
| Comment by nasf (Inactive) [ 27/Apr/12 ] |
|
Another failure instance: https://maloo.whamcloud.com/test_sets/ac77823c-9054-11e1-98a1-525400d2bfa6 |
| Comment by Build Master (Inactive) [ 02/May/12 ] |
|
Integrated in Result = SUCCESS
Oleg Drokin : b91e1648a0bef0dee6ce79a6a9fc5b82efa1e0c0
|
| Comment by Build Master (Inactive) [ 02/May/12 ] |
|
Integrated in Result = SUCCESS
Oleg Drokin : b91e1648a0bef0dee6ce79a6a9fc5b82efa1e0c0
|
| Comment by Build Master (Inactive) [ 02/May/12 ] |
|
Integrated in Result = SUCCESS
Oleg Drokin : b91e1648a0bef0dee6ce79a6a9fc5b82efa1e0c0
|
| Comment by Build Master (Inactive) [ 02/May/12 ] |
|
Integrated in Result = SUCCESS
Oleg Drokin : b91e1648a0bef0dee6ce79a6a9fc5b82efa1e0c0
|
| Comment by Build Master (Inactive) [ 02/May/12 ] |
|
Integrated in Result = SUCCESS
Oleg Drokin : b91e1648a0bef0dee6ce79a6a9fc5b82efa1e0c0
|
| Comment by Build Master (Inactive) [ 02/May/12 ] |
|
Integrated in Result = SUCCESS
Oleg Drokin : b91e1648a0bef0dee6ce79a6a9fc5b82efa1e0c0
|
| Comment by Build Master (Inactive) [ 02/May/12 ] |
|
Integrated in Result = SUCCESS
Oleg Drokin : b91e1648a0bef0dee6ce79a6a9fc5b82efa1e0c0
|
| Comment by Bob Glossman (Inactive) [ 15/May/12 ] |
|
http://review.whamcloud.com/#change,2721 back port to b2_1 |
| Comment by Peter Jones [ 22/May/12 ] |
|
Landed for 2.1.2 and 2.3 |