[LU-10195] sanity test_24v: timeout occurred after 71 mins Created: 02/Nov/17 Updated: 03/Aug/18 Resolved: 03/Aug/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.11.0, Lustre 2.10.2, Lustre 2.12.0, Lustre 2.10.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Bob Glossman (Inactive) | Assignee: | WC Triage |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
This issue was created by maloo for Bob Glossman <bob.glossman@intel.com> This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/5de11e12-bf6f-11e7-a943-52540065bddc. The sub-test test_24v failed with the following error: Timeout occurred after 71 mins, last suite running was sanity, restarting cluster to continue tests This might be a DCO problem, not a lustre bug. I can't find evidence of a lustre failure. Info required for matching: sanity 24v |
| Comments |
| Comment by James Nunez (Inactive) [ 19/Dec/17 ] |
|
sanity test 24v started failing on the master branch on 2017-10-09 and on the b2_10 branch on 2017-11-17. Although number of files created and create times differ, the last lines in the client test_log for this failure are a variation of the following: == sanity test 24v: list large directory (test hash collision, b=17560) == 12:27:33 (1510921653) striped dir -i0 -c2 /mnt/lustre/d24v.sanity - create 9317 (time 1510921664.06 total 10.03 last 928.56) - create 10000 (time 1510921665.04 total 11.02 last 691.01) - create 19063 (time 1510921675.05 total 21.02 last 906.28) - create 20000 (time 1510921676.03 total 22.01 last 953.02) - create 28885 (time 1510921686.03 total 32.01 last 888.44) - create 30000 (time 1510921687.25 total 33.22 last 916.65) - create 38856 (time 1510921697.25 total 43.22 last 885.56) - create 40000 (time 1510921698.48 total 44.46 last 925.93) - create 48554 (time 1510921708.48 total 54.46 last 855.36) - create 50000 (time 1510921710.31 total 56.29 last 790.30) - create 58537 (time 1510921720.31 total 66.29 last 853.60) - create 60000 (time 1510921721.91 total 67.89 last 913.75) - create 68831 (time 1510921731.91 total 77.89 last 883.04) - create 70000 (time 1510921733.16 total 79.13 last 940.86) - create 78502 (time 1510921743.16 total 89.13 last 850.16) - create 80000 (time 1510921745.04 total 91.02 last 794.76) - create 88521 (time 1510921755.04 total 101.02 last 852.07) - create 90000 (time 1510921756.64 total 102.61 last 928.39) - create 99080 (time 1510921766.64 total 112.61 last 907.98) total: 100000 create in 113.62 seconds: 880.16 ops/second mdc.lustre-MDT0000-mdc-ffff880078d24400.stats=clear mdc.lustre-MDT0001-mdc-ffff880078d24400.stats=clear mdc.lustre-MDT0002-mdc-ffff880078d24400.stats=clear mdc.lustre-MDT0003-mdc-ffff880078d24400.stats=clear readpages: 6 rpc_max: 7 |
| Comment by Jian Yu [ 22/Jul/18 ] |
|
More failure instances on Lustre b2_10 branch in DNE test session: |
| Comment by Minh Diep [ 01/Aug/18 ] |
|
+1 on b2_10: https://testing.whamcloud.com/test_sets/d879223a-94fa-11e8-8ee3-52540065bddc |
| Comment by John Hammond [ 01/Aug/18 ] |
|
I ran this locally and it looks like during cleanup each unlink is triggering statahead and reading the whole directory from is_first_dirent(). Here is the stack trace for rm: o:~# stack1 rm 17437 rm [<ffffffffc0d16140>] ptlrpc_set_wait+0x4c0/0x920 [ptlrpc] [<ffffffffc0d1661d>] ptlrpc_queue_wait+0x7d/0x220 [ptlrpc] [<ffffffffc0f04496>] mdc_getpage+0x1e6/0x6a0 [mdc] [<ffffffffc0f04a7d>] mdc_read_page_remote+0x12d/0x640 [mdc] [<ffffffffb8b94c01>] do_read_cache_page+0x81/0x190 [<ffffffffb8b94d4c>] read_cache_page+0x1c/0x30 [<ffffffffc0efec0a>] mdc_read_page+0x1aa/0x970 [mdc] [<ffffffffc0c8f9f4>] lmv_read_page+0x1c4/0x550 [lmv] [<ffffffffc156d6de>] ll_get_dir_page+0xce/0x380 [lustre] [<ffffffffc15c0ca4>] start_statahead_thread+0x94/0xea0 [lustre] [<ffffffffc15c5a4b>] ll_statahead+0x10b/0x120 [lustre] [<ffffffffc15af0b4>] ll_lookup_it+0xb54/0xef0 [lustre] [<ffffffffc15b07eb>] ll_lookup_nd+0xbb/0x190 [lustre] [<ffffffffb8c25cf3>] lookup_real+0x23/0x60 [<ffffffffb8c26702>] __lookup_hash+0x42/0x60 [<ffffffffb8c2cfec>] do_unlinkat+0x14c/0x2d0 [<ffffffffb8c2e0cb>] SyS_unlinkat+0x1b/0x40 [<ffffffffb912082f>] system_call_fastpath+0x1c/0x21 [<ffffffffffffffff>] 0xffffffffffffffff Attaching strace to rm shows that it's doing about 1 unlink per second: o:~# strace -ttt -p 17437 strace: Process 17437 attached 1533157403.718270 unlinkat(4, "f24v.sanity1160", 0) = 0 1533157404.349918 unlinkat(4, "f24v.sanity1161", 0) = 0 1533157404.962529 unlinkat(4, "f24v.sanity1162", 0) = 0 1533157405.580449 unlinkat(4, "f24v.sanity1163", 0) = 0 1533157406.196802 unlinkat(4, "f24v.sanity1164", 0) = 0 1533157406.812760 unlinkat(4, "f24v.sanity1165", 0) = 0 1533157407.681777 unlinkat(4, "f24v.sanity1166", 0) = 0 1533157408.311224 unlinkat(4, "f24v.sanity1167", 0^Cstrace: Process 17437 detached <detached ...> Every time I get the stack trace of rm it appears as above. On disabling statahead, unlinks go to about 100 per second. Also this test is basically disabled in a non-DNE setup since it uses $LFS getdirstripe -c $(dirname $fname) to compute the stripe count of the directory which returns 0 for an unstriped directory. |
| Comment by John Hammond [ 02/Aug/18 ] |
|
Lai, is this |
| Comment by Lai Siyao [ 03/Aug/18 ] |
|
Yes, it looks to be |