[LU-10195] sanity test_24v: timeout occurred after 71 mins Created: 02/Nov/17  Updated: 03/Aug/18  Resolved: 03/Aug/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0, Lustre 2.10.2, Lustre 2.12.0, Lustre 2.10.4
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Bob Glossman (Inactive) Assignee: WC Triage
Resolution: Duplicate Votes: 0
Labels: None

Issue Links:
Duplicate
duplicates LU-10165 sanity test_24v: test took quite long... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Bob Glossman <bob.glossman@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/5de11e12-bf6f-11e7-a943-52540065bddc.

The sub-test test_24v failed with the following error:

Timeout occurred after 71 mins, last suite running was sanity, restarting cluster to continue tests

This might be a DCO problem, not a lustre bug. I can't find evidence of a lustre failure.

Info required for matching: sanity 24v



 Comments   
Comment by James Nunez (Inactive) [ 19/Dec/17 ]

sanity test 24v started failing on the master branch on 2017-10-09 and on the b2_10 branch on 2017-11-17.

Although number of files created and create times differ, the last lines in the client test_log for this failure are a variation of the following:

== sanity test 24v: list large directory (test hash collision, b=17560) == 12:27:33 (1510921653)
striped dir -i0 -c2 /mnt/lustre/d24v.sanity
 - create 9317 (time 1510921664.06 total 10.03 last 928.56)
 - create 10000 (time 1510921665.04 total 11.02 last 691.01)
 - create 19063 (time 1510921675.05 total 21.02 last 906.28)
 - create 20000 (time 1510921676.03 total 22.01 last 953.02)
 - create 28885 (time 1510921686.03 total 32.01 last 888.44)
 - create 30000 (time 1510921687.25 total 33.22 last 916.65)
 - create 38856 (time 1510921697.25 total 43.22 last 885.56)
 - create 40000 (time 1510921698.48 total 44.46 last 925.93)
 - create 48554 (time 1510921708.48 total 54.46 last 855.36)
 - create 50000 (time 1510921710.31 total 56.29 last 790.30)
 - create 58537 (time 1510921720.31 total 66.29 last 853.60)
 - create 60000 (time 1510921721.91 total 67.89 last 913.75)
 - create 68831 (time 1510921731.91 total 77.89 last 883.04)
 - create 70000 (time 1510921733.16 total 79.13 last 940.86)
 - create 78502 (time 1510921743.16 total 89.13 last 850.16)
 - create 80000 (time 1510921745.04 total 91.02 last 794.76)
 - create 88521 (time 1510921755.04 total 101.02 last 852.07)
 - create 90000 (time 1510921756.64 total 102.61 last 928.39)
 - create 99080 (time 1510921766.64 total 112.61 last 907.98)
total: 100000 create in 113.62 seconds: 880.16 ops/second
mdc.lustre-MDT0000-mdc-ffff880078d24400.stats=clear
mdc.lustre-MDT0001-mdc-ffff880078d24400.stats=clear
mdc.lustre-MDT0002-mdc-ffff880078d24400.stats=clear
mdc.lustre-MDT0003-mdc-ffff880078d24400.stats=clear
readpages: 6 rpc_max: 7
Comment by Jian Yu [ 22/Jul/18 ]

More failure instances on Lustre b2_10 branch in DNE test session:
https://testing.whamcloud.com/test_sets/fc89f1a4-88b1-11e8-b376-52540065bddc
https://testing.whamcloud.com/test_sets/1c7e0108-8c90-11e8-a9f7-52540065bddc

Comment by Minh Diep [ 01/Aug/18 ]

+1 on b2_10: https://testing.whamcloud.com/test_sets/d879223a-94fa-11e8-8ee3-52540065bddc

Comment by John Hammond [ 01/Aug/18 ]

I ran this locally and it looks like during cleanup each unlink is triggering statahead and reading the whole directory from is_first_dirent(). Here is the stack trace for rm:

o:~# stack1 rm
17437 rm
[<ffffffffc0d16140>] ptlrpc_set_wait+0x4c0/0x920 [ptlrpc]
[<ffffffffc0d1661d>] ptlrpc_queue_wait+0x7d/0x220 [ptlrpc]
[<ffffffffc0f04496>] mdc_getpage+0x1e6/0x6a0 [mdc]
[<ffffffffc0f04a7d>] mdc_read_page_remote+0x12d/0x640 [mdc]
[<ffffffffb8b94c01>] do_read_cache_page+0x81/0x190
[<ffffffffb8b94d4c>] read_cache_page+0x1c/0x30
[<ffffffffc0efec0a>] mdc_read_page+0x1aa/0x970 [mdc]
[<ffffffffc0c8f9f4>] lmv_read_page+0x1c4/0x550 [lmv]
[<ffffffffc156d6de>] ll_get_dir_page+0xce/0x380 [lustre]
[<ffffffffc15c0ca4>] start_statahead_thread+0x94/0xea0 [lustre]
[<ffffffffc15c5a4b>] ll_statahead+0x10b/0x120 [lustre]
[<ffffffffc15af0b4>] ll_lookup_it+0xb54/0xef0 [lustre]
[<ffffffffc15b07eb>] ll_lookup_nd+0xbb/0x190 [lustre]
[<ffffffffb8c25cf3>] lookup_real+0x23/0x60
[<ffffffffb8c26702>] __lookup_hash+0x42/0x60
[<ffffffffb8c2cfec>] do_unlinkat+0x14c/0x2d0
[<ffffffffb8c2e0cb>] SyS_unlinkat+0x1b/0x40
[<ffffffffb912082f>] system_call_fastpath+0x1c/0x21
[<ffffffffffffffff>] 0xffffffffffffffff

Attaching strace to rm shows that it's doing about 1 unlink per second:

o:~# strace -ttt -p 17437
strace: Process 17437 attached
1533157403.718270 unlinkat(4, "f24v.sanity1160", 0) = 0
1533157404.349918 unlinkat(4, "f24v.sanity1161", 0) = 0
1533157404.962529 unlinkat(4, "f24v.sanity1162", 0) = 0
1533157405.580449 unlinkat(4, "f24v.sanity1163", 0) = 0
1533157406.196802 unlinkat(4, "f24v.sanity1164", 0) = 0
1533157406.812760 unlinkat(4, "f24v.sanity1165", 0) = 0
1533157407.681777 unlinkat(4, "f24v.sanity1166", 0) = 0
1533157408.311224 unlinkat(4, "f24v.sanity1167", 0^Cstrace: Process 17437 detached
 <detached ...>

Every time I get the stack trace of rm it appears as above. On disabling statahead, unlinks go to about 100 per second.

Also this test is basically disabled in a non-DNE setup since it uses $LFS getdirstripe -c $(dirname $fname) to compute the stripe count of the directory which returns 0 for an unstriped directory.

Comment by John Hammond [ 02/Aug/18 ]

Lai, is this LU-10165 again?

Comment by Lai Siyao [ 03/Aug/18 ]

Yes, it looks to be LU-10165, which is not landed on b2_10 yet.

Generated at Sat Feb 10 02:32:54 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.