[LU-14809] Intermittent failure of sanity test_398g (we should see 8 RPCs in flight) Created: 02/Jul/21  Updated: 07/Dec/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Patrick Farrell Assignee: Patrick Farrell
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-14310 Interop: sanity test 130g fails with ... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Lustre Gerrit Janitor
4:05 PM

Patch Set 15:

Testing has completed with errors! IMPORTANT: these tests appear to be new failures unique to this patch

  • sanity2@ldiskfs+DNE:test_398g(Seen in reviews: 39447 39443 39440 39438 39436)

> sanity2@ldiskfs+DNE Failure(7638s)

  • 398g(we should see 8 RPCs in flight)

Succeeded:

  • conf-sanity1@ldiskfs+DNE conf-sanity2@ldiskfs+DNE conf-sanity3@ldiskfs+DNE conf-sanity-slow@ldiskfs+DNE conf-sanity-slow-106@ldiskfs+DNE conf-sanity1@zfs conf-sanity2@zfs conf-sanity3@zfs conf-sanity-slow@zfs conf-sanity-slow-106@zfs insanity@ldiskfs+DNE insanity@zfs lnet-selftest@zfs lustre-rsync-test@ldiskfs+DNE lustre-rsync-test@zfs ost-pools@ldiskfs+DNE ost-pools@zfs pjdfstest@ldiskfs+DNE pjdfstest@zfs racer@ldiskfs+DNE racer@zfs recovery-small@ldiskfs+DNE recovery-small@zfs replay-dual@ldiskfs+DNE replay-dual@zfs replay-ost-single@ldiskfs+DNE replay-ost-single@zfs replay-single@ldiskfs+DNE replay-single@zfs replay-vbr@ldiskfs+DNE replay-vbr@zfs sanity1@ldiskfs+DNE sanity-slow@ldiskfs+DNE sanity1@zfs sanity2@zfs sanity-slow@zfs sanity-benchmark@ldiskfs+DNE sanity-benchmark@zfs sanity-dom@ldiskfs+DNE sanity-dom@zfs sanity-flr@ldiskfs+DNE sanity-flr@zfs sanity-hsm@ldiskfs+DNE sanity-hsm@zfs sanity-lfsck@ldiskfs+DNE sanity-lfsck@zfs sanity-lnet@zfs sanity-pcc@ldiskfs+DNE sanity-pcc@zfs sanity-pfl@ldiskfs+DNE sanity-pfl@zfs sanity-quota@ldiskfs+DNE sanity-quota@zfs sanity-scrub@ldiskfs+DNE sanity-scrub@zfs sanity-sec@ldiskfs+DNE sanity-sec@zfs sanityn@ldiskfs+DNE sanityn@zfs

(centos7)All results and logs: https://testing-archive.whamcloud.com/gerrit-janitor/16844/results.html



 Comments   
Comment by Patrick Farrell [ 02/Jul/21 ]

This test was added by the recent parallel DIO work; this specific failure is one where we got one extra RPC of ~ 1 page for some reason:

8388608 bytes (8.4 MB) copied, 2.17939 s, 3.8 MB/s
osc.lustre-OST0000-osc-ffff8800d6b67800.rpc_stats=
snapshot_time:         1625256875.834467231 (secs.nsecs)
read RPCs in flight:  0
write RPCs in flight: 1
pending write pages:  0
pending read pages:   0			
                      read	      write
pages per rpc         rpcs   % cum % |       rpcs   % cum %
1:		         0   0   0   |          1  11  11
2:		         0   0   0   |          0   0  11
4:		         0   0   0   |          0   0  11
8:		         0   0   0   |          0   0  11
16:		         0   0   0   |          0   0  11
32:		         0   0   0   |          0   0  11
64:		         0   0   0   |          0   0  11
128:		         0   0   0   |          0   0  11
256:		         0   0   0   |          8  88 100

                read			write
rpcs in flight        rpcs   % cum % |       rpcs   % cum %
1:		         0   0   0   |          1  11  11
2:		         0   0   0   |          1  11  22
3:		         0   0   0   |          1  11  33
4:		         0   0   0   |          1  11  44
5:		         0   0   0   |          1  11  55
6:		         0   0   0   |          1  11  66
7:		         0   0   0   |          1  11  77
8:		         0   0   0   |          1  11  88
9:		         0   0   0   |          1  11 100 

This doesn't indicate the feature is broken, but it's weird and we can't have the test failing.

Comment by Andreas Dilger [ 26/May/22 ]

+1 on master:
https://testing.whamcloud.com/test_sets/17cec08c-d1ed-48cf-bf54-b0e9b0de387b

Comment by Andreas Dilger [ 29/Jul/22 ]

+1 on master; https://testing.whamcloud.com/test_sets/65370343-bb92-428b-b736-b1641382af47

Comment by Andreas Dilger [ 29/Jul/22 ]

11 failures in the past 4 weeks, of 2000+ runs, so not super frequent

Comment by Andreas Dilger [ 16/Dec/22 ]

+1 on master: https://testing.whamcloud.com/test_sets/50aacb7f-d807-42ea-aa36-28cc7e8c9a31

Comment by Andreas Dilger [ 06/Jan/23 ]

+1 on master https://testing.whamcloud.com/test_sets/d338f55b-171e-4a2d-aca2-1cac8becb602

Comment by Andreas Dilger [ 25/Apr/23 ]

Still seeing this intermittent failure, with 9 RPCs in flight.

Could it just be an extra OBD_PING or similar coincident with the start of the write? There should be enough info to tie the timestamp when the stats are collected to the Lustre debug log to see which RPCs were in flight at that time.

Comment by Andreas Dilger [ 11/Sep/23 ]

This failed 10x in the past 4 weeks. Nothing critical, but is it worthwhile to just accept the "9 RPCs in flight" answer and move on?

Generated at Sat Feb 10 03:12:58 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.