[LU-14809] Intermittent failure of sanity test_398g (we should see 8 RPCs in flight) Created: 02/Jul/21 Updated: 07/Dec/23 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Patrick Farrell | Assignee: | Patrick Farrell |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
Lustre Gerrit Janitor Testing has completed with errors! IMPORTANT: these tests appear to be new failures unique to this patch
> sanity2@ldiskfs+DNE Failure(7638s)
Succeeded:
(centos7)All results and logs: https://testing-archive.whamcloud.com/gerrit-janitor/16844/results.html |
| Comments |
| Comment by Patrick Farrell [ 02/Jul/21 ] |
|
This test was added by the recent parallel DIO work; this specific failure is one where we got one extra RPC of ~ 1 page for some reason: 8388608 bytes (8.4 MB) copied, 2.17939 s, 3.8 MB/s
osc.lustre-OST0000-osc-ffff8800d6b67800.rpc_stats=
snapshot_time: 1625256875.834467231 (secs.nsecs)
read RPCs in flight: 0
write RPCs in flight: 1
pending write pages: 0
pending read pages: 0
read write
pages per rpc rpcs % cum % | rpcs % cum %
1: 0 0 0 | 1 11 11
2: 0 0 0 | 0 0 11
4: 0 0 0 | 0 0 11
8: 0 0 0 | 0 0 11
16: 0 0 0 | 0 0 11
32: 0 0 0 | 0 0 11
64: 0 0 0 | 0 0 11
128: 0 0 0 | 0 0 11
256: 0 0 0 | 8 88 100
read write
rpcs in flight rpcs % cum % | rpcs % cum %
1: 0 0 0 | 1 11 11
2: 0 0 0 | 1 11 22
3: 0 0 0 | 1 11 33
4: 0 0 0 | 1 11 44
5: 0 0 0 | 1 11 55
6: 0 0 0 | 1 11 66
7: 0 0 0 | 1 11 77
8: 0 0 0 | 1 11 88
9: 0 0 0 | 1 11 100
This doesn't indicate the feature is broken, but it's weird and we can't have the test failing. |
| Comment by Andreas Dilger [ 26/May/22 ] |
|
+1 on master: |
| Comment by Andreas Dilger [ 29/Jul/22 ] |
|
+1 on master; https://testing.whamcloud.com/test_sets/65370343-bb92-428b-b736-b1641382af47 |
| Comment by Andreas Dilger [ 29/Jul/22 ] |
|
11 failures in the past 4 weeks, of 2000+ runs, so not super frequent |
| Comment by Andreas Dilger [ 16/Dec/22 ] |
|
+1 on master: https://testing.whamcloud.com/test_sets/50aacb7f-d807-42ea-aa36-28cc7e8c9a31 |
| Comment by Andreas Dilger [ 06/Jan/23 ] |
|
+1 on master https://testing.whamcloud.com/test_sets/d338f55b-171e-4a2d-aca2-1cac8becb602 |
| Comment by Andreas Dilger [ 25/Apr/23 ] |
|
Still seeing this intermittent failure, with 9 RPCs in flight. Could it just be an extra OBD_PING or similar coincident with the start of the write? There should be enough info to tie the timestamp when the stats are collected to the Lustre debug log to see which RPCs were in flight at that time. |
| Comment by Andreas Dilger [ 11/Sep/23 ] |
|
This failed 10x in the past 4 weeks. Nothing critical, but is it worthwhile to just accept the "9 RPCs in flight" answer and move on? |