[LU-15941] sanity test_398b: timeouts with ZFS Created: 14/Jun/22 Updated: 16/Aug/23 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
This issue was created by maloo for Alex Zhuravlev <bzzz@whamcloud.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/097393a8-5380-4d65-af83-5d44c963ce88 test_398b failed with the following error: Timeout occurred after 326 minutes, last suite running was sanity this started on June 10, after recent landing wave.
VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV |
| Comments |
| Comment by Andreas Dilger [ 01/Jul/22 ] |
|
+1 on master: https://testing.whamcloud.com/test_sets/34d33d79-d068-4eeb-8990-9c8d06669a01 It is reporting pathetic IOPS of 1, under 8 KB/s. I guess that is contention on the singe HDD on the host, also caused by r-m-w of the larger ZFS blocks? ALEX, do you think your blocksize patch https://review.whamcloud.com/47768 "LU-15963 osd: use contiguous chunk to grow blocksize" might help this? |
| Comment by Alex Zhuravlev [ 24/Aug/22 ] |
|
I profiled 398b: dt_trans_stop() in ofd_commitrw_write() takes 50 usec with ldiskfs and 512831 with ZFS on average. |
| Comment by Nikitas Angelinas [ 14/Dec/22 ] |
|
+1 on master: https://testing.whamcloud.com/test_sets/b7ffdf4c-d214-427b-95e0-379d8c837267 |
| Comment by Nikitas Angelinas [ 17/Jan/23 ] |
|
+1 on master: https://testing.whamcloud.com/test_sets/cff8066f-91ae-4805-a4e2-ce35545e5bfe |
| Comment by Patrick Farrell [ 03/Apr/23 ] |
|
Alex, They are missing 'ASYNC' because they should be missing async - This is direct IO, which expects the server to do a sync each time. This means DIO performance on ZFS is absolutely terrible. And I think we can't fix it except by fixing our sync behavior on ZFS, which I understand is a huge project |
| Comment by Andreas Dilger [ 03/Apr/23 ] |
|
IIRC, there are two significant performance issues with ZFS sync writes:
For flash devices it would still be possible to commit thousands of times per second, and for HDD devices maybe 10/s, instead of 1/s. This would of course increase load on the storage and CPUs, but what else are they for, and why should both the clients and servers be waiting idle for the 1s ZFS transaction commit? |
| Comment by Nikitas Angelinas [ 16/Aug/23 ] |
|
+1 on master: https://testing.whamcloud.com/test_sets/2dda2437-8b99-4e36-b99d-f769947b2f6b |