[LU-7970] intermittent ENOSPC on osd_write_commit Created: 31/Mar/16 Updated: 18/Apr/16 Resolved: 16/Apr/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Brian Johanson | Assignee: | WC Triage |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Centos7.2 3.10.0-327.3.1.el7_lustre.x86_64 zfs mirrored mgt/mdt |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Epic: | server, zfs | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
Intermittent enospc error. Reproduced it today and grabbed debug info. server debug, full debug attached. 00080000:00000020:5.0:1459450158.935617:0:6537:0:(osd_handler.c:601:osd_sync()) synced OSD osd-zfs |
| Comments |
| Comment by Andreas Dilger [ 01/Apr/16 ] |
|
This may relate to commit bd1e41672c974b97148b65115185a57ca4b7bbde
Author: Johann Lombardi <johann.lombardi@intel.com>
AuthorDate: Fri Jul 10 17:23:28 2015 -0700
Commit: Oleg Drokin <oleg.drokin@intel.com>
CommitDate: Sat Feb 20 05:39:56 2016 +0000
LU-2049 grant: add support for OBD_CONNECT_GRANT_PARAM
Add support for grant overhead calculation on the client side.
To do so, clients track usage on a per-extent basis. An extent is
composed of contiguous blocks.
The OST now returns to the OSC layer several parameters to consume
grant more accurately:
- the backend filesystem block size which is the minimal grant
allocation unit;
- the maximum extent size;
- the extent insertion cost.
Clients now pack in bulk write how much grant space was consumed for
the RPC. Dirty data accounting also adopts the same scheme.
Moreover, each backend OSD now reports its own set of parameters:
- For ldiskfs, we usually have a 4KB block size with a maximum extent
size of 32MB (theoretical limit of 128MB) and an extent insertion
cost of 6 x 4KB = 24KB
- For ZFS, we report a block size of 128KB, an extent size of 128
blocks (i.e. 16MB with 128KB block size) and a block insertion cost
of 112KB.
Besides, there is now no more generic metadata overhead reservation
done inside each OSD. Instead grant space is inflated for clients
that do not support the new grant parameters. That said, a tiny
percentage (typically 0.76%) of the free space is still reserved
inside each OSD to avoid fragmentation which might hurt performance
and impact our grant calculation (e.g. extents are broken due to
fragmentation).
This patch also fixes several other issues:
- Bulk write resent by ptlrpc after reconnection could trigger
spurious error messages related to broken dirty accounting.
The issue was that oa_dirty is discarded for resent requests
(grant flag cleared in ost_brw_write()), so we can legitimately
have grant > fed_dirty in ofd_grant_check().
This was fixed by reseting fed_dirty on reconnection and skipping
the dirty accounting check in ofd_grant_check() in the case of
ptlrpc resend.
- In obd_connect_data_seqprint(), the connection flags cannot fit
in a 32-bit integer.
- When merging two OSC extents, an extent tax should be released
in both the merged extent and in the grant accounting.
Signed-off-by: Johann Lombardi <johann.lombardi@intel.com>
Signed-off-by: Jinshan Xiong <jinshan.xiong@intel.com>
Reviewed-on: http://review.whamcloud.com/7793
This would be exacerbated by large numbers of small files being written by the client, since it would consume a lot of grant without much data actually being sent to the OSTs. You could check this by looking at: client$ lctl get_param osc.*.cur_grant_bytes oss$ lctl get_param obdfilter.*.tot_granted to see if the client was running out of grant, if some client has a huge amount of unused grant, or the OST had granted all of the available space to the clients. It does appear from the log that this is related to grants: avail 31335002406912 left 0 unstable 0 tot_grant 25092482826240 no space for 1048576 which means of the 31TB of space available, 25TB has been granted to the clients, and there is "no space left" for the 1MB write from the client. Where the remaining 6TB of space went is not totally clear, but because ZFS is a COW filesystem that allocates space long after the OST has accepted it for write and the free space reported by statfs is always behind what is actually available, but probably not by 6TB. Only about 1% of the space is reserved by the obdfilter.*.grant_ratio setting. If you are able to reproduce this and have a spare client, the http://review.whamcloud.com/7793 patch is worthwhile to test. |
| Comment by Brian Johanson [ 01/Apr/16 ] |
|
Right on, I grabbed the patch and will test it out. |
| Comment by Brian Johanson [ 15/Apr/16 ] |
|
Resolved by http://review.whamcloud.com/7793/ |
| Comment by Andreas Dilger [ 16/Apr/16 ] |
|
Closing as a duplicate of |
| Comment by Andreas Dilger [ 16/Apr/16 ] |
|
Brian, just out of curiosity, I see that your OST configuration is "raidz2 9+2 ost 16 ost/oss 2 OSS's total" so you only have one RAID-Z2 VDEV per OST. This means that all metadata that ZFS is writing (always with 2 or 3 redundant copies) it must write these copies to the same disks. Did you configure ZFS with one OST per VDEV because this is how it is typically done with ldiskfs, or is there another reason (e.g. measured performance improvement, failure isolation, something else)? With 16 OSTs/OSS, I can't imagine that the IO performance is limited by the disk IO, but rather the network? Have you done any benchmarks on the per-OST and per-OSS performance? I wouldn't have recommended 16 VDEVs in a single OST either, but maybe 4 OSTs, each with 4 VDEVs. This would improve the performance of single-stripe files, allow ZFS to write redundant metadata to different VDEVs, reduce space wastage and fragmentation as the smaller OSTs get full, and coincidentally have avoided this grant problem because the clients would get 1/4 of the grant with 1/4 of the OSTs and the OSTs would be 4x larger and can afford to grant 4x more space to each client. |
| Comment by Brian Johanson [ 18/Apr/16 ] |
|
Andreas, The original plan was 2 vdevs/OST. Some misguidance and a little variance with consistent performance pushed me away from that. In my haste to get something running, I couldn't find references to the benefits to multiple vdevs/ost. I should have reached out to a few others in the community that I know were running zfs backends before making the final decision. |