Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7970

intermittent ENOSPC on osd_write_commit

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • Lustre 2.8.0
    • None
    • Centos7.2 3.10.0-327.3.1.el7_lustre.x86_64
      lustre 2.8.0
      zfs 0.6.5.5
      OPA 10.0.0.0.697

      zfs mirrored mgt/mdt
      zfs raidz2 9+2 ost 16 ost/oss 2 OSS's total
    • 3
    • 9223372036854775807

    Description

      Intermittent enospc error. Reproduced it today and grabbed debug info.
      All OSTs have ~28TB free
      It was happening over and over on 1 client with rsync of many small files, but not seen on a second client doing the same rsync to a seperate dir. Shortly after I grabbed the debug info, it no longer happened.
      No errors found in system logs.

      server debug, full debug attached.

      00080000:00000020:5.0:1459450158.935617:0:6537:0:(osd_handler.c:601:osd_sync()) synced OSD osd-zfs
      00002000:00000020:5.0:1459450158.935624:0:6537:0:(ofd_grant.c:280:ofd_grant_statfs()) pylon1-OST0017: cli 278722d8-f96f-0b0a-2cb5-fc84038e8f72/ffff880b43f8e000 free: 31335004504064 avail: 31335002406912
      00002000:00000020:5.0:1459450158.935627:0:6537:0:(ofd_grant.c:348:ofd_grant_space_left()) pylon1-OST0017: cli 278722d8-f96f-0b0a-2cb5-fc84038e8f72/ffff880b43f8e000 avail 31335002406912 left 0 unstable 0 tot_grant 25092482826240 pending 0
      00002000:00000020:5.0:1459450158.935631:0:6537:0:(ofd_grant.c:392:ofd_grant_incoming()) pylon1-OST0017: cli 278722d8-f96f-0b0a-2cb5-fc84038e8f72/ffff880b43f8e000 reports grant 49152 dropped 0, local 12582912
      00002000:00000020:5.0:1459450158.935636:0:6537:0:(ofd_grant.c:634:ofd_grant_check()) pylon1-OST0017: cli 278722d8-f96f-0b0a-2cb5-fc84038e8f72/ffff880b43f8e000 idx 0 no space for 1048576
      00002000:00000020:5.0:1459450158.935638:0:6537:0:(ofd_grant.c:655:ofd_grant_check()) pylon1-OST0017: cli 278722d8-f96f-0b0a-2cb5-fc84038e8f72/ffff880b43f8e000 granted: 0 ungranted: 0 grant: 12582912 dirty: 0
      00000100:00000010:5.0:1459450158.935658:0:6537:0:(client.c:134:ptlrpc_new_bulk()) kmalloced 'desc': 240 at ffff88101e6d4f00.
      00000020:00000040:5.0:1459450158.935659:0:6537:0:(genops.c:807:class_export_get()) GETting export ffff880b43f8e000 : new refcount 1260
      00000400:00000010:5.0:1459450158.935665:0:6537:0:(lib-lnet.h:218:lnet_md_alloc()) alloc '(md)': 176 at ffff880f583c58c0 (tot 256070564).
      00000400:00000010:5.0:1459450158.935668:0:6537:0:(lib-lnet.h:263:lnet_msg_alloc()) alloc '(msg)': 352 at ffff880b36f19800 (tot 256070916).
      00000400:00000200:5.0:1459450158.935672:0:6537:0:(lib-move.c:2412:LNetGet()) LNetGet -> 12345-10.4.112.74@o2ib
      00000800:00000200:5.0:1459450158.935676:0:6537:0:(o2iblnd_cb.c:1501:kiblnd_send()) sending 0 bytes in 0 frags to 12345-10.4.112.74@o2ib
      00000800:00000200:5.0:1459450158.935679:0:6537:0:(o2iblnd_cb.c:730:kiblnd_setup_rd_kiov()) niov 4 offset 0 nob 16384
      00000400:00000010:5.0:1459450158.935682:0:6537:0:(lib-lnet.h:263:lnet_msg_alloc()) alloc '(msg)': 352 at ffff880b36f18200 (tot 256071268).
      00000400:00000200:5.0:1459450158.935685:0:6537:0:(lib-move.c:2293:lnet_create_reply_msg()) 10.4.108.51@o2ib: Reply from 12345-10.4.112.74@o2ib md ffff880f583c58c0
      00000800:00000200:5.0:1459450158.935689:0:6537:0:(o2iblnd.c:402:kiblnd_find_peer_locked()) got peer [ffff88068f575580] -> 10.4.112.74@o2ib (2) version: 12
      00000800:00000200:5.0:1459450158.935691:0:6537:0:(o2iblnd_cb.c:1384:kiblnd_launch_tx()) conn[ffff880e476dc000] (260)++
      00000800:00000200:5.0:1459450158.935693:0:6537:0:(o2iblnd_cb.c:1155:kiblnd_queue_tx_locked()) conn[ffff880e476dc000] (261)++
      00000800:00000200:5.0:1459450158.935700:0:6537:0:(o2iblnd_cb.c:1390:kiblnd_launch_tx()) conn[ffff880e476dc000] (262)--
      00000100:00000200:5.0:1459450158.935703:0:6537:0:(niobuf.c:251:ptlrpc_start_bulk_transfer()) Transferring 4 pages 16384 bytes via portal 8 id 12345-10.4.112.74@o2ib mbits 0x56eb8dfb7df38-0x56eb8dfb7df38
      00000800:00000200:12.1:1459450158.935746:0:748:0:(o2iblnd_cb.c:3448:kiblnd_cq_completion()) conn[ffff880e476dc000] (261)++
      00000800:00000200:5.0:1459450158.935754:0:43037:0:(o2iblnd_cb.c:3570:kiblnd_scheduler()) conn[ffff880e476dc000] (262)++
      00000800:00000200:5.0:1459450158.935757:0:43037:0:(o2iblnd_cb.c:1011:kiblnd_tx_complete()) conn[ffff880e476dc000] (263)++
      00000800:00000200:5.0:1459450158.935759:0:43037:0:(o2iblnd_cb.c:1020:kiblnd_tx_complete()) conn[ffff880e476dc000] (264)--
      00000800:00000200:5.0:1459450158.935760:0:43037:0:(o2iblnd_cb.c:3586:kiblnd_scheduler()) conn[ffff880e476dc000] (263)--
      00000800:00000200:5.0:1459450158.935761:0:43037:0:(o2iblnd_cb.c:3586:kiblnd_scheduler()) conn[ffff880e476dc000] (262)--
      00000800:00000200:12.1:1459450158.935811:0:748:0:(o2iblnd_cb.c:3448:kiblnd_cq_completion()) conn[ffff880e476dc000] (261)++
      00000800:00000200:28.0:1459450158.935846:0:43035:0:(o2iblnd_cb.c:3570:kiblnd_scheduler()) conn[ffff880e476dc000] (262)++
      00000800:00000200:28.0:1459450158.935855:0:43035:0:(o2iblnd_cb.c:325:kiblnd_handle_rx()) Received d7[0] from 10.4.112.74@o2ib
      00000800:00000200:5.0:1459450158.935856:0:43037:0:(o2iblnd_cb.c:3586:kiblnd_scheduler()) conn[ffff880e476dc000] (263)--
      00000800:00000200:28.0:1459450158.935858:0:43035:0:(o2iblnd_cb.c:79:kiblnd_tx_done()) conn[ffff880e476dc000] (262)--
      00000100:00000200:28.0:1459450158.935863:0:43035:0:(events.c:447:server_bulk_callback()) event type 5, status 0, desc ffff88101e6d4f00
      00000400:00000010:28.0:1459450158.935867:0:43035:0:(lib-lnet.h:273:lnet_msg_free()) kfreed 'msg': 352 at ffff880b36f19800 (tot 256070916).
      00000100:00000200:28.0:1459450158.935870:0:43035:0:(events.c:447:server_bulk_callback()) event type 3, status 0, desc ffff88101e6d4f00
      00000400:00000200:28.0:1459450158.935874:0:43035:0:(lib-md.c:73:lnet_md_unlink()) Unlinking md ffff880f583c58c0
      00000400:00000010:28.0:1459450158.935876:0:43035:0:(lib-lnet.h:240:lnet_md_free()) kfreed 'md': 176 at ffff880f583c58c0 (tot 256070740).
      00000400:00000010:28.0:1459450158.935878:0:43035:0:(lib-lnet.h:273:lnet_msg_free()) kfreed 'msg': 352 at ffff880b36f18200 (tot 256070388).
      00000800:00000200:28.0:1459450158.935879:0:43035:0:(o2iblnd_cb.c:193:kiblnd_post_rx()) conn[ffff880e476dc000] (261)++
      00000800:00000200:28.0:1459450158.935882:0:43035:0:(o2iblnd_cb.c:222:kiblnd_post_rx()) conn[ffff880e476dc000] (262)--
      00000800:00000200:28.0:1459450158.935883:0:43035:0:(o2iblnd_cb.c:3586:kiblnd_scheduler()) conn[ffff880e476dc000] (261)--
      00080000:00000010:5.0:1459450158.935896:0:6537:0:(osd_handler.c:348:osd_trans_create()) kmalloced 'oh': 384 at ffff880b36f18800.
      00080000:04000000:5.0:1459450158.935908:0:6537:0:(osd_io.c:657:osd_declare_write_commit()) writing 4 pages, reserving 0K of quota space
      00000020:00000010:5.0:1459450158.936001:0:6537:0:(lu_object.c:1714:keys_init()) kmalloced 'ctx->lc_value': 320 at ffff880b36f19600.
      00000020:00000010:5.0:1459450158.936003:0:6537:0:(update_records.c:1217:update_key_init()) kmalloced 'value': 408 at ffff880b36f19000.
      00000004:00000010:5.0:1459450158.936006:0:6537:0:(osp_dev.c:1850:osp_txn_key_init()) kmalloced 'value': 4 at ffff8807b5337b98.
      00080000:00000002:5.0:1459450158.936010:0:6537:0:(osd_io.c:706:osd_write_commit()) write 4096 bytes at 65536
      00080000:00000002:5.0:1459450158.936011:0:6537:0:(osd_io.c:714:osd_write_commit()) obj [0x100170000:0x2035b:0x0]: skipping lnb[0]: rc=-28
      00080000:00000002:5.0:1459450158.936013:0:6537:0:(osd_io.c:706:osd_write_commit()) write 4096 bytes at 69632
      00080000:00000002:5.0:1459450158.936014:0:6537:0:(osd_io.c:714:osd_write_commit()) obj [0x100170000:0x2035b:0x0]: skipping lnb[1]: rc=-28
      00080000:00000002:5.0:1459450158.936015:0:6537:0:(osd_io.c:706:osd_write_commit()) write 4096 bytes at 73728
      00080000:00000002:5.0:1459450158.936016:0:6537:0:(osd_io.c:714:osd_write_commit()) obj [0x100170000:0x2035b:0x0]: skipping lnb[2]: rc=-28
      00080000:00000002:5.0:1459450158.936017:0:6537:0:(osd_io.c:706:osd_write_commit()) write 4096 bytes at 77824
      00080000:00000002:5.0:1459450158.936018:0:6537:0:(osd_io.c:714:osd_write_commit()) obj [0x100170000:0x2035b:0x0]: skipping lnb[3]: rc=-28
      00080000:00000010:5.0:1459450158.936022:0:6537:0:(osd_object.c:993:osd_attr_set()) kmalloced 'bulk': 400 at ffff880b36f19200.
      00080000:00000010:5.0:1459450158.936062:0:6537:0:(osd_object.c:1092:osd_attr_set()) kfreed 'bulk': 400 at ffff880b36f19200.

      Attachments

        Issue Links

          Activity

            [LU-7970] intermittent ENOSPC on osd_write_commit

            Andreas, The original plan was 2 vdevs/OST. Some misguidance and a little variance with consistent performance pushed me away from that. In my haste to get something running, I couldn't find references to the benefits to multiple vdevs/ost. I should have reached out to a few others in the community that I know were running zfs backends before making the final decision.

            bjohanso Brian Johanson added a comment - Andreas, The original plan was 2 vdevs/OST. Some misguidance and a little variance with consistent performance pushed me away from that. In my haste to get something running, I couldn't find references to the benefits to multiple vdevs/ost. I should have reached out to a few others in the community that I know were running zfs backends before making the final decision.
            adilger Andreas Dilger added a comment - - edited

            Brian, just out of curiosity, I see that your OST configuration is "raidz2 9+2 ost 16 ost/oss 2 OSS's total" so you only have one RAID-Z2 VDEV per OST. This means that all metadata that ZFS is writing (always with 2 or 3 redundant copies) it must write these copies to the same disks.

            Did you configure ZFS with one OST per VDEV because this is how it is typically done with ldiskfs, or is there another reason (e.g. measured performance improvement, failure isolation, something else)? With 16 OSTs/OSS, I can't imagine that the IO performance is limited by the disk IO, but rather the network? Have you done any benchmarks on the per-OST and per-OSS performance?

            I wouldn't have recommended 16 VDEVs in a single OST either, but maybe 4 OSTs, each with 4 VDEVs. This would improve the performance of single-stripe files, allow ZFS to write redundant metadata to different VDEVs, reduce space wastage and fragmentation as the smaller OSTs get full, and coincidentally have avoided this grant problem because the clients would get 1/4 of the grant with 1/4 of the OSTs and the OSTs would be 4x larger and can afford to grant 4x more space to each client.

            adilger Andreas Dilger added a comment - - edited Brian, just out of curiosity, I see that your OST configuration is "raidz2 9+2 ost 16 ost/oss 2 OSS's total" so you only have one RAID-Z2 VDEV per OST. This means that all metadata that ZFS is writing (always with 2 or 3 redundant copies) it must write these copies to the same disks. Did you configure ZFS with one OST per VDEV because this is how it is typically done with ldiskfs, or is there another reason (e.g. measured performance improvement, failure isolation, something else)? With 16 OSTs/OSS, I can't imagine that the IO performance is limited by the disk IO, but rather the network? Have you done any benchmarks on the per-OST and per-OSS performance? I wouldn't have recommended 16 VDEVs in a single OST either, but maybe 4 OSTs, each with 4 VDEVs. This would improve the performance of single-stripe files, allow ZFS to write redundant metadata to different VDEVs, reduce space wastage and fragmentation as the smaller OSTs get full, and coincidentally have avoided this grant problem because the clients would get 1/4 of the grant with 1/4 of the OSTs and the OSTs would be 4x larger and can afford to grant 4x more space to each client.

            Closing as a duplicate of LU-2049.

            adilger Andreas Dilger added a comment - Closing as a duplicate of LU-2049 .
            bjohanso Brian Johanson added a comment - Resolved by http://review.whamcloud.com/7793/

            Right on, I grabbed the patch and will test it out.

            bjohanso Brian Johanson added a comment - Right on, I grabbed the patch and will test it out.

            This may relate to LU-2049 and the space grant calculation on clients accessing ZFS OSTs. A patch for that was recently landed to master, but after the 2.8.0 release:

            commit bd1e41672c974b97148b65115185a57ca4b7bbde
            Author:     Johann Lombardi <johann.lombardi@intel.com>
            AuthorDate: Fri Jul 10 17:23:28 2015 -0700
            Commit:     Oleg Drokin <oleg.drokin@intel.com>
            CommitDate: Sat Feb 20 05:39:56 2016 +0000
            
                LU-2049 grant: add support for OBD_CONNECT_GRANT_PARAM
                
                Add support for grant overhead calculation on the client side.
                To do so, clients track usage on a per-extent basis. An extent is
                composed of contiguous blocks.
                The OST now returns to the OSC layer several parameters to consume
                grant more accurately:
                - the backend filesystem block size which is the minimal grant
                  allocation unit;
                - the maximum extent size;
                - the extent insertion cost.
                Clients now pack in bulk write how much grant space was consumed for
                the RPC. Dirty data accounting also adopts the same scheme.
                
                Moreover, each backend OSD now reports its own set of parameters:
                - For ldiskfs, we usually have a 4KB block size with a maximum extent
                  size of 32MB (theoretical limit of 128MB) and an extent insertion
                  cost of 6 x 4KB = 24KB
                - For ZFS, we report a block size of 128KB, an extent size of 128
                  blocks (i.e. 16MB with 128KB block size) and a block insertion cost
                  of 112KB.
                
                Besides, there is now no more generic metadata overhead reservation
                done inside each OSD. Instead grant space is inflated for clients
                that do not support the new grant parameters. That said, a tiny
                percentage (typically 0.76%) of the free space is still reserved
                inside each OSD to avoid fragmentation which might hurt performance
                and impact our grant calculation (e.g. extents are broken due to
                fragmentation).
                
                This patch also fixes several other issues:
                
                - Bulk write resent by ptlrpc after reconnection could trigger
                spurious error messages related to broken dirty accounting.
                The issue was that oa_dirty is discarded for resent requests
                (grant flag cleared in ost_brw_write()), so we can legitimately
                have grant > fed_dirty in ofd_grant_check().
                This was fixed by reseting fed_dirty on reconnection and skipping
                the dirty accounting check in ofd_grant_check() in the case of
                ptlrpc resend.
                
                - In obd_connect_data_seqprint(), the connection flags cannot fit
                  in a 32-bit integer.
                
                - When merging two OSC extents, an extent tax should be released
                  in both the merged extent and in the grant accounting.
                
                Signed-off-by: Johann Lombardi <johann.lombardi@intel.com>
                Signed-off-by: Jinshan Xiong <jinshan.xiong@intel.com>
                Reviewed-on: http://review.whamcloud.com/7793
            

            This would be exacerbated by large numbers of small files being written by the client, since it would consume a lot of grant without much data actually being sent to the OSTs. You could check this by looking at:

            client$ lctl get_param osc.*.cur_grant_bytes
            oss$ lctl get_param obdfilter.*.tot_granted
            

            to see if the client was running out of grant, if some client has a huge amount of unused grant, or the OST had granted all of the available space to the clients. It does appear from the log that this is related to grants:

            avail 31335002406912 left 0 unstable 0 tot_grant 25092482826240
            no space for 1048576
            

            which means of the 31TB of space available, 25TB has been granted to the clients, and there is "no space left" for the 1MB write from the client. Where the remaining 6TB of space went is not totally clear, but because ZFS is a COW filesystem that allocates space long after the OST has accepted it for write and the free space reported by statfs is always behind what is actually available, but probably not by 6TB. Only about 1% of the space is reserved by the obdfilter.*.grant_ratio setting.

            If you are able to reproduce this and have a spare client, the http://review.whamcloud.com/7793 patch is worthwhile to test.

            adilger Andreas Dilger added a comment - This may relate to LU-2049 and the space grant calculation on clients accessing ZFS OSTs. A patch for that was recently landed to master, but after the 2.8.0 release: commit bd1e41672c974b97148b65115185a57ca4b7bbde Author: Johann Lombardi <johann.lombardi@intel.com> AuthorDate: Fri Jul 10 17:23:28 2015 -0700 Commit: Oleg Drokin <oleg.drokin@intel.com> CommitDate: Sat Feb 20 05:39:56 2016 +0000 LU-2049 grant: add support for OBD_CONNECT_GRANT_PARAM Add support for grant overhead calculation on the client side. To do so, clients track usage on a per-extent basis. An extent is composed of contiguous blocks. The OST now returns to the OSC layer several parameters to consume grant more accurately: - the backend filesystem block size which is the minimal grant allocation unit; - the maximum extent size; - the extent insertion cost. Clients now pack in bulk write how much grant space was consumed for the RPC. Dirty data accounting also adopts the same scheme. Moreover, each backend OSD now reports its own set of parameters: - For ldiskfs, we usually have a 4KB block size with a maximum extent size of 32MB (theoretical limit of 128MB) and an extent insertion cost of 6 x 4KB = 24KB - For ZFS, we report a block size of 128KB, an extent size of 128 blocks (i.e. 16MB with 128KB block size) and a block insertion cost of 112KB. Besides, there is now no more generic metadata overhead reservation done inside each OSD. Instead grant space is inflated for clients that do not support the new grant parameters. That said, a tiny percentage (typically 0.76%) of the free space is still reserved inside each OSD to avoid fragmentation which might hurt performance and impact our grant calculation (e.g. extents are broken due to fragmentation). This patch also fixes several other issues: - Bulk write resent by ptlrpc after reconnection could trigger spurious error messages related to broken dirty accounting. The issue was that oa_dirty is discarded for resent requests (grant flag cleared in ost_brw_write()), so we can legitimately have grant > fed_dirty in ofd_grant_check(). This was fixed by reseting fed_dirty on reconnection and skipping the dirty accounting check in ofd_grant_check() in the case of ptlrpc resend. - In obd_connect_data_seqprint(), the connection flags cannot fit in a 32-bit integer. - When merging two OSC extents, an extent tax should be released in both the merged extent and in the grant accounting. Signed-off-by: Johann Lombardi <johann.lombardi@intel.com> Signed-off-by: Jinshan Xiong <jinshan.xiong@intel.com> Reviewed-on: http://review.whamcloud.com/7793 This would be exacerbated by large numbers of small files being written by the client, since it would consume a lot of grant without much data actually being sent to the OSTs. You could check this by looking at: client$ lctl get_param osc.*.cur_grant_bytes oss$ lctl get_param obdfilter.*.tot_granted to see if the client was running out of grant, if some client has a huge amount of unused grant, or the OST had granted all of the available space to the clients. It does appear from the log that this is related to grants: avail 31335002406912 left 0 unstable 0 tot_grant 25092482826240 no space for 1048576 which means of the 31TB of space available, 25TB has been granted to the clients, and there is "no space left" for the 1MB write from the client. Where the remaining 6TB of space went is not totally clear, but because ZFS is a COW filesystem that allocates space long after the OST has accepted it for write and the free space reported by statfs is always behind what is actually available, but probably not by 6TB. Only about 1% of the space is reserved by the obdfilter.*.grant_ratio setting. If you are able to reproduce this and have a spare client, the http://review.whamcloud.com/7793 patch is worthwhile to test.

            People

              wc-triage WC Triage
              bjohanso Brian Johanson
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: