Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16830

mdtest SEL jobs aborted with ENOSPC during automated FOFB testing

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0
    • None
    • 3
    • 9223372036854775807

    Description

      lod_ost_alloc_rr() has 3 speed loops (speed 0, 1, 2). At 0,1 it loops OSTs without waiting on objects. And at speed 2 it waits obd_timeout(60s) for OST object creation.
      Somehow 0 and 1 speed loops did not allocate stripes. And at speed 2, OST loop waited 4 times on OSTs, 60 seconds each, and failed to allocate stripes also. Request took 240+ seconds to complete. And as result mdt_reint_open() finished with ENOSPC.

      00000004:00080000:5.0:1681399745.413238:0:52376:0:(osp_precreate.c:1552:osp_precreate_reserve()) snx11922-OST0001-osc-MDT0000: slow creates, last=[0x100010000:0x856c7ce1:0x0], next=[0x100010000:0x856c7ce1:0x0], reserved=0, sync_changes=0, sync_rpcs_in_progress=0, status=-19
      00020000:00001000:5.0:1681399745.413246:0:52376:0:(lod_qos.c:415:lod_qos_declare_object_on()) can't declare creation on #1: -110
      00020000:00001000:5.0:1681399745.413269:0:52376:0:(lod_qos.c:822:lod_ost_alloc_rr()) #3 strt 4 act 0 strp 0 ary 0 idx 0
      00000004:00000040:5.0:1681399745.413304:0:52376:0:(osp_precreate.c:1538:osp_precreate_reserve()) snx11922-OST0000-osc-MDT0000: Sleeping on objects
      00000004:00080000:5.0:1681399806.853255:0:52376:0:(osp_precreate.c:1552:osp_precreate_reserve()) snx11922-OST0000-osc-MDT0000: slow creates, last=[0x100000000:0x86f5beb8:0x0], next=[0x100000000:0x86f5beb8:0x0], reserved=0, sync_changes=0, sync_rpcs_in_progress=0, status=-19
      00020000:00001000:5.0:1681399806.853278:0:52376:0:(lod_qos.c:415:lod_qos_declare_object_on()) can't declare creation on #0: -110
      00000100:00100000:5.0:1681399806.853525:0:52376:0:(service.c:2368:ptlrpc_server_handle_request()) Handled RPC req@0000000032eefb47 pname:cluuid+ref:pid:xid:nid:opc:job mdt01_013:deadcc71-76dd-42b1-be5d-3b7c0cd701f9+310925:10959:x1762034564784640:12345-104@gni4:101: Request processed in 247686653us (247686679us total) trans 0 rc 301/301
      00000100:00001000:5.0:1681399806.853545:0:52376:0:(service.c:2384:ptlrpc_server_handle_request()) @@@ sent 9 early replies before finishing in 247s  req@0000000032eefb47 x1762034564784640/t0(0) o101->deadcc71-76dd-42b1-be5d-3b7c0cd701f9@104@gni4:316/0 lens 984/600 e 9 to 0 dl 1681399836 ref 1 fl Complete:/0/0 rc 301/301 job:''
      

      Here is enospc logs, without -28, but it is really it.

      00000004:00080000:4.0:1681509372.244328:0:96679:0:(osp_precreate.c:1552:osp_precreate_reserve()) snx11922-OST0000-osc-MDT0001: slow creates, last=[0x280000400:0x9685b1fc:0x0], next=[0x280000400:0x9685b1fc:0x0], reserved=0, sync_changes=83616, sync_rpcs_in_progress=0, status=-19
      00020000:00001000:4.0:1681509372.244340:0:96679:0:(lod_qos.c:415:lod_qos_declare_object_on()) can't declare creation on #0: -110
      00020000:00001000:4.0:1681509372.244345:0:96679:0:(lod_qos.c:693:lod_check_and_reserve_ost()) can't declare new object on #0: -110
      00020000:00001000:4.0:1681509372.244347:0:96679:0:(lod_qos.c:822:lod_ost_alloc_rr()) #1 strt 14 act 0 strp 0 ary 2 idx 1
      00000004:00001000:4.0:1681509372.244351:0:96679:0:(osp_dev.c:806:osp_statfs()) snx11922-OST0001-osc-MDT0001: 30062590209 blocks, 20088647533 free, 19785479608 avail, 4096 bsize, 114679 reserved mb low, 229359 reserved mb high, 118423424 files, 82006030 free files
      00000004:00000040:4.0:1681509372.244362:0:96679:0:(osp_precreate.c:1538:osp_precreate_reserve()) snx11922-OST0001-osc-MDT0001: Sleeping on objects
      00000004:00080000:4.0:1681509433.684392:0:96679:0:(osp_precreate.c:1552:osp_precreate_reserve()) snx11922-OST0001-osc-MDT0001: slow creates, last=[0x300000400:0x953923b1:0x0], next=[0x300000400:0x953923b1:0x0], reserved=0, sync_changes=76494, sync_rpcs_in_progress=0, status=-19
      00020000:00001000:4.0:1681509433.684395:0:96679:0:(lod_qos.c:415:lod_qos_declare_object_on()) can't declare creation on #1: -110
      00020000:00001000:4.0:1681509433.684396:0:96679:0:(lod_qos.c:693:lod_check_and_reserve_ost()) can't declare new object on #1: -110
      00020000:00001000:4.0:1681509433.684397:0:96679:0:(lod_qos.c:822:lod_ost_alloc_rr()) #2 strt 12 act 0 strp 0 ary 0 idx 0
      00000004:00001000:4.0:1681509433.684398:0:96679:0:(osp_dev.c:806:osp_statfs()) snx11922-OST0000-osc-MDT0001: 30062590209 blocks, 20054934974 free, 19751767042 avail, 4096 bsize, 114679 reserved mb low, 229359 reserved mb high, 118423424 files, 81874401 free files
      00000004:00000040:4.0:1681509433.684401:0:96679:0:(osp_precreate.c:1538:osp_precreate_reserve()) snx11922-OST0000-osc-MDT0001: Sleeping on objects
      00000004:00080000:4.0:1681509495.124293:0:96679:0:(osp_precreate.c:1552:osp_precreate_reserve()) snx11922-OST0000-osc-MDT0001: slow creates, last=[0x280000400:0x9685b1fc:0x0], next=[0x280000400:0x9685b1fc:0x0], reserved=0, sync_changes=83616, sync_rpcs_in_progress=0, status=-19
      00020000:00001000:4.0:1681509495.124299:0:96679:0:(lod_qos.c:415:lod_qos_declare_object_on()) can't declare creation on #0: -110
      00020000:00001000:4.0:1681509495.124301:0:96679:0:(lod_qos.c:693:lod_check_and_reserve_ost()) can't declare new object on #0: -110
      00020000:00001000:4.0:1681509495.124302:0:96679:0:(lod_qos.c:822:lod_ost_alloc_rr()) #3 strt 6 act 0 strp 0 ary 2 idx 1
      00000004:00001000:4.0:1681509495.124304:0:96679:0:(osp_dev.c:806:osp_statfs()) snx11922-OST0001-osc-MDT0001: 30062590209 blocks, 20088647533 free, 19785479608 avail, 4096 bsize, 114679 reserved mb low, 229359 reserved mb high, 118423424 files, 82006030 free files
      00000004:00000040:4.0:1681509495.124310:0:96679:0:(osp_precreate.c:1538:osp_precreate_reserve()) snx11922-OST0001-osc-MDT0001: Sleeping on objects
      00000004:00080000:4.0:1681509556.564304:0:96679:0:(osp_precreate.c:1552:osp_precreate_reserve()) snx11922-OST0001-osc-MDT0001: slow creates, last=[0x300000400:0x953923b1:0x0], next=[0x300000400:0x953923b1:0x0], reserved=0, sync_changes=76494, sync_rpcs_in_progress=0, status=-19
      00020000:00001000:4.0:1681509556.564310:0:96679:0:(lod_qos.c:415:lod_qos_declare_object_on()) can't declare creation on #1: -110
      00020000:00001000:4.0:1681509556.564313:0:96679:0:(lod_qos.c:693:lod_check_and_reserve_ost()) can't declare new object on #1: -110
      00000004:00000040:4.0:1681509556.564324:0:96679:0:(mdt_handler.c:6369:mdt_object_free()) object free, fid = [0x2400f42e8:0xc2de:0x0]
      00000004:00000040:4.0:1681509556.564342:0:96679:0:(mdt_lib.c:735:mdt_fix_reply()) Shrink to md_size = 0 cookie/acl_size = 0
      00000100:00001000:4.0:1681509556.564350:0:96679:0:(import.c:1941:at_measured()) add 248 to 00000000f15f2540 time=111 v=253 (253 153 23 25)
      00000100:00000040:4.0:1681509556.564352:0:96679:0:(connection.c:145:ptlrpc_connection_addref()) conn=00000000cd9dc6e2 refcount 3 to 112@gni4
      00000100:00000040:4.0:1681509556.564355:0:96679:0:(niobuf.c:58:ptl_send_buf()) peer_id 12345-112@gni4
      00000100:00000040:4.0:1681509556.564366:0:96679:0:(lustre_net.h:1945:ptlrpc_connection_put()) PUT conn=00000000cd9dc6e2 refcount 2 to 112@gni4
      00000100:00000040:4.0:1681509556.564368:0:96679:0:(lustre_net.h:2404:ptlrpc_rqphase_move()) @@@ move request phase from Interpret to Complete  req@00000000d5ffd1e9 x1761995533613248/t0(0) o101->00e2d691-47e0-4194-a67c-1aa5d855ebc6@112@gni4:456/0 lens 984/600 e 8 to 0 dl 1681509561 ref 1 fl Interpret:/0/0 rc 301/301 job:''
      00000100:00100000:4.0:1681509556.564373:0:96679:0:(service.c:2368:ptlrpc_server_handle_request()) Handled RPC req@00000000d5ffd1e9 pname:cluuid+ref:pid:xid:nid:opc:job mdt01_028:00e2d691-47e0-4194-a67c-1aa5d855ebc6+191645:4033:x1761995533613248:12345-112@gni4:101: Request processed in 247828965us (247829037us total) trans 0 rc 301/301
      

      It looks like lod_ost_alloc_rr() logic generates OST idx for allocation always equal to a failover indexes, 0 and 1.
      I think I'm seeing a problem with object allocation at sources.
      with 2 out of 4 OSTs we have 12 retries, it is not totally right math since Lustre uses atomic_inc() % ost_count, but estimated probability is 0,5^12 = 0,024%

      Attachments

        Issue Links

          Activity

            People

              aboyko Alexander Boyko
              aboyko Alexander Boyko
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: