Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10751

sanity test 27o fails with 'able to create /mnt/lustre/d27o.sanity/f27o.sanity'

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.11.0
    • 3
    • 9223372036854775807

    Description

      sanity test_27o is failing because it is able to create a file after it exhausts all precreations on all OSTs. The error message for this failure is

      'able to create /mnt/lustre/d27o.sanity/f27o.sanity'
      

      For each OST, the test, in exhaust_all_preallocations(), collects osc..prealloc_last_id and osc..prealloc_next_id and creates (last_id – next_id +2) files to exhaust all file precreations. For each OST, and looking at the suite_log for the failure at https://testing.hpdd.intel.com/test_sets/dabd9962-0d65-11e8-bd00-52540065bddc, we see that this works. For example, for OST1

       
      OSTIDX=1 MDTIDX=3
      CMD: trevis-15vm5 lctl get_param -n osc.lustre-OST0001-osc-MDT0003.prealloc_last_id
      CMD: trevis-15vm5 lctl get_param -n osc.lustre-OST0001-osc-MDT0003.prealloc_next_id
      CMD: trevis-15vm5 lctl get_param osc.*OST*-osc-MDT0003.prealloc*
      …
      osc.lustre-OST0001-osc-MDT0003.prealloc_last_id=97
      osc.lustre-OST0001-osc-MDT0003.prealloc_last_seq=0x380000401
      osc.lustre-OST0001-osc-MDT0003.prealloc_next_id=69
      osc.lustre-OST0001-osc-MDT0003.prealloc_next_seq=0x380000401
      osc.lustre-OST0001-osc-MDT0003.prealloc_reserved=0
      osc.lustre-OST0001-osc-MDT0003.prealloc_status=-28
      …
      striped dir -i3 -c2 /mnt/lustre/d27o.sanity/lustre-OST0001
      CMD: trevis-15vm3 lctl set_param fail_val=-1 fail_loc=0x215
      fail_val=-1
      fail_loc=0x215
      Creating to objid 97 on ost lustre-OST0001...
      open(/mnt/lustre/d27o.sanity/lustre-OST0001/f71) error: No space left on device
      total: 2 open/close in 0.01 seconds: 210.13 ops/second
      

      So, OST1 is “full”, and we see this for all OSTs except for one:

      OSTIDX=0 MDTIDX=3
      CMD: trevis-15vm5 lctl get_param -n osc.lustre-OST0000-osc-MDT0003.prealloc_last_id
      CMD: trevis-15vm5 lctl get_param -n osc.lustre-OST0000-osc-MDT0003.prealloc_next_id
      CMD: trevis-15vm5 lctl get_param osc.*OST*-osc-MDT0003.prealloc*
      osc.lustre-OST0000-osc-MDT0003.prealloc_last_id=129
      osc.lustre-OST0000-osc-MDT0003.prealloc_last_seq=0x300000401
      osc.lustre-OST0000-osc-MDT0003.prealloc_next_id=85
      osc.lustre-OST0000-osc-MDT0003.prealloc_next_seq=0x300000401
      osc.lustre-OST0000-osc-MDT0003.prealloc_reserved=0
      osc.lustre-OST0000-osc-MDT0003.prealloc_status=0
      …
      striped dir -i3 -c2 /mnt/lustre/d27o.sanity/lustre-OST0000
      CMD: trevis-15vm3 lctl set_param fail_val=-1 fail_loc=0x215
      fail_val=-1
      fail_loc=0x215
      Creating to objid 129 on ost lustre-OST0000...
      total: 46 open/close in 0.07 seconds: 615.37 ops/second
      

      We don’t see OST0 fill/error with “No space on device”. Unfortunately, we see the same thing for sanity test 27o when it passes.

      Although this might expected due to the fail_loc, in the dmesg log MDS1/3, we see

      [ 1181.147030] Lustre: DEBUG MARKER: == sanity test 27o: create file with all full OSTs (should error) ==================================== 20:59:07 (1518123547)
      [ 1181.756807] Lustre: DEBUG MARKER: lctl get_param -n lov.*.qos_maxage
      [ 1194.103935] Lustre: DEBUG MARKER: lctl get_param -n lov.*.qos_maxage
      [ 1205.580992] LustreError: 28563:0:(lod_qos.c:1352:lod_alloc_specific()) can't lstripe objid [0x2000013a2:0xf4be:0x0]: have 0 want 1
      [ 1206.377129] Lustre: DEBUG MARKER: lctl get_param -n lov.*.qos_maxage
      [ 1217.863538] LustreError: 30940:0:(lod_qos.c:1352:lod_alloc_specific()) can't lstripe objid [0x2000013a2:0xf4bf:0x0]: have 0 want 1
      [ 1218.669304] Lustre: DEBUG MARKER: lctl get_param -n lov.*.qos_maxage
      [ 1230.138926] LustreError: 28565:0:(lod_qos.c:1352:lod_alloc_specific()) can't lstripe objid [0x2000013a2:0xf4c0:0x0]: have 0 want 1
      [ 1230.931360] Lustre: DEBUG MARKER: lctl get_param -n lov.*.qos_maxage
      [ 1242.447227] LustreError: 32075:0:(lod_qos.c:1352:lod_alloc_specific()) can't lstripe objid [0x2000013a2:0xf4c1:0x0]: have 0 want 1
      [ 1243.258795] Lustre: DEBUG MARKER: lctl get_param -n lov.*.qos_maxage
      [ 1254.768848] LustreError: 32075:0:(lod_qos.c:1352:lod_alloc_specific()) can't lstripe objid [0x2000013a2:0xf4c2:0x0]: have 0 want 1
      [ 1255.579674] Lustre: DEBUG MARKER: lctl get_param -n lov.*.qos_maxage
      [ 1267.067337] LustreError: 30940:0:(lod_qos.c:1352:lod_alloc_specific()) can't lstripe objid [0x2000013a2:0xf4c3:0x0]: have 0 want 1
      [ 1267.873867] Lustre: DEBUG MARKER: lctl get_param -n lov.*.qos_maxage
      [ 1280.214333] Lustre: DEBUG MARKER: lctl get_param -n lov.*.qos_maxage
      [ 1290.618993] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanity test_27o: @@@@@@ FAIL: able to create \/mnt\/lustre\/d27o.sanity\/f27o.sanity 
      

      sanity test 27o started failing with this error message on 2018-01-25 and, so far, only fails for DNE testing.

      Logs for failures are at
      https://testing.hpdd.intel.com/test_sets/959d7148-1c58-11e8-a10a-52540065bddc

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: