Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10613

replay-single tests 20c, 21, 23, 24, 25, 26, 30, 48, 53f, 53g, 62, 70b, 70c, fails on open with ‘ No space left on device’

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.11.0, Lustre 2.10.4, Lustre 2.10.5, Lustre 2.13.0, Lustre 2.10.7, Lustre 2.12.1, Lustre 2.12.3, Lustre 2.12.4, Lustre 2.12.5
    • None
    • 3
    • 9223372036854775807

    Description

      A variety of replay-single tests are failing due to the file system being full; calls like open and touch are failing with “No space left on device”. For each test sessions, a different set of replay-single test will fail. For example, test session at https://testing.hpdd.intel.com/test_sets/91b58a56-0a6b-11e8-a6ad-52540065bddc has tests 23, 24, 25, 26, 48, 53g, 70b, 70c fail with no space left on device, but test session https://testing.hpdd.intel.com/test_sets/b226317a-08a2-11e8-a10a-52540065bddc has tests 20c, 21, 30, 32, 32, 33a, 48, 53c, 53f, 53g, 70b, 70c fail with the same error message.

      For example, looking at a failover test session for lustre-master build #3703 for ldiskfs servers where replay-single fails (https://testing.hpdd.intel.com/test_sets/91b58a56-0a6b-11e8-a6ad-52540065bddc), we see test 26 fail with the error message

      'multiop_bg_pause /mnt/lustre/f26.replay-single-1 failed' 
      

      Looking at the test_log for replay-single test_26 (and most of the other failed tests), we see that the file system is not close to being full

      == replay-single test 26: |X| open(O_CREAT), unlink two, close one, replay, close one (test mds_cleanup_orphans) ====================================================================================================== 08:40:42 (1517820042)
      CMD: onyx-49vm11 sync; sync; sync
      UUID                   1K-blocks        Used   Available Use% Mounted on
      lustre-MDT0000_UUID      5825660       48852     5253976   1% /mnt/lustre[MDT:0]
      lustre-OST0000_UUID      1933276       30916     1781120   2% /mnt/lustre[OST:0]
      lustre-OST0001_UUID      1933276       25792     1786244   1% /mnt/lustre[OST:1]
      lustre-OST0002_UUID      1933276       25788     1786248   1% /mnt/lustre[OST:2]
      lustre-OST0003_UUID      1933276       25788     1786248   1% /mnt/lustre[OST:3]
      lustre-OST0004_UUID      1933276       25788     1786248   1% /mnt/lustre[OST:4]
      lustre-OST0005_UUID      1933276       25788     1786248   1% /mnt/lustre[OST:5]
      lustre-OST0006_UUID      1933276       25840     1769716   1% /mnt/lustre[OST:6]
      
      filesystem_summary:     13532932      185700    12482072   1% /mnt/lustre
      
      CMD: onyx-49vm5.onyx.hpdd.intel.com,onyx-49vm7,onyx-49vm8 mcreate /mnt/lustre/fsa-\$(hostname); rm /mnt/lustre/fsa-\$(hostname)
      CMD: onyx-49vm5.onyx.hpdd.intel.com,onyx-49vm7,onyx-49vm8 if [ -d /mnt/lustre2 ]; then mcreate /mnt/lustre2/fsa-\$(hostname); rm /mnt/lustre2/fsa-\$(hostname); fi
      CMD: onyx-49vm11 /usr/sbin/lctl --device lustre-MDT0000 notransno
      CMD: onyx-49vm11 /usr/sbin/lctl --device lustre-MDT0000 readonly
      CMD: onyx-49vm11 /usr/sbin/lctl mark mds1 REPLAY BARRIER on lustre-MDT0000
      multiop /mnt/lustre/f26.replay-single-1 vO_tSc
      TMPPIPE=/tmp/multiop_open_wait_pipe.976
      open(O_RDWR|O_CREAT): No space left on device
      replay-single test_26: @@@@@@ FAIL: multiop_bg_pause /mnt/lustre/f26.replay-single-1 failed 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:5336:error()
        = /usr/lib64/lustre/tests/replay-single.sh:613:test_26()
      

      In the MDS dmesg, the only hint of trouble is

      [  117.289138] Lustre: DEBUG MARKER: mds1 REPLAY BARRIER on lustre-MDT0000
      [  117.297983] LustreError: 1978:0:(lod_qos.c:1352:lod_alloc_specific()) can't lstripe objid [0x200097212:0xe:0x0]: have 0 want 1
      [  117.511816] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  replay-single test_26: @@@@@@ FAIL: multiop_bg_pause \/mnt\/lustre\/f26.replay-single-1 failed 
      

      This error “can't lstripe objid” make the problem look like LU-10350, but that patch landed and sanity-dom is not run in this test group.

      For each test that fails due to ‘no space’, here is the test and the error from the test log. Several lines of test output have been removed to focus on the error/failure:

      == replay-single test 20c: check that client eviction does not affect file content =================== 18:24:30 (1517624670)
      multiop /mnt/lustre/f20c.replay-single vOw_c
      TMPPIPE=/tmp/multiop_open_wait_pipe.31377
      open(O_RDWR|O_CREAT): No space left on device
      replay-single test_20c: @@@@@@ FAIL: multiop_bg_pause /mnt/lustre/f20c.replay-single failed 
      
      == replay-single test 21: |X| open(O_CREAT), unlink touch new, replay, close (test mds_cleanup_orphans) 
      multiop /mnt/lustre/f21.replay-single vO_tSc
      TMPPIPE=/tmp/multiop_open_wait_pipe.31377
      open(O_RDWR|O_CREAT): No space left on device
      replay-single test_21: @@@@@@ FAIL: multiop_bg_pause /mnt/lustre/f21.replay-single failed 
      
      == replay-single test 30: open(O_CREAT) two, unlink two, replay, close two (test mds_cleanup_orphans) ====================================================================================================== 18:34:12 (1517625252)
      multiop /mnt/lustre/f30.replay-single-1 vO_tSc
      TMPPIPE=/tmp/multiop_open_wait_pipe.31377
      open(O_RDWR|O_CREAT): No space left on device
      replay-single test_30: @@@@@@ FAIL: multiop_bg_pause /mnt/lustre/f30.replay-single-1 failed 
      
      == replay-single test 31: open(O_CREAT) two, unlink one, |X| unlink one, close two (test mds_cleanup_orphans) ====================================================================================================== 18:34:16 (1517625256)
      multiop /mnt/lustre/f31.replay-single-1 vO_tSc
      TMPPIPE=/tmp/multiop_open_wait_pipe.31377
      open(O_RDWR|O_CREAT): No space left on device
      replay-single test_31: @@@@@@ FAIL: multiop_bg_pause /mnt/lustre/f31.replay-single-1 failed 
      
      == replay-single test 32: close() notices client eviction; close() after client eviction ============= 18:34:18 (1517625258)
      multiop /mnt/lustre/f32.replay-single vO_c
      TMPPIPE=/tmp/multiop_open_wait_pipe.31377
      open(O_RDWR|O_CREAT): No space left on device
      replay-single test_32: @@@@@@ FAIL: multiop_bg_pause /mnt/lustre/f32.replay-single failed 
      
      == replay-single test 33a: fid seq shouldn't be reused after abort recovery ========================== 18:34:20 (1517625260)
      open(/mnt/lustre/f33a.replay-single-0) error: No space left on device
      total: 0 open/close in 0.00 seconds: 0.00 ops/second
      replay-single test_33a: @@@@@@ FAIL: createmany create /mnt/lustre/f33a.replay-single failed 
      
      == replay-single test 48: MDS->OSC failure during precreate cleanup (2824) =========================== 19:12:20 (1517627540)
      Started lustre-MDT0000
      CMD: onyx-48vm10 lctl set_param fail_loc=0x80000216
      fail_loc=0x80000216
      open(/mnt/lustre/f48.replay-single20) error: No space left on device
      total: 0 open/close in 0.00 seconds: 0.00 ops/second
      replay-single test_48: @@@@@@ FAIL: createmany recraete /mnt/lustre/f48.replay-single failed 
      
      == replay-single test 53c: |X| open request and close request while two MDC requests in flight ======= 19:17:23 (1517627843)
      CMD: onyx-48vm11 lctl set_param fail_loc=0x80000107
      open(O_RDWR|O_CREAT): No space left on device
      fail_loc=0x80000107
      CMD: onyx-48vm11 lctl set_param fail_loc=0x80000115
      fail_loc=0x80000115
      /usr/lib64/lustre/tests/replay-single.sh: line 1293: kill: (3180) - No such process
      replay-single test_53c: @@@@@@ FAIL: close_pid doesn't exist 
      
      == replay-single test 53f: |X| open reply and close reply while two MDC requests in flight =========== 19:20:23 (1517628023)
      CMD: onyx-48vm11 lctl set_param fail_loc=0x119
      open(O_RDWR|O_CREAT): No space left on device
      fail_loc=0x119
      CMD: onyx-48vm11 lctl set_param fail_loc=0x8000013b
      fail_loc=0x8000013b
      /usr/lib64/lustre/tests/replay-single.sh: line 1398: kill: (6748) - No such process
      replay-single test_53f: @@@@@@ FAIL: close_pid doesn't exist 
      
      == replay-single test 53g: |X| drop open reply and close request while close and open are both in flight ====================================================================================================== 19:20:31 (1517628031)
      CMD: onyx-48vm11 lctl set_param fail_loc=0x119
      open(O_RDWR|O_CREAT): No space left on device
      fail_loc=0x119
      CMD: onyx-48vm11 lctl set_param fail_loc=0x80000115
      fail_loc=0x80000115
      /usr/lib64/lustre/tests/replay-single.sh: line 1436: kill: (7302) - No such process
      CMD: onyx-48vm11 lctl set_param fail_loc=0
      fail_loc=0
      replay-single test_53g: @@@@@@ FAIL: close_pid doesn't exist 
      
      == replay-single test 70b: dbench 1mdts recovery; 3 clients ========================================== 20:06:47 (1517630807)
      onyx-48vm1: [6141] open ./clients/client0/~dmtmp/EXCEL/BEED0000 failed for handle 11169 (No space left on device)
      onyx-48vm1: (6142) ERROR: handle 11169 was not found
      onyx-48vm1: Child failed with status 1
      onyx-48vm4: [6141] open ./clients/client0/~dmtmp/EXCEL/BEED0000 failed for handle 11169 (No space left on device)
      onyx-48vm4: (6142) ERROR: handle 11169 was not found
      onyx-48vm4: Child failed with status 1
      onyx-48vm4: dbench: no process found
      onyx-48vm1: dbench: no process found
      onyx-48vm3: [6320] open ./clients/client0/~dmtmp/WORDPRO/LWPSAV0.TMP failed for handle 11188 (No space left on device)
      onyx-48vm3: (6321) ERROR: handle 11188 was not found
      onyx-48vm3: Child failed with status 1
      onyx-48vm3: dbench: no process found
      onyx-48vm3: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 10 sec
      onyx-48vm4: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 10 sec
      onyx-48vm1: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 10 sec
      CMD: onyx-48vm1,onyx-48vm3,onyx-48vm4 killall -0 dbench
      onyx-48vm4: dbench: no process found
      onyx-48vm3: dbench: no process found
      onyx-48vm1: dbench: no process found
      replay-single test_70b: @@@@@@ FAIL: dbench stopped on some of onyx-48vm1,onyx-48vm3,onyx-48vm4! 
      
      CMD: onyx-48vm1,onyx-48vm3,onyx-48vm4 killall  dbench
      onyx-48vm3: dbench: no process found
      onyx-48vm4: dbench: no process found
      onyx-48vm1: dbench: no process found
      replay-single test_70b: @@@@@@ FAIL: rundbench load on onyx-48vm1,onyx-48vm3,onyx-48vm4 failed! 
      
      

      Logs for this failure can be found at
      lustre-master build # 3703 el7 - https://testing.hpdd.intel.com/test_sets/91b58a56-0a6b-11e8-a6ad-52540065bddc
      lustre-master build # 3703 el7 servers/sles12sp3 clients - https://testing.hpdd.intel.com/test_sets/b226317a-08a2-11e8-a10a-52540065bddc
      lustre-master build # 3703 sles12sp2 - https://testing.hpdd.intel.com/test_sets/4d1259a8-06d6-11e8-a7cd-52540065bddc
      lustre-master build # 3702 el7 servers/sles12sp3 clients - https://testing.hpdd.intel.com/test_sets/cee4eabe-04e1-11e8-bd00-52540065bddc

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: