Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12818

replay-single test_70b and other tests fail with “Numerical result out of range” error

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Minor
    • None
    • Lustre 2.13.0, Lustre 2.14.0, Lustre 2.12.4
    • RHEL8 and Ubuntu 18.04
    • 3
    • 9223372036854775807

    Description

      Many replay-single tests fail for RHEL8 and Ubuntu18.04 with the “Numerical result out of range” error. All tests have different failure messages, but they all have the “Numerical result out of range” error during file open.

      Looking at the suite_log for https://testing.whamcloud.com/test_sets/0f9f03d0-c158-11e9-90ad-52540065bddc, we see replay-single tests 13, 14, 15, 18, 21, 22, 23, 31, 32, 33a, 53b, 53c, 53g, 53h, and 70b all fail

      == replay-single test 13: open chmod 0 |x| write close =============================================== 20:22:03 (1566073323)
      multiop /mnt/lustre/f13.replay-single vO_wc
      TMPPIPE=/tmp/multiop_open_wait_pipe.29190
      open(O_RDWR|O_CREAT): Numerical result out of range
       replay-single test_13: @@@@@@ FAIL: multiop_bg_pause /mnt/lustre/f13.replay-single failed 
      …
      == replay-single test 14: open(O_CREAT), unlink |X| close ============================================ 20:22:07 (1566073327)
      multiop /mnt/lustre/f14.replay-single vO_tSc
      TMPPIPE=/tmp/multiop_open_wait_pipe.29190
      open(O_RDWR|O_CREAT): Numerical result out of range
       replay-single test_14: @@@@@@ FAIL: multiop_bg_pause /mnt/lustre/f14.replay-single failed 
      …
      == replay-single test 15: open(O_CREAT), unlink |X|  touch new, close ================================ 20:22:11 (1566073331)
      multiop /mnt/lustre/f15.replay-single vO_tSc
      TMPPIPE=/tmp/multiop_open_wait_pipe.29190
      open(O_RDWR|O_CREAT): Numerical result out of range
       replay-single test_15: @@@@@@ FAIL: multiop_bg_pause /mnt/lustre/f15.replay-single failed 
      …
      trevis-8vm6: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
      touch: cannot touch '/mnt/lustre/f18.replay-single-3': Numerical result out of range
       replay-single test_18: @@@@@@ FAIL: touch /mnt/lustre/f18.replay-single-3 failed 
      …
      trevis-8vm6: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
      touch: cannot touch '/mnt/lustre/f21.replay-single-2': Numerical result out of range
       replay-single test_21: @@@@@@ FAIL: touch /mnt/lustre/f21.replay-single-2 failed 
      …
      == replay-single test 22: open(O_CREAT), |X| unlink, replay, close (test mds_cleanup_orphans) ======== 20:33:35 (1566074015)
      multiop /mnt/lustre/f22.replay-single vO_tSc
      TMPPIPE=/tmp/multiop_open_wait_pipe.29190
      open(O_RDWR|O_CREAT): Numerical result out of range
      …
      == replay-single test 23: open(O_CREAT), |X| unlink touch new, replay, close (test mds_cleanup_orphans) ====================================================================================================== 20:33:39 (1566074019)
      multiop /mnt/lustre/f23.replay-single vO_tSc
      TMPPIPE=/tmp/multiop_open_wait_pipe.29190
      open(O_RDWR|O_CREAT): Numerical result out of range
       replay-single test_23: @@@@@@ FAIL: multiop_bg_pause /mnt/lustre/f23.replay-single failed 
      …
      == replay-single test 31: open(O_CREAT) two, unlink one, |X| unlink one, close two (test mds_cleanup_orphans) ====================================================================================================== 20:44:50 (1566074690)
      multiop /mnt/lustre/f31.replay-single-1 vO_tSc
      TMPPIPE=/tmp/multiop_open_wait_pipe.29190
      open(O_RDWR|O_CREAT): Numerical result out of range
       replay-single test_31: @@@@@@ FAIL: multiop_bg_pause /mnt/lustre/f31.replay-single-1 failed 
      …
      == replay-single test 32: close() notices client eviction; close() after client eviction ============= 20:44:54 (1566074694)
      multiop /mnt/lustre/f32.replay-single vO_c
      TMPPIPE=/tmp/multiop_open_wait_pipe.29190
      open(O_RDWR|O_CREAT): Numerical result out of range
       replay-single test_32: @@@@@@ FAIL: multiop_bg_pause /mnt/lustre/f32.replay-single failed 
      …
      == replay-single test 33a: fid seq shouldn't be reused after abort recovery ========================== 20:44:58 (1566074698)
      open(/mnt/lustre/f33a.replay-single-0) error: Numerical result out of range
      total: 0 open/close in 0.00 seconds: 0.00 ops/second
       replay-single test_33a: @@@@@@ FAIL: createmany create /mnt/lustre/f33a.replay-single failed 
      …
      == replay-single test 53b: |X| open request while two MDC requests in flight ========================= 21:35:13 (1566077713)
      multiop /mnt/lustre/d53b.replay-single-1/f vO_c
      TMPPIPE=/tmp/multiop_open_wait_pipe.29190
      open(O_RDWR|O_CREAT): Numerical result out of range
       replay-single test_53b: @@@@@@ FAIL: multiop_bg_pause /mnt/lustre/d53b.replay-single-1/f failed 
      …
      == replay-single test 53c: |X| open request and close request while two MDC requests in flight ======= 21:35:17 (1566077717)
      open(O_RDWR|O_CREAT): Numerical result out of range
      CMD: trevis-8vm10 lctl set_param fail_loc=0x80000107
      fail_loc=0x80000107
      CMD: trevis-8vm10 lctl set_param fail_loc=0x80000115
      fail_loc=0x80000115
      /usr/lib64/lustre/tests/replay-single.sh: line 1294: kill: (15973) - No such process
      …
      == replay-single test 53g: |X| drop open reply and close request while close and open are both in flight ====================================================================================================== 21:39:58 (1566077998)
      open(O_RDWR|O_CREAT): Numerical result out of range
      CMD: trevis-8vm11 lctl set_param fail_loc=0x119
      fail_loc=0x119
      CMD: trevis-8vm11 lctl set_param fail_loc=0x80000115
      fail_loc=0x80000115
      /usr/lib64/lustre/tests/replay-single.sh: line 1437: kill: (22835) - No such process
      CMD: trevis-8vm11 lctl set_param fail_loc=0
      fail_loc=0
       replay-single test_53g: @@@@@@ FAIL: close_pid doesn't exist 
      …
      == replay-single test 53h: open request and close reply while two MDC requests in flight ============= 21:40:07 (1566078007)
      open(O_RDWR|O_CREAT): Numerical result out of range
      CMD: trevis-8vm11 lctl set_param fail_loc=0x80000107
      fail_loc=0x80000107
      CMD: trevis-8vm11 lctl set_param fail_loc=0x8000013b
      fail_loc=0x8000013b
      /usr/lib64/lustre/tests/replay-single.sh: line 1475: kill: (23667) - No such process
       replay-single test_53h: @@@@@@ FAIL: close_pid doesn't exist 
      …
      trevis-8vm7: [44191] open ./clients/client0/~dmtmp/PARADOX/__40D6B.DB failed for handle 10888 (Numerical result out of range)
      trevis-8vm7: (44193) ERROR: handle 10888 was not found
      trevis-8vm7: Child failed with status 1
      trevis-8vm7: dbench: no process found
      pdsh@trevis-8vm4: trevis-8vm7: ssh exited with exit code 1
      trevis-8vm4: [44069] open ./clients/client0/~dmtmp/PARADOX/__3F2C4.DB failed for handle 10851 (Numerical result out of range)
      trevis-8vm4: (44071) ERROR: handle 10851 was not found
      trevis-8vm6:    1     44591     0.68 MB/sec  execute 193 sec  latency 72567.258 ms
      trevis-8vm4: Child failed with status 1
      trevis-8vm4: dbench: no process found
      pdsh@trevis-8vm4: trevis-8vm4: ssh exited with exit code 1
      trevis-8vm6: [44660] open ./clients/client0/~dmtmp/PARADOX/__50172.DB failed for handle 11004 (Numerical result out of range)
      trevis-8vm6: (44662) ERROR: handle 11004 was not found
      trevis-8vm6: Child failed with status 1
      trevis-8vm6: dbench: no process found
      …
      trevis-8vm6: dbench: no process found
      pdsh@trevis-8vm4: trevis-8vm6: ssh exited with exit code 1
       replay-single test_70b: @@@@@@ FAIL: dbench stopped on some of trevis-8vm4.trevis.whamcloud.com,trevis-8vm6,trevis-8vm7! 
      

      Different test sessions have different tests that fail. For example tests 28, 29, 30, 31, 32, 53f, 53g, and 70b for the Ubuntu 18.04 test session https://testing.whamcloud.com/test_sets/9742e3aa-cc8f-11e9-98c8-52540065bddc.

      Although the error messages are different, this issue may be related to LU-10613.

      Logs for other failed test sessions are at:
      https://testing.whamcloud.com/test_sets/b80265a8-d8af-11e9-a25b-52540065bddc
      https://testing.whamcloud.com/test_sets/c68821fe-b9fa-11e9-97d5-52540065bddc
      https://testing.whamcloud.com/test_sets/6615df92-d105-11e9-90ad-52540065bddc
      https://testing.whamcloud.com/test_sets/95922fb6-d80b-11e9-98c8-52540065bddc
      https://testing.whamcloud.com/test_sets/211af22a-ddcf-11e9-a197-52540065bddc

      Attachments

        Issue Links

          Activity

            [LU-12818] replay-single test_70b and other tests fail with “Numerical result out of range” error

            There is a patch: https://review.whamcloud.com/37393 "LU-13184 tests: wait for OST startup in test_112" which may fix this problem.

            adilger Andreas Dilger added a comment - There is a patch: https://review.whamcloud.com/37393 " LU-13184 tests: wait for OST startup in test_112 " which may fix this problem.
            adilger Andreas Dilger added a comment - +1 on master https://testing.whamcloud.com/test_sets/9072748b-8448-46af-a2df-0e006ed552f1
            jhammond John Hammond added a comment - https://testing.whamcloud.com/test_sets/a242a505-11a4-4c88-9317-21d1ce234094
            jhammond John Hammond added a comment -

            ys's comment from LU-13995.

            Looks like we need ensure ost0 is FULL before create on it?

            ---client side
            
            00010000:00000001:1.0:1603787411.358695:0:25028:0:(ldlm_lock.c:249:ldlm_lock_put()) Process leaving
            00010000:00000001:1.0:1603787411.358696:0:25028:0:(ldlm_request.c:1136:ldlm_cli_enqueue()) Process leaving (rc=301 : 301 : 12d)
            00000002:00000001:1.0:1603787411.358697:0:25028:0:(mdc_locks.c:686:mdc_finish_enqueue()) Process entered
            00000002:00100000:1.0:1603787411.358698:0:25028:0:(mdc_locks.c:741:mdc_finish_enqueue()) @@@ op=1 disposition=3, status=-34  req@ffff8b1a79d88d80 x1681692846730048/t0(0) o101->lustre-MDT0000-mdc-ffff8b1a7a48f000@10.9.7.66@tcp:12/10 lens 648/600 e 0 to 0 dl 1603787418 ref 1 fl Complete:RQU/0/0 rc 301/301 job:'lfs.0'
            
            --server side
            
            00020000:00000001:1.0:1603787411.359031:0:23957:0:(lod_qos.c:108:lod_statfs_and_check()) Process entered
            00000004:00000001:1.0:1603787411.359032:0:23957:0:(osp_dev.c:774:osp_statfs()) Process entered
            00000004:00000001:1.0:1603787411.359033:0:23957:0:(osp_dev.c:780:osp_statfs()) Process leaving (rc=18446744073709551509 : -107 : ffffffffffffff95)
            00020000:01000000:1.0:1603787411.359034:0:23957:0:(lod_qos.c:143:lod_statfs_and_check()) lustre-OST0000-osc-MDT0000: turns inactive
            00020000:00000001:1.0:1603787411.359034:0:23957:0:(lod_qos.c:171:lod_statfs_and_check()) Process leaving (rc=18446744073709551495 : -121 : ffffffffffffff87)
            00020000:00000001:1.0:1603787411.359035:0:23957:0:(lod_qos.c:108:lod_statfs_and_check()) Process entered
            00000004:00000001:1.0:1603787411.359036:0:23957:0:(osp_dev.c:774:osp_statfs()) Process entered
            00000004:00001000:1.0:1603787411.359037:0:23957:0:(osp_dev.c:797:osp_statfs()) lustre-OST0001-osc-MDT0000: 95744 blocks, 94976 free, 84480 avail, 4096 bsize, 1 reserved mb low, 3 reserved mb high, 12231 files, 11872 free files
            00000004:00000001:1.0:1603787411.359039:0:23957:0:(osp_dev.c:816:osp_statfs()) Process leaving (rc=0 : 0 : 0)
            00020000:01000000:1.0:1603787411.359040:0:23957:0:(lod_qos.c:143:lod_statfs_and_check()) lustre-OST0001-osc-MDT0000: turns inactive
            00020000:00000001:1.0:1603787411.359040:0:23957:0:(lod_qos.c:171:lod_statfs_and_check()) Process leaving (rc=18446744073709551511 : -105 : ffffffffffffff97)
            00020000:00000001:1.0:1603787411.359041:0:23957:0:(lod_qos.c:232:lod_qos_statfs_update()) Process leaving
            00020000:00000001:1.0:1603787411.359042:0:23957:0:(lod_qos.c:2509:lod_qos_prep_create()) Process leaving via out (rc=18446744073709551582 : -34 : 0xffffffffffffffde)
            00020000:00000001:1.0:1603787411.359043:0:23957:0:(lod_qos.c:2607:lod_qos_prep_create()) Process leaving (rc=18446744073709551582 : -34 : ffffffffffffffde)
            00020000:00000001:1.0:1603787411.359044:0:23957:0:(lod_qos.c:2663:lod_prepare_create()) Process leaving (rc=18446744073709551582 : -34 : ffffffffffffffde)
            00000004:00000001:1.0:1603787411.359045:0:23957:0:(lod_object.c:5474:lod_declare_striped_create()) Process leaving via out (rc=18446744073709551582 : -34 : 0xffffffffffffffde)
            00000004:00000010:1.0:1603787411.359046:0:23957:0:(lod_lov.c:492:lod_free_comp_buffer()) kfreed 'entries': 120 at ffff93c7af951f80.
            00000004:00000001:1.0:1603787411.359047:0:23957:0:(lod_object.c:5509:lod_declare_striped_create()) Process leaving (rc=18446744073709551582 : -34 : ffffffffffffffde)
            00000004:00000001:1.0:1603787411.359048:0:23957:0:(lod_object.c:3491:lod_declare_xattr_set()) Process leaving (rc=18446744073709551582 : -34 : ffffffffffffffde)
            00000004:00000001:1.0:1603787411.359049:0:23957:0:(mdd_dir.c:1923:mdd_create_data()) Process leaving via stop (rc=18446744073709551582 : -34 : 0xffffffffffffffde)
            
            
            jhammond John Hammond added a comment - ys 's comment from LU-13995 . Looks like we need ensure ost0 is FULL before create on it? ---client side 00010000:00000001:1.0:1603787411.358695:0:25028:0:(ldlm_lock.c:249:ldlm_lock_put()) Process leaving 00010000:00000001:1.0:1603787411.358696:0:25028:0:(ldlm_request.c:1136:ldlm_cli_enqueue()) Process leaving (rc=301 : 301 : 12d) 00000002:00000001:1.0:1603787411.358697:0:25028:0:(mdc_locks.c:686:mdc_finish_enqueue()) Process entered 00000002:00100000:1.0:1603787411.358698:0:25028:0:(mdc_locks.c:741:mdc_finish_enqueue()) @@@ op=1 disposition=3, status=-34 req@ffff8b1a79d88d80 x1681692846730048/t0(0) o101->lustre-MDT0000-mdc-ffff8b1a7a48f000@10.9.7.66@tcp:12/10 lens 648/600 e 0 to 0 dl 1603787418 ref 1 fl Complete:RQU/0/0 rc 301/301 job:'lfs.0' --server side 00020000:00000001:1.0:1603787411.359031:0:23957:0:(lod_qos.c:108:lod_statfs_and_check()) Process entered 00000004:00000001:1.0:1603787411.359032:0:23957:0:(osp_dev.c:774:osp_statfs()) Process entered 00000004:00000001:1.0:1603787411.359033:0:23957:0:(osp_dev.c:780:osp_statfs()) Process leaving (rc=18446744073709551509 : -107 : ffffffffffffff95) 00020000:01000000:1.0:1603787411.359034:0:23957:0:(lod_qos.c:143:lod_statfs_and_check()) lustre-OST0000-osc-MDT0000: turns inactive 00020000:00000001:1.0:1603787411.359034:0:23957:0:(lod_qos.c:171:lod_statfs_and_check()) Process leaving (rc=18446744073709551495 : -121 : ffffffffffffff87) 00020000:00000001:1.0:1603787411.359035:0:23957:0:(lod_qos.c:108:lod_statfs_and_check()) Process entered 00000004:00000001:1.0:1603787411.359036:0:23957:0:(osp_dev.c:774:osp_statfs()) Process entered 00000004:00001000:1.0:1603787411.359037:0:23957:0:(osp_dev.c:797:osp_statfs()) lustre-OST0001-osc-MDT0000: 95744 blocks, 94976 free, 84480 avail, 4096 bsize, 1 reserved mb low, 3 reserved mb high, 12231 files, 11872 free files 00000004:00000001:1.0:1603787411.359039:0:23957:0:(osp_dev.c:816:osp_statfs()) Process leaving (rc=0 : 0 : 0) 00020000:01000000:1.0:1603787411.359040:0:23957:0:(lod_qos.c:143:lod_statfs_and_check()) lustre-OST0001-osc-MDT0000: turns inactive 00020000:00000001:1.0:1603787411.359040:0:23957:0:(lod_qos.c:171:lod_statfs_and_check()) Process leaving (rc=18446744073709551511 : -105 : ffffffffffffff97) 00020000:00000001:1.0:1603787411.359041:0:23957:0:(lod_qos.c:232:lod_qos_statfs_update()) Process leaving 00020000:00000001:1.0:1603787411.359042:0:23957:0:(lod_qos.c:2509:lod_qos_prep_create()) Process leaving via out (rc=18446744073709551582 : -34 : 0xffffffffffffffde) 00020000:00000001:1.0:1603787411.359043:0:23957:0:(lod_qos.c:2607:lod_qos_prep_create()) Process leaving (rc=18446744073709551582 : -34 : ffffffffffffffde) 00020000:00000001:1.0:1603787411.359044:0:23957:0:(lod_qos.c:2663:lod_prepare_create()) Process leaving (rc=18446744073709551582 : -34 : ffffffffffffffde) 00000004:00000001:1.0:1603787411.359045:0:23957:0:(lod_object.c:5474:lod_declare_striped_create()) Process leaving via out (rc=18446744073709551582 : -34 : 0xffffffffffffffde) 00000004:00000010:1.0:1603787411.359046:0:23957:0:(lod_lov.c:492:lod_free_comp_buffer()) kfreed 'entries': 120 at ffff93c7af951f80. 00000004:00000001:1.0:1603787411.359047:0:23957:0:(lod_object.c:5509:lod_declare_striped_create()) Process leaving (rc=18446744073709551582 : -34 : ffffffffffffffde) 00000004:00000001:1.0:1603787411.359048:0:23957:0:(lod_object.c:3491:lod_declare_xattr_set()) Process leaving (rc=18446744073709551582 : -34 : ffffffffffffffde) 00000004:00000001:1.0:1603787411.359049:0:23957:0:(mdd_dir.c:1923:mdd_create_data()) Process leaving via stop (rc=18446744073709551582 : -34 : 0xffffffffffffffde)

            I just saw this failure recently for a patch I was looking at, but I didn't find this bug at the time, so +1 but I can't link to the test results.

            adilger Andreas Dilger added a comment - I just saw this failure recently for a patch I was looking at, but I didn't find this bug at the time, so +1 but I can't link to the test results.

            Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36762
            Subject: LU-12818 test: debug patch
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 6ca3d6684d7c6adbc812a1c84c9afa8c8fcc7050

            gerrit Gerrit Updater added a comment - Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36762 Subject: LU-12818 test: debug patch Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 6ca3d6684d7c6adbc812a1c84c9afa8c8fcc7050
            pjones Peter Jones added a comment -

            Hongchao

            Can you please advise

            Peter

            pjones Peter Jones added a comment - Hongchao Can you please advise Peter

            People

              hongchao.zhang Hongchao Zhang
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: