Details
Description
Many replay-single tests fail for RHEL8 and Ubuntu18.04 with the “Numerical result out of range” error. All tests have different failure messages, but they all have the “Numerical result out of range” error during file open.
Looking at the suite_log for https://testing.whamcloud.com/test_sets/0f9f03d0-c158-11e9-90ad-52540065bddc, we see replay-single tests 13, 14, 15, 18, 21, 22, 23, 31, 32, 33a, 53b, 53c, 53g, 53h, and 70b all fail
== replay-single test 13: open chmod 0 |x| write close =============================================== 20:22:03 (1566073323) multiop /mnt/lustre/f13.replay-single vO_wc TMPPIPE=/tmp/multiop_open_wait_pipe.29190 open(O_RDWR|O_CREAT): Numerical result out of range replay-single test_13: @@@@@@ FAIL: multiop_bg_pause /mnt/lustre/f13.replay-single failed … == replay-single test 14: open(O_CREAT), unlink |X| close ============================================ 20:22:07 (1566073327) multiop /mnt/lustre/f14.replay-single vO_tSc TMPPIPE=/tmp/multiop_open_wait_pipe.29190 open(O_RDWR|O_CREAT): Numerical result out of range replay-single test_14: @@@@@@ FAIL: multiop_bg_pause /mnt/lustre/f14.replay-single failed … == replay-single test 15: open(O_CREAT), unlink |X| touch new, close ================================ 20:22:11 (1566073331) multiop /mnt/lustre/f15.replay-single vO_tSc TMPPIPE=/tmp/multiop_open_wait_pipe.29190 open(O_RDWR|O_CREAT): Numerical result out of range replay-single test_15: @@@@@@ FAIL: multiop_bg_pause /mnt/lustre/f15.replay-single failed … trevis-8vm6: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec touch: cannot touch '/mnt/lustre/f18.replay-single-3': Numerical result out of range replay-single test_18: @@@@@@ FAIL: touch /mnt/lustre/f18.replay-single-3 failed … trevis-8vm6: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec touch: cannot touch '/mnt/lustre/f21.replay-single-2': Numerical result out of range replay-single test_21: @@@@@@ FAIL: touch /mnt/lustre/f21.replay-single-2 failed … == replay-single test 22: open(O_CREAT), |X| unlink, replay, close (test mds_cleanup_orphans) ======== 20:33:35 (1566074015) multiop /mnt/lustre/f22.replay-single vO_tSc TMPPIPE=/tmp/multiop_open_wait_pipe.29190 open(O_RDWR|O_CREAT): Numerical result out of range … == replay-single test 23: open(O_CREAT), |X| unlink touch new, replay, close (test mds_cleanup_orphans) ====================================================================================================== 20:33:39 (1566074019) multiop /mnt/lustre/f23.replay-single vO_tSc TMPPIPE=/tmp/multiop_open_wait_pipe.29190 open(O_RDWR|O_CREAT): Numerical result out of range replay-single test_23: @@@@@@ FAIL: multiop_bg_pause /mnt/lustre/f23.replay-single failed … == replay-single test 31: open(O_CREAT) two, unlink one, |X| unlink one, close two (test mds_cleanup_orphans) ====================================================================================================== 20:44:50 (1566074690) multiop /mnt/lustre/f31.replay-single-1 vO_tSc TMPPIPE=/tmp/multiop_open_wait_pipe.29190 open(O_RDWR|O_CREAT): Numerical result out of range replay-single test_31: @@@@@@ FAIL: multiop_bg_pause /mnt/lustre/f31.replay-single-1 failed … == replay-single test 32: close() notices client eviction; close() after client eviction ============= 20:44:54 (1566074694) multiop /mnt/lustre/f32.replay-single vO_c TMPPIPE=/tmp/multiop_open_wait_pipe.29190 open(O_RDWR|O_CREAT): Numerical result out of range replay-single test_32: @@@@@@ FAIL: multiop_bg_pause /mnt/lustre/f32.replay-single failed … == replay-single test 33a: fid seq shouldn't be reused after abort recovery ========================== 20:44:58 (1566074698) open(/mnt/lustre/f33a.replay-single-0) error: Numerical result out of range total: 0 open/close in 0.00 seconds: 0.00 ops/second replay-single test_33a: @@@@@@ FAIL: createmany create /mnt/lustre/f33a.replay-single failed … == replay-single test 53b: |X| open request while two MDC requests in flight ========================= 21:35:13 (1566077713) multiop /mnt/lustre/d53b.replay-single-1/f vO_c TMPPIPE=/tmp/multiop_open_wait_pipe.29190 open(O_RDWR|O_CREAT): Numerical result out of range replay-single test_53b: @@@@@@ FAIL: multiop_bg_pause /mnt/lustre/d53b.replay-single-1/f failed … == replay-single test 53c: |X| open request and close request while two MDC requests in flight ======= 21:35:17 (1566077717) open(O_RDWR|O_CREAT): Numerical result out of range CMD: trevis-8vm10 lctl set_param fail_loc=0x80000107 fail_loc=0x80000107 CMD: trevis-8vm10 lctl set_param fail_loc=0x80000115 fail_loc=0x80000115 /usr/lib64/lustre/tests/replay-single.sh: line 1294: kill: (15973) - No such process … == replay-single test 53g: |X| drop open reply and close request while close and open are both in flight ====================================================================================================== 21:39:58 (1566077998) open(O_RDWR|O_CREAT): Numerical result out of range CMD: trevis-8vm11 lctl set_param fail_loc=0x119 fail_loc=0x119 CMD: trevis-8vm11 lctl set_param fail_loc=0x80000115 fail_loc=0x80000115 /usr/lib64/lustre/tests/replay-single.sh: line 1437: kill: (22835) - No such process CMD: trevis-8vm11 lctl set_param fail_loc=0 fail_loc=0 replay-single test_53g: @@@@@@ FAIL: close_pid doesn't exist … == replay-single test 53h: open request and close reply while two MDC requests in flight ============= 21:40:07 (1566078007) open(O_RDWR|O_CREAT): Numerical result out of range CMD: trevis-8vm11 lctl set_param fail_loc=0x80000107 fail_loc=0x80000107 CMD: trevis-8vm11 lctl set_param fail_loc=0x8000013b fail_loc=0x8000013b /usr/lib64/lustre/tests/replay-single.sh: line 1475: kill: (23667) - No such process replay-single test_53h: @@@@@@ FAIL: close_pid doesn't exist … trevis-8vm7: [44191] open ./clients/client0/~dmtmp/PARADOX/__40D6B.DB failed for handle 10888 (Numerical result out of range) trevis-8vm7: (44193) ERROR: handle 10888 was not found trevis-8vm7: Child failed with status 1 trevis-8vm7: dbench: no process found pdsh@trevis-8vm4: trevis-8vm7: ssh exited with exit code 1 trevis-8vm4: [44069] open ./clients/client0/~dmtmp/PARADOX/__3F2C4.DB failed for handle 10851 (Numerical result out of range) trevis-8vm4: (44071) ERROR: handle 10851 was not found trevis-8vm6: 1 44591 0.68 MB/sec execute 193 sec latency 72567.258 ms trevis-8vm4: Child failed with status 1 trevis-8vm4: dbench: no process found pdsh@trevis-8vm4: trevis-8vm4: ssh exited with exit code 1 trevis-8vm6: [44660] open ./clients/client0/~dmtmp/PARADOX/__50172.DB failed for handle 11004 (Numerical result out of range) trevis-8vm6: (44662) ERROR: handle 11004 was not found trevis-8vm6: Child failed with status 1 trevis-8vm6: dbench: no process found … trevis-8vm6: dbench: no process found pdsh@trevis-8vm4: trevis-8vm6: ssh exited with exit code 1 replay-single test_70b: @@@@@@ FAIL: dbench stopped on some of trevis-8vm4.trevis.whamcloud.com,trevis-8vm6,trevis-8vm7!
Different test sessions have different tests that fail. For example tests 28, 29, 30, 31, 32, 53f, 53g, and 70b for the Ubuntu 18.04 test session https://testing.whamcloud.com/test_sets/9742e3aa-cc8f-11e9-98c8-52540065bddc.
Although the error messages are different, this issue may be related to LU-10613.
Logs for other failed test sessions are at:
https://testing.whamcloud.com/test_sets/b80265a8-d8af-11e9-a25b-52540065bddc
https://testing.whamcloud.com/test_sets/c68821fe-b9fa-11e9-97d5-52540065bddc
https://testing.whamcloud.com/test_sets/6615df92-d105-11e9-90ad-52540065bddc
https://testing.whamcloud.com/test_sets/95922fb6-d80b-11e9-98c8-52540065bddc
https://testing.whamcloud.com/test_sets/211af22a-ddcf-11e9-a197-52540065bddc
Attachments
Issue Links
- is duplicated by
-
LU-13995 conf-sanity_112: Numerical result out of range
- Closed
- is related to
-
LU-12589 sanity test_102a: setfattr: /mnt/lustre/f102a.sanity: Numerical result out of range
- Resolved
-
LU-13184 conf-sanity test_112: problem creating f112.conf-sanity.0 on OST0000
- Resolved
- is related to
-
LU-13221 conf-sanity test_112: FAIL: MDS start failed
- Reopened
-
LU-13813 conf-sanity test_112: can't put import for mdc.lustre-MDT0000-mdc-*.mds_server_uuid into FULL state after 1475 sec, have DISCONN
- Resolved