[LU-12818] replay-single test_70b and other tests fail with “Numerical result out of range” error Created: 27/Sep/19 Updated: 31/Mar/22 Resolved: 31/Mar/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.13.0, Lustre 2.14.0, Lustre 2.12.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | Hongchao Zhang |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | rhel8, ubuntu18, zfs | ||
| Environment: |
RHEL8 and Ubuntu 18.04 |
||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||||||
| Description |
|
Many replay-single tests fail for RHEL8 and Ubuntu18.04 with the “Numerical result out of range” error. All tests have different failure messages, but they all have the “Numerical result out of range” error during file open. Looking at the suite_log for https://testing.whamcloud.com/test_sets/0f9f03d0-c158-11e9-90ad-52540065bddc, we see replay-single tests 13, 14, 15, 18, 21, 22, 23, 31, 32, 33a, 53b, 53c, 53g, 53h, and 70b all fail == replay-single test 13: open chmod 0 |x| write close =============================================== 20:22:03 (1566073323) multiop /mnt/lustre/f13.replay-single vO_wc TMPPIPE=/tmp/multiop_open_wait_pipe.29190 open(O_RDWR|O_CREAT): Numerical result out of range replay-single test_13: @@@@@@ FAIL: multiop_bg_pause /mnt/lustre/f13.replay-single failed … == replay-single test 14: open(O_CREAT), unlink |X| close ============================================ 20:22:07 (1566073327) multiop /mnt/lustre/f14.replay-single vO_tSc TMPPIPE=/tmp/multiop_open_wait_pipe.29190 open(O_RDWR|O_CREAT): Numerical result out of range replay-single test_14: @@@@@@ FAIL: multiop_bg_pause /mnt/lustre/f14.replay-single failed … == replay-single test 15: open(O_CREAT), unlink |X| touch new, close ================================ 20:22:11 (1566073331) multiop /mnt/lustre/f15.replay-single vO_tSc TMPPIPE=/tmp/multiop_open_wait_pipe.29190 open(O_RDWR|O_CREAT): Numerical result out of range replay-single test_15: @@@@@@ FAIL: multiop_bg_pause /mnt/lustre/f15.replay-single failed … trevis-8vm6: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec touch: cannot touch '/mnt/lustre/f18.replay-single-3': Numerical result out of range replay-single test_18: @@@@@@ FAIL: touch /mnt/lustre/f18.replay-single-3 failed … trevis-8vm6: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec touch: cannot touch '/mnt/lustre/f21.replay-single-2': Numerical result out of range replay-single test_21: @@@@@@ FAIL: touch /mnt/lustre/f21.replay-single-2 failed … == replay-single test 22: open(O_CREAT), |X| unlink, replay, close (test mds_cleanup_orphans) ======== 20:33:35 (1566074015) multiop /mnt/lustre/f22.replay-single vO_tSc TMPPIPE=/tmp/multiop_open_wait_pipe.29190 open(O_RDWR|O_CREAT): Numerical result out of range … == replay-single test 23: open(O_CREAT), |X| unlink touch new, replay, close (test mds_cleanup_orphans) ====================================================================================================== 20:33:39 (1566074019) multiop /mnt/lustre/f23.replay-single vO_tSc TMPPIPE=/tmp/multiop_open_wait_pipe.29190 open(O_RDWR|O_CREAT): Numerical result out of range replay-single test_23: @@@@@@ FAIL: multiop_bg_pause /mnt/lustre/f23.replay-single failed … == replay-single test 31: open(O_CREAT) two, unlink one, |X| unlink one, close two (test mds_cleanup_orphans) ====================================================================================================== 20:44:50 (1566074690) multiop /mnt/lustre/f31.replay-single-1 vO_tSc TMPPIPE=/tmp/multiop_open_wait_pipe.29190 open(O_RDWR|O_CREAT): Numerical result out of range replay-single test_31: @@@@@@ FAIL: multiop_bg_pause /mnt/lustre/f31.replay-single-1 failed … == replay-single test 32: close() notices client eviction; close() after client eviction ============= 20:44:54 (1566074694) multiop /mnt/lustre/f32.replay-single vO_c TMPPIPE=/tmp/multiop_open_wait_pipe.29190 open(O_RDWR|O_CREAT): Numerical result out of range replay-single test_32: @@@@@@ FAIL: multiop_bg_pause /mnt/lustre/f32.replay-single failed … == replay-single test 33a: fid seq shouldn't be reused after abort recovery ========================== 20:44:58 (1566074698) open(/mnt/lustre/f33a.replay-single-0) error: Numerical result out of range total: 0 open/close in 0.00 seconds: 0.00 ops/second replay-single test_33a: @@@@@@ FAIL: createmany create /mnt/lustre/f33a.replay-single failed … == replay-single test 53b: |X| open request while two MDC requests in flight ========================= 21:35:13 (1566077713) multiop /mnt/lustre/d53b.replay-single-1/f vO_c TMPPIPE=/tmp/multiop_open_wait_pipe.29190 open(O_RDWR|O_CREAT): Numerical result out of range replay-single test_53b: @@@@@@ FAIL: multiop_bg_pause /mnt/lustre/d53b.replay-single-1/f failed … == replay-single test 53c: |X| open request and close request while two MDC requests in flight ======= 21:35:17 (1566077717) open(O_RDWR|O_CREAT): Numerical result out of range CMD: trevis-8vm10 lctl set_param fail_loc=0x80000107 fail_loc=0x80000107 CMD: trevis-8vm10 lctl set_param fail_loc=0x80000115 fail_loc=0x80000115 /usr/lib64/lustre/tests/replay-single.sh: line 1294: kill: (15973) - No such process … == replay-single test 53g: |X| drop open reply and close request while close and open are both in flight ====================================================================================================== 21:39:58 (1566077998) open(O_RDWR|O_CREAT): Numerical result out of range CMD: trevis-8vm11 lctl set_param fail_loc=0x119 fail_loc=0x119 CMD: trevis-8vm11 lctl set_param fail_loc=0x80000115 fail_loc=0x80000115 /usr/lib64/lustre/tests/replay-single.sh: line 1437: kill: (22835) - No such process CMD: trevis-8vm11 lctl set_param fail_loc=0 fail_loc=0 replay-single test_53g: @@@@@@ FAIL: close_pid doesn't exist … == replay-single test 53h: open request and close reply while two MDC requests in flight ============= 21:40:07 (1566078007) open(O_RDWR|O_CREAT): Numerical result out of range CMD: trevis-8vm11 lctl set_param fail_loc=0x80000107 fail_loc=0x80000107 CMD: trevis-8vm11 lctl set_param fail_loc=0x8000013b fail_loc=0x8000013b /usr/lib64/lustre/tests/replay-single.sh: line 1475: kill: (23667) - No such process replay-single test_53h: @@@@@@ FAIL: close_pid doesn't exist … trevis-8vm7: [44191] open ./clients/client0/~dmtmp/PARADOX/__40D6B.DB failed for handle 10888 (Numerical result out of range) trevis-8vm7: (44193) ERROR: handle 10888 was not found trevis-8vm7: Child failed with status 1 trevis-8vm7: dbench: no process found pdsh@trevis-8vm4: trevis-8vm7: ssh exited with exit code 1 trevis-8vm4: [44069] open ./clients/client0/~dmtmp/PARADOX/__3F2C4.DB failed for handle 10851 (Numerical result out of range) trevis-8vm4: (44071) ERROR: handle 10851 was not found trevis-8vm6: 1 44591 0.68 MB/sec execute 193 sec latency 72567.258 ms trevis-8vm4: Child failed with status 1 trevis-8vm4: dbench: no process found pdsh@trevis-8vm4: trevis-8vm4: ssh exited with exit code 1 trevis-8vm6: [44660] open ./clients/client0/~dmtmp/PARADOX/__50172.DB failed for handle 11004 (Numerical result out of range) trevis-8vm6: (44662) ERROR: handle 11004 was not found trevis-8vm6: Child failed with status 1 trevis-8vm6: dbench: no process found … trevis-8vm6: dbench: no process found pdsh@trevis-8vm4: trevis-8vm6: ssh exited with exit code 1 replay-single test_70b: @@@@@@ FAIL: dbench stopped on some of trevis-8vm4.trevis.whamcloud.com,trevis-8vm6,trevis-8vm7! Different test sessions have different tests that fail. For example tests 28, 29, 30, 31, 32, 53f, 53g, and 70b for the Ubuntu 18.04 test session https://testing.whamcloud.com/test_sets/9742e3aa-cc8f-11e9-98c8-52540065bddc. Although the error messages are different, this issue may be related to LU-10613. Logs for other failed test sessions are at: |
| Comments |
| Comment by Peter Jones [ 03/Nov/19 ] |
|
Hongchao Can you please advise Peter |
| Comment by Gerrit Updater [ 15/Nov/19 ] |
|
Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36762 |
| Comment by Andreas Dilger [ 31/Jan/20 ] |
|
I just saw this failure recently for a patch I was looking at, but I didn't find this bug at the time, so +1 but I can't link to the test results. |
| Comment by John Hammond [ 08/Jan/21 ] |
|
Looks like we need ensure ost0 is FULL before create on it? ---client side 00010000:00000001:1.0:1603787411.358695:0:25028:0:(ldlm_lock.c:249:ldlm_lock_put()) Process leaving 00010000:00000001:1.0:1603787411.358696:0:25028:0:(ldlm_request.c:1136:ldlm_cli_enqueue()) Process leaving (rc=301 : 301 : 12d) 00000002:00000001:1.0:1603787411.358697:0:25028:0:(mdc_locks.c:686:mdc_finish_enqueue()) Process entered 00000002:00100000:1.0:1603787411.358698:0:25028:0:(mdc_locks.c:741:mdc_finish_enqueue()) @@@ op=1 disposition=3, status=-34 req@ffff8b1a79d88d80 x1681692846730048/t0(0) o101->lustre-MDT0000-mdc-ffff8b1a7a48f000@10.9.7.66@tcp:12/10 lens 648/600 e 0 to 0 dl 1603787418 ref 1 fl Complete:RQU/0/0 rc 301/301 job:'lfs.0' --server side 00020000:00000001:1.0:1603787411.359031:0:23957:0:(lod_qos.c:108:lod_statfs_and_check()) Process entered 00000004:00000001:1.0:1603787411.359032:0:23957:0:(osp_dev.c:774:osp_statfs()) Process entered 00000004:00000001:1.0:1603787411.359033:0:23957:0:(osp_dev.c:780:osp_statfs()) Process leaving (rc=18446744073709551509 : -107 : ffffffffffffff95) 00020000:01000000:1.0:1603787411.359034:0:23957:0:(lod_qos.c:143:lod_statfs_and_check()) lustre-OST0000-osc-MDT0000: turns inactive 00020000:00000001:1.0:1603787411.359034:0:23957:0:(lod_qos.c:171:lod_statfs_and_check()) Process leaving (rc=18446744073709551495 : -121 : ffffffffffffff87) 00020000:00000001:1.0:1603787411.359035:0:23957:0:(lod_qos.c:108:lod_statfs_and_check()) Process entered 00000004:00000001:1.0:1603787411.359036:0:23957:0:(osp_dev.c:774:osp_statfs()) Process entered 00000004:00001000:1.0:1603787411.359037:0:23957:0:(osp_dev.c:797:osp_statfs()) lustre-OST0001-osc-MDT0000: 95744 blocks, 94976 free, 84480 avail, 4096 bsize, 1 reserved mb low, 3 reserved mb high, 12231 files, 11872 free files 00000004:00000001:1.0:1603787411.359039:0:23957:0:(osp_dev.c:816:osp_statfs()) Process leaving (rc=0 : 0 : 0) 00020000:01000000:1.0:1603787411.359040:0:23957:0:(lod_qos.c:143:lod_statfs_and_check()) lustre-OST0001-osc-MDT0000: turns inactive 00020000:00000001:1.0:1603787411.359040:0:23957:0:(lod_qos.c:171:lod_statfs_and_check()) Process leaving (rc=18446744073709551511 : -105 : ffffffffffffff97) 00020000:00000001:1.0:1603787411.359041:0:23957:0:(lod_qos.c:232:lod_qos_statfs_update()) Process leaving 00020000:00000001:1.0:1603787411.359042:0:23957:0:(lod_qos.c:2509:lod_qos_prep_create()) Process leaving via out (rc=18446744073709551582 : -34 : 0xffffffffffffffde) 00020000:00000001:1.0:1603787411.359043:0:23957:0:(lod_qos.c:2607:lod_qos_prep_create()) Process leaving (rc=18446744073709551582 : -34 : ffffffffffffffde) 00020000:00000001:1.0:1603787411.359044:0:23957:0:(lod_qos.c:2663:lod_prepare_create()) Process leaving (rc=18446744073709551582 : -34 : ffffffffffffffde) 00000004:00000001:1.0:1603787411.359045:0:23957:0:(lod_object.c:5474:lod_declare_striped_create()) Process leaving via out (rc=18446744073709551582 : -34 : 0xffffffffffffffde) 00000004:00000010:1.0:1603787411.359046:0:23957:0:(lod_lov.c:492:lod_free_comp_buffer()) kfreed 'entries': 120 at ffff93c7af951f80. 00000004:00000001:1.0:1603787411.359047:0:23957:0:(lod_object.c:5509:lod_declare_striped_create()) Process leaving (rc=18446744073709551582 : -34 : ffffffffffffffde) 00000004:00000001:1.0:1603787411.359048:0:23957:0:(lod_object.c:3491:lod_declare_xattr_set()) Process leaving (rc=18446744073709551582 : -34 : ffffffffffffffde) 00000004:00000001:1.0:1603787411.359049:0:23957:0:(mdd_dir.c:1923:mdd_create_data()) Process leaving via stop (rc=18446744073709551582 : -34 : 0xffffffffffffffde) |
| Comment by John Hammond [ 08/Jan/21 ] |
|
https://testing.whamcloud.com/test_sets/a242a505-11a4-4c88-9317-21d1ce234094 |
| Comment by Andreas Dilger [ 21/Jan/21 ] |
|
+1 on master https://testing.whamcloud.com/test_sets/9072748b-8448-46af-a2df-0e006ed552f1 |
| Comment by Andreas Dilger [ 26/Jan/21 ] |
|
There is a patch: https://review.whamcloud.com/37393 " |