[LU-12818] replay-single test_70b and other tests fail with “Numerical result out of range” error Created: 27/Sep/19  Updated: 31/Mar/22  Resolved: 31/Mar/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0, Lustre 2.14.0, Lustre 2.12.4
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: Hongchao Zhang
Resolution: Cannot Reproduce Votes: 0
Labels: rhel8, ubuntu18, zfs
Environment:

RHEL8 and Ubuntu 18.04


Issue Links:
Duplicate
is duplicated by LU-13995 conf-sanity_112: Numerical result out... Closed
Related
is related to LU-13221 conf-sanity test_112: FAIL: MDS start... Reopened
is related to LU-13813 conf-sanity test_112: can't put impor... Resolved
is related to LU-12589 sanity test_102a: setfattr: /mnt/lust... Resolved
is related to LU-13184 conf-sanity test_112: problem creatin... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Many replay-single tests fail for RHEL8 and Ubuntu18.04 with the “Numerical result out of range” error. All tests have different failure messages, but they all have the “Numerical result out of range” error during file open.

Looking at the suite_log for https://testing.whamcloud.com/test_sets/0f9f03d0-c158-11e9-90ad-52540065bddc, we see replay-single tests 13, 14, 15, 18, 21, 22, 23, 31, 32, 33a, 53b, 53c, 53g, 53h, and 70b all fail

== replay-single test 13: open chmod 0 |x| write close =============================================== 20:22:03 (1566073323)
multiop /mnt/lustre/f13.replay-single vO_wc
TMPPIPE=/tmp/multiop_open_wait_pipe.29190
open(O_RDWR|O_CREAT): Numerical result out of range
 replay-single test_13: @@@@@@ FAIL: multiop_bg_pause /mnt/lustre/f13.replay-single failed 
…
== replay-single test 14: open(O_CREAT), unlink |X| close ============================================ 20:22:07 (1566073327)
multiop /mnt/lustre/f14.replay-single vO_tSc
TMPPIPE=/tmp/multiop_open_wait_pipe.29190
open(O_RDWR|O_CREAT): Numerical result out of range
 replay-single test_14: @@@@@@ FAIL: multiop_bg_pause /mnt/lustre/f14.replay-single failed 
…
== replay-single test 15: open(O_CREAT), unlink |X|  touch new, close ================================ 20:22:11 (1566073331)
multiop /mnt/lustre/f15.replay-single vO_tSc
TMPPIPE=/tmp/multiop_open_wait_pipe.29190
open(O_RDWR|O_CREAT): Numerical result out of range
 replay-single test_15: @@@@@@ FAIL: multiop_bg_pause /mnt/lustre/f15.replay-single failed 
…
trevis-8vm6: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
touch: cannot touch '/mnt/lustre/f18.replay-single-3': Numerical result out of range
 replay-single test_18: @@@@@@ FAIL: touch /mnt/lustre/f18.replay-single-3 failed 
…
trevis-8vm6: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
touch: cannot touch '/mnt/lustre/f21.replay-single-2': Numerical result out of range
 replay-single test_21: @@@@@@ FAIL: touch /mnt/lustre/f21.replay-single-2 failed 
…
== replay-single test 22: open(O_CREAT), |X| unlink, replay, close (test mds_cleanup_orphans) ======== 20:33:35 (1566074015)
multiop /mnt/lustre/f22.replay-single vO_tSc
TMPPIPE=/tmp/multiop_open_wait_pipe.29190
open(O_RDWR|O_CREAT): Numerical result out of range
…
== replay-single test 23: open(O_CREAT), |X| unlink touch new, replay, close (test mds_cleanup_orphans) ====================================================================================================== 20:33:39 (1566074019)
multiop /mnt/lustre/f23.replay-single vO_tSc
TMPPIPE=/tmp/multiop_open_wait_pipe.29190
open(O_RDWR|O_CREAT): Numerical result out of range
 replay-single test_23: @@@@@@ FAIL: multiop_bg_pause /mnt/lustre/f23.replay-single failed 
…
== replay-single test 31: open(O_CREAT) two, unlink one, |X| unlink one, close two (test mds_cleanup_orphans) ====================================================================================================== 20:44:50 (1566074690)
multiop /mnt/lustre/f31.replay-single-1 vO_tSc
TMPPIPE=/tmp/multiop_open_wait_pipe.29190
open(O_RDWR|O_CREAT): Numerical result out of range
 replay-single test_31: @@@@@@ FAIL: multiop_bg_pause /mnt/lustre/f31.replay-single-1 failed 
…
== replay-single test 32: close() notices client eviction; close() after client eviction ============= 20:44:54 (1566074694)
multiop /mnt/lustre/f32.replay-single vO_c
TMPPIPE=/tmp/multiop_open_wait_pipe.29190
open(O_RDWR|O_CREAT): Numerical result out of range
 replay-single test_32: @@@@@@ FAIL: multiop_bg_pause /mnt/lustre/f32.replay-single failed 
…
== replay-single test 33a: fid seq shouldn't be reused after abort recovery ========================== 20:44:58 (1566074698)
open(/mnt/lustre/f33a.replay-single-0) error: Numerical result out of range
total: 0 open/close in 0.00 seconds: 0.00 ops/second
 replay-single test_33a: @@@@@@ FAIL: createmany create /mnt/lustre/f33a.replay-single failed 
…
== replay-single test 53b: |X| open request while two MDC requests in flight ========================= 21:35:13 (1566077713)
multiop /mnt/lustre/d53b.replay-single-1/f vO_c
TMPPIPE=/tmp/multiop_open_wait_pipe.29190
open(O_RDWR|O_CREAT): Numerical result out of range
 replay-single test_53b: @@@@@@ FAIL: multiop_bg_pause /mnt/lustre/d53b.replay-single-1/f failed 
…
== replay-single test 53c: |X| open request and close request while two MDC requests in flight ======= 21:35:17 (1566077717)
open(O_RDWR|O_CREAT): Numerical result out of range
CMD: trevis-8vm10 lctl set_param fail_loc=0x80000107
fail_loc=0x80000107
CMD: trevis-8vm10 lctl set_param fail_loc=0x80000115
fail_loc=0x80000115
/usr/lib64/lustre/tests/replay-single.sh: line 1294: kill: (15973) - No such process
…
== replay-single test 53g: |X| drop open reply and close request while close and open are both in flight ====================================================================================================== 21:39:58 (1566077998)
open(O_RDWR|O_CREAT): Numerical result out of range
CMD: trevis-8vm11 lctl set_param fail_loc=0x119
fail_loc=0x119
CMD: trevis-8vm11 lctl set_param fail_loc=0x80000115
fail_loc=0x80000115
/usr/lib64/lustre/tests/replay-single.sh: line 1437: kill: (22835) - No such process
CMD: trevis-8vm11 lctl set_param fail_loc=0
fail_loc=0
 replay-single test_53g: @@@@@@ FAIL: close_pid doesn't exist 
…
== replay-single test 53h: open request and close reply while two MDC requests in flight ============= 21:40:07 (1566078007)
open(O_RDWR|O_CREAT): Numerical result out of range
CMD: trevis-8vm11 lctl set_param fail_loc=0x80000107
fail_loc=0x80000107
CMD: trevis-8vm11 lctl set_param fail_loc=0x8000013b
fail_loc=0x8000013b
/usr/lib64/lustre/tests/replay-single.sh: line 1475: kill: (23667) - No such process
 replay-single test_53h: @@@@@@ FAIL: close_pid doesn't exist 
…
trevis-8vm7: [44191] open ./clients/client0/~dmtmp/PARADOX/__40D6B.DB failed for handle 10888 (Numerical result out of range)
trevis-8vm7: (44193) ERROR: handle 10888 was not found
trevis-8vm7: Child failed with status 1
trevis-8vm7: dbench: no process found
pdsh@trevis-8vm4: trevis-8vm7: ssh exited with exit code 1
trevis-8vm4: [44069] open ./clients/client0/~dmtmp/PARADOX/__3F2C4.DB failed for handle 10851 (Numerical result out of range)
trevis-8vm4: (44071) ERROR: handle 10851 was not found
trevis-8vm6:    1     44591     0.68 MB/sec  execute 193 sec  latency 72567.258 ms
trevis-8vm4: Child failed with status 1
trevis-8vm4: dbench: no process found
pdsh@trevis-8vm4: trevis-8vm4: ssh exited with exit code 1
trevis-8vm6: [44660] open ./clients/client0/~dmtmp/PARADOX/__50172.DB failed for handle 11004 (Numerical result out of range)
trevis-8vm6: (44662) ERROR: handle 11004 was not found
trevis-8vm6: Child failed with status 1
trevis-8vm6: dbench: no process found
…
trevis-8vm6: dbench: no process found
pdsh@trevis-8vm4: trevis-8vm6: ssh exited with exit code 1
 replay-single test_70b: @@@@@@ FAIL: dbench stopped on some of trevis-8vm4.trevis.whamcloud.com,trevis-8vm6,trevis-8vm7! 

Different test sessions have different tests that fail. For example tests 28, 29, 30, 31, 32, 53f, 53g, and 70b for the Ubuntu 18.04 test session https://testing.whamcloud.com/test_sets/9742e3aa-cc8f-11e9-98c8-52540065bddc.

Although the error messages are different, this issue may be related to LU-10613.

Logs for other failed test sessions are at:
https://testing.whamcloud.com/test_sets/b80265a8-d8af-11e9-a25b-52540065bddc
https://testing.whamcloud.com/test_sets/c68821fe-b9fa-11e9-97d5-52540065bddc
https://testing.whamcloud.com/test_sets/6615df92-d105-11e9-90ad-52540065bddc
https://testing.whamcloud.com/test_sets/95922fb6-d80b-11e9-98c8-52540065bddc
https://testing.whamcloud.com/test_sets/211af22a-ddcf-11e9-a197-52540065bddc



 Comments   
Comment by Peter Jones [ 03/Nov/19 ]

Hongchao

Can you please advise

Peter

Comment by Gerrit Updater [ 15/Nov/19 ]

Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36762
Subject: LU-12818 test: debug patch
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 6ca3d6684d7c6adbc812a1c84c9afa8c8fcc7050

Comment by Andreas Dilger [ 31/Jan/20 ]

I just saw this failure recently for a patch I was looking at, but I didn't find this bug at the time, so +1 but I can't link to the test results.

Comment by John Hammond [ 08/Jan/21 ]

ys's comment from LU-13995.

Looks like we need ensure ost0 is FULL before create on it?

---client side

00010000:00000001:1.0:1603787411.358695:0:25028:0:(ldlm_lock.c:249:ldlm_lock_put()) Process leaving
00010000:00000001:1.0:1603787411.358696:0:25028:0:(ldlm_request.c:1136:ldlm_cli_enqueue()) Process leaving (rc=301 : 301 : 12d)
00000002:00000001:1.0:1603787411.358697:0:25028:0:(mdc_locks.c:686:mdc_finish_enqueue()) Process entered
00000002:00100000:1.0:1603787411.358698:0:25028:0:(mdc_locks.c:741:mdc_finish_enqueue()) @@@ op=1 disposition=3, status=-34  req@ffff8b1a79d88d80 x1681692846730048/t0(0) o101->lustre-MDT0000-mdc-ffff8b1a7a48f000@10.9.7.66@tcp:12/10 lens 648/600 e 0 to 0 dl 1603787418 ref 1 fl Complete:RQU/0/0 rc 301/301 job:'lfs.0'

--server side

00020000:00000001:1.0:1603787411.359031:0:23957:0:(lod_qos.c:108:lod_statfs_and_check()) Process entered
00000004:00000001:1.0:1603787411.359032:0:23957:0:(osp_dev.c:774:osp_statfs()) Process entered
00000004:00000001:1.0:1603787411.359033:0:23957:0:(osp_dev.c:780:osp_statfs()) Process leaving (rc=18446744073709551509 : -107 : ffffffffffffff95)
00020000:01000000:1.0:1603787411.359034:0:23957:0:(lod_qos.c:143:lod_statfs_and_check()) lustre-OST0000-osc-MDT0000: turns inactive
00020000:00000001:1.0:1603787411.359034:0:23957:0:(lod_qos.c:171:lod_statfs_and_check()) Process leaving (rc=18446744073709551495 : -121 : ffffffffffffff87)
00020000:00000001:1.0:1603787411.359035:0:23957:0:(lod_qos.c:108:lod_statfs_and_check()) Process entered
00000004:00000001:1.0:1603787411.359036:0:23957:0:(osp_dev.c:774:osp_statfs()) Process entered
00000004:00001000:1.0:1603787411.359037:0:23957:0:(osp_dev.c:797:osp_statfs()) lustre-OST0001-osc-MDT0000: 95744 blocks, 94976 free, 84480 avail, 4096 bsize, 1 reserved mb low, 3 reserved mb high, 12231 files, 11872 free files
00000004:00000001:1.0:1603787411.359039:0:23957:0:(osp_dev.c:816:osp_statfs()) Process leaving (rc=0 : 0 : 0)
00020000:01000000:1.0:1603787411.359040:0:23957:0:(lod_qos.c:143:lod_statfs_and_check()) lustre-OST0001-osc-MDT0000: turns inactive
00020000:00000001:1.0:1603787411.359040:0:23957:0:(lod_qos.c:171:lod_statfs_and_check()) Process leaving (rc=18446744073709551511 : -105 : ffffffffffffff97)
00020000:00000001:1.0:1603787411.359041:0:23957:0:(lod_qos.c:232:lod_qos_statfs_update()) Process leaving
00020000:00000001:1.0:1603787411.359042:0:23957:0:(lod_qos.c:2509:lod_qos_prep_create()) Process leaving via out (rc=18446744073709551582 : -34 : 0xffffffffffffffde)
00020000:00000001:1.0:1603787411.359043:0:23957:0:(lod_qos.c:2607:lod_qos_prep_create()) Process leaving (rc=18446744073709551582 : -34 : ffffffffffffffde)
00020000:00000001:1.0:1603787411.359044:0:23957:0:(lod_qos.c:2663:lod_prepare_create()) Process leaving (rc=18446744073709551582 : -34 : ffffffffffffffde)
00000004:00000001:1.0:1603787411.359045:0:23957:0:(lod_object.c:5474:lod_declare_striped_create()) Process leaving via out (rc=18446744073709551582 : -34 : 0xffffffffffffffde)
00000004:00000010:1.0:1603787411.359046:0:23957:0:(lod_lov.c:492:lod_free_comp_buffer()) kfreed 'entries': 120 at ffff93c7af951f80.
00000004:00000001:1.0:1603787411.359047:0:23957:0:(lod_object.c:5509:lod_declare_striped_create()) Process leaving (rc=18446744073709551582 : -34 : ffffffffffffffde)
00000004:00000001:1.0:1603787411.359048:0:23957:0:(lod_object.c:3491:lod_declare_xattr_set()) Process leaving (rc=18446744073709551582 : -34 : ffffffffffffffde)
00000004:00000001:1.0:1603787411.359049:0:23957:0:(mdd_dir.c:1923:mdd_create_data()) Process leaving via stop (rc=18446744073709551582 : -34 : 0xffffffffffffffde)

Comment by John Hammond [ 08/Jan/21 ]

https://testing.whamcloud.com/test_sets/a242a505-11a4-4c88-9317-21d1ce234094

Comment by Andreas Dilger [ 21/Jan/21 ]

+1 on master https://testing.whamcloud.com/test_sets/9072748b-8448-46af-a2df-0e006ed552f1

Comment by Andreas Dilger [ 26/Jan/21 ]

There is a patch: https://review.whamcloud.com/37393 "LU-13184 tests: wait for OST startup in test_112" which may fix this problem.

Generated at Sat Feb 10 02:55:55 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.