[LU-12230] parallel-scale-nfsv3 test_connectathon times out with segfault Created: 26/Apr/19 Updated: 14/Jul/23 Resolved: 29/Nov/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.13.0, Lustre 2.12.1, Lustre 2.14.0, Lustre 2.12.4, Lustre 2.12.5, Lustre 2.12.6, Lustre 2.12.8, Lustre 2.15.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | Alex Deiter |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | ubuntu, ubuntu18, ubuntu20 | ||
| Environment: |
Ubuntu clients |
||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||
| Description |
|
parallel-scale-nfsv3 test_connectathon hangs for Ubuntu 18.04 clients only and hangs 100% of the time for Ubuntu 18.04 clients. Test #6 - Try to lock the MAXEOF byte. Parent: 6.0 - F_TLOCK [7fffffffffffffff, 1] PASSED. Child: 6.1 - F_TEST [7ffffffffffffffe, 1] PASSED. Child: 6.2 - F_TEST [7ffffffffffffffe, 2] PASSED. Child: 6.3 - F_TEST [7ffffffffffffffe, ENDING] PASSED. Child: 6.4 - F_TEST [7fffffffffffffff, 1] PASSED. Child: 6.5 - F_TEST [7fffffffffffffff, 2] PASSED. Child: 6.6 - F_TEST [7fffffffffffffff, ENDING] PASSED. Child: 6.7 - F_TEST [8000000000000000, ENDING] PASSED. Child: 6.8 - F_TEST [8000000000000000, 1] PASSED. Child: 6.9 - F_TEST [8000000000000000,7fffffffffffffff] PASSED. Child: 6.10 - F_TEST [8000000000000000,8000000000000000] PASSED. Parent: 6.11 - F_ULOCK [7fffffffffffffff, 1] PASSED. Test #7 - Test parent/child mutual exclusion. Parent: 7.0 - F_TLOCK [ ffc, 9] PASSED. runtests: line 48: 1175 Segmentation fault (core dumped) $i $TESTARGS $NFSTESTDIR lock tests failed Looking at the logs for a recent failure, with logs at https://testing.whamcloud.com/test_sets/4ba5183a-6661-11e9-aeec-52540065bddc , the console log for client 2 (vm9) shows the seg fault [ 2567.134886] Lustre: DEBUG MARKER: ./runtests -N 10 -l -f /mnt/lustre/d0.parallel-scale-nfs/d0.connectathon [ 2570.861240] tlocklfs[658]: segfault at 3cfa3a80 ip 00007f68b6289800 sp 00007ffea83150c8 error 4 in libc-2.27.so[7f68b61d8000+1e7000] [ 2571.788153] Lustre: DEBUG MARKER: /usr/sbin/lctl mark parallel-scale-nfsv3 test_connectathon: @@@@@@ FAIL: connectathon failed: 1 [ 2572.001779] Lustre: DEBUG MARKER: parallel-scale-nfsv3 test_connectathon: @@@@@@ FAIL: connectathon failed: 1 The console logs for all other nodes do not contain errors. There are several examples of this failure, but here are just a couple of additional links to logs |
| Comments |
| Comment by Alex Zhuravlev [ 26/Apr/19 ] |
[ 738.563521] Lustre: 30890:0:(osd_handler.c:1871:osd_trans_start()) lustre-OST0005: credits 3207 > trans_max 2592 meaning transaction is too big for this journal. |
| Comment by James Nunez (Inactive) [ 30/Apr/19 ] |
|
It looks like Ubuntu and CentOShave the same size servers for parallel-scale-nfsv3. For Ubuntu 18.04, https://testing.whamcloud.com/test_sets/4ba5183a-6661-11e9-aeec-52540065bddc , we see UUID 1K-blocks Used Available Use% Mounted on lustre-MDT0000_UUID 1165900 7940 1054764 1% /mnt/lustre[MDT:0] lustre-OST0000_UUID 1933276 25788 1786248 2% /mnt/lustre[OST:0] lustre-OST0001_UUID 1933276 25788 1786248 2% /mnt/lustre[OST:1] lustre-OST0002_UUID 1933276 25788 1786248 2% /mnt/lustre[OST:2] lustre-OST0003_UUID 1933276 25784 1786252 2% /mnt/lustre[OST:3] lustre-OST0004_UUID 1933276 25784 1786252 2% /mnt/lustre[OST:4] lustre-OST0005_UUID 1933276 25784 1786252 2% /mnt/lustre[OST:5] lustre-OST0006_UUID 1933276 25784 1786252 2% /mnt/lustre[OST:6] filesystem_summary: 13532932 180500 12503752 2% /mnt/lustre For CentOS, https://testing.whamcloud.com/test_sets/c0c429a4-6665-11e9-bd0e-52540065bddc , we have UUID 1K-blocks Used Available Use% Mounted on lustre-MDT0000_UUID 1165900 80736 981968 8% /mnt/lustre[MDT:0] lustre-OST0000_UUID 1933276 32116 1779920 2% /mnt/lustre[OST:0] lustre-OST0001_UUID 1933276 29940 1782096 2% /mnt/lustre[OST:1] lustre-OST0002_UUID 1933276 28676 1783360 2% /mnt/lustre[OST:2] lustre-OST0003_UUID 1933276 30168 1781868 2% /mnt/lustre[OST:3] lustre-OST0004_UUID 1933276 29948 1782088 2% /mnt/lustre[OST:4] lustre-OST0005_UUID 1933276 34856 1777180 2% /mnt/lustre[OST:5] lustre-OST0006_UUID 1933276 29960 1782076 2% /mnt/lustre[OST:6] filesystem_summary: 13532932 215664 12468588 2% /mnt/lustre |
| Comment by Andreas Dilger [ 24/Aug/20 ] |
I also see this error frequently in local-node testing with striped directories. However, I don't think that this would actually cause the test to segfault in glibc. Instead, I suspect that this test is simply triggering some flock mode that we do not support and the error return code is not being handled gracefully for some reason. It would probably be useful to run this test with full debug, so that we can get a log of what the NFS server is requesting the Lustre client to do. Alternately (or also), manually running the test binary under gdb to catch the segfault would also be useful, especially in conjunction with the Lustre client debug logs to get the history of the just-completed request. |
| Comment by Gerrit Updater [ 02/Jul/21 ] |
|
James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/44134 |
| Comment by Alena Nikitenko [ 30/Nov/21 ] |
|
Having a similar issue with https://testing.whamcloud.com/test_sets/daa7a9a4-d213-42fb-bb56-edbceff2c17e 2.12.8, Ubuntu 18.04, parallel-scale-nfsv3, test_connectathon
Test #6 - Try to lock the MAXEOF byte.
Parent: 6.0 - F_TLOCK [7fffffffffffffff, 1] PASSED.
Child: 6.1 - F_TEST [7ffffffffffffffe, 1] PASSED.
Child: 6.2 - F_TEST [7ffffffffffffffe, 2] PASSED.
Child: 6.3 - F_TEST [7ffffffffffffffe, ENDING] PASSED.
Child: 6.4 - F_TEST [7fffffffffffffff, 1] PASSED.
Child: 6.5 - F_TEST [7fffffffffffffff, 2] PASSED.
Child: 6.6 - F_TEST [7fffffffffffffff, ENDING] PASSED.
Child: 6.7 - F_TEST [8000000000000000, ENDING] PASSED.
Child: 6.8 - F_TEST [8000000000000000, 1] PASSED.
Child: 6.9 - F_TEST [8000000000000000,7fffffffffffffff] PASSED.
Child: 6.10 - F_TEST [8000000000000000,8000000000000000] PASSED.
Parent: 6.11 - F_ULOCK [7fffffffffffffff, 1] PASSED.
Test #7 - Test parent/child mutual exclusion.
Parent: 7.0 - F_TLOCK [ ffc, 9] PASSED.
runtests: line 48: 15529 Segmentation fault (core dumped) $i $TESTARGS $NFSTESTDIR
lock tests failed
parallel-scale-nfsv3 test_connectathon: @@@@@@ FAIL: connectathon failed: 1
[ 1351.318605] Lustre: DEBUG MARKER: sh ./runtests -N 10 -l -f /mnt/lustre/d0.parallel-scale-nfs/d0.connectathon [ 1351.658038] tlocklfs[15529]: segfault at 36109a80 ip 00007f3a9645b5e0 sp 00007ffc088aa118 error 4 in libc-2.27.so[7f3a963aa000+1e7000] [ 1352.141971] Lustre: DEBUG MARKER: /usr/sbin/lctl mark parallel-scale-nfsv3 test_connectathon: @@@@@@ FAIL: connectathon failed: 1 [ 1352.360581] Lustre: DEBUG MARKER: parallel-scale-nfsv3 test_connectathon: @@@@@@ FAIL: connectathon failed: 1 |
| Comment by James A Simmons [ 12/Jan/22 ] |
|
Does |
| Comment by Gerrit Updater [ 12/Jan/22 ] |
|
"James Simmons <jsimmons@infradead.org>" uploaded a new patch: https://review.whamcloud.com/46074 |
| Comment by James A Simmons [ 12/Jan/22 ] |
|
I'm seeing in maloo: trevis-53vm4: Redirecting to /bin/systemctl restart nfs.service trevis-53vm4: Failed to restart nfs.service: Unit nfs.service not found. parallel-scale-nfsv3 : @@@@@@ FAIL: setup nfs failed! Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:5919:error() = /usr/lib64/lustre/tests/parallel-scale-nfs.sh:66:main() Looks like a configuration issue. |
| Comment by James Nunez (Inactive) [ 20/Jan/22 ] |
|
simmonsja - I don't see the failure that you posted, but I do see a similar problem but the test recovers. See the suite_log at https://testing.whamcloud.com/test_sets/cd6bfcf2-b2ad-4de0-ae5a-8a62cea37819 CMD: trevis-76vm2 { [[ -e /etc/SuSE-release ]] &&
service nfsserver restart; } ||
service nfs restart ||
service nfs-server restart
trevis-76vm2: Redirecting to /bin/systemctl restart nfs.service
trevis-76vm2: Failed to restart nfs.service: Unit nfs.service not found.
trevis-76vm2: Redirecting to /bin/systemctl restart nfs-server.service
CMD: trevis-34vm5.trevis.whamcloud.com,trevis-34vm8 chkconfig --list rpcidmapd 2>/dev/null |
grep -q rpcidmapd && service rpcidmapd restart ||
true
Mounting NFS clients (version 3)...
Please post links to failed test sessions if you continue to see this issue. |
| Comment by Gerrit Updater [ 18/Nov/22 ] |
|
The issue can be reproduced on Ubuntu distro:
# sh ./runtests -N 2 -l -f /mnt/nfs
# Test #6 - Try to lock the MAXEOF byte.
Parent: 6.0 - F_TLOCK [7fffffffffffffff, 1] PASSED.
Child: 6.1 - F_TEST [7ffffffffffffffe, 1] PASSED.
Child: 6.2 - F_TEST [7ffffffffffffffe, 2] PASSED.
Child: 6.3 - F_TEST [7ffffffffffffffe, ENDING] PASSED.
Child: 6.4 - F_TEST [7fffffffffffffff, 1] PASSED.
Child: 6.5 - F_TEST [7fffffffffffffff, 2] PASSED.
Child: 6.6 - F_TEST [7fffffffffffffff, ENDING] PASSED.
Child: 6.7 - F_TEST [8000000000000000, ENDING] PASSED.
Child: 6.8 - F_TEST [8000000000000000, 1] PASSED.
Child: 6.9 - F_TEST [8000000000000000,7fffffffffffffff] PASSED.
Child: 6.10 - F_TEST [8000000000000000,8000000000000000] PASSED.
Parent: 6.11 - F_ULOCK [7fffffffffffffff, 1] PASSED.
Test #7 - Test parent/child mutual exclusion.
Parent: 7.0 - F_TLOCK [ ffc, 9] PASSED.
Segmentation fault (core dumped)
lock tests failed
|
| Comment by Alex Deiter [ 18/Nov/22 ] |
|
Packaging issue - will be fixed in DCO-9179 |
| Comment by Alex Deiter [ 18/Nov/22 ] |
|
Proposed patch has been successfully tested on Ubuntu 22.04: # dpkg -i connectathon_1.20-1_amd64.deb Selecting previously unselected package connectathon. (Reading database ... 98334 files and directories currently installed.) Preparing to unpack connectathon_1.20-1_amd64.deb ... Unpacking connectathon (1.20-1) ... Setting up connectathon (1.20-1) ... # mount -t nfs 10.128.0.32:/mnt/exacloud on /mnt/exacloud type nfs (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=10.128.0.32,mountvers=3,mountport=56089,mountproto=udp,local_lock=none,addr=10.128.0.32) # sh ./runtests -N 2 -l -f /mnt/exacloud ... Parent: 6.11 - F_ULOCK [7fffffffffffffff, 1] PASSED. Test #7 - Test parent/child mutual exclusion. Parent: 7.0 - F_TLOCK [ ffc, 9] PASSED. Parent: Wrote 'aaaa eh' to testfile [ 4092, 7 ]. Parent: Now free child to run, should block on lock. Parent: Check data in file to insure child blocked. Parent: Read 'aaaa eh' from testfile [ 4092, 7 ]. Parent: 7.1 - COMPARE [ ffc, 7] PASSED. Parent: Now unlock region so child will unblock. Parent: 7.2 - F_ULOCK [ ffc, 9] PASSED. Child: 7.3 - F_LOCK [ ffc, 9] PASSED. Child: Write child's version of the data and release lock. Parent: Now try to regain lock, parent should block. Child: Wrote 'bebebebeb' to testfile [ 4092, 9 ]. Child: 7.4 - F_ULOCK [ ffc, 9] PASSED. Parent: 7.5 - F_LOCK [ ffc, 9] PASSED. Parent: Check data in file to insure child unblocked. Parent: Read 'bebebebeb' from testfile [ 4092, 9 ]. Parent: 7.6 - COMPARE [ ffc, 9] PASSED. Parent: 7.7 - F_ULOCK [ ffc, 9] PASSED. Test #10 - Make sure a locked region is split properly. |
| Comment by James A Simmons [ 18/Nov/22 ] |
|
Thank you for fixing this |
| Comment by Alex Deiter [ 21/Nov/22 ] |
|
Proposed patch has been successfully tested on Ubuntu 18.04: $ sudo dpkg -i connectathon_1.20-1_amd64.deb Selecting previously unselected package connectathon. (Reading database ... 217066 files and directories currently installed.) Preparing to unpack connectathon_1.20-1_amd64.deb ... Unpacking connectathon (1.20-1) ... Setting up connectathon (1.20-1) ... $ sh /opt/connectathon/runtests -a -f /mnt/test Test #7 - Test parent/child mutual exclusion. Parent: 7.0 - F_TLOCK [ ffc, 9] PASSED. Parent: Wrote 'aaaa eh' to testfile [ 4092, 7 ]. Parent: Now free child to run, should block on lock. Parent: Check data in file to insure child blocked. Parent: Read 'aaaa eh' from testfile [ 4092, 7 ]. Parent: 7.1 - COMPARE [ ffc, 7] PASSED. Parent: Now unlock region so child will unblock. Child: 7.3 - F_LOCK [ ffc, 9] PASSED. Child: Write child's version of the data and release lock. Parent: 7.2 - F_ULOCK [ ffc, 9] PASSED. Parent: Now try to regain lock, parent should block. Child: Wrote 'bebebebeb' to testfile [ 4092, 9 ]. Child: 7.4 - F_ULOCK [ ffc, 9] PASSED. Parent: 7.5 - F_LOCK [ ffc, 9] PASSED. Parent: Check data in file to insure child unblocked. Parent: Read 'bebebebeb' from testfile [ 4092, 9 ]. Parent: 7.6 - COMPARE [ ffc, 9] PASSED. Parent: 7.7 - F_ULOCK [ ffc, 9] PASSED. ** PARENT pass 1 results: 49/49 pass, 1/1 warn, 0/0 fail (pass/total). ** CHILD pass 1 results: 64/64 pass, 0/0 warn, 0/0 fail (pass/total). Congratulations, you passed the locking tests! All tests completed |
| Comment by James A Simmons [ 14/Jul/23 ] |
|
Alex did git commit 1a8fe55b17ac2bc2195aaba446467ccdac67b564 resolve this bug? if so we can close this. |
| Comment by Alex Deiter [ 14/Jul/23 ] |
|
Hello simmonsja, ??Alex did git commit 1a8fe55b17ac2bc2195aaba446467ccdac67b564 resolve this bug? if so we can close this. I think yes. Thank you! |
| Comment by Andreas Dilger [ 14/Jul/23 ] |
|
James, this ticket was already marked resolved on Nov of last year. Did you mean to update a different ticket? |
| Comment by James A Simmons [ 14/Jul/23 ] |
|
I was seeing it as still open. |