[LU-12230] parallel-scale-nfsv3 test_connectathon times out with segfault Created: 26/Apr/19  Updated: 14/Jul/23  Resolved: 29/Nov/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0, Lustre 2.12.1, Lustre 2.14.0, Lustre 2.12.4, Lustre 2.12.5, Lustre 2.12.6, Lustre 2.12.8, Lustre 2.15.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: Alex Deiter
Resolution: Fixed Votes: 0
Labels: ubuntu, ubuntu18, ubuntu20
Environment:

Ubuntu clients


Issue Links:
Duplicate
Related
is related to LU-11948 parallel-scale-nfsv4 test connectatho... In Progress
is related to LU-14294 parallel-scale-nfsv4 fails to start w... Resolved
is related to LU-15781 Ubuntu 22.04 LTS release support Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

parallel-scale-nfsv3 test_connectathon hangs for Ubuntu 18.04 clients only and hangs 100% of the time for Ubuntu 18.04 clients.

Test #6 - Try to lock the MAXEOF byte.
	Parent: 6.0  - F_TLOCK [7fffffffffffffff,               1] PASSED.
	Child:  6.1  - F_TEST  [7ffffffffffffffe,               1] PASSED.
	Child:  6.2  - F_TEST  [7ffffffffffffffe,               2] PASSED.
	Child:  6.3  - F_TEST  [7ffffffffffffffe,          ENDING] PASSED.
	Child:  6.4  - F_TEST  [7fffffffffffffff,               1] PASSED.
	Child:  6.5  - F_TEST  [7fffffffffffffff,               2] PASSED.
	Child:  6.6  - F_TEST  [7fffffffffffffff,          ENDING] PASSED.
	Child:  6.7  - F_TEST  [8000000000000000,          ENDING] PASSED.
	Child:  6.8  - F_TEST  [8000000000000000,               1] PASSED.
	Child:  6.9  - F_TEST  [8000000000000000,7fffffffffffffff] PASSED.
	Child:  6.10 - F_TEST  [8000000000000000,8000000000000000] PASSED.
	Parent: 6.11 - F_ULOCK [7fffffffffffffff,               1] PASSED.

Test #7 - Test parent/child mutual exclusion.
	Parent: 7.0  - F_TLOCK [             ffc,               9] PASSED.
runtests: line 48:  1175 Segmentation fault      (core dumped) $i $TESTARGS $NFSTESTDIR
lock tests failed

Looking at the logs for a recent failure, with logs at https://testing.whamcloud.com/test_sets/4ba5183a-6661-11e9-aeec-52540065bddc , the console log for client 2 (vm9) shows the seg fault

[ 2567.134886] Lustre: DEBUG MARKER: ./runtests -N 10 -l -f /mnt/lustre/d0.parallel-scale-nfs/d0.connectathon
[ 2570.861240] tlocklfs[658]: segfault at 3cfa3a80 ip 00007f68b6289800 sp 00007ffea83150c8 error 4 in libc-2.27.so[7f68b61d8000+1e7000]
[ 2571.788153] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  parallel-scale-nfsv3 test_connectathon: @@@@@@ FAIL: connectathon failed: 1 
[ 2572.001779] Lustre: DEBUG MARKER: parallel-scale-nfsv3 test_connectathon: @@@@@@ FAIL: connectathon failed: 1

The console logs for all other nodes do not contain errors.

There are several examples of this failure, but here are just a couple of additional links to logs
https://testing.whamcloud.com/test_sets/0a36df6c-411a-11e9-8e92-52540065bddc
https://testing.whamcloud.com/test_sets/7891c378-4e62-11e9-b98a-52540065bddc
https://testing.whamcloud.com/test_sets/2d506d0a-560b-11e9-b98a-52540065bddc



 Comments   
Comment by Alex Zhuravlev [ 26/Apr/19 ]
[  738.563521] Lustre: 30890:0:(osd_handler.c:1871:osd_trans_start()) lustre-OST0005: credits 3207 > trans_max 2592

meaning transaction is too big for this journal.
for that filesystem 2592 max blocks were calculated as 1/4 of 2592*4=10368 blocks in journal, which in turn 10368*4/1024=40MB - no any real setup has such a tiny journal. probably it makes sense to skip the test if target OST is too small.

Comment by James Nunez (Inactive) [ 30/Apr/19 ]

It looks like Ubuntu and CentOShave the same size servers for parallel-scale-nfsv3. For Ubuntu 18.04, https://testing.whamcloud.com/test_sets/4ba5183a-6661-11e9-aeec-52540065bddc , we see

UUID                   1K-blocks        Used   Available Use% Mounted on
lustre-MDT0000_UUID      1165900        7940     1054764   1% /mnt/lustre[MDT:0]
lustre-OST0000_UUID      1933276       25788     1786248   2% /mnt/lustre[OST:0]
lustre-OST0001_UUID      1933276       25788     1786248   2% /mnt/lustre[OST:1]
lustre-OST0002_UUID      1933276       25788     1786248   2% /mnt/lustre[OST:2]
lustre-OST0003_UUID      1933276       25784     1786252   2% /mnt/lustre[OST:3]
lustre-OST0004_UUID      1933276       25784     1786252   2% /mnt/lustre[OST:4]
lustre-OST0005_UUID      1933276       25784     1786252   2% /mnt/lustre[OST:5]
lustre-OST0006_UUID      1933276       25784     1786252   2% /mnt/lustre[OST:6]

filesystem_summary:     13532932      180500    12503752   2% /mnt/lustre

For CentOS, https://testing.whamcloud.com/test_sets/c0c429a4-6665-11e9-bd0e-52540065bddc , we have

UUID                   1K-blocks        Used   Available Use% Mounted on
lustre-MDT0000_UUID      1165900       80736      981968   8% /mnt/lustre[MDT:0]
lustre-OST0000_UUID      1933276       32116     1779920   2% /mnt/lustre[OST:0]
lustre-OST0001_UUID      1933276       29940     1782096   2% /mnt/lustre[OST:1]
lustre-OST0002_UUID      1933276       28676     1783360   2% /mnt/lustre[OST:2]
lustre-OST0003_UUID      1933276       30168     1781868   2% /mnt/lustre[OST:3]
lustre-OST0004_UUID      1933276       29948     1782088   2% /mnt/lustre[OST:4]
lustre-OST0005_UUID      1933276       34856     1777180   2% /mnt/lustre[OST:5]
lustre-OST0006_UUID      1933276       29960     1782076   2% /mnt/lustre[OST:6]

filesystem_summary:     13532932      215664    12468588   2% /mnt/lustre
Comment by Andreas Dilger [ 24/Aug/20 ]
Lustre: 30890:0:(osd_handler.c:1871:osd_trans_start()) lustre-OST0005: credits 3207 > trans_max 2592

meaning transaction is too big for this journal.

I also see this error frequently in local-node testing with striped directories. However, I don't think that this would actually cause the test to segfault in glibc. Instead, I suspect that this test is simply triggering some flock mode that we do not support and the error return code is not being handled gracefully for some reason.

It would probably be useful to run this test with full debug, so that we can get a log of what the NFS server is requesting the Lustre client to do. Alternately (or also), manually running the test binary under gdb to catch the segfault would also be useful, especially in conjunction with the Lustre client debug logs to get the history of the just-completed request.

Comment by Gerrit Updater [ 02/Jul/21 ]

James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/44134
Subject: LU-12230 tests: test connectathon full debug
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b42d3ff67c7a5d64ce4b02e34147309f9a8599dc

Comment by Alena Nikitenko [ 30/Nov/21 ]

Having a similar issue with https://testing.whamcloud.com/test_sets/daa7a9a4-d213-42fb-bb56-edbceff2c17e

2.12.8, Ubuntu 18.04, parallel-scale-nfsv3, test_connectathon

Test #6 - Try to lock the MAXEOF byte.
	Parent: 6.0  - F_TLOCK [7fffffffffffffff,               1] PASSED.
	Child:  6.1  - F_TEST  [7ffffffffffffffe,               1] PASSED.
	Child:  6.2  - F_TEST  [7ffffffffffffffe,               2] PASSED.
	Child:  6.3  - F_TEST  [7ffffffffffffffe,          ENDING] PASSED.
	Child:  6.4  - F_TEST  [7fffffffffffffff,               1] PASSED.
	Child:  6.5  - F_TEST  [7fffffffffffffff,               2] PASSED.
	Child:  6.6  - F_TEST  [7fffffffffffffff,          ENDING] PASSED.
	Child:  6.7  - F_TEST  [8000000000000000,          ENDING] PASSED.
	Child:  6.8  - F_TEST  [8000000000000000,               1] PASSED.
	Child:  6.9  - F_TEST  [8000000000000000,7fffffffffffffff] PASSED.
	Child:  6.10 - F_TEST  [8000000000000000,8000000000000000] PASSED.
	Parent: 6.11 - F_ULOCK [7fffffffffffffff,               1] PASSED.

Test #7 - Test parent/child mutual exclusion.
	Parent: 7.0  - F_TLOCK [             ffc,               9] PASSED.
runtests: line 48: 15529 Segmentation fault      (core dumped) $i $TESTARGS $NFSTESTDIR
lock tests failed
 parallel-scale-nfsv3 test_connectathon: @@@@@@ FAIL: connectathon failed: 1  
[ 1351.318605] Lustre: DEBUG MARKER: sh ./runtests -N 10 -l -f /mnt/lustre/d0.parallel-scale-nfs/d0.connectathon
[ 1351.658038] tlocklfs[15529]: segfault at 36109a80 ip 00007f3a9645b5e0 sp 00007ffc088aa118 error 4 in libc-2.27.so[7f3a963aa000+1e7000]
[ 1352.141971] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  parallel-scale-nfsv3 test_connectathon: @@@@@@ FAIL: connectathon failed: 1 
[ 1352.360581] Lustre: DEBUG MARKER: parallel-scale-nfsv3 test_connectathon: @@@@@@ FAIL: connectathon failed: 1 
Comment by James A Simmons [ 12/Jan/22 ]

Does LU-15143 fix this?

Comment by Gerrit Updater [ 12/Jan/22 ]

"James Simmons <jsimmons@infradead.org>" uploaded a new patch: https://review.whamcloud.com/46074
Subject: LU-12230 tests: validate connectathon
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2aa99b4c4042ca088fee98bbedfb44acd768c64c

Comment by James A Simmons [ 12/Jan/22 ]

I'm seeing in maloo:

trevis-53vm4: Redirecting to /bin/systemctl restart nfs.service

trevis-53vm4: Failed to restart nfs.service: Unit nfs.service not found. parallel-scale-nfsv3 : @@@@@@ FAIL: setup nfs failed! Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:5919:error() = /usr/lib64/lustre/tests/parallel-scale-nfs.sh:66:main()

Looks like a configuration issue.

Comment by James Nunez (Inactive) [ 20/Jan/22 ]

simmonsja - I don't see the failure that you posted, but I do see a similar problem but the test recovers. See the suite_log at https://testing.whamcloud.com/test_sets/cd6bfcf2-b2ad-4de0-ae5a-8a62cea37819

CMD: trevis-76vm2 { [[ -e /etc/SuSE-release ]] &&
				 service nfsserver restart; } ||
				 service nfs restart ||
				 service nfs-server restart
trevis-76vm2: Redirecting to /bin/systemctl restart nfs.service
trevis-76vm2: Failed to restart nfs.service: Unit nfs.service not found.
trevis-76vm2: Redirecting to /bin/systemctl restart nfs-server.service
CMD: trevis-34vm5.trevis.whamcloud.com,trevis-34vm8 chkconfig --list rpcidmapd 2>/dev/null |
			       grep -q rpcidmapd && service rpcidmapd restart ||
			       true

Mounting NFS clients (version 3)...

Please post links to failed test sessions if you continue to see this issue.

Comment by Gerrit Updater [ 18/Nov/22 ]

The issue can be reproduced on Ubuntu distro:

# sh ./runtests -N 2 -l -f /mnt/nfs
# Test #6 - Try to lock the MAXEOF byte.
	Parent: 6.0  - F_TLOCK [7fffffffffffffff,               1] PASSED.
	Child:  6.1  - F_TEST  [7ffffffffffffffe,               1] PASSED.
	Child:  6.2  - F_TEST  [7ffffffffffffffe,               2] PASSED.
	Child:  6.3  - F_TEST  [7ffffffffffffffe,          ENDING] PASSED.
	Child:  6.4  - F_TEST  [7fffffffffffffff,               1] PASSED.
	Child:  6.5  - F_TEST  [7fffffffffffffff,               2] PASSED.
	Child:  6.6  - F_TEST  [7fffffffffffffff,          ENDING] PASSED.
	Child:  6.7  - F_TEST  [8000000000000000,          ENDING] PASSED.
	Child:  6.8  - F_TEST  [8000000000000000,               1] PASSED.
	Child:  6.9  - F_TEST  [8000000000000000,7fffffffffffffff] PASSED.
	Child:  6.10 - F_TEST  [8000000000000000,8000000000000000] PASSED.
	Parent: 6.11 - F_ULOCK [7fffffffffffffff,               1] PASSED.

Test #7 - Test parent/child mutual exclusion.
	Parent: 7.0  - F_TLOCK [             ffc,               9] PASSED.
Segmentation fault (core dumped)
lock tests failed
Comment by Alex Deiter [ 18/Nov/22 ]

Packaging issue - will be fixed in DCO-9179

Comment by Alex Deiter [ 18/Nov/22 ]

Proposed patch has been successfully tested on Ubuntu 22.04:

# dpkg -i connectathon_1.20-1_amd64.deb 
Selecting previously unselected package connectathon.
(Reading database ... 98334 files and directories currently installed.)
Preparing to unpack connectathon_1.20-1_amd64.deb ...
Unpacking connectathon (1.20-1) ...
Setting up connectathon (1.20-1) ...

# mount -t nfs
10.128.0.32:/mnt/exacloud on /mnt/exacloud type nfs (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=10.128.0.32,mountvers=3,mountport=56089,mountproto=udp,local_lock=none,addr=10.128.0.32)

# sh ./runtests -N 2 -l -f /mnt/exacloud
...
        Parent: 6.11 - F_ULOCK [7fffffffffffffff,               1] PASSED.

Test #7 - Test parent/child mutual exclusion.
        Parent: 7.0  - F_TLOCK [             ffc,               9] PASSED.
        Parent: Wrote 'aaaa eh' to testfile [ 4092, 7 ].
        Parent: Now free child to run, should block on lock.
        Parent: Check data in file to insure child blocked.
        Parent: Read 'aaaa eh' from testfile [ 4092, 7 ].
        Parent: 7.1  - COMPARE [             ffc,               7] PASSED.
        Parent: Now unlock region so child will unblock.
        Parent: 7.2  - F_ULOCK [             ffc,               9] PASSED.
        Child:  7.3  - F_LOCK  [             ffc,               9] PASSED.
        Child:  Write child's version of the data and release lock.
        Parent: Now try to regain lock, parent should block.
        Child:  Wrote 'bebebebeb' to testfile [ 4092, 9 ].
        Child:  7.4  - F_ULOCK [             ffc,               9] PASSED.
        Parent: 7.5  - F_LOCK  [             ffc,               9] PASSED.
        Parent: Check data in file to insure child unblocked.
        Parent: Read 'bebebebeb' from testfile [ 4092, 9 ].
        Parent: 7.6  - COMPARE [             ffc,               9] PASSED.
        Parent: 7.7  - F_ULOCK [             ffc,               9] PASSED.

Test #10 - Make sure a locked region is split properly.
Comment by James A Simmons [ 18/Nov/22 ]

Thank you for fixing this

Comment by Alex Deiter [ 21/Nov/22 ]

Proposed patch has been successfully tested on Ubuntu 18.04:

$ sudo dpkg -i connectathon_1.20-1_amd64.deb 
Selecting previously unselected package connectathon.
(Reading database ... 217066 files and directories currently installed.)
Preparing to unpack connectathon_1.20-1_amd64.deb ...
Unpacking connectathon (1.20-1) ...
Setting up connectathon (1.20-1) ...

$ sh /opt/connectathon/runtests -a -f /mnt/test
Test #7 - Test parent/child mutual exclusion.
	Parent: 7.0  - F_TLOCK [             ffc,               9] PASSED.
	Parent: Wrote 'aaaa eh' to testfile [ 4092, 7 ].
	Parent: Now free child to run, should block on lock.
	Parent: Check data in file to insure child blocked.
	Parent: Read 'aaaa eh' from testfile [ 4092, 7 ].
	Parent: 7.1  - COMPARE [             ffc,               7] PASSED.
	Parent: Now unlock region so child will unblock.
	Child:  7.3  - F_LOCK  [             ffc,               9] PASSED.
	Child:  Write child's version of the data and release lock.
	Parent: 7.2  - F_ULOCK [             ffc,               9] PASSED.
	Parent: Now try to regain lock, parent should block.
	Child:  Wrote 'bebebebeb' to testfile [ 4092, 9 ].
	Child:  7.4  - F_ULOCK [             ffc,               9] PASSED.
	Parent: 7.5  - F_LOCK  [             ffc,               9] PASSED.
	Parent: Check data in file to insure child unblocked.
	Parent: Read 'bebebebeb' from testfile [ 4092, 9 ].
	Parent: 7.6  - COMPARE [             ffc,               9] PASSED.
	Parent: 7.7  - F_ULOCK [             ffc,               9] PASSED.

** PARENT pass 1 results: 49/49 pass, 1/1 warn, 0/0 fail (pass/total).

**  CHILD pass 1 results: 64/64 pass, 0/0 warn, 0/0 fail (pass/total).
Congratulations, you passed the locking tests!

All tests completed
Comment by James A Simmons [ 14/Jul/23 ]

Alex did git commit 1a8fe55b17ac2bc2195aaba446467ccdac67b564 resolve this bug? if so we can close this.

Comment by Alex Deiter [ 14/Jul/23 ]

Hello simmonsja,

??Alex did git commit 1a8fe55b17ac2bc2195aaba446467ccdac67b564 resolve this bug? if so we can close this.
??

I think yes.

Thank you!

Comment by Andreas Dilger [ 14/Jul/23 ]

James, this ticket was already marked resolved on Nov of last year. Did you mean to update a different ticket?

Comment by James A Simmons [ 14/Jul/23 ]

I was seeing it as still open. 

Generated at Sat Feb 10 02:50:46 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.