[LU-16652] sanity-lnet test_253: Expect 1 dropped GET but found 2 Created: 21/Mar/23 Updated: 11/Apr/23 Resolved: 11/Apr/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | Chris Horn |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com> This issue relates to the following test suite run: test_253 failed intermittently with the following error: Expect 1 dropped GET but found 2 Test session details: CMD: trevis-58vm2 /usr/sbin/lctl list_nids
/usr/sbin/lnetctl discover 10.240.40.249@tcp
discover:
- primary nid: 10.240.40.249@tcp
Multi-Rail: True
peer ni:
- nid: 10.240.40.249@tcp
net:
- net type: lo
local NI(s):
- nid: 0@lo
status: up
- net type: tcp
local NI(s):
- nid: 10.240.40.248@tcp
status: up
interfaces:
0: eth0
- primary nid: 10.240.40.249@tcp
- nid: 10.240.40.249@tcp
health stats:
health value: 1000
debug=+net
/usr/sbin/lnetctl set transaction_timeout 10
Added delay rule 10.240.40.248@tcp->10.240.40.249@tcp (1/1)
Found 8 peer_credits for 10.240.40.249@tcp
Issued 8 pings to 10.240.40.249@tcp
manage:
- ping:
errno: -1
descr: failed to ping 10.240.40.249@tcp: Connection timed out
manage:
- ping:
errno: -1
descr: failed to ping 10.240.40.249@tcp: Connection timed out
Removed 1 delay rules
ping:
- primary nid: 10.240.40.249@tcp
Multi-Rail: True
peer ni:
- nid: 10.240.40.249@tcp
sanity-lnet test_253: @@@@@@ FAIL: Expect 1 dropped GET but found 2
VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV |
| Comments |
| Comment by Andreas Dilger [ 21/Mar/23 ] |
|
Patches landed to master on 2023-02-08 when this subtest first started failing: $ git log --oneline --after 2023-02-06 --before 2023-02-09 0568f4ca25 LU-16500 utils: set default ost index for lfs migrate 9ce04000fb LU-930 ptlrpc: clarify AT error message e6b6b7ee25 LU-16367 utils: clean up ldiskfs feature handling a05d02ea0e LU-16221 kernel: new kernel [RHEL 9.1 5.14.0-162.12.1.el9_1] 919b93b951 LU-16510 build: fortified memcpy from linux 6.1 738e69d4b9 LU-16292 llite: delete_from_page_cache not exported b13a5b351e LU-16188 mdt: fix incompatible HSM request handling a3a51806ef LU-16118 build: Workaround __write_overflow_field errors 391392e117 LU-16354 ldiskfs: RHEL9.1 server support 54eb6da1f8 LU-16477 ldiskfs: Add ext4-enc-flag patch for RHEL9 c10c6eeb37 LU-15728 llite: fix relatime support 0c05dc21ab LU-6142 ldlm: minor list_entry improvements in ldlm_request.c 685fb4b17f LU-6142 ldlm: use list_for_each_entry in ldlm_lock.c 93230059ab LU-12275 tests: skip new nodemap params on old MGS |
| Comment by Andreas Dilger [ 21/Mar/23 ] |
|
It looks like this is only failing on review-ldiskfs-ubuntu with Ubuntu22.04 clients, about 1/10 of sanity-lnet tests on that distro. |
| Comment by Andreas Dilger [ 29/Mar/23 ] |
|
Hmm, checking the date, the Note the |
| Comment by Chris Horn [ 29/Mar/23 ] |
|
The issue is we're racing with discovery. After the delay rule is added we receive a discovery PUSH (incoming PUT) and the ACK is delayed. This consumes a peer credit, so we end up with two GETs on the peer NI tx queue. These are then dropped after the delay rule is removed and thus we get an extra, unexpected dropped GET. We can resolve this by modifying the delay rule to only apply to GET messages. |
| Comment by Gerrit Updater [ 29/Mar/23 ] |
|
"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50469 |
| Comment by Gerrit Updater [ 11/Apr/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50469/ |
| Comment by Peter Jones [ 11/Apr/23 ] |
|
Landed for 2.16 |