[LU-16652] sanity-lnet test_253: Expect 1 dropped GET but found 2 Created: 21/Mar/23  Updated: 11/Apr/23  Resolved: 11/Apr/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Chris Horn
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-16551 sanity-lnet test_254: Expect 1 droppe... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

This issue relates to the following test suite run:
https://testing.whamcloud.com/test_sets/48cb196d-84d7-47a6-b3b8-da7376ea54c4

test_253 failed intermittently with the following error:

Expect 1 dropped GET but found 2

Test session details:
clients: https://build.whamcloud.com/job/lustre-reviews/92208 - 5.15.0-52-generic
servers: https://build.whamcloud.com/job/lustre-reviews/92208 - 4.18.0-348.23.1.el8_lustre.x86_64

CMD: trevis-58vm2 /usr/sbin/lctl list_nids
/usr/sbin/lnetctl discover 10.240.40.249@tcp
discover:
    - primary nid: 10.240.40.249@tcp
      Multi-Rail: True
      peer ni:
        - nid: 10.240.40.249@tcp
net:
    - net type: lo
      local NI(s):
        - nid: 0@lo
          status: up
    - net type: tcp
      local NI(s):
        - nid: 10.240.40.248@tcp
          status: up
          interfaces:
              0: eth0
    - primary nid: 10.240.40.249@tcp
        - nid: 10.240.40.249@tcp
          health stats:
              health value: 1000
debug=+net
/usr/sbin/lnetctl set transaction_timeout 10
Added delay rule 10.240.40.248@tcp->10.240.40.249@tcp (1/1)
Found 8 peer_credits for 10.240.40.249@tcp
Issued 8 pings to 10.240.40.249@tcp
manage:
    - ping:
          errno: -1
          descr: failed to ping 10.240.40.249@tcp: Connection timed out
                 
manage:
    - ping:
          errno: -1
          descr: failed to ping 10.240.40.249@tcp: Connection timed out
                 
Removed 1 delay rules
ping:
    - primary nid: 10.240.40.249@tcp
      Multi-Rail: True
      peer ni:
        - nid: 10.240.40.249@tcp
 sanity-lnet test_253: @@@@@@ FAIL: Expect 1 dropped GET but found 2 

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity-lnet test_253 - Expect 1 dropped GET but found 2



 Comments   
Comment by Andreas Dilger [ 21/Mar/23 ]

Patches landed to master on 2023-02-08 when this subtest first started failing:

$ git log --oneline --after 2023-02-06 --before 2023-02-09
0568f4ca25 LU-16500 utils: set default ost index for lfs migrate
9ce04000fb LU-930 ptlrpc: clarify AT error message
e6b6b7ee25 LU-16367 utils: clean up ldiskfs feature handling
a05d02ea0e LU-16221 kernel: new kernel [RHEL 9.1 5.14.0-162.12.1.el9_1]
919b93b951 LU-16510 build: fortified memcpy from linux 6.1
738e69d4b9 LU-16292 llite: delete_from_page_cache not exported
b13a5b351e LU-16188 mdt: fix incompatible HSM request handling
a3a51806ef LU-16118 build: Workaround __write_overflow_field errors
391392e117 LU-16354 ldiskfs: RHEL9.1 server support
54eb6da1f8 LU-16477 ldiskfs: Add ext4-enc-flag patch for RHEL9
c10c6eeb37 LU-15728 llite: fix relatime support
0c05dc21ab LU-6142 ldlm: minor list_entry improvements in ldlm_request.c
685fb4b17f LU-6142 ldlm: use list_for_each_entry in ldlm_lock.c
93230059ab LU-12275 tests: skip new nodemap params on old MGS
Comment by Andreas Dilger [ 21/Mar/23 ]

It looks like this is only failing on review-ldiskfs-ubuntu with Ubuntu22.04 clients, about 1/10 of sanity-lnet tests on that distro.

https://testing.whamcloud.com/search?client_branch_type_id=24a6947e-04a9-11e1-bb5f-52540025f9af&client_distribution_type_id=546c7bd8-1a0d-46be-afb2-f9e842762553&server_branch_type_id=24a6947e-04a9-11e1-bb5f-52540025f9af&test_set_script_id=a2b1c4b2-b449-11e9-b88c-52540065bddc&sub_test_script_id=5a631bd3-f3e2-48aa-937f-fa3662bd2d47&start_date=2023-03-14&end_date=2023-03-21&source=sub_tests#redirect

Comment by Andreas Dilger [ 29/Mar/23 ]

Chris, presumably this is the same cause as LU-16551 and could be closed as a duplicate now that your patch landed for that ticket already?

Hmm, checking the date, the LU-16551 patch was landed on 2023-02-22, but these failures continue to be hit on master. test_253 failed 80/474 runs on Ubuntu22.04 clients since 2023-02-08, 36/242 runs since 2023-03-01, after that patch landed.

Note the LU-16551 ticket was for sanity-lnet test_254, while this ticket is for test_253, but they both reported the same error message. I only see two such test_254 failures in the logs since the start of the year, so it could be that ticket was mis-attributing the source of the failures? In any case, it looks like the patch did not fix the issue, or there is a similar issue only being hit on Ubuntu.

Comment by Chris Horn [ 29/Mar/23 ]

The issue is we're racing with discovery. After the delay rule is added we receive a discovery PUSH (incoming PUT) and the ACK is delayed. This consumes a peer credit, so we end up with two GETs on the peer NI tx queue. These are then dropped after the delay rule is removed and thus we get an extra, unexpected dropped GET.

We can resolve this by modifying the delay rule to only apply to GET messages.

Comment by Gerrit Updater [ 29/Mar/23 ]

"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50469
Subject: LU-16652 tests: test 253/254 race with discovery
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d5cda7d1090c433852b3eacdd128f143ef0c8cf5

Comment by Gerrit Updater [ 11/Apr/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50469/
Subject: LU-16652 tests: sanity-lnet test 253/254 race with discovery
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 050ce41d7359ebf4e10c5e592b5454b133ed3e64

Comment by Peter Jones [ 11/Apr/23 ]

Landed for 2.16

Generated at Sat Feb 10 03:28:51 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.