Details
-
Bug
-
Resolution: Duplicate
-
Minor
-
None
-
Lustre 2.16.0
-
None
-
3
-
9223372036854775807
Description
This is on lustre 2.15.62 with LU-17700's lnet ping buffer size backported, on EL9.3 systems with RoCE on ConnectX6 (mlx5_0) for Stanford new cluster scheduled to start production soonish.
When running benchmarks everything seems to start up well, but some lnet errors quickly show up and transfers grind to a halt waiting for nodes to reconnect.
Looking at dk logs, this seems to be interesting (I've attached the whole file I'm looking at in case it helps, it has 3 occurrences, all similar to the snippet below)
00000100:00000200:17.0:1713479405.286641:0:7433:0:(events.c:99:reply_in_callback()) @@@ type 6, status 0 req@0000000011322484 x1796700924489856/t0(0) o36->elm-MDT0001-mdc-ffff940378cab000@10.4.0.12@o2ib9:12/10 lens 544/624 e 0 to 0 dl 1713479428 ref 3 fl Rpc:R/200/ffffffff rc 0/-1 job:'' uid:134429 gid:14429 00000100:00000200:17.0:1713479405.286645:0:7433:0:(events.c:121:reply_in_callback()) @@@ unlink req@0000000011322484 x1796700924489856/t0(0) o36->elm-MDT0001-mdc-ffff940378cab000@10.4.0.12@o2ib9:12/10 lens 544/624 e 0 to 0 dl 1713479428 ref 3 fl Rpc:RU/200/ffffffff rc 0/-1 job:'' uid:134429 gid:14429 00000400:00000200:17.0:1713479405.286649:0:7433:0:(lib-md.c:42:lnet_md_unlink()) Queueing unlink of md 00000000ebe9f0a6 00000400:00000200:24.0:1713479405.286662:0:1532:0:(lib-msg.c:958:lnet_msg_detach_md()) md 0x00000000ebe9f0a6 already unlinked 00000400:00000200:24.0:1713479405.286663:0:1532:0:(lib-md.c:46:lnet_md_unlink()) Unlinking md 00000000ebe9f0a6 00000100:00000200:24.0:1713479405.286669:0:1532:0:(events.c:58:request_out_callback()) @@@ type 5, status -110 req@0000000011322484 x1796700924489856/t0(0) o36->elm-MDT0001-mdc-ffff940378cab000@10.4.0.12@o2ib9:12/10 lens 544/624 e 0 to 0 dl 1713479428 ref 3 fl UnregRPC:RU/200/ffffffff rc 0/-1 job:'' uid:134429 gid:14429 00000100:00000400:17.0:1713479405.286680:0:7433:0:(client.c:2334:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1713479405/real 1713479405] req@0000000011322484 x1796700924489856/t0(0) o36->elm-MDT0001-mdc-ffff940378cab000@10.4.0.12@o2ib9:12/10 lens 544/624 e 0 to 1 dl 1713479428 ref 2 fl Rpc:ReXQU/200/ffffffff rc 0/-1 job:'' uid:134429 gid:14429 00000100:02000400:17.0:1713479405.286691:0:7433:0:(import.c:198:ptlrpc_set_import_discon()) elm-MDT0001-mdc-ffff940378cab000: Connection to elm-MDT0001 (at 10.4.0.12@o2ib9) was lost; in progress operations using this service will wait for recovery to complete
We can see two threads unlinking the same md
- 7433 gets a LNET_EVENT_UNLINK in reply_in_callback() and queues 00000000ebe9f0a6 unlink
- 1532 was doing a health check with it which succeeded so the md is also staled for unlinking, but lnet_msg_detach_md() notices the queued unlink (LNET_MD_FLAG_ABORTED set) and fails the ping, leading to disconnection observed.
I didn't take the time to try to reproduce on my test bed yet, but looking at the git log the check in lnet_msg_detach_md() was added in LU-10931 because if the request is still slated for reception at this time it won't come and we need to raise an error to ensure we don't get stuck, but in our case the healthcheck is already successful and over so if I undersand this right we shouldn't need this? And an ugly workaround would be to skip the check if we're coming from a healthcheck (e.g. extra argument we'd set to true if we go through the if (lnet_is_health_check()))... But I didn't try looking at why we're getting the event unlink earlier in the first place, so this might be hiding a race or something – I'll check a bit more tomorrow.
I'd appreciate confirmation that I'm reading this right, and if someone has a better idea it's all welcome
Thanks!
Attachments
Issue Links
- duplicates
-
LU-17704 network error in sanity-benchmark
- Resolved