Details
-
Bug
-
Resolution: Fixed
-
Major
-
None
-
Lustre 2.1.5
-
Lustre servers running 2.1.5, Lustre clients with 1.8.9.
-
3
-
8618
Description
During an IOR-like benchmark doing directIO from multiple clients (16, 64) clients get disconnected and evicted. The MPI process dies in misery and some of it's processes aren't even killable.
We've seen that there was a similar bug a while ago that was marked as solved, it was occuring on lnet routers (https://bugzilla.lustre.org/show_bug.cgi?id=13607). This one is on clients.
What can lead to the "RDMA too fragmented" issue? Any hint or suggestion? Client log messages are in the attached file.
Regards,
Erich