Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
Lustre 2.5.0
-
3
-
HSM
-
12689
Description
Issuing too many HSM requests (120 or more, it seems) leads to lnet errors. The corresponding requests, as well as subsequent ones, aren't delivered to the copytool.
LNetError: 7307:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-0@lo, match 1460297122524484 length 6776 too big: 7600 left, 6144 allowed
Can be easily reproduced with lfs hsm_archive * against a hundred-ish files.
I tried the same thing on the latest build of b2_5, build #33, and got different error messages. The results seem the same; I can't archive any files from this client and the bulk archive command never completes, but I can write from this client to the file system.
There are no errors on the client, agent, nor MDS. So, with 111 files, archiving all the files worked.
Then tried 120 files:
So, it looks like the archive of these 120 files did not work.
The difference between b2_5 and the master results, posted previously in this ticket, are the errors in the logs. After the failed bulk archive attempt, there are no errors on the client nor on the agent. The errors on the MDS are different from master in that they complain about not finding an agent for archive 3:
I have a single archive with index 1. On the agent: