Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
Lustre 2.5.0
-
3
-
HSM
-
12689
Description
Issuing too many HSM requests (120 or more, it seems) leads to lnet errors. The corresponding requests, as well as subsequent ones, aren't delivered to the copytool.
LNetError: 7307:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-0@lo, match 1460297122524484 length 6776 too big: 7600 left, 6144 allowed
Can be easily reproduced with lfs hsm_archive * against a hundred-ish files.
I tried to see if any of the files I tried to archive were actually archived or if I could archive a new file.
For all files that I tried to archive in the bulk archive attempt above, I see:
I can't archive any new files and there is no error message in the logs or console for this new request to archive. Maybe the request is queued up, but can't be executed because of the previous bulk request:
I rebooted the client node and tried the above again, but nothing changed. I looks like, as Oleg said, the client is stuck.
I still see the following error message printed periodically in the agent logs: