[LU-4639] HSM requests not delivered Created: 17/Feb/14  Updated: 21/Mar/14  Resolved: 04/Mar/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0
Fix Version/s: Lustre 2.6.0, Lustre 2.5.1

Type: Bug Priority: Blocker
Reporter: Henri Doreau (Inactive) Assignee: James Nunez (Inactive)
Resolution: Fixed Votes: 0
Labels: MB

Severity: 3
Project: HSM
Rank (Obsolete): 12689

 Description   

Issuing too many HSM requests (120 or more, it seems) leads to lnet errors. The corresponding requests, as well as subsequent ones, aren't delivered to the copytool.

LNetError: 7307:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-0@lo, match 1460297122524484 length 6776 too big: 7600 left, 6144 allowed

Can be easily reproduced with lfs hsm_archive * against a hundred-ish files.



 Comments   
Comment by Peter Jones [ 17/Feb/14 ]

James

Please could you set up this reproducer?

Thanks

Peter

Comment by Amir Shehata (Inactive) [ 17/Feb/14 ]

as far as I understand, this error message is in lnet_try_match_md(). This is when a message is received and LNet tries to mach a memory descriptor, which was setup to receive these messages. The MD is of a specific size. Now there are some flags that can be set to define what should be done if the message received is larger than the memory descriptor. It can truncate or drop. This particular MD seems to be setup to drop messages.

I believe either the HSM needs to be aware of this limiation and make sure to send messages of a smaller size, or to increas the size of the MD

Comment by Aurelien Degremont (Inactive) [ 17/Feb/14 ]

This limitation was known. But the resulting stale state was not.

It is a little bit strange that upper layer knows lower layers constraints.

Comment by Amir Shehata (Inactive) [ 19/Feb/14 ]

I wasn't involved in the HSM design. But if the limitation was known, what is the expected behavior when this scenario is hit? From the HSM feature perspective, is the message dropped silently at the ptl layer, and that's what's causing the problem?

Comment by Aurelien Degremont (Inactive) [ 19/Feb/14 ]

Actually, it was not know during HSM design , this was found out later. LNET message size and KUC message size where limitation for big request. I think KUC was patched to detect this. Nothing was done for LNET.
Anyway, I do not really see why this packet drop will hang everything. We need to work this out.

Comment by Oleg Drokin [ 19/Feb/14 ]

The packet drops hang everything because the client will keep retrying indefinitely, as server will not provide any indication it just dropped the request.
So client is stuck trying to send this rpc with who knows what locks held and blocking other rpcs from being sent

Comment by Jinshan Xiong (Inactive) [ 19/Feb/14 ]

It looks like we've already limited the maximum size of HSM requests to MDS_MAXREQSIZE which is 5K bytes. Why does it add up to 6776 bytes at the end?

Comment by James Nunez (Inactive) [ 20/Feb/14 ]

Yes, I'm able to recreate this problem trying archive ~111 files at once. There are several of the follow error on the agent:

LNetError: 12269:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.2.109@o2ib, match 1460420866039584 length 8288 too big: 7896 left, 6144 allowed

On the MDS, I see the following in dmesg:

Lustre: scratch-MDD0000: changelog on
LustreError: 21923:0:(mdt_coordinator.c:950:mdt_hsm_cdt_start()) scratch-MDT0000: Coordinator already started
LustreError: 21923:0:(obd_config.c:1350:class_process_proc_param()) mdt.: error writing proc entry 'hsm_control': rc = -114
Lustre: 1963:0:(client.c:1907:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1392918571/real 1392918571]  req@ffff8806bd2bd400 x1460420865979988/t0(0) o8->scratch-OST0000-osc-MDT0000@192.168.2.111@o2ib:28/4 lens 400/544 e 0 to 1 dl 1392918596 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: 1963:0:(client.c:1907:ptlrpc_expire_one_request()) Skipped 41 previous similar messages
LNet: 1959:0:(o2iblnd_cb.c:381:kiblnd_handle_rx()) PUT_NACK from 192.168.2.110@o2ib
Lustre: 21913:0:(client.c:1907:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1392931034/real 1392931034]  req@ffff8806bd181800 x1460420866039584/t0(0) o107->scratch-MDT0000@192.168.2.110@o2ib:15/16 lens 8288/224 e 0 to 1 dl 1392931041 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
Lustre: 21913:0:(client.c:1907:ptlrpc_expire_one_request()) Skipped 5 previous similar messages
LNet: 1958:0:(o2iblnd_cb.c:381:kiblnd_handle_rx()) PUT_NACK from 192.168.2.110@o2ib
Comment by James Nunez (Inactive) [ 20/Feb/14 ]

I tried to see if any of the files I tried to archive were actually archived or if I could archive a new file.

For all files that I tried to archive in the bulk archive attempt above, I see:

# lfs hsm_state u1*
u1: (0x00000001) exists, archive_id:1
u10: (0x00000001) exists, archive_id:1
u100: (0x00000001) exists, archive_id:1
...

I can't archive any new files and there is no error message in the logs or console for this new request to archive. Maybe the request is queued up, but can't be executed because of the previous bulk request:

# lfs hsm_state u2
u2: (0x00000000)
# lfs hsm_archive --archive 1 u2
# lfs hsm_state u2
u2: (0x00000000)

I rebooted the client node and tried the above again, but nothing changed. I looks like, as Oleg said, the client is stuck.

I still see the following error message printed periodically in the agent logs:

LNetError: 12270:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.2.109@o2ib, match 1460420866039584 length 8288 too big: 7896 left, 6144 allowed
LNetError: 12270:0:(lib-ptl.c:190:lnet_try_match_md()) Skipped 85 previous similar messages
Comment by John Hammond [ 25/Feb/14 ]

Jinshan, we limit on the client side in ll_dir_ioctl(). There we require that hur_len(hur) < MDS_MAXREQSIZE (5K). Since hsm_user_item is 32 bytes, this means hur_count < 160. Then the MDT tries to send the same number of 72 byte hsm_action_items to the copytool. So with 120 items we have a KUC buffer that's more than 72 * 120 = 8640 bytes. It looks like a cheap fix would be to use MDS_MAXREQSIZE / 4 in ll_dir_ioctl().

Comment by Jinshan Xiong (Inactive) [ 25/Feb/14 ]

Good point John.

Comment by James Nunez (Inactive) [ 25/Feb/14 ]

I tried the same thing on the latest build of b2_5, build #33, and got different error messages. The results seem the same; I can't archive any files from this client and the bulk archive command never completes, but I can write from this client to the file system.

# ls u1* | wc
    111     111     543
# lfs hsm_archive u1*
# lfs hsm_state u1*
u1: (0x00000009) exists archived, archive_id:1
u10: (0x00000009) exists archived, archive_id:1
u100: (0x00000009) exists archived, archive_id:1
...

There are no errors on the client, agent, nor MDS. So, with 111 files, archiving all the files worked.

Then tried 120 files:

# ls a* | wc
    125     125     517
# lfs hsm_archive a*
# lfs hsm_state a*
a1: (0x00000000)
a10: (0x00000000)
a100: (0x00000000)
...

So, it looks like the archive of these 120 files did not work.

The difference between b2_5 and the master results, posted previously in this ticket, are the errors in the logs. After the failed bulk archive attempt, there are no errors on the client nor on the agent. The errors on the MDS are different from master in that they complain about not finding an agent for archive 3:

Lustre: 13480:0:(mdd_device.c:1351:mdd_changelog_user_purge()) Could not determine changelog records to purge; rc=0
Lustre: 13480:0:(mdd_device.c:1351:mdd_changelog_user_purge()) Skipped 14 previous similar messages
LustreError: 21605:0:(mdt_hsm_cdt_agent.c:338:mdt_hsm_agent_send()) scratch-MDT0000: Cannot find agent for archive 3: rc = -11
LustreError: 21605:0:(mdt_hsm_cdt_agent.c:338:mdt_hsm_agent_send()) Skipped 2 previous similar messages
LustreError: 21605:0:(mdt_hsm_cdt_agent.c:338:mdt_hsm_agent_send()) scratch-MDT0000: Cannot find agent for archive 3: rc = -11

I have a single archive with index 1. On the agent:

# ps -ef | grep hsm | grep -v grep
root     11891     1  0 13:29 ?        00:00:00 lhsmtool_posix --daemon --hsm_root /lustre/archive --archive 1 /lustre/scratch
Comment by James Nunez (Inactive) [ 27/Feb/14 ]

I tested John's fix on master and, as expected, it reduced the number of files I was allowed to request for archive. If I try to archive, say, 200 files, I get an "Argument list too long" (E2BIG) error as expected. Now the bulk archive request with the reduced number of files succeeds and does not hang the client.

Patch for master at http://review.whamcloud.com/#/c/9393/
Patch for b2_5 at http://review.whamcloud.com/#/c/9422/

Comment by Aurelien Degremont (Inactive) [ 28/Feb/14 ]

"/ 4" seems a little bit magic to me. It is possible to add a comment to both patches to explain why we are doing this? If my maths are correct, divide it by 3 should be enough to fit in MDS_MAXREQSIZE. This should give us 33% increase of file in batched hsm requests. Ideally, we should add a test for that also .

Comment by Robert Read (Inactive) [ 21/Mar/14 ]

I'm not sure I'd call this patch a fix - it's really just a work around. A fix would have allowed arbitrary sized requests, or would have at least fixed the lfs command to submit a large request in multiple batches instead of failing.

In particular, since there is no way for a user (or tool like lfs) to determine what the max request size is, it's not possible to know exactly when a request is too large until it is too late. This makes it difficult to build usable tools on top of this without resort to arbitrarily small batch sizes or hard coded magic values.

Generated at Sat Feb 10 01:44:33 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.