[LU-4215] Some expected improvements for OUT - Whamcloud Community JIRA

Details

Type: Improvement
Resolution: Unresolved
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 2.6.0
Labels:
- LMR

Severity:
3
Rank (Obsolete):
11467

Description

1. OUT RPC service threads on MDT and OST using different reply portals confused the OUT RPC user.

On MDT-side, it is:

                .psc_buf                = {
                        .bc_nbufs               = MDS_NBUFS,
                        .bc_buf_size            = OUT_BUFSIZE,
                        .bc_req_max_size        = OUT_MAXREQSIZE,
                        .bc_rep_max_size        = OUT_MAXREPSIZE,
                        .bc_req_portal          = OUT_PORTAL,
                        .bc_rep_portal          = MDC_REPLY_PORTAL,
                },

On OST-side, it is:

                .psc_buf                = {
                        .bc_nbufs               = OST_NBUFS,
                        .bc_buf_size            = OUT_BUFSIZE,
                        .bc_req_max_size        = OUT_MAXREQSIZE,
                        .bc_rep_max_size        = OUT_MAXREPSIZE,
                        .bc_req_portal          = OUT_PORTAL,
                        .bc_rep_portal          = OSC_REPLY_PORTAL,
                },

For the case that both MDT and OST runs on the same physical server node (especially for VM environment testing), when OSP wants to talk with OST via OUT_PORTAL, the OUT RPC maybe handled by MDT-side OUT RPC service thread unexpected, and replied via MDC_REPLY_PORTAL, instead of OSC_REPLY_PORTAL on which the OSP is waiting for the reply. Then caused the OSP-side OUT RPC timeout and resend again and again.

The bad case also can happen when OSP wants to talk with MDT via OUT_PORTAL.

Because NDE I has already used the OUT RPC for talking among MDTs. To be compatible with the old version, we cannot change the MDT-side OUT RPC reply portal. So we have to chance OST-side OUT RPC reply portal to "MDC_REPLY_PORTAL". But it is strange for OST-side to use MDT-side reply portal.

2. The OUT RPC version is fixed on "LUSTRE_MDS_VERSION", in spite of the RPC is to MDT or to OST. Also confused others. We can re-define "tgt_out_handlers". But it may break the policy of Unified Target.

3. Pack multiple idempotent sub-requests into single OUT RPC. In general, the OUT RPC should not assume that the sub-requests are related with each other. So even if one sub-request failed to be executed, the others should not be ignored. But in current implementation, it is not. If the other sub-requests are not related with the failed one, then such behavior is unexpected. Unfortunately, it is not easy to judge whether one sub-request is related with the others within current OUT request format, especially consider to be compatible with DNE I.

4. Iteration via OUT. I found some client-side iteration framework in osp_md_object.c, but seems no server side handler. Do we have any plan to support that?

Attachments

Issue Links

is blocked by

LU-7318 OUT: dynamic reply buffer

Resolved

LU-7319 OUT: continue updates processing upon an error

Open

is blocking

LU-4009 Add ZIL support to osd-zfs

Open

is related to

LU-17818 LMR: Lustre Metadata Redundancy

Open

LU-4690 sanity test_4: Expect error removing in-use dir /mnt/lustre/remote_dir

Resolved

LU-12310 MDT Device-level Replication/Mirroring

Open

LU-7426 DNE3: improve llog format for remote update llog

Open

LU-7427 DNE3: multiple entries for BATCHID

Open

is related to

LU-3467 Unified request handler on OST

Resolved

LU-3539 Change update RPC format

Resolved

LU-7426 DNE3: improve llog format for remote update llog

Open

(3 is related to, 3 is related to )

Activity

[LU-4215] Some expected improvements for OUT

Alex Zhuravlev added a comment - 30/Sep/15 2:12 PM

http://review.whamcloud.com/#/c/15336/

Alex Zhuravlev added a comment - 30/Sep/15 2:12 PM http://review.whamcloud.com/#/c/15336/

Alex Zhuravlev added a comment - 30/Sep/15 2:10 PM

this improvement is needed to shrink records going to ZIL. the patch mentioned in the bug shrink average record on MDT from 1541 to 407 bytes.

Alex Zhuravlev added a comment - 30/Sep/15 2:10 PM this improvement is needed to shrink records going to ZIL. the patch mentioned in the bug shrink average record on MDT from 1541 to 407 bytes.

nasf (Inactive) added a comment - 13/Jan/15 2:36 PM

The left issue is the #3, that is for performance improvement. It is essential for neither LFSCK nor DNE. I am not sure whether Alex or Di has made some patches on that. (I have NOT yet because of other LFSCK tickets). From the LFSCK view, it changed nothing about the OUT protocol. Even if someone will change the OUT protocol for #3 in the future, there will be no LFSCK special trouble.

nasf (Inactive) added a comment - 13/Jan/15 2:36 PM The left issue is the #3, that is for performance improvement. It is essential for neither LFSCK nor DNE. I am not sure whether Alex or Di has made some patches on that. (I have NOT yet because of other LFSCK tickets). From the LFSCK view, it changed nothing about the OUT protocol. Even if someone will change the OUT protocol for #3 in the future, there will be no LFSCK special trouble.

Andreas Dilger added a comment - 12/Jan/15 6:45 PM

This bug has been dropped from 2.7.0 because there hasn't been any progress on it in several months. Is this going to cause major protocol incompatibility if this is fixed in 2.8.0? If yes, is anyone able to fix the problems in the current code in the next week or so?

Andreas Dilger added a comment - 12/Jan/15 6:45 PM This bug has been dropped from 2.7.0 because there hasn't been any progress on it in several months. Is this going to cause major protocol incompatibility if this is fixed in 2.8.0? If yes, is anyone able to fix the problems in the current code in the next week or so?

nasf (Inactive) added a comment - 08/Oct/14 9:26 AM

Because the original master did not support to execute other batchids after the former failed, the OSP (for LFSCK) only aggregates the sub-requests that operate on the same object in the same OUT RPC. So even thought without resolving the batchid issues, the LFSCK still works although it may be inefficient.

nasf (Inactive) added a comment - 08/Oct/14 9:26 AM Because the original master did not support to execute other batchids after the former failed, the OSP (for LFSCK) only aggregates the sub-requests that operate on the same object in the same OUT RPC. So even thought without resolving the batchid issues, the LFSCK still works although it may be inefficient.

Alex Zhuravlev added a comment - 07/Oct/14 6:22 PM

the ability to proceed is important for batched destroys.

Alex Zhuravlev added a comment - 07/Oct/14 6:22 PM the ability to proceed is important for batched destroys.

Di Wang (Inactive) added a comment - 07/Oct/14 6:03 PM - edited

I just checked current master code, which seems not resolved yet, not sure in Nasf's patches. For DNE, it always fail immediately, which is good enough even for DNE2. For LFSCK, is this only for read-only updates like getattr? Hmm, there is padding in OSP update request

* Hold object_updates sending to the remote OUT in single RPC */
struct object_update_request {
        __u32                   ourq_magic;
        __u16                   ourq_count;     /* number of ourq_updates[] */
        __u16                   ourq_padding;
        struct object_update    ourq_updates[0];
};

We can add the flag there.

Di Wang (Inactive) added a comment - 07/Oct/14 6:03 PM - edited I just checked current master code, which seems not resolved yet, not sure in Nasf's patches. For DNE, it always fail immediately, which is good enough even for DNE2. For LFSCK, is this only for read-only updates like getattr? Hmm, there is padding in OSP update request * Hold object_updates sending to the remote OUT in single RPC */ struct object_update_request { __u32 ourq_magic; __u16 ourq_count; /* number of ourq_updates[] */ __u16 ourq_padding; struct object_update ourq_updates[0]; }; We can add the flag there.

Andreas Dilger added a comment - 07/Oct/14 5:29 PM

Di, Nasf, what is the status on fixing this last issue? What is the proposed solution? Should the server mark all later batchids as failed, or should it try to execute them? What if they are dependent on each other? Is there a flag that could be set on the batch that indicates if it should be executed even if the previous batch failed?

Andreas Dilger added a comment - 07/Oct/14 5:29 PM Di, Nasf, what is the status on fixing this last issue? What is the proposed solution? Should the server mark all later batchids as failed, or should it try to execute them? What if they are dependent on each other? Is there a flag that could be set on the batch that indicates if it should be executed even if the previous batch failed?

nasf (Inactive) added a comment - 30/May/14 11:33 PM

The code for batched requests has worked since DNE 1. The trouble is that the handling for the batched requests within single OUT RPC will stop when it hits failure at some of the sub-request and the left sub-requests will be ignored even though they are not related with failed one. (that is the #3)

nasf (Inactive) added a comment - 30/May/14 11:33 PM The code for batched requests has worked since DNE 1. The trouble is that the handling for the batched requests within single OUT RPC will stop when it hits failure at some of the sub-request and the left sub-requests will be ignored even though they are not related with failed one. (that is the #3)

Andreas Dilger added a comment - 30/May/14 5:46 PM

It seems #3 is the only item still outstanding. Is the code to handle batched requests working?

Andreas Dilger added a comment - 30/May/14 5:46 PM It seems #3 is the only item still outstanding. Is the code to handle batched requests working?

People

Assignee:: Alex Zhuravlev

Reporter:: nasf (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 06/Nov/13 2:59 AM

Updated:: 06/May/24 8:45 PM