[LU-3399] MDT don't update client last commited correctly so produce OOM on client Created: 26/May/13 Updated: 02/Dec/14 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.0.0, Lustre 2.1.0, Lustre 2.2.0, Lustre 2.3.0, Lustre 2.4.0, Lustre 1.8.x (1.8.0 - 1.8.5) |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Alexey Lyashkov | Assignee: | Bob Glossman (Inactive) |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | patch | ||
| Environment: |
lustre mounted with noatime or relatime mount option. |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 8414 | ||||||||
| Description |
|
Cray customer hit a OOM with using a lustre and Robin-Hood application. (patch to move ptlrpc request in slab - wait a WC/Intel review). so after simple ls -lRa we have 2500 active requests... to flush these requests need touch $some_file_on_lustre. per additional investigation - I found MDT send a zero as last committed to the affected client, where client2 have a correct last committed updates. As i see, target_to_commited_req send a just per export committed info to the client, instead of use a global data. In that case buggy client don't able to commit a requests in base of different client activity. Can someone explain why it's used instead of global obd_last_commited? diff --git a/lustre/ldlm/ldlm_lib.c b/lustre/ldlm/ldlm_lib.c index 31ffbc9..99f2f26 100644 --- a/lustre/ldlm/ldlm_lib.c +++ b/lustre/ldlm/ldlm_lib.c @@ -2529,7 +2529,8 @@ int target_committed_to_req(struct ptlrpc_request *req) if (likely(!exp->exp_obd->obd_no_transno && req->rq_repmsg != NULL)) { lustre_msg_set_last_committed(req->rq_repmsg, - exp->exp_last_committed); + exp->exp_obd->obd_last_committed); +// exp->exp_last_committed); } else { DEBUG_REQ(D_IOCTL, req, "not sending last_committed update (%d/" "%d)", exp->exp_obd->obd_no_transno, @@ -2537,8 +2538,8 @@ int target_committed_to_req(struct ptlrpc_request *req) ret = 0; } - CDEBUG(D_INFO, "last_committed "LPU64", transno "LPU64", xid "LPU64"\n", - exp->exp_last_committed, req->rq_transno, req->rq_xid); + CDEBUG(D_INFO, "last_committed "LPU64"/"LPU64", transno "LPU64", xid "LPU64"\n", + exp->exp_last_committed, exp->exp_obd->obd_last_committed, req->rq_transno, req->rq_xid); return ret; } EXPORT_SYMBOL(target_committed_to_req); cache have reduced after next ping. |
| Comments |
| Comment by Alex Zhuravlev [ 27/May/13 ] |
|
I think this is a dup of |
| Comment by Alexander Boyko [ 27/May/13 ] |
| Comment by Alexey Lyashkov [ 27/May/13 ] |
|
well, 1.90 (nathan 10-Feb-07): void target_committed_to_req(struct ptlrpc_request *req)
1.90 (nathan 10-Feb-07): {
1.96.6.38.2.3 (johann 18-Dec-08): struct obd_export *exp = req->rq_export;
1.96.6.38.2.3 (johann 18-Dec-08): if (!exp->exp_obd->obd_no_transno && req->rq_repmsg != NULL) {
1.90 (nathan 10-Feb-07): lustre_msg_set_last_committed(req->rq_repmsg,
1.96.6.38.2.3 (johann 18-Dec-08): exp->exp_last_committed);
before it change we always use a obd_last_commited to send to client. void target_committed_to_req(struct ptlrpc_request *req)
{
struct obd_device *obd = req->rq_export->exp_obd;
if (!obd->obd_no_transno && req->rq_repmsg != NULL)
req->rq_repmsg->last_committed = obd->obd_last_committed
...
that bug may produce OOM on client, as may block a release a BRW rpc from replay queue also. |
| Comment by Alexey Lyashkov [ 27/May/13 ] |
|
finally identified as bug in VBR patch. https://projectlava.xyratex.com/show_bug.cgi?id=16353 |
| Comment by Mikhail Pershin [ 27/May/13 ] |
|
That was done to reduce contention on last_rcvd header update, per-export last_committed track also is needed for VBR and delayed recovery specifically, each export may recover without affecting whole server data but using only own client data. In our case the simplest way to fix that would be using max(exp_last_committed, obd_last_committed) in target_committed_to_req() function. This will give us the highest value. Strictly speaking this breaks 'delayed recovery' feature but it was never finished to be real product feature so we may don't care about it. Meanwhile this is also not full solution, consider the case when many clients do only opens and there is no disk update on server at all. I don't even think it is right to solve that on server. I wrote some thoughts in |
| Comment by Alexey Lyashkov [ 27/May/13 ] |
|
I don't ask about "why per-export last_commited info is need", i understand it - question about client side updates. as about recovery - we need correctly track a higher commits number in last_rcvd anyway, if i correctly understand - you talk about VBR with clients connected after recovery window closed? but it's should be replayed with using a lock/inode version as base and don't rely to transaction info in recovery. I agree about complex solution, but it fix created just to fix for 2.4 and continue to create a final solution in next release. May be we need both patches from |
| Comment by Alex Zhuravlev [ 27/May/13 ] |
|
"per export committed info always less or same for a cluster wide info" - this is not true as we do NOT want to maintain global last_committed to improve SMP scalability. the original point was that "per-MDT" last committed is not enough to fix the issue if another clients aren't generating new transno. |
| Comment by Alexey Lyashkov [ 27/May/13 ] |
|
> "per export committed info always less or same for a cluster wide info" - this is not true as we do NOT want to maintain global last_committed to improve SMP scalability. Single thread have updates - many threads have reads - where you find a scalability problem in that case? > the original point was that "per-MDT" last committed is not enough to fix the issue if another clients aren't generating new transno. so another solution (already implemented by Niu) is needed. first of all, it's add's additional disk load - so it's too bad by definition - because disk is most slow part and mostly contended. I may be agree with fake commit callback assigned without disk updates but just update a per export value. |
| Comment by Mikhail Pershin [ 27/May/13 ] |
|
obd_last_committed is not updated per each transaction, it is used just as 'reserve' storage for cases when we cannot write transno at client data, e.g. client umount. That way we have no contention on last_rcvd header, each client updates only own slot in most cases. Therefore per-client last_committed usually bigger then obd one. That is why you cannot use only obd_last_committed, it is NOT the latest value. The patch in |
| Comment by Alexey Lyashkov [ 27/May/13 ] |
|
> obd_last_committed is not updated per each transaction.. static inline void obd_transno_commit_cb(struct obd_device *obd, __u64 transno, struct obd_export *exp, int error) { if (error) { CERROR("%s: transno "LPU64" commit error: %d\n", obd->obd_name, transno, error); return; } if (exp && transno > exp->exp_last_committed) { CDEBUG(D_HA, "%s: transno "LPU64" committed\n", obd->obd_name, transno); exp->exp_last_committed = transno; ptlrpc_commit_replies(exp); } else { CDEBUG(D_INFO, "%s: transno "LPU64" committed\n", obd->obd_name, transno); } if (transno > obd->obd_last_committed) obd->obd_last_committed = transno; } or for master void tgt_cb_last_committed(struct lu_env *env, struct thandle *th,
struct dt_txn_commit_cb *cb, int err)
{
struct tgt_last_committed_callback *ccb;
ccb = container_of0(cb, struct tgt_last_committed_callback, llcc_cb);
LASSERT(ccb->llcc_tgt != NULL);
LASSERT(ccb->llcc_exp->exp_obd == ccb->llcc_tgt->lut_obd);
spin_lock(&ccb->llcc_tgt->lut_translock);
if (ccb->llcc_transno > ccb->llcc_tgt->lut_obd->obd_last_committed)
ccb->llcc_tgt->lut_obd->obd_last_committed = ccb->llcc_transno;
LASSERT(ccb->llcc_exp);
if (ccb->llcc_transno > ccb->llcc_exp->exp_last_committed) {
ccb->llcc_exp->exp_last_committed = ccb->llcc_transno;
spin_unlock(&ccb->llcc_tgt->lut_translock);
ptlrpc_commit_replies(ccb->llcc_exp);
} else {
spin_unlock(&ccb->llcc_tgt->lut_translock);
}
class_export_cb_put(ccb->llcc_exp);
if (ccb->llcc_transno)
CDEBUG(D_HA, "%s: transno "LPD64" is committed\n",
ccb->llcc_tgt->lut_obd->obd_name, ccb->llcc_transno);
OBD_FREE_PTR(ccb);
}
so it's updated in same time as exp_last_commited does. I don't see why it's don't have lasts data. > That way we have no contention on last_rcvd header, each client updates only own slot in most cases. > I don't think that extra disk load is a problem, it happens rarely. |
| Comment by Mikhail Pershin [ 28/May/13 ] |
|
yes, you are right about memory value, I had impression we are talking about disk. 'Happens rarely' was about extra disk load happens with Niu patch, it happens once per 1000 opens and if there are no more other updates, this is rare case. Speaking about using obd_last_committed, it breaks VBR features because whole idea was about each client may recover independently from others, so each client maintain own state of committed transaction, it is not cluster-wide since VBR was introduced. Independent client recovery opens way to better server recovery mechanisms like delayed recovery and parallel recovery, that is why I don't want to break this now if there is another way, just because we will return to this problem again in future and will solve it again. I'd prefer to solve it properly now. As I can see Niu patch in |
| Comment by Alexey Lyashkov [ 28/May/13 ] |
|
Tappro, Niu patch eats space in journal, that is one more objection and we may don't have 1000 commits to trigger a Niu exp_last_commited updates. I will be happy if patch will add's a just commit callback to update a per export value and don't touch disk for each fake commit, may be we need additional API or just move some logic to MDD layer. as about using obd_last_committed, from my point view - it's not a differences between have a snapshot of committed ID via introducing a fake commits, and take same info from a global variable - anyway it's point to same data, but i think it's easy because client will be get a last committed updates early, and it's may affect a BRW requests also - as they may don't have any real transaction but need to be know about last committed to release own resources (it's my idea, but i really don't check in code - just reading long long time Adilger commit - why obd_last_committed used). |
| Comment by Alex Zhuravlev [ 28/May/13 ] |
|
Niu's patch will "eat" one block every 5s per client at most. with 1GB journal, a transaction limit is 65536 blocks. |
| Comment by Alexey Lyashkov [ 28/May/13 ] |
|
with 10k clients - it's eats 10k blocks each 5s - so it's will eat 1/6 of journal size will used to store a fake commits. |
| Comment by Alex Zhuravlev [ 28/May/13 ] |
|
(gdb) p sizeof(struct lsd_client_data) so, it should be nearly 320000 clients very evenly distributed over last_rcvd to consume 10K blocks. which is very unlikely. |