[LU-5939] Error: trying to overwrite bigger transno Created: 20/Nov/14 Updated: 01/Nov/15 Resolved: 24/May/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.7.0 |
| Fix Version/s: | Lustre 2.8.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | James Nunez (Inactive) | Assignee: | Mikhail Pershin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | HB | ||
| Environment: |
OpenSFS cluster running lustre-master tag 2.6.90 build #2745 with one MDS/MDT, three OSSs with two OSTs each and three clients. |
||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 16583 | ||||||||||||
| Description |
|
I've been running sanity-hsm test 90 several time on this cluster and nearly every time I run the test, I see the following in dmesg on the MDS: Lustre: DEBUG MARKER: == sanity-hsm test 90: Archive/restore a file list == 15:39:24 (1416440364) Lustre: HSM agent bb8c2497-7403-4909-0e46-6614668e8ed7 already registered LustreError: 26047:0:(mdt_coordinator.c:957:mdt_hsm_cdt_start()) scratch-MDT0000: Coordinator already started LustreError: 19956:0:(tgt_lastrcvd.c:806:tgt_last_rcvd_update()) scratch-MDT0000: trying to overwrite bigger transno:on-disk: 25769818612, new: 25769818611 replay: 0. see LU-617. LustreError: 19956:0:(tgt_lastrcvd.c:806:tgt_last_rcvd_update()) Skipped 5 previous similar messages Lustre: DEBUG MARKER: == sanity-hsm test complete, duration 37 sec == 15:39:50 (1416440390) From the kernel logs, I see: ... 00000001:00020000:9.0:1416440377.839622:0:19956:0:(tgt_lastrcvd.c:806:tgt_last_rcvd_update()) scratch-MDT0000: trying to overwrite bigger transno:on-disk: 25769818612, new: 25769818611 replay: 0. see LU-617. ... 00000001:00080000:8.0:1416440377.869378:0:30331:0:(tgt_lastrcvd.c:1231:tgt_txn_stop_cb()) More than one transaction 25769818612 ... 00000001:00080000:8.0:1416440377.869423:0:30331:0:(tgt_lastrcvd.c:1231:tgt_txn_stop_cb()) More than one transaction 25769818612 ... 00000001:00080000:8.0:1416440377.869508:0:30331:0:(tgt_lastrcvd.c:1231:tgt_txn_stop_cb()) More than one transaction 25769818612 ... 00000100:00100000:8.0:1416440377.869685:0:30331:0:(service.c:2116:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_002:bb8c2497-7403-4909-0e46-6614668e8ed7+713:21533:x1485210712561904:12345-192.168.2.111@o2ib:57 Request procesed in 30116us (30167us total) trans 25769818612 rc 0/0 Similarly for other transaction numbers: 00000001:00020000:0.0:1416440378.133498:0:19955:0:(tgt_lastrcvd.c:806:tgt_last_rcvd_update()) scratch-MDT0000: trying to overwrite bigger transno:on-disk: 25769818617, new: 25769818614 replay: 0. see LU-617. and 00000001:00020000:1.0F:1416440378.133518:0:31313:0:(tgt_lastrcvd.c:806:tgt_last_rcvd_update()) scratch-MDT0000: trying to overwrite bigger transno:on-disk: 25769818619, new: 25769818618 replay: 0. see LU-617. Before running sanity-hsm test 90, the copytool was started on the agent, c11. |
| Comments |
| Comment by James Nunez (Inactive) [ 20/Nov/14 ] |
|
I've attached the kernel debug log from the MDS; hsm_log_1.txt. |
| Comment by Jodi Levi (Inactive) [ 21/Nov/14 ] |
|
Mike, |
| Comment by Jodi Levi (Inactive) [ 02/Dec/14 ] |
|
Mike, |
| Comment by Mikhail Pershin [ 19/Jan/15 ] |
|
I am not familiar with HSM coordinator design, but it seems it does many transactions per single RPC, which cause 'More than one transaction' message. Also it seems that HSM RPC from client are sent without mdc RPC lock and non-serialized, that cause 'trying to overwrite bigger transno' message. This may be not a problem if HSM requests don't need recovery, in that case the coordinator don't need transactions for its requests and last_cvd file update. |
| Comment by Mikhail Pershin [ 21/Jan/15 ] |
|
Considering that HSM requests are filtered out during recovery, they are not supposed to be replayed during Lustre recovery and HSM copytool will just send them again. In that case such updates shouldn't get transaction number assigned and update last_rcvd slot. |
| Comment by Mikhail Pershin [ 27/Jan/15 ] |
|
Meanwhile, I am not sure how to fix that properly, HSM uses common API to access related objects and all transaction are started by MDD. There is no easy way to deny transaction number generation for some of them |
| Comment by Andreas Dilger [ 28/Jan/15 ] |
|
Any comments from HSM folks on this issue? |
| Comment by Aurelien Degremont (Inactive) [ 29/Jan/15 ] |
|
Sorry, I did not see this issue. HSM code is doing 2 kind of disk access:
Nothing in HSM code will resend the HSM request registrations. HSM (wrongly?) rely on Lustre recovery mechanism. HSM actions (archive this file, restore this one, etc...) are supposed to be kept between a MDT restarts. Another thing that could lead to very big transaction is when sending an archive request for a big list of files. MDT will try to add them (one llog record per file) in one transaction. So far, this file list was limited by other Lustre limitation (LNET message size, KUC, ...). Don't know if this limitation changed but the test is still using only 51 files.
AFAIK, replay and recovery is tightly coupled. I do know how you can manage replay without recovery. Or doing this by hand? But is there a code somewhere in Lustre which does that? |
| Comment by Mikhail Pershin [ 29/Jan/15 ] |
|
Aurelien, I decided that HSM requests don't rely on Lustre recovery (replays) because HSM requests are filtered out during recovery, check tgt_filter_recovery_requests(), there are no any MDS_HSM_ opcode listed. This must be fixed if they are supposed to be replayed, also that means we should serialize HSM requests from mdc by mdc_rpc_lock(). This should solve also 'overwrite bigger transno' issue. |
| Comment by Aurelien Degremont (Inactive) [ 04/Feb/15 ] |
|
I think this was forgotten. If somebody think this is bad, tell it, but IMO, we should replay them, and so, serialize them. |
| Comment by Andreas Dilger [ 04/Feb/15 ] |
|
If the MDC is serializing change requests, is this going to hurt HSM performance, or is there only a single HSM request active on any client/agent node at once? Adding the HSM opcodes to the recovery list seems like a simple and low-risk change, and serializing the HSM requests on the MDC is also only limited to HSM usage so doesn't seem too high risk. |
| Comment by Mikhail Pershin [ 06/Feb/15 ] |
|
Yes, I am working on patch already, as for HSM agent performance, I am trying to serialize only requests that might update data on server |
| Comment by Aurelien Degremont (Inactive) [ 06/Feb/15 ] |
|
On my side, I've identified 3 RPCs which can update data on server
Serializing those requests can slow request ingestion rate but I think it is acceptable. If you're working on a patch, please also add MUTABOR flag to MDS_HSM_REQUEST RPC. |
| Comment by Mikhail Pershin [ 07/Feb/15 ] |
|
Thanks for help, I identified first two as well, but not MDS_HSM_REQUEST, is it specific action which update data on disk or it does that upon any request? |
| Comment by Gerrit Updater [ 09/Feb/15 ] |
|
Mike Pershin (mike.pershin@intel.com) uploaded a new patch: http://review.whamcloud.com/13684 |
| Comment by Mikhail Pershin [ 09/Feb/15 ] |
|
Note, the patch above makes just HSM requests repayable, it contains no tests because I found that HSM actions can't recover for reasons not related to this particular patch. I've create LU-6223 for HSM recovery testing. So this particular patch should solve just problem with "overwrite bigger transno" message. |
| Comment by Jinshan Xiong (Inactive) [ 12/Feb/15 ] |
|
hmm I'm not comfortable with mutilple transactions can be made by HSM requests because I'm afraid it may have problems down the road. OUT can have multiple transaction for one RPC because it was carefully designed for this, but it can't be applied to HSM. I'd like to have an alternative way to fix this problem by limiting that HSM request can have only one trans per RPC. |
| Comment by Mikhail Pershin [ 12/Feb/15 ] |
|
Yes, I agree, that would be better |
| Comment by Aurelien Degremont (Inactive) [ 12/Feb/15 ] |
|
IIRC, HSM_REQUEST store a list of requests to be done in a llog. One RPC can send request for the same action (archive, restore, ...) for a list of files. One llog record will be added for each files (with the same compound_id to be able to rebuilt this request later). Records are added using llog_cat_add(). If we want to have only one transaction, we need a special version which can add several records in one call, and update mdt_hsm_add_actions() accordingly. |
| Comment by Jinshan Xiong (Inactive) [ 12/Feb/15 ] |
|
Exactly, llog_cat_add() can be revised to carry a transaction handler parameter therefore we can start a transaction in mdt_hsm_add_actions() and use it for all llog operations later. The only concern is about the size of the transaction. I remember that there is a limitation for it, but I'm not an OSD expert. If that is the case, we also need to take log file creation into account for the transaction size. |
| Comment by Jinshan Xiong (Inactive) [ 12/Feb/15 ] |
|
after a second thought, we don't even need to add a parameter into llog_cat_add(). We just need to call llog_add() series of interfaces instead, just as what we do for changelog. |
| Comment by Aurelien Degremont (Inactive) [ 12/Feb/15 ] |
|
The problem is we should declare the number of credits we need for the transaction in advance. So we need to also update the credit declaration. |
| Comment by Mikhail Pershin [ 13/Feb/15 ] |
|
I am not sure it is about llog records only, llog_cat_add() cause local transaction which produce no transaction number, there must be another update, maybe attributes of file or something like that? I can give more details about HSM request type and operations behind multiple transno later today. Meanwhile, llog_cat_add() should be replaced with llog_add() in any case. As for putting everything into single transaction, we still have another way to go - use the same mechanism as OUT uses to control batch of updates. This will cause compatibility problem but maybe it is not so difficult to solve. I mean we shouldn't deny this case completely and review it too. This is context of LU-6223 though. |
| Comment by Jinshan Xiong (Inactive) [ 13/Feb/15 ] |
|
Hi Mike, you have the expertise on recovery - so if you think it's better to go for multiple transactions, I'm good. Sorry for noise. |
| Comment by Mikhail Pershin [ 13/Feb/15 ] |
|
I've got trace for last occurrence of 'Multiple transactions'. It is not about llog, it is HSM_PROGRESS, and MDT does several disk changes:
this is him_archive operation from sanity-hsm.sh. Each call to lower level (MDD) will cause separate transaction. I am not sure how to solve that right now, ideally transaction should start in MDT but mo_... interface cannot pass transaction details to the MDD. I think we might allow multiple transactions for this case and for restore, with additional checks. This is to be continued in LU-6223. |
| Comment by Gerrit Updater [ 08/May/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13684/ |
| Comment by Peter Jones [ 24/May/15 ] |
|
Landed for 2.8 |