Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.7.0
-
OpenSFS cluster running lustre-master tag 2.6.90 build #2745 with one MDS/MDT, three OSSs with two OSTs each and three clients.
-
3
-
16583
Description
I've been running sanity-hsm test 90 several time on this cluster and nearly every time I run the test, I see the following in dmesg on the MDS:
Lustre: DEBUG MARKER: == sanity-hsm test 90: Archive/restore a file list == 15:39:24 (1416440364) Lustre: HSM agent bb8c2497-7403-4909-0e46-6614668e8ed7 already registered LustreError: 26047:0:(mdt_coordinator.c:957:mdt_hsm_cdt_start()) scratch-MDT0000: Coordinator already started LustreError: 19956:0:(tgt_lastrcvd.c:806:tgt_last_rcvd_update()) scratch-MDT0000: trying to overwrite bigger transno:on-disk: 25769818612, new: 25769818611 replay: 0. see LU-617. LustreError: 19956:0:(tgt_lastrcvd.c:806:tgt_last_rcvd_update()) Skipped 5 previous similar messages Lustre: DEBUG MARKER: == sanity-hsm test complete, duration 37 sec == 15:39:50 (1416440390)
From the kernel logs, I see:
... 00000001:00020000:9.0:1416440377.839622:0:19956:0:(tgt_lastrcvd.c:806:tgt_last_rcvd_update()) scratch-MDT0000: trying to overwrite bigger transno:on-disk: 25769818612, new: 25769818611 replay: 0. see LU-617. ... 00000001:00080000:8.0:1416440377.869378:0:30331:0:(tgt_lastrcvd.c:1231:tgt_txn_stop_cb()) More than one transaction 25769818612 ... 00000001:00080000:8.0:1416440377.869423:0:30331:0:(tgt_lastrcvd.c:1231:tgt_txn_stop_cb()) More than one transaction 25769818612 ... 00000001:00080000:8.0:1416440377.869508:0:30331:0:(tgt_lastrcvd.c:1231:tgt_txn_stop_cb()) More than one transaction 25769818612 ... 00000100:00100000:8.0:1416440377.869685:0:30331:0:(service.c:2116:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_002:bb8c2497-7403-4909-0e46-6614668e8ed7+713:21533:x1485210712561904:12345-192.168.2.111@o2ib:57 Request procesed in 30116us (30167us total) trans 25769818612 rc 0/0
Similarly for other transaction numbers:
00000001:00020000:0.0:1416440378.133498:0:19955:0:(tgt_lastrcvd.c:806:tgt_last_rcvd_update()) scratch-MDT0000: trying to overwrite bigger transno:on-disk: 25769818617, new: 25769818614 replay: 0. see LU-617.
and
00000001:00020000:1.0F:1416440378.133518:0:31313:0:(tgt_lastrcvd.c:806:tgt_last_rcvd_update()) scratch-MDT0000: trying to overwrite bigger transno:on-disk: 25769818619, new: 25769818618 replay: 0. see LU-617.
Before running sanity-hsm test 90, the copytool was started on the agent, c11.
hmm I'm not comfortable with mutilple transactions can be made by HSM requests because I'm afraid it may have problems down the road. OUT can have multiple transaction for one RPC because it was carefully designed for this, but it can't be applied to HSM. I'd like to have an alternative way to fix this problem by limiting that HSM request can have only one trans per RPC.