[LU-2086] Server went back in time Created: 18/May/12 Updated: 19/Apr/13 Resolved: 07/Oct/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Brian Behlendorf | Assignee: | Mikhail Pershin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | topsequoia | ||
| Attachments: |
|
| Severity: | 3 |
| Epic: | server |
| Project: | Orion |
| Rank (Obsolete): | 3019 |
| Description |
|
Observed when manually restarting zfs servers, specifically: 1) umount /mnt/lustre/server There is some suspicion that we may be rolling back to a previous uberblock due to how things were shutdown but that's just a suspicion at the moment until we have more data. 2012-05-17 11:24:31 LustreError: 11-0: lstest-OST02e6-osc-ffff88063d2a2000: Communicating with 172.20.4.142@o2ib500, operation obd_ping failed with -107 2012-05-17 11:24:31 Lustre: lstest-OST02ef-osc-ffff88063d2a2000: Connection to lstest-OST02ef (at 172.20.4.151@o2ib500) was lost; in progress operations using this service will wait for recovery to complete 2012-05-17 11:24:31 Lustre: Skipped 19 previous similar messages 2012-05-17 11:24:31 LustreError: Skipped 27 previous similar messages 2012-05-17 11:24:53 Lustre: lstest-OST02b8-osc-ffff88063d2a2000: Connection to lstest-OST02b8 (at 172.20.4.96@o2ib500) was lost; in progress operations using this service will wait for recovery to complete 2012-05-17 11:24:53 Lustre: Skipped 30 previous similar messages 2012-05-17 11:24:53 LustreError: 23000:0:(import.c:930:ptlrpc_connect_interpret()) lstest-OST029a_UUID went back in time (transno 4315649123 was previously committed, server now claims 4295433349)! See https://bugzilla.lustre.org/show_bug.cgi?id=9646 2012-05-17 11:24:53 LustreError: 23000:0:(import.c:930:ptlrpc_connect_interpret()) Skipped 11 previous similar messages 2012-05-17 11:24:57 Lustre: lstest-OST0286-osc-ffff88063d2a2000: Connection restored to lstest-OST0286 (at 172.20.4.46@o2ib500) 2012-05-17 11:24:57 Lustre: Skipped 12 previous similar messages 2012-05-17 11:25:02 Lustre: lstest-OST02e9-osc-ffff88063d2a2000: Connection restored to lstest-OST02e9 (at 172.20.4.145@o2ib500) 2012-05-17 11:25:47 Lustre: lstest-OST02e3-osc-ffff88063d2a2000: Connection restored to lstest-OST02e3 (at 172.20.4.139@o2ib500) 2012-05-17 11:25:47 Lustre: Skipped 25 previous similar messages 2012-05-17 11:25:57 Lustre: lstest-OST028d-osc-ffff88063d2a2000: Connection restored to lstest-OST028d (at 172.20.4.53@o2ib500) 2012-05-17 11:25:57 Lustre: Skipped 3 previous similar messages 2012-05-17 11:26:11 LustreError: 23000:0:(import.c:930:ptlrpc_connect_interpret()) lstest-OST02e4_UUID went back in time (transno 4320876109 was previously committed, server now claims 4296801416)! See https://bugzilla.lustre.org/show_bug.cgi?id=9646 2012-05-17 11:26:11 LustreError: 23000:0:(import.c:930:ptlrpc_connect_interpret()) Skipped 67 previous similar messages |
| Comments |
| Comment by Mikhail Pershin [ 18/May/12 ] |
|
Note, that this can happen if client was evicted and reconnected. Today last_committed value is per-export, server data has also such value but it is not updated upon every transaction. So if client export data was deleted than the largest last commit value is reported back and can be lower value. So check please, was client evicted or not. |
| Comment by Brian Behlendorf [ 18/May/12 ] |
|
Actually, looking more closely at all the logs I don't see any evidence that a client was evicted. But I am able to easily reproduce the failure simply be restarting an OSS with the above procedure. Here are the console logs from the restarted server and a client which reported that issue. Note, not all clients encountered this. OSS grove-641 (172.20.4.41@o2ib500) <ConMan> Console [grove641] log at 2012-05-18 12:00:00 PDT. <ConMan> Console [grove641] joined by <root@localhost> on pts/16 at 05-18 12:16. 2012-05-18 12:16:46 Lustre: Failing over lstest-OST0281 2012-05-18 12:16:46 LustreError: 137-5: lstest-OST0281: Not available for connect from 172.20.3.192@o2ib500 (stopping) 2012-05-18 12:16:46 LustreError: 137-5: lstest-OST0281: Not available for connect from 172.20.3.200@o2ib500 (stopping) 2012-05-18 12:16:48 LustreError: 137-5: lstest-OST0281: Not available for connect from 172.20.3.155@o2ib500 (stopping) 2012-05-18 12:16:48 LustreError: Skipped 1 previous similar message 2012-05-18 12:16:49 Lustre: lstest-OST0281: Unmounted 2012-05-18 12:17:11 LustreError: 11-0: MGC172.20.5.2@o2ib500: Communicating with 172.20.5.2@o2ib500, operation llog_origin_handle_create failed with -2 2012-05-18 12:17:11 LustreError: 64142:0:(mgc_request.c:250:do_config_log_add()) failed processing sptlrpc log: -2 2012-05-18 12:17:12 Lustre: lstest-OST0281: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-450 2012-05-18 12:17:12 Lustre: lstest-OST0281: Mounted grove641/ost0 2012-05-18 12:17:17 Lustre: lstest-OST0281: Will be in recovery for at least 2:30, or until 257 clients reconnect. 2012-05-18 12:17:33 Lustre: lstest-OST0281: Recovery over after 0:16, of 257 clients 257 recovered and 0 were evicted. Client grove592 (172.20.3.192@o2ib500) - No issue 2012-05-18 12:16:46 LustreError: 11-0: lstest-OST0281-osc-ffff88080c6cc800: Communicating with 172.20.4.41@o2ib500, operation obd_ping failed with -107 2012-05-18 12:16:46 Lustre: lstest-OST0281-osc-ffff88080c6cc800: Connection to lstest-OST0281 (at 172.20.4.41@o2ib500) was lost; in progress operations using this service will wait for recovery to complete 2012-05-18 12:17:33 Lustre: lstest-OST0281-osc-ffff88080c6cc800: Connection restored to lstest-OST0281 (at 172.20.4.41@o2ib500) Client grove443 (172.20.3.43@o2ib500) - Went back in time 2012-05-18 12:17:22 Lustre: lstest-OST0281-osc-ffff8807fe7abc00: Connection to lstest-OST0281 (at 172.20.4.41@o2ib500) was lost; in progress operations using this service will wait for recovery to complete 2012-05-18 12:17:22 Lustre: Skipped 127 previous similar messages 2012-05-18 12:17:22 LustreError: 13990:0:(import.c:930:ptlrpc_connect_interpret()) lstest-OST0281_UUID went back in time (transno 4295846814 was previously committed, server now claims 4295242246)! See https://bugzilla.lustre.org/show_bug.cgi?id=9646 2012-05-18 12:17:22 LustreError: 13990:0:(import.c:930:ptlrpc_connect_interpret()) Skipped 71 previous similar messages 2012-05-18 12:17:33 Lustre: lstest-OST0281-osc-ffff8807fe7abc00: Connection restored to lstest-OST0281 (at 172.20.4.41@o2ib500) 2012-05-18 12:17:33 Lustre: Skipped 96 previous similar messages |
| Comment by Mikhail Pershin [ 20/May/12 ] |
|
Brian, can you try with lines below added into ofd_obd_disconnect : + if (exp->exp_obd->obd_replayable &&
+ (!exp->exp_obd->obd_fail || exp->exp_failed))
lut_client_del(&env, exp);
|
| Comment by Brian Behlendorf [ 21/May/12 ] |
|
Mike, I'll roll that tweak in to our next tag and let you know. |
| Comment by Brian Behlendorf [ 31/May/12 ] |
|
So far I haven't reproduced this issue with the patch applied, but we haven't tried very hard yet either. We should get serious fail over testing going soon. |
| Comment by Brian Behlendorf [ 01/Jun/12 ] |
|
No more actual data to go on here, but I was able to reproduce this easily during the grove update. The update did contain your suggested change. We'll try and get detailed logs under more control circumstances. |
| Comment by Prakash Surya (Inactive) [ 13/Jun/12 ] |
|
Mike, I'm going to try and dig into this issue. It's easily reproducible when failing over the servers under load. I'll try to collect and post some Lustre debug logs. |
| Comment by Prakash Surya (Inactive) [ 13/Jun/12 ] |
|
Looks like I can hit the problem on my VM running the replay-single test LustreError: 996:0:(import.c:928:ptlrpc_connect_interpret()) lustre-OST0000_UUID went back in time (transno 4294969953 was previously committed, server now claims 4294967730)! See https://bugzilla.lustre.org/show_bug.cgi?id=9646 |
| Comment by Ian Colle (Inactive) [ 14/Jun/12 ] |
|
Prakash - that's weird - can you tell us during what subtest you hit that? We've passed replay-single routinely and recovery-small test_54 should find that also. |
| Comment by Prakash Surya (Inactive) [ 14/Jun/12 ] |
|
Looking back on the logs from my acceptance-small and replay-single runs from the past few days, the theme looks to be tests 59, 47, 61c, and 90 of replay-single hitting this issue. Although, I see none of these tests as having failed. |
| Comment by Prakash Surya (Inactive) [ 18/Jun/12 ] |
|
Also, running recovery-small in my VM doesn't appear to trigger the message. |
| Comment by Prakash Surya (Inactive) [ 26/Jun/12 ] |
|
Applied the following patch and ran the replay-single reproducers to generate the log file. diff --git i/lustre/ptlrpc/import.c w/lustre/ptlrpc/import.c
index 7e68efb..7bf2088 100644
--- i/lustre/ptlrpc/import.c
+++ w/lustre/ptlrpc/import.c
@@ -920,6 +920,9 @@ static int ptlrpc_connect_interpret(const struct lu_env *env,
if (lustre_msg_get_last_committed(request->rq_repmsg) > 0 &&
lustre_msg_get_last_committed(request->rq_repmsg) <
aa->pcaa_peer_committed) {
+#ifdef __KERNEL__
+ libcfs_debug_dumplog();
+#endif
CERROR("%s went back in time (transno "LPD64
" was previously committed, server now claims "LPD64
")! See https://bugzilla.lustre.org/show_bug.cgi?"
diff --git i/lustre/tests/cfg/local.sh w/lustre/tests/cfg/local.sh
index d537924..6cc0f61 100644
--- i/lustre/tests/cfg/local.sh
+++ w/lustre/tests/cfg/local.sh
@@ -48,12 +48,12 @@ STRIPE_BYTES=${STRIPE_BYTES:-1048576}
STRIPES_PER_OBJ=${STRIPES_PER_OBJ:-0}
SINGLEMDS=${SINGLEMDS:-"mds1"}
TIMEOUT=${TIMEOUT:-20}
-PTLDEBUG=${PTLDEBUG:-"vfstrace rpctrace dlmtrace neterror ha config ioctl super"}
+PTLDEBUG=${PTLDEBUG:-"-1"}
SUBSYSTEM=${SUBSYSTEM:-"all -lnet -lnd -pinger"}
# promise 2MB for every cpu
if [ -f /sys/devices/system/cpu/possible ]; then
- _debug_mb=$((($(cut -d "-" -f 2 /sys/devices/system/cpu/possible)+1)*2))
+ _debug_mb=$((($(cut -d "-" -f 2 /sys/devices/system/cpu/possible)+1)*128))
else
_debug_mb=$(($(getconf _NPROCESSORS_CONF)*2))
fi
|
| Comment by Ned Bass [ 01/Aug/12 ] |
|
It looks like ofd->ofd_last_transno gets reset to 4294967296 (1 << 32) when a target is restarted on the server. That is, lut_boot_epoch_update() is called with lut->lus_last_transno == 0. Then in ofd_txn_stop_cb() new transaction numbers are assigned starting from there and saved in the last_rcvd file. 199 } else if (info->fti_transno == 0) { 200 info->fti_transno = ++ ofd->ofd_last_transno; ... 224 return ofd_last_rcvd_update(info, txn); 225 } However exp->exp_last_committed keeps the value originally read from last_rcvd until the new transaction numbers overtake it. So the server keeps sending a larger last_committed transaction number than what is actually being used by the client, but it writes the newer, lower numbers to last_rcvd. Now if the server restarts it sends the lcd_last_transno from last_rcvd to the client. But if the saved transaction numbers never grew larger than the old last_committed value, the client thinks the server went back in time. To clarify, here's some debug data I captured with systemtap. I unmounted the OST at timestamp 1343860077 and the client connection was restored at 1343860152. It prints ofd->ofd_lut->lut_last_transno in ofd_txn_stop_cb() and lut->lus_last_transno in lut_boot_epoch_update(). There was only one active client. 1343860077 ll_ost_io_114 (24102) ofd_txn_stop_cb lut_last_transno 4295182382 1343860152 tgt_recov (24731) lut_boot_epoch_update() lut_last_transno 0 1343860152 ll_ost_io_127 24728 ofd_txn_stop_cb lut_last_transno 4294967296 1343860152 ll_ost_io_125 24726 ofd_txn_stop_cb lut_last_transno 4294967297 1343860152 ll_ost_io_120 24721 ofd_txn_stop_cb lut_last_transno 4294967298 |
| Comment by Ned Bass [ 02/Aug/12 ] |
|
Patch for orion: http://review.whamcloud.com/3518 |
| Comment by Ned Bass [ 02/Aug/12 ] |
|
Patch for master: http://review.whamcloud.com/3519 |
| Comment by Ian Colle (Inactive) [ 07/Oct/12 ] |
|
The fix from patch 3519 was already landed to master as part of another patch. |