Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • None
    • 3
    • Orion
    • 3019

    Description

      Observed when manually restarting zfs servers, specifically:

      1) umount /mnt/lustre/server
      2) export zpool
      3) import zpool
      4) mount zpool/server /mnt/lustre/server

      There is some suspicion that we may be rolling back to a previous uberblock due to how things were shutdown but that's just a suspicion at the moment until we have more data.

      2012-05-17 11:24:31 LustreError: 11-0: lstest-OST02e6-osc-ffff88063d2a2000: Communicating with 172.20.4.142@o2ib500, operation obd_ping failed with -107
      2012-05-17 11:24:31 Lustre: lstest-OST02ef-osc-ffff88063d2a2000: Connection to lstest-OST02ef (at 172.20.4.151@o2ib500) was lost; in progress operations using this service will wait for recovery to complete
      2012-05-17 11:24:31 Lustre: Skipped 19 previous similar messages
      2012-05-17 11:24:31 LustreError: Skipped 27 previous similar messages
      2012-05-17 11:24:53 Lustre: lstest-OST02b8-osc-ffff88063d2a2000: Connection to lstest-OST02b8 (at 172.20.4.96@o2ib500) was lost; in progress operations using this service will wait for recovery to complete
      2012-05-17 11:24:53 Lustre: Skipped 30 previous similar messages
      2012-05-17 11:24:53 LustreError: 23000:0:(import.c:930:ptlrpc_connect_interpret()) lstest-OST029a_UUID went back in time (transno 4315649123 was previously committed, server now claims 4295433349)!  See https://bugzilla.lustre.org/show_bug.cgi?id=9646
      2012-05-17 11:24:53 LustreError: 23000:0:(import.c:930:ptlrpc_connect_interpret()) Skipped 11 previous similar messages
      2012-05-17 11:24:57 Lustre: lstest-OST0286-osc-ffff88063d2a2000: Connection restored to lstest-OST0286 (at 172.20.4.46@o2ib500)
      2012-05-17 11:24:57 Lustre: Skipped 12 previous similar messages
      2012-05-17 11:25:02 Lustre: lstest-OST02e9-osc-ffff88063d2a2000: Connection restored to lstest-OST02e9 (at 172.20.4.145@o2ib500)
      2012-05-17 11:25:47 Lustre: lstest-OST02e3-osc-ffff88063d2a2000: Connection restored to lstest-OST02e3 (at 172.20.4.139@o2ib500)
      2012-05-17 11:25:47 Lustre: Skipped 25 previous similar messages
      2012-05-17 11:25:57 Lustre: lstest-OST028d-osc-ffff88063d2a2000: Connection restored to lstest-OST028d (at 172.20.4.53@o2ib500)
      2012-05-17 11:25:57 Lustre: Skipped 3 previous similar messages
      2012-05-17 11:26:11 LustreError: 23000:0:(import.c:930:ptlrpc_connect_interpret()) lstest-OST02e4_UUID went back in time (transno 4320876109 was previously committed, server now claims 4296801416)!  See https://bugzilla.lustre.org/show_bug.cgi?id=9646
      2012-05-17 11:26:11 LustreError: 23000:0:(import.c:930:ptlrpc_connect_interpret()) Skipped 67 previous similar messages
      

      Attachments

        Activity

          [LU-2086] Server went back in time

          Prakash - that's weird - can you tell us during what subtest you hit that? We've passed replay-single routinely and recovery-small test_54 should find that also.

          ian Ian Colle (Inactive) added a comment - Prakash - that's weird - can you tell us during what subtest you hit that? We've passed replay-single routinely and recovery-small test_54 should find that also.

          Looks like I can hit the problem on my VM running the replay-single test

          LustreError: 996:0:(import.c:928:ptlrpc_connect_interpret()) lustre-OST0000_UUID went back in time (transno 4294969953 was previously committed, server now claims 4294967730)!  See https://bugzilla.lustre.org/show_bug.cgi?id=9646
          
          prakash Prakash Surya (Inactive) added a comment - Looks like I can hit the problem on my VM running the replay-single test LustreError: 996:0:(import.c:928:ptlrpc_connect_interpret()) lustre-OST0000_UUID went back in time (transno 4294969953 was previously committed, server now claims 4294967730)! See https://bugzilla.lustre.org/show_bug.cgi?id=9646

          Mike, I'm going to try and dig into this issue. It's easily reproducible when failing over the servers under load. I'll try to collect and post some Lustre debug logs.

          prakash Prakash Surya (Inactive) added a comment - Mike, I'm going to try and dig into this issue. It's easily reproducible when failing over the servers under load. I'll try to collect and post some Lustre debug logs.

          No more actual data to go on here, but I was able to reproduce this easily during the grove update. The update did contain your suggested change. We'll try and get detailed logs under more control circumstances.

          behlendorf Brian Behlendorf added a comment - No more actual data to go on here, but I was able to reproduce this easily during the grove update. The update did contain your suggested change. We'll try and get detailed logs under more control circumstances.

          So far I haven't reproduced this issue with the patch applied, but we haven't tried very hard yet either. We should get serious fail over testing going soon.

          behlendorf Brian Behlendorf added a comment - So far I haven't reproduced this issue with the patch applied, but we haven't tried very hard yet either. We should get serious fail over testing going soon.

          Mike, I'll roll that tweak in to our next tag and let you know.

          behlendorf Brian Behlendorf added a comment - Mike, I'll roll that tweak in to our next tag and let you know.

          Brian, can you try with lines below added into ofd_obd_disconnect :

          +	if (exp->exp_obd->obd_replayable &&
          +	    (!exp->exp_obd->obd_fail || exp->exp_failed))
          		lut_client_del(&env, exp);
          
          tappro Mikhail Pershin added a comment - Brian, can you try with lines below added into ofd_obd_disconnect : + if (exp->exp_obd->obd_replayable && + (!exp->exp_obd->obd_fail || exp->exp_failed)) lut_client_del(&env, exp);

          Actually, looking more closely at all the logs I don't see any evidence that a client was evicted. But I am able to easily reproduce the failure simply be restarting an OSS with the above procedure. Here are the console logs from the restarted server and a client which reported that issue. Note, not all clients encountered this.

          OSS grove-641 (172.20.4.41@o2ib500)

          <ConMan> Console [grove641] log at 2012-05-18 12:00:00 PDT.
          
          <ConMan> Console [grove641] joined by <root@localhost> on pts/16 at 05-18 12:16.
          2012-05-18 12:16:46 Lustre: Failing over lstest-OST0281
          2012-05-18 12:16:46 LustreError: 137-5: lstest-OST0281: Not available for connect from 172.20.3.192@o2ib500 (stopping)
          2012-05-18 12:16:46 LustreError: 137-5: lstest-OST0281: Not available for connect from 172.20.3.200@o2ib500 (stopping)
          2012-05-18 12:16:48 LustreError: 137-5: lstest-OST0281: Not available for connect from 172.20.3.155@o2ib500 (stopping)
          2012-05-18 12:16:48 LustreError: Skipped 1 previous similar message
          2012-05-18 12:16:49 Lustre: lstest-OST0281: Unmounted
          2012-05-18 12:17:11 LustreError: 11-0: MGC172.20.5.2@o2ib500: Communicating with 172.20.5.2@o2ib500, operation llog_origin_handle_create failed with -2
          2012-05-18 12:17:11 LustreError: 64142:0:(mgc_request.c:250:do_config_log_add()) failed processing sptlrpc log: -2
          2012-05-18 12:17:12 Lustre: lstest-OST0281: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-450
          2012-05-18 12:17:12 Lustre: lstest-OST0281: Mounted grove641/ost0
          2012-05-18 12:17:17 Lustre: lstest-OST0281: Will be in recovery for at least 2:30, or until 257 clients reconnect.
          2012-05-18 12:17:33 Lustre: lstest-OST0281: Recovery over after 0:16, of 257 clients 257 recovered and 0 were evicted.
          

          Client grove592 (172.20.3.192@o2ib500) - No issue

          2012-05-18 12:16:46 LustreError: 11-0: lstest-OST0281-osc-ffff88080c6cc800: Communicating with 172.20.4.41@o2ib500, operation obd_ping failed with -107
          2012-05-18 12:16:46 Lustre: lstest-OST0281-osc-ffff88080c6cc800: Connection to lstest-OST0281 (at 172.20.4.41@o2ib500) was lost; in progress operations using this service will wait for recovery to complete
          2012-05-18 12:17:33 Lustre: lstest-OST0281-osc-ffff88080c6cc800: Connection restored to lstest-OST0281 (at 172.20.4.41@o2ib500)
          

          Client grove443 (172.20.3.43@o2ib500) - Went back in time

          2012-05-18 12:17:22 Lustre: lstest-OST0281-osc-ffff8807fe7abc00: Connection to lstest-OST0281 (at 172.20.4.41@o2ib500) was lost; in progress operations using this service will wait for recovery to complete
          2012-05-18 12:17:22 Lustre: Skipped 127 previous similar messages
          2012-05-18 12:17:22 LustreError: 13990:0:(import.c:930:ptlrpc_connect_interpret()) lstest-OST0281_UUID went back in time (transno 4295846814 was previously committed, server now claims 4295242246)!  See https://bugzilla.lustre.org/show_bug.cgi?id=9646
          2012-05-18 12:17:22 LustreError: 13990:0:(import.c:930:ptlrpc_connect_interpret()) Skipped 71 previous similar messages
          2012-05-18 12:17:33 Lustre: lstest-OST0281-osc-ffff8807fe7abc00: Connection restored to lstest-OST0281 (at 172.20.4.41@o2ib500)
          2012-05-18 12:17:33 Lustre: Skipped 96 previous similar messages
          
          behlendorf Brian Behlendorf added a comment - Actually, looking more closely at all the logs I don't see any evidence that a client was evicted. But I am able to easily reproduce the failure simply be restarting an OSS with the above procedure. Here are the console logs from the restarted server and a client which reported that issue. Note, not all clients encountered this. OSS grove-641 (172.20.4.41@o2ib500) <ConMan> Console [grove641] log at 2012-05-18 12:00:00 PDT. <ConMan> Console [grove641] joined by <root@localhost> on pts/16 at 05-18 12:16. 2012-05-18 12:16:46 Lustre: Failing over lstest-OST0281 2012-05-18 12:16:46 LustreError: 137-5: lstest-OST0281: Not available for connect from 172.20.3.192@o2ib500 (stopping) 2012-05-18 12:16:46 LustreError: 137-5: lstest-OST0281: Not available for connect from 172.20.3.200@o2ib500 (stopping) 2012-05-18 12:16:48 LustreError: 137-5: lstest-OST0281: Not available for connect from 172.20.3.155@o2ib500 (stopping) 2012-05-18 12:16:48 LustreError: Skipped 1 previous similar message 2012-05-18 12:16:49 Lustre: lstest-OST0281: Unmounted 2012-05-18 12:17:11 LustreError: 11-0: MGC172.20.5.2@o2ib500: Communicating with 172.20.5.2@o2ib500, operation llog_origin_handle_create failed with -2 2012-05-18 12:17:11 LustreError: 64142:0:(mgc_request.c:250:do_config_log_add()) failed processing sptlrpc log: -2 2012-05-18 12:17:12 Lustre: lstest-OST0281: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-450 2012-05-18 12:17:12 Lustre: lstest-OST0281: Mounted grove641/ost0 2012-05-18 12:17:17 Lustre: lstest-OST0281: Will be in recovery for at least 2:30, or until 257 clients reconnect. 2012-05-18 12:17:33 Lustre: lstest-OST0281: Recovery over after 0:16, of 257 clients 257 recovered and 0 were evicted. Client grove592 (172.20.3.192@o2ib500) - No issue 2012-05-18 12:16:46 LustreError: 11-0: lstest-OST0281-osc-ffff88080c6cc800: Communicating with 172.20.4.41@o2ib500, operation obd_ping failed with -107 2012-05-18 12:16:46 Lustre: lstest-OST0281-osc-ffff88080c6cc800: Connection to lstest-OST0281 (at 172.20.4.41@o2ib500) was lost; in progress operations using this service will wait for recovery to complete 2012-05-18 12:17:33 Lustre: lstest-OST0281-osc-ffff88080c6cc800: Connection restored to lstest-OST0281 (at 172.20.4.41@o2ib500) Client grove443 (172.20.3.43@o2ib500) - Went back in time 2012-05-18 12:17:22 Lustre: lstest-OST0281-osc-ffff8807fe7abc00: Connection to lstest-OST0281 (at 172.20.4.41@o2ib500) was lost; in progress operations using this service will wait for recovery to complete 2012-05-18 12:17:22 Lustre: Skipped 127 previous similar messages 2012-05-18 12:17:22 LustreError: 13990:0:(import.c:930:ptlrpc_connect_interpret()) lstest-OST0281_UUID went back in time (transno 4295846814 was previously committed, server now claims 4295242246)! See https://bugzilla.lustre.org/show_bug.cgi?id=9646 2012-05-18 12:17:22 LustreError: 13990:0:(import.c:930:ptlrpc_connect_interpret()) Skipped 71 previous similar messages 2012-05-18 12:17:33 Lustre: lstest-OST0281-osc-ffff8807fe7abc00: Connection restored to lstest-OST0281 (at 172.20.4.41@o2ib500) 2012-05-18 12:17:33 Lustre: Skipped 96 previous similar messages

          Note, that this can happen if client was evicted and reconnected. Today last_committed value is per-export, server data has also such value but it is not updated upon every transaction. So if client export data was deleted than the largest last commit value is reported back and can be lower value. So check please, was client evicted or not.

          tappro Mikhail Pershin added a comment - Note, that this can happen if client was evicted and reconnected. Today last_committed value is per-export, server data has also such value but it is not updated upon every transaction. So if client export data was deleted than the largest last commit value is reported back and can be lower value. So check please, was client evicted or not.

          People

            tappro Mikhail Pershin
            behlendorf Brian Behlendorf
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: