Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • None
    • 3
    • Orion
    • 3019

    Description

      Observed when manually restarting zfs servers, specifically:

      1) umount /mnt/lustre/server
      2) export zpool
      3) import zpool
      4) mount zpool/server /mnt/lustre/server

      There is some suspicion that we may be rolling back to a previous uberblock due to how things were shutdown but that's just a suspicion at the moment until we have more data.

      2012-05-17 11:24:31 LustreError: 11-0: lstest-OST02e6-osc-ffff88063d2a2000: Communicating with 172.20.4.142@o2ib500, operation obd_ping failed with -107
      2012-05-17 11:24:31 Lustre: lstest-OST02ef-osc-ffff88063d2a2000: Connection to lstest-OST02ef (at 172.20.4.151@o2ib500) was lost; in progress operations using this service will wait for recovery to complete
      2012-05-17 11:24:31 Lustre: Skipped 19 previous similar messages
      2012-05-17 11:24:31 LustreError: Skipped 27 previous similar messages
      2012-05-17 11:24:53 Lustre: lstest-OST02b8-osc-ffff88063d2a2000: Connection to lstest-OST02b8 (at 172.20.4.96@o2ib500) was lost; in progress operations using this service will wait for recovery to complete
      2012-05-17 11:24:53 Lustre: Skipped 30 previous similar messages
      2012-05-17 11:24:53 LustreError: 23000:0:(import.c:930:ptlrpc_connect_interpret()) lstest-OST029a_UUID went back in time (transno 4315649123 was previously committed, server now claims 4295433349)!  See https://bugzilla.lustre.org/show_bug.cgi?id=9646
      2012-05-17 11:24:53 LustreError: 23000:0:(import.c:930:ptlrpc_connect_interpret()) Skipped 11 previous similar messages
      2012-05-17 11:24:57 Lustre: lstest-OST0286-osc-ffff88063d2a2000: Connection restored to lstest-OST0286 (at 172.20.4.46@o2ib500)
      2012-05-17 11:24:57 Lustre: Skipped 12 previous similar messages
      2012-05-17 11:25:02 Lustre: lstest-OST02e9-osc-ffff88063d2a2000: Connection restored to lstest-OST02e9 (at 172.20.4.145@o2ib500)
      2012-05-17 11:25:47 Lustre: lstest-OST02e3-osc-ffff88063d2a2000: Connection restored to lstest-OST02e3 (at 172.20.4.139@o2ib500)
      2012-05-17 11:25:47 Lustre: Skipped 25 previous similar messages
      2012-05-17 11:25:57 Lustre: lstest-OST028d-osc-ffff88063d2a2000: Connection restored to lstest-OST028d (at 172.20.4.53@o2ib500)
      2012-05-17 11:25:57 Lustre: Skipped 3 previous similar messages
      2012-05-17 11:26:11 LustreError: 23000:0:(import.c:930:ptlrpc_connect_interpret()) lstest-OST02e4_UUID went back in time (transno 4320876109 was previously committed, server now claims 4296801416)!  See https://bugzilla.lustre.org/show_bug.cgi?id=9646
      2012-05-17 11:26:11 LustreError: 23000:0:(import.c:930:ptlrpc_connect_interpret()) Skipped 67 previous similar messages
      

      Attachments

        Activity

          [LU-2086] Server went back in time
          ian Ian Colle (Inactive) added a comment - - edited

          The fix from patch 3519 was already landed to master as part of another patch.

          ian Ian Colle (Inactive) added a comment - - edited The fix from patch 3519 was already landed to master as part of another patch.
          nedbass Ned Bass (Inactive) added a comment - Patch for master: http://review.whamcloud.com/3519
          nedbass Ned Bass (Inactive) added a comment - Patch for orion: http://review.whamcloud.com/3518

          It looks like ofd->ofd_last_transno gets reset to 4294967296 (1 << 32) when a target is restarted on the server. That is, lut_boot_epoch_update() is called with lut->lus_last_transno == 0. Then in ofd_txn_stop_cb() new transaction numbers are assigned starting from there and saved in the last_rcvd file.

          199         } else if (info->fti_transno == 0) {
          200                 info->fti_transno = ++ ofd->ofd_last_transno;
          ...
          224         return ofd_last_rcvd_update(info, txn);
          225 }
          

          However exp->exp_last_committed keeps the value originally read from last_rcvd until the new transaction numbers overtake it. So the server keeps sending a larger last_committed transaction number than what is actually being used by the client, but it writes the newer, lower numbers to last_rcvd. Now if the server restarts it sends the lcd_last_transno from last_rcvd to the client. But if the saved transaction numbers never grew larger than the old last_committed value, the client thinks the server went back in time.

          To clarify, here's some debug data I captured with systemtap. I unmounted the OST at timestamp 1343860077 and the client connection was restored at 1343860152. It prints ofd->ofd_lut->lut_last_transno in ofd_txn_stop_cb() and lut->lus_last_transno in lut_boot_epoch_update(). There was only one active client.

          1343860077 ll_ost_io_114 (24102) ofd_txn_stop_cb lut_last_transno 4295182382
          1343860152 tgt_recov (24731) lut_boot_epoch_update() lut_last_transno 0
          1343860152 ll_ost_io_127 24728 ofd_txn_stop_cb lut_last_transno 4294967296
          1343860152 ll_ost_io_125 24726 ofd_txn_stop_cb lut_last_transno 4294967297
          1343860152 ll_ost_io_120 24721 ofd_txn_stop_cb lut_last_transno 4294967298
          
          nedbass Ned Bass (Inactive) added a comment - It looks like ofd->ofd_last_transno gets reset to 4294967296 (1 << 32) when a target is restarted on the server. That is, lut_boot_epoch_update() is called with lut->lus_last_transno == 0. Then in ofd_txn_stop_cb() new transaction numbers are assigned starting from there and saved in the last_rcvd file. 199 } else if (info->fti_transno == 0) { 200 info->fti_transno = ++ ofd->ofd_last_transno; ... 224 return ofd_last_rcvd_update(info, txn); 225 } However exp->exp_last_committed keeps the value originally read from last_rcvd until the new transaction numbers overtake it. So the server keeps sending a larger last_committed transaction number than what is actually being used by the client, but it writes the newer, lower numbers to last_rcvd. Now if the server restarts it sends the lcd_last_transno from last_rcvd to the client. But if the saved transaction numbers never grew larger than the old last_committed value, the client thinks the server went back in time. To clarify, here's some debug data I captured with systemtap. I unmounted the OST at timestamp 1343860077 and the client connection was restored at 1343860152. It prints ofd->ofd_lut->lut_last_transno in ofd_txn_stop_cb() and lut->lus_last_transno in lut_boot_epoch_update(). There was only one active client. 1343860077 ll_ost_io_114 (24102) ofd_txn_stop_cb lut_last_transno 4295182382 1343860152 tgt_recov (24731) lut_boot_epoch_update() lut_last_transno 0 1343860152 ll_ost_io_127 24728 ofd_txn_stop_cb lut_last_transno 4294967296 1343860152 ll_ost_io_125 24726 ofd_txn_stop_cb lut_last_transno 4294967297 1343860152 ll_ost_io_120 24721 ofd_txn_stop_cb lut_last_transno 4294967298

          Applied the following patch and ran the replay-single reproducers to generate the log file.

          diff --git i/lustre/ptlrpc/import.c w/lustre/ptlrpc/import.c
          index 7e68efb..7bf2088 100644
          --- i/lustre/ptlrpc/import.c
          +++ w/lustre/ptlrpc/import.c
          @@ -920,6 +920,9 @@ static int ptlrpc_connect_interpret(const struct lu_env *env,
                   if (lustre_msg_get_last_committed(request->rq_repmsg) > 0 &&
                       lustre_msg_get_last_committed(request->rq_repmsg) <
                       aa->pcaa_peer_committed) {
          +#ifdef __KERNEL__
          +               libcfs_debug_dumplog();
          +#endif
                           CERROR("%s went back in time (transno "LPD64
                                  " was previously committed, server now claims "LPD64
                                  ")!  See https://bugzilla.lustre.org/show_bug.cgi?"
          diff --git i/lustre/tests/cfg/local.sh w/lustre/tests/cfg/local.sh
          index d537924..6cc0f61 100644
          --- i/lustre/tests/cfg/local.sh
          +++ w/lustre/tests/cfg/local.sh
          @@ -48,12 +48,12 @@ STRIPE_BYTES=${STRIPE_BYTES:-1048576}
           STRIPES_PER_OBJ=${STRIPES_PER_OBJ:-0}
           SINGLEMDS=${SINGLEMDS:-"mds1"}
           TIMEOUT=${TIMEOUT:-20}
          -PTLDEBUG=${PTLDEBUG:-"vfstrace rpctrace dlmtrace neterror ha config ioctl super"}
          +PTLDEBUG=${PTLDEBUG:-"-1"}
           SUBSYSTEM=${SUBSYSTEM:-"all -lnet -lnd -pinger"}
           
           # promise 2MB for every cpu
           if [ -f /sys/devices/system/cpu/possible ]; then
          -    _debug_mb=$((($(cut -d "-" -f 2 /sys/devices/system/cpu/possible)+1)*2))
          +    _debug_mb=$((($(cut -d "-" -f 2 /sys/devices/system/cpu/possible)+1)*128))
           else
               _debug_mb=$(($(getconf _NPROCESSORS_CONF)*2))
           fi
          
          prakash Prakash Surya (Inactive) added a comment - Applied the following patch and ran the replay-single reproducers to generate the log file. diff --git i/lustre/ptlrpc/import.c w/lustre/ptlrpc/import.c index 7e68efb..7bf2088 100644 --- i/lustre/ptlrpc/import.c +++ w/lustre/ptlrpc/import.c @@ -920,6 +920,9 @@ static int ptlrpc_connect_interpret(const struct lu_env *env, if (lustre_msg_get_last_committed(request->rq_repmsg) > 0 && lustre_msg_get_last_committed(request->rq_repmsg) < aa->pcaa_peer_committed) { +#ifdef __KERNEL__ + libcfs_debug_dumplog(); +#endif CERROR("%s went back in time (transno "LPD64 " was previously committed, server now claims "LPD64 ")! See https://bugzilla.lustre.org/show_bug.cgi?" diff --git i/lustre/tests/cfg/local.sh w/lustre/tests/cfg/local.sh index d537924..6cc0f61 100644 --- i/lustre/tests/cfg/local.sh +++ w/lustre/tests/cfg/local.sh @@ -48,12 +48,12 @@ STRIPE_BYTES=${STRIPE_BYTES:-1048576} STRIPES_PER_OBJ=${STRIPES_PER_OBJ:-0} SINGLEMDS=${SINGLEMDS:-"mds1"} TIMEOUT=${TIMEOUT:-20} -PTLDEBUG=${PTLDEBUG:-"vfstrace rpctrace dlmtrace neterror ha config ioctl super"} +PTLDEBUG=${PTLDEBUG:-"-1"} SUBSYSTEM=${SUBSYSTEM:-"all -lnet -lnd -pinger"} # promise 2MB for every cpu if [ -f /sys/devices/system/cpu/possible ]; then - _debug_mb=$((($(cut -d "-" -f 2 /sys/devices/system/cpu/possible)+1)*2)) + _debug_mb=$((($(cut -d "-" -f 2 /sys/devices/system/cpu/possible)+1)*128)) else _debug_mb=$(($(getconf _NPROCESSORS_CONF)*2)) fi

          Also, running recovery-small in my VM doesn't appear to trigger the message.

          prakash Prakash Surya (Inactive) added a comment - Also, running recovery-small in my VM doesn't appear to trigger the message.

          Looking back on the logs from my acceptance-small and replay-single runs from the past few days, the theme looks to be tests 59, 47, 61c, and 90 of replay-single hitting this issue. Although, I see none of these tests as having failed.

          prakash Prakash Surya (Inactive) added a comment - Looking back on the logs from my acceptance-small and replay-single runs from the past few days, the theme looks to be tests 59, 47, 61c, and 90 of replay-single hitting this issue. Although, I see none of these tests as having failed.

          Prakash - that's weird - can you tell us during what subtest you hit that? We've passed replay-single routinely and recovery-small test_54 should find that also.

          ian Ian Colle (Inactive) added a comment - Prakash - that's weird - can you tell us during what subtest you hit that? We've passed replay-single routinely and recovery-small test_54 should find that also.

          Looks like I can hit the problem on my VM running the replay-single test

          LustreError: 996:0:(import.c:928:ptlrpc_connect_interpret()) lustre-OST0000_UUID went back in time (transno 4294969953 was previously committed, server now claims 4294967730)!  See https://bugzilla.lustre.org/show_bug.cgi?id=9646
          
          prakash Prakash Surya (Inactive) added a comment - Looks like I can hit the problem on my VM running the replay-single test LustreError: 996:0:(import.c:928:ptlrpc_connect_interpret()) lustre-OST0000_UUID went back in time (transno 4294969953 was previously committed, server now claims 4294967730)! See https://bugzilla.lustre.org/show_bug.cgi?id=9646

          Mike, I'm going to try and dig into this issue. It's easily reproducible when failing over the servers under load. I'll try to collect and post some Lustre debug logs.

          prakash Prakash Surya (Inactive) added a comment - Mike, I'm going to try and dig into this issue. It's easily reproducible when failing over the servers under load. I'll try to collect and post some Lustre debug logs.

          People

            tappro Mikhail Pershin
            behlendorf Brian Behlendorf
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: