Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • None
    • 3
    • Orion
    • 3019

    Description

      Observed when manually restarting zfs servers, specifically:

      1) umount /mnt/lustre/server
      2) export zpool
      3) import zpool
      4) mount zpool/server /mnt/lustre/server

      There is some suspicion that we may be rolling back to a previous uberblock due to how things were shutdown but that's just a suspicion at the moment until we have more data.

      2012-05-17 11:24:31 LustreError: 11-0: lstest-OST02e6-osc-ffff88063d2a2000: Communicating with 172.20.4.142@o2ib500, operation obd_ping failed with -107
      2012-05-17 11:24:31 Lustre: lstest-OST02ef-osc-ffff88063d2a2000: Connection to lstest-OST02ef (at 172.20.4.151@o2ib500) was lost; in progress operations using this service will wait for recovery to complete
      2012-05-17 11:24:31 Lustre: Skipped 19 previous similar messages
      2012-05-17 11:24:31 LustreError: Skipped 27 previous similar messages
      2012-05-17 11:24:53 Lustre: lstest-OST02b8-osc-ffff88063d2a2000: Connection to lstest-OST02b8 (at 172.20.4.96@o2ib500) was lost; in progress operations using this service will wait for recovery to complete
      2012-05-17 11:24:53 Lustre: Skipped 30 previous similar messages
      2012-05-17 11:24:53 LustreError: 23000:0:(import.c:930:ptlrpc_connect_interpret()) lstest-OST029a_UUID went back in time (transno 4315649123 was previously committed, server now claims 4295433349)!  See https://bugzilla.lustre.org/show_bug.cgi?id=9646
      2012-05-17 11:24:53 LustreError: 23000:0:(import.c:930:ptlrpc_connect_interpret()) Skipped 11 previous similar messages
      2012-05-17 11:24:57 Lustre: lstest-OST0286-osc-ffff88063d2a2000: Connection restored to lstest-OST0286 (at 172.20.4.46@o2ib500)
      2012-05-17 11:24:57 Lustre: Skipped 12 previous similar messages
      2012-05-17 11:25:02 Lustre: lstest-OST02e9-osc-ffff88063d2a2000: Connection restored to lstest-OST02e9 (at 172.20.4.145@o2ib500)
      2012-05-17 11:25:47 Lustre: lstest-OST02e3-osc-ffff88063d2a2000: Connection restored to lstest-OST02e3 (at 172.20.4.139@o2ib500)
      2012-05-17 11:25:47 Lustre: Skipped 25 previous similar messages
      2012-05-17 11:25:57 Lustre: lstest-OST028d-osc-ffff88063d2a2000: Connection restored to lstest-OST028d (at 172.20.4.53@o2ib500)
      2012-05-17 11:25:57 Lustre: Skipped 3 previous similar messages
      2012-05-17 11:26:11 LustreError: 23000:0:(import.c:930:ptlrpc_connect_interpret()) lstest-OST02e4_UUID went back in time (transno 4320876109 was previously committed, server now claims 4296801416)!  See https://bugzilla.lustre.org/show_bug.cgi?id=9646
      2012-05-17 11:26:11 LustreError: 23000:0:(import.c:930:ptlrpc_connect_interpret()) Skipped 67 previous similar messages
      

      Attachments

        Activity

          [LU-2086] Server went back in time
          ian Ian Colle (Inactive) added a comment - - edited

          The fix from patch 3519 was already landed to master as part of another patch.

          ian Ian Colle (Inactive) added a comment - - edited The fix from patch 3519 was already landed to master as part of another patch.
          nedbass Ned Bass (Inactive) added a comment - Patch for master: http://review.whamcloud.com/3519
          nedbass Ned Bass (Inactive) added a comment - Patch for orion: http://review.whamcloud.com/3518

          It looks like ofd->ofd_last_transno gets reset to 4294967296 (1 << 32) when a target is restarted on the server. That is, lut_boot_epoch_update() is called with lut->lus_last_transno == 0. Then in ofd_txn_stop_cb() new transaction numbers are assigned starting from there and saved in the last_rcvd file.

          199         } else if (info->fti_transno == 0) {
          200                 info->fti_transno = ++ ofd->ofd_last_transno;
          ...
          224         return ofd_last_rcvd_update(info, txn);
          225 }
          

          However exp->exp_last_committed keeps the value originally read from last_rcvd until the new transaction numbers overtake it. So the server keeps sending a larger last_committed transaction number than what is actually being used by the client, but it writes the newer, lower numbers to last_rcvd. Now if the server restarts it sends the lcd_last_transno from last_rcvd to the client. But if the saved transaction numbers never grew larger than the old last_committed value, the client thinks the server went back in time.

          To clarify, here's some debug data I captured with systemtap. I unmounted the OST at timestamp 1343860077 and the client connection was restored at 1343860152. It prints ofd->ofd_lut->lut_last_transno in ofd_txn_stop_cb() and lut->lus_last_transno in lut_boot_epoch_update(). There was only one active client.

          1343860077 ll_ost_io_114 (24102) ofd_txn_stop_cb lut_last_transno 4295182382
          1343860152 tgt_recov (24731) lut_boot_epoch_update() lut_last_transno 0
          1343860152 ll_ost_io_127 24728 ofd_txn_stop_cb lut_last_transno 4294967296
          1343860152 ll_ost_io_125 24726 ofd_txn_stop_cb lut_last_transno 4294967297
          1343860152 ll_ost_io_120 24721 ofd_txn_stop_cb lut_last_transno 4294967298
          
          nedbass Ned Bass (Inactive) added a comment - It looks like ofd->ofd_last_transno gets reset to 4294967296 (1 << 32) when a target is restarted on the server. That is, lut_boot_epoch_update() is called with lut->lus_last_transno == 0. Then in ofd_txn_stop_cb() new transaction numbers are assigned starting from there and saved in the last_rcvd file. 199 } else if (info->fti_transno == 0) { 200 info->fti_transno = ++ ofd->ofd_last_transno; ... 224 return ofd_last_rcvd_update(info, txn); 225 } However exp->exp_last_committed keeps the value originally read from last_rcvd until the new transaction numbers overtake it. So the server keeps sending a larger last_committed transaction number than what is actually being used by the client, but it writes the newer, lower numbers to last_rcvd. Now if the server restarts it sends the lcd_last_transno from last_rcvd to the client. But if the saved transaction numbers never grew larger than the old last_committed value, the client thinks the server went back in time. To clarify, here's some debug data I captured with systemtap. I unmounted the OST at timestamp 1343860077 and the client connection was restored at 1343860152. It prints ofd->ofd_lut->lut_last_transno in ofd_txn_stop_cb() and lut->lus_last_transno in lut_boot_epoch_update(). There was only one active client. 1343860077 ll_ost_io_114 (24102) ofd_txn_stop_cb lut_last_transno 4295182382 1343860152 tgt_recov (24731) lut_boot_epoch_update() lut_last_transno 0 1343860152 ll_ost_io_127 24728 ofd_txn_stop_cb lut_last_transno 4294967296 1343860152 ll_ost_io_125 24726 ofd_txn_stop_cb lut_last_transno 4294967297 1343860152 ll_ost_io_120 24721 ofd_txn_stop_cb lut_last_transno 4294967298

          Applied the following patch and ran the replay-single reproducers to generate the log file.

          diff --git i/lustre/ptlrpc/import.c w/lustre/ptlrpc/import.c
          index 7e68efb..7bf2088 100644
          --- i/lustre/ptlrpc/import.c
          +++ w/lustre/ptlrpc/import.c
          @@ -920,6 +920,9 @@ static int ptlrpc_connect_interpret(const struct lu_env *env,
                   if (lustre_msg_get_last_committed(request->rq_repmsg) > 0 &&
                       lustre_msg_get_last_committed(request->rq_repmsg) <
                       aa->pcaa_peer_committed) {
          +#ifdef __KERNEL__
          +               libcfs_debug_dumplog();
          +#endif
                           CERROR("%s went back in time (transno "LPD64
                                  " was previously committed, server now claims "LPD64
                                  ")!  See https://bugzilla.lustre.org/show_bug.cgi?"
          diff --git i/lustre/tests/cfg/local.sh w/lustre/tests/cfg/local.sh
          index d537924..6cc0f61 100644
          --- i/lustre/tests/cfg/local.sh
          +++ w/lustre/tests/cfg/local.sh
          @@ -48,12 +48,12 @@ STRIPE_BYTES=${STRIPE_BYTES:-1048576}
           STRIPES_PER_OBJ=${STRIPES_PER_OBJ:-0}
           SINGLEMDS=${SINGLEMDS:-"mds1"}
           TIMEOUT=${TIMEOUT:-20}
          -PTLDEBUG=${PTLDEBUG:-"vfstrace rpctrace dlmtrace neterror ha config ioctl super"}
          +PTLDEBUG=${PTLDEBUG:-"-1"}
           SUBSYSTEM=${SUBSYSTEM:-"all -lnet -lnd -pinger"}
           
           # promise 2MB for every cpu
           if [ -f /sys/devices/system/cpu/possible ]; then
          -    _debug_mb=$((($(cut -d "-" -f 2 /sys/devices/system/cpu/possible)+1)*2))
          +    _debug_mb=$((($(cut -d "-" -f 2 /sys/devices/system/cpu/possible)+1)*128))
           else
               _debug_mb=$(($(getconf _NPROCESSORS_CONF)*2))
           fi
          
          prakash Prakash Surya (Inactive) added a comment - Applied the following patch and ran the replay-single reproducers to generate the log file. diff --git i/lustre/ptlrpc/import.c w/lustre/ptlrpc/import.c index 7e68efb..7bf2088 100644 --- i/lustre/ptlrpc/import.c +++ w/lustre/ptlrpc/import.c @@ -920,6 +920,9 @@ static int ptlrpc_connect_interpret(const struct lu_env *env, if (lustre_msg_get_last_committed(request->rq_repmsg) > 0 && lustre_msg_get_last_committed(request->rq_repmsg) < aa->pcaa_peer_committed) { +#ifdef __KERNEL__ + libcfs_debug_dumplog(); +#endif CERROR("%s went back in time (transno "LPD64 " was previously committed, server now claims "LPD64 ")! See https://bugzilla.lustre.org/show_bug.cgi?" diff --git i/lustre/tests/cfg/local.sh w/lustre/tests/cfg/local.sh index d537924..6cc0f61 100644 --- i/lustre/tests/cfg/local.sh +++ w/lustre/tests/cfg/local.sh @@ -48,12 +48,12 @@ STRIPE_BYTES=${STRIPE_BYTES:-1048576} STRIPES_PER_OBJ=${STRIPES_PER_OBJ:-0} SINGLEMDS=${SINGLEMDS:-"mds1"} TIMEOUT=${TIMEOUT:-20} -PTLDEBUG=${PTLDEBUG:-"vfstrace rpctrace dlmtrace neterror ha config ioctl super"} +PTLDEBUG=${PTLDEBUG:-"-1"} SUBSYSTEM=${SUBSYSTEM:-"all -lnet -lnd -pinger"} # promise 2MB for every cpu if [ -f /sys/devices/system/cpu/possible ]; then - _debug_mb=$((($(cut -d "-" -f 2 /sys/devices/system/cpu/possible)+1)*2)) + _debug_mb=$((($(cut -d "-" -f 2 /sys/devices/system/cpu/possible)+1)*128)) else _debug_mb=$(($(getconf _NPROCESSORS_CONF)*2)) fi

          Also, running recovery-small in my VM doesn't appear to trigger the message.

          prakash Prakash Surya (Inactive) added a comment - Also, running recovery-small in my VM doesn't appear to trigger the message.

          People

            tappro Mikhail Pershin
            behlendorf Brian Behlendorf
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: