Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-369

ASSERTION(oti && oti->oti_thread && oti->oti_thread->t_watchdog) failed in quota_chk_acq_common()

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • Lustre 2.0.0
    • None
    • 3
    • 4960

    Description

      ASSERTION(oti && oti->oti_thread && oti->oti_thread->t_watchdog) failed in quota_chk_acq_common()
      -------------------------------------------------------------------------------------------------

      The following system crash was reported by TGCC Bull customer, after a controller problem occured on a RAID array.

      On April 2, at 11:00:35 the syslog notices a "link down" for about 6 minutes, during which a lot a I/O errors and Lustre errors on OST are recorded.
      error examples:
      end_request: I/O error, dev sdbb, sector 8074493953
      Lustre: 22430:0:(filter_io_26.c:747:filter_commitrw_write()) scratch2-OST000b: slow direct_io 30s
      LustreError: 16365:0:(obd.h:1581:obd_transno_commit_cb()) scratch2-OST000b: transno 14007590 commit error: 2
      Lustre: 26055:0:(filter_io_26.c:684:filter_commitrw_write()) scratch2-OST000b: slow i_mutex 30s
      LustreError: 26055:0:(fsfilt-ldiskfs.c:483:fsfilt_ldiskfs_brw_start()) can't get handle for 582 credits: rc = -30
      LustreError: 22410:0:(filter_io_26.c:691:filter_commitrw_write()) error starting transaction: rc = -30
      lustre-ioerror Check device : /dev/sdbb

      When the link became up, the 15 luns where no longer available, and the system continued to record the same errors and finally crashed at 11.25:42 (more than 40000 I/O errors). For this first crash, we do not have a dump.

      The system then restarted and the syslog shows that one of the lun was not detected (only 14 luns available). Lustre recorded the following error (19=ENODEV) for the lun not detected:
      LustreError: 13493:0:(obd_mount.c:1343:server_kernel_mount()) premount /dev/sdb:0x0 ldiskfs failed: -19 Is the ldiskfs module available?
      LustreError: 13493:0:(obd_mount.c:1665:server_fill_super()) Unable to mount device /dev/sdb: -19
      LustreError: 13493:0:(obd_mount.c:2136:lustre_fill_super()) Unable to mount (-19)

      Lustre then started the recovery on all the OSTs (tgt_recov thread), and logs several times the following error for the lun not detected:
      LustreError: 137-5: UUID 'scratch2-OST0000_UUID' is not available for connect (no target)
      LustreError: 13946:0:(ldlm_lib.c:2123:target_send_reply_msg()) @@@ processing error (19) req@ffff880431930400 x1364628274475046/t0(0) o8><?>@<?>:0/0 lens 368/0 e 0 to 0 dl 1301737426 ref 1 fl Interpret:/ffffffff/ffffffff rc -19/-1

      After approximately 1 minute the system crashed on the following assertion in quota_interface.c:451:quota_chk_acq_common():
      ASSERTION(oti && oti->oti_thread && oti->oti_thread->t_watchdog)

      This crash occured 6 times, as long as one of the OST was not detected. We have 3 dumps available which all show the same following stack:

      PID: 19634 TASK: ffff88045728d5a0 CPU: 2 COMMAND: "tgt_recov"
      #0 [ffff8804493112d8] machine_kexec at ffffffff8102e67b
      #1 [ffff880449311338] crash_kexec at ffffffff810a9af8
      #2 [ffff8804493113b8] show_trace at ffffffff810102d5
      #3 [ffff880449311408] panic at ffffffff81452147
      #4 [ffff8804493114d8] libcfs_assertion_failed at ffffffffa01157d6
      #5 [ffff880449311528] quota_chk_acq_common at ffffffffa071aa32
      #6 [ffff8804493116a8] filter_commitrw_write at ffffffffa090a488
      #7 [ffff880449311898] filter_commitrw at ffffffffa08fd535
      #8 [ffff880449311958] obd_commitrw at ffffffffa08b4ffa
      #9 [ffff8804493119d8] ost_brw_write at ffffffffa08bd644
      #10 [ffff880449311bb8] ost_handle at ffffffffa08c237a

      In these 3 dumps, "oti" and "oti->oti_thread" are valid pointers, but "oti->oti_thread->t_watchdog" is a null pointer.

      The only place were this pointer is reset is in "ptlrpc_main()", when the service is stopped.

      A possible fix could be to remove the LASSERT and modify the "if" statement, as it is done 50 lines above, by replacing these lines:
      if (!qctxt->lqc_import && oti) {
      cfs_spin_unlock(&qctxt->lqc_lock);

      LASSERT(oti && oti->oti_thread &&
      oti->oti_thread->t_watchdog);
      by
      if (!qctxt->lqc_import && oti && oti->oti_thread && oti->oti_thread->t_watchdog) {
      cfs_spin_unlock(&qctxt->lqc_lock);

      But I'm wondering if this is enough, or if an additional test must be added to exit the while loop.

      Attachments

        Activity

          People

            niu Niu Yawei (Inactive)
            patrick.valentin Patrick Valentin (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: