Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17406

sanity-flr test_50A: watchdog: BUG: soft lockup - CPU#0 stuck for 22s

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.16.0
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

      This issue relates to the following test suite run:
      https://testing.whamcloud.com/test_sets/112570ae-2e64-4c60-bd13-b1447c7934fa

      test_50A failed with the following error after both CPUs were locked up:

      onyx-99vm1 crash during sanity-flr test_50A
      

      Test session details:
      clients: https://build.whamcloud.com/job/lustre-reviews/101181 - 4.18.0-477.27.1.el8_8.x86_64
      servers: https://build.whamcloud.com/job/lustre-reviews/101181 - 4.18.0-477.27.1.el8_lustre.x86_64

       Lustre: DEBUG MARKER: == sanity-flr test 50A: mirror split update layout generation ===== 19:25:25 (1704741925)
       Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds1' ' /proc/mounts || true
       Lustre: DEBUG MARKER: umount -d /mnt/lustre-mds1
       Lustre: Failing over lustre-MDT0000
       watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [ldlm_bl_02:77462]
       watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [ldlm_bl_03:80014]
       CPU: 1 PID: 80014 Comm: ldlm_bl_03 4.18.0-477.27.1.el8_lustre.x86_64 #1
       CPU: 0 PID: 77462 Comm: ldlm_bl_02 4.18.0-477.27.1.el8_lustre.x86_64 #1
       RIP: 0010:cfs_hash_for_each_relax+0x17b/0x480 [libcfs]
       Call Trace:
        kvm_wait+0x58/0x60
        __pv_queued_spin_lock_slowpath+0x268/0x2a0
        cfs_hash_for_each_nolock+0x126/0x1f0 [libcfs]
        ldlm_reprocess_recovery_done+0x8b/0x100 [ptlrpc]
        _raw_spin_lock+0x1e/0x30
        cfs_hash_for_each_relax+0x14a/0x480 [libcfs]
        cfs_hash_for_each_nolock+0x126/0x1f0 [libcfs]
        ldlm_reprocess_recovery_done+0x8b/0x100 [ptlrpc]
        ldlm_export_cancel_locks+0x172/0x180 [ptlrpc]
        ldlm_export_cancel_locks+0x172/0x180 [ptlrpc]
        ldlm_bl_thread_main+0x6df/0x940 [ptlrpc]
        ldlm_bl_thread_main+0x6df/0x940 [ptlrpc]
        kthread+0x134/0x150
        kthread+0x134/0x150
        ret_from_fork+0x35/0x40
        ret_from_fork+0x35/0x40
      

      The duplicate lines in the stack trace look like they are because both CPUs are printing to the console at the same time and both appear to be in ldlm_export_cancel_locks() and contending on the same spinlock.

      This similar stack also appeared in LU-17349.

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      sanity-flr test_50A - onyx-99vm1 crashed during sanity-flr test_50A

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated: