Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3870

deadlock with journal

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • Lustre 2.5.0
    • None
    • RHEL6 + master branch (two days ago)
    • 3
    • 10042

    Description

      while testing a own patch i hit a deadlock with journal

      Lustre: DEBUG MARKER: == sanity test 156: Verification of tunables ============================== 21:09:39 (1377968979)
      Lustre: DEBUG MARKER: cancel_lru_locks osc start
      Lustre: DEBUG MARKER: cancel_lru_locks osc stop
      Lustre: DEBUG MARKER: cancel_lru_locks osc start
      Lustre: DEBUG MARKER: cancel_lru_locks osc stop
      INFO: task jbd2/loop0-8:11882 blocked for more than 120 seconds.
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      jbd2/loop0-8  D ffff8801225702b0     0 11882      2 0x00000080
       ffff88010811bcf0 0000000000000046 0000000000000002 ffff8801225702b0
       ffff88010811bc90 0000000000000046 ffff88010811bdd8 0000000000000282
       ffff88010b0845f8 ffff88010811bfd8 000000000000fc88 ffff88010b0845f8
      Call Trace:
       [<ffffffffa0041aed>] jbd2_journal_commit_transaction+0x19d/0x16e0 [jbd2]
       [<ffffffff81081d5c>] ? lock_timer_base+0x3c/0x70
       [<ffffffff815323f1>] ? _spin_lock_irqsave+0x81/0xa0
       [<ffffffff8108381b>] ? try_to_del_timer_sync+0x7b/0xe0
       [<ffffffff81099690>] ? autoremove_wake_function+0x0/0x40
       [<ffffffff8108390a>] ? del_timer_sync+0x8a/0xc0
       [<ffffffff81083880>] ? del_timer_sync+0x0/0xc0
       [<ffffffffa00481e1>] kjournald2+0xc1/0x220 [jbd2]
       [<ffffffff81099690>] ? autoremove_wake_function+0x0/0x40
       [<ffffffffa0048120>] ? kjournald2+0x0/0x220 [jbd2]
       [<ffffffff81099346>] kthread+0x96/0xa0
       [<ffffffff8100c1ca>] child_rip+0xa/0x20
       [<ffffffff81531fc0>] ? _spin_unlock_irq+0x30/0x40
       [<ffffffff8100bb10>] ? restore_args+0x0/0x30
       [<ffffffff810992b0>] ? kthread+0x0/0xa0
       [<ffffffff8100c1c0>] ? child_rip+0x0/0x20
      INFO: lockdep is turned off.
      INFO: task mdt01_003:12211 blocked for more than 120 seconds.
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      mdt01_003     D ffff88013fc26d00     0 12211      2 0x00000080
       ffff88010aa11870 0000000000000046 0000000000000000 ffffffff810b262d
       ffff88010aa117f0 0000000000000286 ffff88002cbd6cd8 0000000000000286
       ffff880106ac8cf8 ffff88010aa11fd8 000000000000fc88 ffff880106ac8cf8
      Call Trace:
       [<ffffffff810b262d>] ? trace_hardirqs_on+0xd/0x10
       [<ffffffffa0f456be>] cfs_waitq_wait+0xe/0x10 [libcfs]
       [<ffffffffa04ef057>] lu_object_find_at+0xb7/0x370 [obdclass]
       [<ffffffffa014d7e2>] ? fld_server_lookup+0x72/0x440 [fld]
       [<ffffffff810642f0>] ? default_wake_function+0x0/0x20
       [<ffffffffa0bdc6f4>] osd_object_find+0x34/0x270 [osd_ldiskfs]
       [<ffffffffa0be168e>] osd_index_ea_insert+0x4ae/0xdf0 [osd_ldiskfs]
       [<ffffffffa0be8f46>] ? osd_index_ea_delete+0x906/0xe60 [osd_ldiskfs]
       [<ffffffffa0b8a818>] ? __ldiskfs_journal_stop+0x68/0xa0 [ldiskfs]
       [<ffffffffa0449b84>] lod_index_insert+0x44/0xe0 [lod]
       [<ffffffffa04499d8>] ? lod_index_delete+0x38/0xd0 [lod]
       [<ffffffffa0b04d5e>] __mdd_orphan_add+0x33e/0x3b0 [mdd]
       [<ffffffffa0bdf43d>] ? osd_attr_get+0x13d/0x180 [osd_ldiskfs]
       [<ffffffffa0b0cb51>] mdd_finish_unlink+0xe1/0x380 [mdd]
       [<ffffffffa0b0f7c8>] mdd_unlink+0x9d8/0xe30 [mdd]
       [<ffffffffa0c7eb98>] mdo_unlink+0x18/0x50 [mdt]
       [<ffffffffa0c81f80>] mdt_reint_unlink+0x820/0x1010 [mdt]
       [<ffffffffa0c7e891>] mdt_reint_rec+0x41/0xe0 [mdt]
       [<ffffffffa0c66573>] mdt_reint_internal+0x4c3/0x780 [mdt]
       [<ffffffffa0c66874>] mdt_reint+0x44/0xe0 [mdt]
       [<ffffffffa0c693da>] mdt_handle_common+0x53a/0x1480 [mdt]
       [<ffffffffa0ca2b95>] mds_regular_handle+0x15/0x20 [mdt]
       [<ffffffffa072b6e1>] ptlrpc_server_handle_request+0x3b1/0xc80 [ptlrpc]
       [<ffffffffa0f4557e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
       [<ffffffffa0f568af>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
       [<ffffffffa0722fe7>] ? ptlrpc_wait_event+0xb7/0x2c0 [ptlrpc]
       [<ffffffff810b262d>] ? trace_hardirqs_on+0xd/0x10
       [<ffffffff81055d73>] ? __wake_up+0x53/0x70
       [<ffffffffa072c9f2>] ptlrpc_main+0xa42/0x1670 [ptlrpc]
       [<ffffffffa072bfb0>] ? ptlrpc_main+0x0/0x1670 [ptlrpc]
       [<ffffffff81099346>] kthread+0x96/0xa0
       [<ffffffff8100c1ca>] child_rip+0xa/0x20
       [<ffffffff81531fc0>] ? _spin_unlock_irq+0x30/0x40
       [<ffffffff8100bb10>] ? restore_args+0x0/0x30
       [<ffffffff810992b0>] ? kthread+0x0/0xa0
       [<ffffffff8100c1c0>] ? child_rip+0x0/0x20
      INFO: lockdep is turned off.
      INFO: task jbd2/loop0-8:11882 blocked for more than 120 seconds.
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      jbd2/loop0-8  D ffff8801225702b0     0 11882      2 0x00000080
       ffff88010811bcf0 0000000000000046 0000000000000002 ffff8801225702b0
       ffff88010811bc90 0000000000000046 ffff88010811bdd8 0000000000000282
       ffff88010b0845f8 ffff88010811bfd8 000000000000fc88 ffff88010b0845f8
      Call Trace:
       [<ffffffffa0041aed>] jbd2_journal_commit_transaction+0x19d/0x16e0 [jbd2]
       [<ffffffff81081d5c>] ? lock_timer_base+0x3c/0x70
       [<ffffffff815323f1>] ? _spin_lock_irqsave+0x81/0xa0
       [<ffffffff8108381b>] ? try_to_del_timer_sync+0x7b/0xe0
       [<ffffffff81099690>] ? autoremove_wake_function+0x0/0x40
       [<ffffffff8108390a>] ? del_timer_sync+0x8a/0xc0
       [<ffffffff81083880>] ? del_timer_sync+0x0/0xc0
       [<ffffffffa00481e1>] kjournald2+0xc1/0x220 [jbd2]
       [<ffffffff81099690>] ? autoremove_wake_function+0x0/0x40
       [<ffffffffa0048120>] ? kjournald2+0x0/0x220 [jbd2]
       [<ffffffff81099346>] kthread+0x96/0xa0
       [<ffffffff8100c1ca>] child_rip+0xa/0x20
       [<ffffffff81531fc0>] ? _spin_unlock_irq+0x30/0x40
       [<ffffffff8100bb10>] ? restore_args+0x0/0x30
       [<ffffffff810992b0>] ? kthread+0x0/0xa0
       [<ffffffff8100c1c0>] ? child_rip+0x0/0x20
      INFO: lockdep is turned off.
      INFO: task mdt01_003:12211 blocked for more than 120 seconds.
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      mdt01_003     D ffff88013fc26d00     0 12211      2 0x00000080
       ffff88010aa11870 0000000000000046 0000000000000000 ffffffff810b262d
       ffff88010aa117f0 0000000000000286 ffff88002cbd6cd8 0000000000000286
       ffff880106ac8cf8 ffff88010aa11fd8 000000000000fc88 ffff880106ac8cf8
      Call Trace:
       [<ffffffff810b262d>] ? trace_hardirqs_on+0xd/0x10
       [<ffffffffa0f456be>] cfs_waitq_wait+0xe/0x10 [libcfs]
       [<ffffffffa04ef057>] lu_object_find_at+0xb7/0x370 [obdclass]
       [<ffffffffa014d7e2>] ? fld_server_lookup+0x72/0x440 [fld]
       [<ffffffff810642f0>] ? default_wake_function+0x0/0x20
       [<ffffffffa0bdc6f4>] osd_object_find+0x34/0x270 [osd_ldiskfs]
       [<ffffffffa0be168e>] osd_index_ea_insert+0x4ae/0xdf0 [osd_ldiskfs]
       [<ffffffffa0be8f46>] ? osd_index_ea_delete+0x906/0xe60 [osd_ldiskfs]
       [<ffffffffa0b8a818>] ? __ldiskfs_journal_stop+0x68/0xa0 [ldiskfs]
       [<ffffffffa0449b84>] lod_index_insert+0x44/0xe0 [lod]
       [<ffffffffa04499d8>] ? lod_index_delete+0x38/0xd0 [lod]
       [<ffffffffa0b04d5e>] __mdd_orphan_add+0x33e/0x3b0 [mdd]
       [<ffffffffa0bdf43d>] ? osd_attr_get+0x13d/0x180 [osd_ldiskfs]
       [<ffffffffa0b0cb51>] mdd_finish_unlink+0xe1/0x380 [mdd]
       [<ffffffffa0b0f7c8>] mdd_unlink+0x9d8/0xe30 [mdd]
       [<ffffffffa0c7eb98>] mdo_unlink+0x18/0x50 [mdt]
       [<ffffffffa0c81f80>] mdt_reint_unlink+0x820/0x1010 [mdt]
       [<ffffffffa0c7e891>] mdt_reint_rec+0x41/0xe0 [mdt]
       [<ffffffffa0c66573>] mdt_reint_internal+0x4c3/0x780 [mdt]
       [<ffffffffa0c66874>] mdt_reint+0x44/0xe0 [mdt]
       [<ffffffffa0c693da>] mdt_handle_common+0x53a/0x1480 [mdt]
       [<ffffffffa0ca2b95>] mds_regular_handle+0x15/0x20 [mdt]
       [<ffffffffa072b6e1>] ptlrpc_server_handle_request+0x3b1/0xc80 [ptlrpc]
       [<ffffffffa0f4557e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
       [<ffffffffa0f568af>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
       [<ffffffffa0722fe7>] ? ptlrpc_wait_event+0xb7/0x2c0 [ptlrpc]
       [<ffffffff810b262d>] ? trace_hardirqs_on+0xd/0x10
       [<ffffffff81055d73>] ? __wake_up+0x53/0x70
       [<ffffffffa072c9f2>] ptlrpc_main+0xa42/0x1670 [ptlrpc]
       [<ffffffffa072bfb0>] ? ptlrpc_main+0x0/0x1670 [ptlrpc]
       [<ffffffff81099346>] kthread+0x96/0xa0
       [<ffffffff8100c1ca>] child_rip+0xa/0x20
       [<ffffffff81531fc0>] ? _spin_unlock_irq+0x30/0x40
       [<ffffffff8100bb10>] ? restore_args+0x0/0x30
       [<ffffffff810992b0>] ? kthread+0x0/0xa0
       [<ffffffff8100c1c0>] ? child_rip+0x0/0x20
      INFO: lockdep is turned off.
      ...
      

      i have a crashdump - but it's triggered too later to get a lustre log.

      Attachments

        1. bt.log
          178 kB
          Alexey Lyashkov
        2. bt-full.log
          1.15 MB
          Alexey Lyashkov
        3. lustre.txt.gz
          836 kB
          Alexey Lyashkov

        Issue Links

          Activity

            People

              bzzz Alex Zhuravlev
              shadow Alexey Lyashkov
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: