Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • Lustre 2.5.0
    • None
    • RHEL6 + master branch (two days ago)
    • 3
    • 10042

    Description

      while testing a own patch i hit a deadlock with journal

      Lustre: DEBUG MARKER: == sanity test 156: Verification of tunables ============================== 21:09:39 (1377968979)
      Lustre: DEBUG MARKER: cancel_lru_locks osc start
      Lustre: DEBUG MARKER: cancel_lru_locks osc stop
      Lustre: DEBUG MARKER: cancel_lru_locks osc start
      Lustre: DEBUG MARKER: cancel_lru_locks osc stop
      INFO: task jbd2/loop0-8:11882 blocked for more than 120 seconds.
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      jbd2/loop0-8  D ffff8801225702b0     0 11882      2 0x00000080
       ffff88010811bcf0 0000000000000046 0000000000000002 ffff8801225702b0
       ffff88010811bc90 0000000000000046 ffff88010811bdd8 0000000000000282
       ffff88010b0845f8 ffff88010811bfd8 000000000000fc88 ffff88010b0845f8
      Call Trace:
       [<ffffffffa0041aed>] jbd2_journal_commit_transaction+0x19d/0x16e0 [jbd2]
       [<ffffffff81081d5c>] ? lock_timer_base+0x3c/0x70
       [<ffffffff815323f1>] ? _spin_lock_irqsave+0x81/0xa0
       [<ffffffff8108381b>] ? try_to_del_timer_sync+0x7b/0xe0
       [<ffffffff81099690>] ? autoremove_wake_function+0x0/0x40
       [<ffffffff8108390a>] ? del_timer_sync+0x8a/0xc0
       [<ffffffff81083880>] ? del_timer_sync+0x0/0xc0
       [<ffffffffa00481e1>] kjournald2+0xc1/0x220 [jbd2]
       [<ffffffff81099690>] ? autoremove_wake_function+0x0/0x40
       [<ffffffffa0048120>] ? kjournald2+0x0/0x220 [jbd2]
       [<ffffffff81099346>] kthread+0x96/0xa0
       [<ffffffff8100c1ca>] child_rip+0xa/0x20
       [<ffffffff81531fc0>] ? _spin_unlock_irq+0x30/0x40
       [<ffffffff8100bb10>] ? restore_args+0x0/0x30
       [<ffffffff810992b0>] ? kthread+0x0/0xa0
       [<ffffffff8100c1c0>] ? child_rip+0x0/0x20
      INFO: lockdep is turned off.
      INFO: task mdt01_003:12211 blocked for more than 120 seconds.
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      mdt01_003     D ffff88013fc26d00     0 12211      2 0x00000080
       ffff88010aa11870 0000000000000046 0000000000000000 ffffffff810b262d
       ffff88010aa117f0 0000000000000286 ffff88002cbd6cd8 0000000000000286
       ffff880106ac8cf8 ffff88010aa11fd8 000000000000fc88 ffff880106ac8cf8
      Call Trace:
       [<ffffffff810b262d>] ? trace_hardirqs_on+0xd/0x10
       [<ffffffffa0f456be>] cfs_waitq_wait+0xe/0x10 [libcfs]
       [<ffffffffa04ef057>] lu_object_find_at+0xb7/0x370 [obdclass]
       [<ffffffffa014d7e2>] ? fld_server_lookup+0x72/0x440 [fld]
       [<ffffffff810642f0>] ? default_wake_function+0x0/0x20
       [<ffffffffa0bdc6f4>] osd_object_find+0x34/0x270 [osd_ldiskfs]
       [<ffffffffa0be168e>] osd_index_ea_insert+0x4ae/0xdf0 [osd_ldiskfs]
       [<ffffffffa0be8f46>] ? osd_index_ea_delete+0x906/0xe60 [osd_ldiskfs]
       [<ffffffffa0b8a818>] ? __ldiskfs_journal_stop+0x68/0xa0 [ldiskfs]
       [<ffffffffa0449b84>] lod_index_insert+0x44/0xe0 [lod]
       [<ffffffffa04499d8>] ? lod_index_delete+0x38/0xd0 [lod]
       [<ffffffffa0b04d5e>] __mdd_orphan_add+0x33e/0x3b0 [mdd]
       [<ffffffffa0bdf43d>] ? osd_attr_get+0x13d/0x180 [osd_ldiskfs]
       [<ffffffffa0b0cb51>] mdd_finish_unlink+0xe1/0x380 [mdd]
       [<ffffffffa0b0f7c8>] mdd_unlink+0x9d8/0xe30 [mdd]
       [<ffffffffa0c7eb98>] mdo_unlink+0x18/0x50 [mdt]
       [<ffffffffa0c81f80>] mdt_reint_unlink+0x820/0x1010 [mdt]
       [<ffffffffa0c7e891>] mdt_reint_rec+0x41/0xe0 [mdt]
       [<ffffffffa0c66573>] mdt_reint_internal+0x4c3/0x780 [mdt]
       [<ffffffffa0c66874>] mdt_reint+0x44/0xe0 [mdt]
       [<ffffffffa0c693da>] mdt_handle_common+0x53a/0x1480 [mdt]
       [<ffffffffa0ca2b95>] mds_regular_handle+0x15/0x20 [mdt]
       [<ffffffffa072b6e1>] ptlrpc_server_handle_request+0x3b1/0xc80 [ptlrpc]
       [<ffffffffa0f4557e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
       [<ffffffffa0f568af>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
       [<ffffffffa0722fe7>] ? ptlrpc_wait_event+0xb7/0x2c0 [ptlrpc]
       [<ffffffff810b262d>] ? trace_hardirqs_on+0xd/0x10
       [<ffffffff81055d73>] ? __wake_up+0x53/0x70
       [<ffffffffa072c9f2>] ptlrpc_main+0xa42/0x1670 [ptlrpc]
       [<ffffffffa072bfb0>] ? ptlrpc_main+0x0/0x1670 [ptlrpc]
       [<ffffffff81099346>] kthread+0x96/0xa0
       [<ffffffff8100c1ca>] child_rip+0xa/0x20
       [<ffffffff81531fc0>] ? _spin_unlock_irq+0x30/0x40
       [<ffffffff8100bb10>] ? restore_args+0x0/0x30
       [<ffffffff810992b0>] ? kthread+0x0/0xa0
       [<ffffffff8100c1c0>] ? child_rip+0x0/0x20
      INFO: lockdep is turned off.
      INFO: task jbd2/loop0-8:11882 blocked for more than 120 seconds.
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      jbd2/loop0-8  D ffff8801225702b0     0 11882      2 0x00000080
       ffff88010811bcf0 0000000000000046 0000000000000002 ffff8801225702b0
       ffff88010811bc90 0000000000000046 ffff88010811bdd8 0000000000000282
       ffff88010b0845f8 ffff88010811bfd8 000000000000fc88 ffff88010b0845f8
      Call Trace:
       [<ffffffffa0041aed>] jbd2_journal_commit_transaction+0x19d/0x16e0 [jbd2]
       [<ffffffff81081d5c>] ? lock_timer_base+0x3c/0x70
       [<ffffffff815323f1>] ? _spin_lock_irqsave+0x81/0xa0
       [<ffffffff8108381b>] ? try_to_del_timer_sync+0x7b/0xe0
       [<ffffffff81099690>] ? autoremove_wake_function+0x0/0x40
       [<ffffffff8108390a>] ? del_timer_sync+0x8a/0xc0
       [<ffffffff81083880>] ? del_timer_sync+0x0/0xc0
       [<ffffffffa00481e1>] kjournald2+0xc1/0x220 [jbd2]
       [<ffffffff81099690>] ? autoremove_wake_function+0x0/0x40
       [<ffffffffa0048120>] ? kjournald2+0x0/0x220 [jbd2]
       [<ffffffff81099346>] kthread+0x96/0xa0
       [<ffffffff8100c1ca>] child_rip+0xa/0x20
       [<ffffffff81531fc0>] ? _spin_unlock_irq+0x30/0x40
       [<ffffffff8100bb10>] ? restore_args+0x0/0x30
       [<ffffffff810992b0>] ? kthread+0x0/0xa0
       [<ffffffff8100c1c0>] ? child_rip+0x0/0x20
      INFO: lockdep is turned off.
      INFO: task mdt01_003:12211 blocked for more than 120 seconds.
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      mdt01_003     D ffff88013fc26d00     0 12211      2 0x00000080
       ffff88010aa11870 0000000000000046 0000000000000000 ffffffff810b262d
       ffff88010aa117f0 0000000000000286 ffff88002cbd6cd8 0000000000000286
       ffff880106ac8cf8 ffff88010aa11fd8 000000000000fc88 ffff880106ac8cf8
      Call Trace:
       [<ffffffff810b262d>] ? trace_hardirqs_on+0xd/0x10
       [<ffffffffa0f456be>] cfs_waitq_wait+0xe/0x10 [libcfs]
       [<ffffffffa04ef057>] lu_object_find_at+0xb7/0x370 [obdclass]
       [<ffffffffa014d7e2>] ? fld_server_lookup+0x72/0x440 [fld]
       [<ffffffff810642f0>] ? default_wake_function+0x0/0x20
       [<ffffffffa0bdc6f4>] osd_object_find+0x34/0x270 [osd_ldiskfs]
       [<ffffffffa0be168e>] osd_index_ea_insert+0x4ae/0xdf0 [osd_ldiskfs]
       [<ffffffffa0be8f46>] ? osd_index_ea_delete+0x906/0xe60 [osd_ldiskfs]
       [<ffffffffa0b8a818>] ? __ldiskfs_journal_stop+0x68/0xa0 [ldiskfs]
       [<ffffffffa0449b84>] lod_index_insert+0x44/0xe0 [lod]
       [<ffffffffa04499d8>] ? lod_index_delete+0x38/0xd0 [lod]
       [<ffffffffa0b04d5e>] __mdd_orphan_add+0x33e/0x3b0 [mdd]
       [<ffffffffa0bdf43d>] ? osd_attr_get+0x13d/0x180 [osd_ldiskfs]
       [<ffffffffa0b0cb51>] mdd_finish_unlink+0xe1/0x380 [mdd]
       [<ffffffffa0b0f7c8>] mdd_unlink+0x9d8/0xe30 [mdd]
       [<ffffffffa0c7eb98>] mdo_unlink+0x18/0x50 [mdt]
       [<ffffffffa0c81f80>] mdt_reint_unlink+0x820/0x1010 [mdt]
       [<ffffffffa0c7e891>] mdt_reint_rec+0x41/0xe0 [mdt]
       [<ffffffffa0c66573>] mdt_reint_internal+0x4c3/0x780 [mdt]
       [<ffffffffa0c66874>] mdt_reint+0x44/0xe0 [mdt]
       [<ffffffffa0c693da>] mdt_handle_common+0x53a/0x1480 [mdt]
       [<ffffffffa0ca2b95>] mds_regular_handle+0x15/0x20 [mdt]
       [<ffffffffa072b6e1>] ptlrpc_server_handle_request+0x3b1/0xc80 [ptlrpc]
       [<ffffffffa0f4557e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
       [<ffffffffa0f568af>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
       [<ffffffffa0722fe7>] ? ptlrpc_wait_event+0xb7/0x2c0 [ptlrpc]
       [<ffffffff810b262d>] ? trace_hardirqs_on+0xd/0x10
       [<ffffffff81055d73>] ? __wake_up+0x53/0x70
       [<ffffffffa072c9f2>] ptlrpc_main+0xa42/0x1670 [ptlrpc]
       [<ffffffffa072bfb0>] ? ptlrpc_main+0x0/0x1670 [ptlrpc]
       [<ffffffff81099346>] kthread+0x96/0xa0
       [<ffffffff8100c1ca>] child_rip+0xa/0x20
       [<ffffffff81531fc0>] ? _spin_unlock_irq+0x30/0x40
       [<ffffffff8100bb10>] ? restore_args+0x0/0x30
       [<ffffffff810992b0>] ? kthread+0x0/0xa0
       [<ffffffff8100c1c0>] ? child_rip+0x0/0x20
      INFO: lockdep is turned off.
      ...
      

      i have a crashdump - but it's triggered too later to get a lustre log.

      Attachments

        1. bt.log
          178 kB
        2. bt-full.log
          1.15 MB
        3. lustre.txt.gz
          836 kB

        Issue Links

          Activity

            [LU-3870] deadlock with journal

            Reopening to remove fix version since this is a duplicate.

            jlevi Jodi Levi (Inactive) added a comment - Reopening to remove fix version since this is a duplicate.

            Alex,

            Lustre: DEBUG MARKER: == sanity test 156: Verification of tunables ============================== 21:09:39 (1377968979)
            Lustre: DEBUG MARKER: cancel_lru_locks osc start
            Lustre: DEBUG MARKER: cancel_lru_locks osc stop
            Lustre: DEBUG MARKER: cancel_lru_locks osc start
            Lustre: DEBUG MARKER: cancel_lru_locks osc stop
            INFO: task jbd2/loop0-8:11882 blocked for more than 120 seconds.
            
            shadow Alexey Lyashkov added a comment - Alex, Lustre: DEBUG MARKER: == sanity test 156: Verification of tunables ============================== 21:09:39 (1377968979) Lustre: DEBUG MARKER: cancel_lru_locks osc start Lustre: DEBUG MARKER: cancel_lru_locks osc stop Lustre: DEBUG MARKER: cancel_lru_locks osc start Lustre: DEBUG MARKER: cancel_lru_locks osc stop INFO: task jbd2/loop0-8:11882 blocked for more than 120 seconds.

            wait in lu_object_find_at() means that the object was marked as dying which is set in ->do_destroy(). if I checked correctly, all calls to ->destroy() are done with the lock held. can you tell what test was running?

            bzzz Alex Zhuravlev added a comment - wait in lu_object_find_at() means that the object was marked as dying which is set in ->do_destroy(). if I checked correctly, all calls to ->destroy() are done with the lock held. can you tell what test was running?

            Bobijam is away on holiday this week. Alex, could you please take a look?

            adilger Andreas Dilger added a comment - Bobijam is away on holiday this week. Alex, could you please take a look?

            from my point view - it's bug similar to the LU-4019.
            mdt thread stick in

            PID: 12211  TASK: ffff880106ac8740  CPU: 6   COMMAND: "mdt01_003"
             #0 [ffff88010aa117b0] schedule at ffffffff8152eebd
             #1 [ffff88010aa11878] cfs_waitq_wait at ffffffffa0f456be [libcfs]
             #2 [ffff88010aa11888] lu_object_find_at at ffffffffa04ef057 [obdclass]
             #3 [ffff88010aa11948] osd_object_find at ffffffffa0bdc6f4 [osd_ldiskfs]
             #4 [ffff88010aa11978] osd_index_ea_insert at ffffffffa0be168e [osd_ldiskfs]
             #5 [ffff88010aa11a18] lod_index_insert at ffffffffa0449b84 [lod]
             #6 [ffff88010aa11a38] __mdd_orphan_add at ffffffffa0b04d5e [mdd]
             #7 [ffff88010aa11aa8] mdd_finish_unlink at ffffffffa0b0cb51 [mdd]
            

            and transaction open. that transaction block any other IO on MDT.

            Bobijam - did you test a status of the object? may it's race between create and unlink ? or other possibilities to be similar for LU-4019 ?

            shadow Alexey Lyashkov added a comment - from my point view - it's bug similar to the LU-4019 . mdt thread stick in PID: 12211 TASK: ffff880106ac8740 CPU: 6 COMMAND: "mdt01_003" #0 [ffff88010aa117b0] schedule at ffffffff8152eebd #1 [ffff88010aa11878] cfs_waitq_wait at ffffffffa0f456be [libcfs] #2 [ffff88010aa11888] lu_object_find_at at ffffffffa04ef057 [obdclass] #3 [ffff88010aa11948] osd_object_find at ffffffffa0bdc6f4 [osd_ldiskfs] #4 [ffff88010aa11978] osd_index_ea_insert at ffffffffa0be168e [osd_ldiskfs] #5 [ffff88010aa11a18] lod_index_insert at ffffffffa0449b84 [lod] #6 [ffff88010aa11a38] __mdd_orphan_add at ffffffffa0b04d5e [mdd] #7 [ffff88010aa11aa8] mdd_finish_unlink at ffffffffa0b0cb51 [mdd] and transaction open. that transaction block any other IO on MDT. Bobijam - did you test a status of the object? may it's race between create and unlink ? or other possibilities to be similar for LU-4019 ?

            Moved over to 2.6.0 since we haven't been able to make any progress with understanding, reproducing, or fixing this problem. Can increase priority again if this is hit again.

            adilger Andreas Dilger added a comment - Moved over to 2.6.0 since we haven't been able to make any progress with understanding, reproducing, or fixing this problem. Can increase priority again if this is hit again.
            bobijam Zhenyu Xu added a comment -

            I didn't find anything suspicious in the dump so far.

            bobijam Zhenyu Xu added a comment - I didn't find anything suspicious in the dump so far.

            Hi Bobijam.

            LU-3870.tar.bz2 finished to upload in next 10-15 minutes. it's have vmlinux, system.map, crashdump with all logs and copy of my lustre tree.
            but crash triggered too later :/ let me know if you need something else.

            shadow Alexey Lyashkov added a comment - Hi Bobijam. LU-3870 .tar.bz2 finished to upload in next 10-15 minutes. it's have vmlinux, system.map, crashdump with all logs and copy of my lustre tree. but crash triggered too later :/ let me know if you need something else.
            bobijam Zhenyu Xu added a comment -

            Shadow,

            Would you mind uploading the crashdump packages please? Thanks.

            bobijam Zhenyu Xu added a comment - Shadow, Would you mind uploading the crashdump packages please? Thanks.
            pjones Peter Jones added a comment -

            Bobijam will look into this one

            pjones Peter Jones added a comment - Bobijam will look into this one

            People

              bzzz Alex Zhuravlev
              shadow Alexey Lyashkov
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: