[LU-818] possible circular locking dependency detected on MDS Created: 02/Nov/11  Updated: 04/Jun/12  Resolved: 04/Jun/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Ned Bass Assignee: Oleg Drokin
Resolution: Won't Fix Votes: 0
Labels: None
Environment:

RHEL 6.2
debug kernel 2.6.32-207.1chaos.ch5.x86_64.debug


Severity: 3
Rank (Obsolete): 6529

 Description   

The kernel lock validator produced the following warning on a Lustre 2.1 MDS.

=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.32-207.1chaos.ch5.x86_64.debug #1
-------------------------------------------------------
mdt_01/24742 is trying to acquire lock:
 (&sb->s_type->i_mutex_key#14){+.+.+.}, at: [<ffffffff811a2bf0>] do_filp_open+0x2c0/0xd90

but task is already holding lock:
 (&loghandle->lgh_lock){+++++.}, at: [<ffffffffa04aaef2>] llog_cat_current_log.clone.0+0xa2/0x1000 [obdclass]

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #3 (&loghandle->lgh_lock){+++++.}:
       [<ffffffff810af29e>] __lock_acquire+0xeee/0x1570
       [<ffffffff810af9c4>] lock_acquire+0xa4/0x120
       [<ffffffff8109b74e>] down_read_nested+0x5e/0xb0
       [<ffffffffa04aae97>] llog_cat_current_log.clone.0+0x47/0x1000 [obdclass]
       [<ffffffffa04ad68a>] llog_cat_add_rec+0x5a/0x840 [obdclass]
       [<ffffffffa04b3fd6>] llog_obd_origin_add+0x56/0x190 [obdclass]
       [<ffffffffa04b41b1>] llog_add+0xa1/0x3c0 [obdclass]
       [<ffffffffa07660cc>] lov_llog_origin_add+0xcc/0x5f0 [lov]
       [<ffffffffa04b41b1>] llog_add+0xa1/0x3c0 [obdclass]
       [<ffffffffa09dfc39>] mds_llog_origin_add+0x109/0x310 [mds]
       [<ffffffffa04b41b1>] llog_add+0xa1/0x3c0 [obdclass]
       [<ffffffffa09e02c6>] mds_llog_add_unlink+0x176/0x520 [mds]
       [<ffffffffa09e09a6>] mds_log_op_unlink+0x196/0x9a0 [mds]
       [<ffffffffa0a012b1>] mdd_unlink_log+0x51/0x110 [mdd]
       [<ffffffffa09f594e>] mdd_object_kill+0x14e/0x1b0 [mdd]
       [<ffffffffa0a0eece>] mdd_finish_unlink+0x20e/0x2c0 [mdd]
       [<ffffffffa0a15683>] mdd_unlink+0x853/0xb00 [mdd]
       [<ffffffffa0abcb46>] cml_unlink+0xb6/0x260 [cmm]
       [<ffffffffa0a5b799>] mdt_reint_unlink+0x689/0xa30 [mdt]
       [<ffffffffa0a5966f>] mdt_reint_rec+0x3f/0x100 [mdt]
       [<ffffffffa0a51b94>] mdt_reint_internal+0x6d4/0x9f0 [mdt]
       [<ffffffffa0a51efc>] mdt_reint+0x4c/0x120 [mdt]
       [<ffffffffa0a459f5>] mdt_handle_common+0x8e5/0x1820 [mdt]
       [<ffffffffa0a46a05>] mdt_regular_handle+0x15/0x20 [mdt]
       [<ffffffffa05f3d9a>] ptlrpc_main+0xbea/0x1930 [ptlrpc]
       [<ffffffff8100c20a>] child_rip+0xa/0x20

-> #2 (&mo->oo_sem#2/3){+++++.}:
       [<ffffffff810af29e>] __lock_acquire+0xeee/0x1570
       [<ffffffff810af9c4>] lock_acquire+0xa4/0x120
       [<ffffffff8109b6a5>] down_write_nested+0x65/0xb0
       [<ffffffffa0ad512b>] osd_object_write_lock+0x5b/0xd0 [osd_ldiskfs]
       [<ffffffffa0a23fb6>] mdd_write_lock+0x26/0x30 [mdd]
       [<ffffffffa0a16127>] mdd_create+0x7f7/0x1db0 [mdd]
       [<ffffffffa0510880>] llo_store_create_index+0x180/0x220 [obdclass]
       [<ffffffffa0addf8f>] osd_oi_init+0x1bf/0x2d0 [osd_ldiskfs]
       [<ffffffffa0ad779e>] osd_prepare+0x7e/0x310 [osd_ldiskfs]
       [<ffffffffa0a1da2a>] mdd_prepare+0x8a/0x790 [mdd]
       [<ffffffffa0ab607d>] cmm_prepare+0x4d/0x110 [cmm]
       [<ffffffffa0a4f0e0>] mdt_device_alloc+0xf40/0x2510 [mdt]
       [<ffffffffa04d599f>] obd_setup+0x1ff/0x330 [obdclass]
       [<ffffffffa04d5cd8>] class_setup+0x208/0xa50 [obdclass]
       [<ffffffffa04dd80c>] class_process_config+0xd6c/0x1fd0 [obdclass]
       [<ffffffffa04dfb28>] class_config_llog_handler+0x938/0x1680 [obdclass]
       [<ffffffffa04a939b>] llog_process_thread+0x94b/0xdf0 [obdclass]
       [<ffffffff8100c20a>] child_rip+0xa/0x20

-> #1 (jbd2_handle){+.+.+.}:
       [<ffffffff810af29e>] __lock_acquire+0xeee/0x1570
       [<ffffffff810af9c4>] lock_acquire+0xa4/0x120
       [<ffffffffa009737c>] start_this_handle+0x45c/0x580 [jbd2]
       [<ffffffffa00976bf>] jbd2_journal_start+0xdf/0x120 [jbd2]
       [<ffffffffa0994318>] ldiskfs_journal_start_sb+0x58/0x90 [ldiskfs]
       [<ffffffffa0985122>] ldiskfs_create+0x62/0x180 [ldiskfs]
       [<ffffffff8119fce4>] vfs_create+0xb4/0xe0
       [<ffffffff811a345f>] do_filp_open+0xb2f/0xd90
       [<ffffffff8118f569>] do_sys_open+0x69/0x140
       [<ffffffff8118f680>] sys_open+0x20/0x30
       [<ffffffff8100b0b2>] system_call_fastpath+0x16/0x1b

-> #0 (&sb->s_type->i_mutex_key#14){+.+.+.}:
       [<ffffffff810af89e>] __lock_acquire+0x14ee/0x1570
       [<ffffffff810af9c4>] lock_acquire+0xa4/0x120
       [<ffffffff8151dbe3>] __mutex_lock_common+0x43/0x400
       [<ffffffff8151e0a8>] mutex_lock_nested+0x48/0x60
       [<ffffffff811a2bf0>] do_filp_open+0x2c0/0xd90
       [<ffffffff811a36dd>] filp_open+0x1d/0x20
       [<ffffffffa09dd420>] mds_obd_create+0x1e0/0xd80 [mds]
       [<ffffffffa04b1a64>] llog_lvfs_create+0x6d4/0x1310 [obdclass]
       [<ffffffffa04ab16a>] llog_cat_current_log.clone.0+0x31a/0x1000 [obdclass]
       [<ffffffffa04ad68a>] llog_cat_add_rec+0x5a/0x840 [obdclass]
       [<ffffffffa04b3fd6>] llog_obd_origin_add+0x56/0x190 [obdclass]
       [<ffffffffa04b41b1>] llog_add+0xa1/0x3c0 [obdclass]
       [<ffffffffa07660cc>] lov_llog_origin_add+0xcc/0x5f0 [lov]
       [<ffffffffa04b41b1>] llog_add+0xa1/0x3c0 [obdclass]
       [<ffffffffa09dfc39>] mds_llog_origin_add+0x109/0x310 [mds]
       [<ffffffffa04b41b1>] llog_add+0xa1/0x3c0 [obdclass]
       [<ffffffffa09e02c6>] mds_llog_add_unlink+0x176/0x520 [mds]
       [<ffffffffa09e09a6>] mds_log_op_unlink+0x196/0x9a0 [mds]
       [<ffffffffa0a012b1>] mdd_unlink_log+0x51/0x110 [mdd]
       [<ffffffffa09f594e>] mdd_object_kill+0x14e/0x1b0 [mdd]
       [<ffffffffa0a0eece>] mdd_finish_unlink+0x20e/0x2c0 [mdd]
       [<ffffffffa0a15683>] mdd_unlink+0x853/0xb00 [mdd]
       [<ffffffffa0abcb46>] cml_unlink+0xb6/0x260 [cmm]
       [<ffffffffa0a5b799>] mdt_reint_unlink+0x689/0xa30 [mdt]
       [<ffffffffa0a5966f>] mdt_reint_rec+0x3f/0x100 [mdt]
       [<ffffffffa0a51b94>] mdt_reint_internal+0x6d4/0x9f0 [mdt]
       [<ffffffffa0a51efc>] mdt_reint+0x4c/0x120 [mdt]
       [<ffffffffa0a459f5>] mdt_handle_common+0x8e5/0x1820 [mdt]
       [<ffffffffa0a46a05>] mdt_regular_handle+0x15/0x20 [mdt]
       [<ffffffffa05f3d9a>] ptlrpc_main+0xbea/0x1930 [ptlrpc]
       [<ffffffff8100c20a>] child_rip+0xa/0x20

other info that might help us debug this:

3 locks held by mdt_01/24742:
 #0:  (jbd2_handle){+.+.+.}, at: [<ffffffffa0097325>] start_this_handle+0x405/0x580 [jbd2]
 #1:  (&mo->oo_sem#2/3){+++++.}, at: [<ffffffffa0ad512b>] osd_object_write_lock+0x5b/0xd0 [osd_ldiskfs]
 #2:  (&loghandle->lgh_lock){+++++.}, at: [<ffffffffa04aaef2>] llog_cat_current_log.clone.0+0xa2/0x1000 [obdclass]

stack backtrace:
Pid: 24742, comm: mdt_01 Tainted: G        W  ----------------   2.6.32-207.1chaos.ch5.x86_64.debug #1
Call Trace:
 [<ffffffff810aca93>] ? print_circular_bug+0xf3/0x100
 [<ffffffff810af89e>] ? __lock_acquire+0x14ee/0x1570
 [<ffffffff811aa760>] ? __d_lookup+0x0/0x180
 [<ffffffff810af9c4>] ? lock_acquire+0xa4/0x120
 [<ffffffff811a2bf0>] ? do_filp_open+0x2c0/0xd90
 [<ffffffff8151dbe3>] ? __mutex_lock_common+0x43/0x400
 [<ffffffff811a2bf0>] ? do_filp_open+0x2c0/0xd90
 [<ffffffff810adc9d>] ? trace_hardirqs_on_caller+0x14d/0x190
 [<ffffffff811a2bf0>] ? do_filp_open+0x2c0/0xd90
 [<ffffffff810ab351>] ? lockdep_init_map+0x51/0x130
 [<ffffffff8151e0a8>] ? mutex_lock_nested+0x48/0x60
 [<ffffffff811a2bf0>] ? do_filp_open+0x2c0/0xd90
 [<ffffffff8103758c>] ? kvm_clock_read+0x1c/0x20
 [<ffffffff81012c29>] ? sched_clock+0x9/0x10
 [<ffffffff811b1bc9>] ? mntput_no_expire+0x29/0x110
 [<ffffffff81297fa0>] ? sprintf+0x40/0x50
 [<ffffffff811a36dd>] ? filp_open+0x1d/0x20
 [<ffffffffa09dd420>] ? mds_obd_create+0x1e0/0xd80 [mds]
 [<ffffffff810aa27d>] ? trace_hardirqs_off+0xd/0x10
 [<ffffffff8109d4af>] ? cpu_clock+0x6f/0x80
 [<ffffffff810ad39d>] ? lock_release_holdtime+0x3d/0x190
 [<ffffffff81175df3>] ? cache_alloc_debugcheck_after+0xf3/0x230
 [<ffffffffa04b1a64>] ? llog_lvfs_create+0x6d4/0x1310 [obdclass]
 [<ffffffffa04aaef2>] ? llog_cat_current_log.clone.0+0xa2/0x1000 [obdclass]
 [<ffffffffa04aaef2>] ? llog_cat_current_log.clone.0+0xa2/0x1000 [obdclass]
 [<ffffffffa04ab16a>] ? llog_cat_current_log.clone.0+0x31a/0x1000 [obdclass]
 [<ffffffff8103758c>] ? kvm_clock_read+0x1c/0x20
 [<ffffffff81012c29>] ? sched_clock+0x9/0x10
 [<ffffffff8109d245>] ? sched_clock_local+0x25/0x90
 [<ffffffffa04ad68a>] ? llog_cat_add_rec+0x5a/0x840 [obdclass]
 [<ffffffffa076625a>] ? lov_llog_origin_add+0x25a/0x5f0 [lov]
 [<ffffffffa04b3fd6>] ? llog_obd_origin_add+0x56/0x190 [obdclass]
 [<ffffffff8109d368>] ? sched_clock_cpu+0xb8/0x110
 [<ffffffff810aa27d>] ? trace_hardirqs_off+0xd/0x10
 [<ffffffff8109d4af>] ? cpu_clock+0x6f/0x80
 [<ffffffffa04b41b1>] ? llog_add+0xa1/0x3c0 [obdclass]
 [<ffffffffa07660cc>] ? lov_llog_origin_add+0xcc/0x5f0 [lov]
 [<ffffffff8103758c>] ? kvm_clock_read+0x1c/0x20
 [<ffffffff81012c29>] ? sched_clock+0x9/0x10
 [<ffffffff8109d245>] ? sched_clock_local+0x25/0x90
 [<ffffffffa04b41b1>] ? llog_add+0xa1/0x3c0 [obdclass]
 [<ffffffffa09dfc39>] ? mds_llog_origin_add+0x109/0x310 [mds]
 [<ffffffff8109d368>] ? sched_clock_cpu+0xb8/0x110
 [<ffffffffa04b41b1>] ? llog_add+0xa1/0x3c0 [obdclass]
 [<ffffffffa09e02c6>] ? mds_llog_add_unlink+0x176/0x520 [mds]
 [<ffffffffa09e09a6>] ? mds_log_op_unlink+0x196/0x9a0 [mds]
 [<ffffffffa0a012b1>] ? mdd_unlink_log+0x51/0x110 [mdd]
 [<ffffffffa09fadfb>] ? mdd_attr_get_internal+0x7ab/0xb10 [mdd]
 [<ffffffffa09f594e>] ? mdd_object_kill+0x14e/0x1b0 [mdd]
 [<ffffffffa0a0eece>] ? mdd_finish_unlink+0x20e/0x2c0 [mdd]
 [<ffffffffa09f7259>] ? mdd_attr_check_set_internal_locked+0x69/0x180 [mdd]
 [<ffffffffa0a15683>] ? mdd_unlink+0x853/0xb00 [mdd]
 [<ffffffffa0a12196>] ? mdd_lookup+0x56/0x120 [mdd]
 [<ffffffffa0ad26c6>] ? osd_object_version_get+0x16/0xc0 [osd_ldiskfs]
 [<ffffffffa05bce20>] ? ldlm_completion_ast+0x0/0x740 [ptlrpc]
 [<ffffffffa0a3ccc0>] ? mdt_blocking_ast+0x0/0x320 [mdt]
 [<ffffffffa0abcb46>] ? cml_unlink+0xb6/0x260 [cmm]
 [<ffffffffa0a5a406>] ? mdt_version_save+0x96/0x170 [mdt]
 [<ffffffffa0a5b799>] ? mdt_reint_unlink+0x689/0xa30 [mdt]
 [<ffffffffa0a5882f>] ? mdt_unlink_unpack+0x41f/0x5a0 [mdt]
 [<ffffffffa0a5966f>] ? mdt_reint_rec+0x3f/0x100 [mdt]
 [<ffffffffa05e5b64>] ? lustre_msg_get_flags+0x34/0xa0 [ptlrpc]
 [<ffffffffa0a51b94>] ? mdt_reint_internal+0x6d4/0x9f0 [mdt]
 [<ffffffffa0a47806>] ? mdt_reint_opcode+0x96/0x160 [mdt]
 [<ffffffffa0a51efc>] ? mdt_reint+0x4c/0x120 [mdt]
 [<ffffffffa0a459f5>] ? mdt_handle_common+0x8e5/0x1820 [mdt]
 [<ffffffffa05e32c4>] ? lustre_msg_get_opc+0x94/0x100 [ptlrpc]
 [<ffffffffa0a46a05>] ? mdt_regular_handle+0x15/0x20 [mdt]
 [<ffffffffa05f3d9a>] ? ptlrpc_main+0xbea/0x1930 [ptlrpc]
 [<ffffffff810adc9d>] ? trace_hardirqs_on_caller+0x14d/0x190
 [<ffffffffa05f31b0>] ? ptlrpc_main+0x0/0x1930 [ptlrpc]
 [<ffffffff8100c20a>] ? child_rip+0xa/0x20
 [<ffffffff8100bb50>] ? restore_args+0x0/0x30
 [<ffffffffa05f31b0>] ? ptlrpc_main+0x0/0x1930 [ptlrpc]
 [<ffffffff8100c200>] ? child_rip+0x0/0x20


 Comments   
Comment by Peter Jones [ 02/Nov/11 ]

Oleg

Could you please look into this one?

Thanks

Peter

Comment by Ned Bass [ 03/Nov/11 ]

How interested are you in getting bug reports based on kernel lock validator warnings? I've seen recursive locking/circular dependency notices for several different chains, but haven't hit any actual deadlocks. Would you like me to open an issue for each unique occurrence?

Comment by Oleg Drokin [ 03/Nov/11 ]

Hm, I don't think I have seen this one.
llogs are undergoing significant rewrk now as part of orion project and this was not a real problem before, so I tend to ignore this warning.

There is a different chain I have seen in ldlm that we have discussed internally and decided that the reversion could never happen.

Comment by Peter Jones [ 04/Jun/12 ]

Close ticket. We can reopen if this ever causes a problem in production

Generated at Sat Feb 10 01:10:42 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.