[LU-7619] USE_LU_REF/lu_ref feature broken after some REFASSERT()s have been added without lu_ref::lf_guard protection Created: 30/Dec/15  Updated: 13/Jul/16  Resolved: 13/Jul/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Minor
Reporter: Bruno Faccini (Inactive) Assignee: Bruno Faccini (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Some quite old patches have introduced several REFASSERT()s, to strengthen lu_ref feature controls, but not under required lu_ref::lf_guard spin-lock protection.
This can cause unexpected "self" dead-lock situations (vs LBUG/ASSERT!!) upon lu_ref failure detection.
This has been unveiled during my own usage of the feature to track and debug some unreferencing bug/lack.



 Comments   
Comment by Bruno Faccini (Inactive) [ 30/Dec/15 ]

For info, the specific dead-lock situation I have encountered when using lu_ref failure and caused by this bug had the following signature/stack :

<1>LustreError: dumping log to /tmp/lustre-log.1451225788.113113
<0>BUG: soft lockup - CPU#1 stuck for 67s! [ldlm_bl_03:113432]
<4>Modules linked in: lustre(U) ofd(U) osp(U) lod(U) ost(U) mdt(U) mdd(U) mgs(U) osd_ldiskfs(U) ldiskfs(U) lquota(U) lfsck(U) jbd2 obdecho(U) mgc(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ptlrpc(U) obdclass(U) ksocklnd(U) lnet(U) sha512_generic crc32c_intel libcfs(U) nfsd exportfs nfs lockd fscache auth_rpcgss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm sg joydev microcode iTCO_wdt iTCO_vendor_support igb i2c_algo_bit sb_edac edac_core i2c_i801 i2c_core lpc_ich mfd_core ioatdma dca shpchp ext3 jbd mbcache sd_mod crc_t10dif isci libsas scsi_transport_sas mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_en ptp pps_core mlx4_core ahci wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
<4>CPU 1
<4>Modules linked in: lustre(U) ofd(U) osp(U) lod(U) ost(U) mdt(U) mdd(U) mgs(U) osd_ldiskfs(U) ldiskfs(U) lquota(U) lfsck(U) jbd2 obdecho(U) mgc(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ptlrpc(U) obdclass(U) ksocklnd(U) lnet(U) sha512_generic crc32c_intel libcfs(U) nfsd exportfs nfs lockd fscache auth_rpcgss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm sg joydev microcode iTCO_wdt iTCO_vendor_support igb i2c_algo_bit sb_edac edac_core i2c_i801 i2c_core lpc_ich mfd_core ioatdma dca shpchp ext3 jbd mbcache sd_mod crc_t10dif isci libsas scsi_transport_sas mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_en ptp pps_core mlx4_core ahci wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
<4>
<4>Pid: 113432, comm: ldlm_bl_03 Not tainted 2.6.32.573.8.1.el6_lustre #1 Intel Corporation S2600GZ/S2600GZ
<4>RIP: 0010:[<ffffffff8153cdfc>]  [<ffffffff8153cdfc>] _spin_lock+0x1c/0x30
<4>RSP: 0018:ffff88030307ba30  EFLAGS: 00000297
<4>RAX: 000000000000059c RBX: ffff88030307ba30 RCX: 0000000000000000
<4>RDX: 000000000000059b RSI: 0000000000000000 RDI: ffffffffa071db68
<4>RBP: ffffffff8100bc0e R08: 00000000fffffffe R09: 0000000000000000
<4>R10: 000000000000000f R11: 000000000000000f R12: ffff8802fac8f970
<4>R13: ffff8803030df060 R14: 0000000000000097 R15: 00000000fffffffc
<4>FS:  0000000000000000(0000) GS:ffff880038620000(0000) knlGS:0000000000000000
<4>CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
<4>CR2: 0000000002527c1c CR3: 0000000001a8d000 CR4: 00000000000407e0
<4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>Process ldlm_bl_03 (pid: 113432, threadinfo ffff880303078000, task ffff8803f1eb4040)
<4>Stack:
<4> ffff88030307ba50 ffffffffa05fb78a ffff8803030df060 ffff8803030df060
<4><d> ffff88030307ba70 ffffffffa05fb9f2 ffff88030307ba70 ffff8803030df000
<4><d> ffff88030307bab0 ffffffffa05f3955 ffff8802fae12888 ffff8803030df000
<4>Call Trace:
<4> [<ffffffffa05fb78a>] ? lu_ref_print_all+0x1a/0x80 [obdclass]
<4> [<ffffffffa05fb9f2>] ? lu_ref_fini+0x82/0x170 [obdclass]
<4> [<ffffffffa05f3955>] ? cl_page_free+0xe5/0x540 [obdclass]
<4> [<ffffffffa05f3f5c>] ? cl_page_put+0x1ac/0x3e0 [obdclass]
<4> [<ffffffffa05fc179>] ? lu_ref_del+0x109/0x2c0 [obdclass]
<4> [<ffffffffa0aa709c>] ? osc_page_gang_lookup+0x1dc/0x380 [osc]
<4> [<ffffffffa0aa6b80>] ? discard_cb+0x0/0x190 [osc]
<4> [<ffffffffa0aa7384>] ? osc_lock_discard_pages+0x144/0x240 [osc]
<4> [<ffffffffa0a9d8b5>] ? osc_lock_flush+0x55/0x260 [osc]
<4> [<ffffffffa0aa6b80>] ? discard_cb+0x0/0x190 [osc]
<4> [<ffffffffa0a9d8eb>] ? osc_lock_flush+0x8b/0x260 [osc]
<4> [<ffffffffa0a9dd68>] ? osc_ldlm_blocking_ast+0x2a8/0x3c0 [osc]
<4> [<ffffffffa07c34bc>] ? ldlm_cancel_callback+0x6c/0x1a0 [ptlrpc]
<4> [<ffffffffa07de40a>] ? ldlm_cli_cancel_local+0x8a/0x470 [ptlrpc]
<4> [<ffffffffa07e2fac>] ? ldlm_cli_cancel+0x7c/0x380 [ptlrpc]
<4> [<ffffffffa0a9db9b>] ? osc_ldlm_blocking_ast+0xdb/0x3c0 [osc]
<4> [<ffffffffa04889f1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
<4> [<ffffffffa07e6ea0>] ? ldlm_handle_bl_callback+0x130/0x400 [ptlrpc]
<4> [<ffffffffa07e81a1>] ? ldlm_bl_thread_main+0x271/0x3f0 [ptlrpc]
<4> [<ffffffff810672b0>] ? default_wake_function+0x0/0x20
<4> [<ffffffffa07e7f30>] ? ldlm_bl_thread_main+0x0/0x3f0 [ptlrpc]
<4> [<ffffffff810a0fce>] ? kthread+0x9e/0xc0
<4> [<ffffffff8100c28a>] ? child_rip+0xa/0x20
<4> [<ffffffff810a0f30>] ? kthread+0x0/0xc0
<4> [<ffffffff8100c280>] ? child_rip+0x0/0x20
Comment by Gerrit Updater [ 30/Dec/15 ]

Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: http://review.whamcloud.com/17756
Subject: LU-7619 obdclass: protect REFASSERT() with lu_ref::lf_guard
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 71ab9f599d2035aac6d906be3402e7b0f8aa9044

Comment by Gerrit Updater [ 11/Jul/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17756/
Subject: LU-7619 obdclass: protect REFASSERT() with lu_ref::lf_guard
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 838514369ac245d2dcfdcda7715a2798fe9ee755

Comment by Joseph Gmitter (Inactive) [ 13/Jul/16 ]

Patch landed to master for 2.9.0

Generated at Sat Feb 10 02:10:28 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.