[LU-409] Oops: RIP: _spin_lock_irq+0x15/0x40 Created: 13/Jun/11 Updated: 04/Feb/13 Resolved: 25/Oct/11 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.0, Lustre 1.8.6 |
| Fix Version/s: | Lustre 2.1.0, Lustre 1.8.6 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Jian Yu | Assignee: | Yang Sheng |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Lustre Branch: v1_8_6_RC2 |
||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 4271 | ||||||||||||
| Description |
|
After mounting and unmounting Lustre filesystem, running lustre_rmmod caused the Lustre client node crash as follows: BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffff814dcf35>] _spin_lock_irq+0x15/0x40 PGD 31ae08067 PUD 312eae067 PMD 0 Oops: 0002 [#1] SMP last sysfs file: /sys/devices/system/cpu/cpu3/cache/index2/shared_cpu_map CPU 2 Modules linked in: llite_lloop(-)(U) lustre(U) mgc(U) lov(U) osc(U) mdc(U) lquota(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) libcfs(U) nfs lockd fscache(T ) nfs_acl auth_rpcgss autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa dm_mirror dm_reg ion_hash dm_log mlx4_ib ib_mad ib_core mlx4_en mlx4_core igb serio_raw ghes hed i2c_i801 i2c_core sg iTCO_wdt iTCO_vendor_support ioatdma dca i7core_edac edac_core shpchp ext3 jbd mbcache sd_mod crc_t10dif ahci dm_mod [last unloaded: microcode] Modules linked in: llite_lloop(-)(U) lustre(U) mgc(U) lov(U) osc(U) mdc(U) lquota(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) libcfs(U) nfs lockd fscache(T ) nfs_acl auth_rpcgss autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa dm_mirror dm_reg ion_hash dm_log mlx4_ib ib_mad ib_core mlx4_en mlx4_core igb serio_raw ghes hed i2c_i801 i2c_core sg iTCO_wdt iTCO_vendor_support ioatdma dca i7core_edac edac_core shpchp ext3 jbd mbcache sd_mod crc_t10dif ahci dm_mod [last unloaded: microcode] Pid: 4826, comm: rmmod Tainted: G ---------------- T 2.6.32-131.2.1.el6.x86_64 #1 X8DTT RIP: 0010:[<ffffffff814dcf35>] [<ffffffff814dcf35>] _spin_lock_irq+0x15/0x40 RSP: 0018:ffff880318cd9da8 EFLAGS: 00010092 RAX: 0000000000010000 RBX: ffff880328bda000 RCX: 000000000000b1a0 RDX: 0000000000000000 RSI: ffff88031ce09a90 RDI: 0000000000000000 RBP: ffff880318cd9da8 R08: 0000000000000001 R09: ffffffff817c3f86 R10: 0000000000000001 R11: 0000000000000000 R12: ffff88031ce09800 R13: ffff880328bda000 R14: ffff88031ce0b560 R15: 0000000000000001 FS: 00007fb1de18d700(0000) GS:ffff880032e40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000000 CR3: 000000031ae78000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process rmmod (pid: 4826, threadinfo ffff880318cd8000, task ffff88032123ca80) Stack: ffff880318cd9dd8 ffffffff8125689c ffff880328bda000 ffff880328bda328 <0> ffff880328bda328 ffff88031ce0b560 ffff880318cd9df8 ffffffff8124ba66 <0> ffffffff81a8a820 ffff880328bda360 ffff880318cd9e28 ffffffff81264a2d Call Trace: [<ffffffff8125689c>] blk_throtl_exit+0x3c/0xd0 [<ffffffff8124ba66>] blk_release_queue+0x26/0x80 [<ffffffff81264a2d>] kobject_release+0x8d/0x240 [<ffffffff812649a0>] ? kobject_release+0x0/0x240 [<ffffffff81265fd7>] kref_put+0x37/0x70 [<ffffffff812648a7>] kobject_put+0x27/0x60 [<ffffffff81247687>] blk_cleanup_queue+0x57/0x70 [<ffffffffa08070b1>] lloop_exit+0x61/0x300 [llite_lloop] [<ffffffff81069012>] ? put_online_cpus+0x52/0x70 [<ffffffff810a8ef8>] ? module_refcount+0x58/0x70 [<ffffffff810a9a74>] sys_delete_module+0x194/0x260 [<ffffffff8100b172>] system_call_fastpath+0x16/0x1b Code: c1 74 0e f3 90 0f b7 0f eb f5 83 3f 00 75 f4 eb df 48 89 d0 c9 c3 55 48 89 e5 0f 1f 44 00 00 fa 66 0f 1f 44 00 00 b8 00 00 01 00 <f0> 0f c1 07 0f b7 d0 c1 e8 10 39 c2 74 0e f3 90 0f b7 17 eb f5 RIP [<ffffffff814dcf35>] _spin_lock_irq+0x15/0x40 RSP <ffff880318cd9da8> CR2: 0000000000000000 This failure could be easily reproduced by running llmount.sh and then llmountcleanup.sh. |
| Comments |
| Comment by Peter Jones [ 13/Jun/11 ] |
|
YangSheng Can you please look into this failure as your top priority? Thanks Peter |
| Comment by Andreas Dilger [ 13/Jun/11 ] |
|
This looks at first glance to be related to the lloop virtual block device, This is an unsupported feature, and if this is causing problems then I would |
| Comment by Jian Yu [ 14/Jun/11 ] |
|
After removing the "load_module llite/llite_lloop" line from load_modules_local() in test-framework.sh, the auster testing could go forward. And the testing is ongoing now. |
| Comment by Yang Sheng [ 14/Jun/11 ] |
|
This is a known issue has discussed in kernel-list. Date: Wed, 16 Feb 2011 18:31:14 +1100 From: NeilBrown <neilb@suse.de> To: Vivek Goyal <vgoyal@redhat.com> Cc: Jens Axboe <jaxboe@fusionio.com>, linux-kernel@vger.kernel.org Subject: blk_throtl_exit taking q->queue_lock is problematic Message-ID: <20110216183114.26a3613b@notabene.brown> X-Mailer: Claws Mail 3.7.8 (GTK+ 2.20.1; x86_64-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org Hi, I recently discovered that blk_throtl_exit takes ->queue_lock when a blockdev is finally released. This is a problem for because by that time the queue_lock doesn't exist any more. It is in a separate data structure controlled by the RAID personality and by the time that the block device is being destroyed the raid personality has shutdown and the data structure containing the lock has been freed. This has not been a problem before. Nothing else takes queue_lock after blk_cleanup_queue. I could of course set queue_lock to point to __queue_lock and initialise that, but it seems untidy and probably violates some locking requirements. Is there some way you could use some other lock - maybe a global lock, or maybe used __queue_lock directly ??? Thanks, NeilBrown -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ Some fix patch already provided and landed to upstream. Do we carry it on our own patch series or just report to Redhat and waiting for next rhel6 update release? Since it looks like just effect llite_loop module. As andreas point out, it hasn't used for now. |
| Comment by Richard Henwood (Inactive) [ 15/Jun/11 ] |
|
I believe I'm seeing this with 2.1 on RHEL6. To repoduce: 1. uname -a = 2.6.32-131.0.15.el6_lustre.x86_64 This is performed in a VM, watching dmesg over netcat. |
| Comment by Yang Sheng [ 15/Jun/11 ] |
|
Hi, Andreas, I think need a decision for this issue. |
| Comment by Andreas Dilger [ 16/Jun/11 ] |
|
Yang Sheng, I think for the current time we should just disable the llite_loop module for 2.6.32 kernels. |
| Comment by Yang Sheng [ 16/Jun/11 ] |
|
Hi, Yujian, Could you please push your working patch to gerrit? So we can save time to test it and ensure it works well. |
| Comment by Jian Yu [ 16/Jun/11 ] |
Sure. Patch for b1_8: http://review.whamcloud.com/954. |
| Comment by Richard Henwood (Inactive) [ 16/Jun/11 ] |
|
Change 954, ported to 2.1, works for me. I'll await review of the 1.8 version before I submit a change. |
| Comment by Peter Jones [ 16/Jun/11 ] |
|
Richard This is likely to land sooner on master due to 1.8.x release testing so if you can submit a patch for master then it could even get into the next tag Peter |
| Comment by Richard Henwood (Inactive) [ 16/Jun/11 ] |
|
Change set for 2.1 is here: |
| Comment by Build Master (Inactive) [ 17/Jun/11 ] |
|
Integrated in Johann Lombardi : 2ed811cb0149c805a19a278a8350202e47724d46
|
| Comment by Build Master (Inactive) [ 17/Jun/11 ] |
|
Integrated in Johann Lombardi : 2ed811cb0149c805a19a278a8350202e47724d46
|
| Comment by Build Master (Inactive) [ 17/Jun/11 ] |
|
Integrated in Johann Lombardi : 2ed811cb0149c805a19a278a8350202e47724d46
|
| Comment by Build Master (Inactive) [ 17/Jun/11 ] |
|
Integrated in Johann Lombardi : 2ed811cb0149c805a19a278a8350202e47724d46
|
| Comment by Build Master (Inactive) [ 17/Jun/11 ] |
|
Integrated in Johann Lombardi : 2ed811cb0149c805a19a278a8350202e47724d46
|
| Comment by Build Master (Inactive) [ 17/Jun/11 ] |
|
Integrated in Johann Lombardi : 2ed811cb0149c805a19a278a8350202e47724d46
|
| Comment by Build Master (Inactive) [ 17/Jun/11 ] |
|
Integrated in Johann Lombardi : 2ed811cb0149c805a19a278a8350202e47724d46
|
| Comment by Build Master (Inactive) [ 17/Jun/11 ] |
|
Integrated in Johann Lombardi : 2ed811cb0149c805a19a278a8350202e47724d46
|
| Comment by Build Master (Inactive) [ 17/Jun/11 ] |
|
Integrated in Johann Lombardi : 2ed811cb0149c805a19a278a8350202e47724d46
|
| Comment by Build Master (Inactive) [ 17/Jun/11 ] |
|
Integrated in Johann Lombardi : 2ed811cb0149c805a19a278a8350202e47724d46
|
| Comment by Build Master (Inactive) [ 19/Jun/11 ] |
|
Integrated in Johann Lombardi : 2ed811cb0149c805a19a278a8350202e47724d46
|
| Comment by Build Master (Inactive) [ 22/Jun/11 ] |
|
Integrated in Oleg Drokin : ff9f95abb13642fd2a1a183e2f92f390ffdbb1ae
|
| Comment by Build Master (Inactive) [ 22/Jun/11 ] |
|
Integrated in Oleg Drokin : ff9f95abb13642fd2a1a183e2f92f390ffdbb1ae
|
| Comment by Build Master (Inactive) [ 22/Jun/11 ] |
|
Integrated in Oleg Drokin : ff9f95abb13642fd2a1a183e2f92f390ffdbb1ae
|
| Comment by Build Master (Inactive) [ 22/Jun/11 ] |
|
Integrated in Oleg Drokin : ff9f95abb13642fd2a1a183e2f92f390ffdbb1ae
|
| Comment by Build Master (Inactive) [ 22/Jun/11 ] |
|
Integrated in Oleg Drokin : ff9f95abb13642fd2a1a183e2f92f390ffdbb1ae
|
| Comment by Build Master (Inactive) [ 22/Jun/11 ] |
|
Integrated in Oleg Drokin : ff9f95abb13642fd2a1a183e2f92f390ffdbb1ae
|
| Comment by Build Master (Inactive) [ 22/Jun/11 ] |
|
Integrated in Oleg Drokin : ff9f95abb13642fd2a1a183e2f92f390ffdbb1ae
|
| Comment by Build Master (Inactive) [ 22/Jun/11 ] |
|
Integrated in Oleg Drokin : ff9f95abb13642fd2a1a183e2f92f390ffdbb1ae
|
| Comment by Build Master (Inactive) [ 22/Jun/11 ] |
|
Integrated in Oleg Drokin : ff9f95abb13642fd2a1a183e2f92f390ffdbb1ae
|
| Comment by Build Master (Inactive) [ 22/Jun/11 ] |
|
Integrated in Oleg Drokin : ff9f95abb13642fd2a1a183e2f92f390ffdbb1ae
|
| Comment by Build Master (Inactive) [ 22/Jun/11 ] |
|
Integrated in Oleg Drokin : ff9f95abb13642fd2a1a183e2f92f390ffdbb1ae
|
| Comment by Build Master (Inactive) [ 22/Jun/11 ] |
|
Integrated in Oleg Drokin : ff9f95abb13642fd2a1a183e2f92f390ffdbb1ae
|
| Comment by Build Master (Inactive) [ 22/Jun/11 ] |
|
Integrated in Oleg Drokin : ff9f95abb13642fd2a1a183e2f92f390ffdbb1ae
|
| Comment by Build Master (Inactive) [ 22/Jun/11 ] |
|
Integrated in Oleg Drokin : ff9f95abb13642fd2a1a183e2f92f390ffdbb1ae
|
| Comment by Peter Jones [ 22/Jun/11 ] |
|
Landed for 2.1 |
| Comment by Jinshan Xiong (Inactive) [ 23/Jun/11 ] |
|
I tend to think the queue has been cleaned up in del_gendisk(). That means we don't need to do it lloop_exit() then this issue will be fixed. diff --git a/lustre/llite/lloop.c b/lustre/llite/lloop.c index 6975c85..c3b8fb0 100644 --- a/lustre/llite/lloop.c +++ b/lustre/llite/lloop.c @@ -878,7 +878,7 @@ static void lloop_exit(void) ll_iocontrol_unregister(ll_iocontrol_magic); for (i = 0; i < max_loop; i++) { del_gendisk(disks[i]); - blk_cleanup_queue(loop_dev[i].lo_queue); +// blk_cleanup_queue(loop_dev[i].lo_queue); put_disk(disks[i]); } if (ll_unregister_blkdev(lloop_major, "lloop")) |
| Comment by Jinshan Xiong (Inactive) [ 23/Jun/11 ] |
|
I pushed a patch at: http://review.whamcloud.com/1011, please take a look. The patch itself needs polishing by adding macro to check if it's working in 2.6.32+ kernels. |
| Comment by Build Master (Inactive) [ 23/Jun/11 ] |
|
Integrated in Oleg Drokin : 5b840606641c3b227c451056c37a941cc13696c9
|
| Comment by Build Master (Inactive) [ 23/Jun/11 ] |
|
Integrated in Oleg Drokin : 5b840606641c3b227c451056c37a941cc13696c9
|
| Comment by Build Master (Inactive) [ 23/Jun/11 ] |
|
Integrated in Oleg Drokin : 5b840606641c3b227c451056c37a941cc13696c9
|
| Comment by Build Master (Inactive) [ 23/Jun/11 ] |
|
Integrated in Oleg Drokin : 5b840606641c3b227c451056c37a941cc13696c9
|
| Comment by Build Master (Inactive) [ 23/Jun/11 ] |
|
Integrated in Oleg Drokin : 5b840606641c3b227c451056c37a941cc13696c9
|
| Comment by Build Master (Inactive) [ 23/Jun/11 ] |
|
Integrated in Oleg Drokin : 5b840606641c3b227c451056c37a941cc13696c9
|
| Comment by Build Master (Inactive) [ 23/Jun/11 ] |
|
Integrated in Oleg Drokin : 5b840606641c3b227c451056c37a941cc13696c9
|
| Comment by Build Master (Inactive) [ 23/Jun/11 ] |
|
Integrated in Oleg Drokin : 5b840606641c3b227c451056c37a941cc13696c9
|
| Comment by Build Master (Inactive) [ 23/Jun/11 ] |
|
Integrated in Oleg Drokin : 5b840606641c3b227c451056c37a941cc13696c9
|
| Comment by Build Master (Inactive) [ 23/Jun/11 ] |
|
Integrated in Oleg Drokin : 5b840606641c3b227c451056c37a941cc13696c9
|
| Comment by Build Master (Inactive) [ 23/Jun/11 ] |
|
Integrated in Oleg Drokin : 5b840606641c3b227c451056c37a941cc13696c9
|
| Comment by Build Master (Inactive) [ 23/Jun/11 ] |
|
Integrated in Oleg Drokin : 5b840606641c3b227c451056c37a941cc13696c9
|
| Comment by Build Master (Inactive) [ 23/Jun/11 ] |
|
Integrated in Oleg Drokin : 5b840606641c3b227c451056c37a941cc13696c9
|
| Comment by Build Master (Inactive) [ 23/Jun/11 ] |
|
Integrated in Oleg Drokin : 5b840606641c3b227c451056c37a941cc13696c9
|
| Comment by Peter Jones [ 28/Jun/11 ] |
|
Workaround in place so landing. More correct fix is lower priority. |
| Comment by Sarah Liu [ 28/Jun/11 ] |
|
got this problem again when I run sanity test_68a with the latest master build RHEL6/x86_64/#190 Lustre: DEBUG MARKER: == sanity test 68a: lloop driver - basic test ========================== 14:48:58 (1309297738) Modules linked in: llite_lloop |
| Comment by Sarah Liu [ 13/Jul/11 ] |
|
reproduced on the latest rhel6-x86_64/#201 |
| Comment by Yang Sheng [ 28/Jul/11 ] |
|
patch upload to: http://review.whamcloud.com/#change,1150 |
| Comment by Build Master (Inactive) [ 02/Aug/11 ] |
|
Integrated in Oleg Drokin : 167f2a4ec9c577fcafa07ac5356708c3cc09bdea
|
| Comment by Build Master (Inactive) [ 02/Aug/11 ] |
|
Integrated in Oleg Drokin : 167f2a4ec9c577fcafa07ac5356708c3cc09bdea
|
| Comment by Build Master (Inactive) [ 02/Aug/11 ] |
|
Integrated in Oleg Drokin : 167f2a4ec9c577fcafa07ac5356708c3cc09bdea
|
| Comment by Build Master (Inactive) [ 02/Aug/11 ] |
|
Integrated in Oleg Drokin : 167f2a4ec9c577fcafa07ac5356708c3cc09bdea
|
| Comment by Build Master (Inactive) [ 02/Aug/11 ] |
|
Integrated in Oleg Drokin : 167f2a4ec9c577fcafa07ac5356708c3cc09bdea
|
| Comment by Build Master (Inactive) [ 02/Aug/11 ] |
|
Integrated in Oleg Drokin : 167f2a4ec9c577fcafa07ac5356708c3cc09bdea
|
| Comment by Build Master (Inactive) [ 02/Aug/11 ] |
|
Integrated in Oleg Drokin : 167f2a4ec9c577fcafa07ac5356708c3cc09bdea
|
| Comment by Build Master (Inactive) [ 02/Aug/11 ] |
|
Integrated in Oleg Drokin : 167f2a4ec9c577fcafa07ac5356708c3cc09bdea
|
| Comment by Build Master (Inactive) [ 02/Aug/11 ] |
|
Integrated in Oleg Drokin : 167f2a4ec9c577fcafa07ac5356708c3cc09bdea
|
| Comment by Build Master (Inactive) [ 02/Aug/11 ] |
|
Integrated in Oleg Drokin : 167f2a4ec9c577fcafa07ac5356708c3cc09bdea
|
| Comment by Build Master (Inactive) [ 02/Aug/11 ] |
|
Integrated in Oleg Drokin : 167f2a4ec9c577fcafa07ac5356708c3cc09bdea
|
| Comment by Build Master (Inactive) [ 02/Aug/11 ] |
|
Integrated in Oleg Drokin : 167f2a4ec9c577fcafa07ac5356708c3cc09bdea
|
| Comment by Build Master (Inactive) [ 02/Aug/11 ] |
|
Integrated in Oleg Drokin : 167f2a4ec9c577fcafa07ac5356708c3cc09bdea
|
| Comment by Build Master (Inactive) [ 02/Aug/11 ] |
|
Integrated in Oleg Drokin : 167f2a4ec9c577fcafa07ac5356708c3cc09bdea
|
| Comment by Yang Sheng [ 25/Oct/11 ] |
|
Close as fixed. |