[LU-815] BUG: unable to handle kernel NULL pointer dereference" in lprocfs_rd_import() - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.9.0
Affects Version/s: None
Labels:
None
Environment:
Lustre-2.0, RHEL6.0

Severity:
3
Bugzilla ID:
24,449
Rank (Obsolete):
6530

Description

We've been hitting this problem for several months whe we reading in "/proc/fs/lustre/osc/<OST>/import".

I saw there's maybe a related patch (BZ#22032 - WC's git: 839280926956f16552194fe803ba21096770ebc4) which was integrated for Lustre-2.1. What do you think of this? If 22032's patch is not related, then does this sound to you as a know problem?

==============================================================================
BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
IP: [<ffffffffa0482d3d>] lprocfs_rd_import+0x32d/0x6b0 [obdclass]
PGD c7cf9f067 PUD ae9bcc067 PMD 0
Oops: 0000 1 SMP
last sysfs file: /sys/devices/pci0000:00/0000:00:05.0/0000:05:00.0/infiniband/mlx4_0/ports/1/rate
CPU 5
Modules linked in: sit(U) tunnel4(U) lmv(U) mgc(U) lustre(U) lov(U) osc(U) mdc(U) lquota(U) fid(U) fld(U) ko2iblnd(U)
ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ib_cm(U)
ib_sa(U) ib_uverbs(U) ib_umad(U) mlx4_ib(U) mlx4_core(U) ib_mthca(U) ib_mad(U) ib_core(U) ipmi_devintf(U) ipmi_si(U)
ipmi_msghandler(U) iptable_filter(U) ip_tables(U) x_tables(U) nfs(U) lockd(U) fscache(U) nfs_acl(U) auth_rpcgss(U) sunrpc(U)
acpi_cpufreq(U) freq_table(U) vtune_drv(U) autofs4(U) ipv6(U) sg(U) i7core_edac(U) edac_core(U) i2c_i801(U) i2c_core(U)
igb(U) ioatdma(U) dca(U) iTCO_wdt(U) iTCO_vendor_support(U) ext3(U) jbd(U) mbcache(U) sd_mod(U) crc_t10dif(U) usbhid(U)
hid(U) ehci_hcd(U) ahci(U) uhci_hcd(U) dm_mod(U) [last unloaded: libcfs]

Pid: 29413, comm: grep Not tainted 2.6.32-30.el6.Bull.14.x86_64 #1 bullx super-node
RIP: 0010:[<ffffffffa0482d3d>] [<ffffffffa0482d3d>] lprocfs_rd_import+0x32d/0x6b0 [obdclass]
RSP: 0018:ffff8806e57ffd78 EFLAGS: 00010206
RAX: 0000000000000000 RBX: ffff880c7db5a000 RCX: 0000000000000038
RDX: ffff880c6fd42105 RSI: 00000000fffffffe RDI: 0000000000000013
RBP: ffff8806e57ffe38 R08: 0000000000000000 R09: 00000000fffffffe
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000105 R14: 0000000000000000 R15: 0000000000001000
FS: 00002b8d09d85f60(0000) GS:ffff88088e440000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000018 CR3: 00000009da84e000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process grep (pid: 29413, threadinfo ffff8806e57fe000, task ffff8807bf266c50)
Stack:
0000000000000000 ffffea0029a6ddf8 00000200011cce48 00000010e68896a0
<0> ffff880a4323e948 ffff880c7db5a000 ffff880a4323e438 ffff880c6fd42000
<0> ffff8806e57ffde8 ffff880880001d80 ffff880c7d7c2300 00000000000000d0
Call Trace:
[<ffffffff8113e377>] ? alloc_pages_current+0x87/0xd0
[<ffffffffa0480651>] lprocfs_fops_read+0xd1/0x1e0 [obdclass]
[<ffffffff811b6a36>] proc_reg_read+0x76/0xb0
[<ffffffff81157f55>] vfs_read+0xb5/0x1a0
[<ffffffff810c5282>] ? audit_syscall_entry+0x252/0x280
[<ffffffff81158091>] sys_read+0x51/0x90
[<ffffffff8100c172>] system_call_fastpath+0x16/0x1b
Code: 18 08 75 a2 48 8b 9d 68 ff ff ff 66 ff 83 78 02 00 00 48 8b 43 60 44 8b 83 28 02 00 00 44 8b b3 14 01 00 00 44
8b a3 24 02 00 00 <48> 8b 78 18 44 89 85 58 ff ff ff e8 d3 5e dc ff 49 63 fd 48 03
RIP [<ffffffffa0482d3d>] lprocfs_rd_import+0x32d/0x6b0 [obdclass]
RSP <ffff8806e57ffd78>
==============================================================================

And further+in-deep analysis clearly indicates this problem comes from a race between a process reading
"/proc/fs/lustre/osc/<OST>/import" special file via lprocfs layer and other Lustre layers dealing with
imports.

Thanks,

Attachments

Activity

[LU-815] BUG: unable to handle kernel NULL pointer dereference" in lprocfs_rd_import()

Bruno Faccini (Inactive) added a comment - 05/Oct/12 2:18 AM

Oops, thank's to Andreas asking me to review code+patches from top-down, starting with master branch !! And bingo, a similar patch has been already applied starting with b2_3, it comes from JIRA ~~LU-1448~~ where the same issue was found for disabled OSCs when in our case it also happen during OSC mount !!

Patch on master is at http://review.whamcloud.com/2995, so it needs to be cherry-picked from there to be applied to b2_1/b2_2 branches.

In the mean time, should I need to "Abandon" my change on Gerrit by pointing to master change ??

Bruno Faccini (Inactive) added a comment - 05/Oct/12 2:18 AM Oops, thank's to Andreas asking me to review code+patches from top-down, starting with master branch !! And bingo, a similar patch has been already applied starting with b2_3, it comes from JIRA LU-1448 where the same issue was found for disabled OSCs when in our case it also happen during OSC mount !! Patch on master is at http://review.whamcloud.com/2995 , so it needs to be cherry-picked from there to be applied to b2_1/b2_2 branches. In the mean time, should I need to "Abandon" my change on Gerrit by pointing to master change ??

Bruno Faccini (Inactive) added a comment - 04/Oct/12 2:07 PM

Humm, even if running with lustre 2.1.1 (including fix for BZ#22032) we can still reproduce the same crash/Oops !! So would like to re-open this JIRA ...

Again the crash is due to imp->imp_connection beeing NULL and beeing dereferenced in lprocfs_rd_import().

So I am back with my earlier fix idea, not choosen by Bull R&D in favor of BZ#22032 at that time ..., where imp->imp_connection access must be done under imp->imp_lock protection too and NULL value detected.

Patch against b2_1 is at http://review.whamcloud.com/4187

Bruno Faccini (Inactive) added a comment - 04/Oct/12 2:07 PM Humm, even if running with lustre 2.1.1 (including fix for BZ#22032) we can still reproduce the same crash/Oops !! So would like to re-open this JIRA ... Again the crash is due to imp->imp_connection beeing NULL and beeing dereferenced in lprocfs_rd_import(). So I am back with my earlier fix idea, not choosen by Bull R&D in favor of BZ#22032 at that time ..., where imp->imp_connection access must be done under imp->imp_lock protection too and NULL value detected. Patch against b2_1 is at http://review.whamcloud.com/4187

Andreas Dilger added a comment - 15/Dec/11 8:00 PM

I'm going to mark this fixed in 2.1.0. Please reopen if the customer hits this problem again.

Andreas Dilger added a comment - 15/Dec/11 8:00 PM I'm going to mark this fixed in 2.1.0. Please reopen if the customer hits this problem again.

Sebastien Buisson (Inactive) added a comment - 15/Dec/11 11:16 AM

Yes, we integrated the proposed patch, and delivered it to the customer. But we do not have any feedback yet.

Sebastien Buisson (Inactive) added a comment - 15/Dec/11 11:16 AM Yes, we integrated the proposed patch, and delivered it to the customer. But we do not have any feedback yet.

Peter Jones added a comment - 15/Dec/11 9:58 AM

Any feedback on this ticket? Have you been able to try the suggested fix yet? If not, when do you expect to be able to do so?

Peter Jones added a comment - 15/Dec/11 9:58 AM Any feedback on this ticket? Have you been able to try the suggested fix yet? If not, when do you expect to be able to do so?

Diego Moreno (Inactive) added a comment - 03/Nov/11 7:43 AM

Actually I don't think the patch in 839280926956f16552194fe803ba21096770ebc4 is in the official 2.0.0 release. In git we can see this patch was introduced between 2.0.52.0 and 2.0.53.0 tags so the result shown by git describe is very strange.

We're going to integrate this patch in our 2.0.0 (which is the official) and if we still have the problem we'll try with ~~LU-615~~.

Thanks Andreas

Diego Moreno (Inactive) added a comment - 03/Nov/11 7:43 AM Actually I don't think the patch in 839280926956f16552194fe803ba21096770ebc4 is in the official 2.0.0 release. In git we can see this patch was introduced between 2.0.52.0 and 2.0.53.0 tags so the result shown by git describe is very strange. We're going to integrate this patch in our 2.0.0 (which is the official) and if we still have the problem we'll try with LU-615 . Thanks Andreas

Andreas Dilger added a comment - 02/Nov/11 3:58 PM

Looking at git for 839280926956f16552194fe803ba21096770ebc4, it definitely seems related, but "git describe" shows that this should be included into v2_0_0-rc1a, which means it should be in the Lustre 2.0.0 release already. Are you running the official 2.0.0 release, or some earlier build?

The other possibility is that this is related to the patch in http://review.whamcloud.com/1544 (~~LU-615~~), which is fixing the reads from .../import to avoid problems overflowing the page buffer. This patch has not yet been landed to the master (2.2) release branch, so I would recommend testing it first if you plan to apply it before testing and landing has completed.

Andreas Dilger added a comment - 02/Nov/11 3:58 PM Looking at git for 839280926956f16552194fe803ba21096770ebc4, it definitely seems related, but "git describe" shows that this should be included into v2_0_0-rc1a, which means it should be in the Lustre 2.0.0 release already. Are you running the official 2.0.0 release, or some earlier build? The other possibility is that this is related to the patch in http://review.whamcloud.com/1544 ( LU-615 ), which is fixing the reads from .../import to avoid problems overflowing the page buffer. This patch has not yet been landed to the master (2.2) release branch, so I would recommend testing it first if you plan to apply it before testing and landing has completed.

People

Assignee:: Andreas Dilger

Reporter:: Lustre Bull (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 02/Nov/11 12:54 PM

Updated:: 08/Sep/16 4:16 AM

Resolved:: 15/Dec/11 8:00 PM