Details
-
Bug
-
Resolution: Fixed
-
Major
-
None
-
None
-
Lustre-2.0, RHEL6.0
-
3
-
24,449
-
6530
Description
We've been hitting this problem for several months whe we reading in "/proc/fs/lustre/osc/<OST>/import".
I saw there's maybe a related patch (BZ#22032 - WC's git: 839280926956f16552194fe803ba21096770ebc4) which was integrated for Lustre-2.1. What do you think of this? If 22032's patch is not related, then does this sound to you as a know problem?
==============================================================================
BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
IP: [<ffffffffa0482d3d>] lprocfs_rd_import+0x32d/0x6b0 [obdclass]
PGD c7cf9f067 PUD ae9bcc067 PMD 0
Oops: 0000 1 SMP
last sysfs file: /sys/devices/pci0000:00/0000:00:05.0/0000:05:00.0/infiniband/mlx4_0/ports/1/rate
CPU 5
Modules linked in: sit(U) tunnel4(U) lmv(U) mgc(U) lustre(U) lov(U) osc(U) mdc(U) lquota(U) fid(U) fld(U) ko2iblnd(U)
ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ib_cm(U)
ib_sa(U) ib_uverbs(U) ib_umad(U) mlx4_ib(U) mlx4_core(U) ib_mthca(U) ib_mad(U) ib_core(U) ipmi_devintf(U) ipmi_si(U)
ipmi_msghandler(U) iptable_filter(U) ip_tables(U) x_tables(U) nfs(U) lockd(U) fscache(U) nfs_acl(U) auth_rpcgss(U) sunrpc(U)
acpi_cpufreq(U) freq_table(U) vtune_drv(U) autofs4(U) ipv6(U) sg(U) i7core_edac(U) edac_core(U) i2c_i801(U) i2c_core(U)
igb(U) ioatdma(U) dca(U) iTCO_wdt(U) iTCO_vendor_support(U) ext3(U) jbd(U) mbcache(U) sd_mod(U) crc_t10dif(U) usbhid(U)
hid(U) ehci_hcd(U) ahci(U) uhci_hcd(U) dm_mod(U) [last unloaded: libcfs]
Pid: 29413, comm: grep Not tainted 2.6.32-30.el6.Bull.14.x86_64 #1 bullx super-node
RIP: 0010:[<ffffffffa0482d3d>] [<ffffffffa0482d3d>] lprocfs_rd_import+0x32d/0x6b0 [obdclass]
RSP: 0018:ffff8806e57ffd78 EFLAGS: 00010206
RAX: 0000000000000000 RBX: ffff880c7db5a000 RCX: 0000000000000038
RDX: ffff880c6fd42105 RSI: 00000000fffffffe RDI: 0000000000000013
RBP: ffff8806e57ffe38 R08: 0000000000000000 R09: 00000000fffffffe
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000105 R14: 0000000000000000 R15: 0000000000001000
FS: 00002b8d09d85f60(0000) GS:ffff88088e440000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000018 CR3: 00000009da84e000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process grep (pid: 29413, threadinfo ffff8806e57fe000, task ffff8807bf266c50)
Stack:
0000000000000000 ffffea0029a6ddf8 00000200011cce48 00000010e68896a0
<0> ffff880a4323e948 ffff880c7db5a000 ffff880a4323e438 ffff880c6fd42000
<0> ffff8806e57ffde8 ffff880880001d80 ffff880c7d7c2300 00000000000000d0
Call Trace:
[<ffffffff8113e377>] ? alloc_pages_current+0x87/0xd0
[<ffffffffa0480651>] lprocfs_fops_read+0xd1/0x1e0 [obdclass]
[<ffffffff811b6a36>] proc_reg_read+0x76/0xb0
[<ffffffff81157f55>] vfs_read+0xb5/0x1a0
[<ffffffff810c5282>] ? audit_syscall_entry+0x252/0x280
[<ffffffff81158091>] sys_read+0x51/0x90
[<ffffffff8100c172>] system_call_fastpath+0x16/0x1b
Code: 18 08 75 a2 48 8b 9d 68 ff ff ff 66 ff 83 78 02 00 00 48 8b 43 60 44 8b 83 28 02 00 00 44 8b b3 14 01 00 00 44
8b a3 24 02 00 00 <48> 8b 78 18 44 89 85 58 ff ff ff e8 d3 5e dc ff 49 63 fd 48 03
RIP [<ffffffffa0482d3d>] lprocfs_rd_import+0x32d/0x6b0 [obdclass]
RSP <ffff8806e57ffd78>
==============================================================================
And further+in-deep analysis clearly indicates this problem comes from a race between a process reading
"/proc/fs/lustre/osc/<OST>/import" special file via lprocfs layer and other Lustre layers dealing with
imports.
Thanks,
Oops, thank's to Andreas asking me to review code+patches from top-down, starting with master branch !! And bingo, a similar patch has been already applied starting with b2_3, it comes from JIRA
LU-1448where the same issue was found for disabled OSCs when in our case it also happen during OSC mount !!Patch on master is at http://review.whamcloud.com/2995, so it needs to be cherry-picked from there to be applied to b2_1/b2_2 branches.
In the mean time, should I need to "Abandon" my change on Gerrit by pointing to master change ??