<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:11:38 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-912] OSS node(s) crash with Kernel oops</title>
                <link>https://jira.whamcloud.com/browse/LU-912</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Sorry if this is a duplicate, but I couldn&apos;t find a similar bug.&lt;/p&gt;

&lt;p&gt;Failure is restricted to OSS nodes and occurs as follows:&lt;/p&gt;

&lt;p&gt;&lt;del&gt;1&lt;/del&gt; One OSS node crash. Heartbeat manage to takeover the resources towards the standy node smoothly.&lt;/p&gt;

&lt;p&gt;There&apos;s no indication of any IB errors in the opensm.log; No Error in /var/log/messages and /var/log/warn. No resource (CPU, Memory, network, Disk) is exhausted (I can provide the collectl files if needed). One thing that might be noticed is that the &apos;ldiskfs_inode_cache&apos; increase constantly over 1GB till the nodes crashes (numslabs, object, size). See attached collectl excerpt output for slabs.&lt;/p&gt;

&lt;p&gt;Anyway, we found the following message in the console log file (conman):&lt;/p&gt;

&lt;p&gt;jf92o05 login: BUG: unable to handle kernel NULL pointer dereference at 00000000000000c8&lt;br/&gt;
IP: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa09bbdbd&amp;gt;&amp;#93;&lt;/span&gt; ost_rw_prolong_locks+0x18d/0x460 &lt;span class=&quot;error&quot;&gt;&amp;#91;ost&amp;#93;&lt;/span&gt;&lt;br/&gt;
PGD 0 &lt;br/&gt;
Oops: 0000 &lt;span class=&quot;error&quot;&gt;&amp;#91;1&amp;#93;&lt;/span&gt; SMP &lt;br/&gt;
last sysfs file: /sys/kernel/uevent_seqnum&lt;br/&gt;
CPU 0 &lt;br/&gt;
Modules linked in: obdfilter(N) fsfilt_ldiskfs(N) ost(N) mgc(N) ldiskfs(N) lustre(N) lov(N) mdc(N) lquota(N) osc(N) ko2iblnd(N) ptlrpc(N) obdclass(N) lnet(N) lvfs(N) libcfs(N) quota_v2(N) quot&lt;br/&gt;
a_tree(N) jbd2(N) crc16(N) edd(N) nfs(N) lockd(N) nfs_acl(N) sunrpc(N) rdma_ucm(N) ib_sdp(N) rdma_cm(N) iw_cm(N) ib_addr(N) ib_ipoib(N) ib_cm(N) ib_sa(N) ipv6(N) ib_uverbs(N) ib_umad(N) iw_nes&lt;br/&gt;
(N) libcrc32c(N) iw_cxgb3(N) cxgb3(N) ib_ipath(N) cpufreq_conservative(N) cpufreq_userspace(N) cpufreq_powersave(N) acpi_cpufreq(N) mlx4_ib(N) ib_mthca(N) ib_mad(N) ib_core(N) fuse(N) dm_crypt&lt;br/&gt;
(N) crypto_blkcipher(N) loop(N) dm_round_robin(N) dm_multipath(N) scsi_dh(N) sr_mod(N) cdrom(N) ide_pci_generic(N) jmicron(N) ide_core(N) ata_generic(N) snd_hda_intel(N) thermal(N) snd_pcm(N) &lt;br/&gt;
snd_timer(N) rtc_cmos(N) snd_page_alloc(N) ahci(N) processor(N) pata_jmicron(N) snd_hwdep(N) rtc_core(N) lpfc(N) libata(N) ses(N) thermal_sys(N) snd(N) rtc_lib(N) mlx4_core(N) pcspkr(N) i2c_i8&lt;br/&gt;
01(N) ohci1394(N) e1000e(N) serio_raw(N) enclosure(N) igb(N) soundcore(N) joydev(N) scsi_transport_fc(N) button(N) ieee1394(N) i2c_core(N) scsi_tgt(N) hwmon(N) dock(N) sg(N) linear(N) usbhid(N&lt;br/&gt;
) hid(N) ff_memless(N) uhci_hcd(N) ehci_hcd(N) sd_mod(N) crc_t10dif(N) usbcore(N) dm_snapshot(N) dm_mod(N) ext3(N) jbd(N) mbcache(N) aacraid(N) scsi_mod(N) &lt;span class=&quot;error&quot;&gt;&amp;#91;last unloaded: libcfs&amp;#93;&lt;/span&gt;&lt;br/&gt;
Supported: No&lt;br/&gt;
Pid: 24183, comm: ll_ost_io_71 Tainted: G          2.6.27.39-0.1_lustre.1.8.4-default #1&lt;br/&gt;
RIP: 0010:&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa09bbdbd&amp;gt;&amp;#93;&lt;/span&gt;  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa09bbdbd&amp;gt;&amp;#93;&lt;/span&gt; ost_rw_prolong_locks+0x18d/0x460 &lt;span class=&quot;error&quot;&gt;&amp;#91;ost&amp;#93;&lt;/span&gt;&lt;br/&gt;
RSP: 0018:ffff8805bbd3bd00  EFLAGS: 00010246&lt;br/&gt;
RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffff8805bbd3bd40&lt;br/&gt;
RDX: ffffffffa09bb480 RSI: ffff8805bbd3bd80 RDI: 0000000000000258&lt;br/&gt;
RBP: ffff8801d97c41b0 R08: 0000000000000006 R09: 0000000000000000&lt;br/&gt;
R10: ffff8805d0548c00 R11: ffff8805d9b5eb80 R12: 0000000000000006&lt;br/&gt;
R13: ffff8801d97c40c8 R14: ffff8802ba95dc00 R15: ffff8805bbd3bd40&lt;br/&gt;
FS:  00007fefa37f96f0(0000) GS:ffffffff80a33080(0000) knlGS:0000000000000000&lt;br/&gt;
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b&lt;br/&gt;
CR2: 00000000000000c8 CR3: 0000000000201000 CR4: 00000000000006e0&lt;br/&gt;
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000&lt;br/&gt;
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400&lt;br/&gt;
Process ll_ost_io_71 (pid: 24183, threadinfo ffff8805bbd3a000, task ffff8805bbd38100)&lt;br/&gt;
Stack:  ffffffff80a23680 0000000000000000 ffff88062a43e7c0 ffffffffa07fd790&lt;br/&gt;
 ffff8805bbd3be40 ffffffff80498e16 0000000000000000 ffffffffffffffff&lt;br/&gt;
 ffff880815a27e00 00000000138da000 00000000138dafff 0000000000000000&lt;br/&gt;
Call Trace:&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa09bc1bb&amp;gt;&amp;#93;&lt;/span&gt; ost_rw_hpreq_check+0x12b/0x2b0 &lt;span class=&quot;error&quot;&gt;&amp;#91;ost&amp;#93;&lt;/span&gt;&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa076c9c3&amp;gt;&amp;#93;&lt;/span&gt; ptlrpc_main+0xef3/0x15f0 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8020cf49&amp;gt;&amp;#93;&lt;/span&gt; child_rip+0xa/0x11&lt;/p&gt;


&lt;p&gt;&lt;del&gt;2&lt;/del&gt; Some time later the node that took over the resources of the crashed node hangs, too.&lt;/p&gt;

&lt;p&gt;Same situation in log files and resource allocation (no resource is exhausted); &apos;ldiskfs_inode_cache&apos; slabs increase continuously before the server crashes (hangs), but allocation is not very high ( ~ 200 MB). &lt;/p&gt;

&lt;p&gt;The same message appears in node&apos;s console log file, too:&lt;/p&gt;

&lt;p&gt;-Separator ---- Sun Dec 11 20:10:01 CET 2011 ----&lt;br/&gt;
general protection fault: 0000 &lt;span class=&quot;error&quot;&gt;&amp;#91;1&amp;#93;&lt;/span&gt; SMP &lt;br/&gt;
last sysfs file: /sys/kernel/uevent_seqnum&lt;br/&gt;
CPU 0 &lt;br/&gt;
Modules linked in: obdfilter(N) fsfilt_ldiskfs(N) ost(N) mgc(N) ldiskfs(N) lustre(N) lov(N) mdc(N) lquota(N) osc(N) ko2iblnd(N) ptlrpc(N) obdclass(N) lnet(N) lvfs(N) libcfs(N) quota_v2(N) quot&lt;br/&gt;
a_tree(N) jbd2(N) crc16(N) edd(N) nfs(N) lockd(N) nfs_acl(N) sunrpc(N) rdma_ucm(N) ib_sdp(N) rdma_cm(N) iw_cm(N) ib_addr(N) ib_ipoib(N) ib_cm(N) ib_sa(N) ipv6(N) ib_uverbs(N) ib_umad(N) iw_nes&lt;br/&gt;
(N) libcrc32c(N) iw_cxgb3(N) cxgb3(N) ib_ipath(N) cpufreq_conservative(N) cpufreq_userspace(N) cpufreq_powersave(N) acpi_cpufreq(N) mlx4_ib(N) ib_mthca(N) ib_mad(N) ib_core(N) fuse(N) dm_crypt&lt;br/&gt;
(N) crypto_blkcipher(N) loop(N) dm_round_robin(N) dm_multipath(N) scsi_dh(N) sr_mod(N) cdrom(N) ide_pci_generic(N) jmicron(N) ide_core(N) ata_generic(N) thermal(N) snd_hda_intel(N) snd_pcm(N) &lt;br/&gt;
processor(N) snd_timer(N) ahci(N) pata_jmicron(N) rtc_cmos(N) snd_page_alloc(N) ses(N) lpfc(N) thermal_sys(N) ohci1394(N) libata(N) rtc_core(N) snd_hwdep(N) scsi_transport_fc(N) mlx4_core(N) e&lt;br/&gt;
nclosure(N) hwmon(N) i2c_i801(N) dock(N) joydev(N) rtc_lib(N) button(N) pcspkr(N) ieee1394(N) snd(N) serio_raw(N) igb(N) scsi_tgt(N) e1000e(N) soundcore(N) i2c_core(N) sg(N) linear(N) usbhid(N&lt;br/&gt;
) hid(N) ff_memless(N) uhci_hcd(N) ehci_hcd(N) sd_mod(N) crc_t10dif(N) usbcore(N) dm_snapshot(N) dm_mod(N) ext3(N) jbd(N) mbcache(N) aacraid(N) scsi_mod(N) &lt;span class=&quot;error&quot;&gt;&amp;#91;last unloaded: libcfs&amp;#93;&lt;/span&gt;&lt;br/&gt;
Supported: No&lt;br/&gt;
Pid: 20502, comm: ll_ost_io_80 Tainted: G          2.6.27.39-0.1_lustre.1.8.4-default #1&lt;br/&gt;
RIP: 0010:&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa075ce94&amp;gt;&amp;#93;&lt;/span&gt;  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa075ce94&amp;gt;&amp;#93;&lt;/span&gt; lustre_msg_buf+0x4/0x90 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
RSP: 0000:ffff8805cf82bdb0  EFLAGS: 00010282&lt;br/&gt;
RAX: 0000000000000008 RBX: ffff88026b76a808 RCX: aaaaaaaaaaaaaaab&lt;br/&gt;
RDX: 0000000000000018 RSI: 0000000000000002 RDI: 5a5a5a5a5a5a5a5a&lt;br/&gt;
RBP: 0000000000000001 R08: ffff8805f0dae900 R09: 0000000000000000&lt;br/&gt;
R10: 000000004ee5023d R11: ffff880c2d53edc0 R12: ffff88026b76a800&lt;br/&gt;
R13: 0000000000000001 R14: ffff88026b76a800 R15: ffff8803067bc608&lt;br/&gt;
FS:  00007f03bd6456f0(0000) GS:ffffffff80a33080(0000) knlGS:0000000000000000&lt;br/&gt;
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b&lt;br/&gt;
CR2: 0000000001ab9348 CR3: 0000000000201000 CR4: 00000000000006e0&lt;br/&gt;
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000&lt;br/&gt;
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400&lt;br/&gt;
Process ll_ost_io_80 (pid: 20502, threadinfo ffff8805cf82a000, task ffff8805cf828800)&lt;br/&gt;
Stack:  ffff88026b76a800 ffff88026b76a808 ffff8805f5c6c800 ffff88026b76a808&lt;br/&gt;
 ffff8805f5c6c800 ffffffffa09b913b ffff88026b76a800 ffffffffa09bab0c&lt;br/&gt;
 0000000000000000 ffff8803067bc540 ffff8805f5c6c800 ffff88026b76a800&lt;br/&gt;
Call Trace:&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa09b913b&amp;gt;&amp;#93;&lt;/span&gt; ost_rw_hpreq_check+0xab/0x2b0 &lt;span class=&quot;error&quot;&gt;&amp;#91;ost&amp;#93;&lt;/span&gt;&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa07699c3&amp;gt;&amp;#93;&lt;/span&gt; ptlrpc_main+0xef3/0x15f0 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
 &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8020cf49&amp;gt;&amp;#93;&lt;/span&gt; child_rip+0xa/0x11&lt;/p&gt;

&lt;p&gt;This time the system broken. After booting the second node manually the system is operational again.&lt;/p&gt;

&lt;p&gt;The incident is &apos;restricted&apos; to two server node pairs, and happens since 3 weeks periodically approximately after 7 days (every weekend, but that might be by chance).&lt;/p&gt;</description>
                <environment>MDS, MGS, OSS : SLES 11, Lustre 1.8.4 (oracle), kernel 2.6.27.39-0.1_lustre.1.8.4-default, OFED 1.4.2&lt;br/&gt;
INTERCONNECT  : Infiniband&lt;br/&gt;
&lt;br/&gt;
Server nodes are &amp;#39;coupled&amp;#39; in pairs of high availability nodes with help of Linux - HA (heartbeat-2.1.4-4.1)&lt;br/&gt;
Configuration is the same (besides node names and UDP port) for all nodes:&lt;br/&gt;
&lt;br/&gt;
debugfile /var/log/ha-debug&lt;br/&gt;
logfile /var/log/ha-log&lt;br/&gt;
logfacility local0&lt;br/&gt;
keepalive 2&lt;br/&gt;
deadtime 90&lt;br/&gt;
warntime 30&lt;br/&gt;
initdead 180&lt;br/&gt;
udpport 10119&lt;br/&gt;
bcast eth0 ib0&lt;br/&gt;
auto_failback off&lt;br/&gt;
stonith_host jf92o05 external/ipmi jf92o06 jf92o06s ADMIN jadminsb lanplus&lt;br/&gt;
stonith_host jf92o06 external/ipmi jf92o05 jf92o05s ADMIN jadminsb lanplus&lt;br/&gt;
node jf92o05&lt;br/&gt;
node jf92o06&lt;br/&gt;
&lt;br/&gt;
CLIENTS              : SLES 11 SP1, Lustre 1.8.4 (oracle) patchless, kernel 2.6.32.23-0.3-default, OFED 1.4.2</environment>
        <key id="12645">LU-912</key>
            <summary>OSS node(s) crash with Kernel oops</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="2" iconUrl="https://jira.whamcloud.com/images/icons/priorities/critical.svg">Critical</priority>
                        <status id="6" iconUrl="https://jira.whamcloud.com/images/icons/statuses/closed.png" description="The issue is considered finished, the resolution is correct. Issues which are closed can be reopened.">Closed</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="3">Duplicate</resolution>
                                        <assignee username="wc-triage">WC Triage</assignee>
                                    <reporter username="heckes">Frank Heckes</reporter>
                        <labels>
                    </labels>
                <created>Mon, 12 Dec 2011 04:41:01 +0000</created>
                <updated>Thu, 29 Dec 2011 14:34:04 +0000</updated>
                            <resolved>Thu, 29 Dec 2011 14:33:55 +0000</resolved>
                                    <version>Lustre 1.8.x (1.8.0 - 1.8.5)</version>
                                    <fixVersion>Lustre 1.8.6</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>0</watches>
                                                                            <comments>
                            <comment id="24073" author="heckes" created="Mon, 12 Dec 2011 04:46:13 +0000"  >&lt;p&gt;SLAB info of first crashed node.&lt;/p&gt;</comment>
                            <comment id="24074" author="heckes" created="Mon, 12 Dec 2011 04:46:53 +0000"  >&lt;p&gt;SLAB info of second crashed node.&lt;/p&gt;</comment>
                            <comment id="24075" author="johann" created="Mon, 12 Dec 2011 05:22:23 +0000"  >&lt;p&gt;This looks like bugzilla 21804.&lt;/p&gt;</comment>
                            <comment id="24076" author="heckes" created="Mon, 12 Dec 2011 05:48:00 +0000"  >&lt;p&gt;Ok, we need to update to 1.8.6. Many thanks for pointer to the bugzilla. I&apos;m sorry for creating, yet another ticket.  Our problem is that we don&apos;t want to change the OS distribution (SLES 11), but that&apos;s a different story. &lt;/p&gt;

&lt;p&gt;You can close the ticket&lt;/p&gt;</comment>
                            <comment id="25277" author="brian" created="Thu, 29 Dec 2011 14:33:55 +0000"  >&lt;p&gt;Duplicate of Lustre Bugzilla bug 21804.&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="10670" name="f92o06-slabs-sorted-by-totsize.dat" size="1197070" author="heckes" created="Mon, 12 Dec 2011 04:46:53 +0000"/>
                            <attachment id="10669" name="jf92o05-slabs-sorted-by-totsize.dat" size="459089" author="heckes" created="Mon, 12 Dec 2011 04:46:13 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                    <customfield id="customfield_10020" key="com.atlassian.jira.plugin.system.customfieldtypes:float">
                        <customfieldname>Bugzilla ID</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>21804.0</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzvhn3:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>6507</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>