<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:35:06 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-3577] BUG: soft lockup - CPU#25 stuck for 67s! [jbd2/dm-8-8:8966]; Kernel panic - not syncing: softlockup: hung tasks</title>
                <link>https://jira.whamcloud.com/browse/LU-3577</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;We have a Lustre 2.1.5 system with two MDSes (active / standby), and two OSSes (active / active).  Each OSS has 6 OSTs.&lt;/p&gt;

&lt;p&gt;We filled the file system to 100%.  To remove the files, one Lustre client ran the following script:&lt;/p&gt;

&lt;p&gt;rm -rf /mnt/hss45/ost/ost-00/* &amp;amp;&lt;br/&gt;
rm -rf /mnt/hss45/ost/ost-01/* &amp;amp;&lt;br/&gt;
rm -rf /mnt/hss45/ost/ost-02/* &amp;amp;&lt;br/&gt;
rm -rf /mnt/hss45/ost/ost-03/* &amp;amp;&lt;br/&gt;
rm -rf /mnt/hss45/ost/ost-04/* &amp;amp;&lt;br/&gt;
rm -rf /mnt/hss45/ost/ost-05/* &amp;amp;&lt;br/&gt;
rm -rf /mnt/hss45/ost/ost-06/* &amp;amp;&lt;br/&gt;
rm -rf /mnt/hss45/ost/ost-07/* &amp;amp;&lt;br/&gt;
rm -rf /mnt/hss45/ost/ost-08/* &amp;amp;&lt;br/&gt;
rm -rf /mnt/hss45/ost/ost-09/* &amp;amp;&lt;br/&gt;
rm -rf /mnt/hss45/ost/ost-10/* &amp;amp;&lt;br/&gt;
rm -rf /mnt/hss45/ost/ost-11/* &amp;amp;&lt;/p&gt;


&lt;p&gt;One OSS crashed with this error:  &lt;br/&gt;
BUG: soft lockup - CPU#25 stuck for 67s! &lt;span class=&quot;error&quot;&gt;&amp;#91;jbd2/dm-8-8:8966&amp;#93;&lt;/span&gt;&lt;br/&gt;
. . .&lt;br/&gt;
Kernel panic - not syncing: softlockup: hung tasks&lt;/p&gt;


&lt;p&gt;The OSS was STONITH&apos;ed.&lt;/p&gt;

&lt;p&gt;Shortly thereafter, the second OSS got the same error:&lt;/p&gt;

&lt;p&gt;BUG: soft lockup - CPU#17 stuck for 67s! &lt;span class=&quot;error&quot;&gt;&amp;#91;jbd2/dm-6-8:21440&amp;#93;&lt;/span&gt;&lt;br/&gt;
Kernel panic - not syncing: softlockup: hung tasks&lt;/p&gt;


&lt;p&gt;I have attached the full console output. There was nothing in /var/log/messages.&lt;/p&gt;

</description>
                <environment>Kernel: 2.6.32-279.19.1.el6_lustre.2.1.5_1.0.3&lt;br/&gt;
Distro: CentOS release 6.4 (Final)&lt;br/&gt;
CPUs: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz&lt;br/&gt;
32 Cores&lt;br/&gt;
128GB RAM on MDSes; 64GB RAM on OSSes&lt;br/&gt;
Active/Standby MDS&lt;br/&gt;
Two OSSes&lt;br/&gt;
12 OSTs&lt;br/&gt;
Each OST is 22T</environment>
        <key id="19780">LU-3577</key>
            <summary>BUG: soft lockup - CPU#25 stuck for 67s! [jbd2/dm-8-8:8966]; Kernel panic - not syncing: softlockup: hung tasks</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="2">Won&apos;t Fix</resolution>
                                        <assignee username="pjones">Peter Jones</assignee>
                                    <reporter username="rspellman">Roger Spellman</reporter>
                        <labels>
                    </labels>
                <created>Thu, 11 Jul 2013 17:40:45 +0000</created>
                <updated>Wed, 21 Mar 2018 15:07:15 +0000</updated>
                            <resolved>Wed, 21 Mar 2018 15:07:15 +0000</resolved>
                                    <version>Lustre 2.1.5</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>2</watches>
                                                                            <comments>
                            <comment id="62144" author="pjones" created="Thu, 11 Jul 2013 18:31:27 +0000"  >&lt;p&gt;Thanks for the report Roger. Given that you are running RHEL 6.4 is there any reason you chose 2.1.5 over 2.1.6? Other than rebuilding for RHEL6.4, are there any other changes you made from a standard 2.1.5?&lt;/p&gt;</comment>
                            <comment id="62153" author="rspellman" created="Thu, 11 Jul 2013 20:02:38 +0000"  >&lt;p&gt;Peter,&lt;/p&gt;

&lt;p&gt;We are using 2.1.5 because we started this project a couple of months ago (before 2.1.6 was release), and we have promised a 2.1.x release to a customer pretty soon.  We are pretty far into our QA cycle, so switching Lustre versions right now would set us back a bit.&lt;/p&gt;

&lt;p&gt;We will go to 2.1.6 very soon.  But, if you say that this is a known bug in 2.1.5 that is fixed in 2.1.6, that will push us to 2.1.6 even sooner.&lt;/p&gt;

&lt;p&gt;We make changes to configure scripts and Makefiles, so that we can build on our build machine.  &lt;/p&gt;

&lt;p&gt;We make some minor functional changes to the code (we made them some time ago, in earlier releases).  Here are the patches that are functional changes.&lt;/p&gt;

&lt;p&gt;diff -rcN -x &apos;&lt;b&gt;~&apos; -x &apos;&lt;/b&gt;.orig&apos; /build/lustre/lustre-2.1.5/lustre/ldlm/ldlm_pool.c 2.1.5/trunk/lustre-working_lustre.patch/lustre/ldlm/ldlm_pool.c&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;
	&lt;ul&gt;
		&lt;li&gt;
		&lt;ul&gt;
			&lt;li&gt;/build/lustre/lustre-2.1.5/lustre/ldlm/ldlm_pool.c  Tue Apr 30 10:34:06 2013&lt;/li&gt;
		&lt;/ul&gt;
		&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
			&lt;li&gt;2.1.5/trunk/lustre-working_lustre.patch/lustre/ldlm/ldlm_pool.c     Tue Jun 18 15:23:14 2013&lt;br/&gt;
***************&lt;/li&gt;
		&lt;/ul&gt;
		&lt;ul&gt;
			&lt;li&gt;143,149 ****&lt;br/&gt;
  /*&lt;/li&gt;
		&lt;/ul&gt;
		&lt;/li&gt;
	&lt;/ul&gt;
	&lt;/li&gt;
	&lt;li&gt;Max age for locks on clients.&lt;br/&gt;
   */&lt;br/&gt;
! #define LDLM_POOL_MAX_AGE (36000)&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;  /*&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;The granularity of SLV calculation.
	&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
		&lt;li&gt;
		&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
			&lt;li&gt;143,157 ----&lt;br/&gt;
  /*&lt;/li&gt;
		&lt;/ul&gt;
		&lt;/li&gt;
	&lt;/ul&gt;
	&lt;/li&gt;
	&lt;li&gt;Max age for locks on clients.&lt;br/&gt;
   */&lt;br/&gt;
! //#define LDLM_POOL_MAX_AGE (36000)&lt;br/&gt;
! /*&lt;br/&gt;
!  * Max age for locks on clients.&lt;br/&gt;
!  * Terascala: Set to default 2 minute max age&lt;br/&gt;
!  *      Units are seconds.&lt;br/&gt;
!  *      This actually kicks in lru eviction after 7 minutes at this setting.&lt;br/&gt;
!  */&lt;br/&gt;
! static u_int32_t ldlm_pool_max_age=120;&lt;br/&gt;
!&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;  /*&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;The granularity of SLV calculation.&lt;br/&gt;
***************
	&lt;ul&gt;
		&lt;li&gt;
		&lt;ul&gt;
			&lt;li&gt;162,171 ****&lt;br/&gt;
  static inline _&lt;em&gt;u64 ldlm_pool_slv_max(&lt;/em&gt;_u32 L)
  {
          /*
!          * Allow to have all locks for 1 client for 10 hrs.
!          * Formula is the following: limit * 10h / 1 client.
           */
!         __u64 lim = (__u64)L *  LDLM_POOL_MAX_AGE / 1;
          return lim;
  }&lt;/li&gt;
		&lt;/ul&gt;
		&lt;/li&gt;
	&lt;/ul&gt;
	&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;&amp;#8212; 170,179 ----&lt;br/&gt;
  static inline _&lt;em&gt;u64 ldlm_pool_slv_max(&lt;/em&gt;_u32 L)&lt;/p&gt;
  {
          /* 
!          * Allow to have all locks for 1 client for 10 minutes.
!          * Formula is the following: limit * 2 min / 1 client.
           */
!         __u64 lim = (__u64)L *  ldlm_pool_max_age / 1; /* Terascala */
          return lim;
  }

&lt;p&gt;***************&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;
	&lt;ul&gt;
		&lt;li&gt;
		&lt;ul&gt;
			&lt;li&gt;805,810 ****&lt;/li&gt;
		&lt;/ul&gt;
		&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
			&lt;li&gt;813,825 ----&lt;br/&gt;
          pool_vars&lt;span class=&quot;error&quot;&gt;&amp;#91;0&amp;#93;&lt;/span&gt;.write_fptr = lprocfs_wr_atomic; &lt;br/&gt;
          lprocfs_add_vars(pl-&amp;gt;pl_proc_dir, pool_vars, 0);&lt;/li&gt;
		&lt;/ul&gt;
		&lt;/li&gt;
	&lt;/ul&gt;
	&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;+         /* Terascala */&lt;br/&gt;
+         snprintf(var_name, MAX_STRING_SIZE, &quot;ldlm_pool_max_age&quot;);&lt;br/&gt;
+         pool_vars&lt;span class=&quot;error&quot;&gt;&amp;#91;0&amp;#93;&lt;/span&gt;.data = &amp;amp;ldlm_pool_max_age;&lt;br/&gt;
+         pool_vars&lt;span class=&quot;error&quot;&gt;&amp;#91;0&amp;#93;&lt;/span&gt;.read_fptr = lprocfs_rd_uint;&lt;br/&gt;
+         pool_vars&lt;span class=&quot;error&quot;&gt;&amp;#91;0&amp;#93;&lt;/span&gt;.write_fptr = lprocfs_wr_uint;&lt;br/&gt;
+         lprocfs_add_vars(pl-&amp;gt;pl_proc_dir, pool_vars, 0);&lt;br/&gt;
+&lt;br/&gt;
          snprintf(var_name, MAX_STRING_SIZE, &quot;state&quot;);&lt;br/&gt;
          pool_vars&lt;span class=&quot;error&quot;&gt;&amp;#91;0&amp;#93;&lt;/span&gt;.data = pl;&lt;br/&gt;
          pool_vars&lt;span class=&quot;error&quot;&gt;&amp;#91;0&amp;#93;&lt;/span&gt;.read_fptr = lprocfs_rd_pool_state;&lt;br/&gt;
diff -rcN -x &apos;&lt;b&gt;~&apos; -x &apos;&lt;/b&gt;.orig&apos; /build/lustre/lustre-2.1.5/lustre/liblustre/super.c 2.1.5/trunk/lustre-working_lustre.patch/lustre/liblustre/super.c&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;
	&lt;ul&gt;
		&lt;li&gt;
		&lt;ul&gt;
			&lt;li&gt;/build/lustre/lustre-2.1.5/lustre/liblustre/super.c Tue Apr 30 10:34:12 2013&lt;/li&gt;
		&lt;/ul&gt;
		&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
			&lt;li&gt;2.1.5/trunk/lustre-working_lustre.patch/lustre/liblustre/super.c    Tue Jun 18 15:23:14 2013&lt;br/&gt;
***************&lt;/li&gt;
		&lt;/ul&gt;
		&lt;ul&gt;
			&lt;li&gt;141,146 ****&lt;/li&gt;
		&lt;/ul&gt;
		&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
			&lt;li&gt;141,147 ----&lt;br/&gt;
          struct mdt_body *body = md-&amp;gt;body;&lt;br/&gt;
          struct lov_stripe_md *lsm = md-&amp;gt;lsm;&lt;br/&gt;
          struct intnl_stat *st = llu_i2stat(inode);&lt;br/&gt;
+         struct ll_sb_info *sbi = ll_i2sbi(inode);&lt;/li&gt;
		&lt;/ul&gt;
		&lt;/li&gt;
	&lt;/ul&gt;
	&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;          LASSERT ((lsm != NULL) == ((body-&amp;gt;valid &amp;amp; OBD_MD_FLEASIZE) != 0));&lt;/p&gt;

&lt;p&gt;***************&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;
	&lt;ul&gt;
		&lt;li&gt;
		&lt;ul&gt;
			&lt;li&gt;181,187 ****&lt;br/&gt;
                  lli-&amp;gt;lli_lvb.lvb_ctime = body-&amp;gt;ctime;&lt;br/&gt;
          }&lt;br/&gt;
          if (S_ISREG(st-&amp;gt;st_mode))&lt;br/&gt;
!                 st-&amp;gt;st_blksize = min(2UL * PTLRPC_MAX_BRW_SIZE, LL_MAX_BLKSIZE);&lt;br/&gt;
          else&lt;br/&gt;
                  st-&amp;gt;st_blksize = 4096;&lt;br/&gt;
          if (body-&amp;gt;valid &amp;amp; OBD_MD_FLUID)&lt;/li&gt;
		&lt;/ul&gt;
		&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
			&lt;li&gt;182,188 ----&lt;br/&gt;
                  lli-&amp;gt;lli_lvb.lvb_ctime = body-&amp;gt;ctime;&lt;br/&gt;
          }&lt;br/&gt;
          if (S_ISREG(st-&amp;gt;st_mode))&lt;br/&gt;
!                 st-&amp;gt;st_blksize = min(2UL * PTLRPC_MAX_BRW_SIZE, 1UL&amp;lt;&amp;lt; sbi-&amp;gt;ll_max_blksize_bits);&lt;br/&gt;
          else&lt;br/&gt;
                  st-&amp;gt;st_blksize = 4096;&lt;br/&gt;
          if (body-&amp;gt;valid &amp;amp; OBD_MD_FLUID)&lt;/li&gt;
		&lt;/ul&gt;
		&lt;/li&gt;
	&lt;/ul&gt;
	&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Hope this helps.&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="13144" name="bug.3" size="106639" author="rspellman" created="Thu, 11 Jul 2013 17:40:45 +0000"/>
                            <attachment id="13145" name="bug.4" size="79069" author="rspellman" created="Thu, 11 Jul 2013 17:40:45 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10490" key="com.atlassian.jira.plugin.system.customfieldtypes:datepicker">
                        <customfieldname>End date</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>Thu, 11 Jul 2013 17:40:45 +0000</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                            <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzvv6f:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9055</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10021"><![CDATA[2]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                        <customfield id="customfield_10493" key="com.atlassian.jira.plugin.system.customfieldtypes:datepicker">
                        <customfieldname>Start date</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>Thu, 11 Jul 2013 17:40:45 +0000</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                    </customfields>
    </item>
</channel>
</rss>