<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:17:43 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-15368] Monitoring changelog_size is slowing down changelogs operations</title>
                <link>https://jira.whamcloud.com/browse/LU-15368</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;We have noticed that monitoring changelog_size on MDS can significantly slow down changelogs operations and eventually even the MDS in general.&lt;/p&gt;

&lt;p&gt;We have a script that monitor changelog_size every minute to display the size per MDT with Grafana (we also have an alert via Grafana when a changelog_size grows too much). The script basically does the following:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@fir-md1-s3 ~]# lctl get_param mdd.*-MDT*.changelog_size
mdd.fir-MDT0002.changelog_size=305538455808
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Recently, a user launched massive jobs writing many small files at 30K+ create/s for hours, resulting in a backlog of changelogs on MDT0002: changelog_size increased dramatically and the monitoring script took longer and longer to get the value of changelog_size (more than 1 minute). This resulted in a slow down of changelogs commit seen from Robinhood and also eventually a general high load on the MDS and slow operations (up to 5s for a chownat).&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@fir-md1-s3 ~]# uptime
 09:35:21 up 25 days, 23:30,  2 users,  load average: 101.05, 97.57, 100.55
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;I even noticed one watchdog backtrace at some point:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Dec 13 08:49:02 fir-md1-s3 kernel: Pid: 24707, comm: mdt00_062 3.10.0-1160.45.1.el7_lustre.pl1.x86_64 #1 SMP Wed Nov 10 23:41:33 PST 2021
Dec 13 08:49:02 fir-md1-s3 kernel: Call Trace:
Dec 13 08:49:02 fir-md1-s3 kernel:  [&amp;lt;ffffffffa7798387&amp;gt;] call_rwsem_down_write_failed+0x17/0x30
Dec 13 08:49:02 fir-md1-s3 kernel:  [&amp;lt;ffffffffc0ecf68f&amp;gt;] llog_cat_id2handle+0x7f/0x620 [obdclass]
Dec 13 08:49:02 fir-md1-s3 kernel:  [&amp;lt;ffffffffc0ed0778&amp;gt;] llog_cat_cancel_records+0x128/0x3d0 [obdclass]
Dec 13 08:49:02 fir-md1-s3 kernel:  [&amp;lt;ffffffffc1b01328&amp;gt;] llog_changelog_cancel_cb+0x1d8/0x5b0 [mdd]
Dec 13 08:49:02 fir-md1-s3 kernel:  [&amp;lt;ffffffffc0eca5af&amp;gt;] llog_process_thread+0x85f/0x1a70 [obdclass]
Dec 13 08:49:02 fir-md1-s3 kernel:  [&amp;lt;ffffffffc0ecb87c&amp;gt;] llog_process_or_fork+0xbc/0x450 [obdclass]
Dec 13 08:49:02 fir-md1-s3 kernel:  [&amp;lt;ffffffffc0ed0c59&amp;gt;] llog_cat_process_cb+0x239/0x250 [obdclass]
Dec 13 08:49:02 fir-md1-s3 kernel:  [&amp;lt;ffffffffc0eca5af&amp;gt;] llog_process_thread+0x85f/0x1a70 [obdclass]
Dec 13 08:49:02 fir-md1-s3 kernel:  [&amp;lt;ffffffffc0ecb87c&amp;gt;] llog_process_or_fork+0xbc/0x450 [obdclass]
Dec 13 08:49:02 fir-md1-s3 kernel:  [&amp;lt;ffffffffc0ecd5e1&amp;gt;] llog_cat_process_or_fork+0x1e1/0x360 [obdclass]
Dec 13 08:49:02 fir-md1-s3 kernel:  [&amp;lt;ffffffffc0ecd78e&amp;gt;] llog_cat_process+0x2e/0x30 [obdclass]
Dec 13 08:49:02 fir-md1-s3 kernel:  [&amp;lt;ffffffffc1affa34&amp;gt;] llog_changelog_cancel.isra.16+0x54/0x1c0 [mdd]
Dec 13 08:49:02 fir-md1-s3 kernel:  [&amp;lt;ffffffffc1b02110&amp;gt;] mdd_changelog_llog_cancel+0xd0/0x270 [mdd]
Dec 13 08:49:03 fir-md1-s3 kernel:  [&amp;lt;ffffffffc1b05f33&amp;gt;] mdd_changelog_clear+0x653/0x7d0 [mdd]
Dec 13 08:49:03 fir-md1-s3 kernel:  [&amp;lt;ffffffffc1b08153&amp;gt;] mdd_iocontrol+0x163/0x540 [mdd]
Dec 13 08:49:03 fir-md1-s3 kernel:  [&amp;lt;ffffffffc198684c&amp;gt;] mdt_iocontrol+0x5ec/0xb00 [mdt]
Dec 13 08:49:03 fir-md1-s3 kernel:  [&amp;lt;ffffffffc19871e4&amp;gt;] mdt_set_info+0x484/0x490 [mdt]
Dec 13 08:49:03 fir-md1-s3 kernel:  [&amp;lt;ffffffffc125b89a&amp;gt;] tgt_request_handle+0xada/0x1570 [ptlrpc]
Dec 13 08:49:03 fir-md1-s3 kernel:  [&amp;lt;ffffffffc120073b&amp;gt;] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Dec 13 08:49:03 fir-md1-s3 kernel:  [&amp;lt;ffffffffc12040a4&amp;gt;] ptlrpc_main+0xb34/0x1470 [ptlrpc]
Dec 13 08:49:03 fir-md1-s3 kernel:  [&amp;lt;ffffffffa74c5e61&amp;gt;] kthread+0xd1/0xe0
Dec 13 08:49:03 fir-md1-s3 kernel:  [&amp;lt;ffffffffa7b95ddd&amp;gt;] ret_from_fork_nospec_begin+0x7/0x21
Dec 13 08:49:03 fir-md1-s3 kernel:  [&amp;lt;ffffffffffffffff&amp;gt;] 0xffffffffffffffff
Dec 13 08:49:03 fir-md1-s3 kernel: LustreError: dumping log to /tmp/lustre-log.1639414143.24707
Dec 13 08:49:11 fir-md1-s3 kernel: LNet: Service thread pid 24707 completed after 369.43s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Just stopping the monitoring script (thus stopping checking the changelog_size from the MDS) resolved the issue: changelogs commit rates went immediately up and the load of the server down at a reasonable level.&lt;/p&gt;

&lt;p&gt;I wanted to raise this issue so it can be improved in the future. It&apos;s important for us that we can monitor the size of the changelogs stored on each MDT. It would be nice to have a more efficient way of doing so. Thanks!&lt;/p&gt;</description>
                <environment>CentOS 7.9</environment>
        <key id="67599">LU-15368</key>
            <summary>Monitoring changelog_size is slowing down changelogs operations</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="pjones">Peter Jones</assignee>
                                    <reporter username="sthiell">Stephane Thiell</reporter>
                        <labels>
                    </labels>
                <created>Mon, 13 Dec 2021 20:18:15 +0000</created>
                <updated>Thu, 16 Dec 2021 15:30:35 +0000</updated>
                                            <version>Lustre 2.12.7</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>7</watches>
                                                                            <comments>
                            <comment id="320840" author="adilger" created="Tue, 14 Dec 2021 10:40:51 +0000"  >&lt;p&gt;Taking a quick look at the patch &lt;a href=&quot;http://review.whamcloud.com/16416&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/16416&lt;/a&gt; &quot;&lt;tt&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7156&quot; title=&quot;Provide size of changelogs&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7156&quot;&gt;&lt;del&gt;LU-7156&lt;/del&gt;&lt;/a&gt; mdd: add changelog_size to procfs&lt;/tt&gt;&quot; that added the &lt;tt&gt;changelog_size&lt;/tt&gt; parameter, it looks like this is reprocessing each logfile to see if it can be cleaned up.  However, if there are a large number of log records, and the &lt;tt&gt;changelog_size&lt;/tt&gt; file is accessed frequently, this extra scanning overhead may be substantial.  &lt;/p&gt;

&lt;p&gt;It would make sense to put some limit on the amount of re-scanning done (e.g. do it once when the parameter is first accessed after a remount, or at most once per hour, or similar), so that repeated access to this parameter does not hurt MDS performance.  Without that, the checking of the sizes of the llog files should be very fast (stat of a few thousand llog files should be milliseconds on a flash MDT, maybe 10s on an HDD MDT).&lt;/p&gt;</comment>
                            <comment id="320864" author="lixi_wc" created="Tue, 14 Dec 2021 15:40:00 +0000"  >&lt;blockquote&gt;&lt;p&gt;do it once when the parameter is first accessed after a remount, or at most once per hour, or similar&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;So that means the readed changelog might be an async value that gathered a hour ago or so?&lt;/p&gt;

&lt;p&gt;I think there might be two different cases when reading the changelog size. One is for estimation which does not need a precise number, which indeed can be fit well into this &quot;async&quot; mechanism.&lt;/p&gt;

&lt;p&gt;The other is when a precise number is wanted when for example, need to check whether changelog is shrinking.&lt;/p&gt;

&lt;p&gt;To support the second use case properly, I think it makes to add write opertaion to mdd_changelog_size, i.e. &lt;/p&gt;

&lt;p&gt;1) if reading from this proc entry, it will return the cached/async value of change log size.&lt;br/&gt;
2) If writing to this entry, it will rescan the change log and refresh the value.&lt;/p&gt;</comment>
                            <comment id="320920" author="adilger" created="Wed, 15 Dec 2021 09:51:20 +0000"  >&lt;p&gt;I&apos;m not suggesting to cache the Changelog &lt;b&gt;size&lt;/b&gt; for such a long time, only to prevent processing and trying to clean up the Changelog files every time.  Even with an out-of-control Changelog producer, there should be at most a few thousand llog files to stat, so this shouldn&apos;t take minutes to finish.&lt;/p&gt;

&lt;p&gt;What &lt;em&gt;might&lt;/em&gt; be useful is to prevent multiple readers of &lt;tt&gt;changelog_size&lt;/tt&gt; from processing the catalogs/llogs at the same time (e.g. if polling happens every 60s, but it takes 90s to process all of the logs for some reason).  There should be a high-level mutex taken once llog traversal starts, and any reader arriving afterward should either block on the mutex, and then return the same size calculated by the first thread, or potentially return the &lt;em&gt;previous&lt;/em&gt; size that was computed.&lt;/p&gt;

&lt;p&gt;A further possible optimization would be to incrementally update a cached size (after the first read from the file traverses all of the logs).  This could be a percpu variable that is updated by the llog process for each record, or only incremented on a whole-file basis (e.g. add the size of whole llogs when they are finished, subtract the size of whole llogs when they are destroyed).&lt;/p&gt;</comment>
                            <comment id="320925" author="eaujames" created="Wed, 15 Dec 2021 12:05:15 +0000"  >&lt;p&gt;Hello,&lt;/p&gt;

&lt;p&gt;For the CEA, the changelog_size is used to monitor how well we keep up to de-queue changelog with Robinhood. And I don&apos;t know if this is the same for  Stephane Thiell.&lt;br/&gt;
For this purpose the CEA use also the difference between current changelog id and the first changelog id for user (changelog_users)  to monitor the delay between Robinhood and the MDT and trigger an alert.&lt;/p&gt;

&lt;p&gt;This is necessary for the following reason:&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;Robinhood stop to process record: Robinhood bug or client lag&lt;/li&gt;
	&lt;li&gt;Robinhood stop to process record because of Lustre bug: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14158&quot; title=&quot;lfs changelog do not display old changelog after changelog_catalog  wrapped arround&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14158&quot;&gt;&lt;del&gt;LU-14158&lt;/del&gt;&lt;/a&gt;, &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-15280&quot; title=&quot;&amp;quot;lfs changelog --follow&amp;quot; does not support wrapped catalog&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-15280&quot;&gt;&lt;del&gt;LU-15280&lt;/del&gt;&lt;/a&gt;&lt;/li&gt;
	&lt;li&gt;Unable to clear changelog efficiently (to much changelogs): &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14688&quot; title=&quot;Changelog cancel improvement&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14688&quot;&gt;&lt;del&gt;LU-14688&lt;/del&gt;&lt;/a&gt;, &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14606&quot; title=&quot;llog_changelog_cancel_cb returns ENOENT(-2)&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14606&quot;&gt;&lt;del&gt;LU-14606&lt;/del&gt;&lt;/a&gt;&lt;/li&gt;
	&lt;li&gt;Jobs create to much changelogs, can&apos;t keep up.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;So we need to monitor changelog usage before reaching the maximum number of changelog_catalog indexes (64767).&lt;br/&gt;
Also, without the changelog option &quot;follow&quot; is not used with several cl users, more the difference between cl_users is important more the changelog processing is slow down. But we can&apos;t use the &quot;follow&quot; option because of the  &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-15280&quot; title=&quot;&amp;quot;lfs changelog --follow&amp;quot; does not support wrapped catalog&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-15280&quot;&gt;&lt;del&gt;LU-15280&lt;/del&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;So we don&apos;t really need to have the exact &quot;changelog_size&quot; for this purpose but more a kind of &quot;changelog_usage&quot;.&lt;br/&gt;
This  usage can be easily found by retrieving the &quot;llh_count&quot; inside the changelog_catalog header (percent_usage = 100 * llh_count / LLOG_HDR_BITMAP_SIZE(llh)). A upper approximation of changelog_size can be computed with: 11M *  llh_count (actually llog plain size could be limited to 2.1M if the MDT space is low).&lt;br/&gt;
For now I use debugfs to dump changelog_catalog and llog_reader to retrieve the llh_count.&lt;/p&gt;

&lt;p&gt;What do you think about adding a &quot;changelog_usage&quot; procfs?&lt;/p&gt;</comment>
                            <comment id="320933" author="eaujames" created="Wed, 15 Dec 2021 14:15:14 +0000"  >&lt;p&gt;For the changelog_size performance issue, I think the &lt;a href=&quot;https://review.whamcloud.com/#/c/43264/5/lustre/mdd/mdd_device.c&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/#/c/43264/5/lustre/mdd/mdd_device.c&lt;/a&gt; (&quot;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14606&quot; title=&quot;llog_changelog_cancel_cb returns ENOENT(-2)&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14606&quot;&gt;&lt;del&gt;LU-14606&lt;/del&gt;&lt;/a&gt; llog: hide ENOENT for cancelling record&quot;) might help.&lt;br/&gt;
This patch replace &quot;llog_cat_cancel_records&quot; with &quot;RETURN(LLOG_DEL_RECORD);&quot; in llog_changelog_cancel_cb.&lt;br/&gt;
The problem with llog_cat_cancel_records in llog_cat_cancel_records is that it calls llog_cat_id2handle for each llog plain record.&lt;br/&gt;
llog_cat_id2handle have to read a lot of cathandle-&amp;gt;u.chd.chd_head entries  (llog_handle cache) with llog_handle lock if there is lot of changelogs entries. This consume a lot of CPU time and conflict directly with llog_cat_size_cb.&lt;/p&gt;</comment>
                            <comment id="320946" author="sthiell" created="Wed, 15 Dec 2021 17:11:06 +0000"  >&lt;p&gt;Thanks for the feedback! We are indeed using changelog_size for monitoring the de-queue process of changelogs per MDT with Robinhood (and also lauditd in some case, we have multiple readers).&lt;/p&gt;

&lt;p&gt;We use SSDs on MDTs and they are not IO bound when this happen. Actually when this happens, sar shows a drop in IOPS. It&apos;s more like a lock contention to me. At some point when this happens, a single core is at 100%, so I believe Etienne&apos;s last comment is spot on. I will add the patch from &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14606&quot; title=&quot;llog_changelog_cancel_cb returns ENOENT(-2)&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14606&quot;&gt;&lt;del&gt;LU-14606&lt;/del&gt;&lt;/a&gt; to our 2.12.7 when I get a chance (thanks Etienne for backporting it to b2_12). Note that this probably won&apos;t be before early January 2022 as we&apos;re trying to minimize changes on our systems before winter closure.&lt;/p&gt;

&lt;p&gt;I&apos;m not a super fan of a different changelog_usage proc entry, which could add some confusion. Best would be to make the current proc entry efficient enough, and from what I hear in this ticket from you, it looks like it should be doable.&lt;/p&gt;</comment>
                            <comment id="320969" author="adilger" created="Wed, 15 Dec 2021 20:54:50 +0000"  >&lt;p&gt;In theory it should also be possible to monitor the changelog usage just by the difference in current vs. last used changelog ID?  That should be a good proxy for the changelog size (about 160 bytes per record), and would not slow down because of the scanning the size.&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i02cev:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>