<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:46:40 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-4881] Lustre client: eviction on open/truncate</title>
                <link>https://jira.whamcloud.com/browse/LU-4881</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Lustre client: eviction on open/truncate&lt;br/&gt;
========================================&lt;/p&gt;

&lt;p&gt;lustre 2.4.2 and kernel 2.6.32_431.1.2&lt;/p&gt;

&lt;p&gt;At CEA site, there are some applications openning the same file in O_TRUNC mode that cause client evictions and EIO errors (Input/output error).&lt;br/&gt;
This happens with lustre 2.4.2 but works correctly with lustre 2.1.6.&lt;/p&gt;

&lt;p&gt;The support team wrote a reproducer (open_truncate_evicted.c attached file) allowing to reproduce the same error with 2 Bull Mesca nodes (32 core each).&lt;/p&gt;

&lt;p&gt;Below are the customer backtraces and the corresponding client and server log messages.&lt;/p&gt;

&lt;p&gt;Next there is a copy of the reporducer output, and the associated console output.&lt;/p&gt;

&lt;p&gt;Customer backtraces and corresponding log messages&lt;br/&gt;
==================================================&lt;/p&gt;

&lt;p&gt;Backtrace of tasks trying to open the file:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;PID: 22464 TASK: ffff880b13ad3500 CPU: 25 COMMAND: &quot;XYZ_mpi&quot;
 #0 [ffff880b13ad5938] schedule at ffffffff81528762
 #1 [ffff880b13ad5a00] cfs_waitq_wait at ffffffffa03686fe [libcfs]
 #2 [ffff880b13ad5a10] cl_lock_state_wait at ffffffffa04d94fa [obdclass]
 #3 [ffff880b13ad5a90] cl_enqueue_locked at ffffffffa04d9ceb [obdclass]
 #4 [ffff880b13ad5ad0] cl_lock_request at ffffffffa04da86e [obdclass]
 #5 [ffff880b13ad5b30] cl_io_lock at ffffffffa04dfb0c [obdclass]
 #6 [ffff880b13ad5b90] cl_io_loop at ffffffffa04dfd42 [obdclass]
 #7 [ffff880b13ad5bc0] cl_setattr_ost at ffffffffa0a830d8 [lustre]
 #8 [ffff880b13ad5c20] ll_setattr_raw at ffffffffa0a511fe [lustre]
 #9 [ffff880b13ad5cc0] ll_setattr at ffffffffa0a5188b [lustre]
#10 [ffff880b13ad5cd0] notify_change at ffffffff811a79f8
#11 [ffff880b13ad5d40] do_truncate at ffffffff811876c4
#12 [ffff880b13ad5db0] do_filp_open at ffffffff8119c371
#13 [ffff880b13ad5f20] do_sys_open at ffffffff81186389
#14 [ffff880b13ad5f70] sys_open at ffffffff811864a0
#15 [ffff880b13ad5f80] system_call_fastpath at ffffffff8100b072
    RIP: 00007f959ee2e55d RSP: 00007fff6f9929e8 RFLAGS: 00010206
    RAX: 0000000000000002 RBX: ffffffff8100b072 RCX: 0000000000000006
    RDX: 00000000000001b6 RSI: 0000000000000241 RDI: 00007f95a32a173f
    RBP: 00007fff6f992940 R8: 00007f959f3c0489 R9: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000293 R12: ffffffff811864a0
    R13: ffff880b13ad5f78 R14: 00007f959f5eab00 R15: 0000000000000004
    ORIG_RAX: 0000000000000002 CS: 0033 SS: 002b
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The 11 hanging tasks waiting for the lock:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;PID: 22467 TASK: ffff880b756d3540 CPU: 6 COMMAND: &quot;XYZ_mpi&quot;
 #0 [ffff880b756d5a58] schedule at ffffffff81528762
 #1 [ffff880b756d5b20] rwsem_down_failed_common at ffffffff8152ae25
 #2 [ffff880b756d5b80] rwsem_down_write_failed at ffffffff8152af83
 #3 [ffff880b756d5bc0] call_rwsem_down_write_failed at ffffffff8128ef23
 #4 [ffff880b756d5c20] ll_setattr_raw at ffffffffa0a509cc [lustre]
 #5 [ffff880b756d5cc0] ll_setattr at ffffffffa0a5188b [lustre]
 #6 [ffff880b756d5cd0] notify_change at ffffffff811a79f8
 #7 [ffff880b756d5d40] do_truncate at ffffffff811876c4
 #8 [ffff880b756d5db0] do_filp_open at ffffffff8119c371
 #9 [ffff880b756d5f20] do_sys_open at ffffffff81186389
#10 [ffff880b756d5f70] sys_open at ffffffff811864a0
#11 [ffff880b756d5f80] system_call_fastpath at ffffffff8100b072
    RIP: 00007f16e686655d RSP: 00007fff8ef09148 RFLAGS: 00010206
    RAX: 0000000000000002 RBX: ffffffff8100b072 RCX: 0000000000000006
    RDX: 00000000000001b6 RSI: 0000000000000241 RDI: 00007f16eacd973f
    RBP: 00007fff8ef090a0 R8: 00007f16e6df8489 R9: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000293 R12: ffffffff811864a0
    R13: ffff880b756d5f78 R14: 00007f16e7022b00 R15: 0000000000000004
    ORIG_RAX: 0000000000000002 CS: 0033 SS: 002b
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Client log messages:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;LustreError: 11-0: fs3-OST00e4-osc-ffff88045c004000: Communicating with A.B.C.D@o2ib10, operation ost_punch failed with -107.
Lustre: fs3-OST00e4-osc-ffff88045c004000: Connection to scratch3-OST00e4 (at A.B.C.D@o2ib10) was lost; in progress operations using this service will wait for recovery to complete
Lustre: Skipped 2 previous similar messages
LustreError: 167-0: fs3-OST00e4-osc-ffff88045c004000: This client was evicted by fs3-OST00e4; in progress operations using this service will fail.
Lustre: fs3-OST00e4-osc-ffff88045c004000: Connection restored to fs3-OST00e4 (at A.B.C.D@o2ib10)
Lustre: Skipped 2 previous similar messages
Lustre: DEBUG MARKER: Wed Mar 5 15:15:01 2014

LustreError: 11-0: fs3-OST00e8-osc-ffff88045c004000: Communicating with A.B.C.D@o2ib10, operation obd_ping failed with -107.
LustreError: Skipped 1 previous similar message
LustreError: 167-0: fs3-OST00e8-osc-ffff88045c004000: This client was evicted by fs3-OST00e8; in progress operations using this service will fail.
LustreError: Skipped 1 previous similar message
LustreError: 3905:0:(ldlm_resource.c:804:ldlm_resource_complain()) fs3-OST00e8-osc-ffff88045c004000: namespace resource [0xe1b:0x0:0x0].0 (ffff8802cf99ce80) refcount nonzero (1) after lock cleanup; forcing cleanup.
LustreError: 3905:0:(ldlm_resource.c:1415:ldlm_resource_dump()) --- Resource: [0xe1b:0x0:0x0].0 (ffff8802cf99ce80) refcount = 2
Lustre: DEBUG MARKER: Wed Mar 5 15:20:01 2014
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Server log messages:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;LustreError: 0:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 202s: evicting client at A.B.C.E@o2ib10 ns: filter-fs3-OST00e4_UUID lock: ffff8810233b9b40/0xb9a496999f417c9b lrc: 3/0,0 mode:
PW/PW res: [0xe18:0x0:0x0].0 rrc: 4 type: EXT [0-&amp;gt;18446744073709551615] (req 0-&amp;gt;18446744073709551615) flags: 0x80010020 nid: A.B.C.E@o2ib10 remote: 0xf075be8a293a25dc expref: 4 pid: 24731 timeout: 4809968903 lvb_type: 0
LustreError: 0:0:(ldlm_lockd.c:391:waiting_locks_callback()) Skipped 11 previous similar messages
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;



&lt;p&gt;reporducer output and associated console messages&lt;br/&gt;
=================================================&lt;/p&gt;

&lt;p&gt;reporducer first run:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;clascar54% ls -rtl
total 85424
-rwxr-x---. 1 percher s8 85353840 Mar 6 15:34 initfile
-rw-r-----. 1 percher s8 2097152 Mar 6 15:41 test0000.txt

clascar54% srun --resv-ports -n 64 -N 2-2 ~/_cc/opentruncate/e 3
ERREUR dans open fd2: 5: Input/output error
ERREUR dans write fd2: 5: Input/output error
ERREUR dans open fd2: 5: Input/output error
srun: error: clascar4174: task 3: Exited with exit code 1
srun: Terminating job step 108683.0
slurmd[clascar4175]: *** STEP 108683.0 KILLED AT 2014-03-06T15:41:51 WITH SIGNAL 9 ***
slurmd[clascar4174]: *** STEP 108683.0 KILLED AT 2014-03-06T15:41:51 WITH SIGNAL 9 ***
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
slurmd[clascar4175]: *** STEP 108683.0 KILLED AT 2014-03-06T15:41:51 WITH SIGNAL 9 ***
slurmd[clascar4174]: *** STEP 108683.0 KILLED AT 2014-03-06T15:41:51 WITH SIGNAL 9 ***
srun: error: clascar4175: tasks 35,44: Exited with exit code 1
srun: error: clascar4174: tasks 0-2,4-31: Killed
srun: error: clascar4175: tasks 32-34,36-43,45-63: Killed
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;reporducer second run:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;clascar54% srun --resv-ports -n 64 -N 2-2 ~/_cc/opentruncate/e 3
ERREUR dans open fd2: 5: Input/output error
ERREUR dans open fd2: 5: Input/output error
ERREUR dans open fd2: 108: Cannot send after transport endpoint shutdown
srun: error: clascar4024: task 42: Exited with exit code 1
srun: Terminating job step 108699.0
slurmd[clascar4024]: *** STEP 108699.0 KILLED AT 2014-03-06T16:13:55 WITH SIGNAL 9 ***
slurmd[clascar4023]: *** STEP 108699.0 KILLED AT 2014-03-06T16:13:55 WITH SIGNAL 9 ***
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
srun: error: clascar4023: tasks 20-21: Exited with exit code 1
slurmd[clascar4023]: *** STEP 108699.0 KILLED AT 2014-03-06T16:13:55 WITH SIGNAL 9 ***
slurmd[clascar4024]: *** STEP 108699.0 KILLED AT 2014-03-06T16:13:55 WITH SIGNAL 9 ***
srun: error: clascar4023: tasks 0-19,22-31: Killed
srun: error: clascar4024: tasks 32-41,43-63: Killed
clascar54%
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;reproducer running on lustre 2.1.6 (no error):&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;clascar53% srun --resv-ports -n 64 -N 2-2 ~/_cc/opentruncate/eae2 3
srun: job 262910 queued and waiting for resources
srun: job 262910 has been allocated resources
clascar53% srun --resv-ports -n 64 -N 2-2 ~/_cc/opentruncate/eae2 3
srun: job 262924 queued and waiting for resources
srun: job 262924 has been allocated resources
clascar53%
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Console :::&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;clascar4174:
clascar4174: Lustre: DEBUG MARKER: Thu Mar 6 15:40:01 2014
clascar4174:
clascar4174: LustreError: 11-0: scratch3-OST00e1-osc-ffff8804658abc00: Communicating with JO.BOO.WL.LOT@o2ib10, operation ost_punch failed with -107.
clascar4174: Lustre: scratch3-OST00e1-osc-ffff8804658abc00: Connection to scratch3-OST00e1 (at JO.BOO.WL.LOT@o2ib10) was lost; in progress operations using this service will wait for recovery to complete
clascar4174: LustreError: 167-0: scratch3-OST00e1-osc-ffff8804658abc00: This client was evicted by scratch3-OST00e1; in progress operationsusing this service will fail.
clascar4174: Lustre: scratch3-OST00e1-osc-ffff8804658abc00: Connection restored to scratch3-OST00e1 (at JO.BOO.WL.LOT@o2ib10)
clascar4174: Lustre: DEBUG MARKER: Thu Mar 6 15:45:01 2014
clascar4174:


clascar4175: Lustre: DEBUG MARKER: Thu Mar 6 15:40:01 2014
clascar4175:
clascar4175: LustreError: 11-0: scratch3-OST00e2-osc-ffff880466736c00: Communicating with JO.BOO.WL.LOT@o2ib10, operation ldlm_enqueue failed with -107.
clascar4175: Lustre: scratch3-OST00e2-osc-ffff880466736c00: Connection to scratch3-OST00e2 (at JO.BOO.WL.LOT@o2ib10) was lost; in progress operations using this service will wait for recovery to complete
clascar4175: LustreError: 167-0: scratch3-OST00e2-osc-ffff880466736c00: This client was evicted by scratch3-OST00e2; in progress operationsusing this service will fail.
clascar4175: Lustre: scratch3-OST00e2-osc-ffff880466736c00: Connection restored to scratch3-OST00e2 (at JO.BOO.WL.LOT@o2ib10)
clascar4175: LustreError: 11-0: scratch3-OST00e1-osc-ffff880466736c00: Communicating with JO.BOO.WL.LOT@o2ib10, operation obd_ping failed with -107.
clascar4175: Lustre: scratch3-OST00e1-osc-ffff880466736c00: Connection to scratch3-OST00e1 (at JO.BOO.WL.LOT@o2ib10) was lost; in progress operations using this service will wait for recovery to complete
clascar4175: LustreError: 167-0: scratch3-OST00e1-osc-ffff880466736c00: This client was evicted by scratch3-OST00e1; in progress operationsusing this service will fail.
clascar4175: Lustre: scratch3-OST00e1-osc-ffff880466736c00: Connection restored to scratch3-OST00e1 (at JO.BOO.WL.LOT@o2ib10)
clascar4175: Lustre: DEBUG MARKER: Thu Mar 6 15:45:01 2014
clascar4175:
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</description>
                <environment>kernel 2.6.32_431.1.2</environment>
        <key id="24162">LU-4881</key>
            <summary>Lustre client: eviction on open/truncate</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="2" iconUrl="https://jira.whamcloud.com/images/icons/priorities/critical.svg">Critical</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="3">Duplicate</resolution>
                                        <assignee username="bfaccini">Bruno Faccini</assignee>
                                    <reporter username="patrick.valentin">Patrick Valentin</reporter>
                        <labels>
                    </labels>
                <created>Fri, 11 Apr 2014 13:50:57 +0000</created>
                <updated>Wed, 13 Oct 2021 03:01:10 +0000</updated>
                            <resolved>Wed, 13 Oct 2021 03:01:10 +0000</resolved>
                                    <version>Lustre 2.4.2</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>7</watches>
                                                                            <comments>
                            <comment id="81420" author="bfaccini" created="Fri, 11 Apr 2014 14:05:03 +0000"  >&lt;p&gt;Hello Patrick,&lt;br/&gt;
Humm, it seems it reminds me of an old problem ...&lt;br/&gt;
Just to fully qualify this problem, does the reproducer you attached always reproduce within CEA environment ? And if yes and a lustre debug-log can be exported (??), could it be possible to run it again with full debug (or at least dlmtrace+rpctrace) enabled, a big debug log buffer configured, and also with dump_on_eviction set (or manual debug log dump) ?&lt;br/&gt;
In the mean time I will try to reproduce in-house.&lt;/p&gt;</comment>
                            <comment id="81424" author="lustre-bull" created="Fri, 11 Apr 2014 15:05:20 +0000"  >&lt;p&gt;Hi Bruno,&lt;/p&gt;

&lt;p&gt;&amp;gt;&amp;gt;  Humm, it seems it reminds me of an old problem ...&lt;/p&gt;

&lt;p&gt;yes, &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-2380&quot; title=&quot;Hang and eviction scenario when multiple tasks/nodes do ftruncate() on the same file in parallel&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-2380&quot;&gt;&lt;del&gt;LU-2380&lt;/del&gt;&lt;/a&gt; was opened in november 2012 to report evictions when a lot of tasks were issuing ftruncate() of the same file.&lt;br/&gt;
But this was with lustre 2.1, and on site support reports that this new issue only occurs with lustre 2.4, with some customer applications as well as with the reproducer.&lt;/p&gt;

&lt;p&gt;&amp;gt;&amp;gt;  Just to fully qualify this problem, does the reproducer you attached always reproduce within CEA environment ?&lt;/p&gt;

&lt;p&gt;Yes they wrote in the MANTIS report that it&apos;s systematic and easy to reproduce with 2 nodes in lustre 2.4, and that they do no have the issue in 2.1 with the same reproducer (and application ???).&lt;br/&gt;
But it seems to be related to CEA environmemt, because we are not able to reproduce this on our development cluster, with the same 32 CPU Mesca nodes. &lt;/p&gt;

&lt;p&gt;&amp;gt;&amp;gt;  And if yes and a lustre debug-log can be exported (??), could it be possible to run it again with full debug (or at least dlmtrace+rpctrace) enabled, a big debug log buffer configured, and also with dump_on_eviction set (or manual debug log dump) ?&lt;br/&gt;
In the mean time I will try to reproduce in-house.&lt;/p&gt;

&lt;p&gt;I ask them if they think it could be possible to export the logs, and transmit your requirements (trace level and log buffer size).&lt;/p&gt;

&lt;p&gt;Patrick.&lt;/p&gt;</comment>
                            <comment id="81428" author="paf" created="Fri, 11 Apr 2014 15:31:09 +0000"  >&lt;p&gt;Patrick, Bruno - I haven&apos;t investigated this deeply, but from reading the description, this may be a duplicate of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4381&quot; title=&quot;clio deadlock from truncate&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4381&quot;&gt;&lt;del&gt;LU-4381&lt;/del&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I don&apos;t know if it&apos;s mentioned in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4381&quot; title=&quot;clio deadlock from truncate&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4381&quot;&gt;&lt;del&gt;LU-4381&lt;/del&gt;&lt;/a&gt;, but of course, the deadlock can lead to evictions.  We reported that (without truncate) in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4495&quot; title=&quot;client evicted on parallel append write to the shared file.&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4495&quot;&gt;&lt;del&gt;LU-4495&lt;/del&gt;&lt;/a&gt;, which was eventually marked as a duplicate to &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4381&quot; title=&quot;clio deadlock from truncate&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4381&quot;&gt;&lt;del&gt;LU-4381&lt;/del&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Not sure if the &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4381&quot; title=&quot;clio deadlock from truncate&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4381&quot;&gt;&lt;del&gt;LU-4381&lt;/del&gt;&lt;/a&gt; patch is in the code you&apos;re running or not.&lt;/p&gt;</comment>
                            <comment id="81634" author="jlevi" created="Tue, 15 Apr 2014 17:17:52 +0000"  >&lt;p&gt;Bruno,&lt;br/&gt;
Are you also investigating if this affects Master as well?&lt;/p&gt;</comment>
                            <comment id="81733" author="bfaccini" created="Wed, 16 Apr 2014 14:20:27 +0000"  >&lt;p&gt;Patrick F.: thanks pointing to &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4381&quot; title=&quot;clio deadlock from truncate&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4381&quot;&gt;&lt;del&gt;LU-4381&lt;/del&gt;&lt;/a&gt;. This may be the same problem, but I would like to be able to confirm it with more infos/traces from the site, or by reproducing it in-house.&lt;/p&gt;

&lt;p&gt;Jodi: I will double-check if Master is affected.&lt;/p&gt;

&lt;p&gt;Patrick V./Lustre Bull: any news about the availability of full debug log from the site ?&lt;/p&gt;

&lt;p&gt;Also, I have just been able to reproduce in-house by running the reproducer with a default striping of 12, and only 16 tasks running on 2x nodes with only 8 cores each ...&lt;br/&gt;
I will update asap to detail what I will find in the full debug traces I have now ...&lt;/p&gt;</comment>
                            <comment id="81736" author="patrick.valentin" created="Wed, 16 Apr 2014 14:43:10 +0000"  >&lt;p&gt;Hi Bruno,&lt;br/&gt;
We haven&apos;t yet received full debug logs from CEA.&lt;br/&gt;
We are preparing a new Lustre 2.4.3 for CEA with some fixes, and we are waiting for Peter Jones answer, in order to know if we can also integrate &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4381&quot; title=&quot;clio deadlock from truncate&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4381&quot;&gt;&lt;del&gt;LU-4381&lt;/del&gt;&lt;/a&gt; patch #9152 (landed in master) to this 2.4 delivery.&lt;/p&gt;</comment>
                            <comment id="81847" author="bfaccini" created="Thu, 17 Apr 2014 16:28:50 +0000"  >&lt;p&gt;Patrick V.: did you try to reproduce on your development cluster with a default-striping != 1 or enough &quot;wide&quot; ? And on the other hand, can you ask the site people if their default-striping is != 1 when they reproduce and if yes, to run their reproducer with a default-striping = 1 ?&lt;/p&gt;

&lt;p&gt;BTW, if I am asking this it is because, and even if I did not complete analysis of the full log from my in-house occurrence at this time (but will!), I strongly suspect what Patrick F. suggested, that this ticket is a dup of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4381&quot; title=&quot;clio deadlock from truncate&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4381&quot;&gt;&lt;del&gt;LU-4381&lt;/del&gt;&lt;/a&gt;/&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4495&quot; title=&quot;client evicted on parallel append write to the shared file.&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4495&quot;&gt;&lt;del&gt;LU-4495&lt;/del&gt;&lt;/a&gt;. And the associated patch may also fix the concurrent write-append issue also seen at CEA (like problem reported in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4495&quot; title=&quot;client evicted on parallel append write to the shared file.&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4495&quot;&gt;&lt;del&gt;LU-4495&lt;/del&gt;&lt;/a&gt; ?!).&lt;/p&gt;</comment>
                            <comment id="81972" author="patrick.valentin" created="Fri, 18 Apr 2014 18:18:16 +0000"  >&lt;p&gt;Bruno,&lt;br/&gt;
I built a new lustre FS with 160 OSTs, as the CEA file systems have a lot of OSTs, and I tried with a stripe count of 16 and 160, and I was able to reproduce the issue with the attached reproducer, with 2 nodes with 16 core each.&lt;/p&gt;

&lt;p&gt;We built a lustre 2.4.3 with patch #9152 of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4381&quot; title=&quot;clio deadlock from truncate&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4381&quot;&gt;&lt;del&gt;LU-4381&lt;/del&gt;&lt;/a&gt;, as mentionned in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4881&quot; title=&quot;Lustre client: eviction on open/truncate&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4881&quot;&gt;&lt;del&gt;LU-4881&lt;/del&gt;&lt;/a&gt;, and it seems to fix the issue.&lt;br/&gt;
I restart the test for a longer period and will provide the results after the WE.&lt;/p&gt;</comment>
                            <comment id="82097" author="bfaccini" created="Mon, 21 Apr 2014 21:56:16 +0000"  >&lt;p&gt;I have been able to complete the debug-log analysis of my own/in-house reproducer run, and it definitely shows the same scenario than the one described in both &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4381&quot; title=&quot;clio deadlock from truncate&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4381&quot;&gt;&lt;del&gt;LU-4381&lt;/del&gt;&lt;/a&gt; and &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4495&quot; title=&quot;client evicted on parallel append write to the shared file.&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4495&quot;&gt;&lt;del&gt;LU-4495&lt;/del&gt;&lt;/a&gt;. So it is very likely that Gerrit change #9152, and the &quot;sub lock creation from lov_lock_sub_init&quot; it removes, will fix problem for this ticket but again/also probably the concurrent write-append issue also seen at CEA (like in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4495&quot; title=&quot;client evicted on parallel append write to the shared file.&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4495&quot;&gt;&lt;del&gt;LU-4495&lt;/del&gt;&lt;/a&gt;?!).&lt;/p&gt;</comment>
                            <comment id="82131" author="patrick.valentin" created="Tue, 22 Apr 2014 12:19:45 +0000"  >&lt;p&gt;Bruno,&lt;br/&gt;
The test as described in &quot;comment on 18/Apr/14 8:18 PM&quot; has been runing during 3 days without any error with a stripe count of 16.&lt;br/&gt;
We are going to build a new lustre 2.4.3 to be installed on a test cluster at CEA, containing &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4381&quot; title=&quot;clio deadlock from truncate&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4381&quot;&gt;&lt;del&gt;LU-4381&lt;/del&gt;&lt;/a&gt; (Gerrit change #9152) and a few other fixes. I&apos;ll keep you informed of the results with the real application causing these evictions.&lt;/p&gt;</comment>
                            <comment id="85579" author="bfaccini" created="Tue, 3 Jun 2014 14:02:52 +0000"  >&lt;p&gt;Patrick,&lt;br/&gt;
Any (good ?) news after you included patch from &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4381&quot; title=&quot;clio deadlock from truncate&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4381&quot;&gt;&lt;del&gt;LU-4381&lt;/del&gt;&lt;/a&gt; in Bull&apos;s 2.4.3 distro and exposed it vs CEA production, to see if it fixes the evictions ??&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="16757">LU-2380</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="14700" name="open_truncate_evicted.c" size="1885" author="patrick.valentin" created="Fri, 11 Apr 2014 13:50:57 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10490" key="com.atlassian.jira.plugin.system.customfieldtypes:datepicker">
                        <customfieldname>End date</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>Tue, 3 Jun 2014 13:50:57 +0000</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                            <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzwjtj:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>13503</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                        <customfield id="customfield_10493" key="com.atlassian.jira.plugin.system.customfieldtypes:datepicker">
                        <customfieldname>Start date</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>Fri, 11 Apr 2014 13:50:57 +0000</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                    </customfields>
    </item>
</channel>
</rss>