[LU-4881] Lustre client: eviction on open/truncate Created: 11/Apr/14  Updated: 13/Oct/21  Resolved: 13/Oct/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.2
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Patrick Valentin (Inactive) Assignee: Bruno Faccini (Inactive)
Resolution: Duplicate Votes: 0
Labels: None
Environment:

kernel 2.6.32_431.1.2


Attachments: File open_truncate_evicted.c    
Issue Links:
Related
is related to LU-2380 Hang and eviction scenario when multi... Resolved
Severity: 3
Rank (Obsolete): 13503

 Description   

Lustre client: eviction on open/truncate
========================================

lustre 2.4.2 and kernel 2.6.32_431.1.2

At CEA site, there are some applications openning the same file in O_TRUNC mode that cause client evictions and EIO errors (Input/output error).
This happens with lustre 2.4.2 but works correctly with lustre 2.1.6.

The support team wrote a reproducer (open_truncate_evicted.c attached file) allowing to reproduce the same error with 2 Bull Mesca nodes (32 core each).

Below are the customer backtraces and the corresponding client and server log messages.

Next there is a copy of the reporducer output, and the associated console output.

Customer backtraces and corresponding log messages
==================================================

Backtrace of tasks trying to open the file:

PID: 22464 TASK: ffff880b13ad3500 CPU: 25 COMMAND: "XYZ_mpi"
 #0 [ffff880b13ad5938] schedule at ffffffff81528762
 #1 [ffff880b13ad5a00] cfs_waitq_wait at ffffffffa03686fe [libcfs]
 #2 [ffff880b13ad5a10] cl_lock_state_wait at ffffffffa04d94fa [obdclass]
 #3 [ffff880b13ad5a90] cl_enqueue_locked at ffffffffa04d9ceb [obdclass]
 #4 [ffff880b13ad5ad0] cl_lock_request at ffffffffa04da86e [obdclass]
 #5 [ffff880b13ad5b30] cl_io_lock at ffffffffa04dfb0c [obdclass]
 #6 [ffff880b13ad5b90] cl_io_loop at ffffffffa04dfd42 [obdclass]
 #7 [ffff880b13ad5bc0] cl_setattr_ost at ffffffffa0a830d8 [lustre]
 #8 [ffff880b13ad5c20] ll_setattr_raw at ffffffffa0a511fe [lustre]
 #9 [ffff880b13ad5cc0] ll_setattr at ffffffffa0a5188b [lustre]
#10 [ffff880b13ad5cd0] notify_change at ffffffff811a79f8
#11 [ffff880b13ad5d40] do_truncate at ffffffff811876c4
#12 [ffff880b13ad5db0] do_filp_open at ffffffff8119c371
#13 [ffff880b13ad5f20] do_sys_open at ffffffff81186389
#14 [ffff880b13ad5f70] sys_open at ffffffff811864a0
#15 [ffff880b13ad5f80] system_call_fastpath at ffffffff8100b072
    RIP: 00007f959ee2e55d RSP: 00007fff6f9929e8 RFLAGS: 00010206
    RAX: 0000000000000002 RBX: ffffffff8100b072 RCX: 0000000000000006
    RDX: 00000000000001b6 RSI: 0000000000000241 RDI: 00007f95a32a173f
    RBP: 00007fff6f992940 R8: 00007f959f3c0489 R9: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000293 R12: ffffffff811864a0
    R13: ffff880b13ad5f78 R14: 00007f959f5eab00 R15: 0000000000000004
    ORIG_RAX: 0000000000000002 CS: 0033 SS: 002b

The 11 hanging tasks waiting for the lock:

PID: 22467 TASK: ffff880b756d3540 CPU: 6 COMMAND: "XYZ_mpi"
 #0 [ffff880b756d5a58] schedule at ffffffff81528762
 #1 [ffff880b756d5b20] rwsem_down_failed_common at ffffffff8152ae25
 #2 [ffff880b756d5b80] rwsem_down_write_failed at ffffffff8152af83
 #3 [ffff880b756d5bc0] call_rwsem_down_write_failed at ffffffff8128ef23
 #4 [ffff880b756d5c20] ll_setattr_raw at ffffffffa0a509cc [lustre]
 #5 [ffff880b756d5cc0] ll_setattr at ffffffffa0a5188b [lustre]
 #6 [ffff880b756d5cd0] notify_change at ffffffff811a79f8
 #7 [ffff880b756d5d40] do_truncate at ffffffff811876c4
 #8 [ffff880b756d5db0] do_filp_open at ffffffff8119c371
 #9 [ffff880b756d5f20] do_sys_open at ffffffff81186389
#10 [ffff880b756d5f70] sys_open at ffffffff811864a0
#11 [ffff880b756d5f80] system_call_fastpath at ffffffff8100b072
    RIP: 00007f16e686655d RSP: 00007fff8ef09148 RFLAGS: 00010206
    RAX: 0000000000000002 RBX: ffffffff8100b072 RCX: 0000000000000006
    RDX: 00000000000001b6 RSI: 0000000000000241 RDI: 00007f16eacd973f
    RBP: 00007fff8ef090a0 R8: 00007f16e6df8489 R9: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000293 R12: ffffffff811864a0
    R13: ffff880b756d5f78 R14: 00007f16e7022b00 R15: 0000000000000004
    ORIG_RAX: 0000000000000002 CS: 0033 SS: 002b

Client log messages:

LustreError: 11-0: fs3-OST00e4-osc-ffff88045c004000: Communicating with A.B.C.D@o2ib10, operation ost_punch failed with -107.
Lustre: fs3-OST00e4-osc-ffff88045c004000: Connection to scratch3-OST00e4 (at A.B.C.D@o2ib10) was lost; in progress operations using this service will wait for recovery to complete
Lustre: Skipped 2 previous similar messages
LustreError: 167-0: fs3-OST00e4-osc-ffff88045c004000: This client was evicted by fs3-OST00e4; in progress operations using this service will fail.
Lustre: fs3-OST00e4-osc-ffff88045c004000: Connection restored to fs3-OST00e4 (at A.B.C.D@o2ib10)
Lustre: Skipped 2 previous similar messages
Lustre: DEBUG MARKER: Wed Mar 5 15:15:01 2014

LustreError: 11-0: fs3-OST00e8-osc-ffff88045c004000: Communicating with A.B.C.D@o2ib10, operation obd_ping failed with -107.
LustreError: Skipped 1 previous similar message
LustreError: 167-0: fs3-OST00e8-osc-ffff88045c004000: This client was evicted by fs3-OST00e8; in progress operations using this service will fail.
LustreError: Skipped 1 previous similar message
LustreError: 3905:0:(ldlm_resource.c:804:ldlm_resource_complain()) fs3-OST00e8-osc-ffff88045c004000: namespace resource [0xe1b:0x0:0x0].0 (ffff8802cf99ce80) refcount nonzero (1) after lock cleanup; forcing cleanup.
LustreError: 3905:0:(ldlm_resource.c:1415:ldlm_resource_dump()) --- Resource: [0xe1b:0x0:0x0].0 (ffff8802cf99ce80) refcount = 2
Lustre: DEBUG MARKER: Wed Mar 5 15:20:01 2014

Server log messages:

LustreError: 0:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 202s: evicting client at A.B.C.E@o2ib10 ns: filter-fs3-OST00e4_UUID lock: ffff8810233b9b40/0xb9a496999f417c9b lrc: 3/0,0 mode:
PW/PW res: [0xe18:0x0:0x0].0 rrc: 4 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x80010020 nid: A.B.C.E@o2ib10 remote: 0xf075be8a293a25dc expref: 4 pid: 24731 timeout: 4809968903 lvb_type: 0
LustreError: 0:0:(ldlm_lockd.c:391:waiting_locks_callback()) Skipped 11 previous similar messages

reporducer output and associated console messages
=================================================

reporducer first run:

clascar54% ls -rtl
total 85424
-rwxr-x---. 1 percher s8 85353840 Mar 6 15:34 initfile
-rw-r-----. 1 percher s8 2097152 Mar 6 15:41 test0000.txt

clascar54% srun --resv-ports -n 64 -N 2-2 ~/_cc/opentruncate/e 3
ERREUR dans open fd2: 5: Input/output error
ERREUR dans write fd2: 5: Input/output error
ERREUR dans open fd2: 5: Input/output error
srun: error: clascar4174: task 3: Exited with exit code 1
srun: Terminating job step 108683.0
slurmd[clascar4175]: *** STEP 108683.0 KILLED AT 2014-03-06T15:41:51 WITH SIGNAL 9 ***
slurmd[clascar4174]: *** STEP 108683.0 KILLED AT 2014-03-06T15:41:51 WITH SIGNAL 9 ***
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
slurmd[clascar4175]: *** STEP 108683.0 KILLED AT 2014-03-06T15:41:51 WITH SIGNAL 9 ***
slurmd[clascar4174]: *** STEP 108683.0 KILLED AT 2014-03-06T15:41:51 WITH SIGNAL 9 ***
srun: error: clascar4175: tasks 35,44: Exited with exit code 1
srun: error: clascar4174: tasks 0-2,4-31: Killed
srun: error: clascar4175: tasks 32-34,36-43,45-63: Killed

reporducer second run:

clascar54% srun --resv-ports -n 64 -N 2-2 ~/_cc/opentruncate/e 3
ERREUR dans open fd2: 5: Input/output error
ERREUR dans open fd2: 5: Input/output error
ERREUR dans open fd2: 108: Cannot send after transport endpoint shutdown
srun: error: clascar4024: task 42: Exited with exit code 1
srun: Terminating job step 108699.0
slurmd[clascar4024]: *** STEP 108699.0 KILLED AT 2014-03-06T16:13:55 WITH SIGNAL 9 ***
slurmd[clascar4023]: *** STEP 108699.0 KILLED AT 2014-03-06T16:13:55 WITH SIGNAL 9 ***
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
srun: error: clascar4023: tasks 20-21: Exited with exit code 1
slurmd[clascar4023]: *** STEP 108699.0 KILLED AT 2014-03-06T16:13:55 WITH SIGNAL 9 ***
slurmd[clascar4024]: *** STEP 108699.0 KILLED AT 2014-03-06T16:13:55 WITH SIGNAL 9 ***
srun: error: clascar4023: tasks 0-19,22-31: Killed
srun: error: clascar4024: tasks 32-41,43-63: Killed
clascar54%

reproducer running on lustre 2.1.6 (no error):

clascar53% srun --resv-ports -n 64 -N 2-2 ~/_cc/opentruncate/eae2 3
srun: job 262910 queued and waiting for resources
srun: job 262910 has been allocated resources
clascar53% srun --resv-ports -n 64 -N 2-2 ~/_cc/opentruncate/eae2 3
srun: job 262924 queued and waiting for resources
srun: job 262924 has been allocated resources
clascar53%

Console :::

clascar4174:
clascar4174: Lustre: DEBUG MARKER: Thu Mar 6 15:40:01 2014
clascar4174:
clascar4174: LustreError: 11-0: scratch3-OST00e1-osc-ffff8804658abc00: Communicating with JO.BOO.WL.LOT@o2ib10, operation ost_punch failed with -107.
clascar4174: Lustre: scratch3-OST00e1-osc-ffff8804658abc00: Connection to scratch3-OST00e1 (at JO.BOO.WL.LOT@o2ib10) was lost; in progress operations using this service will wait for recovery to complete
clascar4174: LustreError: 167-0: scratch3-OST00e1-osc-ffff8804658abc00: This client was evicted by scratch3-OST00e1; in progress operationsusing this service will fail.
clascar4174: Lustre: scratch3-OST00e1-osc-ffff8804658abc00: Connection restored to scratch3-OST00e1 (at JO.BOO.WL.LOT@o2ib10)
clascar4174: Lustre: DEBUG MARKER: Thu Mar 6 15:45:01 2014
clascar4174:


clascar4175: Lustre: DEBUG MARKER: Thu Mar 6 15:40:01 2014
clascar4175:
clascar4175: LustreError: 11-0: scratch3-OST00e2-osc-ffff880466736c00: Communicating with JO.BOO.WL.LOT@o2ib10, operation ldlm_enqueue failed with -107.
clascar4175: Lustre: scratch3-OST00e2-osc-ffff880466736c00: Connection to scratch3-OST00e2 (at JO.BOO.WL.LOT@o2ib10) was lost; in progress operations using this service will wait for recovery to complete
clascar4175: LustreError: 167-0: scratch3-OST00e2-osc-ffff880466736c00: This client was evicted by scratch3-OST00e2; in progress operationsusing this service will fail.
clascar4175: Lustre: scratch3-OST00e2-osc-ffff880466736c00: Connection restored to scratch3-OST00e2 (at JO.BOO.WL.LOT@o2ib10)
clascar4175: LustreError: 11-0: scratch3-OST00e1-osc-ffff880466736c00: Communicating with JO.BOO.WL.LOT@o2ib10, operation obd_ping failed with -107.
clascar4175: Lustre: scratch3-OST00e1-osc-ffff880466736c00: Connection to scratch3-OST00e1 (at JO.BOO.WL.LOT@o2ib10) was lost; in progress operations using this service will wait for recovery to complete
clascar4175: LustreError: 167-0: scratch3-OST00e1-osc-ffff880466736c00: This client was evicted by scratch3-OST00e1; in progress operationsusing this service will fail.
clascar4175: Lustre: scratch3-OST00e1-osc-ffff880466736c00: Connection restored to scratch3-OST00e1 (at JO.BOO.WL.LOT@o2ib10)
clascar4175: Lustre: DEBUG MARKER: Thu Mar 6 15:45:01 2014
clascar4175:


 Comments   
Comment by Bruno Faccini (Inactive) [ 11/Apr/14 ]

Hello Patrick,
Humm, it seems it reminds me of an old problem ...
Just to fully qualify this problem, does the reproducer you attached always reproduce within CEA environment ? And if yes and a lustre debug-log can be exported (??), could it be possible to run it again with full debug (or at least dlmtrace+rpctrace) enabled, a big debug log buffer configured, and also with dump_on_eviction set (or manual debug log dump) ?
In the mean time I will try to reproduce in-house.

Comment by Lustre Bull [ 11/Apr/14 ]

Hi Bruno,

>> Humm, it seems it reminds me of an old problem ...

yes, LU-2380 was opened in november 2012 to report evictions when a lot of tasks were issuing ftruncate() of the same file.
But this was with lustre 2.1, and on site support reports that this new issue only occurs with lustre 2.4, with some customer applications as well as with the reproducer.

>> Just to fully qualify this problem, does the reproducer you attached always reproduce within CEA environment ?

Yes they wrote in the MANTIS report that it's systematic and easy to reproduce with 2 nodes in lustre 2.4, and that they do no have the issue in 2.1 with the same reproducer (and application ???).
But it seems to be related to CEA environmemt, because we are not able to reproduce this on our development cluster, with the same 32 CPU Mesca nodes.

>> And if yes and a lustre debug-log can be exported (??), could it be possible to run it again with full debug (or at least dlmtrace+rpctrace) enabled, a big debug log buffer configured, and also with dump_on_eviction set (or manual debug log dump) ?
In the mean time I will try to reproduce in-house.

I ask them if they think it could be possible to export the logs, and transmit your requirements (trace level and log buffer size).

Patrick.

Comment by Patrick Farrell (Inactive) [ 11/Apr/14 ]

Patrick, Bruno - I haven't investigated this deeply, but from reading the description, this may be a duplicate of LU-4381.

I don't know if it's mentioned in LU-4381, but of course, the deadlock can lead to evictions. We reported that (without truncate) in LU-4495, which was eventually marked as a duplicate to LU-4381.

Not sure if the LU-4381 patch is in the code you're running or not.

Comment by Jodi Levi (Inactive) [ 15/Apr/14 ]

Bruno,
Are you also investigating if this affects Master as well?

Comment by Bruno Faccini (Inactive) [ 16/Apr/14 ]

Patrick F.: thanks pointing to LU-4381. This may be the same problem, but I would like to be able to confirm it with more infos/traces from the site, or by reproducing it in-house.

Jodi: I will double-check if Master is affected.

Patrick V./Lustre Bull: any news about the availability of full debug log from the site ?

Also, I have just been able to reproduce in-house by running the reproducer with a default striping of 12, and only 16 tasks running on 2x nodes with only 8 cores each ...
I will update asap to detail what I will find in the full debug traces I have now ...

Comment by Patrick Valentin (Inactive) [ 16/Apr/14 ]

Hi Bruno,
We haven't yet received full debug logs from CEA.
We are preparing a new Lustre 2.4.3 for CEA with some fixes, and we are waiting for Peter Jones answer, in order to know if we can also integrate LU-4381 patch #9152 (landed in master) to this 2.4 delivery.

Comment by Bruno Faccini (Inactive) [ 17/Apr/14 ]

Patrick V.: did you try to reproduce on your development cluster with a default-striping != 1 or enough "wide" ? And on the other hand, can you ask the site people if their default-striping is != 1 when they reproduce and if yes, to run their reproducer with a default-striping = 1 ?

BTW, if I am asking this it is because, and even if I did not complete analysis of the full log from my in-house occurrence at this time (but will!), I strongly suspect what Patrick F. suggested, that this ticket is a dup of LU-4381/LU-4495. And the associated patch may also fix the concurrent write-append issue also seen at CEA (like problem reported in LU-4495 ?!).

Comment by Patrick Valentin (Inactive) [ 18/Apr/14 ]

Bruno,
I built a new lustre FS with 160 OSTs, as the CEA file systems have a lot of OSTs, and I tried with a stripe count of 16 and 160, and I was able to reproduce the issue with the attached reproducer, with 2 nodes with 16 core each.

We built a lustre 2.4.3 with patch #9152 of LU-4381, as mentionned in LU-4881, and it seems to fix the issue.
I restart the test for a longer period and will provide the results after the WE.

Comment by Bruno Faccini (Inactive) [ 21/Apr/14 ]

I have been able to complete the debug-log analysis of my own/in-house reproducer run, and it definitely shows the same scenario than the one described in both LU-4381 and LU-4495. So it is very likely that Gerrit change #9152, and the "sub lock creation from lov_lock_sub_init" it removes, will fix problem for this ticket but again/also probably the concurrent write-append issue also seen at CEA (like in LU-4495?!).

Comment by Patrick Valentin (Inactive) [ 22/Apr/14 ]

Bruno,
The test as described in "comment on 18/Apr/14 8:18 PM" has been runing during 3 days without any error with a stripe count of 16.
We are going to build a new lustre 2.4.3 to be installed on a test cluster at CEA, containing LU-4381 (Gerrit change #9152) and a few other fixes. I'll keep you informed of the results with the real application causing these evictions.

Comment by Bruno Faccini (Inactive) [ 03/Jun/14 ]

Patrick,
Any (good ?) news after you included patch from LU-4381 in Bull's 2.4.3 distro and exposed it vs CEA production, to see if it fixes the evictions ??

Generated at Sat Feb 10 01:46:40 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.