[LU-4797] ASSERTION( cl_lock_is_mutexed(slice->cls_lock) ) failed - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Duplicate
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.4.2
Labels:
None

Severity:
3
Rank (Obsolete):
13202

Description

Hi,

After 3 days in production with Lustre 2.4.2, CEA is suffering from the following "assertion failed" issue about 5 times a day:

LustreError: 4089:0:(lovsub_lock.c:103:lovsub_lock_state()) ASSERTION( cl_lock_is_mutexed(slice->cls_lock) ) failed:
LustreError: 4089:0:(lovsub_lock.c:103:lovsub_lock_state()) LBUG
Pid: 4089, comm: %%AQC.P.I.O

Call Trace:
 [<ffffffffa0af4895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
 [<ffffffffa0af4e97>] lbug_with_loc+0x47/0xb0 [libcfs]
 [<ffffffffa1065d51>] lovsub_lock_state+0x1a1/0x1b0 [lov]
 [<ffffffffa0bd7a88>] cl_lock_state_signal+0x68/0x160 [obdclass]
 [<ffffffffa0bd7bd5>] cl_lock_state_set+0x55/0x190 [obdclass]
 [<ffffffffa0bdb8d9>] cl_enqueue_try+0x149/0x300 [obdclass]
 [<ffffffffa105e0da>] lov_lock_enqueue+0x22a/0x850 [lov]
 [<ffffffffa0bdb88c>] cl_enqueue_try+0xfc/0x300 [obdclass]
 [<ffffffffa0bdcc7f>] cl_enqueue_locked+0x6f/0x1f0 [obdclass]
 [<ffffffffa0bdd8ee>] cl_lock_request+0x7e/0x270 [obdclass]
 [<ffffffffa0be2b8c>] cl_io_lock+0x3cc/0x560 [obdclass]
 [<ffffffffa0be2dc2>] cl_io_loop+0xa2/0x1b0 [obdclass]
 [<ffffffffa10dba90>] ll_file_io_generic+0x450/0x600 [lustre]
 [<ffffffffa10dc9d2>] ll_file_aio_write+0x142/0x2c0 [lustre]
 [<ffffffffa10dccbc>] ll_file_write+0x16c/0x2a0 [lustre]
 [<ffffffff811895d8>] vfs_write+0xb8/0x1a0
 [<ffffffff81189ed1>] sys_write+0x51/0x90
 [<ffffffff81091039>] ? sys_times+0x29/0x70
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

This issue is very similar to ~~LU-4693~~, which is itself a duplicate of ~~LU-4692~~, for which there is unfortunately no fix yet.

Please ask if you need additional information that could help the diagnostic and resolution of the problem.

Sebastien.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

eviction
6 kB
31/Mar/14 7:59 PM
lascaux2890_dump_lustre_pb_eviction_1.gz
0.3 kB
31/Mar/14 8:01 PM
lascaux2890_dump_lustre_pb_eviction_2.gz
0.3 kB
31/Mar/14 8:05 PM
lascaux2891_dump_lustre_pb_eviction_1.gz
0.3 kB
31/Mar/14 8:08 PM
lascaux2891_dump_lustre_pb_eviction_2.gz
0.3 kB
31/Mar/14 8:10 PM

Issue Links

is duplicated by

LU-4693 (lovsub_lock.c:103:lovsub_lock_state()) ASSERTION( cl_lock_is_mutexed(slice->cls_lock) ) failed:

Resolved

is related to

LU-4558 Crash in cl_lock_put on racer

Resolved

LU-4591 Related cl_lock failures on master/2.5

Resolved

Activity

[LU-4797] ASSERTION( cl_lock_is_mutexed(slice->cls_lock) ) failed

Bruno Faccini (Inactive) added a comment - 28/Mar/14 9:46 AM

Seb, sorry to only answer to your own comment/reply on "21/Mar/14 3:53 PM", so yes it could be useful as a 1st debugging info to get the debug-log content extracted from the crash-dump. BTW, I hope that you run with a default debug-levels mask that will be enough to gather accurate traces for this problem ??...

Bruno Faccini (Inactive) added a comment - 28/Mar/14 9:46 AM Seb, sorry to only answer to your own comment/reply on "21/Mar/14 3:53 PM", so yes it could be useful as a 1st debugging info to get the debug-log content extracted from the crash-dump. BTW, I hope that you run with a default debug-levels mask that will be enough to gather accurate traces for this problem ??...

Aurelien Degremont (Inactive) added a comment - 28/Mar/14 8:25 AM

Please note that Sebastien's script is not a reproducer of this crash, but something similar to the workload that leads to this crash. This code only easily triggers a lot of evictions.

Aurelien Degremont (Inactive) added a comment - 28/Mar/14 8:25 AM Please note that Sebastien's script is not a reproducer of this crash, but something similar to the workload that leads to this crash. This code only easily triggers a lot of evictions.

Jinshan Xiong (Inactive) added a comment - 28/Mar/14 4:03 AM

Is it possible to collect a crash dump for this issue?

The only difference between 2.4.1 and 2.4.2 is ~~LU-3027~~, there are two patches landed for it. Can anyone please revert them and try again? Probably this way we can get some clues for this issue.

Jinshan Xiong (Inactive) added a comment - 28/Mar/14 4:03 AM Is it possible to collect a crash dump for this issue? The only difference between 2.4.1 and 2.4.2 is LU-3027 , there are two patches landed for it. Can anyone please revert them and try again? Probably this way we can get some clues for this issue.

Zhenyu Xu added a comment - 28/Mar/14 2:28 AM

I've been being trying the reproduce script since yesterday, haven't gotten a hit yet. I did a little change for my VM test environment.

$ cat ~/tmp/reproducer.sh
#!/bin/bash

ls > /dev/null &
PID=$!

for i in $(seq 1 10000)
do
echo $PID - $i
usleep $((($PID * 100) + 1500000))
echo "$(date) $(date '+%N') $PID-$i ***** testing write in append mode" >> /mnt/lustre/file
done

and on 2 nodes, run "$~/tmp/reproducer.sh &" five times, I think the basic idea is the same.

Sebastien, How long would it rehit the issue in your case?

Zhenyu Xu added a comment - 28/Mar/14 2:28 AM I've been being trying the reproduce script since yesterday, haven't gotten a hit yet. I did a little change for my VM test environment. $ cat ~/tmp/reproducer.sh #!/bin/bash ls > /dev/null & PID=$! for i in $(seq 1 10000) do echo $PID - $i usleep $((($PID * 100) + 1500000)) echo "$(date) $(date '+%N') $PID-$i ***** testing write in append mode" >> /mnt/lustre/file done and on 2 nodes, run "$~/tmp/reproducer.sh &" five times, I think the basic idea is the same. Sebastien, How long would it rehit the issue in your case?

Christopher Morrone (Inactive) added a comment - 28/Mar/14 1:40 AM

LLNL also hit this in testing Lustre version 2.4.2-6chaos on a Lustre client.

Christopher Morrone (Inactive) added a comment - 28/Mar/14 1:40 AM LLNL also hit this in testing Lustre version 2.4.2-6chaos on a Lustre client.

Sebastien Buisson (Inactive) added a comment - 27/Mar/14 9:33 AM

The workload should be reproduced by launching the script run_reproducer_2.sh with 4 processes on 2 nodes.

::::::::::::::
run_reproducer_2.sh
::::::::::::::
#!/bin/bash
sleeptime=$(( ( ${SLURM_PROCID} * 10000 ) + 1000000 ))
reproducer2.sh 10 /<path>/mylog ${sleeptime} ${SLURM_JOBID}_${SLURM_PROCID}
::::::::::::::
reproducer2.sh
::::::::::::::
#!/bin/bash
#
for i in $(seq 1 $1)
do
  usleep $3
  echo $(date) $(date '+%N') $4 $3 testing write in append mode >> $2
done

Sebastien Buisson (Inactive) added a comment - 27/Mar/14 9:33 AM The workload should be reproduced by launching the script run_reproducer_2.sh with 4 processes on 2 nodes. :::::::::::::: run_reproducer_2.sh :::::::::::::: #!/bin/bash sleeptime=$(( ( ${SLURM_PROCID} * 10000 ) + 1000000 )) reproducer2.sh 10 /<path>/mylog ${sleeptime} ${SLURM_JOBID}_${SLURM_PROCID} :::::::::::::: reproducer2.sh :::::::::::::: #!/bin/bash # for i in $(seq 1 $1) do usleep $3 echo $(date) $(date '+%N') $4 $3 testing write in append mode >> $2 done

Sebastien Buisson (Inactive) added a comment - 25/Mar/14 7:56 AM

Hi Bobijam,

All I know is that the impacted file is a log file in which several processes write.

I have forwarded your request to our on-site Support team.

Cheers,
Sebastien.

Sebastien Buisson (Inactive) added a comment - 25/Mar/14 7:56 AM Hi Bobijam, All I know is that the impacted file is a log file in which several processes write. I have forwarded your request to our on-site Support team. Cheers, Sebastien.

Zhenyu Xu added a comment - 25/Mar/14 7:29 AM

Beside crash-dump, is it possible to find a rehit procedure?

Zhenyu Xu added a comment - 25/Mar/14 7:29 AM Beside crash-dump, is it possible to find a rehit procedure?

Peter Jones added a comment - 21/Mar/14 3:45 PM

Bobijam

Does this appear to be a duplicate of ~~LU-4692~~? Is there anything additional that would assist with debugging this issue?

Peter

Peter Jones added a comment - 21/Mar/14 3:45 PM Bobijam Does this appear to be a duplicate of LU-4692 ? Is there anything additional that would assist with debugging this issue? Peter

Sebastien Buisson (Inactive) added a comment - 21/Mar/14 2:53 PM

Hi Bruno,

I have forwarded your request to on-site Support team. Do you want us to attach the requested debug-log content to this ticket? Or could we have a look by ourselves and search for something specific?

Cheers,
Sebastien.

Sebastien Buisson (Inactive) added a comment - 21/Mar/14 2:53 PM Hi Bruno, I have forwarded your request to on-site Support team. Do you want us to attach the requested debug-log content to this ticket? Or could we have a look by ourselves and search for something specific? Cheers, Sebastien.

Bruno Faccini (Inactive) added a comment - 21/Mar/14 2:42 PM

Hello Sebastien, are there any crash-dump available ?? If yes, could it be possible to extract the debug-log content with the crash-tool expansion described in CFS BZ #13155 (source to be re-compiled are available, and I know you may need them to install+use on-site) ?? BTW, waht is the default debug mask you run with ??

Bruno Faccini (Inactive) added a comment - 21/Mar/14 2:42 PM Hello Sebastien, are there any crash-dump available ?? If yes, could it be possible to extract the debug-log content with the crash-tool expansion described in CFS BZ #13155 (source to be re-compiled are available, and I know you may need them to install+use on-site) ?? BTW, waht is the default debug mask you run with ??

People

Assignee:: Jinshan Xiong (Inactive)

Reporter:: Sebastien Buisson (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 21/Mar/14 2:00 PM

Updated:: 28/Apr/14 2:21 PM

Resolved:: 04/Apr/14 4:24 PM