Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4797

ASSERTION( cl_lock_is_mutexed(slice->cls_lock) ) failed

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • Lustre 2.4.2
    • None
    • 3
    • 13202

    Description

      Hi,

      After 3 days in production with Lustre 2.4.2, CEA is suffering from the following "assertion failed" issue about 5 times a day:

      LustreError: 4089:0:(lovsub_lock.c:103:lovsub_lock_state()) ASSERTION( cl_lock_is_mutexed(slice->cls_lock) ) failed:
      LustreError: 4089:0:(lovsub_lock.c:103:lovsub_lock_state()) LBUG
      Pid: 4089, comm: %%AQC.P.I.O
      
      Call Trace:
       [<ffffffffa0af4895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
       [<ffffffffa0af4e97>] lbug_with_loc+0x47/0xb0 [libcfs]
       [<ffffffffa1065d51>] lovsub_lock_state+0x1a1/0x1b0 [lov]
       [<ffffffffa0bd7a88>] cl_lock_state_signal+0x68/0x160 [obdclass]
       [<ffffffffa0bd7bd5>] cl_lock_state_set+0x55/0x190 [obdclass]
       [<ffffffffa0bdb8d9>] cl_enqueue_try+0x149/0x300 [obdclass]
       [<ffffffffa105e0da>] lov_lock_enqueue+0x22a/0x850 [lov]
       [<ffffffffa0bdb88c>] cl_enqueue_try+0xfc/0x300 [obdclass]
       [<ffffffffa0bdcc7f>] cl_enqueue_locked+0x6f/0x1f0 [obdclass]
       [<ffffffffa0bdd8ee>] cl_lock_request+0x7e/0x270 [obdclass]
       [<ffffffffa0be2b8c>] cl_io_lock+0x3cc/0x560 [obdclass]
       [<ffffffffa0be2dc2>] cl_io_loop+0xa2/0x1b0 [obdclass]
       [<ffffffffa10dba90>] ll_file_io_generic+0x450/0x600 [lustre]
       [<ffffffffa10dc9d2>] ll_file_aio_write+0x142/0x2c0 [lustre]
       [<ffffffffa10dccbc>] ll_file_write+0x16c/0x2a0 [lustre]
       [<ffffffff811895d8>] vfs_write+0xb8/0x1a0
       [<ffffffff81189ed1>] sys_write+0x51/0x90
       [<ffffffff81091039>] ? sys_times+0x29/0x70
       [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      

      This issue is very similar to LU-4693, which is itself a duplicate of LU-4692, for which there is unfortunately no fix yet.

      Please ask if you need additional information that could help the diagnostic and resolution of the problem.

      Sebastien.

      Attachments

        Issue Links

          Activity

            [LU-4797] ASSERTION( cl_lock_is_mutexed(slice->cls_lock) ) failed

            Seb, sorry to only answer to your own comment/reply on "21/Mar/14 3:53 PM", so yes it could be useful as a 1st debugging info to get the debug-log content extracted from the crash-dump. BTW, I hope that you run with a default debug-levels mask that will be enough to gather accurate traces for this problem ??...

            bfaccini Bruno Faccini (Inactive) added a comment - Seb, sorry to only answer to your own comment/reply on "21/Mar/14 3:53 PM", so yes it could be useful as a 1st debugging info to get the debug-log content extracted from the crash-dump. BTW, I hope that you run with a default debug-levels mask that will be enough to gather accurate traces for this problem ??...

            Please note that Sebastien's script is not a reproducer of this crash, but something similar to the workload that leads to this crash. This code only easily triggers a lot of evictions.

            adegremont Aurelien Degremont (Inactive) added a comment - Please note that Sebastien's script is not a reproducer of this crash, but something similar to the workload that leads to this crash. This code only easily triggers a lot of evictions.

            Is it possible to collect a crash dump for this issue?

            The only difference between 2.4.1 and 2.4.2 is LU-3027, there are two patches landed for it. Can anyone please revert them and try again? Probably this way we can get some clues for this issue.

            jay Jinshan Xiong (Inactive) added a comment - Is it possible to collect a crash dump for this issue? The only difference between 2.4.1 and 2.4.2 is LU-3027 , there are two patches landed for it. Can anyone please revert them and try again? Probably this way we can get some clues for this issue.
            bobijam Zhenyu Xu added a comment -

            I've been being trying the reproduce script since yesterday, haven't gotten a hit yet. I did a little change for my VM test environment.

            $ cat ~/tmp/reproducer.sh
            #!/bin/bash

            ls > /dev/null &
            PID=$!

            for i in $(seq 1 10000)
            do
            echo $PID - $i
            usleep $((($PID * 100) + 1500000))
            echo "$(date) $(date '+%N') $PID-$i ***** testing write in append mode" >> /mnt/lustre/file
            done

            and on 2 nodes, run "$~/tmp/reproducer.sh &" five times, I think the basic idea is the same.

            Sebastien, How long would it rehit the issue in your case?

            bobijam Zhenyu Xu added a comment - I've been being trying the reproduce script since yesterday, haven't gotten a hit yet. I did a little change for my VM test environment. $ cat ~/tmp/reproducer.sh #!/bin/bash ls > /dev/null & PID=$! for i in $(seq 1 10000) do echo $PID - $i usleep $((($PID * 100) + 1500000)) echo "$(date) $(date '+%N') $PID-$i ***** testing write in append mode" >> /mnt/lustre/file done and on 2 nodes, run "$~/tmp/reproducer.sh &" five times, I think the basic idea is the same. Sebastien, How long would it rehit the issue in your case?

            LLNL also hit this in testing Lustre version 2.4.2-6chaos on a Lustre client.

            morrone Christopher Morrone (Inactive) added a comment - LLNL also hit this in testing Lustre version 2.4.2-6chaos on a Lustre client.

            The workload should be reproduced by launching the script run_reproducer_2.sh with 4 processes on 2 nodes.

            ::::::::::::::
            run_reproducer_2.sh
            ::::::::::::::
            #!/bin/bash
            sleeptime=$(( ( ${SLURM_PROCID} * 10000 ) + 1000000 ))
            reproducer2.sh 10 /<path>/mylog ${sleeptime} ${SLURM_JOBID}_${SLURM_PROCID}
            ::::::::::::::
            reproducer2.sh
            ::::::::::::::
            #!/bin/bash
            #
            for i in $(seq 1 $1)
            do
              usleep $3
              echo $(date) $(date '+%N') $4 $3 testing write in append mode >> $2
            done
            
            sebastien.buisson Sebastien Buisson (Inactive) added a comment - The workload should be reproduced by launching the script run_reproducer_2.sh with 4 processes on 2 nodes. :::::::::::::: run_reproducer_2.sh :::::::::::::: #!/bin/bash sleeptime=$(( ( ${SLURM_PROCID} * 10000 ) + 1000000 )) reproducer2.sh 10 /<path>/mylog ${sleeptime} ${SLURM_JOBID}_${SLURM_PROCID} :::::::::::::: reproducer2.sh :::::::::::::: #!/bin/bash # for i in $(seq 1 $1) do usleep $3 echo $(date) $(date '+%N') $4 $3 testing write in append mode >> $2 done

            Hi Bobijam,

            All I know is that the impacted file is a log file in which several processes write.

            I have forwarded your request to our on-site Support team.

            Cheers,
            Sebastien.

            sebastien.buisson Sebastien Buisson (Inactive) added a comment - Hi Bobijam, All I know is that the impacted file is a log file in which several processes write. I have forwarded your request to our on-site Support team. Cheers, Sebastien.
            bobijam Zhenyu Xu added a comment -

            Beside crash-dump, is it possible to find a rehit procedure?

            bobijam Zhenyu Xu added a comment - Beside crash-dump, is it possible to find a rehit procedure?
            pjones Peter Jones added a comment -

            Bobijam

            Does this appear to be a duplicate of LU-4692? Is there anything additional that would assist with debugging this issue?

            Peter

            pjones Peter Jones added a comment - Bobijam Does this appear to be a duplicate of LU-4692 ? Is there anything additional that would assist with debugging this issue? Peter

            Hi Bruno,

            I have forwarded your request to on-site Support team. Do you want us to attach the requested debug-log content to this ticket? Or could we have a look by ourselves and search for something specific?

            Cheers,
            Sebastien.

            sebastien.buisson Sebastien Buisson (Inactive) added a comment - Hi Bruno, I have forwarded your request to on-site Support team. Do you want us to attach the requested debug-log content to this ticket? Or could we have a look by ourselves and search for something specific? Cheers, Sebastien.

            Hello Sebastien, are there any crash-dump available ?? If yes, could it be possible to extract the debug-log content with the crash-tool expansion described in CFS BZ #13155 (source to be re-compiled are available, and I know you may need them to install+use on-site) ?? BTW, waht is the default debug mask you run with ??

            bfaccini Bruno Faccini (Inactive) added a comment - Hello Sebastien, are there any crash-dump available ?? If yes, could it be possible to extract the debug-log content with the crash-tool expansion described in CFS BZ #13155 (source to be re-compiled are available, and I know you may need them to install+use on-site) ?? BTW, waht is the default debug mask you run with ??

            People

              jay Jinshan Xiong (Inactive)
              sebastien.buisson Sebastien Buisson (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: