Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4797

ASSERTION( cl_lock_is_mutexed(slice->cls_lock) ) failed

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • Lustre 2.4.2
    • None
    • 3
    • 13202

    Description

      Hi,

      After 3 days in production with Lustre 2.4.2, CEA is suffering from the following "assertion failed" issue about 5 times a day:

      LustreError: 4089:0:(lovsub_lock.c:103:lovsub_lock_state()) ASSERTION( cl_lock_is_mutexed(slice->cls_lock) ) failed:
      LustreError: 4089:0:(lovsub_lock.c:103:lovsub_lock_state()) LBUG
      Pid: 4089, comm: %%AQC.P.I.O
      
      Call Trace:
       [<ffffffffa0af4895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
       [<ffffffffa0af4e97>] lbug_with_loc+0x47/0xb0 [libcfs]
       [<ffffffffa1065d51>] lovsub_lock_state+0x1a1/0x1b0 [lov]
       [<ffffffffa0bd7a88>] cl_lock_state_signal+0x68/0x160 [obdclass]
       [<ffffffffa0bd7bd5>] cl_lock_state_set+0x55/0x190 [obdclass]
       [<ffffffffa0bdb8d9>] cl_enqueue_try+0x149/0x300 [obdclass]
       [<ffffffffa105e0da>] lov_lock_enqueue+0x22a/0x850 [lov]
       [<ffffffffa0bdb88c>] cl_enqueue_try+0xfc/0x300 [obdclass]
       [<ffffffffa0bdcc7f>] cl_enqueue_locked+0x6f/0x1f0 [obdclass]
       [<ffffffffa0bdd8ee>] cl_lock_request+0x7e/0x270 [obdclass]
       [<ffffffffa0be2b8c>] cl_io_lock+0x3cc/0x560 [obdclass]
       [<ffffffffa0be2dc2>] cl_io_loop+0xa2/0x1b0 [obdclass]
       [<ffffffffa10dba90>] ll_file_io_generic+0x450/0x600 [lustre]
       [<ffffffffa10dc9d2>] ll_file_aio_write+0x142/0x2c0 [lustre]
       [<ffffffffa10dccbc>] ll_file_write+0x16c/0x2a0 [lustre]
       [<ffffffff811895d8>] vfs_write+0xb8/0x1a0
       [<ffffffff81189ed1>] sys_write+0x51/0x90
       [<ffffffff81091039>] ? sys_times+0x29/0x70
       [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      

      This issue is very similar to LU-4693, which is itself a duplicate of LU-4692, for which there is unfortunately no fix yet.

      Please ask if you need additional information that could help the diagnostic and resolution of the problem.

      Sebastien.

      Attachments

        Issue Links

          Activity

            [LU-4797] ASSERTION( cl_lock_is_mutexed(slice->cls_lock) ) failed

            The workload should be reproduced by launching the script run_reproducer_2.sh with 4 processes on 2 nodes.

            ::::::::::::::
            run_reproducer_2.sh
            ::::::::::::::
            #!/bin/bash
            sleeptime=$(( ( ${SLURM_PROCID} * 10000 ) + 1000000 ))
            reproducer2.sh 10 /<path>/mylog ${sleeptime} ${SLURM_JOBID}_${SLURM_PROCID}
            ::::::::::::::
            reproducer2.sh
            ::::::::::::::
            #!/bin/bash
            #
            for i in $(seq 1 $1)
            do
              usleep $3
              echo $(date) $(date '+%N') $4 $3 testing write in append mode >> $2
            done
            
            sebastien.buisson Sebastien Buisson (Inactive) added a comment - The workload should be reproduced by launching the script run_reproducer_2.sh with 4 processes on 2 nodes. :::::::::::::: run_reproducer_2.sh :::::::::::::: #!/bin/bash sleeptime=$(( ( ${SLURM_PROCID} * 10000 ) + 1000000 )) reproducer2.sh 10 /<path>/mylog ${sleeptime} ${SLURM_JOBID}_${SLURM_PROCID} :::::::::::::: reproducer2.sh :::::::::::::: #!/bin/bash # for i in $(seq 1 $1) do usleep $3 echo $(date) $(date '+%N') $4 $3 testing write in append mode >> $2 done

            Hi Bobijam,

            All I know is that the impacted file is a log file in which several processes write.

            I have forwarded your request to our on-site Support team.

            Cheers,
            Sebastien.

            sebastien.buisson Sebastien Buisson (Inactive) added a comment - Hi Bobijam, All I know is that the impacted file is a log file in which several processes write. I have forwarded your request to our on-site Support team. Cheers, Sebastien.
            bobijam Zhenyu Xu added a comment -

            Beside crash-dump, is it possible to find a rehit procedure?

            bobijam Zhenyu Xu added a comment - Beside crash-dump, is it possible to find a rehit procedure?
            pjones Peter Jones added a comment -

            Bobijam

            Does this appear to be a duplicate of LU-4692? Is there anything additional that would assist with debugging this issue?

            Peter

            pjones Peter Jones added a comment - Bobijam Does this appear to be a duplicate of LU-4692 ? Is there anything additional that would assist with debugging this issue? Peter

            Hi Bruno,

            I have forwarded your request to on-site Support team. Do you want us to attach the requested debug-log content to this ticket? Or could we have a look by ourselves and search for something specific?

            Cheers,
            Sebastien.

            sebastien.buisson Sebastien Buisson (Inactive) added a comment - Hi Bruno, I have forwarded your request to on-site Support team. Do you want us to attach the requested debug-log content to this ticket? Or could we have a look by ourselves and search for something specific? Cheers, Sebastien.

            Hello Sebastien, are there any crash-dump available ?? If yes, could it be possible to extract the debug-log content with the crash-tool expansion described in CFS BZ #13155 (source to be re-compiled are available, and I know you may need them to install+use on-site) ?? BTW, waht is the default debug mask you run with ??

            bfaccini Bruno Faccini (Inactive) added a comment - Hello Sebastien, are there any crash-dump available ?? If yes, could it be possible to extract the debug-log content with the crash-tool expansion described in CFS BZ #13155 (source to be re-compiled are available, and I know you may need them to install+use on-site) ?? BTW, waht is the default debug mask you run with ??

            People

              jay Jinshan Xiong (Inactive)
              sebastien.buisson Sebastien Buisson (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: