Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4797

ASSERTION( cl_lock_is_mutexed(slice->cls_lock) ) failed

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • Lustre 2.4.2
    • None
    • 3
    • 13202

    Description

      Hi,

      After 3 days in production with Lustre 2.4.2, CEA is suffering from the following "assertion failed" issue about 5 times a day:

      LustreError: 4089:0:(lovsub_lock.c:103:lovsub_lock_state()) ASSERTION( cl_lock_is_mutexed(slice->cls_lock) ) failed:
      LustreError: 4089:0:(lovsub_lock.c:103:lovsub_lock_state()) LBUG
      Pid: 4089, comm: %%AQC.P.I.O
      
      Call Trace:
       [<ffffffffa0af4895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
       [<ffffffffa0af4e97>] lbug_with_loc+0x47/0xb0 [libcfs]
       [<ffffffffa1065d51>] lovsub_lock_state+0x1a1/0x1b0 [lov]
       [<ffffffffa0bd7a88>] cl_lock_state_signal+0x68/0x160 [obdclass]
       [<ffffffffa0bd7bd5>] cl_lock_state_set+0x55/0x190 [obdclass]
       [<ffffffffa0bdb8d9>] cl_enqueue_try+0x149/0x300 [obdclass]
       [<ffffffffa105e0da>] lov_lock_enqueue+0x22a/0x850 [lov]
       [<ffffffffa0bdb88c>] cl_enqueue_try+0xfc/0x300 [obdclass]
       [<ffffffffa0bdcc7f>] cl_enqueue_locked+0x6f/0x1f0 [obdclass]
       [<ffffffffa0bdd8ee>] cl_lock_request+0x7e/0x270 [obdclass]
       [<ffffffffa0be2b8c>] cl_io_lock+0x3cc/0x560 [obdclass]
       [<ffffffffa0be2dc2>] cl_io_loop+0xa2/0x1b0 [obdclass]
       [<ffffffffa10dba90>] ll_file_io_generic+0x450/0x600 [lustre]
       [<ffffffffa10dc9d2>] ll_file_aio_write+0x142/0x2c0 [lustre]
       [<ffffffffa10dccbc>] ll_file_write+0x16c/0x2a0 [lustre]
       [<ffffffff811895d8>] vfs_write+0xb8/0x1a0
       [<ffffffff81189ed1>] sys_write+0x51/0x90
       [<ffffffff81091039>] ? sys_times+0x29/0x70
       [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      

      This issue is very similar to LU-4693, which is itself a duplicate of LU-4692, for which there is unfortunately no fix yet.

      Please ask if you need additional information that could help the diagnostic and resolution of the problem.

      Sebastien.

      Attachments

        Issue Links

          Activity

            [LU-4797] ASSERTION( cl_lock_is_mutexed(slice->cls_lock) ) failed

            duplicate of LU-4558

            jay Jinshan Xiong (Inactive) added a comment - duplicate of LU-4558

            Please try patch: http://review.whamcloud.com/9881

            I believe this is the same issue in LU-4591.

            jay Jinshan Xiong (Inactive) added a comment - Please try patch: http://review.whamcloud.com/9881 I believe this is the same issue in LU-4591 .
            bobijam Zhenyu Xu added a comment -

            still checking the eviction logs, there are lock enqueue and blocking ast call trace intertwined, but that's normal, haven't found where the race happens causing this mutex assertion violation.

            bobijam Zhenyu Xu added a comment - still checking the eviction logs, there are lock enqueue and blocking ast call trace intertwined, but that's normal, haven't found where the race happens causing this mutex assertion violation.

            Bobi, I attach the Client-side debug logs that have been taken during an on-site run of the reproducer on 2x nodes (lascaux[2890-2891]) with evictions :

            _ file "eviction" is a log showing the reproducer run+errors and the actions taken.
            _ files lascaux289[0,1]dump_lustre_pb_eviction[1,2].gz are Lustre debug logs taken on both nodes and as shown in "eviction" file/trace.

            Hope this will allow you to get more infos to qualify.

            bfaccini Bruno Faccini (Inactive) added a comment - Bobi, I attach the Client-side debug logs that have been taken during an on-site run of the reproducer on 2x nodes (lascaux [2890-2891] ) with evictions : _ file "eviction" is a log showing the reproducer run+errors and the actions taken. _ files lascaux289 [0,1] dump_lustre_pb_eviction [1,2] .gz are Lustre debug logs taken on both nodes and as shown in "eviction" file/trace. Hope this will allow you to get more infos to qualify.
            bobijam Zhenyu Xu added a comment -

            update: tried taskset, usleep 1000000, each cpu took 250 echo taskes, 2 VM nodes, each node has 2 cpus, haven't got the hit yet.

            bobijam Zhenyu Xu added a comment - update: tried taskset, usleep 1000000, each cpu took 250 echo taskes, 2 VM nodes, each node has 2 cpus, haven't got the hit yet.
            bobijam Zhenyu Xu added a comment -

            my wait time is 1.5x of the initial reproducer, I'll try the taskset.

            bobijam Zhenyu Xu added a comment - my wait time is 1.5x of the initial reproducer, I'll try the taskset.

            About the testcase :

            • the waiting time is too large than the initial reproducer by x15
            • fix the run a processor/core with taskset for example when
              you run /tmp/reproducer.sh on the 2 nodes
              for cpu in $(seq 0 5) # nb cpus nodes
              taskset -c $cpu /tmp/reproducer.sh & ; done
            • My first testcase was without usleep and 32 task/process by nodes
              and 1000 write/echo by task
            • no lustre trace enable
            apercher Antoine Percher added a comment - About the testcase : the waiting time is too large than the initial reproducer by x15 fix the run a processor/core with taskset for example when you run /tmp/reproducer.sh on the 2 nodes for cpu in $(seq 0 5) # nb cpus nodes taskset -c $cpu /tmp/reproducer.sh & ; done My first testcase was without usleep and 32 task/process by nodes and 1000 write/echo by task no lustre trace enable

            Bobijam,

            To make it clear, 2.4.2 is the first b2_4 release used at CEA, so we are not comparing with 2.4.1, but with 2.1.6.

            Sebastien.

            sebastien.buisson Sebastien Buisson (Inactive) added a comment - Bobijam, To make it clear, 2.4.2 is the first b2_4 release used at CEA, so we are not comparing with 2.4.1, but with 2.1.6. Sebastien.

            Aurelien, concerning the evictions likely to be reproduced on site with this script, is it also possible to get a Lustre debug-log, at least from evicted Client side and with the full debug mask/traces enabled ?

            bfaccini Bruno Faccini (Inactive) added a comment - Aurelien, concerning the evictions likely to be reproduced on site with this script, is it also possible to get a Lustre debug-log, at least from evicted Client side and with the full debug mask/traces enabled ?

            Seb, sorry to only answer to your own comment/reply on "21/Mar/14 3:53 PM", so yes it could be useful as a 1st debugging info to get the debug-log content extracted from the crash-dump. BTW, I hope that you run with a default debug-levels mask that will be enough to gather accurate traces for this problem ??...

            bfaccini Bruno Faccini (Inactive) added a comment - Seb, sorry to only answer to your own comment/reply on "21/Mar/14 3:53 PM", so yes it could be useful as a 1st debugging info to get the debug-log content extracted from the crash-dump. BTW, I hope that you run with a default debug-levels mask that will be enough to gather accurate traces for this problem ??...

            Please note that Sebastien's script is not a reproducer of this crash, but something similar to the workload that leads to this crash. This code only easily triggers a lot of evictions.

            adegremont Aurelien Degremont (Inactive) added a comment - Please note that Sebastien's script is not a reproducer of this crash, but something similar to the workload that leads to this crash. This code only easily triggers a lot of evictions.

            People

              jay Jinshan Xiong (Inactive)
              sebastien.buisson Sebastien Buisson (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: