Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4797

ASSERTION( cl_lock_is_mutexed(slice->cls_lock) ) failed

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • Lustre 2.4.2
    • None
    • 3
    • 13202

    Description

      Hi,

      After 3 days in production with Lustre 2.4.2, CEA is suffering from the following "assertion failed" issue about 5 times a day:

      LustreError: 4089:0:(lovsub_lock.c:103:lovsub_lock_state()) ASSERTION( cl_lock_is_mutexed(slice->cls_lock) ) failed:
      LustreError: 4089:0:(lovsub_lock.c:103:lovsub_lock_state()) LBUG
      Pid: 4089, comm: %%AQC.P.I.O
      
      Call Trace:
       [<ffffffffa0af4895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
       [<ffffffffa0af4e97>] lbug_with_loc+0x47/0xb0 [libcfs]
       [<ffffffffa1065d51>] lovsub_lock_state+0x1a1/0x1b0 [lov]
       [<ffffffffa0bd7a88>] cl_lock_state_signal+0x68/0x160 [obdclass]
       [<ffffffffa0bd7bd5>] cl_lock_state_set+0x55/0x190 [obdclass]
       [<ffffffffa0bdb8d9>] cl_enqueue_try+0x149/0x300 [obdclass]
       [<ffffffffa105e0da>] lov_lock_enqueue+0x22a/0x850 [lov]
       [<ffffffffa0bdb88c>] cl_enqueue_try+0xfc/0x300 [obdclass]
       [<ffffffffa0bdcc7f>] cl_enqueue_locked+0x6f/0x1f0 [obdclass]
       [<ffffffffa0bdd8ee>] cl_lock_request+0x7e/0x270 [obdclass]
       [<ffffffffa0be2b8c>] cl_io_lock+0x3cc/0x560 [obdclass]
       [<ffffffffa0be2dc2>] cl_io_loop+0xa2/0x1b0 [obdclass]
       [<ffffffffa10dba90>] ll_file_io_generic+0x450/0x600 [lustre]
       [<ffffffffa10dc9d2>] ll_file_aio_write+0x142/0x2c0 [lustre]
       [<ffffffffa10dccbc>] ll_file_write+0x16c/0x2a0 [lustre]
       [<ffffffff811895d8>] vfs_write+0xb8/0x1a0
       [<ffffffff81189ed1>] sys_write+0x51/0x90
       [<ffffffff81091039>] ? sys_times+0x29/0x70
       [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      

      This issue is very similar to LU-4693, which is itself a duplicate of LU-4692, for which there is unfortunately no fix yet.

      Please ask if you need additional information that could help the diagnostic and resolution of the problem.

      Sebastien.

      Attachments

        Issue Links

          Activity

            [LU-4797] ASSERTION( cl_lock_is_mutexed(slice->cls_lock) ) failed
            bobijam Zhenyu Xu added a comment -

            every write needs a exclusive lock, write from other node will cause the lock holder to relinquish the lock, and multiple write upon the same file from different node will cause lock enqueue and lock blocking ast intertwined, by that I meant normal.

            bobijam Zhenyu Xu added a comment - every write needs a exclusive lock, write from other node will cause the lock holder to relinquish the lock, and multiple write upon the same file from different node will cause lock enqueue and lock blocking ast intertwined, by that I meant normal.

            Hi Zhenyu Xu,
            Could you explain me why you think "there are lock enqueue and blocking ast call trace intertwined"
            and eviction. You said "that's normal". I don't understand why there a lot contention on this because
            we just add some bytes at the end of one file with just 4 process on 2 nodes . for me and on a 2.1.x
            lustre distribution we haven't this contention.
            and that could explain why sometime we meet the race fix by LU-4558
            LU-3027 and LU-4495 could explain this contention ?
            thanks

            apercher Antoine Percher added a comment - Hi Zhenyu Xu, Could you explain me why you think "there are lock enqueue and blocking ast call trace intertwined" and eviction. You said "that's normal". I don't understand why there a lot contention on this because we just add some bytes at the end of one file with just 4 process on 2 nodes . for me and on a 2.1.x lustre distribution we haven't this contention. and that could explain why sometime we meet the race fix by LU-4558 LU-3027 and LU-4495 could explain this contention ? thanks

            duplicate of LU-4558

            jay Jinshan Xiong (Inactive) added a comment - duplicate of LU-4558

            Please try patch: http://review.whamcloud.com/9881

            I believe this is the same issue in LU-4591.

            jay Jinshan Xiong (Inactive) added a comment - Please try patch: http://review.whamcloud.com/9881 I believe this is the same issue in LU-4591 .
            bobijam Zhenyu Xu added a comment -

            still checking the eviction logs, there are lock enqueue and blocking ast call trace intertwined, but that's normal, haven't found where the race happens causing this mutex assertion violation.

            bobijam Zhenyu Xu added a comment - still checking the eviction logs, there are lock enqueue and blocking ast call trace intertwined, but that's normal, haven't found where the race happens causing this mutex assertion violation.

            Bobi, I attach the Client-side debug logs that have been taken during an on-site run of the reproducer on 2x nodes (lascaux[2890-2891]) with evictions :

            _ file "eviction" is a log showing the reproducer run+errors and the actions taken.
            _ files lascaux289[0,1]dump_lustre_pb_eviction[1,2].gz are Lustre debug logs taken on both nodes and as shown in "eviction" file/trace.

            Hope this will allow you to get more infos to qualify.

            bfaccini Bruno Faccini (Inactive) added a comment - Bobi, I attach the Client-side debug logs that have been taken during an on-site run of the reproducer on 2x nodes (lascaux [2890-2891] ) with evictions : _ file "eviction" is a log showing the reproducer run+errors and the actions taken. _ files lascaux289 [0,1] dump_lustre_pb_eviction [1,2] .gz are Lustre debug logs taken on both nodes and as shown in "eviction" file/trace. Hope this will allow you to get more infos to qualify.
            bobijam Zhenyu Xu added a comment -

            update: tried taskset, usleep 1000000, each cpu took 250 echo taskes, 2 VM nodes, each node has 2 cpus, haven't got the hit yet.

            bobijam Zhenyu Xu added a comment - update: tried taskset, usleep 1000000, each cpu took 250 echo taskes, 2 VM nodes, each node has 2 cpus, haven't got the hit yet.
            bobijam Zhenyu Xu added a comment -

            my wait time is 1.5x of the initial reproducer, I'll try the taskset.

            bobijam Zhenyu Xu added a comment - my wait time is 1.5x of the initial reproducer, I'll try the taskset.

            About the testcase :

            • the waiting time is too large than the initial reproducer by x15
            • fix the run a processor/core with taskset for example when
              you run /tmp/reproducer.sh on the 2 nodes
              for cpu in $(seq 0 5) # nb cpus nodes
              taskset -c $cpu /tmp/reproducer.sh & ; done
            • My first testcase was without usleep and 32 task/process by nodes
              and 1000 write/echo by task
            • no lustre trace enable
            apercher Antoine Percher added a comment - About the testcase : the waiting time is too large than the initial reproducer by x15 fix the run a processor/core with taskset for example when you run /tmp/reproducer.sh on the 2 nodes for cpu in $(seq 0 5) # nb cpus nodes taskset -c $cpu /tmp/reproducer.sh & ; done My first testcase was without usleep and 32 task/process by nodes and 1000 write/echo by task no lustre trace enable

            Bobijam,

            To make it clear, 2.4.2 is the first b2_4 release used at CEA, so we are not comparing with 2.4.1, but with 2.1.6.

            Sebastien.

            sebastien.buisson Sebastien Buisson (Inactive) added a comment - Bobijam, To make it clear, 2.4.2 is the first b2_4 release used at CEA, so we are not comparing with 2.4.1, but with 2.1.6. Sebastien.

            People

              jay Jinshan Xiong (Inactive)
              sebastien.buisson Sebastien Buisson (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: