Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12377

stuck in ldlm thread on client when large amount of locks were canceled

Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • None
    • None
    • master
    • 3
    • 9223372036854775807

    Description

      when client unlinked files which have huge amount of extent locks (e.g. strided file) and one of client's kworker thread get 100% and never back. it looks deadlock in ldlm thread on client at many cancel calls.

      Tasks: 380 total,   2 running, 378 sleeping,   0 stopped,   0 zombie
      %Cpu(s):  0.0 us,  6.6 sy,  0.0 ni, 87.1 id,  6.2 wa,  0.0 hi,  0.0 si,  0.0 st
      KiB Mem : 13182929+total, 12711252+free,  2287060 used,  2429716 buff/cache
      KiB Swap:  1048572 total,  1048572 free,        0 used. 12699912+avail Mem 
      
        PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                 
        377 root      20   0       0      0      0 R 100.0  0.0  32:53.50 kworker/14:1    
      
      Jun  2 18:00:54 c210 kernel: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [ldlm_bl_43:4442]
      Jun  2 18:00:54 c210 kernel: Modules linked in: mgc(OE) lustre(OE) lmv(OE) mdc(OE) fid(OE) osc(OE) lov(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_fpga_tools(OE) mlx4_en(OE) mlx4_ib(OE) mlx4_core(OE) sunrpc sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper iTCO_wdt cryptd joydev iTCO_vendor_support sg ipmi_si shpchp ipmi_devintf wmi mei_me mei ioatdma ipmi_msghandler i2c_i801 pcspkr lpc_ich binfmt_misc knem(OE) ip_tables ext4 mbcache jbd2 mlx5_ib(OE) ib_uverbs(OE) ib_core(OE) sd_mod crc_t10dif crct10dif_generic mgag200 i2c_algo_bit
      Jun  2 18:00:54 c210 kernel: drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ahci ixgbe drm libahci libata mlx5_core(OE) crct10dif_pclmul crct10dif_common crc32c_intel megaraid_sas mlxfw(OE) devlink mdio i2c_core mlx_compat(OE) ptp pps_core dca dm_mirror dm_region_hash dm_log dm_mod [last unloaded: libcfs]
      Jun  2 18:00:54 c210 kernel: CPU: 0 PID: 4442 Comm: ldlm_bl_43 Kdump: loaded Tainted: G           OEL ------------   3.10.0-862.el7.x86_64 #1
      Jun  2 18:00:54 c210 kernel: Hardware name: Supermicro SYS-1027R-WC1RT/X9DRW-CF/CTF, BIOS 3.0b 04/09/2014
      Jun  2 18:00:54 c210 kernel: task: ffffa04c3bb3eeb0 ti: ffffa031b58a0000 task.ti: ffffa031b58a0000
      Jun  2 18:00:54 c210 kernel: RIP: 0010:[<ffffffff86b08802>]  [<ffffffff86b08802>] native_queued_spin_lock_slowpath+0x122/0x200
      Jun  2 18:00:54 c210 kernel: RSP: 0018:ffffa031b58a3b48  EFLAGS: 00000246
      Jun  2 18:00:54 c210 kernel: RAX: 0000000000000000 RBX: ffffffffc0f84b7f RCX: 0000000000010000
      Jun  2 18:00:54 c210 kernel: RDX: ffffa04c7fcd9700 RSI: 0000000000590001 RDI: ffffa045b18e4948
      Jun  2 18:00:54 c210 kernel: RBP: ffffa031b58a3b48 R08: ffffa03c3fc19700 R09: 0000000000000000
      Jun  2 18:00:54 c210 kernel: R10: 0000000000000000 R11: fffff018b1d64480 R12: ffffa040a5c9b7c8
      Jun  2 18:00:54 c210 kernel: R13: ffffa03f8d5a13f8 R14: ffffa040a5c9b8b0 R15: 00000000000a5aff
      Jun  2 18:00:54 c210 kernel: FS:  0000000000000000(0000) GS:ffffa03c3fc00000(0000) knlGS:0000000000000000
      Jun  2 18:00:54 c210 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      Jun  2 18:00:54 c210 kernel: CR2: 00007fd700006bc9 CR3: 000000203be54000 CR4: 00000000001607f0
      Jun  2 18:00:54 c210 kernel: Call Trace:
      Jun  2 18:00:54 c210 kernel: [<ffffffff8710842a>] queued_spin_lock_slowpath+0xb/0xf
      Jun  2 18:00:54 c210 kernel: [<ffffffff87115680>] _raw_spin_lock+0x20/0x30
      Jun  2 18:00:54 c210 kernel: [<ffffffffc0c6914a>] cl_object_attr_lock+0x1a/0x20 [obdclass]
      Jun  2 18:00:54 c210 kernel: [<ffffffffc0f6cf86>] osc_ldlm_blocking_ast+0x2f6/0x3a0 [osc]
      Jun  2 18:00:54 c210 kernel: [<ffffffffc0dc730a>] ldlm_cancel_callback+0x8a/0x330 [ptlrpc]
      Jun  2 18:00:54 c210 kernel: [<ffffffffc0c43892>] ? class_handle_unhash+0x32/0x50 [obdclass]
      Jun  2 18:00:54 c210 kernel: [<ffffffffc0dd21d0>] ldlm_cli_cancel_local+0xa0/0x3f0 [ptlrpc]
      Jun  2 18:00:54 c210 kernel: [<ffffffffc0dd676a>] ldlm_cli_cancel_list_local+0xea/0x280 [ptlrpc]
      Jun  2 18:00:54 c210 kernel: [<ffffffffc0dd6e8b>] ldlm_cancel_lru_local+0x2b/0x30 [ptlrpc]
      Jun  2 18:00:54 c210 kernel: [<ffffffffc0dd8076>] ldlm_cli_cancel+0x216/0x650 [ptlrpc]
      Jun  2 18:00:54 c210 kernel: [<ffffffff86bf62b1>] ? __slab_free+0x81/0x2f0
      Jun  2 18:00:54 c210 kernel: [<ffffffffc0f6ce0a>] osc_ldlm_blocking_ast+0x17a/0x3a0 [osc]
      Jun  2 18:00:54 c210 kernel: [<ffffffffc0ddc7dd>] ldlm_handle_bl_callback+0xed/0x4e0 [ptlrpc]
      Jun  2 18:00:54 c210 kernel: [<ffffffffc0ddd101>] ldlm_bl_thread_main+0x531/0x700 [ptlrpc]
      Jun  2 18:00:54 c210 kernel: [<ffffffffc0ddcbd0>] ? ldlm_handle_bl_callback+0x4e0/0x4e0 [ptlrpc]
      Jun  2 18:00:54 c210 kernel: [<ffffffff86abae31>] kthread+0xd1/0xe0
      Jun  2 18:00:54 c210 kernel: [<ffffffff86ac9667>] ? finish_task_switch+0x57/0x170
      Jun  2 18:00:54 c210 kernel: [<ffffffff86abad60>] ? insert_kthread_work+0x40/0x40
      Jun  2 18:00:54 c210 kernel: [<ffffffff8711f61d>] ret_from_fork_nospec_begin+0x7/0x21
      Jun  2 18:00:54 c210 kernel: [<ffffffff86abad60>] ? insert_kthread_work+0x40/0x40
      Jun  2 18:00:54 c210 kernel: Code: 13 48 c1 ea 0d 48 98 83 e2 30 48 81 c2 00 97 01 00 48 03 14 c5 a0 53 73 87 4c 89 02 41 8b 40 08 85 c0 75 0f 0f 1f 44 00 00 f3 90 <41> 8b 40 08 85 c0 74 f6 4d 8b 08 4d 85 c9 74 04 41 0f 18 09 8b 
      

      Attachments

        Issue Links

          Activity

            [LU-12377] stuck in ldlm thread on client when large amount of locks were canceled

            In LU-12377 case, it disabled lru_resize and keep all locks in caches on clients and didn't hit same problem during io500 run. But once io500 finished and it removed all files which was created by IORs, huge amount of lock were canceled, then hit that issue.
            In LU-12832, lru_resize was enabled and many locks were canceling with LRU manner during io500 run. In addition to that, the huge amount of cancel happened at ior_hard_read.
            So, this is just difference on tirger point of cancel (after remove files vs cacnel with LRU at read), but both of them hit same issue eventually. I agree LU-12377 is same with LU-12832.

            sihara Shuichi Ihara added a comment - In LU-12377 case, it disabled lru_resize and keep all locks in caches on clients and didn't hit same problem during io500 run. But once io500 finished and it removed all files which was created by IORs, huge amount of lock were canceled, then hit that issue. In LU-12832 , lru_resize was enabled and many locks were canceling with LRU manner during io500 run. In addition to that, the huge amount of cancel happened at ior_hard_read. So, this is just difference on tirger point of cancel (after remove files vs cacnel with LRU at read), but both of them hit same issue eventually. I agree LU-12377 is same with LU-12832 .

            Yup,this looks duplicated.

            wshilong Wang Shilong (Inactive) added a comment - Yup,this looks duplicated.

            This seems similar to LU-12832?

            adilger Andreas Dilger added a comment - This seems similar to LU-12832 ?

            Lock conflicts.  Even if write/read is on the same clients (no shuffle, I think it's called?), because the job is unaligned writes from many processes to the same stripes, locks are constantly getting cancelled during the job, and replaced by other different locks.  The result is that at the end of the write, the locks that exist don't match up (like, at all) with what's needed to read the data back.  Often, the lock required for any particular read belongs to another client, or doesn't quite cover the whole region (or it does cover the whole region, but it conflicts with a read that is going to happen on another client... etc etc.).

            So the read has to cancel most of the write locks to complete.  Probably almost all of them.

            pfarrell Patrick Farrell (Inactive) added a comment - Lock conflicts.  Even if write/read is on the same clients (no shuffle, I think it's called?), because the job is unaligned writes from many processes to the same stripes, locks are constantly getting cancelled during the job, and replaced by other different locks.  The result is that at the end of the write, the locks that exist don't match up (like, at all) with what's needed to read the data back.  Often, the lock required for any particular read belongs to another client, or doesn't quite cover the whole region (or it does cover the whole region, but it conflicts with a read that is going to happen on another client... etc etc.). So the read has to cancel most of the write locks to complete.  Probably almost all of them.

            Extent locks are held in the LRU until either they age out, the lru size is exceeded (or the server says the client has too many, if lru resize is enabled), or there is a conflict. In this case, the conflict is from the read.

            yeah, reduced ldlm.namespaces.*.lru_max_age helped either and it canceled extent locks aggresively in background.
            I didn't reproduce same calltrace after that change. btw, why other locks were canceled even before reached lru_max_age?

            sihara Shuichi Ihara added a comment - Extent locks are held in the LRU until either they age out, the lru size is exceeded (or the server says the client has too many, if lru resize is enabled), or there is a conflict. In this case, the conflict is from the read. yeah, reduced ldlm.namespaces.*.lru_max_age helped either and it canceled extent locks aggresively in background. I didn't reproduce same calltrace after that change. btw, why other locks were canceled even before reached lru_max_age?

            My understanding is that it happens after the write, during the read.  The reads cancel the write locks because the write locks are all messy and don't tend to match up with what's required for reading the file.  (This is because the unaligned i/o from multiple clients ends up creating many locks of various unusual sizes & many are also cancelled, so the mapping of write locks across the file is quite irregular.)

            Extent locks are held in the LRU until either they age out, the lru size is exceeded (or the server says the client has too many, if lru resize is enabled), or there is a conflict.  In this case, the conflict is from the read.

            Extent locks are being cancelled as needed, but the problem is that when we try to cancel many, many extent locks on the same resource (in this case, 1 at a time from many threads) we see heavy contention on the lock protecting the resource.  This causes BL callbacks to take an extremely long time, resulting in evictions.

            pfarrell Patrick Farrell (Inactive) added a comment - My understanding is that it happens after the write, during the read.  The reads cancel the write locks because the write locks are all messy and don't tend to match up with what's required for reading the file.  (This is because the unaligned i/o from multiple clients ends up creating many locks of various unusual sizes & many are also cancelled, so the mapping of write locks across the file is quite irregular.) Extent locks are held in the LRU until either they age out, the lru size is exceeded (or the server says the client has too many, if lru resize is enabled), or there is a conflict.  In this case, the conflict is from the read. Extent locks are being cancelled as needed, but the problem is that when we try to cancel many, many extent locks on the same resource (in this case, 1 at a time from many threads) we see heavy contention on the lock protecting the resource.  This causes BL callbacks to take an extremely long time, resulting in evictions.

            Why this happens after read and why extent lock can't be canceled at time to time? as far as I observed ldlm.namespaces.*.lock_count, the number of lock for other lock (openlock?) is canceling automatically, but extent lock doesn't?

            sihara Shuichi Ihara added a comment - Why this happens after read and why extent lock can't be canceled at time to time? as far as I observed ldlm.namespaces.*.lock_count, the number of lock for other lock (openlock?) is canceling automatically, but extent lock doesn't?

            No, it looks like the batch size is already pretty reasonable - Likely less than 64.

            Basically, it looks like our performance when cancelling a huge number of locks at once on the same resource & in the same namespace isn't very good.  This has been improved previously, but it's not really solved.

            This is related to LU-9213...

            I'm now thinking the issue might actually be that we're cancelling one lock at a time, from an absurd number of threads...  Because the existing write locks are not going to be cancelled until they are explicitly called back - We're just cancelling one lock at a time, each time a bl callback comes in.

            Some sort of mechanism to recognize this case might be in order...  If we've got a bunch of locks cancelled in a row, over and over, on a particular resource, then we should probably start cancelling all unused locks on that resource...

            Will think on this a little.

            pfarrell Patrick Farrell (Inactive) added a comment - No, it looks like the batch size is already pretty reasonable - Likely less than 64. Basically, it looks like our performance when cancelling a huge number of locks at once on the same resource & in the same namespace isn't very good.  This has been improved previously, but it's not really solved. This is related to LU-9213 ... I'm now thinking the issue might actually be that we're cancelling one lock at a time, from an absurd number of threads...  Because the existing write locks are not going to be cancelled until they are explicitly called back - We're just cancelling one lock at a time, each time a bl callback comes in. Some sort of mechanism to recognize this case might be in order...  If we've got a bunch of locks cancelled in a row, over and over, on a particular resource, then we should probably start cancelling all unused locks on that resource... Will think on this a little.

            So this is happening because we're trying to cancel too many locks on the hard write/hard read file.  It's taking too long to do that, and so some requests are timing out, leading to evictions, etc, etc.

            The way to fix this in general is probably to reduce the number of locks that are cancelled each time we start doing a cancellation.  This is just a guess, but if the problem is latency of lock cancellations, reducing the count should help with that.

            In the meantime, setting a low-ish lru_size limit should take care of the problem.  I'm trying 10000 currently, but we may be able to safely go much higher (previously I believe we had hundreds of thousands of locks on each client).  I'm not sure if it will help performance or not (it may hurt it slightly), but it should prevent this problem from occurring.

            pfarrell Patrick Farrell (Inactive) added a comment - So this is happening because we're trying to cancel too many locks on the hard write/hard read file.  It's taking too long to do that, and so some requests are timing out, leading to evictions, etc, etc. The way to fix this in general is probably to reduce the number of locks that are cancelled each time we start doing a cancellation.  This is just a guess, but if the problem is latency of lock cancellations, reducing the count should help with that. In the meantime, setting a low-ish lru_size limit should take care of the problem.  I'm trying 10000 currently, but we may be able to safely go much higher (previously I believe we had hundreds of thousands of locks on each client).  I'm not sure if it will help performance or not (it may hurt it slightly), but it should prevent this problem from occurring.
            sihara Shuichi Ihara added a comment - - edited

            I see this issue quite offten at ior_hard_read in io500 test.
            when ldlm cancels huge amount of extent locks for a striped file which is crated with small chunk (47K byte) and strided write by ior_hard_write.

            sihara Shuichi Ihara added a comment - - edited I see this issue quite offten at ior_hard_read in io500 test. when ldlm cancels huge amount of extent locks for a striped file which is crated with small chunk (47K byte) and strided write by ior_hard_write.

            People

              wshilong Wang Shilong (Inactive)
              sihara Shuichi Ihara
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: