Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13413

Lustre soft lockups with peer credit exhaustion

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • Lustre 2.12.4
    • RHEL7.7 running Lustre 2.12 LTS
    • 3
    • 9223372036854775807

    Description

      Currently our production file system is experience peer credit exhaustion which is leading to soft locks. The back trace are attached. Instead of a soft lockups we should be having evicts.

      [2642432.333239]  [<ffffffff9477544a>] queued_spin_lock_slowpath+0xb/0xf
      [2642432.333241]  [<ffffffff94783330>] _raw_spin_lock+0x20/0x30
      [2642432.333258]  [<ffffffffc146202c>] lock_res_and_lock+0x2c/0x50 [ptlrpc]
      [2642432.333273]  [<ffffffffc1469c61>] ldlm_lock_enqueue+0x1b1/0xa20 [ptlrpc]
      [2642432.333294]  [<ffffffffc14b9891>] ? lustre_pack_reply+0x11/0x20 [ptlrpc]
      [2642432.333311]  [<ffffffffc1492506>] ldlm_handle_enqueue0+0xa56/0x15f0 [ptlrpc]
      [2642432.333332]  [<ffffffffc14bb300>] ? lustre_swab_ldlm_lock_desc+0x30/0x30 [ptlrpc]
      [2642432.333357]  [<ffffffffc151acf2>] tgt_enqueue+0x62/0x210 [ptlrpc]
      [2642432.333381]  [<ffffffffc1521b0a>] tgt_request_handle+0xada/0x1570 [ptlrpc]
      [2642432.333404]  [<ffffffffc14fb021>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
      [2642432.333408]  [<ffffffffc103fbde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
      [2642432.333428]  [<ffffffffc14c646b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [2642432.333449]  [<ffffffffc14c3285>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]
      [2642432.333451]  [<ffffffff940d3903>] ? __wake_up+0x13/0x20
      [2642432.333471]  [<ffffffffc14c9dd4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
      

       

      Attachments

        Issue Links

          Activity

            [LU-13413] Lustre soft lockups with peer credit exhaustion

            Dustin, James, it looks like this problem has not been hit again since the lru_size=200 change was made back in April. Were you ever able to track down if there was a specific workload that was causing the high lock traffic?

            adilger Andreas Dilger added a comment - Dustin, James, it looks like this problem has not been hit again since the lru_size=200 change was made back in April. Were you ever able to track down if there was a specific workload that was causing the high lock traffic?

            Andreas,

                       I wanted to follow up with you on this since it has been a while. Generally the workload that triggers this happens during the first week of the month (a specific code runs by a subset of users on a set schedule). Based on your input about conflicting locks, we changed the lru_size configuration on the compute nodes to limit to 200 instead of the dynamic setting. We were thinking that this would limit the number of metadata transactions a single compute node could issue concurrently to the MDS. After this change we did not see more crashes, but it is possible this will crop up here again in the next week.

             

            Thanks,

            Dustin

            dustb100 Dustin Leverman added a comment - Andreas,            I wanted to follow up with you on this since it has been a while. Generally the workload that triggers this happens during the first week of the month (a specific code runs by a subset of users on a set schedule). Based on your input about conflicting locks, we changed the lru_size configuration on the compute nodes to limit to 200 instead of the dynamic setting. We were thinking that this would limit the number of metadata transactions a single compute node could issue concurrently to the MDS. After this change we did not see more crashes, but it is possible this will crop up here again in the next week.   Thanks, Dustin

            f2-oss1f1: ldlm.namespaces.filter-f2-OST0029_UUID.lock_count=9207
            f2-oss1b5: ldlm.namespaces.filter-f2-OST0019_UUID.lock_count=12895
            f2-oss1b5: ldlm.namespaces.filter-f2-OST003d_UUID.lock_count=25789

            OK, so this is totally reasonable, even low by my expectation, so it doesn't seem like the spinlock hold times are driven by having too many DLM locks to process, unless there is a sudden spike of locks at the time the problem occurs?

            According to the stack traces, there are only really a few "interesting" process, the rest are just blocked on (likely) the same spinlock in either ldlm_lock_enqueue() or ofd_lvbo_fill(). In both cases, it looks like the ldlm_cn threads are processing a potentially long list of locks on a single resource:

            [2642432.309284] CPU: 8 PID: 57948 Comm: ldlm_cn00_044  3.10.0-1062.9.1.el7.x86_64
            [2642432.309314] Call Trace:
            [2642432.309331]  [<ffffffffc14808e3>] ldlm_process_extent_lock+0x213/0x490 [ptlrpc]
            [2642432.309362]  [<ffffffffc146959e>] ldlm_reprocess_queue+0x1be/0x3f0 [ptlrpc]
            [2642432.309383]  [<ffffffffc146a5d2>] __ldlm_reprocess_all+0x102/0x360 [ptlrpc]
            [2642432.309397]  [<ffffffffc146a843>] ldlm_reprocess_all+0x13/0x20 [ptlrpc]
            [2642432.309414]  [<ffffffffc148f4df>] ldlm_request_cancel+0x42f/0x780 [ptlrpc]
            [2642432.309431]  [<ffffffffc14937a2>] ldlm_handle_cancel+0x232/0x2b0 [ptlrpc]
            [2642432.309447]  [<ffffffffc1493978>] ldlm_cancel_handler+0x158/0x590 [ptlrpc]
            [2642432.309467]  [<ffffffffc14c646b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
            [2642432.309509]  [<ffffffffc14c9dd4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
            [2642432.309530]  [<ffffffff940c61f1>] kthread+0xd1/0xe0
            
            [2644718.775432] CPU: 15 PID: 52157 Comm: ldlm_cn01_021  3.10.0-1062.9.1.el7.x86_64
            [2644718.775459] Call Trace:
            [2644718.775477]  [<ffffffffc14808e3>] ldlm_process_extent_lock+0x213/0x490 [ptlrpc]
            [2644718.775508]  [<ffffffffc146959e>] ldlm_reprocess_queue+0x1be/0x3f0 [ptlrpc]
            [2644718.775528]  [<ffffffffc146a5d2>] __ldlm_reprocess_all+0x102/0x360 [ptlrpc]
            [2644718.775543]  [<ffffffffc146a843>] ldlm_reprocess_all+0x13/0x20 [ptlrpc]
            [2644718.775559]  [<ffffffffc148f4df>] ldlm_request_cancel+0x42f/0x780 [ptlrpc]
            [2644718.775576]  [<ffffffffc14937a2>] ldlm_handle_cancel+0x232/0x2b0 [ptlrpc]
            [2644718.775593]  [<ffffffffc1493978>] ldlm_cancel_handler+0x158/0x590 [ptlrpc]
            [2644718.775613]  [<ffffffffc14c646b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
            [2644718.775656]  [<ffffffffc14c9dd4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
            [2644718.775677]  [<ffffffff940c61f1>] kthread+0xd1/0xe0
            

            I would guess there is some application which is "piling on" a single object and getting lots of conflicting locks (e.g. high IOPS small/unaligned writes on a HDD OST) that generates a lot of conflicting locks, then repeatedly reprocessing the lock each time a new one is granted/released? I was wondering whether patch https://review.whamcloud.com/33221 "LU-11085 ldlm: simplify use of interval-tree" had already landed and might be causing this, but I see it has not. It seems like the interval tree should protect us against scanning a lot of non-overlapping locks, so possibly there are a lot of overlapping lock requests from the application?

            adilger Andreas Dilger added a comment - f2-oss1f1: ldlm.namespaces.filter-f2-OST0029_UUID.lock_count=9207 f2-oss1b5: ldlm.namespaces.filter-f2-OST0019_UUID.lock_count=12895 f2-oss1b5: ldlm.namespaces.filter-f2-OST003d_UUID.lock_count=25789 OK, so this is totally reasonable, even low by my expectation, so it doesn't seem like the spinlock hold times are driven by having too many DLM locks to process, unless there is a sudden spike of locks at the time the problem occurs? According to the stack traces, there are only really a few "interesting" process, the rest are just blocked on (likely) the same spinlock in either ldlm_lock_enqueue() or ofd_lvbo_fill() . In both cases, it looks like the ldlm_cn threads are processing a potentially long list of locks on a single resource: [2642432.309284] CPU: 8 PID: 57948 Comm: ldlm_cn00_044 3.10.0-1062.9.1.el7.x86_64 [2642432.309314] Call Trace: [2642432.309331] [<ffffffffc14808e3>] ldlm_process_extent_lock+0x213/0x490 [ptlrpc] [2642432.309362] [<ffffffffc146959e>] ldlm_reprocess_queue+0x1be/0x3f0 [ptlrpc] [2642432.309383] [<ffffffffc146a5d2>] __ldlm_reprocess_all+0x102/0x360 [ptlrpc] [2642432.309397] [<ffffffffc146a843>] ldlm_reprocess_all+0x13/0x20 [ptlrpc] [2642432.309414] [<ffffffffc148f4df>] ldlm_request_cancel+0x42f/0x780 [ptlrpc] [2642432.309431] [<ffffffffc14937a2>] ldlm_handle_cancel+0x232/0x2b0 [ptlrpc] [2642432.309447] [<ffffffffc1493978>] ldlm_cancel_handler+0x158/0x590 [ptlrpc] [2642432.309467] [<ffffffffc14c646b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [2642432.309509] [<ffffffffc14c9dd4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] [2642432.309530] [<ffffffff940c61f1>] kthread+0xd1/0xe0 [2644718.775432] CPU: 15 PID: 52157 Comm: ldlm_cn01_021 3.10.0-1062.9.1.el7.x86_64 [2644718.775459] Call Trace: [2644718.775477] [<ffffffffc14808e3>] ldlm_process_extent_lock+0x213/0x490 [ptlrpc] [2644718.775508] [<ffffffffc146959e>] ldlm_reprocess_queue+0x1be/0x3f0 [ptlrpc] [2644718.775528] [<ffffffffc146a5d2>] __ldlm_reprocess_all+0x102/0x360 [ptlrpc] [2644718.775543] [<ffffffffc146a843>] ldlm_reprocess_all+0x13/0x20 [ptlrpc] [2644718.775559] [<ffffffffc148f4df>] ldlm_request_cancel+0x42f/0x780 [ptlrpc] [2644718.775576] [<ffffffffc14937a2>] ldlm_handle_cancel+0x232/0x2b0 [ptlrpc] [2644718.775593] [<ffffffffc1493978>] ldlm_cancel_handler+0x158/0x590 [ptlrpc] [2644718.775613] [<ffffffffc14c646b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [2644718.775656] [<ffffffffc14c9dd4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] [2644718.775677] [<ffffffff940c61f1>] kthread+0xd1/0xe0 I would guess there is some application which is "piling on" a single object and getting lots of conflicting locks (e.g. high IOPS small/unaligned writes on a HDD OST) that generates a lot of conflicting locks, then repeatedly reprocessing the lock each time a new one is granted/released? I was wondering whether patch https://review.whamcloud.com/33221 " LU-11085 ldlm: simplify use of interval-tree " had already landed and might be causing this, but I see it has not. It seems like the interval tree should protect us against scanning a lot of non-overlapping locks, so possibly there are a lot of overlapping lock requests from the application?

            The only patch we are running outside of 2.12 is a port of LU-13101.

            f2-oss1c1: ldlm.namespaces.filter-f2-OST0002_UUID.lock_count=13071
            f2-oss1c1: ldlm.namespaces.filter-f2-OST0026_UUID.lock_count=15371
            f2-oss1f5: ldlm.namespaces.filter-f2-OST001d_UUID.lock_count=9772
            f2-oss1f5: ldlm.namespaces.filter-f2-OST0041_UUID.lock_count=23522
            f2-oss1f4: ldlm.namespaces.filter-f2-OST0017_UUID.lock_count=6410
            f2-oss1f4: ldlm.namespaces.filter-f2-OST003b_UUID.lock_count=10967
            f2-oss1e2: ldlm.namespaces.filter-f2-OST000a_UUID.lock_count=9424
            f2-oss1e2: ldlm.namespaces.filter-f2-OST002e_UUID.lock_count=14095
            f2-oss1e6: ldlm.namespaces.filter-f2-OST0022_UUID.lock_count=9126
            f2-oss1e6: ldlm.namespaces.filter-f2-OST0046_UUID.lock_count=14720
            f2-oss1e1: ldlm.namespaces.filter-f2-OST0004_UUID.lock_count=5522
            f2-oss1e1: ldlm.namespaces.filter-f2-OST0028_UUID.lock_count=13113
            f2-oss1c4: ldlm.namespaces.filter-f2-OST0014_UUID.lock_count=31000
            f2-oss1c4: ldlm.namespaces.filter-f2-OST0038_UUID.lock_count=8096
            f2-oss1a1: ldlm.namespaces.filter-f2-OST0000_UUID.lock_count=9730
            f2-oss1a1: ldlm.namespaces.filter-f2-OST0024_UUID.lock_count=10386
            f2-oss1a2: ldlm.namespaces.filter-f2-OST0006_UUID.lock_count=6744
            f2-oss1a2: ldlm.namespaces.filter-f2-OST002a_UUID.lock_count=17169
            f2-oss1b2: ldlm.namespaces.filter-f2-OST0007_UUID.lock_count=7458
            f2-oss1b2: ldlm.namespaces.filter-f2-OST002b_UUID.lock_count=12837
            f2-oss1e3: ldlm.namespaces.filter-f2-OST0010_UUID.lock_count=17431
            f2-oss1e3: ldlm.namespaces.filter-f2-OST0034_UUID.lock_count=11464
            f2-oss1d4: ldlm.namespaces.filter-f2-OST0015_UUID.lock_count=9836
            f2-oss1d4: ldlm.namespaces.filter-f2-OST0039_UUID.lock_count=14168
            f2-oss1d3: ldlm.namespaces.filter-f2-OST000f_UUID.lock_count=11082
            f2-oss1d3: ldlm.namespaces.filter-f2-OST0033_UUID.lock_count=9801
            f2-oss1f6: ldlm.namespaces.filter-f2-OST0023_UUID.lock_count=7558
            f2-oss1f6: ldlm.namespaces.filter-f2-OST0047_UUID.lock_count=10884
            f2-oss1c6: ldlm.namespaces.filter-f2-OST0020_UUID.lock_count=11754
            f2-oss1c6: ldlm.namespaces.filter-f2-OST0044_UUID.lock_count=21609
            f2-oss1f2: ldlm.namespaces.filter-f2-OST000b_UUID.lock_count=7273
            f2-oss1f2: ldlm.namespaces.filter-f2-OST002f_UUID.lock_count=16546
            f2-oss1d2: ldlm.namespaces.filter-f2-OST0009_UUID.lock_count=5351
            f2-oss1d2: ldlm.namespaces.filter-f2-OST002d_UUID.lock_count=11648
            f2-oss1b4: ldlm.namespaces.filter-f2-OST0013_UUID.lock_count=7091
            f2-oss1b4: ldlm.namespaces.filter-f2-OST0037_UUID.lock_count=17672
            f2-oss1d1: ldlm.namespaces.filter-f2-OST0003_UUID.lock_count=10786
            f2-oss1d1: ldlm.namespaces.filter-f2-OST0027_UUID.lock_count=12821
            f2-oss1a3: ldlm.namespaces.filter-f2-OST000c_UUID.lock_count=8583
            f2-oss1a3: ldlm.namespaces.filter-f2-OST0030_UUID.lock_count=10223
            f2-oss1b1: ldlm.namespaces.filter-f2-OST0001_UUID.lock_count=19223
            f2-oss1b1: ldlm.namespaces.filter-f2-OST0025_UUID.lock_count=12262
            f2-oss1c3: ldlm.namespaces.filter-f2-OST000e_UUID.lock_count=19796
            f2-oss1c3: ldlm.namespaces.filter-f2-OST0032_UUID.lock_count=19187
            f2-oss1c2: ldlm.namespaces.filter-f2-OST0008_UUID.lock_count=22208
            f2-oss1c2: ldlm.namespaces.filter-f2-OST002c_UUID.lock_count=17602
            f2-oss1e5: ldlm.namespaces.filter-f2-OST001c_UUID.lock_count=10177
            f2-oss1e5: ldlm.namespaces.filter-f2-OST0040_UUID.lock_count=18898
            f2-oss1d6: ldlm.namespaces.filter-f2-OST0021_UUID.lock_count=9237
            f2-oss1d6: ldlm.namespaces.filter-f2-OST0045_UUID.lock_count=23232
            f2-oss1b6: ldlm.namespaces.filter-f2-OST001f_UUID.lock_count=8040
            f2-oss1b6: ldlm.namespaces.filter-f2-OST0043_UUID.lock_count=22877
            f2-oss1f1: ldlm.namespaces.filter-f2-OST0005_UUID.lock_count=8117
            f2-oss1f1: ldlm.namespaces.filter-f2-OST0029_UUID.lock_count=9207
            f2-oss1b5: ldlm.namespaces.filter-f2-OST0019_UUID.lock_count=12895
            f2-oss1b5: ldlm.namespaces.filter-f2-OST003d_UUID.lock_count=25789

            simmonsja James A Simmons added a comment - The only patch we are running outside of 2.12 is a port of LU-13101 . f2-oss1c1: ldlm.namespaces.filter-f2-OST0002_UUID.lock_count=13071 f2-oss1c1: ldlm.namespaces.filter-f2-OST0026_UUID.lock_count=15371 f2-oss1f5: ldlm.namespaces.filter-f2-OST001d_UUID.lock_count=9772 f2-oss1f5: ldlm.namespaces.filter-f2-OST0041_UUID.lock_count=23522 f2-oss1f4: ldlm.namespaces.filter-f2-OST0017_UUID.lock_count=6410 f2-oss1f4: ldlm.namespaces.filter-f2-OST003b_UUID.lock_count=10967 f2-oss1e2: ldlm.namespaces.filter-f2-OST000a_UUID.lock_count=9424 f2-oss1e2: ldlm.namespaces.filter-f2-OST002e_UUID.lock_count=14095 f2-oss1e6: ldlm.namespaces.filter-f2-OST0022_UUID.lock_count=9126 f2-oss1e6: ldlm.namespaces.filter-f2-OST0046_UUID.lock_count=14720 f2-oss1e1: ldlm.namespaces.filter-f2-OST0004_UUID.lock_count=5522 f2-oss1e1: ldlm.namespaces.filter-f2-OST0028_UUID.lock_count=13113 f2-oss1c4: ldlm.namespaces.filter-f2-OST0014_UUID.lock_count=31000 f2-oss1c4: ldlm.namespaces.filter-f2-OST0038_UUID.lock_count=8096 f2-oss1a1: ldlm.namespaces.filter-f2-OST0000_UUID.lock_count=9730 f2-oss1a1: ldlm.namespaces.filter-f2-OST0024_UUID.lock_count=10386 f2-oss1a2: ldlm.namespaces.filter-f2-OST0006_UUID.lock_count=6744 f2-oss1a2: ldlm.namespaces.filter-f2-OST002a_UUID.lock_count=17169 f2-oss1b2: ldlm.namespaces.filter-f2-OST0007_UUID.lock_count=7458 f2-oss1b2: ldlm.namespaces.filter-f2-OST002b_UUID.lock_count=12837 f2-oss1e3: ldlm.namespaces.filter-f2-OST0010_UUID.lock_count=17431 f2-oss1e3: ldlm.namespaces.filter-f2-OST0034_UUID.lock_count=11464 f2-oss1d4: ldlm.namespaces.filter-f2-OST0015_UUID.lock_count=9836 f2-oss1d4: ldlm.namespaces.filter-f2-OST0039_UUID.lock_count=14168 f2-oss1d3: ldlm.namespaces.filter-f2-OST000f_UUID.lock_count=11082 f2-oss1d3: ldlm.namespaces.filter-f2-OST0033_UUID.lock_count=9801 f2-oss1f6: ldlm.namespaces.filter-f2-OST0023_UUID.lock_count=7558 f2-oss1f6: ldlm.namespaces.filter-f2-OST0047_UUID.lock_count=10884 f2-oss1c6: ldlm.namespaces.filter-f2-OST0020_UUID.lock_count=11754 f2-oss1c6: ldlm.namespaces.filter-f2-OST0044_UUID.lock_count=21609 f2-oss1f2: ldlm.namespaces.filter-f2-OST000b_UUID.lock_count=7273 f2-oss1f2: ldlm.namespaces.filter-f2-OST002f_UUID.lock_count=16546 f2-oss1d2: ldlm.namespaces.filter-f2-OST0009_UUID.lock_count=5351 f2-oss1d2: ldlm.namespaces.filter-f2-OST002d_UUID.lock_count=11648 f2-oss1b4: ldlm.namespaces.filter-f2-OST0013_UUID.lock_count=7091 f2-oss1b4: ldlm.namespaces.filter-f2-OST0037_UUID.lock_count=17672 f2-oss1d1: ldlm.namespaces.filter-f2-OST0003_UUID.lock_count=10786 f2-oss1d1: ldlm.namespaces.filter-f2-OST0027_UUID.lock_count=12821 f2-oss1a3: ldlm.namespaces.filter-f2-OST000c_UUID.lock_count=8583 f2-oss1a3: ldlm.namespaces.filter-f2-OST0030_UUID.lock_count=10223 f2-oss1b1: ldlm.namespaces.filter-f2-OST0001_UUID.lock_count=19223 f2-oss1b1: ldlm.namespaces.filter-f2-OST0025_UUID.lock_count=12262 f2-oss1c3: ldlm.namespaces.filter-f2-OST000e_UUID.lock_count=19796 f2-oss1c3: ldlm.namespaces.filter-f2-OST0032_UUID.lock_count=19187 f2-oss1c2: ldlm.namespaces.filter-f2-OST0008_UUID.lock_count=22208 f2-oss1c2: ldlm.namespaces.filter-f2-OST002c_UUID.lock_count=17602 f2-oss1e5: ldlm.namespaces.filter-f2-OST001c_UUID.lock_count=10177 f2-oss1e5: ldlm.namespaces.filter-f2-OST0040_UUID.lock_count=18898 f2-oss1d6: ldlm.namespaces.filter-f2-OST0021_UUID.lock_count=9237 f2-oss1d6: ldlm.namespaces.filter-f2-OST0045_UUID.lock_count=23232 f2-oss1b6: ldlm.namespaces.filter-f2-OST001f_UUID.lock_count=8040 f2-oss1b6: ldlm.namespaces.filter-f2-OST0043_UUID.lock_count=22877 f2-oss1f1: ldlm.namespaces.filter-f2-OST0005_UUID.lock_count=8117 f2-oss1f1: ldlm.namespaces.filter-f2-OST0029_UUID.lock_count=9207 f2-oss1b5: ldlm.namespaces.filter-f2-OST0019_UUID.lock_count=12895 f2-oss1b5: ldlm.namespaces.filter-f2-OST003d_UUID.lock_count=25789

            James, how many locks are actually being held by the clients and are on the servers? Running "lctl get_param ldlm.namespaces.*.lock_count" on some clients and servers would give a good idea of where this is at. There were a couple of bugs related to not canceling the DLM locks on the clients, even above LRU size, so it would be good to know if this is the problem you are seeing.

            Also, are there any patches above 2.12.4?

            adilger Andreas Dilger added a comment - James, how many locks are actually being held by the clients and are on the servers? Running " lctl get_param ldlm.namespaces.*.lock_count " on some clients and servers would give a good idea of where this is at. There were a couple of bugs related to not canceling the DLM locks on the clients, even above LRU size, so it would be good to know if this is the problem you are seeing. Also, are there any patches above 2.12.4?
            simmonsja James A Simmons added a comment - - edited

            I talked to our admin team and they stated its the same problem since October with the upgrade to 2.12 LTS. With LU-12600 it did reduce the problem but it still exist. This problem shows up when clients are hitting the servers with large numbers of small I/Os.

            Do you think tuning the lru_resize and lru_size might help? The current values are

            lru_size=2000 and lru_resize=3900000

            simmonsja James A Simmons added a comment - - edited I talked to our admin team and they stated its the same problem since October with the upgrade to 2.12 LTS. With LU-12600 it did reduce the problem but it still exist. This problem shows up when clients are hitting the servers with large numbers of small I/Os. Do you think tuning the lru_resize and lru_size might help? The current values are lru_size=2000 and lru_resize=3900000

            Hi James, any changes to the system that would have triggered this to start happening (upgraded Lustre, new patch, new clients, new workload, etc.)?

            adilger Andreas Dilger added a comment - Hi James, any changes to the system that would have triggered this to start happening (upgraded Lustre, new patch, new clients, new workload, etc.)?

            People

              adilger Andreas Dilger
              simmonsja James A Simmons
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: