[LU-4053] client leaking objects/locks during IO Created: 02/Oct/13  Updated: 05/Aug/20  Resolved: 05/Aug/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Andreas Dilger Assignee: Niu Yawei (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: mq115
Environment:

Config: Single-node client+MDS+OSS with 1 MDT, 3 OSTs
Node: x86_64 w/ dual-core CPU, 2GB RAM
Kernel: 2.6.32-279.5.1.el6_lustre.g7f15218.x86_64
Lustre build: 72afa19c19d5ac


Issue Links:
Duplicate
duplicates LU-3771 stuck 56G of SUnreclaim memory Resolved
is duplicated by LU-4754 MDS large amount of slab usage Resolved
Related
is related to LU-4357 page allocation failure. mode:0x40 ca... Resolved
is related to LU-2487 2.2 Client deadlock between ll_md_blo... Resolved
is related to LU-4033 Failure on test suite parallel-scale-... Resolved
is related to LU-4740 MDS - buffer cache not freed Resolved
is related to LU-3997 Excessive slab usage causes large mem... Resolved
is related to LU-4429 clients leaking open handles/bad lock... Resolved
is related to LU-4754 MDS large amount of slab usage Resolved
is related to LU-4002 HSM restore vs unlink deadlock Resolved
Severity: 3
Rank (Obsolete): 10870

 Description   

I'm trying to determine if there is a "memory leak" in the current Lustre code that can affect long-running clients or servers. While this memory may be cleaned up when the filesystem is unmounted, it does not appear to be cleaned up under steady-state usage.

I started "rundbench 10 -t 3600" and am watching the memory usage in several forms (slabtop, vmstat, "lfs df", "lfs df -i"). It does indeed appear that there are a number of statistics that show what looks to be a memory leak. These statistics are gathered at about the same time, but not exactly at the same time. The general trend is fairly clear, however:

The "lfs df -i" output shows only around 1000 in-use files during the whole run:

UUID                      Inodes       IUsed       IFree IUse% Mounted on
testfs-MDT0000_UUID       524288        1024      523264   0% /mnt/testfs[MDT:0]
testfs-OST0000_UUID       131072         571      130501   0% /mnt/testfs[OST:0]
testfs-OST0001_UUID       131072         562      130510   0% /mnt/testfs[OST:1]
testfs-OST0002_UUID       131072         576      130496   0% /mnt/testfs[OST:2]

filesystem summary:       524288        1024      523264   0% /mnt/testfs

The LDLM resource_count shows the number of locks, slightly less than 50k, but a lot more than the number of actual objects in the filesystem:

# lctl get_param ldlm.namespaces.*.resource_count
ldlm.namespaces.filter-testfs-OST0000_UUID.resource_count=238
ldlm.namespaces.filter-testfs-OST0001_UUID.resource_count=226
ldlm.namespaces.filter-testfs-OST0002_UUID.resource_count=237
ldlm.namespaces.mdt-testfs-MDT0000_UUID.resource_count=49161
ldlm.namespaces.testfs-MDT0000-mdc-ffff8800a66c1c00.resource_count=49160
ldlm.namespaces.testfs-OST0000-osc-ffff8800a66c1c00.resource_count=237
ldlm.namespaces.testfs-OST0001-osc-ffff8800a66c1c00.resource_count=226
ldlm.namespaces.testfs-OST0002-osc-ffff8800a66c1c00.resource_count=236

Total memory used (as shown by "vmstat") also shows a steady increase over time, originally 914116kB of free memory, down to 202036kB after about 3000s of the run so far (about 700MB of memory used), and eventually ends up at 86724kB at the end of the run (830MB used). While that would be normal with a workload that is accessing a large number of files that are kept in cache, the total amount of used space in the filesystem is steadily about 240MB during the entire run.

The "slabtop" output (edited to remove uninteresting slabs) shows over 150k and steadily growing number of allocated structures for CLIO, far more than could actually be in use at any given time. All of the CLIO slabs are 100% used, so it isn't just a matter of alloc/free causing partially-used slabs.

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME                   
242660 242660 100%    0.19K  12133       20     48532K size-192
217260 217260 100%    0.19K  10863       20     43452K dentry
203463 178864  87%    0.10K   5499       37     21996K buffer_head
182000 181972  99%    0.03K   1625      112      6500K size-32
181530 181530 100%    0.12K   6051       30     24204K size-128
156918 156918 100%    1.25K  52306        3    209224K lustre_inode_cache
156840 156840 100%    0.12K   5228       30     20912K lov_oinfo
156825 156825 100%    0.22K   9225       17     36900K lov_object_kmem
156825 156825 100%    0.22K   9225       17     36900K lovsub_object_kmem
156816 156816 100%    0.24K   9801       16     39204K ccc_object_kmem
156814 156814 100%    0.27K  11201       14     44804K osc_object_kmem
123832 121832  98%    0.50K  15479        8     61916K size-512
 98210  92250  93%    0.50K  14030        7     56120K ldlm_locks
 97460  91009  93%    0.38K   9746       10     38984K ldlm_resources
 76320  76320 100%    0.08K   1590       48      6360K mdd_obj
 76262  76262 100%    0.11K   2243       34      8972K lod_obj
 76245  76245 100%    0.28K   5865       13     23460K mdt_obj
  2865   2764  96%    1.03K    955        3      3820K ldiskfs_inode_cache
  1746   1546  88%    0.21K     97       18       388K cl_lock_kmem 
  1396   1396 100%    1.00K    349        4      1396K ptlrpc_cache
  1345   1008  74%    0.78K    269        5      1076K shmem_inode_cache
  1298    847  65%    0.06K     22       59        88K lovsub_lock_kmem
  1224    898  73%    0.16K     51       24       204K ofd_obj
  1008    794  78%    0.18K     48       21       192K osc_lock_kmem
  1008    783  77%    0.03K      9      112        36K lov_lock_link_kmem
   925    782  84%    0.10K     25       37       100K lov_lock_kmem
   920    785  85%    0.04K     10       92        40K ccc_lock_kmem

The ldiskfs_inode_cache shows a reasonable number of objects in use, one for each MDT and OST inode actually in use. It might be that this is a leak of unlinked inodes/dentries on the client?

Now, after 3600s of running, the dbench has finished and deleted all of the files:

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    1229310     5.896  1056.405
 Close         903051     2.960  1499.813
 Rename         52083     8.024   827.129
 Unlink        248209     3.694   789.403
 Deltree           20   119.498   421.063
 Mkdir             10     0.050     0.155
 Qpathinfo    1114775     2.129   953.086
 Qfileinfo     195028     0.114    25.925
 Qfsinfo       204279     0.574    32.902
 Sfileinfo     100238    27.316  1442.888
 Find          430819     6.750  1369.539
 WriteX        611079     0.833   857.679
 ReadX        1927390     0.107  1171.947
 LockX           4004     0.005     1.899
 UnlockX         4004     0.003     3.345
 Flush          86164   183.254  2577.019

Throughput 10.6947 MB/sec  10 clients  10 procs  max_latency=2577.028 ms

The slabs still show a large number of allocations, even though no files exist in the filesystem anymore:

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME                   
289880 133498  46%    0.19K  14494       20     57976K size-192
278768 274718  98%    0.03K   2489      112      9956K size-32
274410 259726  94%    0.12K   9147       30     36588K size-128
253590 250634  98%    0.12K   8453       30     33812K lov_oinfo
253555 250634  98%    0.22K  14915       17     59660K lovsub_object_kmem
253552 250634  98%    0.24K  15847       16     63388K ccc_object_kmem
253540 250634  98%    0.27K  18110       14     72440K osc_object_kmem
253538 250634  98%    0.22K  14914       17     59656K lov_object_kmem
252330 250638  99%    1.25K  84110        3    336440K lustre_inode_cache
203463 179392  88%    0.10K   5499       37     21996K buffer_head
128894 128446  99%    0.11K   3791       34     15164K lod_obj
128880 128446  99%    0.08K   2685       48     10740K mdd_obj
128869 128446  99%    0.28K   9913       13     39652K mdt_obj
 84574  79368  93%    0.50K  12082        7     48328K ldlm_locks
 82660  79314  95%    0.38K   8266       10     33064K ldlm_resources
 71780  50308  70%    0.19K   3589       20     14356K dentry

There are also still about 40k MDT locks, though all of the OST locks are gone (which is expected if these files are unlinked).

# lctl get_param ldlm.namespaces.*.resource_count
ldlm.namespaces.filter-testfs-OST0000_UUID.resource_count=0
ldlm.namespaces.filter-testfs-OST0001_UUID.resource_count=0
ldlm.namespaces.filter-testfs-OST0002_UUID.resource_count=0
ldlm.namespaces.mdt-testfs-MDT0000_UUID.resource_count=39654
ldlm.namespaces.testfs-MDT0000-mdc-ffff8800a66c1c00.resource_count=39654
ldlm.namespaces.testfs-OST0000-osc-ffff8800a66c1c00.resource_count=0
ldlm.namespaces.testfs-OST0001-osc-ffff8800a66c1c00.resource_count=0
ldlm.namespaces.testfs-OST0002-osc-ffff8800a66c1c00.resource_count=0


 Comments   
Comment by Andreas Dilger [ 03/Oct/13 ]

After unmounting the client, a large number of slabs have been cleaned up, but not all of them:

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME                   
289460 132395  45%    0.19K  14473       20     57892K size-192
203463 179354  88%    0.10K   5499       37     21996K buffer_head
128894 127934  99%    0.11K   3791       34     15164K lod_obj
128880 127934  99%    0.08K   2685       48     10740K mdd_obj
128869 127934  99%    0.28K   9913       13     39652K mdt_obj
 71760  50205  69%    0.19K   3588       20     14352K dentry
  1491   1176  78%    1.03K    497        3      1988K ldiskfs_inode_cache

I ran with +malloc debug during the cleanup, and processed the debug log through leak_finder.pl. A sample of the logs:

*** Free without malloc (8 bytes at ffff880049c5b7e0, lov_object.c:lov_fini_raid0:377)
*** Free without malloc (112 bytes at ffff88003fcfe100, lov_ea.c:lsm_free_plain:130)
*** Free without malloc (80 bytes at ffff8800c3b630c0, lov_ea.c:lsm_free_plain:132)
*** Free without malloc (224 bytes at ffff88004d28b3f8, lov_object.c:lov_object_free:821)
*** Free without malloc (248 bytes at ffff88004d79c070, lcommon_cl.c:ccc_object_free:404)
*** Free without malloc (1256 bytes at ffff88002c195580, super25.c:ll_destroy_inode:80)
*** Free without malloc (272 bytes at ffff8800250ba9f8, osc_object.c:osc_object_free:128)
*** Free without malloc (224 bytes at ffff880036c25bd8, lovsub_object.c:lovsub_object_free:96)
*** Free without malloc (8 bytes at ffff880068e18bc0, lov_object.c:lov_fini_raid0:377)
*** Free without malloc (112 bytes at ffff88001d6efe00, lov_ea.c:lsm_free_plain:130)
*** Free without malloc (80 bytes at ffff8800b89ff940, lov_ea.c:lsm_free_plain:132)
*** Free without malloc (224 bytes at ffff88006e2b6e78, lov_object.c:lov_object_free:821)
*** Free without malloc (248 bytes at ffff88004d609830, lcommon_cl.c:ccc_object_free:404)
*** Free without malloc (1256 bytes at ffff8800baa22580, super25.c:ll_destroy_inode:80)
*** Free without malloc (272 bytes at ffff88004a7c4398, osc_object.c:osc_object_free:128)
*** Free without malloc (224 bytes at ffff88002515dbd8, lovsub_object.c:lovsub_object_free:96)
*** Free without malloc (8 bytes at ffff88006556d820, lov_object.c:lov_fini_raid0:377)
*** Free without malloc (112 bytes at ffff88003fcfe300, lov_ea.c:lsm_free_plain:130)
*** Free without malloc (80 bytes at ffff8800d3f239c0, lov_ea.c:lsm_free_plain:132)
*** Free without malloc (224 bytes at ffff88004d28b4d8, lov_object.c:lov_object_free:821)
*** Free without malloc (248 bytes at ffff88004d79c640, lcommon_cl.c:ccc_object_free:404)
*** Free without malloc (1256 bytes at ffff88006672d080, super25.c:ll_destroy_inode:80)
*** Free without malloc (272 bytes at ffff88004a7c44a8, osc_object.c:osc_object_free:128)
*** Free without malloc (224 bytes at ffff88002515dcb8, lovsub_object.c:lovsub_object_free:96)
*** Free without malloc (8 bytes at ffff880049c5b2a0, lov_object.c:lov_fini_raid0:377)
:
:
*** Free without malloc (320 bytes at ffff880081a96b00, ldlm_resource.c:ldlm_resource_putref_locked:1300)
*** Free without malloc (320 bytes at ffff88001ae75200, ldlm_resource.c:ldlm_resource_putref_locked:1300)
*** Free without malloc (320 bytes at ffff880024e4e980, ldlm_resource.c:ldlm_resource_putref_locked:1300)
*** Free without malloc (320 bytes at ffff880031a0c980, ldlm_resource.c:ldlm_resource_putref_locked:1300)
*** Free without malloc (320 bytes at ffff88002e66ec80, ldlm_resource.c:ldlm_resource_putref_locked:1300)
*** Free without malloc (320 bytes at ffff880023b68680, ldlm_resource.c:ldlm_resource_putref_locked:1300)
:
:
*** Free without malloc (504 bytes at ffff8800a67faa80, ldlm_lock.c:lock_handle_free:456)
*** Free without malloc (504 bytes at ffff880077984580, ldlm_lock.c:lock_handle_free:456)
*** Free without malloc (504 bytes at ffff88006e240380, ldlm_lock.c:lock_handle_free:456)
*** Free without malloc (504 bytes at ffff8800670bd900, ldlm_lock.c:lock_handle_free:456)
*** Free without malloc (504 bytes at ffff8800a3fb4880, ldlm_lock.c:lock_handle_free:456)
*** Free without malloc (504 bytes at ffff880058eaaa80, ldlm_lock.c:lock_handle_free:456)
*** Free without malloc (504 bytes at ffff8800c35aad80, ldlm_lock.c:lock_handle_free:456)
:
:

These are allocations that are being freed that were allocated before logging was enabled. Many thousands of these lines...

Comment by Andreas Dilger [ 03/Oct/13 ]

The dcache shrinking patch was disabled in http://review.whamcloud.com/1874 in v2_2_59_0-3-g9f3469f, but needs to be fixed somehow (e.g. have a zombie list that is cleaned up outside of the dcache lock).

Comment by Jinshan Xiong (Inactive) [ 04/Oct/13 ]

is this because the debug buffer overflowed so that it couldn't catch the allocation info?

Comment by Andreas Dilger [ 04/Oct/13 ]

No, just because I didn't have +malloc debugging enabled while the test was running, and because there is a good chance that the allocation is not very close to the free in the first place, so it would be mismatched without a huge debug buffer.

Since this test is so easy to run (sh llmount.sh; sh rundbench -t 3600 10) it is easy for anyone to get whatever information they need to debug it.

Comment by Bruno Faccini (Inactive) [ 07/Oct/13 ]

Andreas, thanks for you hints. I started to work on this.
On the other hand, I asked some of my contacts on customers sites running 2.1.6, and they don't see this on idle nodes after heavy production work-loads.

Comment by Bruno Faccini (Inactive) [ 07/Oct/13 ]

This behavior does not show-up with 1.8.9-wc1 but still does running last master builds.
Seems that "echo 3 > /proc/sys/vm/drop_caches" (without umount) allows to clear the Client-side allocs AND the MDS ones.

Comment by Niu Yawei (Inactive) [ 09/Oct/13 ]

The LDLM resource_count shows the number of locks, slightly less than 50k, but a lot more than the number of actual objects in the filesystem:

It because the layout lock wasn't canceled on unlink & rename.

Total memory used (as shown by "vmstat") also shows a steady increase over time, originally 914116kB of free memory, down to 202036kB after about 3000s of the run so far (about 700MB of memory used), and eventually ends up at 86724kB at the end of the run (830MB used). While that would be normal with a workload that is accessing a large number of files that are kept in cache, the total amount of used space in the filesystem is steadily about 240MB during the entire run.

I think the memory was consumed by slab cache.

The "slabtop" output (edited to remove uninteresting slabs) shows over 150k and steadily growing number of allocated structures for CLIO, far more than could actually be in use at any given time. All of the CLIO slabs are 100% used, so it isn't just a matter of alloc/free causing partially-used slabs.

I guess it because of the unlink/rename, so huge number of slab objects for CLIO object were created (and I don't quite sure what's the ACTIVE / USE in slabtop mean)?

The slabs still show a large number of allocations, even though no files exist in the filesystem anymore:
OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
289880 133498 46% 0.19K 14494 20 57976K size-192
278768 274718 98% 0.03K 2489 112 9956K size-32
274410 259726 94% 0.12K 9147 30 36588K size-128
253590 250634 98% 0.12K 8453 30 33812K lov_oinfo
253555 250634 98% 0.22K 14915 17 59660K lovsub_object_kmem
253552 250634 98% 0.24K 15847 16 63388K ccc_object_kmem
253540 250634 98% 0.27K 18110 14 72440K osc_object_kmem
253538 250634 98% 0.22K 14914 17 59656K lov_object_kmem
252330 250638 99% 1.25K 84110 3 336440K lustre_inode_cache
203463 179392 88% 0.10K 5499 37 21996K buffer_head
128894 128446 99% 0.11K 3791 34 15164K lod_obj
128880 128446 99% 0.08K 2685 48 10740K mdd_obj
128869 128446 99% 0.28K 9913 13 39652K mdt_obj
84574 79368 93% 0.50K 12082 7 48328K ldlm_locks
82660 79314 95% 0.38K 8266 10 33064K ldlm_resources
71780 50308 70% 0.19K 3589 20 14356K dentry

I think lustre has released all objects, it's slab cache which holding these objects and it depends on kernel to decided when to free them to reclaim memroy.

After unmounting the client, a large number of slabs have been cleaned up, but not all of them:
OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
289460 132395 45% 0.19K 14473 20 57892K size-192
203463 179354 88% 0.10K 5499 37 21996K buffer_head
128894 127934 99% 0.11K 3791 34 15164K lod_obj
128880 127934 99% 0.08K 2685 48 10740K mdd_obj
128869 127934 99% 0.28K 9913 13 39652K mdt_obj
71760 50205 69% 0.19K 3588 20 14352K dentry
1491 1176 78% 1.03K 497 3 1988K ldiskfs_inode_cache

We can see all the client's slabs have been freed, these are all for servers.

One thing confused me is that after dbench finished, there are still huge number of layout lock cached:

00010000:00010000:1.0:1381289786.413195:0:11603:0:(ldlm_resource.c:1448:ldlm_resource_dump()) --- Resource: [0x200000400:0x1cd83:0x0].0 (ffff880019c2d340) refcount = 2
00010000:00010000:1.0:1381289786.413196:0:11603:0:(ldlm_resource.c:1451:ldlm_resource_dump()) Granted locks (in reverse order):
00010000:00010000:1.0:1381289786.413196:0:11603:0:(ldlm_resource.c:1454:ldlm_resource_dump()) ### ### ns: lustre-MDT0000-mdc-ffff8800265a9800 lock: ffff88004fdb0b80/0x68a3d67bc8651524 lrc: 1/0,0 mode: CR/CR res: [0x200000400:0x1cd83:0x0].0 bits 0x8 rrc: 2 type: IBT flags: 0x0 nid: local remote: 0x68a3d67bc865152b expref: -99 pid: 10969 timeout: 0 lvb_type: 3
00010000:00010000:1.0:1381289786.413198:0:11603:0:(ldlm_resource.c:1448:ldlm_resource_dump()) --- Resource: [0x200000400:0x11d83:0x0].0 (ffff880070efba80) refcount = 2
00010000:00010000:1.0:1381289786.413198:0:11603:0:(ldlm_resource.c:1451:ldlm_resource_dump()) Granted locks (in reverse order):
00010000:00010000:1.0:1381289786.413199:0:11603:0:(ldlm_resource.c:1454:ldlm_resource_dump()) ### ### ns: lustre-MDT0000-mdc-ffff8800265a9800 lock: ffff8800681ac380/0x68a3d67bc691ef1e lrc: 1/0,0 mode: CR/CR res: [0x200000400:0x11d83:0x0].0 bits 0x8 rrc: 2 type: IBT flags: 0x0 nid: local remote: 0x68a3d67bc691ef3a expref: -99 pid: 10969 timeout: 0 lvb_type: 3
00010000:00010000:1.0:1381289786.413200:0:11603:0:(ldlm_resource.c:1448:ldlm_resource_dump()) --- Resource: [0x200000400:0x15143:0x0].0 (ffff880046767e80) refcount = 2
00010000:00010000:1.0:1381289786.413201:0:11603:0:(ldlm_resource.c:1451:ldlm_resource_dump()) Granted locks (in reverse order):
00010000:00010000:1.0:1381289786.413201:0:11603:0:(ldlm_resource.c:1454:ldlm_resource_dump()) ### ### ns: lustre-MDT0000-mdc-ffff8800265a9800 lock: ffff8800115848c0/0x68a3d67bc71e9503 lrc: 1/0,0 mode: CR/CR res: [0x200000400:0x15143:0x0].0 bits 0x8 rrc: 2 type: IBT flags: 0x0 nid: local remote: 0x68a3d67bc71e951f expref: -99 pid: 10969 timeout: 0 lvb_type: 3

I checked the server code and see that layout lock isn't revoked on unlink/rename (see LU-4002 hsm: avoid layout lock on unlink and rename onto), so the layout locks were cached on client even after all files removed.

Comment by Niu Yawei (Inactive) [ 09/Oct/13 ]

What suprised me is that even I revert the change of LU-4002, the layout locks are still in client cache after dbench run... Xiong, do you have any idea? Thanks.

Comment by Bruno Faccini (Inactive) [ 09/Oct/13 ]

Niu, thanks for all these details! My understanding of the ACTIVE/USE meaning is that they indicate the number/% of non-freed objects per slab type.

Andreas, according that it is now confirmed that all of these allocs are only kept due to caching, and can be reclaimed either upon need or unconditionally (again "echo 3 > /proc/sys/vm/drop_caches" allows to free almost anything), could we at least downgrade this ticket's priority, or do you still consider it as critical ?

Comment by Jinshan Xiong (Inactive) [ 09/Oct/13 ]

I think we need to figure out why layout lock is still in cache even after the patch LU-4002 is reverted.

Comment by Jinshan Xiong (Inactive) [ 09/Oct/13 ]

I'm totally fine with the slab object remaining in cache because that is linux system behavior and no harm.

Comment by Andreas Dilger [ 09/Oct/13 ]

I don't think that the slab objects are just "in the cache", I think they are actively being referenced by some part of the Lustre (e.g. lu cache or dentry or similar). If slabs are just being kept around by the kernel, the OBJS number would be high, but the ACTIVE number would be low (excluding some very small number of objects in a per-CPU cache). As was reported in LU-3771, there can be a very large amount of memory (56GB of a 64GB client) that is not released even under memory pressure, and it causes real problems for applications.

It doesn't make any sense to cache locks or objects for files that have been deleted or dentries that are no longer in memory. Even if that memory is eventually freed, there is a real impact to applications and Lustre itself because there can be large amounts of memory wasted that could be better used by something else.

Comment by Andreas Dilger [ 09/Oct/13 ]

Regarding the MDS_INODEBITS_LAYOUT locks, the first question to ask is why there are separate LAYOUT locks in the first place? Unless there are HSM/migration operations on the file, the LAYOUT bit should always be granted along with LOOKUP/UPDATE to avoid extra RPCs being sent. That would also ensure that the LAYOUT lock would be cancelled along with the file being deleted, unless there was still IO happening on the file.

Secondly, the dcache cleanup for deleted files needs to be fixed again, since deleting the dentries will also delete the locks. This was disabled in http://review.whamcloud.com/1874 but is directly contributing to this problem. If the dcache locking prevents us from doing the right thing anymore, there is code that was removed in commit 3698e90b9b8 (deathrow for dentries) that could be revived.

Comment by Jinshan Xiong (Inactive) [ 09/Oct/13 ]

Yes, you're right. The root cause of this issue is that inodes were not removed from cache when the files were being deleted. This is why it has a high active percentage of lustre_inode_cache, and XYZ_object_kmem is just a fallout of this.

Regarding the MDS_INODEBITS_LAYOUT locks, the first question to ask is why there are separate LAYOUT locks in the first place? Unless there are HSM/migration operations on the file, the LAYOUT bit should always be granted along with LOOKUP/UPDATE to avoid extra RPCs being sent. That would also ensure that the LAYOUT lock would be cancelled along with the file being deleted, unless there was still IO happening on the file.

Yes, the LAYOUT lock should be granted with UPDATE/LOOKUP lock; but if the DLM lock may be revoked because permission and timestamps change, then the process doing glimpse will have to enqueue standalone LAYOUT lock before using the layout. This is why there are so many standalone LAYOUT locks.

The change in ll_ddelete() is connected to LU-2487, so it's not an option to enable it. deaththrow would be a good way, need to check it out.

Actually in another ticket, I made a suggestion to add a hint in blocking AST which tells the client why the lock is being canceled. My original intention is to drop page cache only if the layout is changed(not due to false sharing of DLM lock). I realize we can use it here also - we can drop the i_nlink to zero in the ll_md_blocking_ast() function if we know the file is being unlinked.

Comment by Niu Yawei (Inactive) [ 10/Oct/13 ]

The change in ll_ddelete() is connected to LU-2487, so it's not an option to enable it. deaththrow would be a good way, need to check it out.
Actually in another ticket, I made a suggestion to add a hint in blocking AST which tells the client why the lock is being canceled. My original intention is to drop page cache only if the layout is changed(not due to false sharing of DLM lock). I realize we can use it here also - we can drop the i_nlink to zero in the ll_md_blocking_ast() function if we know the file is being unlinked.

ll_d_iput() will check and clear nlink as well, so I think disable it in ll_ddelete() isn't a problem. Probably the root cause is that layout lock wasn't canceled, so the inode nlink can't be cleared.

Comment by Andreas Dilger [ 10/Oct/13 ]

Also, when an object is being destroyed on the OST it should be sending blocking callbacks to the clients with LDLM_FL_DISCARD_DATA, so that should be a sign that the client can immediately drop its cache without any writes. Could someone please confirm that this is actually happening?

Comment by Jinshan Xiong (Inactive) [ 10/Oct/13 ]

Yes, it's happening on the client.

on the server side, the flag LDLM_FL_DISCARD_DATA is set at revoking time.

void ldlm_add_bl_work_item(struct ldlm_lock *lock, struct ldlm_lock *new,
                           cfs_list_t *work_list)
{
        if ((lock->l_flags & LDLM_FL_AST_SENT) == 0) {
                LDLM_DEBUG(lock, "lock incompatible; sending blocking AST.");
                lock->l_flags |= LDLM_FL_AST_SENT;
                /* If the enqueuing client said so, tell the AST recipient to
                 * discard dirty data, rather than writing back. */
                if (new->l_flags & LDLM_FL_AST_DISCARD_DATA)
                        lock->l_flags |= LDLM_FL_DISCARD_DATA;
                LASSERT(cfs_list_empty(&lock->l_bl_ast));
                cfs_list_add(&lock->l_bl_ast, work_list);
                LDLM_LOCK_GET(lock);
                LASSERT(lock->l_blocking_lock == NULL);
                lock->l_blocking_lock = LDLM_LOCK_GET(new);
        }
}

On the client side, this flag will be transferred to ldlm_lock in ldlm_callback_handler(), here it is:

        /* Copy hints/flags (e.g. LDLM_FL_DISCARD_DATA) from AST. */
        lock_res_and_lock(lock);
        lock->l_flags |= ldlm_flags_from_wire(dlm_req->lock_flags &
                                              LDLM_AST_FLAGS);

When the lock is canceled on the client side, it takes this into account in osc_lock_cancel():

        if (dlmlock != NULL) {
                int do_cancel;
                        
                discard = !!(dlmlock->l_flags & LDLM_FL_DISCARD_DATA);
                if (olck->ols_state >= OLS_GRANTED)
                        result = osc_lock_flush(olck, discard);

And the in osc_cache_writeback_range(), it will discard all of the pages.

Do you find any problems from your experiment?

Comment by Jinshan Xiong (Inactive) [ 11/Oct/13 ]

Hi Niu, Are you working on this? If not, I can start to work on it.

Comment by Niu Yawei (Inactive) [ 11/Oct/13 ]

Xiong, Yes I was trying to find out the reason of why layout lock wasn't canceled, but didn't get the root cause so far. I'm glad that you can take this over if you have time. Thank you.

Comment by Andreas Dilger [ 11/Oct/13 ]

Jinshan, I haven't checked whether LDLM_FL_DISCARD_DATA is actually working. Even though it works in theory, it hasn't been tested in a long time because sanity.sh test_42b has been disabled forever because of race conditions in the test (sometimes page writeback can happen during the test). It would be nice if we had a more robust test case for this, or maybe just change it to be error_ignore?

Comment by Jinshan Xiong (Inactive) [ 11/Oct/13 ]

Actually 42b is easy to fix after osc_extent is implemented because osc extents won't be allowed to flush unless one sys write ends or it has collected enough pages. We can create a fail_loc to delay the write path and unlink the file in the meanwhile.

Niu, can you please work this out while I'm looking at the layout lock issue?

Comment by Jinshan Xiong (Inactive) [ 11/Oct/13 ]

From these two lines:

ldlm.namespaces.mdt-testfs-MDT0000_UUID.resource_count=39654
ldlm.namespaces.testfs-MDT0000-mdc-ffff8800a66c1c00.resource_count=39654

The locks are actually not canceled from the MDT side. I will drill down to find the source.

Comment by Jinshan Xiong (Inactive) [ 11/Oct/13 ]

These are open-unlinked files. After file is unlinked, file IO will cause layout lock to be created. Those locks will never be destroyed. We can fix this issue by acquiring FULL ibits locks at last close for unlinked files.

Comment by Niu Yawei (Inactive) [ 14/Oct/13 ]

These are open-unlinked files. After file is unlinked, file IO will cause layout lock to be created. Those locks will never be destroyed. We can fix this issue by acquiring FULL ibits locks at last close for unlinked files.

open-unlinked file is a problem, but I don't think there are so many open-unlinked files in the dbench test. The major problem looks like: Client issue unlink, object on MDT was unlink, but the objects on OSTs would be kept for a short time (until osp sync unlink rec), so dirty data was still cached on client and nlink wasn't cleared (extent lock cached). If a data flush (ll_writepages()) triggered before the extent lock revoked, client will acquire layout lock again, and this layout lock will never be canceled.

Comment by Jinshan Xiong (Inactive) [ 14/Oct/13 ]

here is the patch: http://review.whamcloud.com/7942, however the fix is not complete.

Niu, yes, that is correct. I discovered that problem also and worked out a patch to fix it, but it still has a few problems - if a file is cached on one client which is then deleted from another client, there is no way of taking this file out of cache. We need to start a dedicated kernel thread for this purpose. If you're still working on this, please take it over to avoid duplicate work.

Comment by Niu Yawei (Inactive) [ 15/Oct/13 ]

Niu, yes, that is correct. I discovered that problem also and worked out a patch to fix it, but it still has a few problems - if a file is cached on one client which is then deleted from another client, there is no way of taking this file out of cache. We need to start a dedicated kernel thread for this purpose. If you're still working on this, please take it over to avoid duplicate work.

Actually, I found the problem when I was trying to restore the sanity test_42b. I'm fine to take this over, but I need some time to think about how to fix.

Comment by Niu Yawei (Inactive) [ 16/Oct/13 ]

Despite of the problme of how to cleanup inode cache, the extra layout lock on dirty flush reminds me that the fix of LU-3160 isn't proper, I think we'd fix LU-3160 in another way instead of acquiring layout on dirty flush: http://review.whamcloud.com/7957

Comment by Jinshan Xiong (Inactive) [ 16/Oct/13 ]

Hi Niu, with my patch at 7942 by not granting layout lock if the file is not existing at the MDT, it doesn't have layout lock problem any more. Let me know if I missed something, thanks.

Comment by Niu Yawei (Inactive) [ 17/Oct/13 ]

Hi Niu, with my patch at 7942 by not granting layout lock if the file is not existing at the MDT, it doesn't have layout lock problem any more. Let me know if I missed something, thanks.

I think that's another problem should be fixed (it's a regression actually, we did check if object exists when we were using getattr function to handle layout intent) , the reason I want to revert the fix of LU-3160 is:

  • For dirty flush, client should not try to acquire layout lock at all, instead of acquire, then failed with -ENOENT.
  • The fix of LU-3160 added extra code on client (ll_umounting, and -ENOENT checking), they can be removed in the new fix.
Comment by Andreas Dilger [ 26/Feb/14 ]

The patch http://review.whamcloud.com/9223 for LU-4357 to add __GFP_WAIT might also help this situation, because the client and server will begin generating memory pressure during allocation, and hopefully flush some of the stale objects from the slabs. I'm not sure that in itself is enough.

I think this needs to be retested once 9223 lands to see what the current state of affairs is.

Comment by Cory Spitz [ 27/Mar/14 ]

Any update since change #9223 landed?

Comment by Niu Yawei (Inactive) [ 28/Jul/14 ]

I split Jinshan's patch: http://review.whamcloud.com/11243, this one could solve the problem described in this ticket. (layout lock was fetched by client after file unlinked, which result in the inode cache for the unlinked file can't be purged).

Comment by parinay v kondekar (Inactive) [ 28/Apr/15 ]

Hello Niu, Jinshan,

I ported the patch - http://review.whamcloud.com/#/c/7942/ to 2.5.1. I thought it could help fix the problem reported in LU-2857. And it did fix the issue reported.

I read the comments here or the review comments in the patch, and it seem to suggest the patch is incomplete especially following,

>> " I discovered that problem also and worked out a patch to fix it, but it still has a few problems - if a file is cached on one client which is then
>> deleted from another client, there is no way of taking this file out of cache. We need to start a dedicated kernel thread for this purpose."

  • Is it possible to split the patch to isolate the issue reported in LU-2857? I am not sure if it would be the right approach.
  • Is there any kernel thread started as you mentioned?

Any help/guidance ?

Thanks

Generated at Sat Feb 10 01:39:14 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.