[LU-4053] client leaking objects/locks during IO Created: 02/Oct/13 Updated: 05/Aug/20 Resolved: 05/Aug/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Andreas Dilger | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | mq115 | ||
| Environment: |
Config: Single-node client+MDS+OSS with 1 MDT, 3 OSTs |
||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 10870 | ||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
I'm trying to determine if there is a "memory leak" in the current Lustre code that can affect long-running clients or servers. While this memory may be cleaned up when the filesystem is unmounted, it does not appear to be cleaned up under steady-state usage. I started "rundbench 10 -t 3600" and am watching the memory usage in several forms (slabtop, vmstat, "lfs df", "lfs df -i"). It does indeed appear that there are a number of statistics that show what looks to be a memory leak. These statistics are gathered at about the same time, but not exactly at the same time. The general trend is fairly clear, however: The "lfs df -i" output shows only around 1000 in-use files during the whole run: UUID Inodes IUsed IFree IUse% Mounted on testfs-MDT0000_UUID 524288 1024 523264 0% /mnt/testfs[MDT:0] testfs-OST0000_UUID 131072 571 130501 0% /mnt/testfs[OST:0] testfs-OST0001_UUID 131072 562 130510 0% /mnt/testfs[OST:1] testfs-OST0002_UUID 131072 576 130496 0% /mnt/testfs[OST:2] filesystem summary: 524288 1024 523264 0% /mnt/testfs The LDLM resource_count shows the number of locks, slightly less than 50k, but a lot more than the number of actual objects in the filesystem: # lctl get_param ldlm.namespaces.*.resource_count ldlm.namespaces.filter-testfs-OST0000_UUID.resource_count=238 ldlm.namespaces.filter-testfs-OST0001_UUID.resource_count=226 ldlm.namespaces.filter-testfs-OST0002_UUID.resource_count=237 ldlm.namespaces.mdt-testfs-MDT0000_UUID.resource_count=49161 ldlm.namespaces.testfs-MDT0000-mdc-ffff8800a66c1c00.resource_count=49160 ldlm.namespaces.testfs-OST0000-osc-ffff8800a66c1c00.resource_count=237 ldlm.namespaces.testfs-OST0001-osc-ffff8800a66c1c00.resource_count=226 ldlm.namespaces.testfs-OST0002-osc-ffff8800a66c1c00.resource_count=236 Total memory used (as shown by "vmstat") also shows a steady increase over time, originally 914116kB of free memory, down to 202036kB after about 3000s of the run so far (about 700MB of memory used), and eventually ends up at 86724kB at the end of the run (830MB used). While that would be normal with a workload that is accessing a large number of files that are kept in cache, the total amount of used space in the filesystem is steadily about 240MB during the entire run. The "slabtop" output (edited to remove uninteresting slabs) shows over 150k and steadily growing number of allocated structures for CLIO, far more than could actually be in use at any given time. All of the CLIO slabs are 100% used, so it isn't just a matter of alloc/free causing partially-used slabs. OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME 242660 242660 100% 0.19K 12133 20 48532K size-192 217260 217260 100% 0.19K 10863 20 43452K dentry 203463 178864 87% 0.10K 5499 37 21996K buffer_head 182000 181972 99% 0.03K 1625 112 6500K size-32 181530 181530 100% 0.12K 6051 30 24204K size-128 156918 156918 100% 1.25K 52306 3 209224K lustre_inode_cache 156840 156840 100% 0.12K 5228 30 20912K lov_oinfo 156825 156825 100% 0.22K 9225 17 36900K lov_object_kmem 156825 156825 100% 0.22K 9225 17 36900K lovsub_object_kmem 156816 156816 100% 0.24K 9801 16 39204K ccc_object_kmem 156814 156814 100% 0.27K 11201 14 44804K osc_object_kmem 123832 121832 98% 0.50K 15479 8 61916K size-512 98210 92250 93% 0.50K 14030 7 56120K ldlm_locks 97460 91009 93% 0.38K 9746 10 38984K ldlm_resources 76320 76320 100% 0.08K 1590 48 6360K mdd_obj 76262 76262 100% 0.11K 2243 34 8972K lod_obj 76245 76245 100% 0.28K 5865 13 23460K mdt_obj 2865 2764 96% 1.03K 955 3 3820K ldiskfs_inode_cache 1746 1546 88% 0.21K 97 18 388K cl_lock_kmem 1396 1396 100% 1.00K 349 4 1396K ptlrpc_cache 1345 1008 74% 0.78K 269 5 1076K shmem_inode_cache 1298 847 65% 0.06K 22 59 88K lovsub_lock_kmem 1224 898 73% 0.16K 51 24 204K ofd_obj 1008 794 78% 0.18K 48 21 192K osc_lock_kmem 1008 783 77% 0.03K 9 112 36K lov_lock_link_kmem 925 782 84% 0.10K 25 37 100K lov_lock_kmem 920 785 85% 0.04K 10 92 40K ccc_lock_kmem The ldiskfs_inode_cache shows a reasonable number of objects in use, one for each MDT and OST inode actually in use. It might be that this is a leak of unlinked inodes/dentries on the client? Now, after 3600s of running, the dbench has finished and deleted all of the files: Operation Count AvgLat MaxLat ---------------------------------------- NTCreateX 1229310 5.896 1056.405 Close 903051 2.960 1499.813 Rename 52083 8.024 827.129 Unlink 248209 3.694 789.403 Deltree 20 119.498 421.063 Mkdir 10 0.050 0.155 Qpathinfo 1114775 2.129 953.086 Qfileinfo 195028 0.114 25.925 Qfsinfo 204279 0.574 32.902 Sfileinfo 100238 27.316 1442.888 Find 430819 6.750 1369.539 WriteX 611079 0.833 857.679 ReadX 1927390 0.107 1171.947 LockX 4004 0.005 1.899 UnlockX 4004 0.003 3.345 Flush 86164 183.254 2577.019 Throughput 10.6947 MB/sec 10 clients 10 procs max_latency=2577.028 ms The slabs still show a large number of allocations, even though no files exist in the filesystem anymore: OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME 289880 133498 46% 0.19K 14494 20 57976K size-192 278768 274718 98% 0.03K 2489 112 9956K size-32 274410 259726 94% 0.12K 9147 30 36588K size-128 253590 250634 98% 0.12K 8453 30 33812K lov_oinfo 253555 250634 98% 0.22K 14915 17 59660K lovsub_object_kmem 253552 250634 98% 0.24K 15847 16 63388K ccc_object_kmem 253540 250634 98% 0.27K 18110 14 72440K osc_object_kmem 253538 250634 98% 0.22K 14914 17 59656K lov_object_kmem 252330 250638 99% 1.25K 84110 3 336440K lustre_inode_cache 203463 179392 88% 0.10K 5499 37 21996K buffer_head 128894 128446 99% 0.11K 3791 34 15164K lod_obj 128880 128446 99% 0.08K 2685 48 10740K mdd_obj 128869 128446 99% 0.28K 9913 13 39652K mdt_obj 84574 79368 93% 0.50K 12082 7 48328K ldlm_locks 82660 79314 95% 0.38K 8266 10 33064K ldlm_resources 71780 50308 70% 0.19K 3589 20 14356K dentry There are also still about 40k MDT locks, though all of the OST locks are gone (which is expected if these files are unlinked). # lctl get_param ldlm.namespaces.*.resource_count ldlm.namespaces.filter-testfs-OST0000_UUID.resource_count=0 ldlm.namespaces.filter-testfs-OST0001_UUID.resource_count=0 ldlm.namespaces.filter-testfs-OST0002_UUID.resource_count=0 ldlm.namespaces.mdt-testfs-MDT0000_UUID.resource_count=39654 ldlm.namespaces.testfs-MDT0000-mdc-ffff8800a66c1c00.resource_count=39654 ldlm.namespaces.testfs-OST0000-osc-ffff8800a66c1c00.resource_count=0 ldlm.namespaces.testfs-OST0001-osc-ffff8800a66c1c00.resource_count=0 ldlm.namespaces.testfs-OST0002-osc-ffff8800a66c1c00.resource_count=0 |
| Comments |
| Comment by Andreas Dilger [ 03/Oct/13 ] |
|
After unmounting the client, a large number of slabs have been cleaned up, but not all of them: OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME 289460 132395 45% 0.19K 14473 20 57892K size-192 203463 179354 88% 0.10K 5499 37 21996K buffer_head 128894 127934 99% 0.11K 3791 34 15164K lod_obj 128880 127934 99% 0.08K 2685 48 10740K mdd_obj 128869 127934 99% 0.28K 9913 13 39652K mdt_obj 71760 50205 69% 0.19K 3588 20 14352K dentry 1491 1176 78% 1.03K 497 3 1988K ldiskfs_inode_cache I ran with +malloc debug during the cleanup, and processed the debug log through leak_finder.pl. A sample of the logs: *** Free without malloc (8 bytes at ffff880049c5b7e0, lov_object.c:lov_fini_raid0:377) *** Free without malloc (112 bytes at ffff88003fcfe100, lov_ea.c:lsm_free_plain:130) *** Free without malloc (80 bytes at ffff8800c3b630c0, lov_ea.c:lsm_free_plain:132) *** Free without malloc (224 bytes at ffff88004d28b3f8, lov_object.c:lov_object_free:821) *** Free without malloc (248 bytes at ffff88004d79c070, lcommon_cl.c:ccc_object_free:404) *** Free without malloc (1256 bytes at ffff88002c195580, super25.c:ll_destroy_inode:80) *** Free without malloc (272 bytes at ffff8800250ba9f8, osc_object.c:osc_object_free:128) *** Free without malloc (224 bytes at ffff880036c25bd8, lovsub_object.c:lovsub_object_free:96) *** Free without malloc (8 bytes at ffff880068e18bc0, lov_object.c:lov_fini_raid0:377) *** Free without malloc (112 bytes at ffff88001d6efe00, lov_ea.c:lsm_free_plain:130) *** Free without malloc (80 bytes at ffff8800b89ff940, lov_ea.c:lsm_free_plain:132) *** Free without malloc (224 bytes at ffff88006e2b6e78, lov_object.c:lov_object_free:821) *** Free without malloc (248 bytes at ffff88004d609830, lcommon_cl.c:ccc_object_free:404) *** Free without malloc (1256 bytes at ffff8800baa22580, super25.c:ll_destroy_inode:80) *** Free without malloc (272 bytes at ffff88004a7c4398, osc_object.c:osc_object_free:128) *** Free without malloc (224 bytes at ffff88002515dbd8, lovsub_object.c:lovsub_object_free:96) *** Free without malloc (8 bytes at ffff88006556d820, lov_object.c:lov_fini_raid0:377) *** Free without malloc (112 bytes at ffff88003fcfe300, lov_ea.c:lsm_free_plain:130) *** Free without malloc (80 bytes at ffff8800d3f239c0, lov_ea.c:lsm_free_plain:132) *** Free without malloc (224 bytes at ffff88004d28b4d8, lov_object.c:lov_object_free:821) *** Free without malloc (248 bytes at ffff88004d79c640, lcommon_cl.c:ccc_object_free:404) *** Free without malloc (1256 bytes at ffff88006672d080, super25.c:ll_destroy_inode:80) *** Free without malloc (272 bytes at ffff88004a7c44a8, osc_object.c:osc_object_free:128) *** Free without malloc (224 bytes at ffff88002515dcb8, lovsub_object.c:lovsub_object_free:96) *** Free without malloc (8 bytes at ffff880049c5b2a0, lov_object.c:lov_fini_raid0:377) : : *** Free without malloc (320 bytes at ffff880081a96b00, ldlm_resource.c:ldlm_resource_putref_locked:1300) *** Free without malloc (320 bytes at ffff88001ae75200, ldlm_resource.c:ldlm_resource_putref_locked:1300) *** Free without malloc (320 bytes at ffff880024e4e980, ldlm_resource.c:ldlm_resource_putref_locked:1300) *** Free without malloc (320 bytes at ffff880031a0c980, ldlm_resource.c:ldlm_resource_putref_locked:1300) *** Free without malloc (320 bytes at ffff88002e66ec80, ldlm_resource.c:ldlm_resource_putref_locked:1300) *** Free without malloc (320 bytes at ffff880023b68680, ldlm_resource.c:ldlm_resource_putref_locked:1300) : : *** Free without malloc (504 bytes at ffff8800a67faa80, ldlm_lock.c:lock_handle_free:456) *** Free without malloc (504 bytes at ffff880077984580, ldlm_lock.c:lock_handle_free:456) *** Free without malloc (504 bytes at ffff88006e240380, ldlm_lock.c:lock_handle_free:456) *** Free without malloc (504 bytes at ffff8800670bd900, ldlm_lock.c:lock_handle_free:456) *** Free without malloc (504 bytes at ffff8800a3fb4880, ldlm_lock.c:lock_handle_free:456) *** Free without malloc (504 bytes at ffff880058eaaa80, ldlm_lock.c:lock_handle_free:456) *** Free without malloc (504 bytes at ffff8800c35aad80, ldlm_lock.c:lock_handle_free:456) : : These are allocations that are being freed that were allocated before logging was enabled. Many thousands of these lines... |
| Comment by Andreas Dilger [ 03/Oct/13 ] |
|
The dcache shrinking patch was disabled in http://review.whamcloud.com/1874 in v2_2_59_0-3-g9f3469f, but needs to be fixed somehow (e.g. have a zombie list that is cleaned up outside of the dcache lock). |
| Comment by Jinshan Xiong (Inactive) [ 04/Oct/13 ] |
|
is this because the debug buffer overflowed so that it couldn't catch the allocation info? |
| Comment by Andreas Dilger [ 04/Oct/13 ] |
|
No, just because I didn't have +malloc debugging enabled while the test was running, and because there is a good chance that the allocation is not very close to the free in the first place, so it would be mismatched without a huge debug buffer. Since this test is so easy to run (sh llmount.sh; sh rundbench -t 3600 10) it is easy for anyone to get whatever information they need to debug it. |
| Comment by Bruno Faccini (Inactive) [ 07/Oct/13 ] |
|
Andreas, thanks for you hints. I started to work on this. |
| Comment by Bruno Faccini (Inactive) [ 07/Oct/13 ] |
|
This behavior does not show-up with 1.8.9-wc1 but still does running last master builds. |
| Comment by Niu Yawei (Inactive) [ 09/Oct/13 ] |
It because the layout lock wasn't canceled on unlink & rename.
I think the memory was consumed by slab cache.
I guess it because of the unlink/rename, so huge number of slab objects for CLIO object were created (and I don't quite sure what's the ACTIVE / USE in slabtop mean)?
I think lustre has released all objects, it's slab cache which holding these objects and it depends on kernel to decided when to free them to reclaim memroy.
We can see all the client's slabs have been freed, these are all for servers. One thing confused me is that after dbench finished, there are still huge number of layout lock cached: 00010000:00010000:1.0:1381289786.413195:0:11603:0:(ldlm_resource.c:1448:ldlm_resource_dump()) --- Resource: [0x200000400:0x1cd83:0x0].0 (ffff880019c2d340) refcount = 2 00010000:00010000:1.0:1381289786.413196:0:11603:0:(ldlm_resource.c:1451:ldlm_resource_dump()) Granted locks (in reverse order): 00010000:00010000:1.0:1381289786.413196:0:11603:0:(ldlm_resource.c:1454:ldlm_resource_dump()) ### ### ns: lustre-MDT0000-mdc-ffff8800265a9800 lock: ffff88004fdb0b80/0x68a3d67bc8651524 lrc: 1/0,0 mode: CR/CR res: [0x200000400:0x1cd83:0x0].0 bits 0x8 rrc: 2 type: IBT flags: 0x0 nid: local remote: 0x68a3d67bc865152b expref: -99 pid: 10969 timeout: 0 lvb_type: 3 00010000:00010000:1.0:1381289786.413198:0:11603:0:(ldlm_resource.c:1448:ldlm_resource_dump()) --- Resource: [0x200000400:0x11d83:0x0].0 (ffff880070efba80) refcount = 2 00010000:00010000:1.0:1381289786.413198:0:11603:0:(ldlm_resource.c:1451:ldlm_resource_dump()) Granted locks (in reverse order): 00010000:00010000:1.0:1381289786.413199:0:11603:0:(ldlm_resource.c:1454:ldlm_resource_dump()) ### ### ns: lustre-MDT0000-mdc-ffff8800265a9800 lock: ffff8800681ac380/0x68a3d67bc691ef1e lrc: 1/0,0 mode: CR/CR res: [0x200000400:0x11d83:0x0].0 bits 0x8 rrc: 2 type: IBT flags: 0x0 nid: local remote: 0x68a3d67bc691ef3a expref: -99 pid: 10969 timeout: 0 lvb_type: 3 00010000:00010000:1.0:1381289786.413200:0:11603:0:(ldlm_resource.c:1448:ldlm_resource_dump()) --- Resource: [0x200000400:0x15143:0x0].0 (ffff880046767e80) refcount = 2 00010000:00010000:1.0:1381289786.413201:0:11603:0:(ldlm_resource.c:1451:ldlm_resource_dump()) Granted locks (in reverse order): 00010000:00010000:1.0:1381289786.413201:0:11603:0:(ldlm_resource.c:1454:ldlm_resource_dump()) ### ### ns: lustre-MDT0000-mdc-ffff8800265a9800 lock: ffff8800115848c0/0x68a3d67bc71e9503 lrc: 1/0,0 mode: CR/CR res: [0x200000400:0x15143:0x0].0 bits 0x8 rrc: 2 type: IBT flags: 0x0 nid: local remote: 0x68a3d67bc71e951f expref: -99 pid: 10969 timeout: 0 lvb_type: 3 I checked the server code and see that layout lock isn't revoked on unlink/rename (see |
| Comment by Niu Yawei (Inactive) [ 09/Oct/13 ] |
|
What suprised me is that even I revert the change of |
| Comment by Bruno Faccini (Inactive) [ 09/Oct/13 ] |
|
Niu, thanks for all these details! My understanding of the ACTIVE/USE meaning is that they indicate the number/% of non-freed objects per slab type. Andreas, according that it is now confirmed that all of these allocs are only kept due to caching, and can be reclaimed either upon need or unconditionally (again "echo 3 > /proc/sys/vm/drop_caches" allows to free almost anything), could we at least downgrade this ticket's priority, or do you still consider it as critical ? |
| Comment by Jinshan Xiong (Inactive) [ 09/Oct/13 ] |
|
I think we need to figure out why layout lock is still in cache even after the patch |
| Comment by Jinshan Xiong (Inactive) [ 09/Oct/13 ] |
|
I'm totally fine with the slab object remaining in cache because that is linux system behavior and no harm. |
| Comment by Andreas Dilger [ 09/Oct/13 ] |
|
I don't think that the slab objects are just "in the cache", I think they are actively being referenced by some part of the Lustre (e.g. lu cache or dentry or similar). If slabs are just being kept around by the kernel, the OBJS number would be high, but the ACTIVE number would be low (excluding some very small number of objects in a per-CPU cache). As was reported in It doesn't make any sense to cache locks or objects for files that have been deleted or dentries that are no longer in memory. Even if that memory is eventually freed, there is a real impact to applications and Lustre itself because there can be large amounts of memory wasted that could be better used by something else. |
| Comment by Andreas Dilger [ 09/Oct/13 ] |
|
Regarding the MDS_INODEBITS_LAYOUT locks, the first question to ask is why there are separate LAYOUT locks in the first place? Unless there are HSM/migration operations on the file, the LAYOUT bit should always be granted along with LOOKUP/UPDATE to avoid extra RPCs being sent. That would also ensure that the LAYOUT lock would be cancelled along with the file being deleted, unless there was still IO happening on the file. Secondly, the dcache cleanup for deleted files needs to be fixed again, since deleting the dentries will also delete the locks. This was disabled in http://review.whamcloud.com/1874 but is directly contributing to this problem. If the dcache locking prevents us from doing the right thing anymore, there is code that was removed in commit 3698e90b9b8 (deathrow for dentries) that could be revived. |
| Comment by Jinshan Xiong (Inactive) [ 09/Oct/13 ] |
|
Yes, you're right. The root cause of this issue is that inodes were not removed from cache when the files were being deleted. This is why it has a high active percentage of lustre_inode_cache, and XYZ_object_kmem is just a fallout of this.
Yes, the LAYOUT lock should be granted with UPDATE/LOOKUP lock; but if the DLM lock may be revoked because permission and timestamps change, then the process doing glimpse will have to enqueue standalone LAYOUT lock before using the layout. This is why there are so many standalone LAYOUT locks. The change in ll_ddelete() is connected to Actually in another ticket, I made a suggestion to add a hint in blocking AST which tells the client why the lock is being canceled. My original intention is to drop page cache only if the layout is changed(not due to false sharing of DLM lock). I realize we can use it here also - we can drop the i_nlink to zero in the ll_md_blocking_ast() function if we know the file is being unlinked. |
| Comment by Niu Yawei (Inactive) [ 10/Oct/13 ] |
ll_d_iput() will check and clear nlink as well, so I think disable it in ll_ddelete() isn't a problem. Probably the root cause is that layout lock wasn't canceled, so the inode nlink can't be cleared. |
| Comment by Andreas Dilger [ 10/Oct/13 ] |
|
Also, when an object is being destroyed on the OST it should be sending blocking callbacks to the clients with LDLM_FL_DISCARD_DATA, so that should be a sign that the client can immediately drop its cache without any writes. Could someone please confirm that this is actually happening? |
| Comment by Jinshan Xiong (Inactive) [ 10/Oct/13 ] |
|
Yes, it's happening on the client. on the server side, the flag LDLM_FL_DISCARD_DATA is set at revoking time. void ldlm_add_bl_work_item(struct ldlm_lock *lock, struct ldlm_lock *new, cfs_list_t *work_list) { if ((lock->l_flags & LDLM_FL_AST_SENT) == 0) { LDLM_DEBUG(lock, "lock incompatible; sending blocking AST."); lock->l_flags |= LDLM_FL_AST_SENT; /* If the enqueuing client said so, tell the AST recipient to * discard dirty data, rather than writing back. */ if (new->l_flags & LDLM_FL_AST_DISCARD_DATA) lock->l_flags |= LDLM_FL_DISCARD_DATA; LASSERT(cfs_list_empty(&lock->l_bl_ast)); cfs_list_add(&lock->l_bl_ast, work_list); LDLM_LOCK_GET(lock); LASSERT(lock->l_blocking_lock == NULL); lock->l_blocking_lock = LDLM_LOCK_GET(new); } } On the client side, this flag will be transferred to ldlm_lock in ldlm_callback_handler(), here it is: /* Copy hints/flags (e.g. LDLM_FL_DISCARD_DATA) from AST. */
lock_res_and_lock(lock);
lock->l_flags |= ldlm_flags_from_wire(dlm_req->lock_flags &
LDLM_AST_FLAGS);
When the lock is canceled on the client side, it takes this into account in osc_lock_cancel(): if (dlmlock != NULL) { int do_cancel; discard = !!(dlmlock->l_flags & LDLM_FL_DISCARD_DATA); if (olck->ols_state >= OLS_GRANTED) result = osc_lock_flush(olck, discard); And the in osc_cache_writeback_range(), it will discard all of the pages. Do you find any problems from your experiment? |
| Comment by Jinshan Xiong (Inactive) [ 11/Oct/13 ] |
|
Hi Niu, Are you working on this? If not, I can start to work on it. |
| Comment by Niu Yawei (Inactive) [ 11/Oct/13 ] |
|
Xiong, Yes I was trying to find out the reason of why layout lock wasn't canceled, but didn't get the root cause so far. I'm glad that you can take this over if you have time. |
| Comment by Andreas Dilger [ 11/Oct/13 ] |
|
Jinshan, I haven't checked whether LDLM_FL_DISCARD_DATA is actually working. Even though it works in theory, it hasn't been tested in a long time because sanity.sh test_42b has been disabled forever because of race conditions in the test (sometimes page writeback can happen during the test). It would be nice if we had a more robust test case for this, or maybe just change it to be error_ignore? |
| Comment by Jinshan Xiong (Inactive) [ 11/Oct/13 ] |
|
Actually 42b is easy to fix after osc_extent is implemented because osc extents won't be allowed to flush unless one sys write ends or it has collected enough pages. We can create a fail_loc to delay the write path and unlink the file in the meanwhile. Niu, can you please work this out while I'm looking at the layout lock issue? |
| Comment by Jinshan Xiong (Inactive) [ 11/Oct/13 ] |
|
From these two lines:
The locks are actually not canceled from the MDT side. I will drill down to find the source. |
| Comment by Jinshan Xiong (Inactive) [ 11/Oct/13 ] |
|
These are open-unlinked files. After file is unlinked, file IO will cause layout lock to be created. Those locks will never be destroyed. We can fix this issue by acquiring FULL ibits locks at last close for unlinked files. |
| Comment by Niu Yawei (Inactive) [ 14/Oct/13 ] |
open-unlinked file is a problem, but I don't think there are so many open-unlinked files in the dbench test. The major problem looks like: Client issue unlink, object on MDT was unlink, but the objects on OSTs would be kept for a short time (until osp sync unlink rec), so dirty data was still cached on client and nlink wasn't cleared (extent lock cached). If a data flush (ll_writepages()) triggered before the extent lock revoked, client will acquire layout lock again, and this layout lock will never be canceled. |
| Comment by Jinshan Xiong (Inactive) [ 14/Oct/13 ] |
|
here is the patch: http://review.whamcloud.com/7942, however the fix is not complete. Niu, yes, that is correct. I discovered that problem also and worked out a patch to fix it, but it still has a few problems - if a file is cached on one client which is then deleted from another client, there is no way of taking this file out of cache. We need to start a dedicated kernel thread for this purpose. If you're still working on this, please take it over to avoid duplicate work. |
| Comment by Niu Yawei (Inactive) [ 15/Oct/13 ] |
Actually, I found the problem when I was trying to restore the sanity test_42b. I'm fine to take this over, but I need some time to think about how to fix. |
| Comment by Niu Yawei (Inactive) [ 16/Oct/13 ] |
|
Despite of the problme of how to cleanup inode cache, the extra layout lock on dirty flush reminds me that the fix of |
| Comment by Jinshan Xiong (Inactive) [ 16/Oct/13 ] |
|
Hi Niu, with my patch at 7942 by not granting layout lock if the file is not existing at the MDT, it doesn't have layout lock problem any more. Let me know if I missed something, thanks. |
| Comment by Niu Yawei (Inactive) [ 17/Oct/13 ] |
I think that's another problem should be fixed (it's a regression actually, we did check if object exists when we were using getattr function to handle layout intent) , the reason I want to revert the fix of
|
| Comment by Andreas Dilger [ 26/Feb/14 ] |
|
The patch http://review.whamcloud.com/9223 for I think this needs to be retested once 9223 lands to see what the current state of affairs is. |
| Comment by Cory Spitz [ 27/Mar/14 ] |
|
Any update since change #9223 landed? |
| Comment by Niu Yawei (Inactive) [ 28/Jul/14 ] |
|
I split Jinshan's patch: http://review.whamcloud.com/11243, this one could solve the problem described in this ticket. (layout lock was fetched by client after file unlinked, which result in the inode cache for the unlinked file can't be purged). |
| Comment by parinay v kondekar (Inactive) [ 28/Apr/15 ] |
|
Hello Niu, Jinshan, I ported the patch - http://review.whamcloud.com/#/c/7942/ to 2.5.1. I thought it could help fix the problem reported in I read the comments here or the review comments in the patch, and it seem to suggest the patch is incomplete especially following, >> " I discovered that problem also and worked out a patch to fix it, but it still has a few problems - if a file is cached on one client which is then
Any help/guidance ? Thanks |