[LU-8355] VFS: Busy inodes after unmount of md0 ... causes kernel panic or at least memory leak Created: 30/Jun/16  Updated: 01/Jul/16

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Sergey Cheremencev Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

If we try to start mdt when using read only device or device with disabled journal we will have following message:

VFS: Busy inodes after unmount of md0. Self-destruct in 5 seconds.  Have a nice day... 

In such case one of the inode's counter has extra increment. Further it may cause kernel panic or at least memory leak.
According to my investigation this inode is s_buddy_cache (created in ldiskfs_mb_init_backend).
I used kprobe(added handler_pre to __iget) and found that extra increment comes from fsnotify_unmount_inodes:

kprobe handler_pre __iget: inode ffff880079a09528 i_ino 21328 i_count 2 i_state 8
Pid: 12211, comm: mount.lustre Tainted: G        W  ---------------    2.6.32-431.17.1.x1.6.39.x86_64 #1
Call Trace:
 <#DB>  [<ffffffffa0552071>] ? handler_pre+0x41/0x44 [kprobe_example]
 [<ffffffff8152b455>] ? kprobe_exceptions_notify+0x3d5/0x430
 [<ffffffff8152b6c5>] ? notifier_call_chain+0x55/0x80
 [<ffffffff8152b72a>] ? atomic_notifier_call_chain+0x1a/0x20
 [<ffffffff810a12be>] ? notify_die+0x2e/0x30
 [<ffffffff81528ff5>] ? do_int3+0x35/0xb0
 [<ffffffff815288c3>] ? int3+0x33/0x40
 [<ffffffff811a4f01>] ? __iget+0x1/0x70
 <<EOE>>  [<ffffffff811ccc7f>] ? fsnotify_unmount_inodes+0x10f/0x120
 [<ffffffff811a620b>] ? invalidate_inodes+0x5b/0x190
 [<ffffffff811c5a14>] ? __sync_blockdev+0x24/0x50
 [<ffffffff8118b34c>] ? generic_shutdown_super+0x4c/0xe0
 [<ffffffff8118b411>] ? kill_block_super+0x31/0x50
 [<ffffffffa058f686>] ? ldiskfs_kill_block_super+0x16/0x60 [ldiskfs]
 [<ffffffff8118bbe7>] ? deactivate_super+0x57/0x80
 [<ffffffff811aabef>] ? mntput_no_expire+0xbf/0x110
 [<ffffffffa09f6515>] ? osd_mount+0x6f5/0xcb0 [osd_ldiskfs]
 [<ffffffffa09f93ff>] ? osd_device_alloc+0x4cf/0x970 [osd_ldiskfs]
 [<ffffffffa11fb33f>] ? obd_setup+0x1bf/0x290 [obdclass]
 [<ffffffffa11fb618>] ? class_setup+0x208/0x870 [obdclass]
 [<ffffffffa1203edc>] ? class_process_config+0xc6c/0x1ad0 [obdclass]
 [<ffffffffa10d54d8>] ? libcfs_log_return+0x28/0x40 [libcfs]
 [<ffffffffa1208dc2>] ? lustre_cfg_new+0x312/0x690 [obdclass]
 [<ffffffffa1209298>] ? do_lcfg+0x158/0x440 [obdclass]
 [<ffffffffa1209614>] ? lustre_start_simple+0x94/0x200 [obdclass]
 [<ffffffffa124503d>] ? server_fill_super+0x97d/0x1a7c [obdclass]
 [<ffffffffa10d54d8>] ? libcfs_log_return+0x28/0x40 [libcfs]
 [<ffffffffa120f0b0>] ? lustre_fill_super+0x1d0/0x5b0 [obdclass]
 [<ffffffffa120eee0>] ? lustre_fill_super+0x0/0x5b0 [obdclass]
 [<ffffffff8118c2af>] ? get_sb_nodev+0x5f/0xa0
 [<ffffffffa1206dd5>] ? lustre_get_sb+0x25/0x30 [obdclass]
 [<ffffffff8118b90b>] ? vfs_kern_mount+0x7b/0x1b0
 [<ffffffff8118bab2>] ? do_kern_mount+0x52/0x130
 [<ffffffff811aca8b>] ? do_mount+0x2fb/0x930
 [<ffffffff811ad150>] ? sys_mount+0x90/0xe0
 [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b

Reading /proc/slabinfo after unloading ldiskfs may cause kernel panic because ldiskfs_inode_cache was not cleaned correctly(it still has busy inodes).
fsnotify_unmount_inodes and inotify_unmount_inodes have bugs in handling of inodes with I_NEW.

               /* In case the dropping of a reference would nuke next_i. */                                           
                if ((&next_i->i_sb_list != list) &&
                    atomic_read(&next_i->i_count) &&
                    !(next_i->i_state & (I_CLEAR | I_FREEING | I_WILL_FREE))) {                                
                        __iget(next_i);
                        need_iput = next_i;                                                                            
                }

We do __iget for next_i but on the next loop iteration we pass iput because inode has I_NEW state:

               /*
                 * We cannot __iget() an inode in state I_CLEAR, I_FREEING,
                 * I_WILL_FREE, or I_NEW which is fine because by that point                                           
                 * the inode cannot have any associated watches.                                                       
                 */
                if (inode->i_state & (I_CLEAR|I_FREEING|I_WILL_FREE|I_NEW))                                            
                        continue;       

The problem faced on 2.6.32-431.17.1. But as I see another kernels also have this bug.



 Comments   
Comment by Sergey Cheremencev [ 30/Jun/16 ]
delete extra iget in fsnotify_unmount_inodes 

Don't increment i_next counter in fsnotify_unmount_inodes 
if inode has I_NEW state. Next loop iteration is skipped 
because of I_NEW and iput(need_iput_tmp) is not called. 
But if I_NEW inode is last in s_inodes list extra increment 
will never be decremented. 
The problem occured when mount lustre FS(server-side) 
on RO device(in such case mount should fail). 
Mount fails after creating 4 inodes. One of them is 
s_buddy_cache that has I_NEW state. This inode is not 
freed after generic_shutdown_super and it causes msg: 
VFS: Busy inodes after unmount of md0. Self-dectruct ... 
In such case ldiskfs_inode_cache can not be cleared and 
"cat /proc/slabinfo" after ldiskfs module unloading may 
cause kernel panic.

--- fs/notify/inotify/inotify.c	2013-07-30 04:16:44.000000000 +0400
+++ inotify.c	2014-12-16 17:36:23.000000000 +0400
@@ -404,7 +404,7 @@
 		if ((&next_i->i_sb_list != list) &&
 				atomic_read(&next_i->i_count) &&
 				!(next_i->i_state & (I_CLEAR | I_FREEING |
-					I_WILL_FREE))) {
+					I_WILL_FREE | I_NEW))) {
 			__iget(next_i);
 			need_iput = next_i;
 		}
--- fs/notify/inode_mark.c	2013-07-30 04:16:42.000000000 +0400
+++ inode_mark.c	2014-12-16 17:36:40.000000000 +0400
@@ -398,7 +398,7 @@
 		/* In case the dropping of a reference would nuke next_i. */
 		if ((&next_i->i_sb_list != list) &&
 		    atomic_read(&next_i->i_count) &&
-		    !(next_i->i_state & (I_CLEAR | I_FREEING | I_WILL_FREE))) {
+		    !(next_i->i_state & (I_CLEAR | I_FREEING | I_WILL_FREE | I_NEW))) {
 			__iget(next_i);
 			need_iput = next_i;
 		}
Comment by Evan D. Chen (Inactive) [ 01/Jul/16 ]

Per triage call, this seems to be a corner case; decrease from "Major" to "Minor"

Generated at Sat Feb 10 02:16:49 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.