[LU-11582] LBUG: ASSERTION( inode->i_data.nrpages == 0 ) failed Created: 29/Oct/18  Updated: 05/Nov/19  Resolved: 27/Nov/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.5
Fix Version/s: Lustre 2.12.0, Lustre 2.10.7

Type: Bug Priority: Critical
Reporter: Stephane Thiell Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: None
Environment:

Client: CentOS 7.5 Lustre 2.10.5
Server (Oak) CentOS 7.4 Lustre 2.10.4


Issue Links:
Duplicate
Related
is related to LU-118 clear_inode: BUG_ON(inode->i_data.nrp... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Hi,
We have a type of job that keeps crashing Lustre client version 2.10.5 with the following trace. It is very likely this job has files open on Oak (Lustre 2.10.4). This looks like old tickets LU-1414 and LU-118... fixed in Lustre 1.8! The issue happens on a bigmem node (1.5TB of RAM), and doesn't seem to happen on node with less memory. I'll try to upload a crash dump file to your ftp.

[11497.465606] LustreError: 132407:0:(llite_lib.c:2047:ll_delete_inode()) ASSERTION( inode->i_data.nrpages == 0 ) failed: inode=[0x200018e83:0x1ba2c:0x0](ffff8aa85a298510) nrpages=1, see LU-118
[11497.487939] LustreError: 132407:0:(llite_lib.c:2047:ll_delete_inode()) LBUG
[11497.495730] Pid: 132407, comm: spades 3.10.0-862.14.4.el7.x86_64 #1 SMP Wed Sep 26 15:12:11 UTC 2018
[11497.505939] Call Trace:
[11497.508685]  [<ffffffffc09947cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
[11497.516009]  [<ffffffffc099487c>] lbug_with_loc+0x4c/0xa0 [libcfs]
[11497.522955]  [<ffffffffc0f25c87>] ll_delete_inode+0x1b7/0x1c0 [lustre]
[11497.530291]  [<ffffffff8d43c504>] evict+0xb4/0x180
[11497.535663]  [<ffffffff8d43ce0c>] iput+0xfc/0x190
[11497.540940]  [<ffffffff8d43126e>] do_unlinkat+0x1ae/0x2d0
[11497.546990]  [<ffffffff8d432326>] SyS_unlink+0x16/0x20
[11497.552753]  [<ffffffff8d92579b>] system_call_fastpath+0x22/0x27
[11497.559484]  [<ffffffffffffffff>] 0xffffffffffffffff
[11497.565069] Kernel panic - not syncing: LBUG
[11497.569837] CPU: 7 PID: 132407 Comm: spades Kdump: loaded Tainted: G           OE  ------------   3.10.0-862.14.4.el7.x86_64 #1
[11497.582928] Hardware name: Dell Inc. PowerEdge R630/02C2CP, BIOS 2.8.0 005/17/2018
[11497.591379] Call Trace:
[11497.594105]  [<ffffffff8d913754>] dump_stack+0x19/0x1b
[11497.599845]  [<ffffffff8d90d29f>] panic+0xe8/0x21f
[11497.605211]  [<ffffffffc09948cb>] lbug_with_loc+0x9b/0xa0 [libcfs]
[11497.612131]  [<ffffffffc0f25c87>] ll_delete_inode+0x1b7/0x1c0 [lustre]
[11497.619421]  [<ffffffff8d43c504>] evict+0xb4/0x180
[11497.624775]  [<ffffffff8d43ce0c>] iput+0xfc/0x190
[11497.630033]  [<ffffffff8d43126e>] do_unlinkat+0x1ae/0x2d0
[11497.636064]  [<ffffffff8d42175e>] ? ____fput+0xe/0x10
[11497.641709]  [<ffffffff8d2bab90>] ? task_work_run+0xc0/0xe0
[11497.647935]  [<ffffffff8d432326>] SyS_unlink+0x16/0x20
[11497.653679]  [<ffffffff8d92579b>] system_call_fastpath+0x22/0x27

Thanks,
Stephane



 Comments   
Comment by Stephane Thiell [ 29/Oct/18 ]

vmcore uploaded to your ftp server, the file is vmcore-sh-112-03-2018-10-26-22-17-26_LU-11582

kernel used is CentOS 7 3.10.0-862.14.4.el7.x86_64 ( http://debuginfo.centos.org/7/x86_64/ )

Thanks!

Comment by Peter Jones [ 30/Oct/18 ]

Bobijam

Could you please advise?

Thanks

Peter

Comment by Zhenyu Xu [ 11/Nov/18 ]

I think the assertion is reading the nrpages without supposed being protected under mapping->tree_lock, and truncate_inode_pages() is traverse the mapping's radix tree without tree_lock, and could miss finding the page being removed from the radix in __remove_mapping()

truncate_inode_pages_final()
        nrpages = mapping->nrpages;
        smp_rmb();
        nrexceptional = mapping->nrexceptional;

        if (nrpages || nrexceptional) {
                /*
                 * As truncation uses a lockless tree lookup, cycle
                 * the tree lock to make sure any ongoing tree
                 * modification that does not see AS_EXITING is
                 * completed before starting the final truncate.
                 */
                spin_lock_irq(&mapping->tree_lock);
                spin_unlock_irq(&mapping->tree_lock);
                // race window, that __remove_mapping() removes the page from the radix,
                // but nrpages hasn't been decreased yet.
                truncate_inode_pages(mapping, 0);
        }       

And I think our truncate_inode_pages_final() in lustre/include/lustre_compat.h made the right sequence call

#ifndef HAVE_TRUNCATE_INODE_PAGES_FINAL
static inline void truncate_inode_pages_final(struct address_space *map)
{       
        truncate_inode_pages(map, 0);
                /* Workaround for LU-118 */
        if (map->nrpages) {
                spin_lock_irq(&map->tree_lock);              // after get the tree_lock, we avoid the race
                spin_unlock_irq(&map->tree_lock);
        }       /* Workaround end */
}
#endif  

I think the fix could be add a tree_lock for checking the nrpages in ll_delete_inode, or just delete this assertion.

Comment by Gerrit Updater [ 11/Nov/18 ]

Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/33639
Subject: LU-11582 llite: protect reading inode->i_data.nrpages
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c9fb1cae878d4c3df24173e8b3c6b436fe983533

Comment by Gerrit Updater [ 18/Nov/18 ]

Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/33681
Subject: LU-11582 llite: protect reading inode->i_data.nrpages
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: 35396125e071ad1457257348be38721bc9ffdad5

Comment by Gerrit Updater [ 27/Nov/18 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33639/
Subject: LU-11582 llite: protect reading inode->i_data.nrpages
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 04c172b686763be0d42eb4c36532d5795166eb7c

Comment by Peter Jones [ 27/Nov/18 ]

Landed for 2.12

Comment by Gerrit Updater [ 05/Jan/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33681/
Subject: LU-11582 llite: protect reading inode->i_data.nrpages
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: d936fdca9e00e05cecd4b21c6e4bbbf7107dc9b4

Generated at Sat Feb 10 02:45:08 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.