[LU-620] "Bad page state" reported after unlink Created: 23/Aug/11 Updated: 09/May/12 Resolved: 16/Mar/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.0, Lustre 2.1.2, Lustre 1.8.6 |
| Fix Version/s: | Lustre 2.2.0, Lustre 2.1.2, Lustre 1.8.8 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Mark Hills | Assignee: | Zhenyu Xu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Client: Lustre b1_8 Git 999530e, Linux 2.6.32.8 |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 4860 |
| Description |
|
I have a reproducable test case of a page bug, which is clearly reported on a kernel with additional debugging enabled. $ dd if=/dev/zero of=/net/lustre/file bs=4096 count=1 The bug occurs on unlink() of a file shortly after it was written to. If there is a delay of a few seconds before the rm, all is okay. Truncate works, but a subsequent unlink rm can fail if it is quick enough. It appears that this bug could be the cause of some kind of mis accounting of the kernel's page cache, which causes lockups when the task is running in a cgroup. Originally I brought this up in a mailing list thread: http://lists.lustre.org/pipermail/lustre-devel/2011-July/003865.html Here's a full example, taken today on the attached kernel config. The process is not running in cgroup, although the kernel is built with cgroup. BUG: Bad page state in process rm pfn:77813 |
| Comments |
| Comment by Mark Hills [ 23/Aug/11 ] |
|
I found the source of this problem: an out-of-date copied function in lustre_patchless_compat.h truncate_complete_page needs to handle cgroup appropriately, and the copy with its own ll_remove_from_page_cache does not. A call to mem_cgroup_uncharge_cache_page is needed but it is not exported, nor does it seem easy or sensible to copy into the Lustre tree. Looks like this has broken the back of the compatibility layer for truncate_complete_page? For now I exported truncate_complete_page from the kernel and in an initial test it seemed to have fixed the problem, and cgroup began working reliably. |
| Comment by Peter Jones [ 20/Sep/11 ] |
|
Bobijam Could you please look into this report? Thanks Peter |
| Comment by Zhenyu Xu [ 21/Sep/11 ] |
|
Mark, Looks like 2.6.32.8 kernel exports delete_from_page_cache in mm/filemap.c (Mark, can you confirm that also?)which can do what ll_remove_from_page_cache and page_cache_release do. For new pachless client, we cannot patch client kernel code to export truncate_complete_page but can leverage already exported delete_from_page_cache to do the same job (uncharge cgroup accounting for the page). But 2.6.38.8 exports another cgroup aware function remove_from_page_cache which do what ll_remove_from_page_cache does while rhel6 kernel does not export this function, it seems kernel has not settled down for this part which makes patchless client support difficult. |
| Comment by Zhenyu Xu [ 21/Sep/11 ] |
|
patch tracking at http://review.whamcloud.com/1399 |
| Comment by Build Master (Inactive) [ 03/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 03/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 03/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 03/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 03/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 03/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 03/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 03/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 03/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 03/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 03/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 03/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 03/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 03/Nov/11 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Zhenyu Xu [ 04/Nov/11 ] |
|
b1_8 patch tracking at http://review.whamcloud.com/1649 |
| Comment by Build Master (Inactive) [ 04/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 04/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 04/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 04/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 04/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 04/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 04/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 04/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 04/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 04/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 04/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Peter Jones [ 16/Jan/12 ] |
|
Landed for 2.2 |
| Comment by Peter Jones [ 29/Feb/12 ] |
|
Bobi Could you please port this patch to b2_1 Thanks Peter |
| Comment by Zhenyu Xu [ 29/Feb/12 ] |
|
b2_1 patch tracking at http://review.whamcloud.com/2230 |
| Comment by Mark Hills [ 11/Apr/12 ] |
|
We are testing this with kernel 2.6.32-220.4.1.el6.x86_64; and Whamcloud b1_8 HEAD (Git 18aafe97) The patch does not fix the bug, seemingly because this kernel does not export any of truncate_complete_page For now, I need to continue to use my initial patch, which exports truncate_complete_page from the kernel. |