Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
Lustre 2.1.0, Lustre 2.1.2, Lustre 1.8.6
-
None
-
Client: Lustre b1_8 Git 999530e, Linux 2.6.32.8
-
3
-
4860
Description
I have a reproducable test case of a page bug, which is clearly reported on a kernel with additional debugging enabled.
$ dd if=/dev/zero of=/net/lustre/file bs=4096 count=1
$ rm /net/lustre/file
BUG: Bad page state in process rm pfn:21fe6a
page:ffffea00076fa730 flags:800000000000000c count:0 mapcount:0 mapping:(null) index:1
The bug occurs on unlink() of a file shortly after it was written to.
If there is a delay of a few seconds before the rm, all is okay. Truncate works, but a subsequent unlink rm can fail if it is quick enough.
It appears that this bug could be the cause of some kind of mis accounting of the kernel's page cache, which causes lockups when the task is running in a cgroup. Originally I brought this up in a mailing list thread:
http://lists.lustre.org/pipermail/lustre-devel/2011-July/003865.html
http://lists.lustre.org/pipermail/lustre-devel/2011-August/003876.html
Here's a full example, taken today on the attached kernel config. The process is not running in cgroup, although the kernel is built with cgroup.
BUG: Bad page state in process rm pfn:77813
page:ffffea0002914688 flags:400000000000000c count:0 mapcount:0 mapping:(null) index:0
Pid: 1173, comm: rm Not tainted 2.6.32.28-ml #8
Call Trace:
[<ffffffff81094ab2>] bad_page+0xd2/0x130
[<ffffffff810c4c39>] ? lookup_page_cgroup_used+0x9/0x20
[<ffffffff810978ea>] free_hot_cold_page+0x6a/0x2d0
[<ffffffff81097bab>] free_hot_page+0xb/0x10
[<ffffffff8109a65a>] put_page+0xea/0x140
[<ffffffffa04fc5c7>] ll_page_removal_cb+0x207/0x510 [lustre]
[<ffffffffa041207b>] cache_remove_lock+0x1ab/0x29c [osc]
[<ffffffffa03fafad>] osc_extent_blocking_cb+0x25d/0x2e0 [osc]
[<ffffffff8137bcf6>] ? _spin_unlock+0x26/0x30
[<ffffffffa02db058>] ? unlock_res_and_lock+0x58/0x100 [ptlrpc]
[<ffffffffa02df630>] ldlm_cancel_callback+0x60/0xf0 [ptlrpc]
[<ffffffffa02f877c>] ldlm_cli_cancel_local+0x6c/0x350 [ptlrpc]
[<ffffffffa02fa960>] ldlm_cancel_list+0xf0/0x240 [ptlrpc]
[<ffffffffa02fac67>] ldlm_cancel_resource_local+0x1b7/0x2d0 [ptlrpc]
[<ffffffff81070f99>] ? is_module_address+0x9/0x20
[<ffffffffa03fcb57>] osc_destroy+0x107/0x730 [osc]
[<ffffffffa04b6a65>] ? lov_prep_destroy_set+0x285/0x970 [lov]
[<ffffffffa04a07c8>] lov_destroy+0x568/0xf20 [lov]
[<ffffffffa05355e3>] ll_objects_destroy+0x4e3/0x18c0 [lustre]
[<ffffffffa046d099>] ? mdc_reint+0xd9/0x270 [mdc]
[<ffffffffa0537098>] ll_unlink_generic+0x298/0x360 [lustre]
[<ffffffff8137a65f>] ? __mutex_lock_common+0x27f/0x3b0
[<ffffffff810d3c7e>] ? vfs_unlink+0x5e/0xd0
[<ffffffffa01a65c9>] ? cfs_free+0x9/0x10 [libcfs]
[<ffffffffa053716d>] ll_unlink+0xd/0x10 [lustre]
[<ffffffff810d3cad>] vfs_unlink+0x8d/0xd0
[<ffffffff810d6245>] ? lookup_hash+0x35/0x50
[<ffffffff810d7613>] do_unlinkat+0x183/0x1c0
[<ffffffff8137b828>] ? lockdep_sys_exit_thunk+0x35/0x67
[<ffffffff8137b7b2>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[<ffffffff810d77ad>] sys_unlinkat+0x1d/0x40
[<ffffffff8100b3c2>] system_call_fastpath+0x16/0x1b
Attachments
Issue Links
- Trackbacks
-
Lustre 1.8.x known issues tracker While testing against Lustre b18 branch, we would hit known bugs which were already reported in Lustre Bugzilla https://bugzilla.lustre.org/. In order to move away from relying on Bugzilla, we would create a JIRA
-
Changelog 1.8 Changes from version 1.8.7wc1 to version 1.8.8wc1 Server support for kernels: 2.6.18308.4.1.el5 (RHEL5) Client support for unpatched kernels: 2.6.18308.4.1.el5 (RHEL5) 2.6.32220.13.1.el6 (RHEL6) Recommended e2fsprogs version: 1.41.90....
-
Changelog 2.1 Changes from version 2.1.1 to version 2.1.2 Server support for kernels: 2.6.18308.4.1.el5 (RHEL5) 2.6.32220.17.1.el6 (RHEL6) Client support for unpatched kernels: 2.6.18308.4.1.el5 (RHEL5) 2.6.32220.17.1....
-
Changelog 2.2 version 2.2.0 Support for networks: o2iblnd OFED 1.5.4 Server support for kernels: 2.6.32220.4.2.el6 (RHEL6) Client support for unpatched kernels: 2.6.18274.18.1.el5 (RHEL5) 2.6.32220.4.2.el6 (RHEL6) 2.6.32.360....