Loading...

XML

Word

Printable

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: Lustre 2.2.0, Lustre 2.1.2, Lustre 1.8.8
Affects Version/s: Lustre 2.1.0, Lustre 2.1.2, Lustre 1.8.6
Labels:
None
Environment:
Client: Lustre b1_8 Git 999530e, Linux 2.6.32.8

Severity:
3
Rank (Obsolete):
4860

I have a reproducable test case of a page bug, which is clearly reported on a kernel with additional debugging enabled.

$ dd if=/dev/zero of=/net/lustre/file bs=4096 count=1
$ rm /net/lustre/file
BUG: Bad page state in process rm pfn:21fe6a
page:ffffea00076fa730 flags:800000000000000c count:0 mapcount:0 mapping:(null) index:1

The bug occurs on unlink() of a file shortly after it was written to.

If there is a delay of a few seconds before the rm, all is okay. Truncate works, but a subsequent unlink rm can fail if it is quick enough.

It appears that this bug could be the cause of some kind of mis accounting of the kernel's page cache, which causes lockups when the task is running in a cgroup. Originally I brought this up in a mailing list thread:

http://lists.lustre.org/pipermail/lustre-devel/2011-July/003865.html
http://lists.lustre.org/pipermail/lustre-devel/2011-August/003876.html

Here's a full example, taken today on the attached kernel config. The process is not running in cgroup, although the kernel is built with cgroup.

BUG: Bad page state in process rm pfn:77813
page:ffffea0002914688 flags:400000000000000c count:0 mapcount:0 mapping:(null) index:0
Pid: 1173, comm: rm Not tainted 2.6.32.28-ml #8
Call Trace:
[<ffffffff81094ab2>] bad_page+0xd2/0x130
[<ffffffff810c4c39>] ? lookup_page_cgroup_used+0x9/0x20
[<ffffffff810978ea>] free_hot_cold_page+0x6a/0x2d0
[<ffffffff81097bab>] free_hot_page+0xb/0x10
[<ffffffff8109a65a>] put_page+0xea/0x140
[<ffffffffa04fc5c7>] ll_page_removal_cb+0x207/0x510 [lustre]
[<ffffffffa041207b>] cache_remove_lock+0x1ab/0x29c [osc]
[<ffffffffa03fafad>] osc_extent_blocking_cb+0x25d/0x2e0 [osc]
[<ffffffff8137bcf6>] ? _spin_unlock+0x26/0x30
[<ffffffffa02db058>] ? unlock_res_and_lock+0x58/0x100 [ptlrpc]
[<ffffffffa02df630>] ldlm_cancel_callback+0x60/0xf0 [ptlrpc]
[<ffffffffa02f877c>] ldlm_cli_cancel_local+0x6c/0x350 [ptlrpc]
[<ffffffffa02fa960>] ldlm_cancel_list+0xf0/0x240 [ptlrpc]
[<ffffffffa02fac67>] ldlm_cancel_resource_local+0x1b7/0x2d0 [ptlrpc]
[<ffffffff81070f99>] ? is_module_address+0x9/0x20
[<ffffffffa03fcb57>] osc_destroy+0x107/0x730 [osc]
[<ffffffffa04b6a65>] ? lov_prep_destroy_set+0x285/0x970 [lov]
[<ffffffffa04a07c8>] lov_destroy+0x568/0xf20 [lov]
[<ffffffffa05355e3>] ll_objects_destroy+0x4e3/0x18c0 [lustre]
[<ffffffffa046d099>] ? mdc_reint+0xd9/0x270 [mdc]
[<ffffffffa0537098>] ll_unlink_generic+0x298/0x360 [lustre]
[<ffffffff8137a65f>] ? __mutex_lock_common+0x27f/0x3b0
[<ffffffff810d3c7e>] ? vfs_unlink+0x5e/0xd0
[<ffffffffa01a65c9>] ? cfs_free+0x9/0x10 [libcfs]
[<ffffffffa053716d>] ll_unlink+0xd/0x10 [lustre]
[<ffffffff810d3cad>] vfs_unlink+0x8d/0xd0
[<ffffffff810d6245>] ? lookup_hash+0x35/0x50
[<ffffffff810d7613>] do_unlinkat+0x183/0x1c0
[<ffffffff8137b828>] ? lockdep_sys_exit_thunk+0x35/0x67
[<ffffffff8137b7b2>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[<ffffffff810d77ad>] sys_unlinkat+0x1d/0x40
[<ffffffff8100b3c2>] system_call_fastpath+0x16/0x1b

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

config.gz
11 kB
23/Aug/11 7:57 AM

Trackbacks

Lustre 1.8.x known issues tracker While testing against Lustre b18 branch, we would hit known bugs which were already reported in Lustre Bugzilla https://bugzilla.lustre.org/. In order to move away from relying on Bugzilla, we would create a JIRA

Changelog 1.8 Changes from version 1.8.7wc1 to version 1.8.8wc1 Server support for kernels: 2.6.18308.4.1.el5 (RHEL5) Client support for unpatched kernels: 2.6.18308.4.1.el5 (RHEL5) 2.6.32220.13.1.el6 (RHEL6) Recommended e2fsprogs version: 1.41.90....

Changelog 2.1 Changes from version 2.1.1 to version 2.1.2 Server support for kernels: 2.6.18308.4.1.el5 (RHEL5) 2.6.32220.17.1.el6 (RHEL6) Client support for unpatched kernels: 2.6.18308.4.1.el5 (RHEL5) 2.6.32220.17.1....

Changelog 2.2 version 2.2.0 Support for networks: o2iblnd OFED 1.5.4 Server support for kernels: 2.6.32220.4.2.el6 (RHEL6) Client support for unpatched kernels: 2.6.18274.18.1.el5 (RHEL5) 2.6.32220.4.2.el6 (RHEL6) 2.6.32.360....

Assignee:: Zhenyu Xu

Reporter:: Mark Hills

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 23/Aug/11 7:57 AM

Updated:: 09/May/12 12:21 PM

Resolved:: 16/Mar/12 8:42 AM

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates