Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-620

"Bad page state" reported after unlink

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Blocker Blocker
    • Lustre 2.1.0, Lustre 2.1.2, Lustre 1.8.6
    • None
    • Client: Lustre b1_8 Git 999530e, Linux 2.6.32.8
    • 3
    • 4860

      I have a reproducable test case of a page bug, which is clearly reported on a kernel with additional debugging enabled.

      $ dd if=/dev/zero of=/net/lustre/file bs=4096 count=1
      $ rm /net/lustre/file
      BUG: Bad page state in process rm pfn:21fe6a
      page:ffffea00076fa730 flags:800000000000000c count:0 mapcount:0 mapping:(null) index:1

      The bug occurs on unlink() of a file shortly after it was written to.

      If there is a delay of a few seconds before the rm, all is okay. Truncate works, but a subsequent unlink rm can fail if it is quick enough.

      It appears that this bug could be the cause of some kind of mis accounting of the kernel's page cache, which causes lockups when the task is running in a cgroup. Originally I brought this up in a mailing list thread:

      http://lists.lustre.org/pipermail/lustre-devel/2011-July/003865.html
      http://lists.lustre.org/pipermail/lustre-devel/2011-August/003876.html

      Here's a full example, taken today on the attached kernel config. The process is not running in cgroup, although the kernel is built with cgroup.

      BUG: Bad page state in process rm pfn:77813
      page:ffffea0002914688 flags:400000000000000c count:0 mapcount:0 mapping:(null) index:0
      Pid: 1173, comm: rm Not tainted 2.6.32.28-ml #8
      Call Trace:
      [<ffffffff81094ab2>] bad_page+0xd2/0x130
      [<ffffffff810c4c39>] ? lookup_page_cgroup_used+0x9/0x20
      [<ffffffff810978ea>] free_hot_cold_page+0x6a/0x2d0
      [<ffffffff81097bab>] free_hot_page+0xb/0x10
      [<ffffffff8109a65a>] put_page+0xea/0x140
      [<ffffffffa04fc5c7>] ll_page_removal_cb+0x207/0x510 [lustre]
      [<ffffffffa041207b>] cache_remove_lock+0x1ab/0x29c [osc]
      [<ffffffffa03fafad>] osc_extent_blocking_cb+0x25d/0x2e0 [osc]
      [<ffffffff8137bcf6>] ? _spin_unlock+0x26/0x30
      [<ffffffffa02db058>] ? unlock_res_and_lock+0x58/0x100 [ptlrpc]
      [<ffffffffa02df630>] ldlm_cancel_callback+0x60/0xf0 [ptlrpc]
      [<ffffffffa02f877c>] ldlm_cli_cancel_local+0x6c/0x350 [ptlrpc]
      [<ffffffffa02fa960>] ldlm_cancel_list+0xf0/0x240 [ptlrpc]
      [<ffffffffa02fac67>] ldlm_cancel_resource_local+0x1b7/0x2d0 [ptlrpc]
      [<ffffffff81070f99>] ? is_module_address+0x9/0x20
      [<ffffffffa03fcb57>] osc_destroy+0x107/0x730 [osc]
      [<ffffffffa04b6a65>] ? lov_prep_destroy_set+0x285/0x970 [lov]
      [<ffffffffa04a07c8>] lov_destroy+0x568/0xf20 [lov]
      [<ffffffffa05355e3>] ll_objects_destroy+0x4e3/0x18c0 [lustre]
      [<ffffffffa046d099>] ? mdc_reint+0xd9/0x270 [mdc]
      [<ffffffffa0537098>] ll_unlink_generic+0x298/0x360 [lustre]
      [<ffffffff8137a65f>] ? __mutex_lock_common+0x27f/0x3b0
      [<ffffffff810d3c7e>] ? vfs_unlink+0x5e/0xd0
      [<ffffffffa01a65c9>] ? cfs_free+0x9/0x10 [libcfs]
      [<ffffffffa053716d>] ll_unlink+0xd/0x10 [lustre]
      [<ffffffff810d3cad>] vfs_unlink+0x8d/0xd0
      [<ffffffff810d6245>] ? lookup_hash+0x35/0x50
      [<ffffffff810d7613>] do_unlinkat+0x183/0x1c0
      [<ffffffff8137b828>] ? lockdep_sys_exit_thunk+0x35/0x67
      [<ffffffff8137b7b2>] ? trace_hardirqs_on_thunk+0x3a/0x3f
      [<ffffffff810d77ad>] sys_unlinkat+0x1d/0x40
      [<ffffffff8100b3c2>] system_call_fastpath+0x16/0x1b

            bobijam Zhenyu Xu
            mark Mark Hills
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: