Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-620

"Bad page state" reported after unlink

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.1.0, Lustre 2.1.2, Lustre 1.8.6
    • None
    • Client: Lustre b1_8 Git 999530e, Linux 2.6.32.8
    • 3
    • 4860

    Description

      I have a reproducable test case of a page bug, which is clearly reported on a kernel with additional debugging enabled.

      $ dd if=/dev/zero of=/net/lustre/file bs=4096 count=1
      $ rm /net/lustre/file
      BUG: Bad page state in process rm pfn:21fe6a
      page:ffffea00076fa730 flags:800000000000000c count:0 mapcount:0 mapping:(null) index:1

      The bug occurs on unlink() of a file shortly after it was written to.

      If there is a delay of a few seconds before the rm, all is okay. Truncate works, but a subsequent unlink rm can fail if it is quick enough.

      It appears that this bug could be the cause of some kind of mis accounting of the kernel's page cache, which causes lockups when the task is running in a cgroup. Originally I brought this up in a mailing list thread:

      http://lists.lustre.org/pipermail/lustre-devel/2011-July/003865.html
      http://lists.lustre.org/pipermail/lustre-devel/2011-August/003876.html

      Here's a full example, taken today on the attached kernel config. The process is not running in cgroup, although the kernel is built with cgroup.

      BUG: Bad page state in process rm pfn:77813
      page:ffffea0002914688 flags:400000000000000c count:0 mapcount:0 mapping:(null) index:0
      Pid: 1173, comm: rm Not tainted 2.6.32.28-ml #8
      Call Trace:
      [<ffffffff81094ab2>] bad_page+0xd2/0x130
      [<ffffffff810c4c39>] ? lookup_page_cgroup_used+0x9/0x20
      [<ffffffff810978ea>] free_hot_cold_page+0x6a/0x2d0
      [<ffffffff81097bab>] free_hot_page+0xb/0x10
      [<ffffffff8109a65a>] put_page+0xea/0x140
      [<ffffffffa04fc5c7>] ll_page_removal_cb+0x207/0x510 [lustre]
      [<ffffffffa041207b>] cache_remove_lock+0x1ab/0x29c [osc]
      [<ffffffffa03fafad>] osc_extent_blocking_cb+0x25d/0x2e0 [osc]
      [<ffffffff8137bcf6>] ? _spin_unlock+0x26/0x30
      [<ffffffffa02db058>] ? unlock_res_and_lock+0x58/0x100 [ptlrpc]
      [<ffffffffa02df630>] ldlm_cancel_callback+0x60/0xf0 [ptlrpc]
      [<ffffffffa02f877c>] ldlm_cli_cancel_local+0x6c/0x350 [ptlrpc]
      [<ffffffffa02fa960>] ldlm_cancel_list+0xf0/0x240 [ptlrpc]
      [<ffffffffa02fac67>] ldlm_cancel_resource_local+0x1b7/0x2d0 [ptlrpc]
      [<ffffffff81070f99>] ? is_module_address+0x9/0x20
      [<ffffffffa03fcb57>] osc_destroy+0x107/0x730 [osc]
      [<ffffffffa04b6a65>] ? lov_prep_destroy_set+0x285/0x970 [lov]
      [<ffffffffa04a07c8>] lov_destroy+0x568/0xf20 [lov]
      [<ffffffffa05355e3>] ll_objects_destroy+0x4e3/0x18c0 [lustre]
      [<ffffffffa046d099>] ? mdc_reint+0xd9/0x270 [mdc]
      [<ffffffffa0537098>] ll_unlink_generic+0x298/0x360 [lustre]
      [<ffffffff8137a65f>] ? __mutex_lock_common+0x27f/0x3b0
      [<ffffffff810d3c7e>] ? vfs_unlink+0x5e/0xd0
      [<ffffffffa01a65c9>] ? cfs_free+0x9/0x10 [libcfs]
      [<ffffffffa053716d>] ll_unlink+0xd/0x10 [lustre]
      [<ffffffff810d3cad>] vfs_unlink+0x8d/0xd0
      [<ffffffff810d6245>] ? lookup_hash+0x35/0x50
      [<ffffffff810d7613>] do_unlinkat+0x183/0x1c0
      [<ffffffff8137b828>] ? lockdep_sys_exit_thunk+0x35/0x67
      [<ffffffff8137b7b2>] ? trace_hardirqs_on_thunk+0x3a/0x3f
      [<ffffffff810d77ad>] sys_unlinkat+0x1d/0x40
      [<ffffffff8100b3c2>] system_call_fastpath+0x16/0x1b

      Attachments

        Issue Links

          Activity

            People

              bobijam Zhenyu Xu
              mark Mark Hills
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: