Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      > PID: 63193 TASK: ffff880902f7e040 CPU: 35 COMMAND: "python"
      > #0 [ffff880b44ea1108] schedule at ffffffff8141c637
      > #1 [ffff880b44ea1270] io_schedule at ffffffff8141cd01
      > #2 [ffff880b44ea12a0] sleep_on_page at ffffffff8110111e
      > #3 [ffff880b44ea12b0] __wait_on_bit at ffffffff8141d4b2
      > #4 [ffff880b44ea12f0] wait_on_page_bit at ffffffff81101454
      > #5 [ffff880b44ea1350] shrink_inactive_list at ffffffff81113c69
      > #6 [ffff880b44ea1530] shrink_list at ffffffff8111445e
      > #7 [ffff880b44ea1560] shrink_zone at ffffffff8111496a
      > #8 [ffff880b44ea16b0] do_try_to_free_pages at ffffffff81114d8b
      > #9 [ffff880b44ea1750] try_to_free_mem_cgroup_pages at ffffffff811153dd
      > #10 [ffff880b44ea17f0] mem_cgroup_hierarchical_reclaim at ffffffff81151d6d
      > #11 [ffff880b44ea18a0] __mem_cgroup_try_charge at ffffffff811539da
      > #12 [ffff880b44ea1990] mem_cgroup_cache_charge at ffffffff811557f4
      > #13 [ffff880b44ea19c0] add_to_page_cache_locked at ffffffff8110167e
      > #14 [ffff880b44ea1a00] add_to_page_cache at ffffffff811017cb
      > #15 [ffff880b44ea1a30] add_to_page_cache_lru at ffffffff8110182e
      > #16 [ffff880b44ea1a50] grab_cache_page_nowait at ffffffff81101f5b
      > #17 [ffff880b44ea1a80] ll_write_begin at ffffffffa086163b [lustre]
      > #18 [ffff880b44ea1b10] generic_file_buffered_write at ffffffff811003ce
      > #19 [ffff880b44ea1bd0] __generic_file_aio_write at ffffffff81103179
      > #20 [ffff880b44ea1c80] generic_file_aio_write at ffffffff811033c9
      > #21 [ffff880b44ea1cc0] vvp_io_write_start at ffffffffa087493f [lustre]
      > #22 [ffff880b44ea1d10] cl_io_start at ffffffffa037f502 [obdclass]
      > #23 [ffff880b44ea1d40] cl_io_loop at ffffffffa0383084 [obdclass]
      > #24 [ffff880b44ea1d70] ll_file_io_generic at ffffffffa0816237 [lustre]
      > #25 [ffff880b44ea1e40] ll_file_aio_write at ffffffffa0826409 [lustre]
      > #26 [ffff880b44ea1ea0] ll_file_write at ffffffffa08269ed [lustre]
      > #27 [ffff880b44ea1f10] vfs_write at ffffffff8115bd9b
      > #28 [ffff880b44ea1f40] sys_write at ffffffff8115bf45
      > #29 [ffff880b44ea1f80] system_call_fastpath at ffffffff81426aab
      > RIP: 00002aaaabaa7f40 RSP: 00007ffffffefa10 RFLAGS: 00000282
      > RAX: 0000000000000001 RBX: ffffffff81426aab RCX: 00000000000000a4
      > RDX: 0000000000400000 RSI: 00002aabd2b6e000 RDI: 0000000000000004
      > RBP: 00002aabd2b6e000 R8: 0000000000000000 R9: 0000000000000000
      > R10: 000000007a1e3f60 R11: 0000000000000246 R12: 0000000000000000
      > R13: 0000000000400000 R14: 000000007a1e3e80 R15: 0000000000400000
      > ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b

      > PID: 65447 TASK: ffff88032d84e040 CPU: 15 COMMAND: "slurmstepd"
      > #0 [ffff880f939b1388] schedule at ffffffff8141c637
      > #1 [ffff880f939b14f0] io_schedule at ffffffff8141cd01
      > #2 [ffff880f939b1520] sleep_on_page at ffffffff8110111e
      > #3 [ffff880f939b1530] __wait_on_bit at ffffffff8141d4b2
      > #4 [ffff880f939b1570] wait_on_page_bit at ffffffff81101454
      > #5 [ffff880f939b15d0] shrink_inactive_list at ffffffff81113c69
      > #6 [ffff880f939b17b0] shrink_list at ffffffff8111445e
      > #7 [ffff880f939b17e0] shrink_zone at ffffffff8111496a
      > #8 [ffff880f939b1930] do_try_to_free_pages at ffffffff81114d8b
      > #9 [ffff880f939b19d0] try_to_free_mem_cgroup_pages at ffffffff811153dd
      > #10 [ffff880f939b1a70] mem_cgroup_hierarchical_reclaim at ffffffff81151d6d
      > #11 [ffff880f939b1b20] __mem_cgroup_try_charge at ffffffff811539da
      > #12 [ffff880f939b1c10] mem_cgroup_prepare_migration at ffffffff81155b92
      > #13 [ffff880f939b1c50] migrate_pages at ffffffff8114e0d1
      > #14 [ffff880f939b1ce0] compact_zone at ffffffff81146aac
      > #15 [ffff880f939b1de0] __compact_pgdat at ffffffff8114725b
      > #16 [ffff880f939b1e20] compact_node at ffffffff811472df
      > #17 [ffff880f939b1e90] sysfs_compact_node at ffffffff81147341
      > #18 [ffff880f939b1eb0] sysdev_store at ffffffff812b7bf0
      > #19 [ffff880f939b1ec0] sysfs_write_file at ffffffff811c4d27
      > #20 [ffff880f939b1f10] vfs_write at ffffffff8115bd9b
      > #21 [ffff880f939b1f40] sys_write at ffffffff8115bf45
      > #20 [ffff880f939b1f10] vfs_write at ffffffff8115bd9b
      > #21 [ffff880f939b1f40] sys_write at ffffffff8115bf45
      > #22 [ffff880f939b1f80] system_call_fastpath at ffffffff81426aab

      Attachments

        Issue Links

          Activity

            [LU-8464] Lustre I/O hung waiting for page

            Oleg, was this patch ever landed in newer el7 releases? I'm wondering if this should be closed as "Won't Fix" since the patch is not really needed anymore, AFAICS.

            adilger Andreas Dilger added a comment - Oleg, was this patch ever landed in newer el7 releases? I'm wondering if this should be closed as "Won't Fix" since the patch is not really needed anymore, AFAICS.
            green Oleg Drokin added a comment -

            I opened a redhat bugzilla ticket about this to backport the patch into sme next rhel7.x kernel. (you probably cannot see it since by default all such tickets are private):
            https://bugzilla.redhat.com/show_bug.cgi?id=1410571

            green Oleg Drokin added a comment - I opened a redhat bugzilla ticket about this to backport the patch into sme next rhel7.x kernel. (you probably cannot see it since by default all such tickets are private): https://bugzilla.redhat.com/show_bug.cgi?id=1410571

            The upstream commit that removed the mem_cgroup_prepare_migration()
            from __unmap_and_move() is

            http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=0a31bc97c80c3fa87b32c091d9a930ac19cd0c40

            amk Ann Koehler (Inactive) added a comment - The upstream commit that removed the mem_cgroup_prepare_migration() from __unmap_and_move() is http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=0a31bc97c80c3fa87b32c091d9a930ac19cd0c40

            it seems like the fix appears since 3.18 kernels.

            jay Jinshan Xiong (Inactive) added a comment - it seems like the fix appears since 3.18 kernels.

            Thanks Jinshan. I'll pass this bug on to our kernel engineers. If you can identify the kernel where it's fixed, I'm sure that would be a big help.

            amk Ann Koehler (Inactive) added a comment - Thanks Jinshan. I'll pass this bug on to our kernel engineers. If you can identify the kernel where it's fixed, I'm sure that would be a big help.
            jay Jinshan Xiong (Inactive) added a comment - - edited

            I spent some time on this issue and found some thing new(what you didn't mentio on the ticket).

            I think this issue is due to an implementation of memory cgroup. As you can see from the code __unmap_and_move():

                            lock_page(page);
                    }
            
                    /* charge against new page */
                    mem_cgroup_prepare_migration(page, newpage, &mem);
            

            it locks a page and charges mem cgroup, which in turns try to free a page from the cgroup. In the process of freeing page, it waits for the page write back to complete. This causes deadlock.

            Let me put things together.

            Ptlrpc thread:

            lock page A;
            set writeback to page A;
            unlock page A;
            lock page B     <- blocked
            

            and migrating thread:

            /* try to migrate page B */
            lock page B;
            /* since there is no free slot of this process' memory control group */
            try to free page A;
            wait for A's writeback to complete;  <- blocked
            free page A;
            wait for B's writeback to complete;
            

            It's a really bad choice for migrate_pages() to lock a page and wait for writeback on another one to complete.

            This problem is hard to fix in Lustre but way easier to get fixed in kernel, actually it turns out that linux-4.x kernels don't have this problem any more.
            We can just move 'wait for B's writeback to complete' to the location before trying to free page A and this problem should be fixed.

            I will take a further look to see since which kernel this problem has been fixed.

            jay Jinshan Xiong (Inactive) added a comment - - edited I spent some time on this issue and found some thing new(what you didn't mentio on the ticket). I think this issue is due to an implementation of memory cgroup. As you can see from the code __unmap_and_move(): lock_page(page); } /* charge against new page */ mem_cgroup_prepare_migration(page, newpage, &mem); it locks a page and charges mem cgroup, which in turns try to free a page from the cgroup. In the process of freeing page, it waits for the page write back to complete. This causes deadlock. Let me put things together. Ptlrpc thread: lock page A; set writeback to page A; unlock page A; lock page B <- blocked and migrating thread: /* try to migrate page B */ lock page B; /* since there is no free slot of this process' memory control group */ try to free page A; wait for A's writeback to complete; <- blocked free page A; wait for B's writeback to complete; It's a really bad choice for migrate_pages() to lock a page and wait for writeback on another one to complete. This problem is hard to fix in Lustre but way easier to get fixed in kernel, actually it turns out that linux-4.x kernels don't have this problem any more. We can just move 'wait for B's writeback to complete' to the location before trying to free page A and this problem should be fixed. I will take a further look to see since which kernel this problem has been fixed.

            does migrate_pages() lock one page and then wait another page to complete write back?

            jay Jinshan Xiong (Inactive) added a comment - does migrate_pages() lock one page and then wait another page to complete write back?

            Andriy Skulysh (andriy.skulysh@seagate.com) uploaded a new patch: http://review.whamcloud.com/21652
            Subject: LU-8464 llite: Lustre I/O hung waiting for page
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 379e1b2fe3d5fe98972e72887eed60801fbc6828

            gerrit Gerrit Updater added a comment - Andriy Skulysh (andriy.skulysh@seagate.com) uploaded a new patch: http://review.whamcloud.com/21652 Subject: LU-8464 llite: Lustre I/O hung waiting for page Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 379e1b2fe3d5fe98972e72887eed60801fbc6828

            Thread PID: 65447 tries to migrate 2nd page from extent
            and waits for PID 14502 to complete writeback.

            But these 2 pages are going to fit in one RPC. So PID 14502 can't complete IO because the 1st page was locked by pid 65447.

            askulysh Andriy Skulysh added a comment - Thread PID: 65447 tries to migrate 2nd page from extent and waits for PID 14502 to complete writeback. But these 2 pages are going to fit in one RPC. So PID 14502 can't complete IO because the 1st page was locked by pid 65447.

            > PID: 14502 TASK: ffff881fedf78040 CPU: 13 COMMAND: "ptlrpcd_11"
            > #0 [ffff881fee223660] schedule at ffffffff8141c637
            > #1 [ffff881fee2237c8] io_schedule at ffffffff8141cd01
            > #2 [ffff881fee2237f8] sleep_on_page at ffffffff8110111e
            > #3 [ffff881fee223808] __wait_on_bit_lock at ffffffff8141d28a
            > #4 [ffff881fee223848] __lock_page at ffffffff81101109
            > #5 [ffff881fee2238a8] vvp_page_make_ready at ffffffffa08716ed [lustre]
            > #6 [ffff881fee2238d8] cl_page_make_ready at ffffffffa03751c5 [obdclass]
            > #7 [ffff881fee223928] osc_extent_make_ready at ffffffffa070f9ac [osc]
            > #8 [ffff881fee223a68] osc_io_unplug0 at ffffffffa0713e5e [osc]
            > #9 [ffff881fee223c98] osc_io_unplug at ffffffffa07156c1 [osc]
            > #10 [ffff881fee223ca8] brw_queue_work at ffffffffa06e6426 [osc]
            > #11 [ffff881fee223cc8] work_interpreter at ffffffffa04955ae [ptlrpc]
            > #12 [ffff881fee223ce8] ptlrpc_check_set at ffffffffa049e85c [ptlrpc]
            > #13 [ffff881fee223d78] ptlrpcd_check at ffffffffa04caaab [ptlrpc]
            > #14 [ffff881fee223dd8] ptlrpcd at ffffffffa04cb15b [ptlrpc]
            > #15 [ffff881fee223ee8] kthread at ffffffff8107374e
            > #16 [ffff881fee223f48] kernel_thread_helper at ffffffff81427bb4

            askulysh Andriy Skulysh added a comment - > PID: 14502 TASK: ffff881fedf78040 CPU: 13 COMMAND: "ptlrpcd_11" > #0 [ffff881fee223660] schedule at ffffffff8141c637 > #1 [ffff881fee2237c8] io_schedule at ffffffff8141cd01 > #2 [ffff881fee2237f8] sleep_on_page at ffffffff8110111e > #3 [ffff881fee223808] __wait_on_bit_lock at ffffffff8141d28a > #4 [ffff881fee223848] __lock_page at ffffffff81101109 > #5 [ffff881fee2238a8] vvp_page_make_ready at ffffffffa08716ed [lustre] > #6 [ffff881fee2238d8] cl_page_make_ready at ffffffffa03751c5 [obdclass] > #7 [ffff881fee223928] osc_extent_make_ready at ffffffffa070f9ac [osc] > #8 [ffff881fee223a68] osc_io_unplug0 at ffffffffa0713e5e [osc] > #9 [ffff881fee223c98] osc_io_unplug at ffffffffa07156c1 [osc] > #10 [ffff881fee223ca8] brw_queue_work at ffffffffa06e6426 [osc] > #11 [ffff881fee223cc8] work_interpreter at ffffffffa04955ae [ptlrpc] > #12 [ffff881fee223ce8] ptlrpc_check_set at ffffffffa049e85c [ptlrpc] > #13 [ffff881fee223d78] ptlrpcd_check at ffffffffa04caaab [ptlrpc] > #14 [ffff881fee223dd8] ptlrpcd at ffffffffa04cb15b [ptlrpc] > #15 [ffff881fee223ee8] kthread at ffffffff8107374e > #16 [ffff881fee223f48] kernel_thread_helper at ffffffff81427bb4

            People

              askulysh Andriy Skulysh
              askulysh Andriy Skulysh
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated: