Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      > PID: 63193 TASK: ffff880902f7e040 CPU: 35 COMMAND: "python"
      > #0 [ffff880b44ea1108] schedule at ffffffff8141c637
      > #1 [ffff880b44ea1270] io_schedule at ffffffff8141cd01
      > #2 [ffff880b44ea12a0] sleep_on_page at ffffffff8110111e
      > #3 [ffff880b44ea12b0] __wait_on_bit at ffffffff8141d4b2
      > #4 [ffff880b44ea12f0] wait_on_page_bit at ffffffff81101454
      > #5 [ffff880b44ea1350] shrink_inactive_list at ffffffff81113c69
      > #6 [ffff880b44ea1530] shrink_list at ffffffff8111445e
      > #7 [ffff880b44ea1560] shrink_zone at ffffffff8111496a
      > #8 [ffff880b44ea16b0] do_try_to_free_pages at ffffffff81114d8b
      > #9 [ffff880b44ea1750] try_to_free_mem_cgroup_pages at ffffffff811153dd
      > #10 [ffff880b44ea17f0] mem_cgroup_hierarchical_reclaim at ffffffff81151d6d
      > #11 [ffff880b44ea18a0] __mem_cgroup_try_charge at ffffffff811539da
      > #12 [ffff880b44ea1990] mem_cgroup_cache_charge at ffffffff811557f4
      > #13 [ffff880b44ea19c0] add_to_page_cache_locked at ffffffff8110167e
      > #14 [ffff880b44ea1a00] add_to_page_cache at ffffffff811017cb
      > #15 [ffff880b44ea1a30] add_to_page_cache_lru at ffffffff8110182e
      > #16 [ffff880b44ea1a50] grab_cache_page_nowait at ffffffff81101f5b
      > #17 [ffff880b44ea1a80] ll_write_begin at ffffffffa086163b [lustre]
      > #18 [ffff880b44ea1b10] generic_file_buffered_write at ffffffff811003ce
      > #19 [ffff880b44ea1bd0] __generic_file_aio_write at ffffffff81103179
      > #20 [ffff880b44ea1c80] generic_file_aio_write at ffffffff811033c9
      > #21 [ffff880b44ea1cc0] vvp_io_write_start at ffffffffa087493f [lustre]
      > #22 [ffff880b44ea1d10] cl_io_start at ffffffffa037f502 [obdclass]
      > #23 [ffff880b44ea1d40] cl_io_loop at ffffffffa0383084 [obdclass]
      > #24 [ffff880b44ea1d70] ll_file_io_generic at ffffffffa0816237 [lustre]
      > #25 [ffff880b44ea1e40] ll_file_aio_write at ffffffffa0826409 [lustre]
      > #26 [ffff880b44ea1ea0] ll_file_write at ffffffffa08269ed [lustre]
      > #27 [ffff880b44ea1f10] vfs_write at ffffffff8115bd9b
      > #28 [ffff880b44ea1f40] sys_write at ffffffff8115bf45
      > #29 [ffff880b44ea1f80] system_call_fastpath at ffffffff81426aab
      > RIP: 00002aaaabaa7f40 RSP: 00007ffffffefa10 RFLAGS: 00000282
      > RAX: 0000000000000001 RBX: ffffffff81426aab RCX: 00000000000000a4
      > RDX: 0000000000400000 RSI: 00002aabd2b6e000 RDI: 0000000000000004
      > RBP: 00002aabd2b6e000 R8: 0000000000000000 R9: 0000000000000000
      > R10: 000000007a1e3f60 R11: 0000000000000246 R12: 0000000000000000
      > R13: 0000000000400000 R14: 000000007a1e3e80 R15: 0000000000400000
      > ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b

      > PID: 65447 TASK: ffff88032d84e040 CPU: 15 COMMAND: "slurmstepd"
      > #0 [ffff880f939b1388] schedule at ffffffff8141c637
      > #1 [ffff880f939b14f0] io_schedule at ffffffff8141cd01
      > #2 [ffff880f939b1520] sleep_on_page at ffffffff8110111e
      > #3 [ffff880f939b1530] __wait_on_bit at ffffffff8141d4b2
      > #4 [ffff880f939b1570] wait_on_page_bit at ffffffff81101454
      > #5 [ffff880f939b15d0] shrink_inactive_list at ffffffff81113c69
      > #6 [ffff880f939b17b0] shrink_list at ffffffff8111445e
      > #7 [ffff880f939b17e0] shrink_zone at ffffffff8111496a
      > #8 [ffff880f939b1930] do_try_to_free_pages at ffffffff81114d8b
      > #9 [ffff880f939b19d0] try_to_free_mem_cgroup_pages at ffffffff811153dd
      > #10 [ffff880f939b1a70] mem_cgroup_hierarchical_reclaim at ffffffff81151d6d
      > #11 [ffff880f939b1b20] __mem_cgroup_try_charge at ffffffff811539da
      > #12 [ffff880f939b1c10] mem_cgroup_prepare_migration at ffffffff81155b92
      > #13 [ffff880f939b1c50] migrate_pages at ffffffff8114e0d1
      > #14 [ffff880f939b1ce0] compact_zone at ffffffff81146aac
      > #15 [ffff880f939b1de0] __compact_pgdat at ffffffff8114725b
      > #16 [ffff880f939b1e20] compact_node at ffffffff811472df
      > #17 [ffff880f939b1e90] sysfs_compact_node at ffffffff81147341
      > #18 [ffff880f939b1eb0] sysdev_store at ffffffff812b7bf0
      > #19 [ffff880f939b1ec0] sysfs_write_file at ffffffff811c4d27
      > #20 [ffff880f939b1f10] vfs_write at ffffffff8115bd9b
      > #21 [ffff880f939b1f40] sys_write at ffffffff8115bf45
      > #20 [ffff880f939b1f10] vfs_write at ffffffff8115bd9b
      > #21 [ffff880f939b1f40] sys_write at ffffffff8115bf45
      > #22 [ffff880f939b1f80] system_call_fastpath at ffffffff81426aab

      Attachments

        Issue Links

          Activity

            [LU-8464] Lustre I/O hung waiting for page

            Oleg, was this patch ever landed in newer el7 releases? I'm wondering if this should be closed as "Won't Fix" since the patch is not really needed anymore, AFAICS.

            adilger Andreas Dilger added a comment - Oleg, was this patch ever landed in newer el7 releases? I'm wondering if this should be closed as "Won't Fix" since the patch is not really needed anymore, AFAICS.
            green Oleg Drokin added a comment -

            I opened a redhat bugzilla ticket about this to backport the patch into sme next rhel7.x kernel. (you probably cannot see it since by default all such tickets are private):
            https://bugzilla.redhat.com/show_bug.cgi?id=1410571

            green Oleg Drokin added a comment - I opened a redhat bugzilla ticket about this to backport the patch into sme next rhel7.x kernel. (you probably cannot see it since by default all such tickets are private): https://bugzilla.redhat.com/show_bug.cgi?id=1410571

            The upstream commit that removed the mem_cgroup_prepare_migration()
            from __unmap_and_move() is

            http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=0a31bc97c80c3fa87b32c091d9a930ac19cd0c40

            amk Ann Koehler (Inactive) added a comment - The upstream commit that removed the mem_cgroup_prepare_migration() from __unmap_and_move() is http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=0a31bc97c80c3fa87b32c091d9a930ac19cd0c40

            it seems like the fix appears since 3.18 kernels.

            jay Jinshan Xiong (Inactive) added a comment - it seems like the fix appears since 3.18 kernels.

            Thanks Jinshan. I'll pass this bug on to our kernel engineers. If you can identify the kernel where it's fixed, I'm sure that would be a big help.

            amk Ann Koehler (Inactive) added a comment - Thanks Jinshan. I'll pass this bug on to our kernel engineers. If you can identify the kernel where it's fixed, I'm sure that would be a big help.
            jay Jinshan Xiong (Inactive) added a comment - - edited

            I spent some time on this issue and found some thing new(what you didn't mentio on the ticket).

            I think this issue is due to an implementation of memory cgroup. As you can see from the code __unmap_and_move():

                            lock_page(page);
                    }
            
                    /* charge against new page */
                    mem_cgroup_prepare_migration(page, newpage, &mem);
            

            it locks a page and charges mem cgroup, which in turns try to free a page from the cgroup. In the process of freeing page, it waits for the page write back to complete. This causes deadlock.

            Let me put things together.

            Ptlrpc thread:

            lock page A;
            set writeback to page A;
            unlock page A;
            lock page B     <- blocked
            

            and migrating thread:

            /* try to migrate page B */
            lock page B;
            /* since there is no free slot of this process' memory control group */
            try to free page A;
            wait for A's writeback to complete;  <- blocked
            free page A;
            wait for B's writeback to complete;
            

            It's a really bad choice for migrate_pages() to lock a page and wait for writeback on another one to complete.

            This problem is hard to fix in Lustre but way easier to get fixed in kernel, actually it turns out that linux-4.x kernels don't have this problem any more.
            We can just move 'wait for B's writeback to complete' to the location before trying to free page A and this problem should be fixed.

            I will take a further look to see since which kernel this problem has been fixed.

            jay Jinshan Xiong (Inactive) added a comment - - edited I spent some time on this issue and found some thing new(what you didn't mentio on the ticket). I think this issue is due to an implementation of memory cgroup. As you can see from the code __unmap_and_move(): lock_page(page); } /* charge against new page */ mem_cgroup_prepare_migration(page, newpage, &mem); it locks a page and charges mem cgroup, which in turns try to free a page from the cgroup. In the process of freeing page, it waits for the page write back to complete. This causes deadlock. Let me put things together. Ptlrpc thread: lock page A; set writeback to page A; unlock page A; lock page B <- blocked and migrating thread: /* try to migrate page B */ lock page B; /* since there is no free slot of this process' memory control group */ try to free page A; wait for A's writeback to complete; <- blocked free page A; wait for B's writeback to complete; It's a really bad choice for migrate_pages() to lock a page and wait for writeback on another one to complete. This problem is hard to fix in Lustre but way easier to get fixed in kernel, actually it turns out that linux-4.x kernels don't have this problem any more. We can just move 'wait for B's writeback to complete' to the location before trying to free page A and this problem should be fixed. I will take a further look to see since which kernel this problem has been fixed.

            People

              askulysh Andriy Skulysh
              askulysh Andriy Skulysh
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated: