Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.4.1, Lustre 2.5.0
-
3
-
11592
Description
Cray has seen this deadlock with 2.4, 2.4.1, and 2.5 clients. It occurs while running the memfill2 test, which creates memory pressure on the node. The test hangs, with all memfill2 tasks showing the following stack trace:
> crash> bt 19721 > PID: 19721 TASK: ffff88083930f040 CPU: 29 COMMAND: "memfill2" > #0 [ffff8807f1a21158] schedule at ffffffff81381947 > #1 [ffff8807f1a212b0] io_schedule at ffffffff81381f41 > #2 [ffff8807f1a212e0] sleep_on_page at ffffffff810f00ce > #3 [ffff8807f1a212f0] __wait_on_bit at ffffffff813825d2 > #4 [ffff8807f1a21330] wait_on_page_bit at ffffffff810f0824 > #5 [ffff8807f1a21390] shrink_page_list at ffffffff81102a25 > #6 [ffff8807f1a214b0] shrink_inactive_list at ffffffff81103067 > #7 [ffff8807f1a21590] shrink_list at ffffffff8110355c > #8 [ffff8807f1a21660] shrink_zone at ffffffff81103a8d > #9 [ffff8807f1a21710] do_try_to_free_pages at ffffffff81103cde > #10 [ffff8807f1a217c0] try_to_free_mem_cgroup_pages at ffffffff8110432d > #11 [ffff8807f1a21860] mem_cgroup_hierarchical_reclaim at ffffffff8113ab7d > #12 [ffff8807f1a21910] __mem_cgroup_try_charge at ffffffff8113b5c7 > #13 [ffff8807f1a21a00] mem_cgroup_cache_charge at ffffffff8113d947 > #14 [ffff8807f1a21a40] add_to_page_cache_locked at ffffffff810f091f > #15 [ffff8807f1a21a80] add_to_page_cache at ffffffff810f0a8b > #16 [ffff8807f1a21ab0] add_to_page_cache_lru at ffffffff810f0afd > #17 [ffff8807f1a21ad0] grab_cache_page_write_begin at ffffffff810f0bc3 > #18 [ffff8807f1a21b20] ll_write_begin at ffffffffa086b92b [lustre] > #19 [ffff8807f1a21b60] generic_file_buffered_write at ffffffff810ef37e > #20 [ffff8807f1a21c20] __generic_file_aio_write at ffffffff810f2461 > #21 [ffff8807f1a21cd0] generic_file_aio_write at ffffffff810f26b6 > #22 [ffff8807f1a21d50] vvp_io_write_start at ffffffffa0880d64 [lustre] > #23 [ffff8807f1a21d90] cl_io_start at ffffffffa036d622 [obdclass] > #24 [ffff8807f1a21dc0] cl_io_loop at ffffffffa03718d4 [obdclass] > #25 [ffff8807f1a21df0] ll_file_io_generic at ffffffffa08268b8 [lustre] > #26 [ffff8807f1a21e60] ll_file_aio_write at ffffffffa0826c8e [lustre] > #27 [ffff8807f1a21eb0] ll_file_write at ffffffffa0827e5a [lustre] > #28 [ffff8807f1a21f10] vfs_write at ffffffff8114257b > #29 [ffff8807f1a21f40] sys_write at ffffffff81142725 > #30 [ffff8807f1a21f80] system_call_fastpath at ffffffff8138baeb > RIP: 00002aaaaec3b6f0 RSP: 00007fffffffa7e8 RFLAGS: 00000206 > RAX: 0000000000000001 RBX: ffffffff8138baeb RCX: 0000000072650000 > RDX: 00000000000f4240 RSI: 0000000072840680 RDI: 0000000000000004 > RBP: 0000000000000000 R8: 0000000000000001 R9: 0000000000000096 > R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000031c > R13: 0000000000000032 R14: 000000000000031b R15: 0000000000000000 > ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b
The task is waiting on a page bit. Given the function list and code location, we know that that bit is PG_writeback. From the full stack trace, we find the page:
> crash> bt -f 19721 > <snip> > #4 [ffff8807f1a21330] wait_on_page_bit at ffffffff810f0824 > ffff8807f1a21338: ffffea001db19af0 000000000000000e > ffff8807f1a21348: 0000000000000000 ffff88083930f040 > ffff8807f1a21358: ffffffff81068e50 ffff88087ffb6ab0 > ffff8807f1a21368: ffff88087ffb6ab0 0000000000000001 > ffff8807f1a21378: ffff8807f1a21388 ffffea001db19af0 > ffff8807f1a21388: ffff8807f1a214a8 ffffffff81102a25 > <snip> > page is 0xffffea001db19af0
A sial script gives us the Lustre structures linked to the page:
> crash> llwp 0xffffea001db19af0 > page 0xffffea001db19af0 > _count 3 private 0xffff880282daf200 > mapping 0xffff8806f944bd00 index 0x2f643 > flags: locked uptodate private writeback reclaim > addr-space 0xffff8806f944bd00 > host 0xffff8806f944bbb8 nrpages 0x20de > writeback_index 0x2f6df a_ops 0xffffffffa08ae8e0 [ll_aops] > inode 0xffff8806f944bbb8 > i_mode 0100015 i_uid 1356 i_gid 11121 > i_op 0xffffffffa089d9c0 [ll_file_inode_operations] > i_private (nil) > lli 0xffff8806f944bac0 > clob 0xffff880838b6bf78 lock 0x4949 > cl-obj 0xffff880838b6bf78 > ops 0xffffffffa0884f20 [vvp_ops] slice_off 0xc0 > lu-obj 0xffff880838b6bf78 > header 0xffff880838b6bee0 dev 0xffff880835ece4c0 > ops 0xffffffffa0884f60 [vvp_lu_obj_ops] > linkage 0xffff8807fb81ded8:0xffff880838b6bf28 depth 0 flags 0x1 > dev_ref (nil) > lu-obj-hdr 0xffff880838b6bee0 > flags 0 ref 8416 fid 0x2e3545187:0x17ec4(97988):0 > layers 0xffff880838b6bf90:0xffff8807fb81ded8 > cl-obj-hdr 0xffff880838b6bee0 > page_guard 0xdede lock_guard 0x1e1e tree 0xffff880838b6bf40 > pages 0x20de locks 0xffff8806ec81bd10:0xffff8806ec81bd10 > parent (nil) attr_guard 0x1a1a page_bufsize 320 nesting 0 > cl-lock 0xffff8806ec81bcf8 > ref 2 layers 0xffff8806eac91410:0xffff8808360ad6f8 > cl-lock-slice 0xffff8806eac913f8 > lock 0xffff8806ec81bcf8 obj 0xffff880838b6bf78 > ops 0xffffffffa0884ac0 [vvp_lock_ops] > linkage 0xffff8808360ad6f8:0xffff8806ec81bd00 > cl-lock-slice 0xffff8808360ad6e0 > lock 0xffff8806ec81bcf8 obj 0xffff8807fb81dec0 > ops 0xffffffffa07b3b00 [lov_lock_ops] > linkage 0xffff8806ec81bd00:0xffff8806eac91410 > cl-page 0xffff880282daf200 > ref 2 obj 0xffff880838b6bf78 index 0x2f643 > layers 0xffff880282daf2d8:0xffff880282daf328 > parent (nil) child 0xffff8802b7306600 state 2 > flight 0xffff880282daf270:0xffff880282daf270 > type 1 owner (nil) task (nil) > req (nil) > page-slice 0xffff880282daf2c0 > page 0xffff880282daf200 obj 0xffff880838b6bf78 > ops 0xffffffffa0884780 [vvp_page_ops] > cl-obj 0xffff880838b6bf78 > ops 0xffffffffa0884f20 [vvp_ops] slice_off 0xc0 > lu-obj 0xffff880838b6bf78 > header 0xffff880838b6bee0 dev 0xffff880835ece4c0 > ops 0xffffffffa0884f60 [vvp_lu_obj_ops] > linkage 0xffff8807fb81ded8:0xffff880838b6bf28 depth 0 flags 0x1 > dev_ref (nil) > lu-obj-hdr 0xffff880838b6bee0 > flags 0 ref 8416 fid 0x2e3545187:0x17ec4(97988):0 > layers 0xffff880838b6bf90:0xffff8807fb81ded8 > cl-obj-hdr 0xffff880838b6bee0 > page_guard 0xdede lock_guard 0x1e1e tree 0xffff880838b6bf40 > pages 0x20de locks 0xffff8806ec81bd10:0xffff8806ec81bd10 > parent (nil) attr_guard 0x1a1a page_bufsize 320 nesting 0 > cl-lock 0xffff8806ec81bcf8 > ref 2 layers 0xffff8806eac91410:0xffff8808360ad6f8 > cl-lock-slice 0xffff8806eac913f8 > lock 0xffff8806ec81bcf8 obj 0xffff880838b6bf78 > ops 0xffffffffa0884ac0 [vvp_lock_ops] > linkage 0xffff8808360ad6f8:0xffff8806ec81bd00 > cl-lock-slice 0xffff8808360ad6e0 > lock 0xffff8806ec81bcf8 obj 0xffff8807fb81dec0 > ops 0xffffffffa07b3b00 [lov_lock_ops] > linkage 0xffff8806ec81bd00:0xffff8806eac91410 > ccc-page 0xffff880282daf2c0 > defer_uptodate 0 ra_used 0 write_queued 0 > pending_linkage 0xffff8802840ecaf8:0xffff8802a864aaf8 > page 0xffffea001db19af0 > page-slice 0xffff880282daf310 > page 0xffff880282daf200 obj 0xffff8807fb81dec0 > ops 0xffffffffa07b38e0 [lov_page_ops] > cl-obj 0xffff8807fb81dec0 > ops 0xffffffffa07b34a0 [lov_ops] slice_off 0x110 > lu-obj 0xffff8807fb81dec0 > header 0xffff880838b6bee0 dev 0xffff881037502440 > ops 0xffffffffa07b34e0 [lov_lu_obj_ops] > linkage 0xffff880838b6bf28:0xffff880838b6bf90 depth 0 flags 0x1 > dev_ref (nil) > lu-obj-hdr 0xffff880838b6bee0 > flags 0 ref 8416 fid 0x2e3545187:0x17ec4(97988):0 > layers 0xffff880838b6bf90:0xffff8807fb81ded8 > cl-obj-hdr 0xffff880838b6bee0 > page_guard 0xdede lock_guard 0x1e1e tree 0xffff880838b6bf40 > pages 0x20de locks 0xffff8806ec81bd10:0xffff8806ec81bd10 > parent (nil) attr_guard 0x1a1a page_bufsize 320 nesting 0 > cl-lock 0xffff8806ec81bcf8 > ref 2 layers 0xffff8806eac91410:0xffff8808360ad6f8 > cl-lock-slice 0xffff8806eac913f8 > lock 0xffff8806ec81bcf8 obj 0xffff880838b6bf78 > ops 0xffffffffa0884ac0 [vvp_lock_ops] > linkage 0xffff8808360ad6f8:0xffff8806ec81bd00 > cl-lock-slice 0xffff8808360ad6e0 > lock 0xffff8806ec81bcf8 obj 0xffff8807fb81dec0 > ops 0xffffffffa07b3b00 [lov_lock_ops] > linkage 0xffff8806ec81bd00:0xffff8806eac91410 > cl-page 0xffff8802b7306600 > ref 2 obj 0xffff8807f3804f18 index 0x2f643 > layers 0xffff8802b73066d8:0xffff8802b7306700 > parent 0xffff880282daf200 child (nil) state 2 > flight 0xffff8802b7306670:0xffff8802b7306670 > type 1 owner (nil) task (nil) > req (nil) > page-slice 0xffff8802b73066c0 > page 0xffff8802b7306600 obj 0xffff8807f3804f18 > ops 0xffffffffa07b4820 [lovsub_page_ops] > cl-obj 0xffff8807f3804f18 > ops 0xffffffffa07b46e0 [lovsub_ops] slice_off 0xc0 > lu-obj 0xffff8807f3804f18 > header 0xffff8807f3804e80 dev 0xffff8807ed564cc0 > ops 0xffffffffa07b4720 [lovsub_lu_obj_ops] > linkage 0xffff8807f6cf6e40:0xffff8807f3804ec8 depth 0 flags 0x1 > dev_ref (nil) > lu-obj-hdr 0xffff8807f3804e80 > flags 0 ref 8416 fid 0x100090000:0xcf08c8(13568200):0 > layers 0xffff8807f3804f30:0xffff8807f6cf6e40 > cl-obj-hdr 0xffff8807f3804e80 > page_guard 0xdede lock_guard 0x1d1d tree 0xffff8807f3804ee0 > pages 0x20de locks 0xffff8806ec81bdf0:0xffff8806ec81bdf0 > parent 0xffff880838b6bee0 attr_guard 0000 page_bufsize 448 nesting 1 > cl-lock 0xffff8806ec81bdd8 > ref 2 layers 0xffff8807f791c4f8:0xffff8806f2bf9bd8 > cl-lock-slice 0xffff8807f791c4e0 > lock 0xffff8806ec81bdd8 obj 0xffff8807f3804f18 > ops 0xffffffffa07b4940 [lovsub_lock_ops] > linkage 0xffff8806f2bf9bd8:0xffff8806ec81bde0 > cl-lock-slice 0xffff8806f2bf9bc0 > lock 0xffff8806ec81bdd8 obj 0xffff8807f6cf6e28 > ops 0xffffffffa072c5e0 [osc_lock_ops] > linkage 0xffff8806ec81bde0:0xffff8807f791c4f8 > osc-lock 0xffff8806f2bf9bc0 > lock 0xffff880835afc200 lvb 0xffff8806f2bf9bf0 flags 0x20040000001 > handle 0xffff8806f2bf9c30 einfo 0xffff8806f2bf9c38 state 3 > owner 0xffff8807f07ccbf0 > osc-io 0xffff8807f07ccbf0 > lockless 0 active 0xffff8807fc9be918 info 0xffff8807f07ccc30 > oa 0xffff8807f07ccca0 rpc_sent 0 rc 0 > page-slice 0xffff8802b73066e8 > page 0xffff8802b7306600 obj 0xffff8807f6cf6e28 > ops 0xffffffffa072c3a0 [osc_page_ops] > cl-obj 0xffff8807f6cf6e28 > ops 0xffffffffa072c240 [osc_ops] slice_off 0xe8 > lu-obj 0xffff8807f6cf6e28 > header 0xffff8807f3804e80 dev 0xffff88103a7c5840 > ops 0xffffffffa072c280 [osc_lu_obj_ops] > linkage 0xffff8807f3804ec8:0xffff8807f3804f30 depth 0 flags 0x1 > dev_ref (nil) > lu-obj-hdr 0xffff8807f3804e80 > flags 0 ref 8416 fid 0x100090000:0xcf08c8(13568200):0 > layers 0xffff8807f3804f30:0xffff8807f6cf6e40 > cl-obj-hdr 0xffff8807f3804e80 > page_guard 0xdede lock_guard 0x1d1d tree 0xffff8807f3804ee0 > pages 0x20de locks 0xffff8806ec81bdf0:0xffff8806ec81bdf0 > parent 0xffff880838b6bee0 attr_guard 0000 page_bufsize 448 nesting 1 > cl-lock 0xffff8806ec81bdd8 > ref 2 layers 0xffff8807f791c4f8:0xffff8806f2bf9bd8 > cl-lock-slice 0xffff8807f791c4e0 > lock 0xffff8806ec81bdd8 obj 0xffff8807f3804f18 > ops 0xffffffffa07b4940 [lovsub_lock_ops] > linkage 0xffff8806f2bf9bd8:0xffff8806ec81bde0 > cl-lock-slice 0xffff8806f2bf9bc0 > lock 0xffff8806ec81bdd8 obj 0xffff8807f6cf6e28 > ops 0xffffffffa072c5e0 [osc_lock_ops] > linkage 0xffff8806ec81bde0:0xffff8807f791c4f8 > osc-lock 0xffff8806f2bf9bc0 > lock 0xffff880835afc200 lvb 0xffff8806f2bf9bf0 flags 0x20040000001 > handle 0xffff8806f2bf9c30 einfo 0xffff8806f2bf9c38 state 3 > owner 0xffff8807f07ccbf0 > osc-io 0xffff8807f07ccbf0 > lockless 0 active 0xffff8807fc9be918 info 0xffff8807f07ccc30 > oa 0xffff8807f07ccca0 rpc_sent 0 rc 0 > osc-obj 0xffff8807f6cf6e28 > oinfo 0xffff8802a6780f40 root.rb_node 0xffff8807fc9be918 > osc-ext 0xffff8807fc9be918 > obj 0xffff8807f6cf6e28 refc 2 users 1 > link 0xffff8807fc9be940:0xffff8807fc9be940 state 1 intree 1 > rw 0 srvlock 0 memalloc 0 trunc_pending 0 fsync_wait 1 hp 0 urgent 1 > grants 0xb3000 nr_pages 0xb3 pages 0xffff880226e83b18:0xffff88031d475118 > start 0x2f62b end 0x2f6dd max_end 0x2f6ff > osc-page 0xffff8802b73066e8 > from 0 to 0x1000 pinned 1 in_lru 1 > lru 0xffff8806e6ae9398:0xffff880319e22198 > submitter 0xffff88083930f040 submit_time 0 > osc-async-page 0xffff8802b7306710 > magic 0x845fed cmd 2 interrupted 0 > pending_item 0xffff880319e22118:0xffff8806e6ae9318 > rpc_item 0xffff8802b7306728:0xffff8802b7306728 > obj_off 0x2f643000 page_off 0 async_flags 0x3 request (nil) > cli 0xffff88103a89a7a8 obj 0xffff8807f6cf6e28 ldlm_lock (nil) > lock 0x101 > brw-page 0xffff8802b7306748 > off 0 pg 0xffffea001db19af0 count 0x1000 flag 0x420
osc-io->active is 0xffff8807fc9be918, which is the same extent (from osc-ext) that the task is waiting on. Since an active extent is not written out, the task is deadlocked: the task is waiting on a page to be written that is part of an active extent that the task has active and thus won't be written out.
The solution is to not mark for writeback (i.e. flush) pages that are part of active extents, but rather leave them dirty so that they get written later.