[LU-4157] Removing files hangs with 100%CPU on 3.12-rc7 client Created: 28/Oct/13  Updated: 27/Dec/13  Resolved: 27/Nov/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.1
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Roland Fehrenbacher Assignee: WC Triage
Resolution: Fixed Votes: 0
Labels: None
Environment:

Vanilla kernel 3.12-rc7 client mounting a 2.4.1 ZFS server that works fine with a 2.4.1 client.


Issue Links:
Related
is related to LU-4416 support for 3.12 linux kernel Resolved
Severity: 4
Rank (Obsolete): 11279

 Description   

Deleting files with the in-kernel client (3.12-rc7) is impossible. The rm command gets stuck at 100%CPU and is unkillable with an example call trace like the following:

Oct 23 15:49:05 beo-05 kernel: [ 1361.539903] [<ffffffff815020ff>] ? __schedule+0x2ff/0x8d0
Oct 23 15:49:05 beo-05 kernel: [ 1361.539907] [<ffffffff812567c4>] ? radix_tree_next_chunk+0x1a4/0x210
Oct 23 15:49:05 beo-05 kernel: [ 1361.539911] [<ffffffff810d665a>] ? find_get_pages+0xca/0x150
Oct 23 15:49:05 beo-05 kernel: [ 1361.539914] [<ffffffff810d666a>] ? find_get_pages+0xda/0x150
Oct 23 15:49:05 beo-05 kernel: [ 1361.539917] [<ffffffff810e042d>] ? pagevec_lookup+0x1d/0x30
Oct 23 15:49:05 beo-05 kernel: [ 1361.539921] [<ffffffff810e20f1>] ? truncate_inode_pages_range.part.11+0xa1/0x630
Oct 23 15:49:05 beo-05 kernel: [ 1361.539925] [<ffffffffa09376e5>] ? lmv_lock_match+0xf5/0x2d0 [lmv]
Oct 23 15:49:05 beo-05 kernel: [ 1361.539933] [<ffffffffa087dffb>] ? ll_have_md_lock+0x14b/0x3e0 [lustre]
Oct 23 15:49:05 beo-05 kernel: [ 1361.539936] [<ffffffff810e26c1>] ? truncate_inode_pages_range+0x41/0x50
Oct 23 15:49:05 beo-05 kernel: [ 1361.539939] [<ffffffff810e2750>] ? truncate_inode_pages+0x10/0x20
Oct 23 15:49:05 beo-05 kernel: [ 1361.539947] [<ffffffffa089d4c7>] ? ll_md_blocking_ast+0x447/0x650 [lustre]
Oct 23 15:49:05 beo-05 kernel: [ 1361.539951] [<ffffffff8120314c>] ? fuse_request_send_nowait_locked+0x6c/0xd0
Oct 23 15:49:05 beo-05 kernel: [ 1361.539960] [<ffffffffa05dd177>] ? ldlm_cancel_callback+0x67/0x190 [ptlrpc]
Oct 23 15:49:05 beo-05 kernel: [ 1361.539969] [<ffffffffa05e68aa>] ? ldlm_cli_cancel_local+0x7a/0x3c0 [ptlrpc]
Oct 23 15:49:05 beo-05 kernel: [ 1361.539979] [<ffffffffa05e8fbd>] ? ldlm_cli_cancel_list_local+0xdd/0x240 [ptlrpc]
Oct 23 15:49:05 beo-05 kernel: [ 1361.539989] [<ffffffffa05e9295>] ? ldlm_cancel_resource_local+0x175/0x1e0 [ptlrpc]
Oct 23 15:49:05 beo-05 kernel: [ 1361.539995] [<ffffffffa07806d7>] ? mdc_resource_get_unused+0xd7/0x170 [mdc]
Oct 23 15:49:05 beo-05 kernel: [ 1361.539998] [<ffffffff810d60c8>] ? filemap_fault+0x88/0x550
Oct 23 15:49:05 beo-05 kernel: [ 1361.540003] [<ffffffffa078170d>] ? mdc_unlink+0x9d/0x4d0 [mdc]
Oct 23 15:49:05 beo-05 kernel: [ 1361.540007] [<ffffffffa0944fba>] ? lmv_unlink+0x1ba/0x500 [lmv]
Oct 23 15:49:05 beo-05 kernel: [ 1361.540015] [<ffffffffa08a2ff4>] ? ll_unlink+0x164/0x410 [lustre]
Oct 23 15:49:05 beo-05 kernel: [ 1361.540018] [<ffffffff81144f1d>] ? vfs_unlink+0x8d/0x100
Oct 23 15:49:05 beo-05 kernel: [ 1361.540022] [<ffffffff81145123>] ? do_unlinkat+0x193/0x230
Oct 23 15:49:05 beo-05 kernel: [ 1361.540025] [<ffffffff81137ce4>] ? vfs_read+0x124/0x170
Oct 23 15:49:05 beo-05 kernel: [ 1361.540029] [<ffffffff8114777d>] ? SyS_unlinkat+0x1d/0x40
Oct 23 15:49:05 beo-05 kernel: [ 1361.540032] [<ffffffff8150a226>] ? system_call_fastpath+0x1a/0x1f



 Comments   
Comment by Peng Tao [ 30/Oct/13 ]

It it the same hang I fixed before. The patch is queued by Greg KH in staging tree but not yet in Linus tree.

The root cause is because of a generic layer change that makes truncate_inode_pages_range stop truncating the page at index ~0UL but Lustre always put the first page in a dir inode mapping at index ~0UL.

See the patch for details https://git.kernel.org/cgit/linux/kernel/git/gregkh/staging.git/commit/?h=staging-next&id=363090e74f3865c589f4026b40865596b0212f90

Comment by Roland Fehrenbacher [ 01/Nov/13 ]

Thanks Peng. I applied the patch and it fixes the problem indeed. Would be great if it could move to 3.12 final (together with the patch in https://jira.hpdd.intel.com/browse/LU-4127 maybe?).

Comment by Peng Tao [ 01/Nov/13 ]

LU-4127 was also fixed in upstream. See https://git.kernel.org/cgit/linux/kernel/git/gregkh/staging.git/commit/?h=staging-next&id=86bac591def1a5b7060c0834828b1eaabfe7f0a7

The two patches are both queue by Greg in his staging-next branch but I do not know when he will push them to Linus.

Comment by Roland Fehrenbacher [ 02/Nov/13 ]

OK, thanks for letting me know. I assume this will go into a 3.12.x then. Without these patches, the lustre client is absolutely useless.

Comment by Dmitry Eremin (Inactive) [ 27/Nov/13 ]

It's fixed in upstream.

Generated at Sat Feb 10 01:40:11 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.