[LU-337] Processes stuck in sync_page on lustre client Created: 17/May/11  Updated: 28/Jun/11  Resolved: 06/Jun/11

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.6
Fix Version/s: Lustre 2.1.0, Lustre 1.8.6

Type: Bug Priority: Minor
Reporter: Christopher Morrone Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: None
Environment:

lustre-1.8.5.0-3chaos, RHEL5.5ish (CHAOS4.4-2)


Severity: 3
Rank (Obsolete): 4997

 Description   

In production we are fairly often in the client console logs seeing task pdflush "blocked for more than 120 seconds". Often these are followed by console messages timeouts and evictions. One some nodes, this appears to be non-fatal; recovery takes place and all is well. On others, the node gets into a state where many threads appear to be stuck in sync_page(), apparently in a deadlocked state.

pdflush usually has this backtrace regardless of whether the hang is fatal:

2011-05-13 14:52:42 INFO: task pdflush:590 blocked for more than 120 seconds.
2011-05-13 14:52:42 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2011-05-13 14:52:42 pdflush D ffff81063d9a37f0 0 590 251 591 589 (L-TLB)
2011-05-13 14:52:42 ffff81063e481aa0 0000000000000046 0000000000000000 ffff81034005bef8
2011-05-13 14:52:42 ffff8103535be050 000000000000000a ffff81033e834080 ffff81063d9a37f0
2011-05-13 14:52:42 0004831b3f0bd508 00000000000b4913 ffff81033e834268 0000000b4005bee8
2011-05-13 14:52:42 Call Trace:
2011-05-13 14:52:42 [<ffffffff8005cf72>] getnstimeofday+0x15/0x2f
2011-05-13 14:52:42 [<ffffffff8002960b>] sync_page+0x0/0x42
2011-05-13 14:52:42 [<ffffffff80066812>] io_schedule+0x3f/0x63
2011-05-13 14:52:42 [<ffffffff80029649>] sync_page+0x3e/0x42
2011-05-13 14:52:42 [<ffffffff80066975>] __wait_on_bit_lock+0x42/0x78
2011-05-13 14:52:42 [<ffffffff80041222>] __lock_page+0x64/0x6b
2011-05-13 14:52:42 [<ffffffff800a822d>] wake_bit_function+0x0/0x2a
2011-05-13 14:52:42 [<ffffffff8001d7a4>] mpage_writepages+0x16b/0x3ad
2011-05-13 14:52:42 [<ffffffff889b5490>] :lustre:ll_writepage_26+0x0/0x10
2011-05-13 14:52:42 [<ffffffff889b548b>] :lustre:generic_writepages+0xb/0x10
2011-05-13 14:52:42 [<ffffffff8005d431>] do_writepages+0x28/0x39
2011-05-13 14:52:42 [<ffffffff80030a9d>] __writeback_single_inode+0x1a3/0x32f
2011-05-13 14:52:42 [<ffffffff80163a26>] list_add+0xc/0xe
2011-05-13 14:52:42 [<ffffffff8003ada0>] generic_drop_inode+0x54/0x153
2011-05-13 14:52:42 [<ffffffff800214e1>] sync_sb_inodes+0x1c0/0x27a
2011-05-13 14:52:42 [<ffffffff80053245>] writeback_inodes+0x87/0xd7
2011-05-13 14:52:42 [<ffffffff800d26e4>] wb_kupdate+0xd4/0x14d
2011-05-13 14:52:42 [<ffffffff80058c34>] pdflush+0x0/0x1e0
2011-05-13 14:52:42 [<ffffffff80058d6f>] pdflush+0x13b/0x1e0
2011-05-13 14:52:42 [<ffffffff800d2610>] wb_kupdate+0x0/0x14d
2011-05-13 14:52:42 [<ffffffff80033905>] kthread+0x100/0x136
2011-05-13 14:52:42 [<ffffffff80028196>] schedule_tail+0x44/0xbe
2011-05-13 14:52:42 [<ffffffff8006101d>] child_rip+0xa/0x11
2011-05-13 14:52:42 [<ffffffff80033805>] kthread+0x0/0x136
2011-05-13 14:52:42 [<ffffffff80061013>] child_rip+0x0/0x11



 Comments   
Comment by Christopher Morrone [ 17/May/11 ]

When processes hang, they all appear to be stuck in sync_page. Here's a list of processes and their general state when they appear to be deadlocked:

pdflush: same as trace in bug description
snmp, ptlrpcd, slurmd: all stuck in similar code paths below shrink_slab(), here's an example:

sync_page
io_schedule
sync_page
__wait_on_bit_lock
__lock_page
wake_bit_function
:lustre:ll_page_removal_cb
:ptlrpc:__ldlm_handle2lock
:osc:cache_removal_lock
"osc"osc_extent_blocking_cb
ldlm_cancel_callback
ldlm_cli_cancel_lock
ldlm_bl_to_thread
ldlm_cancel_list
ldlm_cancel_lru_local
ldlm_cancel_passed_policy
ldlm_cancel_lru
ldlm_cli_pool_shrink
ldlm_namespace_move_locked
ldlm_pools_shrink
ldlm_pools_cli_shrink
shrink_slab
zone_reclaim
(different from here up)

Then there is the ll_imp_inval thread:

sync_page
:ptlrpc:lock_res_and_lock
ll_page_removal_cb
__ldlm_handle2lock
cache_remove_lock
osc_extent_blocking_cb
lock_handle_addref
class_handle2object
ldlm_cancel_callback
ldlm_cli_cancel_local
ptlrpc_lprocfs_brw
ldlm_cli_cancel
ll_ocd_update
cleanup_resource
ldlm_namespace_cleanup
osc_check_rpcs
osc_import_event
ptlrpc_abort_inflight
switch_uid
osc_import_event
ptlrpc_invalidate_inport
ptlrpc_invalidate_import_thread

And then the "sync" process:

sync_page
mpage_writepages
:lustre:ll_writepages
:lustre:generic_writepages
do_writepages
__writeback_single_inode
thread_return
sync_sb_inodes
sync_inodes_sb
__sync_inodes
sync_inodes
do_sync
sys_sync

Comment by Peter Jones [ 18/May/11 ]

Bobijam

Could you please look into this one?

Thanks

Peter

Comment by Zhenyu Xu [ 18/May/11 ]

Christopher Morrone,

Just want to be sure whether your lustre-1.8.5.0-3chaos codebase contains the byg 21873 patch, they looks similar, memory collector's waiting for some pages which is under writeback.

Comment by Christopher Morrone [ 19/May/11 ]

Bug 21873 is "Have clusterDB generated pxeconfig files for each node". I think you meant something else.

If the patch went into 1.8.5, we have it.

We keep all of our branches publicly available here:

https://github.com/chaos/lustre

And that tag in particular is here:

https://github.com/chaos/lustre/commits/1.8.5.0-3chaos

Comment by Zhenyu Xu [ 19/May/11 ]

sorry, i meant 21983, and I've checked in the source you provided https://github.com/chaos/lustre/tree/1.8.5.0-3chaos, the patch of 21983 is there.

Comment by Christopher Morrone [ 19/May/11 ]

Ah, bug 21983 was a bug that dealt with readahead, not writeback.

Also, in that bug a single thread was getting stuck on a lock that it held itself. I suspect (but don't know for certain) that is not the case with this one.

Comment by Christopher Morrone [ 19/May/11 ]

The admins alerted me to another stuck node. This time only three processes are stuck: pdflush, sync, ptlrpcd

Note that the sync command was issued as part of the job cleanup script (slurm's job "epilog" script). Our "lflush" script echoes "clean" into all of the lru_size files in /proc, and then calls /bin/sync once.

Ah, in both of the cases I have seen thus far, ptlrpcd is stuck under ldlm_pools_cli_shrink(), originating from an osc_quota_setdq() call that calls cfs_mem_cache_alloc(), which kicks off the cache_alloc_refill, etc.

Comment by Christopher Morrone [ 19/May/11 ]

So perhaps this is indeed another case of using the wrong alloc mask. I see that alloc_qinfo() is doing:

OBD_SLAB_ALLOC(oqi, qinfo_cachep, CFS_ALLOC_STD, sizeof(*oqi));

I am guessing that is needs to be CFS_ALLOC_IO, just like I needed to do in bug 21983 for llap_from_page_with_lockh().

Comment by Christopher Morrone [ 19/May/11 ]

I pushed http://review.whamcloud.com/582 with that one line change, in case you think that is the correct fix.

Comment by Build Master (Inactive) [ 20/May/11 ]

Integrated in lustre-b1_8 » x86_64,client,el5,ofa #58
LU-337 Fix alloc flags in alloc_qinfo()

Johann Lombardi : 74ec6f5c8d1d73108c6de24a82f6384c98f2bac1
Files :

  • lustre/quota/quota_interface.c
Comment by Build Master (Inactive) [ 20/May/11 ]

Integrated in lustre-b1_8 » i686,client,el5,inkernel #58
LU-337 Fix alloc flags in alloc_qinfo()

Johann Lombardi : 74ec6f5c8d1d73108c6de24a82f6384c98f2bac1
Files :

  • lustre/quota/quota_interface.c
Comment by Build Master (Inactive) [ 20/May/11 ]

Integrated in lustre-b1_8 » x86_64,client,ubuntu1004,inkernel #58
LU-337 Fix alloc flags in alloc_qinfo()

Johann Lombardi : 74ec6f5c8d1d73108c6de24a82f6384c98f2bac1
Files :

  • lustre/quota/quota_interface.c
Comment by Build Master (Inactive) [ 20/May/11 ]

Integrated in lustre-b1_8 » i686,client,el5,ofa #58
LU-337 Fix alloc flags in alloc_qinfo()

Johann Lombardi : 74ec6f5c8d1d73108c6de24a82f6384c98f2bac1
Files :

  • lustre/quota/quota_interface.c
Comment by Build Master (Inactive) [ 20/May/11 ]

Integrated in lustre-b1_8 » x86_64,server,el5,inkernel #58
LU-337 Fix alloc flags in alloc_qinfo()

Johann Lombardi : 74ec6f5c8d1d73108c6de24a82f6384c98f2bac1
Files :

  • lustre/quota/quota_interface.c
Comment by Build Master (Inactive) [ 20/May/11 ]

Integrated in lustre-b1_8 » i686,server,el5,inkernel #58
LU-337 Fix alloc flags in alloc_qinfo()

Johann Lombardi : 74ec6f5c8d1d73108c6de24a82f6384c98f2bac1
Files :

  • lustre/quota/quota_interface.c
Comment by Build Master (Inactive) [ 20/May/11 ]

Integrated in lustre-b1_8 » x86_64,client,el6,inkernel #58
LU-337 Fix alloc flags in alloc_qinfo()

Johann Lombardi : 74ec6f5c8d1d73108c6de24a82f6384c98f2bac1
Files :

  • lustre/quota/quota_interface.c
Comment by Build Master (Inactive) [ 20/May/11 ]

Integrated in lustre-b1_8 » x86_64,client,el5,inkernel #58
LU-337 Fix alloc flags in alloc_qinfo()

Johann Lombardi : 74ec6f5c8d1d73108c6de24a82f6384c98f2bac1
Files :

  • lustre/quota/quota_interface.c
Comment by Build Master (Inactive) [ 20/May/11 ]

Integrated in lustre-b1_8 » i686,server,el5,ofa #58
LU-337 Fix alloc flags in alloc_qinfo()

Johann Lombardi : 74ec6f5c8d1d73108c6de24a82f6384c98f2bac1
Files :

  • lustre/quota/quota_interface.c
Comment by Build Master (Inactive) [ 20/May/11 ]

Integrated in lustre-b1_8 » x86_64,server,el5,ofa #58
LU-337 Fix alloc flags in alloc_qinfo()

Johann Lombardi : 74ec6f5c8d1d73108c6de24a82f6384c98f2bac1
Files :

  • lustre/quota/quota_interface.c
Comment by Build Master (Inactive) [ 20/May/11 ]

Integrated in lustre-b1_8 » i686,client,el6,inkernel #58
LU-337 Fix alloc flags in alloc_qinfo()

Johann Lombardi : 74ec6f5c8d1d73108c6de24a82f6384c98f2bac1
Files :

  • lustre/quota/quota_interface.c
Comment by Peter Jones [ 20/May/11 ]

does this patch need porting to master?

Comment by Zhenyu Xu [ 20/May/11 ]

i think so, Christopher Morrone, would you mind pushing it in master branch for review? thanks.

Comment by Christopher Morrone [ 31/May/11 ]

Pushed for master here:

http://review.whamcloud.com/880

Comment by Zhenyu Xu [ 06/Jun/11 ]

patch landed on b1_8 for 1.8.6 and on master for 2.1.0

Comment by Build Master (Inactive) [ 06/Jun/11 ]

Integrated in lustre-master » x86_64,client,el5,inkernel #156
LU-337 Fix alloc flags in alloc_qinfo()

Oleg Drokin : d8506f4b3a03b5605fc927409ce16f55ad5bffd5
Files :

  • lustre/quota/quota_interface.c
Comment by Build Master (Inactive) [ 06/Jun/11 ]

Integrated in lustre-master » x86_64,client,sles11,inkernel #156
LU-337 Fix alloc flags in alloc_qinfo()

Oleg Drokin : d8506f4b3a03b5605fc927409ce16f55ad5bffd5
Files :

  • lustre/quota/quota_interface.c
Comment by Build Master (Inactive) [ 06/Jun/11 ]

Integrated in lustre-master » i686,client,el5,inkernel #156
LU-337 Fix alloc flags in alloc_qinfo()

Oleg Drokin : d8506f4b3a03b5605fc927409ce16f55ad5bffd5
Files :

  • lustre/quota/quota_interface.c
Comment by Build Master (Inactive) [ 06/Jun/11 ]

Integrated in lustre-master » i686,client,el5,ofa #156
LU-337 Fix alloc flags in alloc_qinfo()

Oleg Drokin : d8506f4b3a03b5605fc927409ce16f55ad5bffd5
Files :

  • lustre/quota/quota_interface.c
Comment by Build Master (Inactive) [ 06/Jun/11 ]

Integrated in lustre-master » x86_64,client,ubuntu1004,inkernel #156
LU-337 Fix alloc flags in alloc_qinfo()

Oleg Drokin : d8506f4b3a03b5605fc927409ce16f55ad5bffd5
Files :

  • lustre/quota/quota_interface.c
Comment by Build Master (Inactive) [ 06/Jun/11 ]

Integrated in lustre-master » x86_64,client,el6,inkernel #156
LU-337 Fix alloc flags in alloc_qinfo()

Oleg Drokin : d8506f4b3a03b5605fc927409ce16f55ad5bffd5
Files :

  • lustre/quota/quota_interface.c
Comment by Build Master (Inactive) [ 06/Jun/11 ]

Integrated in lustre-master » x86_64,server,el5,inkernel #156
LU-337 Fix alloc flags in alloc_qinfo()

Oleg Drokin : d8506f4b3a03b5605fc927409ce16f55ad5bffd5
Files :

  • lustre/quota/quota_interface.c
Comment by Build Master (Inactive) [ 06/Jun/11 ]

Integrated in lustre-master » i686,client,el6,inkernel #156
LU-337 Fix alloc flags in alloc_qinfo()

Oleg Drokin : d8506f4b3a03b5605fc927409ce16f55ad5bffd5
Files :

  • lustre/quota/quota_interface.c
Comment by Build Master (Inactive) [ 06/Jun/11 ]

Integrated in lustre-master » x86_64,client,ubuntu1004,ofa #156
LU-337 Fix alloc flags in alloc_qinfo()

Oleg Drokin : d8506f4b3a03b5605fc927409ce16f55ad5bffd5
Files :

  • lustre/quota/quota_interface.c
Comment by Build Master (Inactive) [ 06/Jun/11 ]

Integrated in lustre-master » i686,server,el5,inkernel #156
LU-337 Fix alloc flags in alloc_qinfo()

Oleg Drokin : d8506f4b3a03b5605fc927409ce16f55ad5bffd5
Files :

  • lustre/quota/quota_interface.c
Comment by Build Master (Inactive) [ 06/Jun/11 ]

Integrated in lustre-master » i686,server,el5,ofa #156
LU-337 Fix alloc flags in alloc_qinfo()

Oleg Drokin : d8506f4b3a03b5605fc927409ce16f55ad5bffd5
Files :

  • lustre/quota/quota_interface.c
Comment by Build Master (Inactive) [ 06/Jun/11 ]

Integrated in lustre-master » x86_64,server,el6,inkernel #156
LU-337 Fix alloc flags in alloc_qinfo()

Oleg Drokin : d8506f4b3a03b5605fc927409ce16f55ad5bffd5
Files :

  • lustre/quota/quota_interface.c
Comment by Build Master (Inactive) [ 07/Jun/11 ]

Integrated in lustre-master » x86_64,client,el5,ofa #156
LU-337 Fix alloc flags in alloc_qinfo()

Oleg Drokin : d8506f4b3a03b5605fc927409ce16f55ad5bffd5
Files :

  • lustre/quota/quota_interface.c
Comment by Build Master (Inactive) [ 07/Jun/11 ]

Integrated in lustre-master » i686,server,el6,inkernel #156
LU-337 Fix alloc flags in alloc_qinfo()

Oleg Drokin : d8506f4b3a03b5605fc927409ce16f55ad5bffd5
Files :

  • lustre/quota/quota_interface.c
Comment by Build Master (Inactive) [ 07/Jun/11 ]

Integrated in lustre-master » x86_64,server,el5,ofa #156
LU-337 Fix alloc flags in alloc_qinfo()

Oleg Drokin : d8506f4b3a03b5605fc927409ce16f55ad5bffd5
Files :

  • lustre/quota/quota_interface.c
Generated at Sat Feb 10 01:06:02 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.