Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 1.8.6
-
None
-
lustre-1.8.5.0-3chaos, RHEL5.5ish (CHAOS4.4-2)
-
3
-
4997
Description
In production we are fairly often in the client console logs seeing task pdflush "blocked for more than 120 seconds". Often these are followed by console messages timeouts and evictions. One some nodes, this appears to be non-fatal; recovery takes place and all is well. On others, the node gets into a state where many threads appear to be stuck in sync_page(), apparently in a deadlocked state.
pdflush usually has this backtrace regardless of whether the hang is fatal:
2011-05-13 14:52:42 INFO: task pdflush:590 blocked for more than 120 seconds.
2011-05-13 14:52:42 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2011-05-13 14:52:42 pdflush D ffff81063d9a37f0 0 590 251 591 589 (L-TLB)
2011-05-13 14:52:42 ffff81063e481aa0 0000000000000046 0000000000000000 ffff81034005bef8
2011-05-13 14:52:42 ffff8103535be050 000000000000000a ffff81033e834080 ffff81063d9a37f0
2011-05-13 14:52:42 0004831b3f0bd508 00000000000b4913 ffff81033e834268 0000000b4005bee8
2011-05-13 14:52:42 Call Trace:
2011-05-13 14:52:42 [<ffffffff8005cf72>] getnstimeofday+0x15/0x2f
2011-05-13 14:52:42 [<ffffffff8002960b>] sync_page+0x0/0x42
2011-05-13 14:52:42 [<ffffffff80066812>] io_schedule+0x3f/0x63
2011-05-13 14:52:42 [<ffffffff80029649>] sync_page+0x3e/0x42
2011-05-13 14:52:42 [<ffffffff80066975>] __wait_on_bit_lock+0x42/0x78
2011-05-13 14:52:42 [<ffffffff80041222>] __lock_page+0x64/0x6b
2011-05-13 14:52:42 [<ffffffff800a822d>] wake_bit_function+0x0/0x2a
2011-05-13 14:52:42 [<ffffffff8001d7a4>] mpage_writepages+0x16b/0x3ad
2011-05-13 14:52:42 [<ffffffff889b5490>] :lustre:ll_writepage_26+0x0/0x10
2011-05-13 14:52:42 [<ffffffff889b548b>] :lustre:generic_writepages+0xb/0x10
2011-05-13 14:52:42 [<ffffffff8005d431>] do_writepages+0x28/0x39
2011-05-13 14:52:42 [<ffffffff80030a9d>] __writeback_single_inode+0x1a3/0x32f
2011-05-13 14:52:42 [<ffffffff80163a26>] list_add+0xc/0xe
2011-05-13 14:52:42 [<ffffffff8003ada0>] generic_drop_inode+0x54/0x153
2011-05-13 14:52:42 [<ffffffff800214e1>] sync_sb_inodes+0x1c0/0x27a
2011-05-13 14:52:42 [<ffffffff80053245>] writeback_inodes+0x87/0xd7
2011-05-13 14:52:42 [<ffffffff800d26e4>] wb_kupdate+0xd4/0x14d
2011-05-13 14:52:42 [<ffffffff80058c34>] pdflush+0x0/0x1e0
2011-05-13 14:52:42 [<ffffffff80058d6f>] pdflush+0x13b/0x1e0
2011-05-13 14:52:42 [<ffffffff800d2610>] wb_kupdate+0x0/0x14d
2011-05-13 14:52:42 [<ffffffff80033905>] kthread+0x100/0x136
2011-05-13 14:52:42 [<ffffffff80028196>] schedule_tail+0x44/0xbe
2011-05-13 14:52:42 [<ffffffff8006101d>] child_rip+0xa/0x11
2011-05-13 14:52:42 [<ffffffff80033805>] kthread+0x0/0x136
2011-05-13 14:52:42 [<ffffffff80061013>] child_rip+0x0/0x11
Attachments
Issue Links
- Trackbacks
-
Lustre 1.8.x known issues tracker While testing against Lustre b18 branch, we would hit known bugs which were already reported in Lustre Bugzilla https://bugzilla.lustre.org/. In order to move away from relying on Bugzilla, we would create a JIRA