Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-337

Processes stuck in sync_page on lustre client

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.1.0, Lustre 1.8.6
    • Lustre 1.8.6
    • None
    • lustre-1.8.5.0-3chaos, RHEL5.5ish (CHAOS4.4-2)
    • 3
    • 4997

    Description

      In production we are fairly often in the client console logs seeing task pdflush "blocked for more than 120 seconds". Often these are followed by console messages timeouts and evictions. One some nodes, this appears to be non-fatal; recovery takes place and all is well. On others, the node gets into a state where many threads appear to be stuck in sync_page(), apparently in a deadlocked state.

      pdflush usually has this backtrace regardless of whether the hang is fatal:

      2011-05-13 14:52:42 INFO: task pdflush:590 blocked for more than 120 seconds.
      2011-05-13 14:52:42 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      2011-05-13 14:52:42 pdflush D ffff81063d9a37f0 0 590 251 591 589 (L-TLB)
      2011-05-13 14:52:42 ffff81063e481aa0 0000000000000046 0000000000000000 ffff81034005bef8
      2011-05-13 14:52:42 ffff8103535be050 000000000000000a ffff81033e834080 ffff81063d9a37f0
      2011-05-13 14:52:42 0004831b3f0bd508 00000000000b4913 ffff81033e834268 0000000b4005bee8
      2011-05-13 14:52:42 Call Trace:
      2011-05-13 14:52:42 [<ffffffff8005cf72>] getnstimeofday+0x15/0x2f
      2011-05-13 14:52:42 [<ffffffff8002960b>] sync_page+0x0/0x42
      2011-05-13 14:52:42 [<ffffffff80066812>] io_schedule+0x3f/0x63
      2011-05-13 14:52:42 [<ffffffff80029649>] sync_page+0x3e/0x42
      2011-05-13 14:52:42 [<ffffffff80066975>] __wait_on_bit_lock+0x42/0x78
      2011-05-13 14:52:42 [<ffffffff80041222>] __lock_page+0x64/0x6b
      2011-05-13 14:52:42 [<ffffffff800a822d>] wake_bit_function+0x0/0x2a
      2011-05-13 14:52:42 [<ffffffff8001d7a4>] mpage_writepages+0x16b/0x3ad
      2011-05-13 14:52:42 [<ffffffff889b5490>] :lustre:ll_writepage_26+0x0/0x10
      2011-05-13 14:52:42 [<ffffffff889b548b>] :lustre:generic_writepages+0xb/0x10
      2011-05-13 14:52:42 [<ffffffff8005d431>] do_writepages+0x28/0x39
      2011-05-13 14:52:42 [<ffffffff80030a9d>] __writeback_single_inode+0x1a3/0x32f
      2011-05-13 14:52:42 [<ffffffff80163a26>] list_add+0xc/0xe
      2011-05-13 14:52:42 [<ffffffff8003ada0>] generic_drop_inode+0x54/0x153
      2011-05-13 14:52:42 [<ffffffff800214e1>] sync_sb_inodes+0x1c0/0x27a
      2011-05-13 14:52:42 [<ffffffff80053245>] writeback_inodes+0x87/0xd7
      2011-05-13 14:52:42 [<ffffffff800d26e4>] wb_kupdate+0xd4/0x14d
      2011-05-13 14:52:42 [<ffffffff80058c34>] pdflush+0x0/0x1e0
      2011-05-13 14:52:42 [<ffffffff80058d6f>] pdflush+0x13b/0x1e0
      2011-05-13 14:52:42 [<ffffffff800d2610>] wb_kupdate+0x0/0x14d
      2011-05-13 14:52:42 [<ffffffff80033905>] kthread+0x100/0x136
      2011-05-13 14:52:42 [<ffffffff80028196>] schedule_tail+0x44/0xbe
      2011-05-13 14:52:42 [<ffffffff8006101d>] child_rip+0xa/0x11
      2011-05-13 14:52:42 [<ffffffff80033805>] kthread+0x0/0x136
      2011-05-13 14:52:42 [<ffffffff80061013>] child_rip+0x0/0x11

      Attachments

        Issue Links

          Activity

            People

              bobijam Zhenyu Xu
              morrone Christopher Morrone (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: