Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2948

Client read hang in osc_page_init()

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.4.0
    • Lustre 2.4.0
    • Lustre 2.3.58-14chaos (github.com/chaos/lustre)
    • 3
    • 7073

    Description

      We are seeing threads that get indefinitely stuck (I've personally seen it stuck there for over an hour) sleeping under osc_page_init() while doing a read(). The backtrace looks like this:

      2013-03-11 18:11:27.964356 {DefaultControlEventListener} [mmcs]{360}.0.0: sysiod        S 00000fff92016460     0 41292   3080 0x00000000
      2013-03-11 18:11:27.964386 {DefaultControlEventListener} [mmcs]{360}.0.0: Call Trace:
      2013-03-11 18:11:27.964416 {DefaultControlEventListener} [mmcs]{360}.0.0: [c0000003c878e7c0] [c0000003c878e850] 0xc0000003c878e850 (unreliable)
      2013-03-11 18:11:27.964446 {DefaultControlEventListener} [mmcs]{360}.0.0: [c0000003c878e990] [c000000000009b2c] .__switch_to+0xc4/0x100
      2013-03-11 18:11:27.964476 {DefaultControlEventListener} [mmcs]{360}.0.0: [c0000003c878ea20] [c00000000044c660] .schedule+0x7d4/0x944
      2013-03-11 18:11:27.964506 {DefaultControlEventListener} [mmcs]{360}.0.0: [c0000003c878ecd0] [8000000000af2090] .cfs_waitq_wait+0x10/0x30 [libcfs]
      2013-03-11 18:11:27.964536 {DefaultControlEventListener} [mmcs]{360}.0.0: [c0000003c878ed40] [80000000046f8c58] .osc_page_init+0xb18/0x1130 [osc]
      2013-03-11 18:11:27.964567 {DefaultControlEventListener} [mmcs]{360}.0.0: [c0000003c878eea0] [800000000252c078] .cl_page_find0+0x378/0xb70 [obdclass]
      2013-03-11 18:11:27.964597 {DefaultControlEventListener} [mmcs]{360}.0.0: [c0000003c878efc0] [800000000519bf80] .lov_page_init_raid0+0x230/0xa20 [lov]
      2013-03-11 18:11:27.964627 {DefaultControlEventListener} [mmcs]{360}.0.0: [c0000003c878f0e0] [8000000005196160] .lov_page_init+0x50/0xa0 [lov]
      2013-03-11 18:11:27.964657 {DefaultControlEventListener} [mmcs]{360}.0.0: [c0000003c878f170] [800000000252c078] .cl_page_find0+0x378/0xb70 [obdclass]
      2013-03-11 18:11:27.964686 {DefaultControlEventListener} [mmcs]{360}.0.0: [c0000003c878f290] [800000000699e098] .ll_readahead+0xdb8/0x1670 [lustre]
      2013-03-11 18:11:27.964717 {DefaultControlEventListener} [mmcs]{360}.0.0: [c0000003c878f400] [80000000069dd38c] .vvp_io_read_page+0x3bc/0x4d0 [lustre]
      2013-03-11 18:11:27.964747 {DefaultControlEventListener} [mmcs]{360}.0.0: [c0000003c878f510] [800000000253d8a4] .cl_io_read_page+0xf4/0x280 [obdclass]
      2013-03-11 18:11:27.964777 {DefaultControlEventListener} [mmcs]{360}.0.0: [c0000003c878f5d0] [800000000699c5dc] .ll_readpage+0xdc/0x2c0 [lustre]
      2013-03-11 18:11:27.964807 {DefaultControlEventListener} [mmcs]{360}.0.0: [c0000003c878f680] [c000000000099144] .generic_file_aio_read+0x500/0x728
      2013-03-11 18:11:27.964837 {DefaultControlEventListener} [mmcs]{360}.0.0: [c0000003c878f7c0] [80000000069dfe24] .vvp_io_read_start+0x274/0x640 [lustre]
      2013-03-11 18:11:27.964867 {DefaultControlEventListener} [mmcs]{360}.0.0: [c0000003c878f8e0] [80000000025398cc] .cl_io_start+0xcc/0x220 [obdclass]
      2013-03-11 18:11:27.964897 {DefaultControlEventListener} [mmcs]{360}.0.0: [c0000003c878f980] [8000000002541724] .cl_io_loop+0x194/0x2c0 [obdclass]
      2013-03-11 18:11:27.964928 {DefaultControlEventListener} [mmcs]{360}.0.0: [c0000003c878fa30] [800000000695a160] .ll_file_io_generic+0x410/0x670 [lustre]
      2013-03-11 18:11:27.964958 {DefaultControlEventListener} [mmcs]{360}.0.0: [c0000003c878fb30] [800000000695af04] .ll_file_aio_read+0x1d4/0x3a0 [lustre]
      2013-03-11 18:11:27.964988 {DefaultControlEventListener} [mmcs]{360}.0.0: [c0000003c878fc00] [800000000695b220] .ll_file_read+0x150/0x320 [lustre]
      2013-03-11 18:11:27.965018 {DefaultControlEventListener} [mmcs]{360}.0.0: [c0000003c878fce0] [c0000000000d429c] .vfs_read+0xd0/0x1c4
      2013-03-11 18:11:27.965048 {DefaultControlEventListener} [mmcs]{360}.0.0: [c0000003c878fd80] [c0000000000d448c] .SyS_read+0x54/0x98
      2013-03-11 18:11:27.965078 {DefaultControlEventListener} [mmcs]{360}.0.0: [c0000003c878fe30] [c000000000000580] syscall_exit+0x0/0x2c
      

      Prakash pointed out that used_mb is maxed out when we found the hung process:

      seqio345-ib0:/proc/fs/lustre/llite/ls1-c0000003cc505000$ cat max_cached_mb
      users: 384
      max_cached_mb: 4096
      used_mb: 4096
      unused_mb: 0
      reclaim_count: 11201
      

      When I tried to cat a file in the same ls1 filesystem, it got stuck in the same call path as the sysiod shown above.

      Issuing an "echo 3 > /proc/sys/vm/drop_caches" gets things moving again.

      This on a Sequoia ION, ppc64 Lustre client. Lustre version 2.3.58-14chaos.

      Attachments

        Activity

          People

            bfaccini Bruno Faccini (Inactive)
            morrone Christopher Morrone (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: