Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.4.0
    • Lustre 2.4.0
    • Lustre 2.3.58-14chaos (github.com/chaos/lustre)
    • 3
    • 7073

    Description

      We are seeing threads that get indefinitely stuck (I've personally seen it stuck there for over an hour) sleeping under osc_page_init() while doing a read(). The backtrace looks like this:

      2013-03-11 18:11:27.964356 {DefaultControlEventListener} [mmcs]{360}.0.0: sysiod        S 00000fff92016460     0 41292   3080 0x00000000
      2013-03-11 18:11:27.964386 {DefaultControlEventListener} [mmcs]{360}.0.0: Call Trace:
      2013-03-11 18:11:27.964416 {DefaultControlEventListener} [mmcs]{360}.0.0: [c0000003c878e7c0] [c0000003c878e850] 0xc0000003c878e850 (unreliable)
      2013-03-11 18:11:27.964446 {DefaultControlEventListener} [mmcs]{360}.0.0: [c0000003c878e990] [c000000000009b2c] .__switch_to+0xc4/0x100
      2013-03-11 18:11:27.964476 {DefaultControlEventListener} [mmcs]{360}.0.0: [c0000003c878ea20] [c00000000044c660] .schedule+0x7d4/0x944
      2013-03-11 18:11:27.964506 {DefaultControlEventListener} [mmcs]{360}.0.0: [c0000003c878ecd0] [8000000000af2090] .cfs_waitq_wait+0x10/0x30 [libcfs]
      2013-03-11 18:11:27.964536 {DefaultControlEventListener} [mmcs]{360}.0.0: [c0000003c878ed40] [80000000046f8c58] .osc_page_init+0xb18/0x1130 [osc]
      2013-03-11 18:11:27.964567 {DefaultControlEventListener} [mmcs]{360}.0.0: [c0000003c878eea0] [800000000252c078] .cl_page_find0+0x378/0xb70 [obdclass]
      2013-03-11 18:11:27.964597 {DefaultControlEventListener} [mmcs]{360}.0.0: [c0000003c878efc0] [800000000519bf80] .lov_page_init_raid0+0x230/0xa20 [lov]
      2013-03-11 18:11:27.964627 {DefaultControlEventListener} [mmcs]{360}.0.0: [c0000003c878f0e0] [8000000005196160] .lov_page_init+0x50/0xa0 [lov]
      2013-03-11 18:11:27.964657 {DefaultControlEventListener} [mmcs]{360}.0.0: [c0000003c878f170] [800000000252c078] .cl_page_find0+0x378/0xb70 [obdclass]
      2013-03-11 18:11:27.964686 {DefaultControlEventListener} [mmcs]{360}.0.0: [c0000003c878f290] [800000000699e098] .ll_readahead+0xdb8/0x1670 [lustre]
      2013-03-11 18:11:27.964717 {DefaultControlEventListener} [mmcs]{360}.0.0: [c0000003c878f400] [80000000069dd38c] .vvp_io_read_page+0x3bc/0x4d0 [lustre]
      2013-03-11 18:11:27.964747 {DefaultControlEventListener} [mmcs]{360}.0.0: [c0000003c878f510] [800000000253d8a4] .cl_io_read_page+0xf4/0x280 [obdclass]
      2013-03-11 18:11:27.964777 {DefaultControlEventListener} [mmcs]{360}.0.0: [c0000003c878f5d0] [800000000699c5dc] .ll_readpage+0xdc/0x2c0 [lustre]
      2013-03-11 18:11:27.964807 {DefaultControlEventListener} [mmcs]{360}.0.0: [c0000003c878f680] [c000000000099144] .generic_file_aio_read+0x500/0x728
      2013-03-11 18:11:27.964837 {DefaultControlEventListener} [mmcs]{360}.0.0: [c0000003c878f7c0] [80000000069dfe24] .vvp_io_read_start+0x274/0x640 [lustre]
      2013-03-11 18:11:27.964867 {DefaultControlEventListener} [mmcs]{360}.0.0: [c0000003c878f8e0] [80000000025398cc] .cl_io_start+0xcc/0x220 [obdclass]
      2013-03-11 18:11:27.964897 {DefaultControlEventListener} [mmcs]{360}.0.0: [c0000003c878f980] [8000000002541724] .cl_io_loop+0x194/0x2c0 [obdclass]
      2013-03-11 18:11:27.964928 {DefaultControlEventListener} [mmcs]{360}.0.0: [c0000003c878fa30] [800000000695a160] .ll_file_io_generic+0x410/0x670 [lustre]
      2013-03-11 18:11:27.964958 {DefaultControlEventListener} [mmcs]{360}.0.0: [c0000003c878fb30] [800000000695af04] .ll_file_aio_read+0x1d4/0x3a0 [lustre]
      2013-03-11 18:11:27.964988 {DefaultControlEventListener} [mmcs]{360}.0.0: [c0000003c878fc00] [800000000695b220] .ll_file_read+0x150/0x320 [lustre]
      2013-03-11 18:11:27.965018 {DefaultControlEventListener} [mmcs]{360}.0.0: [c0000003c878fce0] [c0000000000d429c] .vfs_read+0xd0/0x1c4
      2013-03-11 18:11:27.965048 {DefaultControlEventListener} [mmcs]{360}.0.0: [c0000003c878fd80] [c0000000000d448c] .SyS_read+0x54/0x98
      2013-03-11 18:11:27.965078 {DefaultControlEventListener} [mmcs]{360}.0.0: [c0000003c878fe30] [c000000000000580] syscall_exit+0x0/0x2c
      

      Prakash pointed out that used_mb is maxed out when we found the hung process:

      seqio345-ib0:/proc/fs/lustre/llite/ls1-c0000003cc505000$ cat max_cached_mb
      users: 384
      max_cached_mb: 4096
      used_mb: 4096
      unused_mb: 0
      reclaim_count: 11201
      

      When I tried to cat a file in the same ls1 filesystem, it got stuck in the same call path as the sysiod shown above.

      Issuing an "echo 3 > /proc/sys/vm/drop_caches" gets things moving again.

      This on a Sequoia ION, ppc64 Lustre client. Lustre version 2.3.58-14chaos.

      Attachments

        Activity

          [LU-2948] Client read hang in osc_page_init()

          I haven't been able to test this on Sequoia the past few weeks, but I'm sure it hasn't gone away on its own. Although, without a reproducer, I'm OK dropping the priority until I can get you guys some more information.

          prakash Prakash Surya (Inactive) added a comment - I haven't been able to test this on Sequoia the past few weeks, but I'm sure it hasn't gone away on its own. Although, without a reproducer, I'm OK dropping the priority until I can get you guys some more information.

          Please reopen this issue if this problem has been seen again

          jay Jinshan Xiong (Inactive) added a comment - Please reopen this issue if this problem has been seen again

          Reducing priority until this is hit again.

          jlevi Jodi Levi (Inactive) added a comment - Reducing priority until this is hit again.

          Jinshan, unfortunately I'm not able to easily reproduce LU-2576.

          prakash Prakash Surya (Inactive) added a comment - Jinshan, unfortunately I'm not able to easily reproduce LU-2576 .

          Sorry for delay response.

          Hi Bruno, yes, that sounds pretty reasonable. It doesn't make any sense for the process to try it again if someone is already doing the work. Let's see if the patch works and I will do it if I have a chance to work out a new patch.

          Hi Prakash, probably there is less LRU pages than lru_shrink_min in that cli. The pages can be in use for rewriting or readahead pages are being read, or write and then read. It's quite common for the caching pages to be reused otherwise we don't need to cache them at all . BTW can you easily reproduce LU-2576 before? We recently found that the patch for LU-2576 causing another problem we want to come up with another fix for that issue.

          jay Jinshan Xiong (Inactive) added a comment - Sorry for delay response. Hi Bruno, yes, that sounds pretty reasonable. It doesn't make any sense for the process to try it again if someone is already doing the work. Let's see if the patch works and I will do it if I have a chance to work out a new patch. Hi Prakash, probably there is less LRU pages than lru_shrink_min in that cli. The pages can be in use for rewriting or readahead pages are being read, or write and then read. It's quite common for the caching pages to be reused otherwise we don't need to cache them at all . BTW can you easily reproduce LU-2576 before? We recently found that the patch for LU-2576 causing another problem we want to come up with another fix for that issue.

          Hmm.. I'm not fully convinced this patch will help in that case. Even if osc_lru_reclaim returns > 0, there is still a chance another thread comes in and "steals" those slots before we can decrement cl_lru_left. That hasn't changed as far as I can tell.

          This patch does appear to help the case where an OSC has cl_lru_in_list > 0 but returns zero from osc_lru_reclaim. With the patch, it will move on to the next OSC, where it would just return zero previously. It's worth a try, but without evidence pointing to exactly why it failed in the first place it's hard to say whether this will work or not.

          One thing I noticed when I was reading the code, is we skip any pages that are "in use":

          osc_lru_shrink:
          697                 if (cl_page_in_use_noref(page)) {                               
          698                         cfs_list_move_tail(&opg->ops_lru, &cli->cl_lru_list);   
          699                         continue;                                               
          700                 }
          

          What are the chances that the first lru_shrink_min number of pages will be in use? What circumstances marks a page in use?

          prakash Prakash Surya (Inactive) added a comment - Hmm.. I'm not fully convinced this patch will help in that case. Even if osc_lru_reclaim returns > 0, there is still a chance another thread comes in and "steals" those slots before we can decrement cl_lru_left. That hasn't changed as far as I can tell. This patch does appear to help the case where an OSC has cl_lru_in_list > 0 but returns zero from osc_lru_reclaim. With the patch, it will move on to the next OSC, where it would just return zero previously. It's worth a try, but without evidence pointing to exactly why it failed in the first place it's hard to say whether this will work or not. One thing I noticed when I was reading the code, is we skip any pages that are "in use": osc_lru_shrink: 697 if (cl_page_in_use_noref(page)) { 698 cfs_list_move_tail(&opg->ops_lru, &cli->cl_lru_list); 699 continue; 700 } What are the chances that the first lru_shrink_min number of pages will be in use? What circumstances marks a page in use?

          Prakash, thank's for the full stacks log. But it does not show any other issue/info than the threads waiting in osc_page_init().

          Jinshan, I agree with you and your patch, parsing all OSCs LRUs in osc_lru_relaim() will do a much better garbage collect job, that will ensure a maximum of pages to become available. But then I think we need to find a way to only do this full scan one time for a bunch of concurrent threads/requesters. It should be simply achieved in osc_lru_reclaim() by testing/spin_trylock() if cache->ccc_lru_lock is already acquired by somebody and return 0 if yes (to wait for LRU frees in osc_lru_reserve()), but/and not release cache->ccc_lru_lock inside the loop.

          What do you think ?

          bfaccini Bruno Faccini (Inactive) added a comment - Prakash, thank's for the full stacks log. But it does not show any other issue/info than the threads waiting in osc_page_init(). Jinshan, I agree with you and your patch, parsing all OSCs LRUs in osc_lru_relaim() will do a much better garbage collect job, that will ensure a maximum of pages to become available. But then I think we need to find a way to only do this full scan one time for a bunch of concurrent threads/requesters. It should be simply achieved in osc_lru_reclaim() by testing/spin_trylock() if cache->ccc_lru_lock is already acquired by somebody and return 0 if yes (to wait for LRU frees in osc_lru_reserve()), but/and not release cache->ccc_lru_lock inside the loop. What do you think ?

          In the original implementation, it only tried to shrink other client_obd once if there is no available LRU budget. This is race because the just-freed slots may be immediately used by others this will put the current thread into sleep. And nobody will like to wake it up since we only do it when a page is being released.

          jay Jinshan Xiong (Inactive) added a comment - In the original implementation, it only tried to shrink other client_obd once if there is no available LRU budget. This is race because the just-freed slots may be immediately used by others this will put the current thread into sleep. And nobody will like to wake it up since we only do it when a page is being released.

          Out of curiosity, what makes you think that will fix the issue? What do you think the underlying issue is?

          prakash Prakash Surya (Inactive) added a comment - Out of curiosity, what makes you think that will fix the issue? What do you think the underlying issue is?

          Please try this patch: http://review.whamcloud.com/5760

          This patch has not been verified yet so take your own risk

          jay Jinshan Xiong (Inactive) added a comment - Please try this patch: http://review.whamcloud.com/5760 This patch has not been verified yet so take your own risk

          People

            bfaccini Bruno Faccini (Inactive)
            morrone Christopher Morrone (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: