Details

    • 3
    • 8471

    Description

      It looks like LU-2576 is back again. The problem went away for a while, seemingly thanks to the patch from LU-2576. I note that a later fix for stack overflow in LU-2859 changed the same lines where the fix was applied, so perhaps that reintroduced the problem?

      We are seeing hangs during writes on BG/Q hardware. We find tasks that appear to be stuck sleeping indefinitely here:

      2013-05-29 16:15:36.363547  sysiod        S 00000fffa85363bc     0  4111   3070 0x00000000
      2013-05-29 16:15:36.363582  Call Trace:
      2013-05-29 16:15:36.363617  [c0000002e1e12440] [c000000302e95cc8] 0xc000000302e95cc8 (unreliable)
      2013-05-29 16:15:36.363653  [c0000002e1e12610] [c000000000008de0] .__switch_to+0xc4/0x100
      2013-05-29 16:15:36.363688  [c0000002e1e126a0] [c00000000044dc68] .schedule+0x858/0x9c0
      2013-05-29 16:15:36.363723  [c0000002e1e12950] [80000000004820a0] .cfs_waitq_wait+0x10/0x30 [libcfs]
      2013-05-29 16:15:36.363758  [c0000002e1e129c0] [80000000015b3ccc] .osc_enter_cache+0xb6c/0x1410 [osc]
      2013-05-29 16:15:36.363793  [c0000002e1e12ba0] [80000000015bbf30] .osc_queue_async_io+0xcd0/0x2690 [osc]
      2013-05-29 16:15:36.363828  [c0000002e1e12db0] [8000000001598598] .osc_page_cache_add+0xf8/0x2a0 [osc]
      2013-05-29 16:15:36.363863  [c0000002e1e12e70] [8000000000a04248] .cl_page_cache_add+0xf8/0x420 [obdclass]
      2013-05-29 16:15:36.363898  [c0000002e1e12fa0] [800000000179ed28] .lov_page_cache_add+0xc8/0x340 [lov]
      2013-05-29 16:15:36.363934  [c0000002e1e13070] [8000000000a04248] .cl_page_cache_add+0xf8/0x420 [obdclass]
      2013-05-29 16:15:36.363968  [c0000002e1e131a0] [8000000001d2ac74] .vvp_io_commit_write+0x464/0x910 [lustre]
      2013-05-29 16:15:36.364003  [c0000002e1e132c0] [8000000000a1df6c] .cl_io_commit_write+0x11c/0x2d0 [obdclass]
      2013-05-29 16:15:36.364038  [c0000002e1e13380] [8000000001cebc00] .ll_commit_write+0x120/0x3e0 [lustre]
      2013-05-29 16:15:36.364074  [c0000002e1e13450] [8000000001d0f134] .ll_write_end+0x34/0x80 [lustre]
      2013-05-29 16:15:36.364109  [c0000002e1e134e0] [c000000000097238] .generic_file_buffered_write+0x1f4/0x388
      2013-05-29 16:15:36.364143  [c0000002e1e13620] [c000000000097928] .__generic_file_aio_write+0x374/0x3d8
      2013-05-29 16:15:36.364178  [c0000002e1e13720] [c000000000097a04] .generic_file_aio_write+0x78/0xe8
      2013-05-29 16:15:36.364213  [c0000002e1e137d0] [8000000001d2df00] .vvp_io_write_start+0x170/0x3b0 [lustre]
      2013-05-29 16:15:36.364248  [c0000002e1e138a0] [8000000000a1849c] .cl_io_start+0xcc/0x220 [obdclass]
      2013-05-29 16:15:36.364283  [c0000002e1e13940] [8000000000a202a4] .cl_io_loop+0x194/0x2c0 [obdclass]
      2013-05-29 16:15:36.364317  [c0000002e1e139f0] [8000000001ca0780] .ll_file_io_generic+0x4f0/0x850 [lustre]
      2013-05-29 16:15:36.364352  [c0000002e1e13b30] [8000000001ca0f64] .ll_file_aio_write+0x1d4/0x3a0 [lustre]
      2013-05-29 16:15:36.364387  [c0000002e1e13c00] [8000000001ca1280] .ll_file_write+0x150/0x320 [lustre]
      2013-05-29 16:15:36.364422  [c0000002e1e13ce0] [c0000000000d4328] .vfs_write+0xd0/0x1c4
      2013-05-29 16:15:36.364458  [c0000002e1e13d80] [c0000000000d4518] .SyS_write+0x54/0x98
      2013-05-29 16:15:36.364492  [c0000002e1e13e30] [c000000000000580] syscall_exit+0x0/0x2c
      

      This was with Lustre 2.4.0-RC1_3chaos.

      Attachments

        1. rzuseqio13_drop_caches.txt.bz2
          0.2 kB
          Christopher Morrone
        2. rzuseqio14_drop_caches.txt.bz2
          4.62 MB
          Christopher Morrone
        3. rzuseqio14_lustre_log.txt
          0.2 kB
          Christopher Morrone
        4. rzuseqio15_console.txt.bz2
          79 kB
          Christopher Morrone
        5. rzuseqio15_drop_caches.txt.bz2
          0.2 kB
          Christopher Morrone
        6. rzuseqio16_drop_caches.txt.bz2
          3.46 MB
          Christopher Morrone
        7. rzuseqio16_dump_page_cache.txt.bz2
          783 kB
          Christopher Morrone

        Issue Links

          Activity

            [LU-3416] Hangs on write in osc_enter_cache()

            patch landed for 2.5

            niu Niu Yawei (Inactive) added a comment - patch landed for 2.5

            Jinshan, testing of change 6554 patch set 7 went well. No failures, and no new performance problems noted.

            The commit message will need to be rewritten before landing.

            morrone Christopher Morrone (Inactive) added a comment - Jinshan, testing of change 6554 patch set 7 went well. No failures, and no new performance problems noted. The commit message will need to be rewritten before landing.

            It takes quite a while to run drop_caches. I fear that getting the log of the stuck processes will be quite difficult with the other threads swamping the logs. But I'll keep this as a background task when I have time.

            In the mean time, I'll update to the latest patch.

            morrone Christopher Morrone (Inactive) added a comment - It takes quite a while to run drop_caches. I fear that getting the log of the stuck processes will be quite difficult with the other threads swamping the logs. But I'll keep this as a background task when I have time. In the mean time, I'll update to the latest patch.

            I took a look at the log but unfortunately I didn't find anything interesting.

            For example, for log rzuseqio15_drop_caches.txt, obviously process 15336 is the drop_caches process and 15302 is flush process. So I tried to dig activities other than these two processes by:

            [jinxiong@intel tmp]$ cat rzuseqio15_drop_caches.txt |grep -v '15336:' |grep -v ':15302:' |grep cl_io
            [jinxiong@intel tmp]$ 
            

            but nothing is printed out. I don't know why this can happen because the pages are already existing but no process is related to them. It will be helpful to figure out which process was stuck before dropping caches.

            The purpose of doing this is to fully understand why dropping cache can make the hung process go forward. Though we can fix the problem by latest patch, it's really bad if we don't understand a code path clearly.

            jay Jinshan Xiong (Inactive) added a comment - I took a look at the log but unfortunately I didn't find anything interesting. For example, for log rzuseqio15_drop_caches.txt, obviously process 15336 is the drop_caches process and 15302 is flush process. So I tried to dig activities other than these two processes by: [jinxiong@intel tmp]$ cat rzuseqio15_drop_caches.txt |grep -v '15336:' |grep -v ':15302:' |grep cl_io [jinxiong@intel tmp]$ but nothing is printed out. I don't know why this can happen because the pages are already existing but no process is related to them. It will be helpful to figure out which process was stuck before dropping caches. The purpose of doing this is to fully understand why dropping cache can make the hung process go forward. Though we can fix the problem by latest patch, it's really bad if we don't understand a code path clearly.

            Hi Chris, can you please try the last patch set? If this works, I will land it into master.

            jay Jinshan Xiong (Inactive) added a comment - Hi Chris, can you please try the last patch set? If this works, I will land it into master.

            See the attached rzuseq*_drop_caches.bz2 files. When I got a write hange, I enabled full lustre debugging, issued an 'echo 2 > drop_caches', and waited a few seconds, and then dumped the logs on all four clients.

            Hopefully there is something useful in there.

            morrone Christopher Morrone (Inactive) added a comment - See the attached rzuseq*_drop_caches.bz2 files. When I got a write hange, I enabled full lustre debugging, issued an 'echo 2 > drop_caches', and waited a few seconds, and then dumped the logs on all four clients. Hopefully there is something useful in there.

            Yes, I haven't been able to reproduce the problem again. Performance is in the accepatble range.

            I should be able to get that log tomorrow.

            morrone Christopher Morrone (Inactive) added a comment - Yes, I haven't been able to reproduce the problem again. Performance is in the accepatble range. I should be able to get that log tomorrow.

            No news assumes good news, no?

            If this patch can survive, can you please do me a favor to roll back this patch, make the hang happen, and collect logs when doing `echo 3 > drop_caches', I don't understand why this can make it go forward?

            jay Jinshan Xiong (Inactive) added a comment - No news assumes good news, no? If this patch can survive, can you please do me a favor to roll back this patch, make the hang happen, and collect logs when doing `echo 3 > drop_caches', I don't understand why this can make it go forward?

            I have been running the reproducer, and haven't had a single hang yet with patch set 5 applied. I will leave it running over night.

            morrone Christopher Morrone (Inactive) added a comment - I have been running the reproducer, and haven't had a single hang yet with patch set 5 applied. I will leave it running over night.

            People

              niu Niu Yawei (Inactive)
              morrone Christopher Morrone (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: