Details

    • 3
    • 8471

    Description

      It looks like LU-2576 is back again. The problem went away for a while, seemingly thanks to the patch from LU-2576. I note that a later fix for stack overflow in LU-2859 changed the same lines where the fix was applied, so perhaps that reintroduced the problem?

      We are seeing hangs during writes on BG/Q hardware. We find tasks that appear to be stuck sleeping indefinitely here:

      2013-05-29 16:15:36.363547  sysiod        S 00000fffa85363bc     0  4111   3070 0x00000000
      2013-05-29 16:15:36.363582  Call Trace:
      2013-05-29 16:15:36.363617  [c0000002e1e12440] [c000000302e95cc8] 0xc000000302e95cc8 (unreliable)
      2013-05-29 16:15:36.363653  [c0000002e1e12610] [c000000000008de0] .__switch_to+0xc4/0x100
      2013-05-29 16:15:36.363688  [c0000002e1e126a0] [c00000000044dc68] .schedule+0x858/0x9c0
      2013-05-29 16:15:36.363723  [c0000002e1e12950] [80000000004820a0] .cfs_waitq_wait+0x10/0x30 [libcfs]
      2013-05-29 16:15:36.363758  [c0000002e1e129c0] [80000000015b3ccc] .osc_enter_cache+0xb6c/0x1410 [osc]
      2013-05-29 16:15:36.363793  [c0000002e1e12ba0] [80000000015bbf30] .osc_queue_async_io+0xcd0/0x2690 [osc]
      2013-05-29 16:15:36.363828  [c0000002e1e12db0] [8000000001598598] .osc_page_cache_add+0xf8/0x2a0 [osc]
      2013-05-29 16:15:36.363863  [c0000002e1e12e70] [8000000000a04248] .cl_page_cache_add+0xf8/0x420 [obdclass]
      2013-05-29 16:15:36.363898  [c0000002e1e12fa0] [800000000179ed28] .lov_page_cache_add+0xc8/0x340 [lov]
      2013-05-29 16:15:36.363934  [c0000002e1e13070] [8000000000a04248] .cl_page_cache_add+0xf8/0x420 [obdclass]
      2013-05-29 16:15:36.363968  [c0000002e1e131a0] [8000000001d2ac74] .vvp_io_commit_write+0x464/0x910 [lustre]
      2013-05-29 16:15:36.364003  [c0000002e1e132c0] [8000000000a1df6c] .cl_io_commit_write+0x11c/0x2d0 [obdclass]
      2013-05-29 16:15:36.364038  [c0000002e1e13380] [8000000001cebc00] .ll_commit_write+0x120/0x3e0 [lustre]
      2013-05-29 16:15:36.364074  [c0000002e1e13450] [8000000001d0f134] .ll_write_end+0x34/0x80 [lustre]
      2013-05-29 16:15:36.364109  [c0000002e1e134e0] [c000000000097238] .generic_file_buffered_write+0x1f4/0x388
      2013-05-29 16:15:36.364143  [c0000002e1e13620] [c000000000097928] .__generic_file_aio_write+0x374/0x3d8
      2013-05-29 16:15:36.364178  [c0000002e1e13720] [c000000000097a04] .generic_file_aio_write+0x78/0xe8
      2013-05-29 16:15:36.364213  [c0000002e1e137d0] [8000000001d2df00] .vvp_io_write_start+0x170/0x3b0 [lustre]
      2013-05-29 16:15:36.364248  [c0000002e1e138a0] [8000000000a1849c] .cl_io_start+0xcc/0x220 [obdclass]
      2013-05-29 16:15:36.364283  [c0000002e1e13940] [8000000000a202a4] .cl_io_loop+0x194/0x2c0 [obdclass]
      2013-05-29 16:15:36.364317  [c0000002e1e139f0] [8000000001ca0780] .ll_file_io_generic+0x4f0/0x850 [lustre]
      2013-05-29 16:15:36.364352  [c0000002e1e13b30] [8000000001ca0f64] .ll_file_aio_write+0x1d4/0x3a0 [lustre]
      2013-05-29 16:15:36.364387  [c0000002e1e13c00] [8000000001ca1280] .ll_file_write+0x150/0x320 [lustre]
      2013-05-29 16:15:36.364422  [c0000002e1e13ce0] [c0000000000d4328] .vfs_write+0xd0/0x1c4
      2013-05-29 16:15:36.364458  [c0000002e1e13d80] [c0000000000d4518] .SyS_write+0x54/0x98
      2013-05-29 16:15:36.364492  [c0000002e1e13e30] [c000000000000580] syscall_exit+0x0/0x2c
      

      This was with Lustre 2.4.0-RC1_3chaos.

      Attachments

        1. rzuseqio13_drop_caches.txt.bz2
          0.2 kB
        2. rzuseqio14_drop_caches.txt.bz2
          4.62 MB
        3. rzuseqio14_lustre_log.txt
          0.2 kB
        4. rzuseqio15_console.txt.bz2
          79 kB
        5. rzuseqio15_drop_caches.txt.bz2
          0.2 kB
        6. rzuseqio16_drop_caches.txt.bz2
          3.46 MB
        7. rzuseqio16_dump_page_cache.txt.bz2
          783 kB

        Issue Links

          Activity

            [LU-3416] Hangs on write in osc_enter_cache()

            See new attachment rzuseqio15_console.txt.bz2. It shows backtraces from all processes after a hang began, and then the messages that were printed by patch set 4.

            morrone Christopher Morrone (Inactive) added a comment - See new attachment rzuseqio15_console.txt.bz2 . It shows backtraces from all processes after a hang began, and then the messages that were printed by patch set 4.

            This is awkward - I thought it must be sleeping for LRU slots while holding some dirty pages. The problem is obvious: cl_dirty couldn't be deducted for some reason, one possible reason would the ptlrpc threads were blocked somewhere so brw_interpret couldn't be called, or the processes are waiting for something while holding active osc_extents. Can you please take a look at backtrace of running processes and tell me if there exists any other process is blocking on something?

            Also, I worked out patch set 4 which dumps the extent tree when this problem happens. This can help us isolate the root cause.

            jay Jinshan Xiong (Inactive) added a comment - This is awkward - I thought it must be sleeping for LRU slots while holding some dirty pages. The problem is obvious: cl_dirty couldn't be deducted for some reason, one possible reason would the ptlrpc threads were blocked somewhere so brw_interpret couldn't be called, or the processes are waiting for something while holding active osc_extents. Can you please take a look at backtrace of running processes and tell me if there exists any other process is blocking on something? Also, I worked out patch set 4 which dumps the extent tree when this problem happens. This can help us isolate the root cause.

            Here's output from one of the debug messages before I leave for the day:

            LustreError: 3709:0:(osc_cache.c:1563:osc_enter_cache()) lsrzb-OST0004-osc-c0000003c70ca400: grant { dirty: 33554432/33554432 dirty_pages: 6934/131071 unstable_pages: 0/131071 dropped: 0 avail: 45613056, reserved: 0, flight: 0 } lru {in list: 2407, left: 515, waiters: 0 }try to reserve 65536.
            LustreError: 3736:0:(osc_cache.c:1563:osc_enter_cache()) lsrzb-OST0004-osc-c0000003c70ca400: grant { dirty: 33554432/33554432 dirty_pages: 6934/131071 unstable_pages: 0/131071 dropped: 0 avail: 45613056, reserved: 0, flight: 0 } lru {in list: 2407, left: 515, waiters: 0 }try to reserve 65536.
            
            morrone Christopher Morrone (Inactive) added a comment - Here's output from one of the debug messages before I leave for the day: LustreError: 3709:0:(osc_cache.c:1563:osc_enter_cache()) lsrzb-OST0004-osc-c0000003c70ca400: grant { dirty: 33554432/33554432 dirty_pages: 6934/131071 unstable_pages: 0/131071 dropped: 0 avail: 45613056, reserved: 0, flight: 0 } lru {in list: 2407, left: 515, waiters: 0 }try to reserve 65536. LustreError: 3736:0:(osc_cache.c:1563:osc_enter_cache()) lsrzb-OST0004-osc-c0000003c70ca400: grant { dirty: 33554432/33554432 dirty_pages: 6934/131071 unstable_pages: 0/131071 dropped: 0 avail: 45613056, reserved: 0, flight: 0 } lru {in list: 2407, left: 515, waiters: 0 }try to reserve 65536.

            Performance (and presumably the negative number) problem is gone, but it is still hanging for ten minutes like with patch one. I'll get post some info tomorrow.

            morrone Christopher Morrone (Inactive) added a comment - Performance (and presumably the negative number) problem is gone, but it is still hanging for ten minutes like with patch one. I'll get post some info tomorrow.

            Will you please try patch set 3 and see if it helps.

            jay Jinshan Xiong (Inactive) added a comment - Will you please try patch set 3 and see if it helps.

            I found the problem. Working on a patch.

            jay Jinshan Xiong (Inactive) added a comment - I found the problem. Working on a patch.

            ah yes, indeed. The dirty bytes are actually negative numbers. I'll check it.

            jay Jinshan Xiong (Inactive) added a comment - ah yes, indeed. The dirty bytes are actually negative numbers. I'll check it.

            I double checked, and I am certain that the patch is the only difference. FYI, I've moved to Lustre version 2.4.0-RC2_2chaos on both clients and servers.

            Without the patch, I see write throughput burst to 4.4GB/s intially (as seen in ltop), and then taper off to a fairly constant rate in the low 3GB/s range.

            With the patch, I see 720 MB/s. (These are aggregate throughput for 4 BG/Q clients.)

            Keep in mind that our tree has the unstable page tracking changes. I don't know if there is a relation there or not, but you should be aware.

            Without your patch, at the beginning of an IOR write phase, you might see something like this:

            rzuseqio13-ib0@morrone:lctl get_param "osc.*.cur_dirty_bytes"
            osc.lsrzb-OST0001-osc-c0000003e18f6400.cur_dirty_bytes=33554432
            osc.lsrzb-OST0002-osc-c0000003e18f6400.cur_dirty_bytes=33554432
            osc.lsrzb-OST0003-osc-c0000003e18f6400.cur_dirty_bytes=33554432
            osc.lsrzb-OST0004-osc-c0000003e18f6400.cur_dirty_bytes=33554432
            osc.lsrzb-OST0005-osc-c0000003e18f6400.cur_dirty_bytes=33554432
            osc.lsrzb-OST0006-osc-c0000003e18f6400.cur_dirty_bytes=33161216
            osc.lsrzb-OST0007-osc-c0000003e18f6400.cur_dirty_bytes=33554432
            osc.lsrzb-OST0008-osc-c0000003e18f6400.cur_dirty_bytes=32899072
            osc.lsrzb-OST0009-osc-c0000003e18f6400.cur_dirty_bytes=33554432
            osc.lsrzb-OST000a-osc-c0000003e18f6400.cur_dirty_bytes=33554432
            osc.lsrzb-OST000b-osc-c0000003e18f6400.cur_dirty_bytes=33554432
            osc.lsrzb-OST000c-osc-c0000003e18f6400.cur_dirty_bytes=33030144
            osc.lsrzb-OST000d-osc-c0000003e18f6400.cur_dirty_bytes=33554432
            osc.lsrzb-OST000e-osc-c0000003e18f6400.cur_dirty_bytes=12582912
            osc.lsrzb-OST000f-osc-c0000003e18f6400.cur_dirty_bytes=33554432
            osc.lsrzb-OST0010-osc-c0000003e18f6400.cur_dirty_bytes=33554432
            

            And that seems fairly normal.

            But with patch set 2 of 6554 applied I see these bogus numbers:

            rzuseqlac2:~/BG-191$ rsh rzuseqio13 '/usr/sbin/lctl get_param "osc.*.cur_dirty_bytes"'
            osc.lsrzb-OST0001-osc-c0000003ec957c00.cur_dirty_bytes=18446744073707454464
            osc.lsrzb-OST0002-osc-c0000003ec957c00.cur_dirty_bytes=18446744073707454464
            osc.lsrzb-OST0003-osc-c0000003ec957c00.cur_dirty_bytes=18446744073707454464
            osc.lsrzb-OST0004-osc-c0000003ec957c00.cur_dirty_bytes=18446744073703260160
            osc.lsrzb-OST0005-osc-c0000003ec957c00.cur_dirty_bytes=18446744073707454464
            osc.lsrzb-OST0006-osc-c0000003ec957c00.cur_dirty_bytes=18446744073707454464
            osc.lsrzb-OST0007-osc-c0000003ec957c00.cur_dirty_bytes=18446744073706930176
            osc.lsrzb-OST0008-osc-c0000003ec957c00.cur_dirty_bytes=18446744073707454464
            osc.lsrzb-OST0009-osc-c0000003ec957c00.cur_dirty_bytes=18446744073707454464
            osc.lsrzb-OST000a-osc-c0000003ec957c00.cur_dirty_bytes=18446744073707454464
            osc.lsrzb-OST000b-osc-c0000003ec957c00.cur_dirty_bytes=18446744073706536960
            osc.lsrzb-OST000c-osc-c0000003ec957c00.cur_dirty_bytes=18446744073707454464
            osc.lsrzb-OST000d-osc-c0000003ec957c00.cur_dirty_bytes=18446744073707454464
            osc.lsrzb-OST000e-osc-c0000003ec957c00.cur_dirty_bytes=18446744073707454464
            osc.lsrzb-OST000f-osc-c0000003ec957c00.cur_dirty_bytes=18446744073707454464
            osc.lsrzb-OST0010-osc-c0000003ec957c00.cur_dirty_bytes=18446744073707454464
            

            I'm guessing that those large numbers are simply an arithmetic problem, where we're allowing the number to become negative. That somehow results in very slow write throughput, perhaps because all operations are not completely synchronous?

            morrone Christopher Morrone (Inactive) added a comment - - edited I double checked, and I am certain that the patch is the only difference. FYI, I've moved to Lustre version 2.4.0-RC2_2chaos on both clients and servers. Without the patch, I see write throughput burst to 4.4GB/s intially (as seen in ltop), and then taper off to a fairly constant rate in the low 3GB/s range. With the patch, I see 720 MB/s. (These are aggregate throughput for 4 BG/Q clients.) Keep in mind that our tree has the unstable page tracking changes. I don't know if there is a relation there or not, but you should be aware. Without your patch, at the beginning of an IOR write phase, you might see something like this: rzuseqio13-ib0@morrone:lctl get_param "osc.*.cur_dirty_bytes" osc.lsrzb-OST0001-osc-c0000003e18f6400.cur_dirty_bytes=33554432 osc.lsrzb-OST0002-osc-c0000003e18f6400.cur_dirty_bytes=33554432 osc.lsrzb-OST0003-osc-c0000003e18f6400.cur_dirty_bytes=33554432 osc.lsrzb-OST0004-osc-c0000003e18f6400.cur_dirty_bytes=33554432 osc.lsrzb-OST0005-osc-c0000003e18f6400.cur_dirty_bytes=33554432 osc.lsrzb-OST0006-osc-c0000003e18f6400.cur_dirty_bytes=33161216 osc.lsrzb-OST0007-osc-c0000003e18f6400.cur_dirty_bytes=33554432 osc.lsrzb-OST0008-osc-c0000003e18f6400.cur_dirty_bytes=32899072 osc.lsrzb-OST0009-osc-c0000003e18f6400.cur_dirty_bytes=33554432 osc.lsrzb-OST000a-osc-c0000003e18f6400.cur_dirty_bytes=33554432 osc.lsrzb-OST000b-osc-c0000003e18f6400.cur_dirty_bytes=33554432 osc.lsrzb-OST000c-osc-c0000003e18f6400.cur_dirty_bytes=33030144 osc.lsrzb-OST000d-osc-c0000003e18f6400.cur_dirty_bytes=33554432 osc.lsrzb-OST000e-osc-c0000003e18f6400.cur_dirty_bytes=12582912 osc.lsrzb-OST000f-osc-c0000003e18f6400.cur_dirty_bytes=33554432 osc.lsrzb-OST0010-osc-c0000003e18f6400.cur_dirty_bytes=33554432 And that seems fairly normal. But with patch set 2 of 6554 applied I see these bogus numbers: rzuseqlac2:~/BG-191$ rsh rzuseqio13 '/usr/sbin/lctl get_param "osc.*.cur_dirty_bytes"' osc.lsrzb-OST0001-osc-c0000003ec957c00.cur_dirty_bytes=18446744073707454464 osc.lsrzb-OST0002-osc-c0000003ec957c00.cur_dirty_bytes=18446744073707454464 osc.lsrzb-OST0003-osc-c0000003ec957c00.cur_dirty_bytes=18446744073707454464 osc.lsrzb-OST0004-osc-c0000003ec957c00.cur_dirty_bytes=18446744073703260160 osc.lsrzb-OST0005-osc-c0000003ec957c00.cur_dirty_bytes=18446744073707454464 osc.lsrzb-OST0006-osc-c0000003ec957c00.cur_dirty_bytes=18446744073707454464 osc.lsrzb-OST0007-osc-c0000003ec957c00.cur_dirty_bytes=18446744073706930176 osc.lsrzb-OST0008-osc-c0000003ec957c00.cur_dirty_bytes=18446744073707454464 osc.lsrzb-OST0009-osc-c0000003ec957c00.cur_dirty_bytes=18446744073707454464 osc.lsrzb-OST000a-osc-c0000003ec957c00.cur_dirty_bytes=18446744073707454464 osc.lsrzb-OST000b-osc-c0000003ec957c00.cur_dirty_bytes=18446744073706536960 osc.lsrzb-OST000c-osc-c0000003ec957c00.cur_dirty_bytes=18446744073707454464 osc.lsrzb-OST000d-osc-c0000003ec957c00.cur_dirty_bytes=18446744073707454464 osc.lsrzb-OST000e-osc-c0000003ec957c00.cur_dirty_bytes=18446744073707454464 osc.lsrzb-OST000f-osc-c0000003ec957c00.cur_dirty_bytes=18446744073707454464 osc.lsrzb-OST0010-osc-c0000003ec957c00.cur_dirty_bytes=18446744073707454464 I'm guessing that those large numbers are simply an arithmetic problem, where we're allowing the number to become negative. That somehow results in very slow write throughput, perhaps because all operations are not completely synchronous?

            this patch should have nothing to do with the performance. Maybe you can try to roll back to unpatched version and see if it's due to env settings. Of course, we can simplify the problem by using dd to check performance.

            If you can still see the performance drop, please try to collect a log with D_CACHE enable (lctl set_param debug=cache), I will take a look to make sure it's in the correct path.

            jay Jinshan Xiong (Inactive) added a comment - this patch should have nothing to do with the performance. Maybe you can try to roll back to unpatched version and see if it's due to env settings. Of course, we can simplify the problem by using dd to check performance. If you can still see the performance drop, please try to collect a log with D_CACHE enable (lctl set_param debug=cache), I will take a look to make sure it's in the correct path.

            The problem seems to have disappeared with Patch Set 2 in place.

            But now write performance is unacceptably bad. Four BG/Q clients are only pushing a total of around 720 MB/s (180MB/S each). That number should be closer to 1GB/s per client.

            morrone Christopher Morrone (Inactive) added a comment - The problem seems to have disappeared with Patch Set 2 in place. But now write performance is unacceptably bad. Four BG/Q clients are only pushing a total of around 720 MB/s (180MB/S each). That number should be closer to 1GB/s per client.

            Yes, I'll try it out. Thanks, Jinshan.

            morrone Christopher Morrone (Inactive) added a comment - Yes, I'll try it out. Thanks, Jinshan.

            People

              niu Niu Yawei (Inactive)
              morrone Christopher Morrone (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: