[LU-3416] Hangs on write in osc_enter_cache() - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.5.0, Lustre 2.4.2
Affects Version/s: Lustre 2.4.0, Lustre 2.4.1
Labels:
- mq313
Environment:
Lustre 2.4.0-RC1_3chaos. https://github.com/chaos/lustre/tree/2.4.0-RC1_3chaos, ZFS servers

Severity:
3
Rank (Obsolete):
8471

Description

It looks like ~~LU-2576~~ is back again. The problem went away for a while, seemingly thanks to the patch from ~~LU-2576~~. I note that a later fix for stack overflow in ~~LU-2859~~ changed the same lines where the fix was applied, so perhaps that reintroduced the problem?

We are seeing hangs during writes on BG/Q hardware. We find tasks that appear to be stuck sleeping indefinitely here:

2013-05-29 16:15:36.363547  sysiod        S 00000fffa85363bc     0  4111   3070 0x00000000
2013-05-29 16:15:36.363582  Call Trace:
2013-05-29 16:15:36.363617  [c0000002e1e12440] [c000000302e95cc8] 0xc000000302e95cc8 (unreliable)
2013-05-29 16:15:36.363653  [c0000002e1e12610] [c000000000008de0] .__switch_to+0xc4/0x100
2013-05-29 16:15:36.363688  [c0000002e1e126a0] [c00000000044dc68] .schedule+0x858/0x9c0
2013-05-29 16:15:36.363723  [c0000002e1e12950] [80000000004820a0] .cfs_waitq_wait+0x10/0x30 [libcfs]
2013-05-29 16:15:36.363758  [c0000002e1e129c0] [80000000015b3ccc] .osc_enter_cache+0xb6c/0x1410 [osc]
2013-05-29 16:15:36.363793  [c0000002e1e12ba0] [80000000015bbf30] .osc_queue_async_io+0xcd0/0x2690 [osc]
2013-05-29 16:15:36.363828  [c0000002e1e12db0] [8000000001598598] .osc_page_cache_add+0xf8/0x2a0 [osc]
2013-05-29 16:15:36.363863  [c0000002e1e12e70] [8000000000a04248] .cl_page_cache_add+0xf8/0x420 [obdclass]
2013-05-29 16:15:36.363898  [c0000002e1e12fa0] [800000000179ed28] .lov_page_cache_add+0xc8/0x340 [lov]
2013-05-29 16:15:36.363934  [c0000002e1e13070] [8000000000a04248] .cl_page_cache_add+0xf8/0x420 [obdclass]
2013-05-29 16:15:36.363968  [c0000002e1e131a0] [8000000001d2ac74] .vvp_io_commit_write+0x464/0x910 [lustre]
2013-05-29 16:15:36.364003  [c0000002e1e132c0] [8000000000a1df6c] .cl_io_commit_write+0x11c/0x2d0 [obdclass]
2013-05-29 16:15:36.364038  [c0000002e1e13380] [8000000001cebc00] .ll_commit_write+0x120/0x3e0 [lustre]
2013-05-29 16:15:36.364074  [c0000002e1e13450] [8000000001d0f134] .ll_write_end+0x34/0x80 [lustre]
2013-05-29 16:15:36.364109  [c0000002e1e134e0] [c000000000097238] .generic_file_buffered_write+0x1f4/0x388
2013-05-29 16:15:36.364143  [c0000002e1e13620] [c000000000097928] .__generic_file_aio_write+0x374/0x3d8
2013-05-29 16:15:36.364178  [c0000002e1e13720] [c000000000097a04] .generic_file_aio_write+0x78/0xe8
2013-05-29 16:15:36.364213  [c0000002e1e137d0] [8000000001d2df00] .vvp_io_write_start+0x170/0x3b0 [lustre]
2013-05-29 16:15:36.364248  [c0000002e1e138a0] [8000000000a1849c] .cl_io_start+0xcc/0x220 [obdclass]
2013-05-29 16:15:36.364283  [c0000002e1e13940] [8000000000a202a4] .cl_io_loop+0x194/0x2c0 [obdclass]
2013-05-29 16:15:36.364317  [c0000002e1e139f0] [8000000001ca0780] .ll_file_io_generic+0x4f0/0x850 [lustre]
2013-05-29 16:15:36.364352  [c0000002e1e13b30] [8000000001ca0f64] .ll_file_aio_write+0x1d4/0x3a0 [lustre]
2013-05-29 16:15:36.364387  [c0000002e1e13c00] [8000000001ca1280] .ll_file_write+0x150/0x320 [lustre]
2013-05-29 16:15:36.364422  [c0000002e1e13ce0] [c0000000000d4328] .vfs_write+0xd0/0x1c4
2013-05-29 16:15:36.364458  [c0000002e1e13d80] [c0000000000d4518] .SyS_write+0x54/0x98
2013-05-29 16:15:36.364492  [c0000002e1e13e30] [c000000000000580] syscall_exit+0x0/0x2c

This was with Lustre 2.4.0-RC1_3chaos.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

rzuseqio13_drop_caches.txt.bz2
0.2 kB
14/Jun/13 10:50 PM
rzuseqio14_drop_caches.txt.bz2
4.62 MB
14/Jun/13 10:50 PM
rzuseqio14_lustre_log.txt
0.2 kB
31/May/13 12:23 AM
rzuseqio15_console.txt.bz2
79 kB
11/Jun/13 9:27 PM
rzuseqio15_drop_caches.txt.bz2
0.2 kB
14/Jun/13 10:50 PM
rzuseqio16_drop_caches.txt.bz2
3.46 MB
14/Jun/13 10:50 PM
rzuseqio16_dump_page_cache.txt.bz2
783 kB
30/May/13 6:13 PM

Issue Links

is related to

LU-2576 Hangs in osc_enter_cache due to dirty pages not being flushed

Resolved

LU-2859 Stack overflow issues on x86_64 clients in memory reclaim during writes

Resolved

Activity

[LU-3416] Hangs on write in osc_enter_cache()

Christopher Morrone (Inactive) added a comment - 11/Jun/13 9:29 PM

See new attachment rzuseqio15_console.txt.bz2. It shows backtraces from all processes after a hang began, and then the messages that were printed by patch set 4.

Christopher Morrone (Inactive) added a comment - 11/Jun/13 9:29 PM See new attachment rzuseqio15_console.txt.bz2 . It shows backtraces from all processes after a hang began, and then the messages that were printed by patch set 4.

Jinshan Xiong (Inactive) added a comment - 11/Jun/13 2:34 AM

This is awkward - I thought it must be sleeping for LRU slots while holding some dirty pages. The problem is obvious: cl_dirty couldn't be deducted for some reason, one possible reason would the ptlrpc threads were blocked somewhere so brw_interpret couldn't be called, or the processes are waiting for something while holding active osc_extents. Can you please take a look at backtrace of running processes and tell me if there exists any other process is blocking on something?

Also, I worked out patch set 4 which dumps the extent tree when this problem happens. This can help us isolate the root cause.

Jinshan Xiong (Inactive) added a comment - 11/Jun/13 2:34 AM This is awkward - I thought it must be sleeping for LRU slots while holding some dirty pages. The problem is obvious: cl_dirty couldn't be deducted for some reason, one possible reason would the ptlrpc threads were blocked somewhere so brw_interpret couldn't be called, or the processes are waiting for something while holding active osc_extents. Can you please take a look at backtrace of running processes and tell me if there exists any other process is blocking on something? Also, I worked out patch set 4 which dumps the extent tree when this problem happens. This can help us isolate the root cause.

Christopher Morrone (Inactive) added a comment - 11/Jun/13 1:48 AM

Here's output from one of the debug messages before I leave for the day:

LustreError: 3709:0:(osc_cache.c:1563:osc_enter_cache()) lsrzb-OST0004-osc-c0000003c70ca400: grant { dirty: 33554432/33554432 dirty_pages: 6934/131071 unstable_pages: 0/131071 dropped: 0 avail: 45613056, reserved: 0, flight: 0 } lru {in list: 2407, left: 515, waiters: 0 }try to reserve 65536.
LustreError: 3736:0:(osc_cache.c:1563:osc_enter_cache()) lsrzb-OST0004-osc-c0000003c70ca400: grant { dirty: 33554432/33554432 dirty_pages: 6934/131071 unstable_pages: 0/131071 dropped: 0 avail: 45613056, reserved: 0, flight: 0 } lru {in list: 2407, left: 515, waiters: 0 }try to reserve 65536.

Christopher Morrone (Inactive) added a comment - 11/Jun/13 1:48 AM Here's output from one of the debug messages before I leave for the day: LustreError: 3709:0:(osc_cache.c:1563:osc_enter_cache()) lsrzb-OST0004-osc-c0000003c70ca400: grant { dirty: 33554432/33554432 dirty_pages: 6934/131071 unstable_pages: 0/131071 dropped: 0 avail: 45613056, reserved: 0, flight: 0 } lru {in list: 2407, left: 515, waiters: 0 }try to reserve 65536. LustreError: 3736:0:(osc_cache.c:1563:osc_enter_cache()) lsrzb-OST0004-osc-c0000003c70ca400: grant { dirty: 33554432/33554432 dirty_pages: 6934/131071 unstable_pages: 0/131071 dropped: 0 avail: 45613056, reserved: 0, flight: 0 } lru {in list: 2407, left: 515, waiters: 0 }try to reserve 65536.

Christopher Morrone (Inactive) added a comment - 11/Jun/13 1:30 AM

Performance (and presumably the negative number) problem is gone, but it is still hanging for ten minutes like with patch one. I'll get post some info tomorrow.

Christopher Morrone (Inactive) added a comment - 11/Jun/13 1:30 AM Performance (and presumably the negative number) problem is gone, but it is still hanging for ten minutes like with patch one. I'll get post some info tomorrow.

Jinshan Xiong (Inactive) added a comment - 10/Jun/13 8:59 PM

Will you please try patch set 3 and see if it helps.

Jinshan Xiong (Inactive) added a comment - 10/Jun/13 8:59 PM Will you please try patch set 3 and see if it helps.

Jinshan Xiong (Inactive) added a comment - 07/Jun/13 9:47 PM

I found the problem. Working on a patch.

Jinshan Xiong (Inactive) added a comment - 07/Jun/13 9:47 PM I found the problem. Working on a patch.

Jinshan Xiong (Inactive) added a comment - 07/Jun/13 9:23 PM

ah yes, indeed. The dirty bytes are actually negative numbers. I'll check it.

Jinshan Xiong (Inactive) added a comment - 07/Jun/13 9:23 PM ah yes, indeed. The dirty bytes are actually negative numbers. I'll check it.

Christopher Morrone (Inactive) added a comment - 07/Jun/13 9:15 PM - edited

I double checked, and I am certain that the patch is the only difference. FYI, I've moved to Lustre version 2.4.0-RC2_2chaos on both clients and servers.

Without the patch, I see write throughput burst to 4.4GB/s intially (as seen in ltop), and then taper off to a fairly constant rate in the low 3GB/s range.

With the patch, I see 720 MB/s. (These are aggregate throughput for 4 BG/Q clients.)

Keep in mind that our tree has the unstable page tracking changes. I don't know if there is a relation there or not, but you should be aware.

Without your patch, at the beginning of an IOR write phase, you might see something like this:

rzuseqio13-ib0@morrone:lctl get_param "osc.*.cur_dirty_bytes"
osc.lsrzb-OST0001-osc-c0000003e18f6400.cur_dirty_bytes=33554432
osc.lsrzb-OST0002-osc-c0000003e18f6400.cur_dirty_bytes=33554432
osc.lsrzb-OST0003-osc-c0000003e18f6400.cur_dirty_bytes=33554432
osc.lsrzb-OST0004-osc-c0000003e18f6400.cur_dirty_bytes=33554432
osc.lsrzb-OST0005-osc-c0000003e18f6400.cur_dirty_bytes=33554432
osc.lsrzb-OST0006-osc-c0000003e18f6400.cur_dirty_bytes=33161216
osc.lsrzb-OST0007-osc-c0000003e18f6400.cur_dirty_bytes=33554432
osc.lsrzb-OST0008-osc-c0000003e18f6400.cur_dirty_bytes=32899072
osc.lsrzb-OST0009-osc-c0000003e18f6400.cur_dirty_bytes=33554432
osc.lsrzb-OST000a-osc-c0000003e18f6400.cur_dirty_bytes=33554432
osc.lsrzb-OST000b-osc-c0000003e18f6400.cur_dirty_bytes=33554432
osc.lsrzb-OST000c-osc-c0000003e18f6400.cur_dirty_bytes=33030144
osc.lsrzb-OST000d-osc-c0000003e18f6400.cur_dirty_bytes=33554432
osc.lsrzb-OST000e-osc-c0000003e18f6400.cur_dirty_bytes=12582912
osc.lsrzb-OST000f-osc-c0000003e18f6400.cur_dirty_bytes=33554432
osc.lsrzb-OST0010-osc-c0000003e18f6400.cur_dirty_bytes=33554432

And that seems fairly normal.

But with patch set 2 of 6554 applied I see these bogus numbers:

rzuseqlac2:~/BG-191$ rsh rzuseqio13 '/usr/sbin/lctl get_param "osc.*.cur_dirty_bytes"'
osc.lsrzb-OST0001-osc-c0000003ec957c00.cur_dirty_bytes=18446744073707454464
osc.lsrzb-OST0002-osc-c0000003ec957c00.cur_dirty_bytes=18446744073707454464
osc.lsrzb-OST0003-osc-c0000003ec957c00.cur_dirty_bytes=18446744073707454464
osc.lsrzb-OST0004-osc-c0000003ec957c00.cur_dirty_bytes=18446744073703260160
osc.lsrzb-OST0005-osc-c0000003ec957c00.cur_dirty_bytes=18446744073707454464
osc.lsrzb-OST0006-osc-c0000003ec957c00.cur_dirty_bytes=18446744073707454464
osc.lsrzb-OST0007-osc-c0000003ec957c00.cur_dirty_bytes=18446744073706930176
osc.lsrzb-OST0008-osc-c0000003ec957c00.cur_dirty_bytes=18446744073707454464
osc.lsrzb-OST0009-osc-c0000003ec957c00.cur_dirty_bytes=18446744073707454464
osc.lsrzb-OST000a-osc-c0000003ec957c00.cur_dirty_bytes=18446744073707454464
osc.lsrzb-OST000b-osc-c0000003ec957c00.cur_dirty_bytes=18446744073706536960
osc.lsrzb-OST000c-osc-c0000003ec957c00.cur_dirty_bytes=18446744073707454464
osc.lsrzb-OST000d-osc-c0000003ec957c00.cur_dirty_bytes=18446744073707454464
osc.lsrzb-OST000e-osc-c0000003ec957c00.cur_dirty_bytes=18446744073707454464
osc.lsrzb-OST000f-osc-c0000003ec957c00.cur_dirty_bytes=18446744073707454464
osc.lsrzb-OST0010-osc-c0000003ec957c00.cur_dirty_bytes=18446744073707454464

I'm guessing that those large numbers are simply an arithmetic problem, where we're allowing the number to become negative. That somehow results in very slow write throughput, perhaps because all operations are not completely synchronous?

Christopher Morrone (Inactive) added a comment - 07/Jun/13 9:15 PM - edited I double checked, and I am certain that the patch is the only difference. FYI, I've moved to Lustre version 2.4.0-RC2_2chaos on both clients and servers. Without the patch, I see write throughput burst to 4.4GB/s intially (as seen in ltop), and then taper off to a fairly constant rate in the low 3GB/s range. With the patch, I see 720 MB/s. (These are aggregate throughput for 4 BG/Q clients.) Keep in mind that our tree has the unstable page tracking changes. I don't know if there is a relation there or not, but you should be aware. Without your patch, at the beginning of an IOR write phase, you might see something like this: rzuseqio13-ib0@morrone:lctl get_param "osc.*.cur_dirty_bytes" osc.lsrzb-OST0001-osc-c0000003e18f6400.cur_dirty_bytes=33554432 osc.lsrzb-OST0002-osc-c0000003e18f6400.cur_dirty_bytes=33554432 osc.lsrzb-OST0003-osc-c0000003e18f6400.cur_dirty_bytes=33554432 osc.lsrzb-OST0004-osc-c0000003e18f6400.cur_dirty_bytes=33554432 osc.lsrzb-OST0005-osc-c0000003e18f6400.cur_dirty_bytes=33554432 osc.lsrzb-OST0006-osc-c0000003e18f6400.cur_dirty_bytes=33161216 osc.lsrzb-OST0007-osc-c0000003e18f6400.cur_dirty_bytes=33554432 osc.lsrzb-OST0008-osc-c0000003e18f6400.cur_dirty_bytes=32899072 osc.lsrzb-OST0009-osc-c0000003e18f6400.cur_dirty_bytes=33554432 osc.lsrzb-OST000a-osc-c0000003e18f6400.cur_dirty_bytes=33554432 osc.lsrzb-OST000b-osc-c0000003e18f6400.cur_dirty_bytes=33554432 osc.lsrzb-OST000c-osc-c0000003e18f6400.cur_dirty_bytes=33030144 osc.lsrzb-OST000d-osc-c0000003e18f6400.cur_dirty_bytes=33554432 osc.lsrzb-OST000e-osc-c0000003e18f6400.cur_dirty_bytes=12582912 osc.lsrzb-OST000f-osc-c0000003e18f6400.cur_dirty_bytes=33554432 osc.lsrzb-OST0010-osc-c0000003e18f6400.cur_dirty_bytes=33554432 And that seems fairly normal. But with patch set 2 of 6554 applied I see these bogus numbers: rzuseqlac2:~/BG-191$ rsh rzuseqio13 '/usr/sbin/lctl get_param "osc.*.cur_dirty_bytes"' osc.lsrzb-OST0001-osc-c0000003ec957c00.cur_dirty_bytes=18446744073707454464 osc.lsrzb-OST0002-osc-c0000003ec957c00.cur_dirty_bytes=18446744073707454464 osc.lsrzb-OST0003-osc-c0000003ec957c00.cur_dirty_bytes=18446744073707454464 osc.lsrzb-OST0004-osc-c0000003ec957c00.cur_dirty_bytes=18446744073703260160 osc.lsrzb-OST0005-osc-c0000003ec957c00.cur_dirty_bytes=18446744073707454464 osc.lsrzb-OST0006-osc-c0000003ec957c00.cur_dirty_bytes=18446744073707454464 osc.lsrzb-OST0007-osc-c0000003ec957c00.cur_dirty_bytes=18446744073706930176 osc.lsrzb-OST0008-osc-c0000003ec957c00.cur_dirty_bytes=18446744073707454464 osc.lsrzb-OST0009-osc-c0000003ec957c00.cur_dirty_bytes=18446744073707454464 osc.lsrzb-OST000a-osc-c0000003ec957c00.cur_dirty_bytes=18446744073707454464 osc.lsrzb-OST000b-osc-c0000003ec957c00.cur_dirty_bytes=18446744073706536960 osc.lsrzb-OST000c-osc-c0000003ec957c00.cur_dirty_bytes=18446744073707454464 osc.lsrzb-OST000d-osc-c0000003ec957c00.cur_dirty_bytes=18446744073707454464 osc.lsrzb-OST000e-osc-c0000003ec957c00.cur_dirty_bytes=18446744073707454464 osc.lsrzb-OST000f-osc-c0000003ec957c00.cur_dirty_bytes=18446744073707454464 osc.lsrzb-OST0010-osc-c0000003ec957c00.cur_dirty_bytes=18446744073707454464 I'm guessing that those large numbers are simply an arithmetic problem, where we're allowing the number to become negative. That somehow results in very slow write throughput, perhaps because all operations are not completely synchronous?

Jinshan Xiong (Inactive) added a comment - 07/Jun/13 3:13 AM

this patch should have nothing to do with the performance. Maybe you can try to roll back to unpatched version and see if it's due to env settings. Of course, we can simplify the problem by using dd to check performance.

If you can still see the performance drop, please try to collect a log with D_CACHE enable (lctl set_param debug=cache), I will take a look to make sure it's in the correct path.

Jinshan Xiong (Inactive) added a comment - 07/Jun/13 3:13 AM this patch should have nothing to do with the performance. Maybe you can try to roll back to unpatched version and see if it's due to env settings. Of course, we can simplify the problem by using dd to check performance. If you can still see the performance drop, please try to collect a log with D_CACHE enable (lctl set_param debug=cache), I will take a look to make sure it's in the correct path.

Christopher Morrone (Inactive) added a comment - 07/Jun/13 2:11 AM

The problem seems to have disappeared with Patch Set 2 in place.

But now write performance is unacceptably bad. Four BG/Q clients are only pushing a total of around 720 MB/s (180MB/S each). That number should be closer to 1GB/s per client.

Christopher Morrone (Inactive) added a comment - 07/Jun/13 2:11 AM The problem seems to have disappeared with Patch Set 2 in place. But now write performance is unacceptably bad. Four BG/Q clients are only pushing a total of around 720 MB/s (180MB/S each). That number should be closer to 1GB/s per client.

Christopher Morrone (Inactive) added a comment - 07/Jun/13 1:19 AM

Yes, I'll try it out. Thanks, Jinshan.

Christopher Morrone (Inactive) added a comment - 07/Jun/13 1:19 AM Yes, I'll try it out. Thanks, Jinshan.

People

Assignee:: Niu Yawei (Inactive)

Reporter:: Christopher Morrone (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 30/May/13 12:07 AM

Updated:: 20/Oct/22 1:42 PM

Resolved:: 29/Aug/13 7:00 AM