[LU-3416] Hangs on write in osc_enter_cache() - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.5.0, Lustre 2.4.2
Affects Version/s: Lustre 2.4.0, Lustre 2.4.1
Labels:
- mq313
Environment:
Lustre 2.4.0-RC1_3chaos. https://github.com/chaos/lustre/tree/2.4.0-RC1_3chaos, ZFS servers

Severity:
3
Rank (Obsolete):
8471

Description

It looks like ~~LU-2576~~ is back again. The problem went away for a while, seemingly thanks to the patch from ~~LU-2576~~. I note that a later fix for stack overflow in ~~LU-2859~~ changed the same lines where the fix was applied, so perhaps that reintroduced the problem?

We are seeing hangs during writes on BG/Q hardware. We find tasks that appear to be stuck sleeping indefinitely here:

2013-05-29 16:15:36.363547  sysiod        S 00000fffa85363bc     0  4111   3070 0x00000000
2013-05-29 16:15:36.363582  Call Trace:
2013-05-29 16:15:36.363617  [c0000002e1e12440] [c000000302e95cc8] 0xc000000302e95cc8 (unreliable)
2013-05-29 16:15:36.363653  [c0000002e1e12610] [c000000000008de0] .__switch_to+0xc4/0x100
2013-05-29 16:15:36.363688  [c0000002e1e126a0] [c00000000044dc68] .schedule+0x858/0x9c0
2013-05-29 16:15:36.363723  [c0000002e1e12950] [80000000004820a0] .cfs_waitq_wait+0x10/0x30 [libcfs]
2013-05-29 16:15:36.363758  [c0000002e1e129c0] [80000000015b3ccc] .osc_enter_cache+0xb6c/0x1410 [osc]
2013-05-29 16:15:36.363793  [c0000002e1e12ba0] [80000000015bbf30] .osc_queue_async_io+0xcd0/0x2690 [osc]
2013-05-29 16:15:36.363828  [c0000002e1e12db0] [8000000001598598] .osc_page_cache_add+0xf8/0x2a0 [osc]
2013-05-29 16:15:36.363863  [c0000002e1e12e70] [8000000000a04248] .cl_page_cache_add+0xf8/0x420 [obdclass]
2013-05-29 16:15:36.363898  [c0000002e1e12fa0] [800000000179ed28] .lov_page_cache_add+0xc8/0x340 [lov]
2013-05-29 16:15:36.363934  [c0000002e1e13070] [8000000000a04248] .cl_page_cache_add+0xf8/0x420 [obdclass]
2013-05-29 16:15:36.363968  [c0000002e1e131a0] [8000000001d2ac74] .vvp_io_commit_write+0x464/0x910 [lustre]
2013-05-29 16:15:36.364003  [c0000002e1e132c0] [8000000000a1df6c] .cl_io_commit_write+0x11c/0x2d0 [obdclass]
2013-05-29 16:15:36.364038  [c0000002e1e13380] [8000000001cebc00] .ll_commit_write+0x120/0x3e0 [lustre]
2013-05-29 16:15:36.364074  [c0000002e1e13450] [8000000001d0f134] .ll_write_end+0x34/0x80 [lustre]
2013-05-29 16:15:36.364109  [c0000002e1e134e0] [c000000000097238] .generic_file_buffered_write+0x1f4/0x388
2013-05-29 16:15:36.364143  [c0000002e1e13620] [c000000000097928] .__generic_file_aio_write+0x374/0x3d8
2013-05-29 16:15:36.364178  [c0000002e1e13720] [c000000000097a04] .generic_file_aio_write+0x78/0xe8
2013-05-29 16:15:36.364213  [c0000002e1e137d0] [8000000001d2df00] .vvp_io_write_start+0x170/0x3b0 [lustre]
2013-05-29 16:15:36.364248  [c0000002e1e138a0] [8000000000a1849c] .cl_io_start+0xcc/0x220 [obdclass]
2013-05-29 16:15:36.364283  [c0000002e1e13940] [8000000000a202a4] .cl_io_loop+0x194/0x2c0 [obdclass]
2013-05-29 16:15:36.364317  [c0000002e1e139f0] [8000000001ca0780] .ll_file_io_generic+0x4f0/0x850 [lustre]
2013-05-29 16:15:36.364352  [c0000002e1e13b30] [8000000001ca0f64] .ll_file_aio_write+0x1d4/0x3a0 [lustre]
2013-05-29 16:15:36.364387  [c0000002e1e13c00] [8000000001ca1280] .ll_file_write+0x150/0x320 [lustre]
2013-05-29 16:15:36.364422  [c0000002e1e13ce0] [c0000000000d4328] .vfs_write+0xd0/0x1c4
2013-05-29 16:15:36.364458  [c0000002e1e13d80] [c0000000000d4518] .SyS_write+0x54/0x98
2013-05-29 16:15:36.364492  [c0000002e1e13e30] [c000000000000580] syscall_exit+0x0/0x2c

This was with Lustre 2.4.0-RC1_3chaos.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

rzuseqio13_drop_caches.txt.bz2
0.2 kB
14/Jun/13 10:50 PM
rzuseqio14_drop_caches.txt.bz2
4.62 MB
14/Jun/13 10:50 PM
rzuseqio14_lustre_log.txt
0.2 kB
31/May/13 12:23 AM
rzuseqio15_console.txt.bz2
79 kB
11/Jun/13 9:27 PM
rzuseqio15_drop_caches.txt.bz2
0.2 kB
14/Jun/13 10:50 PM
rzuseqio16_drop_caches.txt.bz2
3.46 MB
14/Jun/13 10:50 PM
rzuseqio16_dump_page_cache.txt.bz2
783 kB
30/May/13 6:13 PM

Issue Links

is related to

LU-2576 Hangs in osc_enter_cache due to dirty pages not being flushed

Resolved

LU-2859 Stack overflow issues on x86_64 clients in memory reclaim during writes

Resolved

Activity

[LU-3416] Hangs on write in osc_enter_cache()

Christopher Morrone (Inactive) added a comment - 14/Jun/13 10:51 PM

See the attached rzuseq*_drop_caches.bz2 files. When I got a write hange, I enabled full lustre debugging, issued an 'echo 2 > drop_caches', and waited a few seconds, and then dumped the logs on all four clients.

Hopefully there is something useful in there.

Christopher Morrone (Inactive) added a comment - 14/Jun/13 10:51 PM See the attached rzuseq*_drop_caches.bz2 files. When I got a write hange, I enabled full lustre debugging, issued an 'echo 2 > drop_caches', and waited a few seconds, and then dumped the logs on all four clients. Hopefully there is something useful in there.

Christopher Morrone (Inactive) added a comment - 13/Jun/13 12:00 AM

Yes, I haven't been able to reproduce the problem again. Performance is in the accepatble range.

I should be able to get that log tomorrow.

Christopher Morrone (Inactive) added a comment - 13/Jun/13 12:00 AM Yes, I haven't been able to reproduce the problem again. Performance is in the accepatble range. I should be able to get that log tomorrow.

Jinshan Xiong (Inactive) added a comment - 12/Jun/13 9:55 PM

No news assumes good news, no?

If this patch can survive, can you please do me a favor to roll back this patch, make the hang happen, and collect logs when doing `echo 3 > drop_caches', I don't understand why this can make it go forward?

Jinshan Xiong (Inactive) added a comment - 12/Jun/13 9:55 PM No news assumes good news, no? If this patch can survive, can you please do me a favor to roll back this patch, make the hang happen, and collect logs when doing `echo 3 > drop_caches', I don't understand why this can make it go forward?

Christopher Morrone (Inactive) added a comment - 12/Jun/13 2:10 AM

I have been running the reproducer, and haven't had a single hang yet with patch set 5 applied. I will leave it running over night.

Christopher Morrone (Inactive) added a comment - 12/Jun/13 2:10 AM I have been running the reproducer, and haven't had a single hang yet with patch set 5 applied. I will leave it running over night.

Christopher Morrone (Inactive) added a comment - 12/Jun/13 12:45 AM

Will do.

Christopher Morrone (Inactive) added a comment - 12/Jun/13 12:45 AM Will do.

Jinshan Xiong (Inactive) added a comment - 12/Jun/13 12:09 AM

Can you please try patch set 5?

Jinshan Xiong (Inactive) added a comment - 12/Jun/13 12:09 AM Can you please try patch set 5?

Christopher Morrone (Inactive) added a comment - 11/Jun/13 9:29 PM

See new attachment rzuseqio15_console.txt.bz2. It shows backtraces from all processes after a hang began, and then the messages that were printed by patch set 4.

Christopher Morrone (Inactive) added a comment - 11/Jun/13 9:29 PM See new attachment rzuseqio15_console.txt.bz2 . It shows backtraces from all processes after a hang began, and then the messages that were printed by patch set 4.

Jinshan Xiong (Inactive) added a comment - 11/Jun/13 2:34 AM

This is awkward - I thought it must be sleeping for LRU slots while holding some dirty pages. The problem is obvious: cl_dirty couldn't be deducted for some reason, one possible reason would the ptlrpc threads were blocked somewhere so brw_interpret couldn't be called, or the processes are waiting for something while holding active osc_extents. Can you please take a look at backtrace of running processes and tell me if there exists any other process is blocking on something?

Also, I worked out patch set 4 which dumps the extent tree when this problem happens. This can help us isolate the root cause.

Jinshan Xiong (Inactive) added a comment - 11/Jun/13 2:34 AM This is awkward - I thought it must be sleeping for LRU slots while holding some dirty pages. The problem is obvious: cl_dirty couldn't be deducted for some reason, one possible reason would the ptlrpc threads were blocked somewhere so brw_interpret couldn't be called, or the processes are waiting for something while holding active osc_extents. Can you please take a look at backtrace of running processes and tell me if there exists any other process is blocking on something? Also, I worked out patch set 4 which dumps the extent tree when this problem happens. This can help us isolate the root cause.

Christopher Morrone (Inactive) added a comment - 11/Jun/13 1:48 AM

Here's output from one of the debug messages before I leave for the day:

LustreError: 3709:0:(osc_cache.c:1563:osc_enter_cache()) lsrzb-OST0004-osc-c0000003c70ca400: grant { dirty: 33554432/33554432 dirty_pages: 6934/131071 unstable_pages: 0/131071 dropped: 0 avail: 45613056, reserved: 0, flight: 0 } lru {in list: 2407, left: 515, waiters: 0 }try to reserve 65536.
LustreError: 3736:0:(osc_cache.c:1563:osc_enter_cache()) lsrzb-OST0004-osc-c0000003c70ca400: grant { dirty: 33554432/33554432 dirty_pages: 6934/131071 unstable_pages: 0/131071 dropped: 0 avail: 45613056, reserved: 0, flight: 0 } lru {in list: 2407, left: 515, waiters: 0 }try to reserve 65536.

Christopher Morrone (Inactive) added a comment - 11/Jun/13 1:48 AM Here's output from one of the debug messages before I leave for the day: LustreError: 3709:0:(osc_cache.c:1563:osc_enter_cache()) lsrzb-OST0004-osc-c0000003c70ca400: grant { dirty: 33554432/33554432 dirty_pages: 6934/131071 unstable_pages: 0/131071 dropped: 0 avail: 45613056, reserved: 0, flight: 0 } lru {in list: 2407, left: 515, waiters: 0 }try to reserve 65536. LustreError: 3736:0:(osc_cache.c:1563:osc_enter_cache()) lsrzb-OST0004-osc-c0000003c70ca400: grant { dirty: 33554432/33554432 dirty_pages: 6934/131071 unstable_pages: 0/131071 dropped: 0 avail: 45613056, reserved: 0, flight: 0 } lru {in list: 2407, left: 515, waiters: 0 }try to reserve 65536.

Christopher Morrone (Inactive) added a comment - 11/Jun/13 1:30 AM

Performance (and presumably the negative number) problem is gone, but it is still hanging for ten minutes like with patch one. I'll get post some info tomorrow.

Christopher Morrone (Inactive) added a comment - 11/Jun/13 1:30 AM Performance (and presumably the negative number) problem is gone, but it is still hanging for ten minutes like with patch one. I'll get post some info tomorrow.

Jinshan Xiong (Inactive) added a comment - 10/Jun/13 8:59 PM

Will you please try patch set 3 and see if it helps.

Jinshan Xiong (Inactive) added a comment - 10/Jun/13 8:59 PM Will you please try patch set 3 and see if it helps.

People

Assignee:: Niu Yawei (Inactive)

Reporter:: Christopher Morrone (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 30/May/13 12:07 AM

Updated:: 20/Oct/22 1:42 PM

Resolved:: 29/Aug/13 7:00 AM