[LU-2367] Hang in osc_extent_wait under fsync Created: 20/Nov/12 Updated: 19/Apr/13 Resolved: 14/Jan/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | Lustre 2.4.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Christopher Morrone | Assignee: | Jinshan Xiong (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | HB, sequoia | ||
| Environment: |
2.3.56-2chaos (github.com/chaos/lustre) |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 5630 |
| Description |
|
We seem to be getting hangs in osc_extent_wait() under the fsync at the end of ior's write phase. 2012-11-20 16:15:08.093629 {DefaultControlEventListener} [mmcs]{657}.0.0: sysiod S 00000fff8c266bd0 0 4037 3078 0x00000000
2012-11-20 16:15:08.093693 {DefaultControlEventListener} [mmcs]{657}.0.0: Call Trace:
2012-11-20 16:15:08.093744 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc2b10] [c0000003e1fc2bc0] 0xc0000003e1fc2bc0 (unreliable)
2012-11-20 16:15:08.093812 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc2ce0] [c000000000008de0] .__switch_to+0xc4/0x100
2012-11-20 16:15:08.093885 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc2d70] [c00000000042b0e0] .schedule+0x858/0x9c0
2012-11-20 16:15:08.093956 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc3020] [8000000000aa20a0] .cfs_waitq_wait+0x10/0x30 [libcfs]
2012-11-20 16:15:08.094010 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc3090] [80000000046d93f0] .osc_extent_wait+0x620/0x7d0 [osc]
2012-11-20 16:15:08.094062 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc3290] [80000000046d9c80] .osc_cache_wait_range+0x6e0/0xcd0 [osc]
2012-11-20 16:15:08.094127 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc34d0] [80000000046c4b00] .osc_io_fsync_end+0xf0/0x100 [osc]
2012-11-20 16:15:08.094192 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc3570] [800000000249962c] .cl_io_end+0xcc/0x2b0 [obdclass]
2012-11-20 16:15:08.094257 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc3620] [8000000005234d00] .lov_io_end_wrapper+0x180/0x1f0 [lov]
2012-11-20 16:15:08.094325 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc36c0] [8000000005234e3c] .lov_io_fsync_end+0xcc/0x250 [lov]
2012-11-20 16:15:08.094389 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc3760] [800000000249962c] .cl_io_end+0xcc/0x2b0 [obdclass]
2012-11-20 16:15:08.094446 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc3810] [80000000024a2e14] .cl_io_loop+0x1a4/0x2c0 [obdclass]
2012-11-20 16:15:08.094498 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc38c0] [8000000006a1332c] .cl_sync_file_range+0x3ac/0x660 [lustre]
2012-11-20 16:15:08.094550 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc39b0] [8000000006a4d4b4] .ll_writepages+0xb4/0x280 [lustre]
2012-11-20 16:15:08.094602 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc3a60] [c00000000009d534] .do_writepages+0x44/0x74
2012-11-20 16:15:08.094653 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc3ad0] [c000000000095710] .__filemap_fdatawrite_range+0x80/0xa8
2012-11-20 16:15:08.094705 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc3be0] [c000000000095778] .filemap_write_and_wait_range+0x40/0x8c
2012-11-20 16:15:08.094757 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc3c80] [c0000000000fc054] .vfs_fsync_range+0x70/0x104
2012-11-20 16:15:08.096720 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc3d30] [c0000000000fc18c] .do_fsync+0x3c/0x6c
2012-11-20 16:15:08.096787 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc3dc0] [c0000000000fc1fc] .SyS_fsync+0x18/0x28
2012-11-20 16:15:08.096846 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc3e30] [c000000000000580] syscall_exit+0x0/0x2c
The console message that seems to go with this event is: LustreError: 4037:0:(osc_cache.c:893:osc_extent_wait()) extent c0000002d415ade8@{[8176 -> 8191/8191], [2|0|+|cache|wiuY|c000000383eb0480], [1048576|16|+|-|c0000002d05f01e0|16|(null)]} lsfull-OST00d1-osc-c0000003cbcda000: wait ext to 0 timedout, recovery in progress?
Attached will be a log file,RA0-ID-J01.log, showing all processes on the node. Note that I logged into the node and ran "sync", which also hung, so you will likely see a second backtrace from that in there. That was not part of the original problem. |
| Comments |
| Comment by Jinshan Xiong (Inactive) [ 20/Nov/12 ] |
|
duplicate of |
| Comment by Christopher Morrone [ 21/Nov/12 ] |
|
We already have that patch since we're based on 2.3.56. |
| Comment by Jinshan Xiong (Inactive) [ 22/Nov/12 ] |
|
patch is at: http://review.whamcloud.com/4659 |
| Comment by Christopher Morrone [ 26/Nov/12 ] |
|
Ok, I've pulled 4659 into our 2.3.56-llnl branch. |
| Comment by Prakash Surya (Inactive) [ 29/Nov/12 ] |
|
To update, we've pull in 4659, did some testing on Sequoia, and have reason to believe that patch fixes this issue. |
| Comment by Jodi Levi (Inactive) [ 11/Jan/13 ] |
|
Based on the results, can this ticket be closed? |
| Comment by Jinshan Xiong (Inactive) [ 11/Jan/13 ] |
|
Hi Jodi, the patch has to be landed before we can close this ticket. |
| Comment by Prakash Surya (Inactive) [ 14/Jan/13 ] |
|
The fix has been landed, so it can be closed. Thanks! |