Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
Lustre 2.4.0
-
2.3.56-2chaos (github.com/chaos/lustre)
-
3
-
5630
Description
We seem to be getting hangs in osc_extent_wait() under the fsync at the end of ior's write phase.
2012-11-20 16:15:08.093629 {DefaultControlEventListener} [mmcs]{657}.0.0: sysiod S 00000fff8c266bd0 0 4037 3078 0x00000000 2012-11-20 16:15:08.093693 {DefaultControlEventListener} [mmcs]{657}.0.0: Call Trace: 2012-11-20 16:15:08.093744 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc2b10] [c0000003e1fc2bc0] 0xc0000003e1fc2bc0 (unreliable) 2012-11-20 16:15:08.093812 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc2ce0] [c000000000008de0] .__switch_to+0xc4/0x100 2012-11-20 16:15:08.093885 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc2d70] [c00000000042b0e0] .schedule+0x858/0x9c0 2012-11-20 16:15:08.093956 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc3020] [8000000000aa20a0] .cfs_waitq_wait+0x10/0x30 [libcfs] 2012-11-20 16:15:08.094010 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc3090] [80000000046d93f0] .osc_extent_wait+0x620/0x7d0 [osc] 2012-11-20 16:15:08.094062 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc3290] [80000000046d9c80] .osc_cache_wait_range+0x6e0/0xcd0 [osc] 2012-11-20 16:15:08.094127 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc34d0] [80000000046c4b00] .osc_io_fsync_end+0xf0/0x100 [osc] 2012-11-20 16:15:08.094192 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc3570] [800000000249962c] .cl_io_end+0xcc/0x2b0 [obdclass] 2012-11-20 16:15:08.094257 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc3620] [8000000005234d00] .lov_io_end_wrapper+0x180/0x1f0 [lov] 2012-11-20 16:15:08.094325 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc36c0] [8000000005234e3c] .lov_io_fsync_end+0xcc/0x250 [lov] 2012-11-20 16:15:08.094389 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc3760] [800000000249962c] .cl_io_end+0xcc/0x2b0 [obdclass] 2012-11-20 16:15:08.094446 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc3810] [80000000024a2e14] .cl_io_loop+0x1a4/0x2c0 [obdclass] 2012-11-20 16:15:08.094498 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc38c0] [8000000006a1332c] .cl_sync_file_range+0x3ac/0x660 [lustre] 2012-11-20 16:15:08.094550 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc39b0] [8000000006a4d4b4] .ll_writepages+0xb4/0x280 [lustre] 2012-11-20 16:15:08.094602 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc3a60] [c00000000009d534] .do_writepages+0x44/0x74 2012-11-20 16:15:08.094653 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc3ad0] [c000000000095710] .__filemap_fdatawrite_range+0x80/0xa8 2012-11-20 16:15:08.094705 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc3be0] [c000000000095778] .filemap_write_and_wait_range+0x40/0x8c 2012-11-20 16:15:08.094757 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc3c80] [c0000000000fc054] .vfs_fsync_range+0x70/0x104 2012-11-20 16:15:08.096720 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc3d30] [c0000000000fc18c] .do_fsync+0x3c/0x6c 2012-11-20 16:15:08.096787 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc3dc0] [c0000000000fc1fc] .SyS_fsync+0x18/0x28 2012-11-20 16:15:08.096846 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc3e30] [c000000000000580] syscall_exit+0x0/0x2c
The console message that seems to go with this event is:
LustreError: 4037:0:(osc_cache.c:893:osc_extent_wait()) extent c0000002d415ade8@{[8176 -> 8191/8191], [2|0|+|cache|wiuY|c000000383eb0480], [1048576|16|+|-|c0000002d05f01e0|16|(null)]} lsfull-OST00d1-osc-c0000003cbcda000: wait ext to 0 timedout, recovery in progress?
Attached will be a log file,RA0-ID-J01.log, showing all processes on the node. Note that I logged into the node and ran "sync", which also hung, so you will likely see a second backtrace from that in there. That was not part of the original problem.