[LU-437] Client hang with spinning ldlm_bl_* and ll_imp_inval threads Created: 20/Jun/11 Updated: 26/Jul/11 Resolved: 26/Jul/11 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.1.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Christopher Morrone | Assignee: | Jinshan Xiong (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 8546 |
| Description |
|
We have found that it is fairly easy to get a client to hang when running a shared-file IOR such as this: srun -l -D /p/lcrater2/surya1/tmp -N 56 -n 112 -- /g/g0/surya1/dev/IOR/src/C/IOR -a POSIX -C -e -i 1 -o /p/lcrater2/surya1/tmp/IOR.data -v -w -W -r -R -t 1m -b 1024m A client node will experience trouble communicating, and get evicted: 2011-06-17 13:14:16 LustreError: 11-0: an error occurred while communicating with 10.1.1.48@o2ib9. The obd_ping operation failed with -107 2011-06-17 13:14:16 Lustre: lc2-OST0005-osc-ffff880200164400: Connection to service lc2-OST0005 via nid 10.1.1.48@o2ib9 was lost; in progress operations using this service will wait for recovery to complete. 2011-06-17 13:14:17 LustreError: 167-0: This client was evicted by lc2-OST0005; in progress operations using this service will fail. 2011-06-17 13:14:17 LustreError: 8675:0:(ldlm_request.c:1169:ldlm_cli_cancel_req()) Got rc -11 from cancel RPC: canceling anyway 2011-06-17 13:14:17 LustreError: 8675:0:(ldlm_request.c:1796:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -11 2011-06-17 13:18:06 INFO: task ll_imp_inval:8757 blocked for more than 120 seconds. 2011-06-17 13:18:06 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. During this time, two IOR processes are stuck here: #0 [ffff880105521398] schedule at ffffffff814dabb9 #1 [ffff880105521460] cfs_waitq_wait at ffffffffa034e75e [libcfs] #2 [ffff880105521470] osc_enter_cache at ffffffffa06c6eda [osc] #3 [ffff880105521580] osc_queue_async_io at ffffffffa06c7439 [osc] #4 [ffff880105521690] osc_page_cache_add at ffffffffa06d549a [osc] #5 [ffff880105521700] cl_page_invoke at ffffffffa0481008 [obdclass] #6 [ffff880105521770] cl_page_cache_add at ffffffffa0481d88 [obdclass] #7 [ffff8801055217f0] vvp_io_commit_write at ffffffffa07d2a43 [lustre] #8 [ffff8801055218a0] cl_io_commit_write at ffffffffa048fb8f [obdclass] #9 [ffff880105521920] ll_commit_write at ffffffffa07aa12d [lustre] #10 [ffff8801055219b0] ll_write_end at ffffffffa07c17a0 [lustre] #11 [ffff8801055219e0] generic_file_buffered_write at ffffffff8110dc74 #12 [ffff880105521ab0] __generic_file_aio_write at ffffffff8110f560 #13 [ffff880105521b70] generic_file_aio_write at ffffffff8110f7ff #14 [ffff880105521bc0] vvp_io_write_start at ffffffffa07d32d1 [lustre] #15 [ffff880105521c40] cl_io_start at ffffffffa048cb38 [obdclass] #16 [ffff880105521ca0] cl_io_loop at ffffffffa0490f40 [obdclass] #17 [ffff880105521d10] ll_file_io_generic at ffffffffa077b8e3 [lustre] #18 [ffff880105521dd0] ll_file_aio_write at ffffffffa0780aaf [lustre] #19 [ffff880105521e60] ll_file_write at ffffffffa0780df1 [lustre] #20 [ffff880105521ef0] vfs_write at ffffffff811725b8 #21 [ffff880105521f30] sys_write at ffffffff81172ff1 #22 [ffff880105521f80] system_call_fastpath at ffffffff8100b172 The ldlm_bl_* thread backtraces look like this when they are spinning: 2011-06-17 14:27:59 <NMI> [<ffffffff81009d45>] ? show_regs+0x45/0x50 2011-06-17 14:27:59 [<ffffffff814decd9>] ? arch_trigger_all_cpu_backtrace_handler+0x79/0xa0 2011-06-17 14:27:59 [<ffffffff814e05c5>] ? notifier_call_chain+0x55/0x80 2011-06-17 14:27:59 [<ffffffff814e062a>] ? atomic_notifier_call_chain+0x1a/0x20 2011-06-17 14:27:59 [<ffffffff8109409e>] ? notify_die+0x2e/0x30 2011-06-17 14:27:59 [<ffffffff814de233>] ? do_nmi+0x173/0x2b0 2011-06-17 14:27:59 [<ffffffff814ddb40>] ? nmi+0x20/0x30 2011-06-17 14:27:59 [<ffffffff814dd3a8>] ? _spin_lock+0x18/0x30 2011-06-17 14:27:59 <<EOE>> [<ffffffffa0484b99>] ? cl_page_gang_lookup+0x219/0x380 [obdclass] 2011-06-17 14:28:00 [<ffffffffa0483860>] ? cl_page_own+0x0/0x20 [obdclass] 2011-06-17 14:28:00 [<ffffffffa048aed7>] ? cl_lock_page_out+0xb7/0x320 [obdclass] 2011-06-17 14:28:00 [<ffffffffa047fae6>] ? cl_env_peek+0x36/0x110 [obdclass] 2011-06-17 14:28:00 [<ffffffffa06d6d8f>] ? osc_lock_flush+0x4f/0x90 [osc] 2011-06-17 14:28:00 [<ffffffffa06d6e29>] ? osc_lock_cancel+0x59/0x170 [osc] 2011-06-17 14:28:00 [<ffffffffa047fae6>] ? cl_env_peek+0x36/0x110 [obdclass] 2011-06-17 14:28:00 [<ffffffffa0485e9d>] ? cl_lock_cancel0+0x6d/0x130 [obdclass] 2011-06-17 14:28:00 [<ffffffffa04869fb>] ? cl_lock_cancel+0x11b/0x120 [obdclass] 2011-06-17 14:28:00 [<ffffffffa06d7bf8>] ? osc_ldlm_blocking_ast+0x148/0x300 [osc] 2011-06-17 14:28:00 [<ffffffff810dbe15>] ? call_rcu_sched+0x15/0x20 2011-06-17 14:28:00 [<ffffffff810dbe2e>] ? call_rcu+0xe/0x10 2011-06-17 14:28:00 [<ffffffffa054287d>] ? ldlm_handle_bl_callback+0x10d/0x2a0 [ptlrpc] 2011-06-17 14:28:00 [<ffffffffa0542c51>] ? ldlm_bl_thread_main+0x241/0x3d0 [ptlrpc] 2011-06-17 14:28:00 [<ffffffff8105dc20>] ? default_wake_function+0x0/0x20 2011-06-17 14:28:00 [<ffffffff8100c1ca>] ? child_rip+0xa/0x20 2011-06-17 14:28:00 [<ffffffffa0542a10>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc] 2011-06-17 14:28:00 [<ffffffff8100c1c0>] ? child_rip+0x0/0x20 With one of the runs, the node became "unstuck" after a vvp_io_commit_write() call fails: 2011-06-17 14:36:55 LustreError: 8747:0:(vvp_io.c:987:vvp_io_commit_write()) Write page 28712193 of inode ffff8801d20666b8 failed -4 2011-06-17 14:59:59 mstflint:4325 freeing invalid memtype fe900000-fe910000 2011-06-17 14:59:59 mstflint:4325 freeing invalid memtype fe910000-fea00000 2011-06-17 15:00:02 mstflint:4333 freeing invalid memtype fe900000-fe910000 2011-06-17 15:00:02 Jun 17 15:00:02 zwicky161 kernel: sfit43 ren nai etp e000f900 2011-06-17 15:00:02 mstflint:4333 freeing invalid memtype fe910000-fea00000 2011-06-17 15:00:07 LustreError: 8757:0:(ldlm_resource.c:748:ldlm_resource_complain()) Namespace lc2-OST0005-osc-ffff880200164400 resource refcount nonzero (1) after lock cleanup; forcing cleanup. 2011-06-17 15:00:07 LustreError: 8757:0:(ldlm_resource.c:754:ldlm_resource_complain()) Resource: ffff8801ff756700 (61363/0/0/0) (rc: 1) 2011-06-17 15:00:07 Lustre: lc2-OST0005-osc-ffff880200164400: Connection restored to service lc2-OST0005 using nid 10.1.1.48@o2ib9. 2011-06-17 15:00:07 CPUFREQ: Per core ondemand sysfs interface is deprecated - up_threshold 2011-06-17 15:00:31 LustreError: 11-0: an error occurred while communicating with 10.1.1.48@o2ib9. The obd_ping operation failed with -107 2011-06-17 15:00:31 Lustre: lc2-OST0005-osc-ffff880200164400: Connection to service lc2-OST0005 via nid 10.1.1.48@o2ib9 was lost; in progress operations using this service will wait for recovery to complete. 2011-06-17 15:00:31 LustreError: 167-0: This client was evicted by lc2-OST0005; in progress operations using this service will fail. 2011-06-17 15:00:31 Lustre: lc2-OST0005-osc-ffff880200164400: Connection restored to service lc2-OST0005 using nid 10.1.1.48@o2ib9. Although I have seen another node that seemed stuck in the same way, and was still stuck long after the IOR processes had been killed. |
| Comments |
| Comment by Christopher Morrone [ 20/Jun/11 ] |
|
Representitive backtraces for ldlm_bl_* threads and ll_imp_inval on a node where those threads are spinning, and IOR is no longer running: crash> bt 7322 PID: 7322 TASK: ffff880214c96a80 CPU: 0 COMMAND: "ldlm_bl_01" #0 [ffff8801ff7ada80] schedule at ffffffff814dabb9 #1 [ffff8801ff7adb48] __cond_resched at ffffffff8105fa7a #2 [ffff8801ff7adb68] _cond_resched at ffffffff814db480 #3 [ffff8801ff7adb78] cfs_cond_resched at ffffffffa034e6be [libcfs] #4 [ffff8801ff7adb88] cl_lock_page_out at ffffffffa048af5d [obdclass] #5 [ffff8801ff7adc38] osc_lock_flush at ffffffffa06d6d8f [osc] #6 [ffff8801ff7adc88] osc_lock_cancel at ffffffffa06d6e29 [osc] #7 [ffff8801ff7add08] cl_lock_cancel0 at ffffffffa0485e9d [obdclass] #8 [ffff8801ff7add58] cl_lock_cancel at ffffffffa04869fb [obdclass] #9 [ffff8801ff7add98] osc_ldlm_blocking_ast at ffffffffa06d7bf8 [osc] #10 [ffff8801ff7ade28] ldlm_handle_bl_callback at ffffffffa054287d [ptlrpc] #11 [ffff8801ff7ade88] ldlm_bl_thread_main at ffffffffa0542c51 [ptlrpc] #12 [ffff8801ff7adf48] kernel_thread at ffffffff8100c1ca crash> bt 8727 PID: 8727 TASK: ffff880213586b40 CPU: 0 COMMAND: "ll_imp_inval" #0 [ffff880189e7b6f0] schedule at ffffffff814dabb9 #1 [ffff880189e7b7b8] __cond_resched at ffffffff8105fa7a #2 [ffff880189e7b7d8] _cond_resched at ffffffff814db480 #3 [ffff880189e7b7e8] cfs_cond_resched at ffffffffa034e6be [libcfs] #4 [ffff880189e7b7f8] cl_lock_page_out at ffffffffa048af5d [obdclass] #5 [ffff880189e7b8a8] osc_lock_flush at ffffffffa06d6d8f [osc] #6 [ffff880189e7b8f8] osc_lock_cancel at ffffffffa06d6e29 [osc] #7 [ffff880189e7b978] cl_lock_cancel0 at ffffffffa0485e9d [obdclass] #8 [ffff880189e7b9c8] cl_lock_cancel at ffffffffa04869fb [obdclass] #9 [ffff880189e7ba08] osc_ldlm_blocking_ast at ffffffffa06d7bf8 [osc] #10 [ffff880189e7ba98] ldlm_cancel_callback at ffffffffa0521ba3 [ptlrpc] #11 [ffff880189e7bab8] ldlm_cli_cancel_local at ffffffffa053b201 [ptlrpc] #12 [ffff880189e7bb18] ldlm_cli_cancel at ffffffffa053f618 [ptlrpc] #13 [ffff880189e7bb98] cleanup_resource at ffffffffa052974c [ptlrpc] #14 [ffff880189e7bc18] ldlm_resource_clean at ffffffffa05298ba [ptlrpc] #15 [ffff880189e7bc38] cfs_hash_for_each_relax at ffffffffa035dbc3 [libcfs] #16 [ffff880189e7bcd8] cfs_hash_for_each_nolock at ffffffffa035f2ff [libcfs] #17 [ffff880189e7bd38] ldlm_namespace_cleanup at ffffffffa05266a9 [ptlrpc] #18 [ffff880189e7bd78] osc_import_event at ffffffffa06ce0c6 [osc] #19 [ffff880189e7bdf8] ptlrpc_invalidate_import at ffffffffa058f1d4 [ptlrpc] #20 [ffff880189e7bed8] ptlrpc_invalidate_import_thread at ffffffffa058f7b5 [ptlrpc] #21 [ffff880189e7bf48] kernel_thread at ffffffff8100c1ca |
| Comment by Peter Jones [ 20/Jun/11 ] |
|
Oleg Could you please have a look at this one? Thanks Peter |
| Comment by Jinshan Xiong (Inactive) [ 20/Jun/11 ] |
|
From the stack trace, it looks like an IO is doing against the page while the covering lock is being canceled(I'm not sure if they are the same page). Can you please provide me the crash dump file, if there is no classified stuff in it? I'd like to see the crash dump while the IOR process are still running with two stuck processes |
| Comment by Niu Yawei (Inactive) [ 21/Jun/11 ] |
|
Xiong, as Chris described, for some nodes, the ldlm_bl_/ll_imp_inval threads spins because the IOR was stucking in commit_write while the lock is being canceled, but for other nodes, the ldlm_bl_/ll_imp_inval threads were still spinning even after all IOR threads had been killed, seems those nodes were running into the similar situation like |
| Comment by Jinshan Xiong (Inactive) [ 21/Jun/11 ] |
|
Indeed, actually I tend to think there are two issues in this bug. This is a bit different to |
| Comment by Prakash Surya (Inactive) [ 21/Jun/11 ] |
|
Xiong, the number of the patch appears to have been lost. Could you please repost the link or patch number? |
| Comment by Jinshan Xiong (Inactive) [ 21/Jun/11 ] |
|
Hi Chris, |
| Comment by Jinshan Xiong (Inactive) [ 21/Jun/11 ] |
|
Prakash, I realized that patch was a bit old, so removed the comment while I was rebasing it so that you guys will get the right one. Sorry for confusion. |
| Comment by Peter Jones [ 22/Jun/11 ] |
|
Jinshan is working on this |
| Comment by Prakash Surya (Inactive) [ 22/Jun/11 ] |
|
Xiong, Thank you! Can this patch be applied only to the clients? or do the servers need this patch as well? |
| Comment by Jinshan Xiong (Inactive) [ 22/Jun/11 ] |
|
Client only, please. |
| Comment by Jinshan Xiong (Inactive) [ 22/Jun/11 ] |
|
BTW, how often can you guys reproduce this problem? |
| Comment by Christopher Morrone [ 23/Jun/11 ] |
|
It is pretty easy to reproduce. |
| Comment by Jinshan Xiong (Inactive) [ 24/Jun/11 ] |
|
how is the patch? Do you guys have a chance to run it? |
| Comment by Prakash Surya (Inactive) [ 24/Jun/11 ] |
|
I am currently running a 54 node, 108 task IOR with the patch applied. So far so good, 29G of the 108G have been written without noticing any lockup issues yet. I will report back when the IOR completes (or fails). IOR is identical to the one in the description, except for the number of nodes and tasks being run. srun -l -D /p/lcrater2/surya1/tmp -N 54 -n 108 -- /g/g0/surya1/dev/IOR/src/C/IOR -a POSIX -C -e -i 1 -o /p/lcrater2/surya1/tmp/IOR.data -v -w -W -r -R -t 1m -b 1024m |
| Comment by Prakash Surya (Inactive) [ 24/Jun/11 ] |
|
It looks like an LBUG was hit on one of the nodes while running the above IOR with the patch applied. A snippet of the console output is below: 2011-06-24 11:38:35 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) lock@ffff8801f8dcdaf8[0 5 0 0 0 00000000] W(2):[13893632, 13901823]@[0x100270000:0x3fc82:0x0] {
2011-06-24 11:38:35 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) lovsub@ffff8801ff7914a0: [0 ffff8801f8dcf678 W(2):[27787520, 27803647]@[0x20000337d:0x1:0x0]]
2011-06-24 11:38:35 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) osc@ffff8801ff792a98: ffff8802153e0d80 40120002 0xf0f29676e28223c6 4 (null) size: 49393172480 mtime: 1308938241 atime: 1308938241 ctime: 1308938241 blocks: 2048
2011-06-24 11:38:35 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) } lock@ffff8801f8dcdaf8
2011-06-24 11:38:35 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) no cover page!
2011-06-24 11:38:35 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) lock@ffff8801f8dcd938[0 5 0 0 0 00000000] W(2):[14024704, 14032895]@[0x100270000:0x3fc82:0x0] {
2011-06-24 11:38:35 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) lovsub@ffff8801ff791460: [0 ffff8801f8dcf610 W(2):[28049664, 28065791]@[0x20000337d:0x1:0x0]]
2011-06-24 11:38:35 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) osc@ffff8801ff7929e0: ffff8801ff6d5000 40100001 0xf0f29676e28223cd 4 (null) size: 49393172480 mtime: 1308938241 atime: 1308938241 ctime: 1308938241 blocks: 2048
2011-06-24 11:38:35 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) } lock@ffff8801f8dcd938
2011-06-24 11:38:35 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) no cover page!
2011-06-24 11:38:35 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) lock@ffff880212888c18[0 5 0 0 0 00000000] W(2):[14032896, 14041087]@[0x100270000:0x3fc82:0x0] {
2011-06-24 11:38:35 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) lovsub@ffff8801ff791260: [0 ffff8801f8dcf2d0 W(2):[28066048, 28082175]@[0x20000337d:0x1:0x0]]
2011-06-24 11:38:35 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) osc@ffff8801ff792088: ffff8802153e0240 40100001 0xf0f29676e28223db 4 (null) size: 56909365248 mtime: 1308938247 atime: 1308938241 ctime: 1308938247 blocks: 231432
2011-06-24 11:38:35 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) } lock@ffff880212888c18
2011-06-24 11:38:35 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) no cover page!
2011-06-24 11:38:35 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) lock@ffff8802128880b8[0 5 0 0 0 00000000] W(2):[13901824, 13910015]@[0x100270000:0x3fc82:0x0] {
2011-06-24 11:38:35 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) lovsub@ffff8801ff7912a0: [0 ffff8801f8dcf338 W(2):[27803904, 27820031]@[0x20000337d:0x1:0x0]]
2011-06-24 11:38:35 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) osc@ffff8801ff792140: ffff8802153e0900 40100001 0xf0f29676e28223e9 4 (null) size: 56909365248 mtime: 1308938247 atime: 1308938241 ctime: 1308938247 blocks: 231432
2011-06-24 11:38:35 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) } lock@ffff8802128880b8
2011-06-24 11:38:35 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) no cover page!
2011-06-24 11:38:35 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) lock@ffff880212888358[0 5 0 0 0 00000000] W(2):[14041088, 14049279]@[0x100270000:0x3fc82:0x0] {
2011-06-24 11:38:35 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) lovsub@ffff8801ff791ba0: [0 ffff8801f8dcfb58 W(2):[28082432, 28098559]@[0x20000337d:0x1:0x0]]
2011-06-24 11:38:35 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) osc@ffff8801ff792928: ffff880203e32b40 40100001 0xf0f29676e28223f7 4 (null) size: 56909365248 mtime: 1308938247 atime: 1308938241 ctime: 1308938247 blocks: 231432
2011-06-24 11:38:35 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) } lock@ffff880212888358
2011-06-24 11:38:35 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) no cover page!
2011-06-24 11:38:35 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) lock@ffff880212888198[0 5 0 0 0 00000000] W(2):[13910016, 13918207]@[0x100270000:0x3fc82:0x0] {
2011-06-24 11:38:35 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) lovsub@ffff8801ff791be0: [0 ffff8801f8dcfaf0 W(2):[27820288, 27836415]@[0x20000337d:0x1:0x0]]
2011-06-24 11:38:35 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) osc@ffff8801ff792870: ffff880203e32900 40100001 0xf0f29676e2822405 4 (null) size: 56909365248 mtime: 1308938247 atime: 1308938241 ctime: 1308938247 blocks: 231432
2011-06-24 11:38:35 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) } lock@ffff880212888198
2011-06-24 11:38:35 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) no cover page!
2011-06-24 11:38:36 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) lock@ffff8801e4f90858[0 5 0 0 0 00000000] W(2):[14049280, 14057471]@[0x100270000:0x3fc82:0x0] {
2011-06-24 11:38:36 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) lovsub@ffff8801ff791a20: [0 ffff8801f8dcf3a0 W(2):[28098816, 28114943]@[0x20000337d:0x1:0x0]]
2011-06-24 11:38:36 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) osc@ffff8801e4f91648: ffff8801e4fd0480 40100001 0xf0f29676e2822413 4 (null) size: 56909365248 mtime: 1308938247 atime: 1308938241 ctime: 1308938247 blocks: 231432
2011-06-24 11:38:36 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) } lock@ffff8801e4f90858
2011-06-24 11:38:36 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) no cover page!
2011-06-24 11:38:36 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) lock@ffff8801e4f90078[0 5 0 0 0 00000000] W(2):[13918208, 13926399]@[0x100270000:0x3fc82:0x0] {
2011-06-24 11:38:36 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) lovsub@ffff8801ff791720: [0 ffff8801f8dcff00 W(2):[27836672, 27852799]@[0x20000337d:0x1:0x0]]
2011-06-24 11:38:36 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) osc@ffff8801e4f91ee8: ffff8801e4fd0000 40100001 0xf0f29676e2822421 4 (null) size: 56909365248 mtime: 1308938247 atime: 1308938241 ctime: 1308938247 blocks: 231432
2011-06-24 11:38:36 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) } lock@ffff8801e4f90078
2011-06-24 11:38:36 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) no cover page!
2011-06-24 11:38:36 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) lock@ffff8801e4f905b8[1 3 0 1 1 00000000] W(2):[13926400, 13934591]@[0x100270000:0x3fc82:0x0] {
2011-06-24 11:38:36 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) lovsub@ffff8801ff7917e0: [0 ffff8801f8dcf470 W(2):[27853056, 27869183]@[0x20000337d:0x1:0x0]]
2011-06-24 11:38:36 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) osc@ffff8801e4f91590: ffff8801d2222d80 40100001 0xf0f29676e282243d 3 ffff880203f71e88 size: 57578356736 mtime: 1308940280 atime: 1308938241 ctime: 1308940280 blocks: 26296408
2011-06-24 11:38:36 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) } lock@ffff8801e4f905b8
2011-06-24 11:38:36 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) no cover page!
2011-06-24 11:38:36 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) lock@ffff8801e4f90238[0 5 0 0 0 00000000] W(2):[14062592, 14070783]@[0x100270000:0x3fc82:0x0] {
2011-06-24 11:38:36 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) lovsub@ffff8801ff791f60: [0 ffff8801f8dcf4d8 W(2):[28125440, 28141567]@[0x20000337d:0x1:0x0]]
2011-06-24 11:38:36 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) osc@ffff8801e4f919e0: ffff8801d22226c0 40100001 0xf0f29676e2822444 4 (null) size: 57600376832 mtime: 1308940695 atime: 1308938241 ctime: 1308940695 blocks: 31336536
2011-06-24 11:38:36 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) } lock@ffff8801e4f90238
2011-06-24 11:38:36 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) no cover page!
2011-06-24 11:38:36 LustreError: 6148:0:(osc_io.c:682:osc_req_attr_set()) page@ffff8801d22e2bc0[1 ffff8801f8d75f08:28119808 ^(null)_ffff8801d22e2b00 2 0 1 (null) ffff880203c7a880 0x0]
2011-06-24 11:38:36 LustreError: 6148:0:(osc_io.c:682:osc_req_attr_set()) page@ffff8801d22e2b00[1 ffff880206f9fe28:14059776 ^ffff8801d22e2bc0_(null) 2 0 1 (null) (null) 0x0]
2011-06-24 11:38:36 LustreError: 6148:0:(osc_io.c:682:osc_req_attr_set()) vvp-page@ffff8801d096f5a0(0:0:0) vm@ffffea000658dcb8 4000000000282d 2:0 ffff8801d22e2bc0 28119808 lru
2011-06-24 11:38:36 LustreError: 6148:0:(osc_io.c:682:osc_req_attr_set()) lov-page@ffff8801d0ac5f78
2011-06-24 11:38:36 LustreError: 6148:0:(osc_io.c:682:osc_req_attr_set()) osc-page@ffff8801d0970ee0: 1< 0x845fed 2 0 - - + > 2< 57588842496 0 0x3 0x420 | (null) ffff8801ff8ea608 ffff8801f8d74d00 ffffffffa06dc2e0 ffff8801d0970ee0 > 3< + ffff880212826a80 1 245863 0 > 4< 0 8 8 32505856 + | + - + - > 5< + - + - | 0 - - | 5888 + +>
2011-06-24 11:38:36 LustreError: 6148:0:(osc_io.c:682:osc_req_attr_set()) end page@ffff8801d22e2b00
2011-06-24 11:38:36 LustreError: 6148:0:(osc_io.c:682:osc_req_attr_set()) dump uncover page!
2011-06-24 11:38:36 Pid: 6148, comm: ptlrpcd-brw
2011-06-24 11:38:36
2011-06-24 11:38:36 Call Trace:
2011-06-24 11:38:36 [<ffffffffa034d855>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
2011-06-24 11:38:36 [<ffffffffa06d9b43>] osc_req_attr_set+0x283/0x2a0 [osc]
2011-06-24 11:38:36 [<ffffffffa07cbbd8>] ? ccc_req_attr_set+0x78/0x150 [lustre]
2011-06-24 11:38:36 [<ffffffffa048c5c1>] cl_req_attr_set+0xd1/0x200 [obdclass]
2011-06-24 11:38:36 [<ffffffffa04913f8>] ? cl_req_alloc+0x188/0x5f0 [obdclass]
2011-06-24 11:38:36 [<ffffffffa06c57ab>] osc_send_oap_rpc+0xc9b/0x1ca0 [osc]
2011-06-24 11:38:36 [<ffffffff81050ba3>] ? enqueue_task_fair+0x43/0x90
2011-06-24 11:38:36 [<ffffffff81057349>] ? enqueue_task+0x79/0x90
2011-06-24 11:38:36 [<ffffffffa06c6a8e>] osc_check_rpcs+0x2de/0x470 [osc]
2011-06-24 11:38:36 [<ffffffffa034e75a>] ? cfs_waitq_signal+0x1a/0x20 [libcfs]
2011-06-24 11:38:36 [<ffffffffa06ba329>] ? osc_wake_cache_waiters+0xf9/0x330 [osc]
2011-06-24 11:38:36 [<ffffffffa06ca886>] brw_interpret+0x2f6/0x710 [osc]
2011-06-24 11:38:36 [<ffffffffa0558346>] ? after_reply+0x7b6/0xe30 [ptlrpc]
2011-06-24 11:38:36 [<ffffffffa055ce7a>] ptlrpc_check_set+0x47a/0x1b20 [ptlrpc]
2011-06-24 11:38:36 [<ffffffff8107a11b>] ? try_to_del_timer_sync+0x7b/0xe0
2011-06-24 11:38:36 [<ffffffffa0590de8>] ptlrpcd_check+0x1c8/0x250 [ptlrpc]
2011-06-24 11:38:36 [<ffffffffa059118b>] ptlrpcd+0x31b/0x390 [ptlrpc]
2011-06-24 11:38:36 [<ffffffff8105dc20>] ? default_wake_function+0x0/0x20
2011-06-24 11:38:36 [<ffffffff8100c1ca>] child_rip+0xa/0x20
2011-06-24 11:38:36 [<ffffffffa0590e70>] ? ptlrpcd+0x0/0x390 [ptlrpc]
2011-06-24 11:38:36 [<ffffffff8100c1c0>] ? child_rip+0x0/0x20
2011-06-24 11:38:36
2011-06-24 11:38:36 LustreError: 6148:0:(osc_io.c:684:osc_req_attr_set()) LBUG
2011-06-24 11:38:36 Pid: 6148, comm: ptlrpcd-brw
2011-06-24 11:38:36 Jun 24 11:38:36
2011-06-24 11:38:36 zwicky163 kernelCall Trace:
2011-06-24 11:38:36 : LustreError: 6 [<ffffffffa034d855>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
2011-06-24 11:38:36 148:0:(osc_io.c: [<ffffffffa034de95>] lbug_with_loc+0x75/0xe0 [libcfs]
2011-06-24 11:38:36 684:osc_req_attr [<ffffffffa06d9b5b>] osc_req_attr_set+0x29b/0x2a0 [osc]
2011-06-24 11:38:36 _set()) LBUG
2011-06-24 11:38:36 [<ffffffffa07cbbd8>] ? ccc_req_attr_set+0x78/0x150 [lustre]
2011-06-24 11:38:36 [<ffffffffa048c5c1>] cl_req_attr_set+0xd1/0x200 [obdclass]
2011-06-24 11:38:36 [<ffffffffa04913f8>] ? cl_req_alloc+0x188/0x5f0 [obdclass]
2011-06-24 11:38:36 [<ffffffffa06c57ab>] osc_send_oap_rpc+0xc9b/0x1ca0 [osc]
2011-06-24 11:38:36 [<ffffffff81050ba3>] ? enqueue_task_fair+0x43/0x90
2011-06-24 11:38:36 [<ffffffff81057349>] ? enqueue_task+0x79/0x90
2011-06-24 11:38:36 [<ffffffffa06c6a8e>] osc_check_rpcs+0x2de/0x470 [osc]
2011-06-24 11:38:36 [<ffffffffa034e75a>] ? cfs_waitq_signal+0x1a/0x20 [libcfs]
2011-06-24 11:38:36 [<ffffffffa06ba329>] ? osc_wake_cache_waiters+0xf9/0x330 [osc]
2011-06-24 11:38:36 [<ffffffffa06ca886>] brw_interpret+0x2f6/0x710 [osc]
2011-06-24 11:38:36 [<ffffffffa0558346>] ? after_reply+0x7b6/0xe30 [ptlrpc]
2011-06-24 11:38:36 [<ffffffffa055ce7a>] ptlrpc_check_set+0x47a/0x1b20 [ptlrpc]
2011-06-24 11:38:36 [<ffffffff8107a11b>] ? try_to_del_timer_sync+0x7b/0xe0
2011-06-24 11:38:36 [<ffffffffa0590de8>] ptlrpcd_check+0x1c8/0x250 [ptlrpc]
2011-06-24 11:38:36 [<ffffffffa059118b>] ptlrpcd+0x31b/0x390 [ptlrpc]
2011-06-24 11:38:36 [<ffffffff8105dc20>] ? default_wake_function+0x0/0x20
2011-06-24 11:38:36 [<ffffffff8100c1ca>] child_rip+0xa/0x20
2011-06-24 11:38:36 [<ffffffffa0590e70>] ? ptlrpcd+0x0/0x390 [ptlrpc]
2011-06-24 11:38:36 [<ffffffff8100c1c0>] ? child_rip+0x0/0x20
2011-06-24 11:38:36
2011-06-24 11:38:36 Kernel panic - not syncing: LBUG
2011-06-24 11:38:36 Pid: 6148, comm: ptlrpcd-brw Tainted: G ---------------- T 2.6.32-131.2.1.1chaos.ch5.x86_64 #1
2011-06-24 11:38:36 Call Trace:
2011-06-24 11:38:36 Jun 24 11:38:36 [<ffffffff814da4fe>] ? panic+0x78/0x143
2011-06-24 11:38:36 zwicky163 kernel [<ffffffffa034deeb>] ? lbug_with_loc+0xcb/0xe0 [libcfs]
2011-06-24 11:38:36 : Kernel panic - [<ffffffffa06d9b5b>] ? osc_req_attr_set+0x29b/0x2a0 [osc]
2011-06-24 11:38:36 not syncing: LB [<ffffffffa07cbbd8>] ? ccc_req_attr_set+0x78/0x150 [lustre]
2011-06-24 11:38:36 UG
2011-06-24 11:38:36 [<ffffffffa048c5c1>] ? cl_req_attr_set+0xd1/0x200 [obdclass]
2011-06-24 11:38:36 [<ffffffffa04913f8>] ? cl_req_alloc+0x188/0x5f0 [obdclass]
2011-06-24 11:38:36 [<ffffffffa06c57ab>] ? osc_send_oap_rpc+0xc9b/0x1ca0 [osc]
2011-06-24 11:38:36 [<ffffffff81050ba3>] ? enqueue_task_fair+0x43/0x90
2011-06-24 11:38:36 [<ffffffff81057349>] ? enqueue_task+0x79/0x90
2011-06-24 11:38:36 [<ffffffffa06c6a8e>] ? osc_check_rpcs+0x2de/0x470 [osc]
2011-06-24 11:38:36 [<ffffffffa034e75a>] ? cfs_waitq_signal+0x1a/0x20 [libcfs]
2011-06-24 11:38:36 [<ffffffffa06ba329>] ? osc_wake_cache_waiters+0xf9/0x330 [osc]
2011-06-24 11:38:36 [<ffffffffa06ca886>] ? brw_interpret+0x2f6/0x710 [osc]
2011-06-24 11:38:36 [<ffffffffa0558346>] ? after_reply+0x7b6/0xe30 [ptlrpc]
2011-06-24 11:38:36 [<ffffffffa055ce7a>] ? ptlrpc_check_set+0x47a/0x1b20 [ptlrpc]
2011-06-24 11:38:36 [<ffffffff8107a11b>] ? try_to_del_timer_sync+0x7b/0xe0
2011-06-24 11:38:36 [<ffffffffa0590de8>] ? ptlrpcd_check+0x1c8/0x250 [ptlrpc]
2011-06-24 11:38:36 [<ffffffffa059118b>] ? ptlrpcd+0x31b/0x390 [ptlrpc]
2011-06-24 11:38:36 [<ffffffff8105dc20>] ? default_wake_function+0x0/0x20
2011-06-24 11:38:36 [<ffffffff8100c1ca>] ? child_rip+0xa/0x20
2011-06-24 11:38:36 [<ffffffffa0590e70>] ? ptlrpcd+0x0/0x390 [ptlrpc]
2011-06-24 11:38:36 [<ffffffff8100c1c0>] ? child_rip+0x0/0x20
|
| Comment by Jinshan Xiong (Inactive) [ 27/Jun/11 ] |
|
Can you please try patch set 8? Thanks. |
| Comment by Prakash Surya (Inactive) [ 27/Jun/11 ] |
|
Xiong, I will work on testing out patch set 8 today. Thanks! |
| Comment by Jinshan Xiong (Inactive) [ 28/Jun/11 ] |
|
Hi Prakash, I hope this is a good sign to not hear about problems for patch set 8 |
| Comment by Prakash Surya (Inactive) [ 28/Jun/11 ] |
|
Hi Xiong, I ran the following IOR (same as before, but with 57 nodes and 114 tasks) with patch set 8 applied: srun -l -D /p/lcrater2/surya1/tmp -N 57 -n 114 -- /g/g0/surya1/dev/IOR/src/C/IOR -a POSIX -C -e -i 1 -o /p/lcrater2/surya1/tmp/IOR.data -v -w -W -r -R -t 1m -b 1024m This time the IOR completed (which is a good sign), but some errors were reported by the application in the process: 000: Commencing write performance test. 000: Mon Jun 27 16:38:08 2011 000: 000: Verifying contents of the file(s) just written. 000: Mon Jun 27 17:19:58 2011 000: 003: [3] FAILED comparison of buffer containing 8-byte ints: 063: [63] FAILED comparison of buffer containing 8-byte ints: 067: [67] FAILED comparison of buffer containing 8-byte ints: 003: [3] File name = /p/lcrater2/surya1/tmp/IOR.data 063: [63] File name = /p/lcrater2/surya1/tmp/IOR.data 067: [67] File name = /p/lcrater2/surya1/tmp/IOR.data 003: [3] In transfer 1, 131072 errors between buffer indices 0 and 131071. 063: [63] In transfer 1, 131072 errors between buffer indices 0 and 131071. 067: [67] In transfer 1, 131072 errors between buffer indices 0 and 131071. 003: [3] File byte offset = 7516192768: 063: [63] File byte offset = 71940702208: 067: [67] File byte offset = 76235669504: 003: [3] Expected: 0x000000074e091460 0000000000000008 000000074e091460 0000000000000018 063: [63] Expected: 0x000000434e091460 0000000000000008 000000434e091460 0000000000000018 067: [67] Expected: 0x000000474e091460 0000000000000008 000000474e091460 0000000000000018 003: [3] Actual: 0x0000000000000000 0000000000000000 0000000000000000 0000000000000000 063: [63] Actual: 0x0000000000000000 0000000000000000 0000000000000000 0000000000000000 067: [67] Actual: 0x0000000000000000 0000000000000000 0000000000000000 0000000000000000 041: [41] FAILED comparison of buffer containing 8-byte ints: 041: [41] File name = /p/lcrater2/surya1/tmp/IOR.data 041: [41] In transfer 2, 131072 errors between buffer indices 0 and 131071. 041: [41] File byte offset = 48319430656: 041: [41] Expected: 0x0000002d4e091460 0000000000000008 0000002d4e091460 0000000000000018 041: [41] Actual: 0x0000000000000000 0000000000000000 0000000000000000 0000000000000000 051: [51] FAILED comparison of buffer containing 8-byte ints: 051: [51] File name = /p/lcrater2/surya1/tmp/IOR.data 051: [51] In transfer 2, 131072 errors between buffer indices 0 and 131071. 051: [51] File byte offset = 59056848896: 051: [51] Expected: 0x000000374e091460 0000000000000008 000000374e091460 0000000000000018 051: [51] Actual: 0x0000000000000000 0000000000000000 0000000000000000 0000000000000000 075: [75] FAILED comparison of buffer containing 8-byte ints: 075: [75] File name = /p/lcrater2/surya1/tmp/IOR.data 075: [75] In transfer 2, 131072 errors between buffer indices 0 and 131071. 075: [75] File byte offset = 84826652672: 075: [75] Expected: 0x0000004f4e091460 0000000000000008 0000004f4e091460 0000000000000018 075: [75] Actual: 0x0000000000000000 0000000000000000 0000000000000000 0000000000000000 009: [9] FAILED comparison of buffer containing 8-byte ints: 009: [9] File name = /p/lcrater2/surya1/tmp/IOR.data 009: [9] In transfer 2, 131072 errors between buffer indices 0 and 131071. 009: [9] File byte offset = 13959692288: 009: [9] Expected: 0x0000000d4e091460 0000000000000008 0000000d4e091460 0000000000000018 009: [9] Actual: 0x0000000000000000 0000000000000000 0000000000000000 0000000000000000 057: [57] FAILED comparison of buffer containing 8-byte ints: 057: [57] File name = /p/lcrater2/surya1/tmp/IOR.data 057: [57] In transfer 2, 131072 errors between buffer indices 0 and 131071. 057: [57] File byte offset = 65499299840: 057: [57] Expected: 0x0000003d4e091460 0000000000000008 0000003d4e091460 0000000000000018 057: [57] Actual: 0x0000000000000000 0000000000000000 0000000000000000 0000000000000000 000: WARNING: incorrect data on write. 000: 1048576 errors found on write check. 000: Used Time Stamp 1309217888 (0x4e091460) for Data Signature 000: Commencing read performance test. 000: Mon Jun 27 17:28:19 2011 000: 000: Re-reading the file(s) twice to verify that reads are consistent. 000: Mon Jun 27 17:38:14 2011 From the looks of it, it appears corrupt data was written to the file. The only errors that I see reported by dmesg on the clients are of the form: LustreError: 11-0: an error occurred while communicating with 10.1.1.212@o2ib9. The mds_connect operation failed with -16 I dont know if that is significant to the currupted data. I am going to rerun the IOR and see if the same errors occur, again. |
| Comment by Prakash Surya (Inactive) [ 28/Jun/11 ] |
|
Similar errors the second time around as well. $ srun -l -D /p/lcrater2/surya1/tmp -N 57 -n 114 -- /g/g0/surya1/dev/IOR/src/C/IOR -a POSIX -C -e -i 1 -o /p/lcrater2/surya1/tmp/IOR.data -v -w -W -r -R -t 1m -b 1024m 000: IOR-2.10.3: MPI Coordinated Test of Parallel I/O 000: 000: Run began: Tue Jun 28 10:29:13 2011 000: Command line used: /g/g0/surya1/dev/IOR/src/C/IOR -a POSIX -C -e -i 1 -o /p/lcrater2/surya1/tmp/IOR.data -v -w -W -r -R -t 1m -b 1024m 000: Machine: Linux zwicky101 000: Start time skew across all tasks: 0.01 sec 000: Path: /p/lcrater2/surya1/tmp 000: FS: 85.9 TiB Used FS: 2.7% Inodes: 675.2 Mi Used Inodes: 0.0% 000: Participating tasks: 114 000: Using reorderTasks '-C' (expecting block, not cyclic, task assignment) 000: 000: Summary: 000: api = POSIX 000: test filename = /p/lcrater2/surya1/tmp/IOR.data 000: access = single-shared-file 000: pattern = segmented (1 segment) 000: ordering in a file = sequential offsets 000: ordering inter file=constant task offsets = 1 000: clients = 114 (2 per node) 000: repetitions = 1 000: xfersize = 1 MiB 000: blocksize = 1 GiB 000: aggregate filesize = 114 GiB 000: 000: Commencing write performance test. 000: Tue Jun 28 10:29:13 2011 000: 000: Verifying contents of the file(s) just written. 000: Tue Jun 28 11:10:52 2011 000: 083: [83] FAILED comparison of buffer containing 8-byte ints: 083: [83] File name = /p/lcrater2/surya1/tmp/IOR.data 083: [83] In transfer 1, 131072 errors between buffer indices 0 and 131071. 083: [83] File byte offset = 93415538688: 083: [83] Expected: 0x000000574e0a0f69 0000000000000008 000000574e0a0f69 0000000000000018 083: [83] Actual: 0x0000000000000000 0000000000000000 0000000000000000 0000000000000000 063: [63] FAILED comparison of buffer containing 8-byte ints: 063: [63] File name = /p/lcrater2/surya1/tmp/IOR.data 063: [63] In transfer 1, 131072 errors between buffer indices 0 and 131071. 063: [63] File byte offset = 71940702208: 063: [63] Expected: 0x000000434e0a0f69 0000000000000008 000000434e0a0f69 0000000000000018 063: [63] Actual: 0x0000000000000000 0000000000000000 0000000000000000 0000000000000000 057: [57] FAILED comparison of buffer containing 8-byte ints: 057: [57] File name = /p/lcrater2/surya1/tmp/IOR.data 057: [57] In transfer 1, 131072 errors between buffer indices 0 and 131071. 057: [57] File byte offset = 65498251264: 057: [57] Expected: 0x0000003d4e0a0f69 0000000000000008 0000003d4e0a0f69 0000000000000018 057: [57] Actual: 0x0000000000000000 0000000000000000 0000000000000000 0000000000000000 035: [35] FAILED comparison of buffer containing 8-byte ints: 035: [35] File name = /p/lcrater2/surya1/tmp/IOR.data 035: [35] In transfer 2, 131072 errors between buffer indices 0 and 131071. 035: [35] File byte offset = 41876979712: 035: [35] Expected: 0x000000274e0a0f69 0000000000000008 000000274e0a0f69 0000000000000018 035: [35] Actual: 0x0000000000000000 0000000000000000 0000000000000000 0000000000000000 047: [47] FAILED comparison of buffer containing 8-byte ints: 047: [47] File name = /p/lcrater2/surya1/tmp/IOR.data 047: [47] In transfer 1, 131072 errors between buffer indices 0 and 131071. 047: [47] File byte offset = 54760833024: 047: [47] Expected: 0x000000334e0a0f69 0000000000000008 000000334e0a0f69 0000000000000018 047: [47] Actual: 0x0000000000000000 0000000000000000 0000000000000000 0000000000000000 005: [5] FAILED comparison of buffer containing 8-byte ints: 005: [5] File name = /p/lcrater2/surya1/tmp/IOR.data 005: [5] In transfer 2, 131072 errors between buffer indices 0 and 131071. 005: [5] File byte offset = 9664724992: 005: [5] Expected: 0x000000094e0a0f69 0000000000000008 000000094e0a0f69 0000000000000018 005: [5] Actual: 0x0000000000000000 0000000000000000 0000000000000000 0000000000000000 077: [77] FAILED comparison of buffer containing 8-byte ints: 077: [77] File name = /p/lcrater2/surya1/tmp/IOR.data 077: [77] In transfer 2, 131072 errors between buffer indices 0 and 131071. 077: [77] File byte offset = 86974136320: 077: [77] Expected: 0x000000514e0a0f69 0000000000000008 000000514e0a0f69 0000000000000018 077: [77] Actual: 0x0000000000000000 0000000000000000 0000000000000000 0000000000000000 099: [99] FAILED comparison of buffer containing 8-byte ints: 099: [99] File name = /p/lcrater2/surya1/tmp/IOR.data 099: [99] In transfer 2, 131072 errors between buffer indices 0 and 131071. 099: [99] File byte offset = 110596456448: 099: [99] Expected: 0x000000674e0a0f69 0000000000000008 000000674e0a0f69 0000000000000018 099: [99] Actual: 0x0000000000000000 0000000000000000 0000000000000000 0000000000000000 000: WARNING: incorrect data on write. 000: 1048576 errors found on write check. 000: Used Time Stamp 1309282153 (0x4e0a0f69) for Data Signature 000: Commencing read performance test. 000: Tue Jun 28 11:19:21 2011 000: 000: Re-reading the file(s) twice to verify that reads are consistent. 000: Tue Jun 28 11:29:08 2011 000: 000: Operation Max (MiB) Min (MiB) Mean (MiB) Std Dev Max (OPs) Min (OPs) Mean (OPs) Std Dev Mean (s) Op grep #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize 000: 000: --------- --------- --------- ---------- ------- --------- --------- ---------- ------- -------- 000: write 46.73 46.73 46.73 0.00 46.73 46.73 46.73 0.002498.34136 114 2 1 0 1 1 0 0 1 1073741824 1048576 122406567936 -1 POSIX EXCEL 000: read 198.94 198.94 198.94 0.00 198.94 198.94 198.94 0.00 586.78222 114 2 1 0 1 1 0 0 1 1073741824 1048576 122406567936 -1 POSIX EXCEL 000: 000: Max Write: 46.73 MiB/sec (49.00 MB/sec) 000: Max Read: 198.94 MiB/sec (208.61 MB/sec) 000: 000: Run finished: Tue Jun 28 11:48:17 2011 |
| Comment by Jinshan Xiong (Inactive) [ 28/Jun/11 ] |
|
Can you please show me the output of: lctl get_param osc.*.stats on client node after IOR is finished. |
| Comment by Prakash Surya (Inactive) [ 28/Jun/11 ] |
zwicky101 surya1 Tue Jun 28 15:23:18 ~/lcrater2 $ lctl get_param osc.*.stats osc.lc2-OST0000-osc-ffff8802137f2400.stats= snapshot_time 1309299882.657956 secs.usecs req_waittime 3316 samples [usec] 232 46310 1152289 2530547829 req_active 3316 samples [reqs] 1 1 3316 3316 ost_connect 1 samples [usec] 46310 46310 46310 2144616100 ost_statfs 3 samples [usec] 385 442 1245 518313 obd_ping 3312 samples [usec] 232 3453 1104734 385413416 osc.lc2-OST0001-osc-ffff8802137f2400.stats= snapshot_time 1309299882.658011 secs.usecs req_waittime 3316 samples [usec] 193 45735 1113345 2535429681 req_active 3316 samples [reqs] 1 1 3316 3316 ost_connect 1 samples [usec] 45735 45735 45735 2091690225 ost_statfs 3 samples [usec] 419 493 1354 613974 obd_ping 3312 samples [usec] 193 10018 1066256 443125482 osc.lc2-OST0002-osc-ffff8802137f2400.stats= snapshot_time 1309299882.658054 secs.usecs req_waittime 3316 samples [usec] 177 45746 1038581 2415257793 req_active 3316 samples [reqs] 1 1 3316 3316 ost_connect 1 samples [usec] 45746 45746 45746 2092696516 ost_statfs 3 samples [usec] 384 439 1232 507458 obd_ping 3312 samples [usec] 177 4774 991603 322053819 osc.lc2-OST0003-osc-ffff8802137f2400.stats= snapshot_time 1309299882.658092 secs.usecs req_waittime 3316 samples [usec] 200 46818 1059725 2670511047 req_active 3316 samples [reqs] 1 1 3316 3316 ost_connect 1 samples [usec] 46818 46818 46818 2191925124 ost_statfs 3 samples [usec] 373 427 1201 482259 obd_ping 3312 samples [usec] 200 13074 1011706 478103664 osc.lc2-OST0004-osc-ffff8802137f2400.stats= snapshot_time 1309299882.658125 secs.usecs req_waittime 3316 samples [usec] 206 3435 1086167 372512999 req_active 3316 samples [reqs] 1 1 3316 3316 ost_connect 1 samples [usec] 265 265 265 70225 ost_statfs 3 samples [usec] 368 425 1192 475250 obd_ping 3312 samples [usec] 206 3435 1084710 371967524 osc.lc2-OST0005-osc-ffff8802137f2400.stats= snapshot_time 1309299882.658155 secs.usecs req_waittime 8476 samples [usec] 97 39670377 15056301629 176430385177943145 req_active 8476 samples [reqs] 1 10 24957 143493 read_bytes 4203 samples [bytes] 4096 1048576 4306735104 4487419898363904 write_bytes 1041 samples [bytes] 61440 1048576 1073741824 1119937888256000 ost_read 4203 samples [usec] 8908 5630433 5585590934 20111322940376164 ost_write 1041 samples [usec] 38605 39670377 9465777986 156305138947095590 ost_destroy 1 samples [usec] 3727231 3727231 3727231 13892250927361 ost_connect 1 samples [usec] 370 370 370 136900 ost_statfs 3 samples [usec] 362 416 1165 453869 ldlm_cancel 33 samples [usec] 277 1198 16588 8963234 obd_ping 3127 samples [usec] 198 498 987448 316860328 osc.lc2-OST0006-osc-ffff8802137f2400.stats= snapshot_time 1309299882.658199 secs.usecs req_waittime 3316 samples [usec] 163 6424 968734 326049426 req_active 3316 samples [reqs] 1 1 3316 3316 ost_connect 1 samples [usec] 300 300 300 90000 ost_statfs 3 samples [usec] 356 408 1148 440656 obd_ping 3312 samples [usec] 163 6424 967286 325518770 osc.lc2-OST0007-osc-ffff8802137f2400.stats= snapshot_time 1309299882.658233 secs.usecs req_waittime 3316 samples [usec] 177 17188 989042 586525450 req_active 3316 samples [reqs] 1 1 3316 3316 ost_connect 1 samples [usec] 425 425 425 180625 ost_statfs 3 samples [usec] 354 406 1136 431528 obd_ping 3312 samples [usec] 177 17188 987481 585913297 osc.lc2-OST0008-osc-ffff8802137f2400.stats= snapshot_time 1309299882.658267 secs.usecs req_waittime 3316 samples [usec] 189 6429 1045459 375009395 req_active 3316 samples [reqs] 1 1 3316 3316 ost_connect 1 samples [usec] 269 269 269 72361 ost_statfs 3 samples [usec] 349 397 1121 420035 obd_ping 3312 samples [usec] 189 6429 1044069 374516999 osc.lc2-OST0009-osc-ffff8802137f2400.stats= snapshot_time 1309299882.658300 secs.usecs req_waittime 3316 samples [usec] 150 6826 1008007 355073493 req_active 3316 samples [reqs] 1 1 3316 3316 ost_connect 1 samples [usec] 310 310 310 96100 ost_statfs 3 samples [usec] 341 395 1102 406262 obd_ping 3312 samples [usec] 150 6826 1006595 354571131 osc.lc2-OST000a-osc-ffff8802137f2400.stats= snapshot_time 1309299882.658335 secs.usecs req_waittime 3316 samples [usec] 160 4352 939053 288343793 req_active 3316 samples [reqs] 1 1 3316 3316 ost_connect 1 samples [usec] 326 326 326 106276 ost_statfs 3 samples [usec] 340 388 1094 400100 obd_ping 3312 samples [usec] 160 4352 937633 287837417 osc.lc2-OST000b-osc-ffff8802137f2400.stats= snapshot_time 1309299882.658369 secs.usecs req_waittime 3316 samples [usec] 162 17348 960848 576029112 req_active 3316 samples [reqs] 1 1 3316 3316 ost_connect 1 samples [usec] 347 347 347 120409 ost_statfs 3 samples [usec] 332 385 1082 391674 obd_ping 3312 samples [usec] 162 17348 959419 575517029 osc.lc2-OST000c-osc-ffff8802137f2400.stats= snapshot_time 1309299882.658402 secs.usecs req_waittime 3316 samples [usec] 167 6438 1038151 371325901 req_active 3316 samples [reqs] 1 1 3316 3316 ost_connect 1 samples [usec] 295 295 295 87025 ost_statfs 3 samples [usec] 330 378 1065 379233 obd_ping 3312 samples [usec] 167 6438 1036791 370859643 osc.lc2-OST000d-osc-ffff8802137f2400.stats= snapshot_time 1309299882.658435 secs.usecs req_waittime 3316 samples [usec] 130 8380 994059 370428241 req_active 3316 samples [reqs] 1 1 3316 3316 ost_connect 1 samples [usec] 299 299 299 89401 ost_statfs 3 samples [usec] 355 414 1145 438797 obd_ping 3312 samples [usec] 130 8380 992615 369900043 osc.lc2-OST000e-osc-ffff8802137f2400.stats= snapshot_time 1309299882.658468 secs.usecs req_waittime 3316 samples [usec] 146 4535 917308 279019898 req_active 3316 samples [reqs] 1 1 3316 3316 ost_connect 1 samples [usec] 273 273 273 74529 ost_statfs 3 samples [usec] 319 374 1039 361353 obd_ping 3312 samples [usec] 146 4535 915996 278584016 osc.lc2-OST000f-osc-ffff8802137f2400.stats= snapshot_time 1309299882.659207 secs.usecs req_waittime 3316 samples [usec] 165 17499 937685 569375985 req_active 3316 samples [reqs] 1 1 3316 3316 ost_connect 1 samples [usec] 339 339 339 114921 ost_statfs 3 samples [usec] 309 366 1020 348462 obd_ping 3312 samples [usec] 165 17499 936326 568912602 osc.lc2-OST0010-osc-ffff8802137f2400.stats= snapshot_time 1309299882.659255 secs.usecs req_waittime 3316 samples [usec] 166 7813 984872 359184956 req_active 3316 samples [reqs] 1 1 3316 3316 ost_connect 1 samples [usec] 227 227 227 51529 ost_statfs 3 samples [usec] 308 359 996 331986 obd_ping 3312 samples [usec] 166 7813 983649 358801441 osc.lc2-OST0011-osc-ffff8802137f2400.stats= snapshot_time 1309299882.659288 secs.usecs req_waittime 3316 samples [usec] 137 9719 938332 361675802 req_active 3316 samples [reqs] 1 1 3316 3316 ost_connect 1 samples [usec] 275 275 275 75625 ost_statfs 3 samples [usec] 315 338 969 313325 obd_ping 3312 samples [usec] 137 9719 937088 361286852 osc.lc2-OST0012-osc-ffff8802137f2400.stats= snapshot_time 1309299882.659321 secs.usecs req_waittime 3316 samples [usec] 132 4578 859697 248602493 req_active 3316 samples [reqs] 1 1 3316 3316 ost_connect 1 samples [usec] 210 210 210 44100 ost_statfs 3 samples [usec] 306 336 956 305128 obd_ping 3312 samples [usec] 132 4578 858531 248253265 osc.lc2-OST0013-osc-ffff8802137f2400.stats= snapshot_time 1309299882.659354 secs.usecs req_waittime 3316 samples [usec] 145 17520 863264 529479340 req_active 3316 samples [reqs] 1 1 3316 3316 ost_connect 1 samples [usec] 215 215 215 46225 ost_statfs 3 samples [usec] 299 332 944 297594 obd_ping 3312 samples [usec] 145 17520 862105 529135521 osc.lc2-OST0014-osc-ffff8802137f2400.stats= snapshot_time 1309299882.659388 secs.usecs req_waittime 3316 samples [usec] 149 8167 911945 319393295 req_active 3316 samples [reqs] 1 1 3316 3316 ost_connect 1 samples [usec] 289 289 289 83521 ost_statfs 3 samples [usec] 292 326 922 283956 obd_ping 3312 samples [usec] 149 8167 910734 319025818 osc.lc2-OST0015-osc-ffff8802137f2400.stats= snapshot_time 1309299882.659421 secs.usecs req_waittime 3316 samples [usec] 142 9721 876334 327596336 req_active 3316 samples [reqs] 1 1 3316 3316 ost_connect 1 samples [usec] 180 180 180 32400 ost_statfs 3 samples [usec] 285 319 900 270602 obd_ping 3312 samples [usec] 142 9721 875254 327293334 osc.lc2-OST0016-osc-ffff8802137f2400.stats= snapshot_time 1309299882.659454 secs.usecs req_waittime 3316 samples [usec] 129 4572 819079 227868867 req_active 3316 samples [reqs] 1 1 3316 3316 ost_connect 1 samples [usec] 129 129 129 16641 ost_statfs 3 samples [usec] 284 313 892 265650 obd_ping 3312 samples [usec] 142 4572 818058 227586576 osc.lc2-OST0017-osc-ffff8802137f2400.stats= snapshot_time 1309299882.659487 secs.usecs req_waittime 3316 samples [usec] 132 17465 820610 506319946 req_active 3316 samples [reqs] 1 1 3316 3316 ost_connect 1 samples [usec] 233 233 233 54289 ost_statfs 3 samples [usec] 275 308 871 253433 obd_ping 3312 samples [usec] 132 17465 819506 506012224 osc.lc2-OST0018-osc-ffff8802137f2400.stats= snapshot_time 1309299882.659520 secs.usecs req_waittime 3316 samples [usec] 133 7420 871673 288373391 req_active 3316 samples [reqs] 1 1 3316 3316 ost_connect 1 samples [usec] 254 254 254 64516 ost_statfs 3 samples [usec] 274 302 862 248076 obd_ping 3312 samples [usec] 133 7420 870557 288060799 osc.lc2-OST0019-osc-ffff8802137f2400.stats= snapshot_time 1309299882.659553 secs.usecs req_waittime 3316 samples [usec] 137 9752 840189 309446239 req_active 3316 samples [reqs] 1 1 3316 3316 ost_connect 1 samples [usec] 150 150 150 22500 ost_statfs 3 samples [usec] 272 298 849 240629 obd_ping 3312 samples [usec] 137 9752 839190 309183110 osc.lc2-OST001a-osc-ffff8802137f2400.stats= snapshot_time 1309299882.659585 secs.usecs req_waittime 3316 samples [usec] 125 4534 785335 211163457 req_active 3316 samples [reqs] 1 1 3316 3316 ost_connect 1 samples [usec] 148 148 148 21904 ost_statfs 3 samples [usec] 264 290 831 230525 obd_ping 3312 samples [usec] 125 4534 784356 210911028 osc.lc2-OST001b-osc-ffff8802137f2400.stats= snapshot_time 1309299882.659618 secs.usecs req_waittime 3316 samples [usec] 122 17444 786734 489367106 req_active 3316 samples [reqs] 1 1 3316 3316 ost_connect 1 samples [usec] 231 231 231 53361 ost_statfs 3 samples [usec] 260 288 818 223444 obd_ping 3312 samples [usec] 122 17444 785685 489090301 osc.lc2-OST001c-osc-ffff8802137f2400.stats= snapshot_time 1309299882.659652 secs.usecs req_waittime 3316 samples [usec] 141 8770 843230 295343834 req_active 3316 samples [reqs] 1 1 3316 3316 ost_connect 1 samples [usec] 197 197 197 38809 ost_statfs 3 samples [usec] 254 285 808 218102 obd_ping 3312 samples [usec] 141 8770 842225 295086923 osc.lc2-OST001d-osc-ffff8802137f2400.stats= snapshot_time 1309299882.659685 secs.usecs req_waittime 3316 samples [usec] 138 9756 820282 300124948 req_active 3316 samples [reqs] 1 1 3316 3316 ost_connect 1 samples [usec] 153 153 153 23409 ost_statfs 3 samples [usec] 250 278 789 207905 obd_ping 3312 samples [usec] 138 9756 819340 299893634 osc.lc2-OST001e-osc-ffff8802137f2400.stats= snapshot_time 1309299882.659717 secs.usecs req_waittime 3316 samples [usec] 125 4630 770561 205056819 req_active 3316 samples [reqs] 1 1 3316 3316 ost_connect 1 samples [usec] 162 162 162 26244 ost_statfs 3 samples [usec] 241 277 777 201891 obd_ping 3312 samples [usec] 125 4630 769622 204828684 osc.lc2-OST001f-osc-ffff8802137f2400.stats= snapshot_time 1309299882.661220 secs.usecs req_waittime 3316 samples [usec] 118 17421 763153 477120307 req_active 3316 samples [reqs] 1 1 3316 3316 ost_connect 1 samples [usec] 213 213 213 45369 ost_statfs 3 samples [usec] 238 268 759 192477 obd_ping 3312 samples [usec] 118 17421 762181 476882461 osc.lc2-OST0020-osc-ffff8802137f2400.stats= snapshot_time 1309299882.661269 secs.usecs req_waittime 3316 samples [usec] 131 8115 901493 316119947 req_active 3316 samples [reqs] 1 1 3316 3316 ost_connect 1 samples [usec] 203 203 203 41209 ost_statfs 3 samples [usec] 235 262 747 186369 obd_ping 3312 samples [usec] 131 8115 900543 315892369 osc.lc2-OST0021-osc-ffff8802137f2400.stats= snapshot_time 1309299882.661301 secs.usecs req_waittime 3316 samples [usec] 128 9787 875772 329087318 req_active 3316 samples [reqs] 1 1 3316 3316 ost_connect 1 samples [usec] 236 236 236 55696 ost_statfs 3 samples [usec] 237 269 766 196130 obd_ping 3312 samples [usec] 128 9787 874770 328835492 osc.lc2-OST0022-osc-ffff8802137f2400.stats= snapshot_time 1309299882.661331 secs.usecs req_waittime 3316 samples [usec] 121 4611 804496 220759670 req_active 3316 samples [reqs] 1 1 3316 3316 ost_connect 1 samples [usec] 214 214 214 45796 ost_statfs 3 samples [usec] 212 251 691 159929 obd_ping 3312 samples [usec] 121 4611 803591 220553945 osc.lc2-OST0023-osc-ffff8802137f2400.stats= snapshot_time 1309299882.661361 secs.usecs req_waittime 3316 samples [usec] 119 17402 801300 496177606 req_active 3316 samples [reqs] 1 1 3316 3316 ost_connect 1 samples [usec] 186 186 186 34596 ost_statfs 3 samples [usec] 208 239 657 144485 obd_ping 3312 samples [usec] 119 17402 800457 495998525 osc.lc2-OST0024-osc-ffff8802137f2400.stats= snapshot_time 1309299882.661393 secs.usecs req_waittime 3317 samples [usec] 137 2381350 3245697 5671117236501 req_active 3317 samples [reqs] 1 1 3317 3317 ost_destroy 1 samples [usec] 2381350 2381350 2381350 5670827822500 ost_connect 1 samples [usec] 229 229 229 52441 ost_statfs 3 samples [usec] 197 230 636 135390 obd_ping 3312 samples [usec] 137 7748 863482 289226170 osc.lc2-OST0025-osc-ffff8802137f2400.stats= snapshot_time 1309299882.661426 secs.usecs req_waittime 3316 samples [usec] 135 9818 842294 310865850 req_active 3316 samples [reqs] 1 1 3316 3316 ost_connect 1 samples [usec] 150 150 150 22500 ost_statfs 3 samples [usec] 177 228 607 124117 obd_ping 3312 samples [usec] 135 9818 841537 310719233 osc.lc2-OST0026-osc-ffff8802137f2400.stats= snapshot_time 1309299882.661456 secs.usecs req_waittime 3316 samples [usec] 119 4541 777610 206535650 req_active 3316 samples [reqs] 1 1 3316 3316 ost_connect 1 samples [usec] 134 134 134 17956 ost_statfs 3 samples [usec] 200 224 629 132201 obd_ping 3312 samples [usec] 119 4541 776847 206385493 osc.lc2-OST0027-osc-ffff8802137f2400.stats= snapshot_time 1309299882.661487 secs.usecs req_waittime 3317 samples [usec] 121 2518610 3296666 6343880625598 req_active 3317 samples [reqs] 1 1 3317 3317 ost_destroy 1 samples [usec] 2518610 2518610 2518610 6343396332100 ost_connect 1 samples [usec] 217 217 217 47089 ost_statfs 3 samples [usec] 178 199 570 108534 obd_ping 3312 samples [usec] 121 17419 777269 484137875 osc.lc2-OST0028-osc-ffff8802137f2400.stats= snapshot_time 1309299882.661520 secs.usecs req_waittime 3316 samples [usec] 120 7729 818831 265014381 req_active 3316 samples [reqs] 1 1 3316 3316 ost_connect 1 samples [usec] 131 131 131 17161 ost_statfs 3 samples [usec] 163 198 552 102254 obd_ping 3312 samples [usec] 120 7729 818148 264894966 osc.lc2-OST0029-osc-ffff8802137f2400.stats= snapshot_time 1309299882.661550 secs.usecs req_waittime 3316 samples [usec] 113 9753 793692 284883676 req_active 3316 samples [reqs] 1 1 3316 3316 ost_connect 1 samples [usec] 121 121 121 14641 ost_statfs 3 samples [usec] 160 213 564 107450 obd_ping 3312 samples [usec] 113 9753 793007 284761585 osc.lc2-OST002a-osc-ffff8802137f2400.stats= snapshot_time 1309299882.661581 secs.usecs req_waittime 3326 samples [usec] 116 6803600 20867138 90963833301586 req_active 3326 samples [reqs] 1 3 3332 3348 write_bytes 6 samples [bytes] 1048576 1048576 6291456 6597069766656 ost_write 6 samples [usec] 101750 6803600 20111207 90963298587213 ost_destroy 1 samples [usec] 18648 18648 18648 347747904 ost_connect 1 samples [usec] 133 133 133 17689 ost_statfs 3 samples [usec] 151 182 502 84486 obd_ping 3313 samples [usec] 116 4554 735954 186524908 osc.lc2-OST002b-osc-ffff8802137f2400.stats= snapshot_time 1309299882.661617 secs.usecs req_waittime 3316 samples [usec] 109 17414 741926 466887986 req_active 3316 samples [reqs] 1 1 3316 3316 ost_connect 1 samples [usec] 128 128 128 16384 ost_statfs 3 samples [usec] 145 184 484 78906 obd_ping 3312 samples [usec] 109 17414 741314 466792696 osc.lc2-OST002c-osc-ffff8802137f2400.stats= snapshot_time 1309299882.661648 secs.usecs req_waittime 8487 samples [usec] 110 36299959 30049160298 472323496437767736 req_active 8487 samples [reqs] 1 10 33250 232008 read_bytes 4243 samples [bytes] 4096 1048576 4315938816 4482920299364352 write_bytes 1024 samples [bytes] 1048576 1048576 1073741824 1125899906842624 ost_read 4243 samples [usec] 6293 5012439 9474972749 35922925662095665 ost_write 1024 samples [usec] 1098441 36299959 20568875607 436380202545289431 ost_destroy 1 samples [usec] 4512911 4512911 4512911 20366365693921 ost_connect 1 samples [usec] 226 226 226 51076 ost_statfs 3 samples [usec] 150 190 508 86824 ldlm_cancel 32 samples [usec] 129 801 15198 7596796 obd_ping 3117 samples [usec] 126 377 710710 165138154 osc.lc2-OST002d-osc-ffff8802137f2400.stats= snapshot_time 1309299882.662283 secs.usecs req_waittime 3325 samples [usec] 121 3255936 11058585 27761035609091 req_active 3325 samples [reqs] 1 3 3328 3336 write_bytes 5 samples [bytes] 4096 1048576 4198400 4398063288320 ost_write 5 samples [usec] 166817 3255936 10303044 27760458982266 ost_destroy 1 samples [usec] 17814 17814 17814 317338596 ost_connect 1 samples [usec] 121 121 121 14641 ost_statfs 3 samples [usec] 129 187 467 74411 obd_ping 3313 samples [usec] 131 9789 736636 259046452 osc.lc2-OST002e-osc-ffff8802137f2400.stats= snapshot_time 1309299882.662332 secs.usecs req_waittime 8446 samples [usec] 84 37963003 30066919244 474388438312497346 req_active 8446 samples [reqs] 1 10 33078 231330 read_bytes 4192 samples [bytes] 12288 1048576 4307550208 4492471870423040 write_bytes 1024 samples [bytes] 1048576 1048576 1073741824 1125899906842624 ost_read 4192 samples [usec] 20373 5049723 9373760081 36615725280852947 ost_write 1024 samples [usec] 47464 37963003 20687893862 437751873380011142 ost_destroy 1 samples [usec] 4565022 4565022 4565022 20839425860484 ost_connect 1 samples [usec] 124 124 124 15376 ost_statfs 3 samples [usec] 128 158 428 61512 ldlm_cancel 33 samples [usec] 150 1353 16591 9419327 obd_ping 3124 samples [usec] 111 365 647157 137061821 osc.lc2-OST002f-osc-ffff8802137f2400.stats= snapshot_time 1309299882.662379 secs.usecs req_waittime 8482 samples [usec] 111 36672657 28422159725 431738138874474129 req_active 8482 samples [reqs] 1 10 31859 213413 read_bytes 4237 samples [bytes] 4096 1048576 4314890240 4481455950397440 write_bytes 1024 samples [bytes] 1048576 1048576 1073741824 1125899906842624 ost_read 4237 samples [usec] 4287 4905514 8637724795 31806703311501945 ost_write 1024 samples [usec] 1485223 36672657 19780314514 399919619442695064 ost_destroy 1 samples [usec] 3437435 3437435 3437435 11815959379225 ost_connect 1 samples [usec] 114 114 114 12996 ost_statfs 3 samples [usec] 153 183 496 82498 ldlm_cancel 32 samples [usec] 264 613 15063 7237275 obd_ping 3118 samples [usec] 111 3039 642264 143155342 |
| Comment by Jinshan Xiong (Inactive) [ 28/Jun/11 ] |
|
and lctl get_param osc.*.osc_stats, thanks. |
| Comment by Prakash Surya (Inactive) [ 28/Jun/11 ] |
zwicky101 surya1 Tue Jun 28 15:37:12 ~/lcrater2 $ lctl get_param osc.*.osc_stats osc.lc2-OST0000-osc-ffff8802137f2400.osc_stats= snapshot_time: 1309300659.875348 (secs.usecs) lockless_write_bytes 0 lockless_read_bytes 0 lockless_truncate 0 osc.lc2-OST0001-osc-ffff8802137f2400.osc_stats= snapshot_time: 1309300659.875393 (secs.usecs) lockless_write_bytes 0 lockless_read_bytes 0 lockless_truncate 0 osc.lc2-OST0002-osc-ffff8802137f2400.osc_stats= snapshot_time: 1309300659.875415 (secs.usecs) lockless_write_bytes 0 lockless_read_bytes 0 lockless_truncate 0 osc.lc2-OST0003-osc-ffff8802137f2400.osc_stats= snapshot_time: 1309300659.875437 (secs.usecs) lockless_write_bytes 0 lockless_read_bytes 0 lockless_truncate 0 osc.lc2-OST0004-osc-ffff8802137f2400.osc_stats= snapshot_time: 1309300659.875459 (secs.usecs) lockless_write_bytes 0 lockless_read_bytes 0 lockless_truncate 0 osc.lc2-OST0005-osc-ffff8802137f2400.osc_stats= snapshot_time: 1309300659.875480 (secs.usecs) lockless_write_bytes 0 lockless_read_bytes 0 lockless_truncate 0 osc.lc2-OST0006-osc-ffff8802137f2400.osc_stats= snapshot_time: 1309300659.875502 (secs.usecs) lockless_write_bytes 0 lockless_read_bytes 0 lockless_truncate 0 osc.lc2-OST0007-osc-ffff8802137f2400.osc_stats= snapshot_time: 1309300659.875523 (secs.usecs) lockless_write_bytes 0 lockless_read_bytes 0 lockless_truncate 0 osc.lc2-OST0008-osc-ffff8802137f2400.osc_stats= snapshot_time: 1309300659.875545 (secs.usecs) lockless_write_bytes 0 lockless_read_bytes 0 lockless_truncate 0 osc.lc2-OST0009-osc-ffff8802137f2400.osc_stats= snapshot_time: 1309300659.875568 (secs.usecs) lockless_write_bytes 0 lockless_read_bytes 0 lockless_truncate 0 osc.lc2-OST000a-osc-ffff8802137f2400.osc_stats= snapshot_time: 1309300659.875590 (secs.usecs) lockless_write_bytes 0 lockless_read_bytes 0 lockless_truncate 0 osc.lc2-OST000b-osc-ffff8802137f2400.osc_stats= snapshot_time: 1309300659.875611 (secs.usecs) lockless_write_bytes 0 lockless_read_bytes 0 lockless_truncate 0 osc.lc2-OST000c-osc-ffff8802137f2400.osc_stats= snapshot_time: 1309300659.875633 (secs.usecs) lockless_write_bytes 0 lockless_read_bytes 0 lockless_truncate 0 osc.lc2-OST000d-osc-ffff8802137f2400.osc_stats= snapshot_time: 1309300659.875655 (secs.usecs) lockless_write_bytes 0 lockless_read_bytes 0 lockless_truncate 0 osc.lc2-OST000e-osc-ffff8802137f2400.osc_stats= snapshot_time: 1309300659.875677 (secs.usecs) lockless_write_bytes 0 lockless_read_bytes 0 lockless_truncate 0 osc.lc2-OST000f-osc-ffff8802137f2400.osc_stats= snapshot_time: 1309300659.875698 (secs.usecs) lockless_write_bytes 0 lockless_read_bytes 0 lockless_truncate 0 osc.lc2-OST0010-osc-ffff8802137f2400.osc_stats= snapshot_time: 1309300659.875719 (secs.usecs) lockless_write_bytes 0 lockless_read_bytes 0 lockless_truncate 0 osc.lc2-OST0011-osc-ffff8802137f2400.osc_stats= snapshot_time: 1309300659.875741 (secs.usecs) lockless_write_bytes 0 lockless_read_bytes 0 lockless_truncate 0 osc.lc2-OST0012-osc-ffff8802137f2400.osc_stats= snapshot_time: 1309300659.875763 (secs.usecs) lockless_write_bytes 0 lockless_read_bytes 0 lockless_truncate 0 osc.lc2-OST0013-osc-ffff8802137f2400.osc_stats= snapshot_time: 1309300659.875783 (secs.usecs) lockless_write_bytes 0 lockless_read_bytes 0 lockless_truncate 0 osc.lc2-OST0014-osc-ffff8802137f2400.osc_stats= snapshot_time: 1309300659.875805 (secs.usecs) lockless_write_bytes 0 lockless_read_bytes 0 lockless_truncate 0 osc.lc2-OST0015-osc-ffff8802137f2400.osc_stats= snapshot_time: 1309300659.875827 (secs.usecs) lockless_write_bytes 0 lockless_read_bytes 0 lockless_truncate 0 osc.lc2-OST0016-osc-ffff8802137f2400.osc_stats= snapshot_time: 1309300659.875848 (secs.usecs) lockless_write_bytes 0 lockless_read_bytes 0 lockless_truncate 0 osc.lc2-OST0017-osc-ffff8802137f2400.osc_stats= snapshot_time: 1309300659.875868 (secs.usecs) lockless_write_bytes 0 lockless_read_bytes 0 lockless_truncate 0 osc.lc2-OST0018-osc-ffff8802137f2400.osc_stats= snapshot_time: 1309300659.875890 (secs.usecs) lockless_write_bytes 0 lockless_read_bytes 0 lockless_truncate 0 osc.lc2-OST0019-osc-ffff8802137f2400.osc_stats= snapshot_time: 1309300659.875912 (secs.usecs) lockless_write_bytes 0 lockless_read_bytes 0 lockless_truncate 0 osc.lc2-OST001a-osc-ffff8802137f2400.osc_stats= snapshot_time: 1309300659.875932 (secs.usecs) lockless_write_bytes 0 lockless_read_bytes 0 lockless_truncate 0 osc.lc2-OST001b-osc-ffff8802137f2400.osc_stats= snapshot_time: 1309300659.875953 (secs.usecs) lockless_write_bytes 0 lockless_read_bytes 0 lockless_truncate 0 osc.lc2-OST001c-osc-ffff8802137f2400.osc_stats= snapshot_time: 1309300659.875975 (secs.usecs) lockless_write_bytes 0 lockless_read_bytes 0 lockless_truncate 0 osc.lc2-OST001d-osc-ffff8802137f2400.osc_stats= snapshot_time: 1309300659.875997 (secs.usecs) lockless_write_bytes 0 lockless_read_bytes 0 lockless_truncate 0 osc.lc2-OST001e-osc-ffff8802137f2400.osc_stats= snapshot_time: 1309300659.876018 (secs.usecs) lockless_write_bytes 0 lockless_read_bytes 0 lockless_truncate 0 osc.lc2-OST001f-osc-ffff8802137f2400.osc_stats= snapshot_time: 1309300659.876058 (secs.usecs) lockless_write_bytes 0 lockless_read_bytes 0 lockless_truncate 0 osc.lc2-OST0020-osc-ffff8802137f2400.osc_stats= snapshot_time: 1309300659.876081 (secs.usecs) lockless_write_bytes 0 lockless_read_bytes 0 lockless_truncate 0 osc.lc2-OST0021-osc-ffff8802137f2400.osc_stats= snapshot_time: 1309300659.876104 (secs.usecs) lockless_write_bytes 0 lockless_read_bytes 0 lockless_truncate 0 osc.lc2-OST0022-osc-ffff8802137f2400.osc_stats= snapshot_time: 1309300659.876125 (secs.usecs) lockless_write_bytes 0 lockless_read_bytes 0 lockless_truncate 0 osc.lc2-OST0023-osc-ffff8802137f2400.osc_stats= snapshot_time: 1309300659.876145 (secs.usecs) lockless_write_bytes 0 lockless_read_bytes 0 lockless_truncate 0 osc.lc2-OST0024-osc-ffff8802137f2400.osc_stats= snapshot_time: 1309300659.876168 (secs.usecs) lockless_write_bytes 0 lockless_read_bytes 0 lockless_truncate 0 osc.lc2-OST0025-osc-ffff8802137f2400.osc_stats= snapshot_time: 1309300659.876190 (secs.usecs) lockless_write_bytes 0 lockless_read_bytes 0 lockless_truncate 0 osc.lc2-OST0026-osc-ffff8802137f2400.osc_stats= snapshot_time: 1309300659.876210 (secs.usecs) lockless_write_bytes 0 lockless_read_bytes 0 lockless_truncate 0 osc.lc2-OST0027-osc-ffff8802137f2400.osc_stats= snapshot_time: 1309300659.876230 (secs.usecs) lockless_write_bytes 0 lockless_read_bytes 0 lockless_truncate 0 osc.lc2-OST0028-osc-ffff8802137f2400.osc_stats= snapshot_time: 1309300659.876252 (secs.usecs) lockless_write_bytes 0 lockless_read_bytes 0 lockless_truncate 0 osc.lc2-OST0029-osc-ffff8802137f2400.osc_stats= snapshot_time: 1309300659.876273 (secs.usecs) lockless_write_bytes 0 lockless_read_bytes 0 lockless_truncate 0 osc.lc2-OST002a-osc-ffff8802137f2400.osc_stats= snapshot_time: 1309300659.876293 (secs.usecs) lockless_write_bytes 0 lockless_read_bytes 0 lockless_truncate 0 osc.lc2-OST002b-osc-ffff8802137f2400.osc_stats= snapshot_time: 1309300659.876313 (secs.usecs) lockless_write_bytes 0 lockless_read_bytes 0 lockless_truncate 0 osc.lc2-OST002c-osc-ffff8802137f2400.osc_stats= snapshot_time: 1309300659.876335 (secs.usecs) lockless_write_bytes 0 lockless_read_bytes 0 lockless_truncate 0 osc.lc2-OST002d-osc-ffff8802137f2400.osc_stats= snapshot_time: 1309300659.876355 (secs.usecs) lockless_write_bytes 0 lockless_read_bytes 0 lockless_truncate 0 osc.lc2-OST002e-osc-ffff8802137f2400.osc_stats= snapshot_time: 1309300659.876375 (secs.usecs) lockless_write_bytes 0 lockless_read_bytes 0 lockless_truncate 0 osc.lc2-OST002f-osc-ffff8802137f2400.osc_stats= snapshot_time: 1309300659.876395 (secs.usecs) lockless_write_bytes 0 lockless_read_bytes 0 lockless_truncate 0 |
| Comment by Jinshan Xiong (Inactive) [ 29/Jun/11 ] |
|
Hi Prakash, I can't reproduce the data corruption issue in our lab with 10 threads on 5 clients. I still need your help. Can you please try patch set 9 and tell me the results? Thanks. |
| Comment by Prakash Surya (Inactive) [ 29/Jun/11 ] |
|
Hi Xiong, not a problem. I will work on getting that tested. |
| Comment by Prakash Surya (Inactive) [ 30/Jun/11 ] |
|
Xiong, just to give an update. I compiled and installed patchset 9 on the clients, and have just kicked off a 57 node IOR. I'll let you know how it goes. |
| Comment by Prakash Surya (Inactive) [ 30/Jun/11 ] |
|
The IOR is still processing the read portion of the test, but the write portion produced similar output as before: zwicky101 surya1 Thu Jun 30 08:29:10 ~/lcrater2 $ srun -l -D /p/lcrater2/surya1/tmp -N 57 -n 114 -- /g/g0/surya1/dev/IOR/src/C/IOR -a POSIX -C -e -i 1 -o /p/lcrater2/surya1/tmp/IOR.data -v -w -W -r -R -t 1m -b 1024m [snip..] 000: Commencing write performance test. 000: Thu Jun 30 08:30:54 2011 000: 000: Verifying contents of the file(s) just written. 000: Thu Jun 30 09:12:44 2011 000: 009: [9] FAILED comparison of buffer containing 8-byte ints: 009: [9] File name = /p/lcrater2/surya1/tmp/IOR.data 009: [9] In transfer 1, 131072 errors between buffer indices 0 and 131071. 009: [9] File byte offset = 13958643712: 009: [9] Expected: 0x0000000d4e0c96ae 0000000000000008 0000000d4e0c96ae 0000000000000018 009: [9] Actual: 0x0000000000000000 0000000000000000 0000000000000000 0000000000000000 079: [79] FAILED comparison of buffer containing 8-byte ints: 079: [79] File name = /p/lcrater2/surya1/tmp/IOR.data 079: [79] In transfer 1, 131072 errors between buffer indices 0 and 131071. 079: [79] File byte offset = 89120571392: 079: [79] Expected: 0x000000534e0c96ae 0000000000000008 000000534e0c96ae 0000000000000018 079: [79] Actual: 0x0000000000000000 0000000000000000 0000000000000000 0000000000000000 025: [25] FAILED comparison of buffer containing 8-byte ints: 025: [25] File name = /p/lcrater2/surya1/tmp/IOR.data 025: [25] In transfer 1, 131072 errors between buffer indices 0 and 131071. 025: [25] File byte offset = 31138512896: 025: [25] Expected: 0x0000001d4e0c96ae 0000000000000008 0000001d4e0c96ae 0000000000000018 025: [25] Actual: 0x0000000000000000 0000000000000000 0000000000000000 0000000000000000 047: [47] FAILED comparison of buffer containing 8-byte ints: 047: [47] File name = /p/lcrater2/surya1/tmp/IOR.data 047: [47] In transfer 2, 131072 errors between buffer indices 0 and 131071. 047: [47] File byte offset = 54761881600: 047: [47] Expected: 0x000000334e0c96ae 0000000000000008 000000334e0c96ae 0000000000000018 047: [47] Actual: 0x0000000000000000 0000000000000000 0000000000000000 0000000000000000 021: [21] FAILED comparison of buffer containing 8-byte ints: 021: [21] File name = /p/lcrater2/surya1/tmp/IOR.data 021: [21] In transfer 2, 131072 errors between buffer indices 0 and 131071. 021: [21] File byte offset = 26844594176: 021: [21] Expected: 0x000000194e0c96ae 0000000000000008 000000194e0c96ae 0000000000000018 021: [21] Actual: 0x0000000000000000 0000000000000000 0000000000000000 0000000000000000 033: [33] FAILED comparison of buffer containing 8-byte ints: 033: [33] File name = /p/lcrater2/surya1/tmp/IOR.data 033: [33] In transfer 2, 131072 errors between buffer indices 0 and 131071. 033: [33] File byte offset = 39729496064: 033: [33] Expected: 0x000000254e0c96ae 0000000000000008 000000254e0c96ae 0000000000000018 033: [33] Actual: 0x0000000000000000 0000000000000000 0000000000000000 0000000000000000 085: [85] FAILED comparison of buffer containing 8-byte ints: 085: [85] File name = /p/lcrater2/surya1/tmp/IOR.data 085: [85] In transfer 2, 131072 errors between buffer indices 0 and 131071. 085: [85] File byte offset = 95564070912: 085: [85] Expected: 0x000000594e0c96ae 0000000000000008 000000594e0c96ae 0000000000000018 085: [85] Actual: 0x0000000000000000 0000000000000000 0000000000000000 0000000000000000 095: [95] FAILED comparison of buffer containing 8-byte ints: 095: [95] File name = /p/lcrater2/surya1/tmp/IOR.data 095: [95] In transfer 2, 131072 errors between buffer indices 0 and 131071. 095: [95] File byte offset = 106301489152: 095: [95] Expected: 0x000000634e0c96ae 0000000000000008 000000634e0c96ae 0000000000000018 095: [95] Actual: 0x0000000000000000 0000000000000000 0000000000000000 0000000000000000 000: WARNING: incorrect data on write. 000: 1048576 errors found on write check. 000: Used Time Stamp 1309447854 (0x4e0c96ae) for Data Signature [snip..] |
| Comment by Jinshan Xiong (Inactive) [ 01/Jul/11 ] |
|
I've found this issue and fixed it, please try patch set 11. Thanks. |
| Comment by Prakash Surya (Inactive) [ 01/Jul/11 ] |
|
Jinshan, great news! I will build and test this patch today. |
| Comment by Ned Bass [ 01/Jul/11 ] |
|
We hit the LBUG in osc_req_attr_set() again this morning. Looks like the client was running patch set 9 based on the version number of the installed RPM. |
| Comment by Prakash Surya (Inactive) [ 05/Jul/11 ] |
|
Hi Jinshan, I have run the IOR with patchset 11 installed 3 times and have not seen the failed comparison error or the original hanging issue. Although, this error has appeared on each run with ps11 installed: 000: Commencing read performance test. 000: Tue Jul 5 10:35:29 2011 000: 000: Re-reading the file(s) twice to verify that reads are consistent. 000: Tue Jul 5 10:51:32 2011 000: 068: ** error ** 068: ERROR in aiori-POSIX.c (line 256): transfer failed. 068: ERROR: Input/output error 068: ** exiting ** The client console logs show many messages of the form: LustreError: 11-0: an error occurred while communicating with 10.1.1.49@o2ib9. The ost_connect operation failed with -19 and LustreError: 11-0: an error occurred while communicating with 10.1.1.47@o2ib9. The obd_ping operation failed with -107 |
| Comment by Christopher Morrone [ 05/Jul/11 ] |
|
Prakash, update the jira ticket where patchset 11 came from as well please, so they know the state of testing on that patchset. You should look for eviction messages as well. A user space app should not be seeing an EIO from just ost_connect or obd_ping errors. The user space app should just hang on whatever operation it is doing until reconnect is successful. Only an eviction should result in the operation being aborted and the app seeing an error message. Try to look for the first error message in the grouping...once something goes wrong, you'll certainly see a flood of many different kinds of error messages. Often the first one is the one of interest. |
| Comment by Prakash Surya (Inactive) [ 05/Jul/11 ] |
|
Sorry, turns out the previous ost_connect and obd_ping messages were old. Looking at the console output for just today, there were not any Lustre or LustreError messages in the logs. And nothing matching the string "evict". Although, there are a lot of messages resembling this: 2011-07-05 14:18:30 mstflint:14869 freeing invalid memtype fe900000-fe910000 2011-07-05 14:18:30 mstflint:14869 freeing invalid memtype fe910000-fea00000 2011-07-05 14:18:31 mstflint:14877 freeing invalid memtype fe900000-fe910000 2011-07-05 14:18:31 mstflint:14877 freeing invalid memtype fe910000-fea00000 2011-07-05 14:18:32 CPUFREQ: Per core ondemand sysfs interface is deprecated - up_threshold but no surrounding Lustre messages like what was seen previously. |
| Comment by Prakash Surya (Inactive) [ 05/Jul/11 ] |
|
I just ran the same IOR but using a file per process, rather than a single shared file, and that IOR completed without reporting any errors. |
| Comment by Jinshan Xiong (Inactive) [ 05/Jul/11 ] |
|
Can you please use `lctl dk > log' to dump lustre log when you saw the IOR error? |
| Comment by Prakash Surya (Inactive) [ 06/Jul/11 ] |
|
Strange, lctl dk on the client which aborted the application reported: # zwicky155 /root > lctl dk > log # zwicky155 /root > cat log Debug log: 0 lines, 0 kept, 0 dropped, 0 bad. # zwicky155 /root > cat /proc/sys/lnet/debug ioctl neterror warning error emerg config console Perhaps, I need to enable more debugging flags? Are they any specific flags that I should enable/disable? |
| Comment by Prakash Surya (Inactive) [ 06/Jul/11 ] |
|
So, with our default debug flags on of: ioctl neterror warning error emerg config console Running: $ lctl dk > /dev/null $ $IOR $ lctl dk > log None of the clients produce useful output in the log files (i.e. Only contains 0 lines, 0 kept, etc..) |
| Comment by Jinshan Xiong (Inactive) [ 06/Jul/11 ] |
|
That's strange, Lustre used to moan a lot... Can you please show me the output of `lctl get_param subsystem_debug', and please make sure client modules, such as mdc, osc etc are in it, or you can do `lctl set_param subsystem_debug=-1' to enable all of them. |
| Comment by Prakash Surya (Inactive) [ 06/Jul/11 ] |
|
The clients all report the following for subsystem_debug: subsystem_debug= undefined mdc mds osc ost class log llite rpc lnet lnd pinger filter echo ldlm lov lquota lmv sec gss mgc mgs fid fld |
| Comment by Jinshan Xiong (Inactive) [ 06/Jul/11 ] |
|
Do you have daemon_file set? Try `lctl get_param daemon_file', and if you have it set, the log is actually dumped to that file. You can use `lctl set_param daemon_file=stop' to disable it so that `lctl dk' will work again. |
| Comment by Prakash Surya (Inactive) [ 06/Jul/11 ] |
|
No, 'lctl get_param daemon_file' does not produce any output on any of the clients. |
| Comment by Jinshan Xiong (Inactive) [ 06/Jul/11 ] |
|
then I have no idea, you may ask Chris to figure it out. |
| Comment by Christopher Morrone [ 06/Jul/11 ] |
|
If there is nothing on the console, then its pretty likely there is simply no message in the log at those log levels either. Those log levels are pretty conservative. I would start by adding "net" to the debug level, simply as a sanity check. You WILL see log messages for any network traffic in the lustre kernel log when you run IOR. If the sanity test give you logs, then set debug to -1 (all logging on). And see if the problem is still reproducible. I would also recommend getting into the habit of using the form "lctl dk log" instead of "lctl dk > log". The former will make sure the file "log" can actually be created before dumping and clearing the lustre log in memory. If you slip up and try to write to a directory where you don't have permissions (or more likely it is a r/o filesystem), with the latter command your log is lost forever. The redirection, "lctl dk > /dev/null" is fine for dumping to /dev/null to clear the log though. |
| Comment by Prakash Surya (Inactive) [ 07/Jul/11 ] |
|
Chris, Thanks for the tip! I believe lustre is simply not reporting any information to the logs with our default log level. If I change the log level to something more liberal (i.e. -1) I do end up getting log information from lctl dk. Jinshan, are there any specific values you would like me to enable in /proc/sys/lnet/debug? I assume enabling all flags would generate too much noise. Once we loosen up the log levels we should start to get some info. |
| Comment by Prakash Surya (Inactive) [ 13/Jul/11 ] |
|
Ok so.. I have not been able to reproduce this issue as of late. I have run a few (around 6) IORs with the following debug flags: super warning dlmtrace error emerg ha rpctrace vfstrace config console cache info vfstrace And also with these flags: ioctl neterror warning error emerg config console And have not been able to get the EIO to crop up. The client nodes all were running Lustre 2.0.63 with patchset 11 applied (2.0.63-1chaos-LU437ps11) from the above mentioned gerrit patch (http://review.whamcloud.com/#change,911) and the servers were all running a pristine Lustre 2.0.63-1chaos. For the time being, I think we are going to let this issue settle and move on to other testing (at least until we run into this again). |
| Comment by Jinshan Xiong (Inactive) [ 13/Jul/11 ] |
|
Hi Prakash, In that case, let's land this patch and see if others can see this issue. Thanks for your effort. Jinshan |
| Comment by Prakash Surya (Inactive) [ 13/Jul/11 ] |
|
Ok, I have no objections to that. |
| Comment by Peter Jones [ 26/Jul/11 ] |
|
Should be fixed by |