[LU-437] Client hang with spinning ldlm_bl_* and ll_imp_inval threads Created: 20/Jun/11  Updated: 26/Jul/11  Resolved: 26/Jul/11

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.1.0

Type: Bug Priority: Blocker
Reporter: Christopher Morrone Assignee: Jinshan Xiong (Inactive)
Resolution: Duplicate Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 8546

 Description   

We have found that it is fairly easy to get a client to hang when running a shared-file IOR such as this:

srun -l -D /p/lcrater2/surya1/tmp -N 56 -n 112 -- /g/g0/surya1/dev/IOR/src/C/IOR -a POSIX -C -e -i 1 -o /p/lcrater2/surya1/tmp/IOR.data -v  -w -W -r -R -t 1m -b 1024m

A client node will experience trouble communicating, and get evicted:

2011-06-17 13:14:16 LustreError: 11-0: an error occurred while communicating with 10.1.1.48@o2ib9. The obd_ping operation failed with -107
2011-06-17 13:14:16 Lustre: lc2-OST0005-osc-ffff880200164400: Connection to service lc2-OST0005 via nid 10.1.1.48@o2ib9 was lost; in progress operations using this service will wait for recovery to complete.
2011-06-17 13:14:17 LustreError: 167-0: This client was evicted by lc2-OST0005; in progress operations using this service will fail.
2011-06-17 13:14:17 LustreError: 8675:0:(ldlm_request.c:1169:ldlm_cli_cancel_req()) Got rc -11 from cancel RPC: canceling anyway
2011-06-17 13:14:17 LustreError: 8675:0:(ldlm_request.c:1796:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -11
2011-06-17 13:18:06 INFO: task ll_imp_inval:8757 blocked for more than 120 seconds.
2011-06-17 13:18:06 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

During this time, two IOR processes are stuck here:

 #0 [ffff880105521398] schedule at ffffffff814dabb9
 #1 [ffff880105521460] cfs_waitq_wait at ffffffffa034e75e [libcfs]
 #2 [ffff880105521470] osc_enter_cache at ffffffffa06c6eda [osc]
 #3 [ffff880105521580] osc_queue_async_io at ffffffffa06c7439 [osc]
 #4 [ffff880105521690] osc_page_cache_add at ffffffffa06d549a [osc]
 #5 [ffff880105521700] cl_page_invoke at ffffffffa0481008 [obdclass]
 #6 [ffff880105521770] cl_page_cache_add at ffffffffa0481d88 [obdclass]
 #7 [ffff8801055217f0] vvp_io_commit_write at ffffffffa07d2a43 [lustre]
 #8 [ffff8801055218a0] cl_io_commit_write at ffffffffa048fb8f [obdclass]
 #9 [ffff880105521920] ll_commit_write at ffffffffa07aa12d [lustre]
#10 [ffff8801055219b0] ll_write_end at ffffffffa07c17a0 [lustre]
#11 [ffff8801055219e0] generic_file_buffered_write at ffffffff8110dc74
#12 [ffff880105521ab0] __generic_file_aio_write at ffffffff8110f560
#13 [ffff880105521b70] generic_file_aio_write at ffffffff8110f7ff
#14 [ffff880105521bc0] vvp_io_write_start at ffffffffa07d32d1 [lustre]
#15 [ffff880105521c40] cl_io_start at ffffffffa048cb38 [obdclass]
#16 [ffff880105521ca0] cl_io_loop at ffffffffa0490f40 [obdclass]
#17 [ffff880105521d10] ll_file_io_generic at ffffffffa077b8e3 [lustre]
#18 [ffff880105521dd0] ll_file_aio_write at ffffffffa0780aaf [lustre]
#19 [ffff880105521e60] ll_file_write at ffffffffa0780df1 [lustre]
#20 [ffff880105521ef0] vfs_write at ffffffff811725b8
#21 [ffff880105521f30] sys_write at ffffffff81172ff1
#22 [ffff880105521f80] system_call_fastpath at ffffffff8100b172

The ldlm_bl_* thread backtraces look like this when they are spinning:

2011-06-17 14:27:59  <NMI>  [<ffffffff81009d45>] ? show_regs+0x45/0x50
2011-06-17 14:27:59  [<ffffffff814decd9>] ?
arch_trigger_all_cpu_backtrace_handler+0x79/0xa0
2011-06-17 14:27:59  [<ffffffff814e05c5>] ? notifier_call_chain+0x55/0x80
2011-06-17 14:27:59  [<ffffffff814e062a>] ?
atomic_notifier_call_chain+0x1a/0x20
2011-06-17 14:27:59  [<ffffffff8109409e>] ? notify_die+0x2e/0x30
2011-06-17 14:27:59  [<ffffffff814de233>] ? do_nmi+0x173/0x2b0
2011-06-17 14:27:59  [<ffffffff814ddb40>] ? nmi+0x20/0x30
2011-06-17 14:27:59  [<ffffffff814dd3a8>] ? _spin_lock+0x18/0x30
2011-06-17 14:27:59  <<EOE>>  [<ffffffffa0484b99>] ?
cl_page_gang_lookup+0x219/0x380 [obdclass]
2011-06-17 14:28:00  [<ffffffffa0483860>] ? cl_page_own+0x0/0x20 [obdclass]
2011-06-17 14:28:00  [<ffffffffa048aed7>] ? cl_lock_page_out+0xb7/0x320
[obdclass]
2011-06-17 14:28:00  [<ffffffffa047fae6>] ? cl_env_peek+0x36/0x110 [obdclass]
2011-06-17 14:28:00  [<ffffffffa06d6d8f>] ? osc_lock_flush+0x4f/0x90 [osc]
2011-06-17 14:28:00  [<ffffffffa06d6e29>] ? osc_lock_cancel+0x59/0x170 [osc]
2011-06-17 14:28:00  [<ffffffffa047fae6>] ? cl_env_peek+0x36/0x110 [obdclass]
2011-06-17 14:28:00  [<ffffffffa0485e9d>] ? cl_lock_cancel0+0x6d/0x130
[obdclass]
2011-06-17 14:28:00  [<ffffffffa04869fb>] ? cl_lock_cancel+0x11b/0x120
[obdclass]
2011-06-17 14:28:00  [<ffffffffa06d7bf8>] ? osc_ldlm_blocking_ast+0x148/0x300
[osc]
2011-06-17 14:28:00  [<ffffffff810dbe15>] ? call_rcu_sched+0x15/0x20
2011-06-17 14:28:00  [<ffffffff810dbe2e>] ? call_rcu+0xe/0x10
2011-06-17 14:28:00  [<ffffffffa054287d>] ? ldlm_handle_bl_callback+0x10d/0x2a0
[ptlrpc]
2011-06-17 14:28:00  [<ffffffffa0542c51>] ? ldlm_bl_thread_main+0x241/0x3d0
[ptlrpc]
2011-06-17 14:28:00  [<ffffffff8105dc20>] ? default_wake_function+0x0/0x20
2011-06-17 14:28:00  [<ffffffff8100c1ca>] ? child_rip+0xa/0x20
2011-06-17 14:28:00  [<ffffffffa0542a10>] ? ldlm_bl_thread_main+0x0/0x3d0
[ptlrpc]
2011-06-17 14:28:00  [<ffffffff8100c1c0>] ? child_rip+0x0/0x20

With one of the runs, the node became "unstuck" after a vvp_io_commit_write() call fails:

2011-06-17 14:36:55 LustreError: 8747:0:(vvp_io.c:987:vvp_io_commit_write())
Write page 28712193 of inode ffff8801d20666b8 failed -4
2011-06-17 14:59:59 mstflint:4325 freeing invalid memtype fe900000-fe910000
2011-06-17 14:59:59 mstflint:4325 freeing invalid memtype fe910000-fea00000
2011-06-17 15:00:02 mstflint:4333 freeing invalid memtype fe900000-fe910000
2011-06-17 15:00:02 Jun 17 15:00:02 zwicky161 kernel: sfit43 ren nai etp
e000f900
2011-06-17 15:00:02 mstflint:4333 freeing invalid memtype fe910000-fea00000
2011-06-17 15:00:07 LustreError:
8757:0:(ldlm_resource.c:748:ldlm_resource_complain()) Namespace
lc2-OST0005-osc-ffff880200164400 resource refcount nonzero (1) after lock
cleanup; forcing cleanup.
2011-06-17 15:00:07 LustreError:
8757:0:(ldlm_resource.c:754:ldlm_resource_complain()) Resource:
ffff8801ff756700 (61363/0/0/0) (rc: 1)
2011-06-17 15:00:07 Lustre: lc2-OST0005-osc-ffff880200164400: Connection
restored to service lc2-OST0005 using nid 10.1.1.48@o2ib9.
2011-06-17 15:00:07 CPUFREQ: Per core ondemand sysfs interface is deprecated -
up_threshold
2011-06-17 15:00:31 LustreError: 11-0: an error occurred while communicating
with 10.1.1.48@o2ib9. The obd_ping operation failed with -107
2011-06-17 15:00:31 Lustre: lc2-OST0005-osc-ffff880200164400: Connection to
service lc2-OST0005 via nid 10.1.1.48@o2ib9 was lost; in progress operations
using this service will wait for recovery to complete.
2011-06-17 15:00:31 LustreError: 167-0: This client was evicted by lc2-OST0005;
in progress operations using this service will fail.
2011-06-17 15:00:31 Lustre: lc2-OST0005-osc-ffff880200164400: Connection
restored to service lc2-OST0005 using nid 10.1.1.48@o2ib9.

Although I have seen another node that seemed stuck in the same way, and was still stuck long after the IOR processes had been killed.



 Comments   
Comment by Christopher Morrone [ 20/Jun/11 ]

Representitive backtraces for ldlm_bl_* threads and ll_imp_inval on a node where those threads are spinning, and IOR is no longer running:

crash> bt 7322
PID: 7322   TASK: ffff880214c96a80  CPU: 0   COMMAND: "ldlm_bl_01"
 #0 [ffff8801ff7ada80] schedule at ffffffff814dabb9
 #1 [ffff8801ff7adb48] __cond_resched at ffffffff8105fa7a
 #2 [ffff8801ff7adb68] _cond_resched at ffffffff814db480
 #3 [ffff8801ff7adb78] cfs_cond_resched at ffffffffa034e6be [libcfs]
 #4 [ffff8801ff7adb88] cl_lock_page_out at ffffffffa048af5d [obdclass]
 #5 [ffff8801ff7adc38] osc_lock_flush at ffffffffa06d6d8f [osc]
 #6 [ffff8801ff7adc88] osc_lock_cancel at ffffffffa06d6e29 [osc]
 #7 [ffff8801ff7add08] cl_lock_cancel0 at ffffffffa0485e9d [obdclass]
 #8 [ffff8801ff7add58] cl_lock_cancel at ffffffffa04869fb [obdclass]
 #9 [ffff8801ff7add98] osc_ldlm_blocking_ast at ffffffffa06d7bf8 [osc]
#10 [ffff8801ff7ade28] ldlm_handle_bl_callback at ffffffffa054287d [ptlrpc]
#11 [ffff8801ff7ade88] ldlm_bl_thread_main at ffffffffa0542c51 [ptlrpc]
#12 [ffff8801ff7adf48] kernel_thread at ffffffff8100c1ca
crash> bt 8727
PID: 8727   TASK: ffff880213586b40  CPU: 0   COMMAND: "ll_imp_inval"
 #0 [ffff880189e7b6f0] schedule at ffffffff814dabb9
 #1 [ffff880189e7b7b8] __cond_resched at ffffffff8105fa7a
 #2 [ffff880189e7b7d8] _cond_resched at ffffffff814db480
 #3 [ffff880189e7b7e8] cfs_cond_resched at ffffffffa034e6be [libcfs]
 #4 [ffff880189e7b7f8] cl_lock_page_out at ffffffffa048af5d [obdclass]
 #5 [ffff880189e7b8a8] osc_lock_flush at ffffffffa06d6d8f [osc]
 #6 [ffff880189e7b8f8] osc_lock_cancel at ffffffffa06d6e29 [osc]
 #7 [ffff880189e7b978] cl_lock_cancel0 at ffffffffa0485e9d [obdclass]
 #8 [ffff880189e7b9c8] cl_lock_cancel at ffffffffa04869fb [obdclass]
 #9 [ffff880189e7ba08] osc_ldlm_blocking_ast at ffffffffa06d7bf8 [osc]
#10 [ffff880189e7ba98] ldlm_cancel_callback at ffffffffa0521ba3 [ptlrpc]
#11 [ffff880189e7bab8] ldlm_cli_cancel_local at ffffffffa053b201 [ptlrpc]
#12 [ffff880189e7bb18] ldlm_cli_cancel at ffffffffa053f618 [ptlrpc]
#13 [ffff880189e7bb98] cleanup_resource at ffffffffa052974c [ptlrpc]
#14 [ffff880189e7bc18] ldlm_resource_clean at ffffffffa05298ba [ptlrpc]
#15 [ffff880189e7bc38] cfs_hash_for_each_relax at ffffffffa035dbc3 [libcfs]
#16 [ffff880189e7bcd8] cfs_hash_for_each_nolock at ffffffffa035f2ff [libcfs]
#17 [ffff880189e7bd38] ldlm_namespace_cleanup at ffffffffa05266a9 [ptlrpc]
#18 [ffff880189e7bd78] osc_import_event at ffffffffa06ce0c6 [osc]
#19 [ffff880189e7bdf8] ptlrpc_invalidate_import at ffffffffa058f1d4 [ptlrpc]
#20 [ffff880189e7bed8] ptlrpc_invalidate_import_thread at ffffffffa058f7b5 [ptlrpc]
#21 [ffff880189e7bf48] kernel_thread at ffffffff8100c1ca
Comment by Peter Jones [ 20/Jun/11 ]

Oleg

Could you please have a look at this one?

Thanks

Peter

Comment by Jinshan Xiong (Inactive) [ 20/Jun/11 ]

From the stack trace, it looks like an IO is doing against the page while the covering lock is being canceled(I'm not sure if they are the same page). Can you please provide me the crash dump file, if there is no classified stuff in it? I'd like to see the crash dump while the IOR process are still running with two stuck processes

Comment by Niu Yawei (Inactive) [ 21/Jun/11 ]

Xiong, as Chris described, for some nodes, the ldlm_bl_/ll_imp_inval threads spins because the IOR was stucking in commit_write while the lock is being canceled, but for other nodes, the ldlm_bl_/ll_imp_inval threads were still spinning even after all IOR threads had been killed, seems those nodes were running into the similar situation like LU-416?

Comment by Jinshan Xiong (Inactive) [ 21/Jun/11 ]

Indeed, actually I tend to think there are two issues in this bug.
1. a busy lock was deemed as free, this is why an IO uses the lock while it is being freed;
2. for the 1st comment, the problem is clear that those two processes entered into infinite loop, but it's not clear if they are working on the same lock.

This is a bit different to LU-416, but we may fix both bugs with one shot.

Comment by Prakash Surya (Inactive) [ 21/Jun/11 ]

Xiong, the number of the patch appears to have been lost. Could you please repost the link or patch number?

Comment by Jinshan Xiong (Inactive) [ 21/Jun/11 ]

Hi Chris,
Can you please try patch at http://review.whamcloud.com/#change,911 and tell me the result. That patch is supposed to help reduce the contention of ->coh_page_guard.
Thanks,
Jinshan

Comment by Jinshan Xiong (Inactive) [ 21/Jun/11 ]

Prakash,

I realized that patch was a bit old, so removed the comment while I was rebasing it so that you guys will get the right one.

Sorry for confusion.

Comment by Peter Jones [ 22/Jun/11 ]

Jinshan is working on this

Comment by Prakash Surya (Inactive) [ 22/Jun/11 ]

Xiong, Thank you! Can this patch be applied only to the clients? or do the servers need this patch as well?

Comment by Jinshan Xiong (Inactive) [ 22/Jun/11 ]

Client only, please.

Comment by Jinshan Xiong (Inactive) [ 22/Jun/11 ]

BTW, how often can you guys reproduce this problem?

Comment by Christopher Morrone [ 23/Jun/11 ]

It is pretty easy to reproduce.

Comment by Jinshan Xiong (Inactive) [ 24/Jun/11 ]

how is the patch? Do you guys have a chance to run it?

Comment by Prakash Surya (Inactive) [ 24/Jun/11 ]

I am currently running a 54 node, 108 task IOR with the patch applied. So far so good, 29G of the 108G have been written without noticing any lockup issues yet. I will report back when the IOR completes (or fails).

IOR is identical to the one in the description, except for the number of nodes and tasks being run.

srun -l -D /p/lcrater2/surya1/tmp -N 54 -n 108 -- /g/g0/surya1/dev/IOR/src/C/IOR -a POSIX -C -e -i 1 -o /p/lcrater2/surya1/tmp/IOR.data -v  -w -W -r -R -t 1m -b 1024m
Comment by Prakash Surya (Inactive) [ 24/Jun/11 ]

It looks like an LBUG was hit on one of the nodes while running the above IOR with the patch applied. A snippet of the console output is below:

2011-06-24 11:38:35 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) lock@ffff8801f8dcdaf8[0 5 0 0 0 00000000] W(2):[13893632, 13901823]@[0x100270000:0x3fc82:0x0] {
2011-06-24 11:38:35 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set())     lovsub@ffff8801ff7914a0: [0 ffff8801f8dcf678 W(2):[27787520, 27803647]@[0x20000337d:0x1:0x0]] 
2011-06-24 11:38:35 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set())     osc@ffff8801ff792a98: ffff8802153e0d80 40120002 0xf0f29676e28223c6 4 (null) size: 49393172480 mtime: 1308938241 atime: 1308938241 ctime: 1308938241 blocks: 2048
2011-06-24 11:38:35 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) } lock@ffff8801f8dcdaf8
2011-06-24 11:38:35 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) no cover page!
2011-06-24 11:38:35 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) lock@ffff8801f8dcd938[0 5 0 0 0 00000000] W(2):[14024704, 14032895]@[0x100270000:0x3fc82:0x0] {
2011-06-24 11:38:35 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set())     lovsub@ffff8801ff791460: [0 ffff8801f8dcf610 W(2):[28049664, 28065791]@[0x20000337d:0x1:0x0]] 
2011-06-24 11:38:35 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set())     osc@ffff8801ff7929e0: ffff8801ff6d5000 40100001 0xf0f29676e28223cd 4 (null) size: 49393172480 mtime: 1308938241 atime: 1308938241 ctime: 1308938241 blocks: 2048
2011-06-24 11:38:35 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) } lock@ffff8801f8dcd938
2011-06-24 11:38:35 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) no cover page!
2011-06-24 11:38:35 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) lock@ffff880212888c18[0 5 0 0 0 00000000] W(2):[14032896, 14041087]@[0x100270000:0x3fc82:0x0] {
2011-06-24 11:38:35 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set())     lovsub@ffff8801ff791260: [0 ffff8801f8dcf2d0 W(2):[28066048, 28082175]@[0x20000337d:0x1:0x0]] 
2011-06-24 11:38:35 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set())     osc@ffff8801ff792088: ffff8802153e0240 40100001 0xf0f29676e28223db 4 (null) size: 56909365248 mtime: 1308938247 atime: 1308938241 ctime: 1308938247 blocks: 231432
2011-06-24 11:38:35 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) } lock@ffff880212888c18
2011-06-24 11:38:35 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) no cover page!
2011-06-24 11:38:35 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) lock@ffff8802128880b8[0 5 0 0 0 00000000] W(2):[13901824, 13910015]@[0x100270000:0x3fc82:0x0] {
2011-06-24 11:38:35 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set())     lovsub@ffff8801ff7912a0: [0 ffff8801f8dcf338 W(2):[27803904, 27820031]@[0x20000337d:0x1:0x0]] 
2011-06-24 11:38:35 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set())     osc@ffff8801ff792140: ffff8802153e0900 40100001 0xf0f29676e28223e9 4 (null) size: 56909365248 mtime: 1308938247 atime: 1308938241 ctime: 1308938247 blocks: 231432
2011-06-24 11:38:35 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) } lock@ffff8802128880b8
2011-06-24 11:38:35 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) no cover page!
2011-06-24 11:38:35 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) lock@ffff880212888358[0 5 0 0 0 00000000] W(2):[14041088, 14049279]@[0x100270000:0x3fc82:0x0] {
2011-06-24 11:38:35 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set())     lovsub@ffff8801ff791ba0: [0 ffff8801f8dcfb58 W(2):[28082432, 28098559]@[0x20000337d:0x1:0x0]] 
2011-06-24 11:38:35 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set())     osc@ffff8801ff792928: ffff880203e32b40 40100001 0xf0f29676e28223f7 4 (null) size: 56909365248 mtime: 1308938247 atime: 1308938241 ctime: 1308938247 blocks: 231432
2011-06-24 11:38:35 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) } lock@ffff880212888358
2011-06-24 11:38:35 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) no cover page!
2011-06-24 11:38:35 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) lock@ffff880212888198[0 5 0 0 0 00000000] W(2):[13910016, 13918207]@[0x100270000:0x3fc82:0x0] {
2011-06-24 11:38:35 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set())     lovsub@ffff8801ff791be0: [0 ffff8801f8dcfaf0 W(2):[27820288, 27836415]@[0x20000337d:0x1:0x0]] 
2011-06-24 11:38:35 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set())     osc@ffff8801ff792870: ffff880203e32900 40100001 0xf0f29676e2822405 4 (null) size: 56909365248 mtime: 1308938247 atime: 1308938241 ctime: 1308938247 blocks: 231432
2011-06-24 11:38:35 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) } lock@ffff880212888198
2011-06-24 11:38:35 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) no cover page!
2011-06-24 11:38:36 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) lock@ffff8801e4f90858[0 5 0 0 0 00000000] W(2):[14049280, 14057471]@[0x100270000:0x3fc82:0x0] {
2011-06-24 11:38:36 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set())     lovsub@ffff8801ff791a20: [0 ffff8801f8dcf3a0 W(2):[28098816, 28114943]@[0x20000337d:0x1:0x0]] 
2011-06-24 11:38:36 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set())     osc@ffff8801e4f91648: ffff8801e4fd0480 40100001 0xf0f29676e2822413 4 (null) size: 56909365248 mtime: 1308938247 atime: 1308938241 ctime: 1308938247 blocks: 231432
2011-06-24 11:38:36 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) } lock@ffff8801e4f90858
2011-06-24 11:38:36 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) no cover page!
2011-06-24 11:38:36 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) lock@ffff8801e4f90078[0 5 0 0 0 00000000] W(2):[13918208, 13926399]@[0x100270000:0x3fc82:0x0] {
2011-06-24 11:38:36 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set())     lovsub@ffff8801ff791720: [0 ffff8801f8dcff00 W(2):[27836672, 27852799]@[0x20000337d:0x1:0x0]] 
2011-06-24 11:38:36 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set())     osc@ffff8801e4f91ee8: ffff8801e4fd0000 40100001 0xf0f29676e2822421 4 (null) size: 56909365248 mtime: 1308938247 atime: 1308938241 ctime: 1308938247 blocks: 231432
2011-06-24 11:38:36 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) } lock@ffff8801e4f90078
2011-06-24 11:38:36 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) no cover page!
2011-06-24 11:38:36 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) lock@ffff8801e4f905b8[1 3 0 1 1 00000000] W(2):[13926400, 13934591]@[0x100270000:0x3fc82:0x0] {
2011-06-24 11:38:36 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set())     lovsub@ffff8801ff7917e0: [0 ffff8801f8dcf470 W(2):[27853056, 27869183]@[0x20000337d:0x1:0x0]] 
2011-06-24 11:38:36 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set())     osc@ffff8801e4f91590: ffff8801d2222d80 40100001 0xf0f29676e282243d 3 ffff880203f71e88 size: 57578356736 mtime: 1308940280 atime: 1308938241 ctime: 1308940280 blocks: 26296408
2011-06-24 11:38:36 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) } lock@ffff8801e4f905b8
2011-06-24 11:38:36 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) no cover page!
2011-06-24 11:38:36 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) lock@ffff8801e4f90238[0 5 0 0 0 00000000] W(2):[14062592, 14070783]@[0x100270000:0x3fc82:0x0] {
2011-06-24 11:38:36 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set())     lovsub@ffff8801ff791f60: [0 ffff8801f8dcf4d8 W(2):[28125440, 28141567]@[0x20000337d:0x1:0x0]] 
2011-06-24 11:38:36 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set())     osc@ffff8801e4f919e0: ffff8801d22226c0 40100001 0xf0f29676e2822444 4 (null) size: 57600376832 mtime: 1308940695 atime: 1308938241 ctime: 1308940695 blocks: 31336536
2011-06-24 11:38:36 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) } lock@ffff8801e4f90238
2011-06-24 11:38:36 LustreError: 6148:0:(osc_io.c:680:osc_req_attr_set()) no cover page!
2011-06-24 11:38:36 LustreError: 6148:0:(osc_io.c:682:osc_req_attr_set()) page@ffff8801d22e2bc0[1 ffff8801f8d75f08:28119808 ^(null)_ffff8801d22e2b00 2 0 1 (null) ffff880203c7a880 0x0]
2011-06-24 11:38:36 LustreError: 6148:0:(osc_io.c:682:osc_req_attr_set()) page@ffff8801d22e2b00[1 ffff880206f9fe28:14059776 ^ffff8801d22e2bc0_(null) 2 0 1 (null) (null) 0x0]
2011-06-24 11:38:36 LustreError: 6148:0:(osc_io.c:682:osc_req_attr_set()) vvp-page@ffff8801d096f5a0(0:0:0) vm@ffffea000658dcb8 4000000000282d 2:0 ffff8801d22e2bc0 28119808 lru
2011-06-24 11:38:36 LustreError: 6148:0:(osc_io.c:682:osc_req_attr_set()) lov-page@ffff8801d0ac5f78
2011-06-24 11:38:36 LustreError: 6148:0:(osc_io.c:682:osc_req_attr_set()) osc-page@ffff8801d0970ee0: 1< 0x845fed 2 0 - - + > 2< 57588842496 0 0x3 0x420 | (null) ffff8801ff8ea608 ffff8801f8d74d00 ffffffffa06dc2e0 ffff8801d0970ee0 > 3< + ffff880212826a80 1 245863 0 > 4< 0 8 8 32505856 + | + - + - > 5< + - + - | 0 - - | 5888 + +>
2011-06-24 11:38:36 LustreError: 6148:0:(osc_io.c:682:osc_req_attr_set()) end page@ffff8801d22e2b00
2011-06-24 11:38:36 LustreError: 6148:0:(osc_io.c:682:osc_req_attr_set()) dump uncover page!
2011-06-24 11:38:36 Pid: 6148, comm: ptlrpcd-brw
2011-06-24 11:38:36 
2011-06-24 11:38:36 Call Trace:
2011-06-24 11:38:36  [<ffffffffa034d855>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
2011-06-24 11:38:36  [<ffffffffa06d9b43>] osc_req_attr_set+0x283/0x2a0 [osc]
2011-06-24 11:38:36  [<ffffffffa07cbbd8>] ? ccc_req_attr_set+0x78/0x150 [lustre]
2011-06-24 11:38:36  [<ffffffffa048c5c1>] cl_req_attr_set+0xd1/0x200 [obdclass]
2011-06-24 11:38:36  [<ffffffffa04913f8>] ? cl_req_alloc+0x188/0x5f0 [obdclass]
2011-06-24 11:38:36  [<ffffffffa06c57ab>] osc_send_oap_rpc+0xc9b/0x1ca0 [osc]
2011-06-24 11:38:36  [<ffffffff81050ba3>] ? enqueue_task_fair+0x43/0x90
2011-06-24 11:38:36  [<ffffffff81057349>] ? enqueue_task+0x79/0x90
2011-06-24 11:38:36  [<ffffffffa06c6a8e>] osc_check_rpcs+0x2de/0x470 [osc]
2011-06-24 11:38:36  [<ffffffffa034e75a>] ? cfs_waitq_signal+0x1a/0x20 [libcfs]
2011-06-24 11:38:36  [<ffffffffa06ba329>] ? osc_wake_cache_waiters+0xf9/0x330 [osc]
2011-06-24 11:38:36  [<ffffffffa06ca886>] brw_interpret+0x2f6/0x710 [osc]
2011-06-24 11:38:36  [<ffffffffa0558346>] ? after_reply+0x7b6/0xe30 [ptlrpc]
2011-06-24 11:38:36  [<ffffffffa055ce7a>] ptlrpc_check_set+0x47a/0x1b20 [ptlrpc]
2011-06-24 11:38:36  [<ffffffff8107a11b>] ? try_to_del_timer_sync+0x7b/0xe0
2011-06-24 11:38:36  [<ffffffffa0590de8>] ptlrpcd_check+0x1c8/0x250 [ptlrpc]
2011-06-24 11:38:36  [<ffffffffa059118b>] ptlrpcd+0x31b/0x390 [ptlrpc]
2011-06-24 11:38:36  [<ffffffff8105dc20>] ? default_wake_function+0x0/0x20
2011-06-24 11:38:36  [<ffffffff8100c1ca>] child_rip+0xa/0x20
2011-06-24 11:38:36  [<ffffffffa0590e70>] ? ptlrpcd+0x0/0x390 [ptlrpc]
2011-06-24 11:38:36  [<ffffffff8100c1c0>] ? child_rip+0x0/0x20
2011-06-24 11:38:36 
2011-06-24 11:38:36 LustreError: 6148:0:(osc_io.c:684:osc_req_attr_set()) LBUG
2011-06-24 11:38:36 Pid: 6148, comm: ptlrpcd-brw
2011-06-24 11:38:36 Jun 24 11:38:36 
2011-06-24 11:38:36 zwicky163 kernelCall Trace:
2011-06-24 11:38:36 : LustreError: 6 [<ffffffffa034d855>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
2011-06-24 11:38:36 148:0:(osc_io.c: [<ffffffffa034de95>] lbug_with_loc+0x75/0xe0 [libcfs]
2011-06-24 11:38:36 684:osc_req_attr [<ffffffffa06d9b5b>] osc_req_attr_set+0x29b/0x2a0 [osc]
2011-06-24 11:38:36 _set()) LBUG
2011-06-24 11:38:36  [<ffffffffa07cbbd8>] ? ccc_req_attr_set+0x78/0x150 [lustre]
2011-06-24 11:38:36  [<ffffffffa048c5c1>] cl_req_attr_set+0xd1/0x200 [obdclass]
2011-06-24 11:38:36  [<ffffffffa04913f8>] ? cl_req_alloc+0x188/0x5f0 [obdclass]
2011-06-24 11:38:36  [<ffffffffa06c57ab>] osc_send_oap_rpc+0xc9b/0x1ca0 [osc]
2011-06-24 11:38:36  [<ffffffff81050ba3>] ? enqueue_task_fair+0x43/0x90
2011-06-24 11:38:36  [<ffffffff81057349>] ? enqueue_task+0x79/0x90
2011-06-24 11:38:36  [<ffffffffa06c6a8e>] osc_check_rpcs+0x2de/0x470 [osc]
2011-06-24 11:38:36  [<ffffffffa034e75a>] ? cfs_waitq_signal+0x1a/0x20 [libcfs]
2011-06-24 11:38:36  [<ffffffffa06ba329>] ? osc_wake_cache_waiters+0xf9/0x330 [osc]
2011-06-24 11:38:36  [<ffffffffa06ca886>] brw_interpret+0x2f6/0x710 [osc]
2011-06-24 11:38:36  [<ffffffffa0558346>] ? after_reply+0x7b6/0xe30 [ptlrpc]
2011-06-24 11:38:36  [<ffffffffa055ce7a>] ptlrpc_check_set+0x47a/0x1b20 [ptlrpc]
2011-06-24 11:38:36  [<ffffffff8107a11b>] ? try_to_del_timer_sync+0x7b/0xe0
2011-06-24 11:38:36  [<ffffffffa0590de8>] ptlrpcd_check+0x1c8/0x250 [ptlrpc]
2011-06-24 11:38:36  [<ffffffffa059118b>] ptlrpcd+0x31b/0x390 [ptlrpc]
2011-06-24 11:38:36  [<ffffffff8105dc20>] ? default_wake_function+0x0/0x20
2011-06-24 11:38:36  [<ffffffff8100c1ca>] child_rip+0xa/0x20
2011-06-24 11:38:36  [<ffffffffa0590e70>] ? ptlrpcd+0x0/0x390 [ptlrpc]
2011-06-24 11:38:36  [<ffffffff8100c1c0>] ? child_rip+0x0/0x20
2011-06-24 11:38:36 
2011-06-24 11:38:36 Kernel panic - not syncing: LBUG
2011-06-24 11:38:36 Pid: 6148, comm: ptlrpcd-brw Tainted: G           ---------------- T 2.6.32-131.2.1.1chaos.ch5.x86_64 #1
2011-06-24 11:38:36 Call Trace:
2011-06-24 11:38:36 Jun 24 11:38:36  [<ffffffff814da4fe>] ? panic+0x78/0x143
2011-06-24 11:38:36 zwicky163 kernel [<ffffffffa034deeb>] ? lbug_with_loc+0xcb/0xe0 [libcfs]
2011-06-24 11:38:36 : Kernel panic - [<ffffffffa06d9b5b>] ? osc_req_attr_set+0x29b/0x2a0 [osc]
2011-06-24 11:38:36  not syncing: LB [<ffffffffa07cbbd8>] ? ccc_req_attr_set+0x78/0x150 [lustre]
2011-06-24 11:38:36 UG
2011-06-24 11:38:36  [<ffffffffa048c5c1>] ? cl_req_attr_set+0xd1/0x200 [obdclass]
2011-06-24 11:38:36  [<ffffffffa04913f8>] ? cl_req_alloc+0x188/0x5f0 [obdclass]
2011-06-24 11:38:36  [<ffffffffa06c57ab>] ? osc_send_oap_rpc+0xc9b/0x1ca0 [osc]
2011-06-24 11:38:36  [<ffffffff81050ba3>] ? enqueue_task_fair+0x43/0x90
2011-06-24 11:38:36  [<ffffffff81057349>] ? enqueue_task+0x79/0x90
2011-06-24 11:38:36  [<ffffffffa06c6a8e>] ? osc_check_rpcs+0x2de/0x470 [osc]
2011-06-24 11:38:36  [<ffffffffa034e75a>] ? cfs_waitq_signal+0x1a/0x20 [libcfs]
2011-06-24 11:38:36  [<ffffffffa06ba329>] ? osc_wake_cache_waiters+0xf9/0x330 [osc]
2011-06-24 11:38:36  [<ffffffffa06ca886>] ? brw_interpret+0x2f6/0x710 [osc]
2011-06-24 11:38:36  [<ffffffffa0558346>] ? after_reply+0x7b6/0xe30 [ptlrpc]
2011-06-24 11:38:36  [<ffffffffa055ce7a>] ? ptlrpc_check_set+0x47a/0x1b20 [ptlrpc]
2011-06-24 11:38:36  [<ffffffff8107a11b>] ? try_to_del_timer_sync+0x7b/0xe0
2011-06-24 11:38:36  [<ffffffffa0590de8>] ? ptlrpcd_check+0x1c8/0x250 [ptlrpc]
2011-06-24 11:38:36  [<ffffffffa059118b>] ? ptlrpcd+0x31b/0x390 [ptlrpc]
2011-06-24 11:38:36  [<ffffffff8105dc20>] ? default_wake_function+0x0/0x20
2011-06-24 11:38:36  [<ffffffff8100c1ca>] ? child_rip+0xa/0x20
2011-06-24 11:38:36  [<ffffffffa0590e70>] ? ptlrpcd+0x0/0x390 [ptlrpc]
2011-06-24 11:38:36  [<ffffffff8100c1c0>] ? child_rip+0x0/0x20
Comment by Jinshan Xiong (Inactive) [ 27/Jun/11 ]

Can you please try patch set 8? Thanks.

Comment by Prakash Surya (Inactive) [ 27/Jun/11 ]

Xiong, I will work on testing out patch set 8 today. Thanks!

Comment by Jinshan Xiong (Inactive) [ 28/Jun/11 ]

Hi Prakash, I hope this is a good sign to not hear about problems for patch set 8

Comment by Prakash Surya (Inactive) [ 28/Jun/11 ]

Hi Xiong,

I ran the following IOR (same as before, but with 57 nodes and 114 tasks) with patch set 8 applied:

srun -l -D /p/lcrater2/surya1/tmp -N 57 -n 114 -- /g/g0/surya1/dev/IOR/src/C/IOR -a POSIX -C -e -i 1 -o /p/lcrater2/surya1/tmp/IOR.data -v  -w -W -r -R -t 1m -b 1024m

This time the IOR completed (which is a good sign), but some errors were reported by the application in the process:

000: Commencing write performance test.
000: Mon Jun 27 16:38:08 2011
000:
000: Verifying contents of the file(s) just written.
000: Mon Jun 27 17:19:58 2011
000:
003: [3] FAILED comparison of buffer containing 8-byte ints:
063: [63] FAILED comparison of buffer containing 8-byte ints:
067: [67] FAILED comparison of buffer containing 8-byte ints:
003: [3]   File name = /p/lcrater2/surya1/tmp/IOR.data
063: [63]   File name = /p/lcrater2/surya1/tmp/IOR.data
067: [67]   File name = /p/lcrater2/surya1/tmp/IOR.data
003: [3]   In transfer 1, 131072 errors between buffer indices 0 and 131071.
063: [63]   In transfer 1, 131072 errors between buffer indices 0 and 131071.
067: [67]   In transfer 1, 131072 errors between buffer indices 0 and 131071.
003: [3]   File byte offset = 7516192768:
063: [63]   File byte offset = 71940702208:
067: [67]   File byte offset = 76235669504:
003: [3]     Expected: 0x000000074e091460 0000000000000008 000000074e091460 0000000000000018
063: [63]     Expected: 0x000000434e091460 0000000000000008 000000434e091460 0000000000000018
067: [67]     Expected: 0x000000474e091460 0000000000000008 000000474e091460 0000000000000018
003: [3]     Actual:   0x0000000000000000 0000000000000000 0000000000000000 0000000000000000
063: [63]     Actual:   0x0000000000000000 0000000000000000 0000000000000000 0000000000000000
067: [67]     Actual:   0x0000000000000000 0000000000000000 0000000000000000 0000000000000000
041: [41] FAILED comparison of buffer containing 8-byte ints:
041: [41]   File name = /p/lcrater2/surya1/tmp/IOR.data
041: [41]   In transfer 2, 131072 errors between buffer indices 0 and 131071.
041: [41]   File byte offset = 48319430656:
041: [41]     Expected: 0x0000002d4e091460 0000000000000008 0000002d4e091460 0000000000000018
041: [41]     Actual:   0x0000000000000000 0000000000000000 0000000000000000 0000000000000000
051: [51] FAILED comparison of buffer containing 8-byte ints:
051: [51]   File name = /p/lcrater2/surya1/tmp/IOR.data
051: [51]   In transfer 2, 131072 errors between buffer indices 0 and 131071.
051: [51]   File byte offset = 59056848896:
051: [51]     Expected: 0x000000374e091460 0000000000000008 000000374e091460 0000000000000018
051: [51]     Actual:   0x0000000000000000 0000000000000000 0000000000000000 0000000000000000
075: [75] FAILED comparison of buffer containing 8-byte ints:
075: [75]   File name = /p/lcrater2/surya1/tmp/IOR.data
075: [75]   In transfer 2, 131072 errors between buffer indices 0 and 131071.
075: [75]   File byte offset = 84826652672:
075: [75]     Expected: 0x0000004f4e091460 0000000000000008 0000004f4e091460 0000000000000018
075: [75]     Actual:   0x0000000000000000 0000000000000000 0000000000000000 0000000000000000
009: [9] FAILED comparison of buffer containing 8-byte ints:
009: [9]   File name = /p/lcrater2/surya1/tmp/IOR.data
009: [9]   In transfer 2, 131072 errors between buffer indices 0 and 131071.
009: [9]   File byte offset = 13959692288:
009: [9]     Expected: 0x0000000d4e091460 0000000000000008 0000000d4e091460 0000000000000018
009: [9]     Actual:   0x0000000000000000 0000000000000000 0000000000000000 0000000000000000
057: [57] FAILED comparison of buffer containing 8-byte ints:
057: [57]   File name = /p/lcrater2/surya1/tmp/IOR.data
057: [57]   In transfer 2, 131072 errors between buffer indices 0 and 131071.
057: [57]   File byte offset = 65499299840:
057: [57]     Expected: 0x0000003d4e091460 0000000000000008 0000003d4e091460 0000000000000018
057: [57]     Actual:   0x0000000000000000 0000000000000000 0000000000000000 0000000000000000
000: WARNING: incorrect data on write.
000:          1048576 errors found on write check.
000: Used Time Stamp 1309217888 (0x4e091460) for Data Signature
000: Commencing read performance test.
000: Mon Jun 27 17:28:19 2011
000: 
000: Re-reading the file(s) twice to verify that reads are consistent.
000: Mon Jun 27 17:38:14 2011

From the looks of it, it appears corrupt data was written to the file. The only errors that I see reported by dmesg on the clients are of the form:

LustreError: 11-0: an error occurred while communicating with 10.1.1.212@o2ib9. The mds_connect operation failed with -16

I dont know if that is significant to the currupted data.

I am going to rerun the IOR and see if the same errors occur, again.

Comment by Prakash Surya (Inactive) [ 28/Jun/11 ]

Similar errors the second time around as well.

$ srun -l -D /p/lcrater2/surya1/tmp -N 57 -n 114 -- /g/g0/surya1/dev/IOR/src/C/IOR -a POSIX -C -e -i 1 -o /p/lcrater2/surya1/tmp/IOR.data -v  -w -W -r -R -t 1m -b 1024m

000: IOR-2.10.3: MPI Coordinated Test of Parallel I/O
000: 
000: Run began: Tue Jun 28 10:29:13 2011
000: Command line used: /g/g0/surya1/dev/IOR/src/C/IOR -a POSIX -C -e -i 1 -o /p/lcrater2/surya1/tmp/IOR.data -v -w -W -r -R -t 1m -b 1024m
000: Machine: Linux zwicky101
000: Start time skew across all tasks: 0.01 sec
000: Path: /p/lcrater2/surya1/tmp
000: FS: 85.9 TiB   Used FS: 2.7%   Inodes: 675.2 Mi   Used Inodes: 0.0%
000: Participating tasks: 114
000: Using reorderTasks '-C' (expecting block, not cyclic, task assignment)
000: 
000: Summary:
000:    api                = POSIX
000:    test filename      = /p/lcrater2/surya1/tmp/IOR.data
000:    access             = single-shared-file
000:    pattern            = segmented (1 segment)
000:    ordering in a file = sequential offsets
000:    ordering inter file=constant task offsets = 1
000:    clients            = 114 (2 per node)
000:    repetitions        = 1
000:    xfersize           = 1 MiB
000:    blocksize          = 1 GiB
000:    aggregate filesize = 114 GiB
000: 
000: Commencing write performance test.
000: Tue Jun 28 10:29:13 2011
000: 
000: Verifying contents of the file(s) just written.
000: Tue Jun 28 11:10:52 2011
000: 
083: [83] FAILED comparison of buffer containing 8-byte ints:
083: [83]   File name = /p/lcrater2/surya1/tmp/IOR.data
083: [83]   In transfer 1, 131072 errors between buffer indices 0 and 131071.
083: [83]   File byte offset = 93415538688:
083: [83]     Expected: 0x000000574e0a0f69 0000000000000008 000000574e0a0f69 0000000000000018 
083: [83]     Actual:   0x0000000000000000 0000000000000000 0000000000000000 0000000000000000 
063: [63] FAILED comparison of buffer containing 8-byte ints:
063: [63]   File name = /p/lcrater2/surya1/tmp/IOR.data
063: [63]   In transfer 1, 131072 errors between buffer indices 0 and 131071.
063: [63]   File byte offset = 71940702208:
063: [63]     Expected: 0x000000434e0a0f69 0000000000000008 000000434e0a0f69 0000000000000018 
063: [63]     Actual:   0x0000000000000000 0000000000000000 0000000000000000 0000000000000000 
057: [57] FAILED comparison of buffer containing 8-byte ints:
057: [57]   File name = /p/lcrater2/surya1/tmp/IOR.data
057: [57]   In transfer 1, 131072 errors between buffer indices 0 and 131071.
057: [57]   File byte offset = 65498251264:
057: [57]     Expected: 0x0000003d4e0a0f69 0000000000000008 0000003d4e0a0f69 0000000000000018 
057: [57]     Actual:   0x0000000000000000 0000000000000000 0000000000000000 0000000000000000 
035: [35] FAILED comparison of buffer containing 8-byte ints:
035: [35]   File name = /p/lcrater2/surya1/tmp/IOR.data
035: [35]   In transfer 2, 131072 errors between buffer indices 0 and 131071.
035: [35]   File byte offset = 41876979712:
035: [35]     Expected: 0x000000274e0a0f69 0000000000000008 000000274e0a0f69 0000000000000018 
035: [35]     Actual:   0x0000000000000000 0000000000000000 0000000000000000 0000000000000000 
047: [47] FAILED comparison of buffer containing 8-byte ints:
047: [47]   File name = /p/lcrater2/surya1/tmp/IOR.data
047: [47]   In transfer 1, 131072 errors between buffer indices 0 and 131071.
047: [47]   File byte offset = 54760833024:
047: [47]     Expected: 0x000000334e0a0f69 0000000000000008 000000334e0a0f69 0000000000000018 
047: [47]     Actual:   0x0000000000000000 0000000000000000 0000000000000000 0000000000000000 
005: [5] FAILED comparison of buffer containing 8-byte ints:
005: [5]   File name = /p/lcrater2/surya1/tmp/IOR.data
005: [5]   In transfer 2, 131072 errors between buffer indices 0 and 131071.
005: [5]   File byte offset = 9664724992:
005: [5]     Expected: 0x000000094e0a0f69 0000000000000008 000000094e0a0f69 0000000000000018 
005: [5]     Actual:   0x0000000000000000 0000000000000000 0000000000000000 0000000000000000 
077: [77] FAILED comparison of buffer containing 8-byte ints:
077: [77]   File name = /p/lcrater2/surya1/tmp/IOR.data
077: [77]   In transfer 2, 131072 errors between buffer indices 0 and 131071.
077: [77]   File byte offset = 86974136320:
077: [77]     Expected: 0x000000514e0a0f69 0000000000000008 000000514e0a0f69 0000000000000018 
077: [77]     Actual:   0x0000000000000000 0000000000000000 0000000000000000 0000000000000000 
099: [99] FAILED comparison of buffer containing 8-byte ints:
099: [99]   File name = /p/lcrater2/surya1/tmp/IOR.data
099: [99]   In transfer 2, 131072 errors between buffer indices 0 and 131071.
099: [99]   File byte offset = 110596456448:
099: [99]     Expected: 0x000000674e0a0f69 0000000000000008 000000674e0a0f69 0000000000000018 
099: [99]     Actual:   0x0000000000000000 0000000000000000 0000000000000000 0000000000000000 
000: WARNING: incorrect data on write.
000:          1048576 errors found on write check.
000: Used Time Stamp 1309282153 (0x4e0a0f69) for Data Signature
000: Commencing read performance test.
000: Tue Jun 28 11:19:21 2011
000: 
000: Re-reading the file(s) twice to verify that reads are consistent.
000: Tue Jun 28 11:29:08 2011
000: 
000: Operation  Max (MiB)  Min (MiB)  Mean (MiB)   Std Dev  Max (OPs)  Min (OPs)  Mean (OPs)   Std Dev  Mean (s)  Op grep #Tasks tPN reps  fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize
000: 
000: ---------  ---------  ---------  ----------   -------  ---------  ---------  ----------   -------  --------
000: write          46.73      46.73       46.73      0.00      46.73      46.73       46.73      0.002498.34136   114 2 1 0 1 1 0 0 1 1073741824 1048576 122406567936 -1 POSIX EXCEL
000: read          198.94     198.94      198.94      0.00     198.94     198.94      198.94      0.00 586.78222   114 2 1 0 1 1 0 0 1 1073741824 1048576 122406567936 -1 POSIX EXCEL
000: 
000: Max Write: 46.73 MiB/sec (49.00 MB/sec)
000: Max Read:  198.94 MiB/sec (208.61 MB/sec)
000: 
000: Run finished: Tue Jun 28 11:48:17 2011
Comment by Jinshan Xiong (Inactive) [ 28/Jun/11 ]

Can you please show me the output of:

lctl get_param osc.*.stats

on client node after IOR is finished.

Comment by Prakash Surya (Inactive) [ 28/Jun/11 ]
zwicky101 surya1 Tue Jun 28 15:23:18 ~/lcrater2
$ lctl get_param osc.*.stats
osc.lc2-OST0000-osc-ffff8802137f2400.stats=
snapshot_time             1309299882.657956 secs.usecs
req_waittime              3316 samples [usec] 232 46310 1152289 2530547829
req_active                3316 samples [reqs] 1 1 3316 3316
ost_connect               1 samples [usec] 46310 46310 46310 2144616100
ost_statfs                3 samples [usec] 385 442 1245 518313
obd_ping                  3312 samples [usec] 232 3453 1104734 385413416
osc.lc2-OST0001-osc-ffff8802137f2400.stats=
snapshot_time             1309299882.658011 secs.usecs
req_waittime              3316 samples [usec] 193 45735 1113345 2535429681
req_active                3316 samples [reqs] 1 1 3316 3316
ost_connect               1 samples [usec] 45735 45735 45735 2091690225
ost_statfs                3 samples [usec] 419 493 1354 613974
obd_ping                  3312 samples [usec] 193 10018 1066256 443125482
osc.lc2-OST0002-osc-ffff8802137f2400.stats=
snapshot_time             1309299882.658054 secs.usecs
req_waittime              3316 samples [usec] 177 45746 1038581 2415257793
req_active                3316 samples [reqs] 1 1 3316 3316
ost_connect               1 samples [usec] 45746 45746 45746 2092696516
ost_statfs                3 samples [usec] 384 439 1232 507458
obd_ping                  3312 samples [usec] 177 4774 991603 322053819
osc.lc2-OST0003-osc-ffff8802137f2400.stats=
snapshot_time             1309299882.658092 secs.usecs
req_waittime              3316 samples [usec] 200 46818 1059725 2670511047
req_active                3316 samples [reqs] 1 1 3316 3316
ost_connect               1 samples [usec] 46818 46818 46818 2191925124
ost_statfs                3 samples [usec] 373 427 1201 482259
obd_ping                  3312 samples [usec] 200 13074 1011706 478103664
osc.lc2-OST0004-osc-ffff8802137f2400.stats=
snapshot_time             1309299882.658125 secs.usecs
req_waittime              3316 samples [usec] 206 3435 1086167 372512999
req_active                3316 samples [reqs] 1 1 3316 3316
ost_connect               1 samples [usec] 265 265 265 70225
ost_statfs                3 samples [usec] 368 425 1192 475250
obd_ping                  3312 samples [usec] 206 3435 1084710 371967524
osc.lc2-OST0005-osc-ffff8802137f2400.stats=
snapshot_time             1309299882.658155 secs.usecs
req_waittime              8476 samples [usec] 97 39670377 15056301629 176430385177943145
req_active                8476 samples [reqs] 1 10 24957 143493
read_bytes                4203 samples [bytes] 4096 1048576 4306735104 4487419898363904
write_bytes               1041 samples [bytes] 61440 1048576 1073741824 1119937888256000
ost_read                  4203 samples [usec] 8908 5630433 5585590934 20111322940376164
ost_write                 1041 samples [usec] 38605 39670377 9465777986 156305138947095590
ost_destroy               1 samples [usec] 3727231 3727231 3727231 13892250927361
ost_connect               1 samples [usec] 370 370 370 136900
ost_statfs                3 samples [usec] 362 416 1165 453869
ldlm_cancel               33 samples [usec] 277 1198 16588 8963234
obd_ping                  3127 samples [usec] 198 498 987448 316860328
osc.lc2-OST0006-osc-ffff8802137f2400.stats=
snapshot_time             1309299882.658199 secs.usecs
req_waittime              3316 samples [usec] 163 6424 968734 326049426
req_active                3316 samples [reqs] 1 1 3316 3316
ost_connect               1 samples [usec] 300 300 300 90000
ost_statfs                3 samples [usec] 356 408 1148 440656
obd_ping                  3312 samples [usec] 163 6424 967286 325518770
osc.lc2-OST0007-osc-ffff8802137f2400.stats=
snapshot_time             1309299882.658233 secs.usecs
req_waittime              3316 samples [usec] 177 17188 989042 586525450
req_active                3316 samples [reqs] 1 1 3316 3316
ost_connect               1 samples [usec] 425 425 425 180625
ost_statfs                3 samples [usec] 354 406 1136 431528
obd_ping                  3312 samples [usec] 177 17188 987481 585913297
osc.lc2-OST0008-osc-ffff8802137f2400.stats=
snapshot_time             1309299882.658267 secs.usecs
req_waittime              3316 samples [usec] 189 6429 1045459 375009395
req_active                3316 samples [reqs] 1 1 3316 3316
ost_connect               1 samples [usec] 269 269 269 72361
ost_statfs                3 samples [usec] 349 397 1121 420035
obd_ping                  3312 samples [usec] 189 6429 1044069 374516999
osc.lc2-OST0009-osc-ffff8802137f2400.stats=
snapshot_time             1309299882.658300 secs.usecs
req_waittime              3316 samples [usec] 150 6826 1008007 355073493
req_active                3316 samples [reqs] 1 1 3316 3316
ost_connect               1 samples [usec] 310 310 310 96100
ost_statfs                3 samples [usec] 341 395 1102 406262
obd_ping                  3312 samples [usec] 150 6826 1006595 354571131
osc.lc2-OST000a-osc-ffff8802137f2400.stats=
snapshot_time             1309299882.658335 secs.usecs
req_waittime              3316 samples [usec] 160 4352 939053 288343793
req_active                3316 samples [reqs] 1 1 3316 3316
ost_connect               1 samples [usec] 326 326 326 106276
ost_statfs                3 samples [usec] 340 388 1094 400100
obd_ping                  3312 samples [usec] 160 4352 937633 287837417
osc.lc2-OST000b-osc-ffff8802137f2400.stats=
snapshot_time             1309299882.658369 secs.usecs
req_waittime              3316 samples [usec] 162 17348 960848 576029112
req_active                3316 samples [reqs] 1 1 3316 3316
ost_connect               1 samples [usec] 347 347 347 120409
ost_statfs                3 samples [usec] 332 385 1082 391674
obd_ping                  3312 samples [usec] 162 17348 959419 575517029
osc.lc2-OST000c-osc-ffff8802137f2400.stats=
snapshot_time             1309299882.658402 secs.usecs
req_waittime              3316 samples [usec] 167 6438 1038151 371325901
req_active                3316 samples [reqs] 1 1 3316 3316
ost_connect               1 samples [usec] 295 295 295 87025
ost_statfs                3 samples [usec] 330 378 1065 379233
obd_ping                  3312 samples [usec] 167 6438 1036791 370859643
osc.lc2-OST000d-osc-ffff8802137f2400.stats=
snapshot_time             1309299882.658435 secs.usecs
req_waittime              3316 samples [usec] 130 8380 994059 370428241
req_active                3316 samples [reqs] 1 1 3316 3316
ost_connect               1 samples [usec] 299 299 299 89401
ost_statfs                3 samples [usec] 355 414 1145 438797
obd_ping                  3312 samples [usec] 130 8380 992615 369900043
osc.lc2-OST000e-osc-ffff8802137f2400.stats=
snapshot_time             1309299882.658468 secs.usecs
req_waittime              3316 samples [usec] 146 4535 917308 279019898
req_active                3316 samples [reqs] 1 1 3316 3316
ost_connect               1 samples [usec] 273 273 273 74529
ost_statfs                3 samples [usec] 319 374 1039 361353
obd_ping                  3312 samples [usec] 146 4535 915996 278584016
osc.lc2-OST000f-osc-ffff8802137f2400.stats=
snapshot_time             1309299882.659207 secs.usecs
req_waittime              3316 samples [usec] 165 17499 937685 569375985
req_active                3316 samples [reqs] 1 1 3316 3316
ost_connect               1 samples [usec] 339 339 339 114921
ost_statfs                3 samples [usec] 309 366 1020 348462
obd_ping                  3312 samples [usec] 165 17499 936326 568912602
osc.lc2-OST0010-osc-ffff8802137f2400.stats=
snapshot_time             1309299882.659255 secs.usecs
req_waittime              3316 samples [usec] 166 7813 984872 359184956
req_active                3316 samples [reqs] 1 1 3316 3316
ost_connect               1 samples [usec] 227 227 227 51529
ost_statfs                3 samples [usec] 308 359 996 331986
obd_ping                  3312 samples [usec] 166 7813 983649 358801441
osc.lc2-OST0011-osc-ffff8802137f2400.stats=
snapshot_time             1309299882.659288 secs.usecs
req_waittime              3316 samples [usec] 137 9719 938332 361675802
req_active                3316 samples [reqs] 1 1 3316 3316
ost_connect               1 samples [usec] 275 275 275 75625
ost_statfs                3 samples [usec] 315 338 969 313325
obd_ping                  3312 samples [usec] 137 9719 937088 361286852
osc.lc2-OST0012-osc-ffff8802137f2400.stats=
snapshot_time             1309299882.659321 secs.usecs
req_waittime              3316 samples [usec] 132 4578 859697 248602493
req_active                3316 samples [reqs] 1 1 3316 3316
ost_connect               1 samples [usec] 210 210 210 44100
ost_statfs                3 samples [usec] 306 336 956 305128
obd_ping                  3312 samples [usec] 132 4578 858531 248253265
osc.lc2-OST0013-osc-ffff8802137f2400.stats=
snapshot_time             1309299882.659354 secs.usecs
req_waittime              3316 samples [usec] 145 17520 863264 529479340
req_active                3316 samples [reqs] 1 1 3316 3316
ost_connect               1 samples [usec] 215 215 215 46225
ost_statfs                3 samples [usec] 299 332 944 297594
obd_ping                  3312 samples [usec] 145 17520 862105 529135521
osc.lc2-OST0014-osc-ffff8802137f2400.stats=
snapshot_time             1309299882.659388 secs.usecs
req_waittime              3316 samples [usec] 149 8167 911945 319393295
req_active                3316 samples [reqs] 1 1 3316 3316
ost_connect               1 samples [usec] 289 289 289 83521
ost_statfs                3 samples [usec] 292 326 922 283956
obd_ping                  3312 samples [usec] 149 8167 910734 319025818
osc.lc2-OST0015-osc-ffff8802137f2400.stats=
snapshot_time             1309299882.659421 secs.usecs
req_waittime              3316 samples [usec] 142 9721 876334 327596336
req_active                3316 samples [reqs] 1 1 3316 3316
ost_connect               1 samples [usec] 180 180 180 32400
ost_statfs                3 samples [usec] 285 319 900 270602
obd_ping                  3312 samples [usec] 142 9721 875254 327293334
osc.lc2-OST0016-osc-ffff8802137f2400.stats=
snapshot_time             1309299882.659454 secs.usecs
req_waittime              3316 samples [usec] 129 4572 819079 227868867
req_active                3316 samples [reqs] 1 1 3316 3316
ost_connect               1 samples [usec] 129 129 129 16641
ost_statfs                3 samples [usec] 284 313 892 265650
obd_ping                  3312 samples [usec] 142 4572 818058 227586576
osc.lc2-OST0017-osc-ffff8802137f2400.stats=
snapshot_time             1309299882.659487 secs.usecs
req_waittime              3316 samples [usec] 132 17465 820610 506319946
req_active                3316 samples [reqs] 1 1 3316 3316
ost_connect               1 samples [usec] 233 233 233 54289
ost_statfs                3 samples [usec] 275 308 871 253433
obd_ping                  3312 samples [usec] 132 17465 819506 506012224
osc.lc2-OST0018-osc-ffff8802137f2400.stats=
snapshot_time             1309299882.659520 secs.usecs
req_waittime              3316 samples [usec] 133 7420 871673 288373391
req_active                3316 samples [reqs] 1 1 3316 3316
ost_connect               1 samples [usec] 254 254 254 64516
ost_statfs                3 samples [usec] 274 302 862 248076
obd_ping                  3312 samples [usec] 133 7420 870557 288060799
osc.lc2-OST0019-osc-ffff8802137f2400.stats=
snapshot_time             1309299882.659553 secs.usecs
req_waittime              3316 samples [usec] 137 9752 840189 309446239
req_active                3316 samples [reqs] 1 1 3316 3316
ost_connect               1 samples [usec] 150 150 150 22500
ost_statfs                3 samples [usec] 272 298 849 240629
obd_ping                  3312 samples [usec] 137 9752 839190 309183110
osc.lc2-OST001a-osc-ffff8802137f2400.stats=
snapshot_time             1309299882.659585 secs.usecs
req_waittime              3316 samples [usec] 125 4534 785335 211163457
req_active                3316 samples [reqs] 1 1 3316 3316
ost_connect               1 samples [usec] 148 148 148 21904
ost_statfs                3 samples [usec] 264 290 831 230525
obd_ping                  3312 samples [usec] 125 4534 784356 210911028
osc.lc2-OST001b-osc-ffff8802137f2400.stats=
snapshot_time             1309299882.659618 secs.usecs
req_waittime              3316 samples [usec] 122 17444 786734 489367106
req_active                3316 samples [reqs] 1 1 3316 3316
ost_connect               1 samples [usec] 231 231 231 53361
ost_statfs                3 samples [usec] 260 288 818 223444
obd_ping                  3312 samples [usec] 122 17444 785685 489090301
osc.lc2-OST001c-osc-ffff8802137f2400.stats=
snapshot_time             1309299882.659652 secs.usecs
req_waittime              3316 samples [usec] 141 8770 843230 295343834
req_active                3316 samples [reqs] 1 1 3316 3316
ost_connect               1 samples [usec] 197 197 197 38809
ost_statfs                3 samples [usec] 254 285 808 218102
obd_ping                  3312 samples [usec] 141 8770 842225 295086923
osc.lc2-OST001d-osc-ffff8802137f2400.stats=
snapshot_time             1309299882.659685 secs.usecs
req_waittime              3316 samples [usec] 138 9756 820282 300124948
req_active                3316 samples [reqs] 1 1 3316 3316
ost_connect               1 samples [usec] 153 153 153 23409
ost_statfs                3 samples [usec] 250 278 789 207905
obd_ping                  3312 samples [usec] 138 9756 819340 299893634
osc.lc2-OST001e-osc-ffff8802137f2400.stats=
snapshot_time             1309299882.659717 secs.usecs
req_waittime              3316 samples [usec] 125 4630 770561 205056819
req_active                3316 samples [reqs] 1 1 3316 3316
ost_connect               1 samples [usec] 162 162 162 26244
ost_statfs                3 samples [usec] 241 277 777 201891
obd_ping                  3312 samples [usec] 125 4630 769622 204828684
osc.lc2-OST001f-osc-ffff8802137f2400.stats=
snapshot_time             1309299882.661220 secs.usecs
req_waittime              3316 samples [usec] 118 17421 763153 477120307
req_active                3316 samples [reqs] 1 1 3316 3316
ost_connect               1 samples [usec] 213 213 213 45369
ost_statfs                3 samples [usec] 238 268 759 192477
obd_ping                  3312 samples [usec] 118 17421 762181 476882461
osc.lc2-OST0020-osc-ffff8802137f2400.stats=
snapshot_time             1309299882.661269 secs.usecs
req_waittime              3316 samples [usec] 131 8115 901493 316119947
req_active                3316 samples [reqs] 1 1 3316 3316
ost_connect               1 samples [usec] 203 203 203 41209
ost_statfs                3 samples [usec] 235 262 747 186369
obd_ping                  3312 samples [usec] 131 8115 900543 315892369
osc.lc2-OST0021-osc-ffff8802137f2400.stats=
snapshot_time             1309299882.661301 secs.usecs
req_waittime              3316 samples [usec] 128 9787 875772 329087318
req_active                3316 samples [reqs] 1 1 3316 3316
ost_connect               1 samples [usec] 236 236 236 55696
ost_statfs                3 samples [usec] 237 269 766 196130
obd_ping                  3312 samples [usec] 128 9787 874770 328835492
osc.lc2-OST0022-osc-ffff8802137f2400.stats=
snapshot_time             1309299882.661331 secs.usecs
req_waittime              3316 samples [usec] 121 4611 804496 220759670
req_active                3316 samples [reqs] 1 1 3316 3316
ost_connect               1 samples [usec] 214 214 214 45796
ost_statfs                3 samples [usec] 212 251 691 159929
obd_ping                  3312 samples [usec] 121 4611 803591 220553945
osc.lc2-OST0023-osc-ffff8802137f2400.stats=
snapshot_time             1309299882.661361 secs.usecs
req_waittime              3316 samples [usec] 119 17402 801300 496177606
req_active                3316 samples [reqs] 1 1 3316 3316
ost_connect               1 samples [usec] 186 186 186 34596
ost_statfs                3 samples [usec] 208 239 657 144485
obd_ping                  3312 samples [usec] 119 17402 800457 495998525
osc.lc2-OST0024-osc-ffff8802137f2400.stats=
snapshot_time             1309299882.661393 secs.usecs
req_waittime              3317 samples [usec] 137 2381350 3245697 5671117236501
req_active                3317 samples [reqs] 1 1 3317 3317
ost_destroy               1 samples [usec] 2381350 2381350 2381350 5670827822500
ost_connect               1 samples [usec] 229 229 229 52441
ost_statfs                3 samples [usec] 197 230 636 135390
obd_ping                  3312 samples [usec] 137 7748 863482 289226170
osc.lc2-OST0025-osc-ffff8802137f2400.stats=
snapshot_time             1309299882.661426 secs.usecs
req_waittime              3316 samples [usec] 135 9818 842294 310865850
req_active                3316 samples [reqs] 1 1 3316 3316
ost_connect               1 samples [usec] 150 150 150 22500
ost_statfs                3 samples [usec] 177 228 607 124117
obd_ping                  3312 samples [usec] 135 9818 841537 310719233
osc.lc2-OST0026-osc-ffff8802137f2400.stats=
snapshot_time             1309299882.661456 secs.usecs
req_waittime              3316 samples [usec] 119 4541 777610 206535650
req_active                3316 samples [reqs] 1 1 3316 3316
ost_connect               1 samples [usec] 134 134 134 17956
ost_statfs                3 samples [usec] 200 224 629 132201
obd_ping                  3312 samples [usec] 119 4541 776847 206385493
osc.lc2-OST0027-osc-ffff8802137f2400.stats=
snapshot_time             1309299882.661487 secs.usecs
req_waittime              3317 samples [usec] 121 2518610 3296666 6343880625598
req_active                3317 samples [reqs] 1 1 3317 3317
ost_destroy               1 samples [usec] 2518610 2518610 2518610 6343396332100
ost_connect               1 samples [usec] 217 217 217 47089
ost_statfs                3 samples [usec] 178 199 570 108534
obd_ping                  3312 samples [usec] 121 17419 777269 484137875
osc.lc2-OST0028-osc-ffff8802137f2400.stats=
snapshot_time             1309299882.661520 secs.usecs
req_waittime              3316 samples [usec] 120 7729 818831 265014381
req_active                3316 samples [reqs] 1 1 3316 3316
ost_connect               1 samples [usec] 131 131 131 17161
ost_statfs                3 samples [usec] 163 198 552 102254
obd_ping                  3312 samples [usec] 120 7729 818148 264894966
osc.lc2-OST0029-osc-ffff8802137f2400.stats=
snapshot_time             1309299882.661550 secs.usecs
req_waittime              3316 samples [usec] 113 9753 793692 284883676
req_active                3316 samples [reqs] 1 1 3316 3316
ost_connect               1 samples [usec] 121 121 121 14641
ost_statfs                3 samples [usec] 160 213 564 107450
obd_ping                  3312 samples [usec] 113 9753 793007 284761585
osc.lc2-OST002a-osc-ffff8802137f2400.stats=
snapshot_time             1309299882.661581 secs.usecs
req_waittime              3326 samples [usec] 116 6803600 20867138 90963833301586
req_active                3326 samples [reqs] 1 3 3332 3348
write_bytes               6 samples [bytes] 1048576 1048576 6291456 6597069766656
ost_write                 6 samples [usec] 101750 6803600 20111207 90963298587213
ost_destroy               1 samples [usec] 18648 18648 18648 347747904
ost_connect               1 samples [usec] 133 133 133 17689
ost_statfs                3 samples [usec] 151 182 502 84486
obd_ping                  3313 samples [usec] 116 4554 735954 186524908
osc.lc2-OST002b-osc-ffff8802137f2400.stats=
snapshot_time             1309299882.661617 secs.usecs
req_waittime              3316 samples [usec] 109 17414 741926 466887986
req_active                3316 samples [reqs] 1 1 3316 3316
ost_connect               1 samples [usec] 128 128 128 16384
ost_statfs                3 samples [usec] 145 184 484 78906
obd_ping                  3312 samples [usec] 109 17414 741314 466792696
osc.lc2-OST002c-osc-ffff8802137f2400.stats=
snapshot_time             1309299882.661648 secs.usecs
req_waittime              8487 samples [usec] 110 36299959 30049160298 472323496437767736
req_active                8487 samples [reqs] 1 10 33250 232008
read_bytes                4243 samples [bytes] 4096 1048576 4315938816 4482920299364352
write_bytes               1024 samples [bytes] 1048576 1048576 1073741824 1125899906842624
ost_read                  4243 samples [usec] 6293 5012439 9474972749 35922925662095665
ost_write                 1024 samples [usec] 1098441 36299959 20568875607 436380202545289431
ost_destroy               1 samples [usec] 4512911 4512911 4512911 20366365693921
ost_connect               1 samples [usec] 226 226 226 51076
ost_statfs                3 samples [usec] 150 190 508 86824
ldlm_cancel               32 samples [usec] 129 801 15198 7596796
obd_ping                  3117 samples [usec] 126 377 710710 165138154
osc.lc2-OST002d-osc-ffff8802137f2400.stats=
snapshot_time             1309299882.662283 secs.usecs
req_waittime              3325 samples [usec] 121 3255936 11058585 27761035609091
req_active                3325 samples [reqs] 1 3 3328 3336
write_bytes               5 samples [bytes] 4096 1048576 4198400 4398063288320
ost_write                 5 samples [usec] 166817 3255936 10303044 27760458982266
ost_destroy               1 samples [usec] 17814 17814 17814 317338596
ost_connect               1 samples [usec] 121 121 121 14641
ost_statfs                3 samples [usec] 129 187 467 74411
obd_ping                  3313 samples [usec] 131 9789 736636 259046452
osc.lc2-OST002e-osc-ffff8802137f2400.stats=
snapshot_time             1309299882.662332 secs.usecs
req_waittime              8446 samples [usec] 84 37963003 30066919244 474388438312497346
req_active                8446 samples [reqs] 1 10 33078 231330
read_bytes                4192 samples [bytes] 12288 1048576 4307550208 4492471870423040
write_bytes               1024 samples [bytes] 1048576 1048576 1073741824 1125899906842624
ost_read                  4192 samples [usec] 20373 5049723 9373760081 36615725280852947
ost_write                 1024 samples [usec] 47464 37963003 20687893862 437751873380011142
ost_destroy               1 samples [usec] 4565022 4565022 4565022 20839425860484
ost_connect               1 samples [usec] 124 124 124 15376
ost_statfs                3 samples [usec] 128 158 428 61512
ldlm_cancel               33 samples [usec] 150 1353 16591 9419327
obd_ping                  3124 samples [usec] 111 365 647157 137061821
osc.lc2-OST002f-osc-ffff8802137f2400.stats=
snapshot_time             1309299882.662379 secs.usecs
req_waittime              8482 samples [usec] 111 36672657 28422159725 431738138874474129
req_active                8482 samples [reqs] 1 10 31859 213413
read_bytes                4237 samples [bytes] 4096 1048576 4314890240 4481455950397440
write_bytes               1024 samples [bytes] 1048576 1048576 1073741824 1125899906842624
ost_read                  4237 samples [usec] 4287 4905514 8637724795 31806703311501945
ost_write                 1024 samples [usec] 1485223 36672657 19780314514 399919619442695064
ost_destroy               1 samples [usec] 3437435 3437435 3437435 11815959379225
ost_connect               1 samples [usec] 114 114 114 12996
ost_statfs                3 samples [usec] 153 183 496 82498
ldlm_cancel               32 samples [usec] 264 613 15063 7237275
obd_ping                  3118 samples [usec] 111 3039 642264 143155342
Comment by Jinshan Xiong (Inactive) [ 28/Jun/11 ]

and lctl get_param osc.*.osc_stats, thanks.

Comment by Prakash Surya (Inactive) [ 28/Jun/11 ]
zwicky101 surya1 Tue Jun 28 15:37:12 ~/lcrater2
$ lctl get_param osc.*.osc_stats
osc.lc2-OST0000-osc-ffff8802137f2400.osc_stats=
snapshot_time:         1309300659.875348 (secs.usecs)
lockless_write_bytes            0
lockless_read_bytes             0
lockless_truncate               0
osc.lc2-OST0001-osc-ffff8802137f2400.osc_stats=
snapshot_time:         1309300659.875393 (secs.usecs)
lockless_write_bytes            0
lockless_read_bytes             0
lockless_truncate               0
osc.lc2-OST0002-osc-ffff8802137f2400.osc_stats=
snapshot_time:         1309300659.875415 (secs.usecs)
lockless_write_bytes            0
lockless_read_bytes             0
lockless_truncate               0
osc.lc2-OST0003-osc-ffff8802137f2400.osc_stats=
snapshot_time:         1309300659.875437 (secs.usecs)
lockless_write_bytes            0
lockless_read_bytes             0
lockless_truncate               0
osc.lc2-OST0004-osc-ffff8802137f2400.osc_stats=
snapshot_time:         1309300659.875459 (secs.usecs)
lockless_write_bytes            0
lockless_read_bytes             0
lockless_truncate               0
osc.lc2-OST0005-osc-ffff8802137f2400.osc_stats=
snapshot_time:         1309300659.875480 (secs.usecs)
lockless_write_bytes            0
lockless_read_bytes             0
lockless_truncate               0
osc.lc2-OST0006-osc-ffff8802137f2400.osc_stats=
snapshot_time:         1309300659.875502 (secs.usecs)
lockless_write_bytes            0
lockless_read_bytes             0
lockless_truncate               0
osc.lc2-OST0007-osc-ffff8802137f2400.osc_stats=
snapshot_time:         1309300659.875523 (secs.usecs)
lockless_write_bytes            0
lockless_read_bytes             0
lockless_truncate               0
osc.lc2-OST0008-osc-ffff8802137f2400.osc_stats=
snapshot_time:         1309300659.875545 (secs.usecs)
lockless_write_bytes            0
lockless_read_bytes             0
lockless_truncate               0
osc.lc2-OST0009-osc-ffff8802137f2400.osc_stats=
snapshot_time:         1309300659.875568 (secs.usecs)
lockless_write_bytes            0
lockless_read_bytes             0
lockless_truncate               0
osc.lc2-OST000a-osc-ffff8802137f2400.osc_stats=
snapshot_time:         1309300659.875590 (secs.usecs)
lockless_write_bytes            0
lockless_read_bytes             0
lockless_truncate               0
osc.lc2-OST000b-osc-ffff8802137f2400.osc_stats=
snapshot_time:         1309300659.875611 (secs.usecs)
lockless_write_bytes            0
lockless_read_bytes             0
lockless_truncate               0
osc.lc2-OST000c-osc-ffff8802137f2400.osc_stats=
snapshot_time:         1309300659.875633 (secs.usecs)
lockless_write_bytes            0
lockless_read_bytes             0
lockless_truncate               0
osc.lc2-OST000d-osc-ffff8802137f2400.osc_stats=
snapshot_time:         1309300659.875655 (secs.usecs)
lockless_write_bytes            0
lockless_read_bytes             0
lockless_truncate               0
osc.lc2-OST000e-osc-ffff8802137f2400.osc_stats=
snapshot_time:         1309300659.875677 (secs.usecs)
lockless_write_bytes            0
lockless_read_bytes             0
lockless_truncate               0
osc.lc2-OST000f-osc-ffff8802137f2400.osc_stats=
snapshot_time:         1309300659.875698 (secs.usecs)
lockless_write_bytes            0
lockless_read_bytes             0
lockless_truncate               0
osc.lc2-OST0010-osc-ffff8802137f2400.osc_stats=
snapshot_time:         1309300659.875719 (secs.usecs)
lockless_write_bytes            0
lockless_read_bytes             0
lockless_truncate               0
osc.lc2-OST0011-osc-ffff8802137f2400.osc_stats=
snapshot_time:         1309300659.875741 (secs.usecs)
lockless_write_bytes            0
lockless_read_bytes             0
lockless_truncate               0
osc.lc2-OST0012-osc-ffff8802137f2400.osc_stats=
snapshot_time:         1309300659.875763 (secs.usecs)
lockless_write_bytes            0
lockless_read_bytes             0
lockless_truncate               0
osc.lc2-OST0013-osc-ffff8802137f2400.osc_stats=
snapshot_time:         1309300659.875783 (secs.usecs)
lockless_write_bytes            0
lockless_read_bytes             0
lockless_truncate               0
osc.lc2-OST0014-osc-ffff8802137f2400.osc_stats=
snapshot_time:         1309300659.875805 (secs.usecs)
lockless_write_bytes            0
lockless_read_bytes             0
lockless_truncate               0
osc.lc2-OST0015-osc-ffff8802137f2400.osc_stats=
snapshot_time:         1309300659.875827 (secs.usecs)
lockless_write_bytes            0
lockless_read_bytes             0
lockless_truncate               0
osc.lc2-OST0016-osc-ffff8802137f2400.osc_stats=
snapshot_time:         1309300659.875848 (secs.usecs)
lockless_write_bytes            0
lockless_read_bytes             0
lockless_truncate               0
osc.lc2-OST0017-osc-ffff8802137f2400.osc_stats=
snapshot_time:         1309300659.875868 (secs.usecs)
lockless_write_bytes            0
lockless_read_bytes             0
lockless_truncate               0
osc.lc2-OST0018-osc-ffff8802137f2400.osc_stats=
snapshot_time:         1309300659.875890 (secs.usecs)
lockless_write_bytes            0
lockless_read_bytes             0
lockless_truncate               0
osc.lc2-OST0019-osc-ffff8802137f2400.osc_stats=
snapshot_time:         1309300659.875912 (secs.usecs)
lockless_write_bytes            0
lockless_read_bytes             0
lockless_truncate               0
osc.lc2-OST001a-osc-ffff8802137f2400.osc_stats=
snapshot_time:         1309300659.875932 (secs.usecs)
lockless_write_bytes            0
lockless_read_bytes             0
lockless_truncate               0
osc.lc2-OST001b-osc-ffff8802137f2400.osc_stats=
snapshot_time:         1309300659.875953 (secs.usecs)
lockless_write_bytes            0
lockless_read_bytes             0
lockless_truncate               0
osc.lc2-OST001c-osc-ffff8802137f2400.osc_stats=
snapshot_time:         1309300659.875975 (secs.usecs)
lockless_write_bytes            0
lockless_read_bytes             0
lockless_truncate               0
osc.lc2-OST001d-osc-ffff8802137f2400.osc_stats=
snapshot_time:         1309300659.875997 (secs.usecs)
lockless_write_bytes            0
lockless_read_bytes             0
lockless_truncate               0
osc.lc2-OST001e-osc-ffff8802137f2400.osc_stats=
snapshot_time:         1309300659.876018 (secs.usecs)
lockless_write_bytes            0
lockless_read_bytes             0
lockless_truncate               0
osc.lc2-OST001f-osc-ffff8802137f2400.osc_stats=
snapshot_time:         1309300659.876058 (secs.usecs)
lockless_write_bytes            0
lockless_read_bytes             0
lockless_truncate               0
osc.lc2-OST0020-osc-ffff8802137f2400.osc_stats=
snapshot_time:         1309300659.876081 (secs.usecs)
lockless_write_bytes            0
lockless_read_bytes             0
lockless_truncate               0
osc.lc2-OST0021-osc-ffff8802137f2400.osc_stats=
snapshot_time:         1309300659.876104 (secs.usecs)
lockless_write_bytes            0
lockless_read_bytes             0
lockless_truncate               0
osc.lc2-OST0022-osc-ffff8802137f2400.osc_stats=
snapshot_time:         1309300659.876125 (secs.usecs)
lockless_write_bytes            0
lockless_read_bytes             0
lockless_truncate               0
osc.lc2-OST0023-osc-ffff8802137f2400.osc_stats=
snapshot_time:         1309300659.876145 (secs.usecs)
lockless_write_bytes            0
lockless_read_bytes             0
lockless_truncate               0
osc.lc2-OST0024-osc-ffff8802137f2400.osc_stats=
snapshot_time:         1309300659.876168 (secs.usecs)
lockless_write_bytes            0
lockless_read_bytes             0
lockless_truncate               0
osc.lc2-OST0025-osc-ffff8802137f2400.osc_stats=
snapshot_time:         1309300659.876190 (secs.usecs)
lockless_write_bytes            0
lockless_read_bytes             0
lockless_truncate               0
osc.lc2-OST0026-osc-ffff8802137f2400.osc_stats=
snapshot_time:         1309300659.876210 (secs.usecs)
lockless_write_bytes            0
lockless_read_bytes             0
lockless_truncate               0
osc.lc2-OST0027-osc-ffff8802137f2400.osc_stats=
snapshot_time:         1309300659.876230 (secs.usecs)
lockless_write_bytes            0
lockless_read_bytes             0
lockless_truncate               0
osc.lc2-OST0028-osc-ffff8802137f2400.osc_stats=
snapshot_time:         1309300659.876252 (secs.usecs)
lockless_write_bytes            0
lockless_read_bytes             0
lockless_truncate               0
osc.lc2-OST0029-osc-ffff8802137f2400.osc_stats=
snapshot_time:         1309300659.876273 (secs.usecs)
lockless_write_bytes            0
lockless_read_bytes             0
lockless_truncate               0
osc.lc2-OST002a-osc-ffff8802137f2400.osc_stats=
snapshot_time:         1309300659.876293 (secs.usecs)
lockless_write_bytes            0
lockless_read_bytes             0
lockless_truncate               0
osc.lc2-OST002b-osc-ffff8802137f2400.osc_stats=
snapshot_time:         1309300659.876313 (secs.usecs)
lockless_write_bytes            0
lockless_read_bytes             0
lockless_truncate               0
osc.lc2-OST002c-osc-ffff8802137f2400.osc_stats=
snapshot_time:         1309300659.876335 (secs.usecs)
lockless_write_bytes            0
lockless_read_bytes             0
lockless_truncate               0
osc.lc2-OST002d-osc-ffff8802137f2400.osc_stats=
snapshot_time:         1309300659.876355 (secs.usecs)
lockless_write_bytes            0
lockless_read_bytes             0
lockless_truncate               0
osc.lc2-OST002e-osc-ffff8802137f2400.osc_stats=
snapshot_time:         1309300659.876375 (secs.usecs)
lockless_write_bytes            0
lockless_read_bytes             0
lockless_truncate               0
osc.lc2-OST002f-osc-ffff8802137f2400.osc_stats=
snapshot_time:         1309300659.876395 (secs.usecs)
lockless_write_bytes            0
lockless_read_bytes             0
lockless_truncate               0
Comment by Jinshan Xiong (Inactive) [ 29/Jun/11 ]

Hi Prakash, I can't reproduce the data corruption issue in our lab with 10 threads on 5 clients. I still need your help. Can you please try patch set 9 and tell me the results? Thanks.

Comment by Prakash Surya (Inactive) [ 29/Jun/11 ]

Hi Xiong, not a problem. I will work on getting that tested.

Comment by Prakash Surya (Inactive) [ 30/Jun/11 ]

Xiong, just to give an update. I compiled and installed patchset 9 on the clients, and have just kicked off a 57 node IOR. I'll let you know how it goes.

Comment by Prakash Surya (Inactive) [ 30/Jun/11 ]

The IOR is still processing the read portion of the test, but the write portion produced similar output as before:

zwicky101 surya1 Thu Jun 30 08:29:10 ~/lcrater2
$ srun -l -D /p/lcrater2/surya1/tmp -N 57 -n 114 -- /g/g0/surya1/dev/IOR/src/C/IOR -a POSIX -C -e -i 1 -o /p/lcrater2/surya1/tmp/IOR.data -v  -w -W -r -R -t 1m -b 1024m

[snip..]

000: Commencing write performance test.
000: Thu Jun 30 08:30:54 2011
000:
000: Verifying contents of the file(s) just written.
000: Thu Jun 30 09:12:44 2011
000: 
009: [9] FAILED comparison of buffer containing 8-byte ints:
009: [9]   File name = /p/lcrater2/surya1/tmp/IOR.data
009: [9]   In transfer 1, 131072 errors between buffer indices 0 and 131071.
009: [9]   File byte offset = 13958643712:
009: [9]     Expected: 0x0000000d4e0c96ae 0000000000000008 0000000d4e0c96ae 0000000000000018 
009: [9]     Actual:   0x0000000000000000 0000000000000000 0000000000000000 0000000000000000 
079: [79] FAILED comparison of buffer containing 8-byte ints:
079: [79]   File name = /p/lcrater2/surya1/tmp/IOR.data
079: [79]   In transfer 1, 131072 errors between buffer indices 0 and 131071.
079: [79]   File byte offset = 89120571392:
079: [79]     Expected: 0x000000534e0c96ae 0000000000000008 000000534e0c96ae 0000000000000018 
079: [79]     Actual:   0x0000000000000000 0000000000000000 0000000000000000 0000000000000000 
025: [25] FAILED comparison of buffer containing 8-byte ints:
025: [25]   File name = /p/lcrater2/surya1/tmp/IOR.data
025: [25]   In transfer 1, 131072 errors between buffer indices 0 and 131071.
025: [25]   File byte offset = 31138512896:
025: [25]     Expected: 0x0000001d4e0c96ae 0000000000000008 0000001d4e0c96ae 0000000000000018 
025: [25]     Actual:   0x0000000000000000 0000000000000000 0000000000000000 0000000000000000 
047: [47] FAILED comparison of buffer containing 8-byte ints:
047: [47]   File name = /p/lcrater2/surya1/tmp/IOR.data
047: [47]   In transfer 2, 131072 errors between buffer indices 0 and 131071.
047: [47]   File byte offset = 54761881600:
047: [47]     Expected: 0x000000334e0c96ae 0000000000000008 000000334e0c96ae 0000000000000018 
047: [47]     Actual:   0x0000000000000000 0000000000000000 0000000000000000 0000000000000000 
021: [21] FAILED comparison of buffer containing 8-byte ints:
021: [21]   File name = /p/lcrater2/surya1/tmp/IOR.data
021: [21]   In transfer 2, 131072 errors between buffer indices 0 and 131071.
021: [21]   File byte offset = 26844594176:
021: [21]     Expected: 0x000000194e0c96ae 0000000000000008 000000194e0c96ae 0000000000000018 
021: [21]     Actual:   0x0000000000000000 0000000000000000 0000000000000000 0000000000000000 
033: [33] FAILED comparison of buffer containing 8-byte ints:
033: [33]   File name = /p/lcrater2/surya1/tmp/IOR.data
033: [33]   In transfer 2, 131072 errors between buffer indices 0 and 131071.
033: [33]   File byte offset = 39729496064:
033: [33]     Expected: 0x000000254e0c96ae 0000000000000008 000000254e0c96ae 0000000000000018 
033: [33]     Actual:   0x0000000000000000 0000000000000000 0000000000000000 0000000000000000 
085: [85] FAILED comparison of buffer containing 8-byte ints:
085: [85]   File name = /p/lcrater2/surya1/tmp/IOR.data
085: [85]   In transfer 2, 131072 errors between buffer indices 0 and 131071.
085: [85]   File byte offset = 95564070912:
085: [85]     Expected: 0x000000594e0c96ae 0000000000000008 000000594e0c96ae 0000000000000018 
085: [85]     Actual:   0x0000000000000000 0000000000000000 0000000000000000 0000000000000000 
095: [95] FAILED comparison of buffer containing 8-byte ints:
095: [95]   File name = /p/lcrater2/surya1/tmp/IOR.data
095: [95]   In transfer 2, 131072 errors between buffer indices 0 and 131071.
095: [95]   File byte offset = 106301489152:
095: [95]     Expected: 0x000000634e0c96ae 0000000000000008 000000634e0c96ae 0000000000000018 
095: [95]     Actual:   0x0000000000000000 0000000000000000 0000000000000000 0000000000000000 
000: WARNING: incorrect data on write.
000:          1048576 errors found on write check.
000: Used Time Stamp 1309447854 (0x4e0c96ae) for Data Signature

[snip..]

Comment by Jinshan Xiong (Inactive) [ 01/Jul/11 ]

I've found this issue and fixed it, please try patch set 11. Thanks.

Comment by Prakash Surya (Inactive) [ 01/Jul/11 ]

Jinshan, great news! I will build and test this patch today.

Comment by Ned Bass [ 01/Jul/11 ]

We hit the LBUG in osc_req_attr_set() again this morning. Looks like the client was running patch set 9 based on the version number of the installed RPM.

Comment by Prakash Surya (Inactive) [ 05/Jul/11 ]

Hi Jinshan, I have run the IOR with patchset 11 installed 3 times and have not seen the failed comparison error or the original hanging issue. Although, this error has appeared on each run with ps11 installed:

 
000: Commencing read performance test.
000: Tue Jul  5 10:35:29 2011
000: 
000: Re-reading the file(s) twice to verify that reads are consistent.
000: Tue Jul  5 10:51:32 2011
000: 
068: ** error **
068: ERROR in aiori-POSIX.c (line 256): transfer failed.
068: ERROR: Input/output error
068: ** exiting **

The client console logs show many messages of the form:

LustreError: 11-0: an error occurred while communicating with 10.1.1.49@o2ib9. The ost_connect operation failed with -19

and

LustreError: 11-0: an error occurred while communicating with 10.1.1.47@o2ib9. The obd_ping operation failed with -107
Comment by Christopher Morrone [ 05/Jul/11 ]

Prakash, update the jira ticket where patchset 11 came from as well please, so they know the state of testing on that patchset.

You should look for eviction messages as well.

A user space app should not be seeing an EIO from just ost_connect or obd_ping errors. The user space app should just hang on whatever operation it is doing until reconnect is successful. Only an eviction should result in the operation being aborted and the app seeing an error message.

Try to look for the first error message in the grouping...once something goes wrong, you'll certainly see a flood of many different kinds of error messages. Often the first one is the one of interest.

Comment by Prakash Surya (Inactive) [ 05/Jul/11 ]

Sorry, turns out the previous ost_connect and obd_ping messages were old. Looking at the console output for just today, there were not any Lustre or LustreError messages in the logs. And nothing matching the string "evict". Although, there are a lot of messages resembling this:

2011-07-05 14:18:30 mstflint:14869 freeing invalid memtype fe900000-fe910000
2011-07-05 14:18:30 mstflint:14869 freeing invalid memtype fe910000-fea00000
2011-07-05 14:18:31 mstflint:14877 freeing invalid memtype fe900000-fe910000
2011-07-05 14:18:31 mstflint:14877 freeing invalid memtype fe910000-fea00000
2011-07-05 14:18:32 CPUFREQ: Per core ondemand sysfs interface is deprecated - up_threshold

but no surrounding Lustre messages like what was seen previously.

Comment by Prakash Surya (Inactive) [ 05/Jul/11 ]

I just ran the same IOR but using a file per process, rather than a single shared file, and that IOR completed without reporting any errors.

Comment by Jinshan Xiong (Inactive) [ 05/Jul/11 ]

Can you please use `lctl dk > log' to dump lustre log when you saw the IOR error?

Comment by Prakash Surya (Inactive) [ 06/Jul/11 ]

Strange, lctl dk on the client which aborted the application reported:

# zwicky155 /root > lctl dk > log
# zwicky155 /root > cat log
Debug log: 0 lines, 0 kept, 0 dropped, 0 bad.
# zwicky155 /root > cat /proc/sys/lnet/debug
ioctl neterror warning error emerg config console

Perhaps, I need to enable more debugging flags? Are they any specific flags that I should enable/disable?

Comment by Prakash Surya (Inactive) [ 06/Jul/11 ]

So, with our default debug flags on of:

ioctl neterror warning error emerg config console

Running:

$ lctl dk > /dev/null
$ $IOR
$ lctl dk > log

None of the clients produce useful output in the log files (i.e. Only contains 0 lines, 0 kept, etc..)

Comment by Jinshan Xiong (Inactive) [ 06/Jul/11 ]

That's strange, Lustre used to moan a lot...

Can you please show me the output of `lctl get_param subsystem_debug', and please make sure client modules, such as mdc, osc etc are in it, or you can do `lctl set_param subsystem_debug=-1' to enable all of them.

Comment by Prakash Surya (Inactive) [ 06/Jul/11 ]

The clients all report the following for subsystem_debug:

subsystem_debug=
undefined mdc mds osc ost class log llite rpc lnet lnd pinger filter echo ldlm lov lquota lmv sec gss mgc mgs fid fld
Comment by Jinshan Xiong (Inactive) [ 06/Jul/11 ]

Do you have daemon_file set? Try `lctl get_param daemon_file', and if you have it set, the log is actually dumped to that file. You can use `lctl set_param daemon_file=stop' to disable it so that `lctl dk' will work again.

Comment by Prakash Surya (Inactive) [ 06/Jul/11 ]

No, 'lctl get_param daemon_file' does not produce any output on any of the clients.

Comment by Jinshan Xiong (Inactive) [ 06/Jul/11 ]

then I have no idea, you may ask Chris to figure it out.

Comment by Christopher Morrone [ 06/Jul/11 ]

If there is nothing on the console, then its pretty likely there is simply no message in the log at those log levels either. Those log levels are pretty conservative. I would start by adding "net" to the debug level, simply as a sanity check. You WILL see log messages for any network traffic in the lustre kernel log when you run IOR.

If the sanity test give you logs, then set debug to -1 (all logging on). And see if the problem is still reproducible.

I would also recommend getting into the habit of using the form "lctl dk log" instead of "lctl dk > log". The former will make sure the file "log" can actually be created before dumping and clearing the lustre log in memory. If you slip up and try to write to a directory where you don't have permissions (or more likely it is a r/o filesystem), with the latter command your log is lost forever.

The redirection, "lctl dk > /dev/null" is fine for dumping to /dev/null to clear the log though.

Comment by Prakash Surya (Inactive) [ 07/Jul/11 ]

Chris, Thanks for the tip!

I believe lustre is simply not reporting any information to the logs with our default log level. If I change the log level to something more liberal (i.e. -1) I do end up getting log information from lctl dk.

Jinshan, are there any specific values you would like me to enable in /proc/sys/lnet/debug? I assume enabling all flags would generate too much noise. Once we loosen up the log levels we should start to get some info.

Comment by Prakash Surya (Inactive) [ 13/Jul/11 ]

Ok so..

I have not been able to reproduce this issue as of late. I have run a few (around 6) IORs with the following debug flags:

super warning dlmtrace error emerg ha rpctrace vfstrace config console cache info vfstrace

And also with these flags:

ioctl neterror warning error emerg config console

And have not been able to get the EIO to crop up. The client nodes all were running Lustre 2.0.63 with patchset 11 applied (2.0.63-1chaos-LU437ps11) from the above mentioned gerrit patch (http://review.whamcloud.com/#change,911) and the servers were all running a pristine Lustre 2.0.63-1chaos.

For the time being, I think we are going to let this issue settle and move on to other testing (at least until we run into this again).

Comment by Jinshan Xiong (Inactive) [ 13/Jul/11 ]

Hi Prakash,

In that case, let's land this patch and see if others can see this issue. Thanks for your effort.

Jinshan

Comment by Prakash Surya (Inactive) [ 13/Jul/11 ]

Ok, I have no objections to that.

Comment by Peter Jones [ 26/Jul/11 ]

Should be fixed by LU-394 landing

Generated at Sat Feb 10 01:07:01 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.