[LU-5154] Failure on test suite recovery-double-scale test_pairwise_fail Created: 06/Jun/14 Updated: 08/Aug/16 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.6.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Environment: |
client and server: lustre-master build # 2052 |
||
| Severity: | 3 |
| Rank (Obsolete): | 14222 |
| Description |
|
This issue was created by maloo for sarah <sarah@whamcloud.com> This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/f5b45d64-e1d6-11e3-8cc0-52540035b04c. The sub-test test_pairwise_fail failed with the following error:
client 3 dmesg Lustre: 1763:0:(llite_lib.c:2738:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.1.4.10@tcp:10.1.4.6@tcp:/lustre/fid: [0x200024621:0x2f14:0x0]/ may get corrupted (rc -108)
Lustre: 1763:0:(llite_lib.c:2738:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.1.4.10@tcp:10.1.4.6@tcp:/lustre/fid: [0x200024621:0x2eec:0x0]/ may get corrupted (rc -108)
Lustre: 1763:0:(llite_lib.c:2738:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.1.4.10@tcp:10.1.4.6@tcp:/lustre/fid: [0x200024621:0x2d28:0x0]/ may get corrupted (rc -108)
Lustre: 1764:0:(llite_lib.c:2738:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.1.4.10@tcp:10.1.4.6@tcp:/lustre/fid: [0x200024621:0x2d86:0x0]/ may get corrupted (rc -108)
Lustre: 1764:0:(llite_lib.c:2738:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.1.4.10@tcp:10.1.4.6@tcp:/lustre/fid: [0x200024621:0x2d94:0x0]/ may get corrupted (rc -108)
Lustre: lustre-OST0004-osc-ffff88007d400c00: Connection restored to lustre-OST0004 (at 10.1.4.9@tcp)
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Failing type2=OST item2=ost5 ...
Lustre: DEBUG MARKER: Failing type2=OST item2=ost5 ...
INFO: task tar:3715 blocked for more than 120 seconds.
Not tainted 2.6.32-431.17.1.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
tar D 0000000000000000 0 3715 3713 0x00000080
ffff8800704d39e8 0000000000000082 0000000000000000 ffffffff810097cc
ffff88007b8c4ad8 0000000000000000 00000000004d39a8 ffff8800022143c0
ffff88007133daf8 ffff8800704d3fd8 000000000000fbc8 ffff88007133daf8
Call Trace:
[<ffffffff810097cc>] ? __switch_to+0x1ac/0x320
[<ffffffff81528a95>] schedule_timeout+0x215/0x2e0
[<ffffffffa0747470>] ? lustre_swab_ost_body+0x0/0x10 [ptlrpc]
[<ffffffff81528713>] wait_for_common+0x123/0x180
[<ffffffff81061d00>] ? default_wake_function+0x0/0x20
[<ffffffff8152882d>] wait_for_completion+0x1d/0x20
[<ffffffffa0b66c5c>] osc_io_setattr_end+0xbc/0x190 [osc]
[<ffffffffa0998370>] ? lov_io_end_wrapper+0x0/0x100 [lov]
[<ffffffffa055e0e0>] cl_io_end+0x60/0x150 [obdclass]
[<ffffffffa055ec60>] ? cl_io_start+0x0/0x140 [obdclass]
[<ffffffffa0998461>] lov_io_end_wrapper+0xf1/0x100 [lov]
[<ffffffffa09981ae>] lov_io_call+0x8e/0x130 [lov]
[<ffffffffa0999f3c>] lov_io_end+0x4c/0xf0 [lov]
[<ffffffffa055e0e0>] cl_io_end+0x60/0x150 [obdclass]
[<ffffffffa0562e52>] cl_io_loop+0xc2/0x1b0 [obdclass]
[<ffffffffa0a7c4c8>] cl_setattr_ost+0x218/0x2f0 [lustre]
[<ffffffffa0a4692c>] ll_setattr_raw+0xa2c/0x10d0 [lustre]
[<ffffffffa0a47035>] ll_setattr+0x65/0xd0 [lustre]
[<ffffffff811a6fc8>] notify_change+0x168/0x340
[<ffffffff811bb4ac>] utimes_common+0xdc/0x1b0
[<ffffffff810ec3fe>] ? call_rcu+0xe/0x10
[<ffffffff811aa6b0>] ? mntput_no_expire+0x30/0x110
[<ffffffff811bb650>] do_utimes+0xd0/0x170
[<ffffffff811bb7f2>] sys_utimensat+0x32/0x90
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Lustre: 1762:0:(client.c:1914:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1400706578/real 1400706578] req@ffff8800721e6800 x1468746594693168/t0(0) o8->lustre-OST0006-osc-ffff88007d400c00@10.1.4.5@tcp:28/4 lens 400/544 e 0 to 1 dl 1400706606 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: 1762:0:(client.c:1914:ptlrpc_expire_one_request()) Skipped 29 previous similar messages
Lustre: lustre-OST0000-osc-ffff88007d400c00: Connection to lustre-OST0000 (at 10.1.4.9@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: Skipped 6 previous similar messages
Lustre: lustre-OST0001-osc-ffff88007d400c00: Connection to lustre-OST0001 (at 10.1.4.9@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: lustre-OST0003-osc-ffff88007d400c00: Connection to lustre-OST0003 (at 10.1.4.9@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: lustre-OST0000-osc-ffff88007d400c00: Connection restored to lustre-OST0000 (at 10.1.4.5@tcp)
Lustre: lustre-OST0001-osc-ffff88007d400c00: Connection restored to lustre-OST0001 (at 10.1.4.5@tcp)
INFO: task tar:3715 blocked for more than 120 seconds.
|
| Comments |
| Comment by Andreas Dilger [ 06/Jun/14 ] |
|
Jinshan or Bobijam, could you please take a look at this bug to see whether it needs to be a blocker for 2.6, and/or what the priority should be? |
| Comment by Jinshan Xiong (Inactive) [ 09/Jun/14 ] |
|
from the log, it looks like that the client was struggling to reconnect to OSTs but kept failing. The client thread was waiting for the SETATTR REQ to finish so this is where the stack trace came from. At last, the client was evicted and then the REQ could wind up with error. This is certainly not a problem on client side. The recovery is really complex from what I have seen on the OSS. For some reason, some of OSTs can't be recovered and caused lots of eviction. |