[LU-5154] Failure on test suite recovery-double-scale test_pairwise_fail Created: 06/Jun/14  Updated: 08/Aug/16

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None
Environment:

client and server: lustre-master build # 2052


Severity: 3
Rank (Obsolete): 14222

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/f5b45d64-e1d6-11e3-8cc0-52540035b04c.

The sub-test test_pairwise_fail failed with the following error:

test failed to respond and timed out

client 3 dmesg

Lustre: 1763:0:(llite_lib.c:2738:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.1.4.10@tcp:10.1.4.6@tcp:/lustre/fid: [0x200024621:0x2f14:0x0]/ may get corrupted (rc -108)
Lustre: 1763:0:(llite_lib.c:2738:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.1.4.10@tcp:10.1.4.6@tcp:/lustre/fid: [0x200024621:0x2eec:0x0]/ may get corrupted (rc -108)
Lustre: 1763:0:(llite_lib.c:2738:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.1.4.10@tcp:10.1.4.6@tcp:/lustre/fid: [0x200024621:0x2d28:0x0]/ may get corrupted (rc -108)
Lustre: 1764:0:(llite_lib.c:2738:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.1.4.10@tcp:10.1.4.6@tcp:/lustre/fid: [0x200024621:0x2d86:0x0]/ may get corrupted (rc -108)
Lustre: 1764:0:(llite_lib.c:2738:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.1.4.10@tcp:10.1.4.6@tcp:/lustre/fid: [0x200024621:0x2d94:0x0]/ may get corrupted (rc -108)
Lustre: lustre-OST0004-osc-ffff88007d400c00: Connection restored to lustre-OST0004 (at 10.1.4.9@tcp)
Lustre: DEBUG MARKER: /usr/sbin/lctl mark                             Failing type2=OST item2=ost5 ... 
Lustre: DEBUG MARKER: Failing type2=OST item2=ost5 ...
INFO: task tar:3715 blocked for more than 120 seconds.
      Not tainted 2.6.32-431.17.1.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
tar           D 0000000000000000     0  3715   3713 0x00000080
 ffff8800704d39e8 0000000000000082 0000000000000000 ffffffff810097cc
 ffff88007b8c4ad8 0000000000000000 00000000004d39a8 ffff8800022143c0
 ffff88007133daf8 ffff8800704d3fd8 000000000000fbc8 ffff88007133daf8
Call Trace:
 [<ffffffff810097cc>] ? __switch_to+0x1ac/0x320
 [<ffffffff81528a95>] schedule_timeout+0x215/0x2e0
 [<ffffffffa0747470>] ? lustre_swab_ost_body+0x0/0x10 [ptlrpc]
 [<ffffffff81528713>] wait_for_common+0x123/0x180
 [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
 [<ffffffff8152882d>] wait_for_completion+0x1d/0x20
 [<ffffffffa0b66c5c>] osc_io_setattr_end+0xbc/0x190 [osc]
 [<ffffffffa0998370>] ? lov_io_end_wrapper+0x0/0x100 [lov]
 [<ffffffffa055e0e0>] cl_io_end+0x60/0x150 [obdclass]
 [<ffffffffa055ec60>] ? cl_io_start+0x0/0x140 [obdclass]
 [<ffffffffa0998461>] lov_io_end_wrapper+0xf1/0x100 [lov]
 [<ffffffffa09981ae>] lov_io_call+0x8e/0x130 [lov]
 [<ffffffffa0999f3c>] lov_io_end+0x4c/0xf0 [lov]
 [<ffffffffa055e0e0>] cl_io_end+0x60/0x150 [obdclass]
 [<ffffffffa0562e52>] cl_io_loop+0xc2/0x1b0 [obdclass]
 [<ffffffffa0a7c4c8>] cl_setattr_ost+0x218/0x2f0 [lustre]
 [<ffffffffa0a4692c>] ll_setattr_raw+0xa2c/0x10d0 [lustre]
 [<ffffffffa0a47035>] ll_setattr+0x65/0xd0 [lustre]
 [<ffffffff811a6fc8>] notify_change+0x168/0x340
 [<ffffffff811bb4ac>] utimes_common+0xdc/0x1b0
 [<ffffffff810ec3fe>] ? call_rcu+0xe/0x10
 [<ffffffff811aa6b0>] ? mntput_no_expire+0x30/0x110
 [<ffffffff811bb650>] do_utimes+0xd0/0x170
 [<ffffffff811bb7f2>] sys_utimensat+0x32/0x90
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Lustre: 1762:0:(client.c:1914:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1400706578/real 1400706578]  req@ffff8800721e6800 x1468746594693168/t0(0) o8->lustre-OST0006-osc-ffff88007d400c00@10.1.4.5@tcp:28/4 lens 400/544 e 0 to 1 dl 1400706606 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: 1762:0:(client.c:1914:ptlrpc_expire_one_request()) Skipped 29 previous similar messages
Lustre: lustre-OST0000-osc-ffff88007d400c00: Connection to lustre-OST0000 (at 10.1.4.9@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: Skipped 6 previous similar messages
Lustre: lustre-OST0001-osc-ffff88007d400c00: Connection to lustre-OST0001 (at 10.1.4.9@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: lustre-OST0003-osc-ffff88007d400c00: Connection to lustre-OST0003 (at 10.1.4.9@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: lustre-OST0000-osc-ffff88007d400c00: Connection restored to lustre-OST0000 (at 10.1.4.5@tcp)
Lustre: lustre-OST0001-osc-ffff88007d400c00: Connection restored to lustre-OST0001 (at 10.1.4.5@tcp)
INFO: task tar:3715 blocked for more than 120 seconds.


 Comments   
Comment by Andreas Dilger [ 06/Jun/14 ]

Jinshan or Bobijam, could you please take a look at this bug to see whether it needs to be a blocker for 2.6, and/or what the priority should be?

Comment by Jinshan Xiong (Inactive) [ 09/Jun/14 ]

from the log, it looks like that the client was struggling to reconnect to OSTs but kept failing. The client thread was waiting for the SETATTR REQ to finish so this is where the stack trace came from. At last, the client was evicted and then the REQ could wind up with error.

This is certainly not a problem on client side.

The recovery is really complex from what I have seen on the OSS. For some reason, some of OSTs can't be recovered and caused lots of eviction.

Generated at Sat Feb 10 01:48:59 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.