[LU-6780] bulk recovery is not stable when 2 MDTs fails at the same time Created: 30/Jun/15  Updated: 04/Jul/15  Resolved: 04/Jul/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Blocker
Reporter: Di Wang Assignee: Di Wang
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

I saw a few bulk timeout error with patch
http://review.whamcloud.com/#/c/13786/

11:44:13:Lustre: DEBUG MARKER: == replay-single test 110f: DNE: create striped dir, fail MDT1/MDT2 == 11:32:16 (1435663936)
11:44:13:Lustre: DEBUG MARKER: sync; sync; sync
11:44:13:Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-MDT0000 notransno
11:44:13:Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-MDT0000 readonly
11:44:13:Turning device dm-0 (0xfd00000) read-only
11:44:13:Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 REPLAY BARRIER on lustre-MDT0000
11:44:13:Lustre: DEBUG MARKER: mds1 REPLAY BARRIER on lustre-MDT0000
11:44:13:Lustre: DEBUG MARKER: grep -c /mnt/mds1' ' /proc/mounts
11:44:13:Lustre: DEBUG MARKER: umount -d /mnt/mds1
11:44:13:Removing read-only on unknown block (0xfd00000)
11:44:13:Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
11:44:13:Lustre: DEBUG MARKER: hostname
11:44:13:Lustre: DEBUG MARKER: test -b /dev/lvm-Role_MDS/P1
11:44:13:Lustre: DEBUG MARKER: mkdir -p /mnt/mds1; mount -t lustre                                 /dev/lvm-Role_MDS/P1 /mnt/mds1
11:44:13:LDISKFS-fs (dm-0): recovery complete
11:44:13:LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. quota=on. Opts:
11:44:13:Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u
11:44:13:Lustre: DEBUG MARKER: lctl set_param -n mdt.lustre*.enable_remote_dir=1
11:44:13:Lustre: DEBUG MARKER: e2label /dev/lvm-Role_MDS/P1 2>/dev/null
11:44:13:Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 4 sec
11:44:13:Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 4 sec
11:44:13:Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 4 sec
11:44:13:Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 4 sec
11:44:13:Lustre: 2930:0:(client.c:2018:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1435664041/real 1435664041]  req@ffff88006043d9c0 x1505397080987136/t0(0) o400->lustre-MDT0001-osp-MDT0000@10.1.4.127@tcp:24/4 lens 224/224 e 1 to 1 dl 1435664046 ref 1 fl Rpc:X/c0/ffffffff rc 0/-1
11:44:13:Lustre: 2930:0:(client.c:2018:ptlrpc_expire_one_request()) Skipped 35 previous similar messages
11:44:13:LustreError: 4290:0:(ldlm_lib.c:3030:target_bulk_io()) @@@ timeout on bulk WRITE after 100+0s  req@ffff88006a3c3050 x1505397094695176/t0(0) o1000->lustre-MDT0001-mdtlov_UUID@10.1.4.127@tcp:638/0 lens 248/16608 e 4 to 0 dl 1435664093 ref 1 fl Interpret:/0/0 rc 0/0
11:44:13:LNet: Service thread pid 4290 completed after 100.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
11:44:13:Lustre: 2930:0:(client.c:2018:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1435664645/real 1435664645]  req@ffff88006043d9c0 x1505397080993520/t0(0) o400->lustre-MDT0001-osp-MDT0000@10.1.4.127@tcp:24/4 lens 224/224 e 1 to 1 dl 1435664647 ref 1 fl Rpc:X/c0/ffffffff rc 0/-1
12:24:19:Lustre: 2930:0:(client.c:2018:ptlrpc_expire_one_request()) Skipped 73 previous similar messages
12:24:19:Lustre: 2930:0:(client.c:2018:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1435665245/real 1435665245]  req@ffff88006043d9c0 x1505397081000080/t0(0) o400->lustre-MDT0001-osp-MDT0000@10.1.4.127@tcp:24/4 lens 224/224 e 1 to 1 dl 1435665247 ref 1 fl Rpc:X/c0/ffffffff rc 0/-1
12:24:19:Lustre: 2930:0:(client.c:2018:ptlrpc_expire_one_request()) Skipped 99 previous similar messages
12:24:19:Lustre: DEBUG MARKER: /usr/sbin/lctl mark  rpc : @@@@@@ FAIL: can\'t put import for mdc.lustre-MDT0001-mdc-*.mds_server_uuid into FULL state after 1475 sec, have REPLAY
12:24:19:Lustre: DEBUG MARKER: /usr/sbin/lctl mark  rpc : @@@@@@ FAIL: can\'t put import for mdc.lustre-MDT0001-mdc-*.mds_server_uuid into FULL state after 1475 sec, have REPLAY_WAIT
12:24:19:Lustre: DEBUG MARKER: rpc : @@@@@@ FAIL: can't put import for mdc.lustre-MDT0001-mdc-*.mds_server_uuid into FULL state after 1475 sec, have REPLAY_WAIT
12:24:19:Lustre: DEBUG MARKER: rpc : @@@@@@ FAIL: can't put import for mdc.lustre-MDT0001-mdc-*.mds_server_uuid into FULL state after 1475 sec, have REPLAY


 Comments   
Comment by Gerrit Updater [ 01/Jul/15 ]

wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/15458
Subject: LU-6780 ptlrpc: Do not resend req with allow_replay
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e2e25cb1b74cc1616c2df3d7dce9d4f5e78437d6

Comment by Gerrit Updater [ 04/Jul/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15458/
Subject: LU-6780 ptlrpc: Do not resend req with allow_replay
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 0ee3487737bd876e233213ccec4e6fca4690093e

Comment by Peter Jones [ 04/Jul/15 ]

Landed for 2.8

Generated at Sat Feb 10 02:03:11 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.