[LU-6880] recovery timeout during 24 hours failover test Created: 19/Jul/15  Updated: 28/Aug/15  Resolved: 28/Aug/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Blocker
Reporter: Di Wang Assignee: Di Wang
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-6831 The ticket for tracking all DNE2 bugs Reopened
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Recovery can not finish in time during 24 hours failover test, after 23 times failover

Server failover period: 600 seconds
Exited after:           13229 seconds
Number of failovers before exit:
mds1: 2 times
mds2: 7 times
mds3: 1 times
mds4: 1 times
mds5: 3 times
mds6: 3 times
mds7: 3 times
mds8: 3 times
ost1: 0 times
ost2: 0 times
ost3: 0 times
ost4: 0 times
Status: FAIL: rc=7


 Comments   
Comment by Di Wang [ 19/Jul/15 ]
tdtd-1        S 000000000000000a     0 22764      2 0x00000080
 ffff8807ef81fd80 0000000000000046 0000000000000000 0000000000000000
 000000060011bf89 00000000fffffff4 ffff8807ef81fd40 ffff8808125f7b30
 ffff8808309a5af8 ffff8807ef81ffd8 000000000000fbc8 ffff8808309a5af8
Call Trace:
 [<ffffffffa08fc32d>] distribute_txn_commit_thread+0xfed/0x1750 [ptlrpc]
 [<ffffffff81061d12>] ? default_wake_function+0x12/0x20
 [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
 [<ffffffffa08fb340>] ? distribute_txn_commit_thread+0x0/0x1750 [ptlrpc]
 [<ffffffff8109abf6>] kthread+0x96/0xa0
 [<ffffffff8100c20a>] child_rip+0xa/0x20
 [<ffffffff8109ab60>] ? kthread+0x0/0xa0
 [<ffffffff8100c200>] ? child_rip+0x0/0x20
lod0001_rec00 S 0000000000000007     0 22766      2 0x00000080
 ffff8807ef825910 0000000000000046 0000000000000000 ffff8807ef8258d4
 000021d557def61b 0000000000000286 ffff8807ef8258b0 ffffffff81083e1c
 ffff8807ef823ab8 ffff8807ef825fd8 000000000000fbc8 ffff8807ef823ab8
Call Trace:
 [<ffffffff81083e1c>] ? lock_timer_base+0x3c/0x70
 [<ffffffff8152a512>] schedule_timeout+0x192/0x2e0
 [<ffffffff81083f30>] ? process_timeout+0x0/0x10
 [<ffffffffa0874f99>] ptlrpc_set_wait+0x319/0xa30 [ptlrpc]
 [<ffffffffa086a510>] ? ptlrpc_interrupted_set+0x0/0x110 [ptlrpc]
 [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
 [<ffffffffa08811a5>] ? lustre_msg_set_jobid+0xf5/0x130 [ptlrpc]
 [<ffffffffa0875731>] ptlrpc_queue_wait+0x81/0x220 [ptlrpc]
 [<ffffffffa10de271>] osp_remote_sync+0x121/0x190 [osp]
 [<ffffffffa10c289d>] osp_attr_get+0x40d/0x6c0 [osp]
 [<ffffffffa10c42a4>] osp_object_init+0x1b4/0x320 [osp]
 [<ffffffffa0657db8>] lu_object_alloc+0xd8/0x320 [obdclass]
 [<ffffffffa0659161>] lu_object_find_try+0x151/0x260 [obdclass]
 [<ffffffffa0659321>] lu_object_find_at+0xb1/0xe0 [obdclass]
 [<ffffffff8116ef30>] ? cache_alloc_refill+0x1c0/0x240
 [<ffffffffa065a1bc>] dt_locate_at+0x1c/0xa0 [obdclass]
 [<ffffffffa061934e>] llog_osd_get_cat_list+0x8e/0xcd0 [obdclass]
 [<ffffffffa0ff4bc0>] lod_sub_prep_llog+0x110/0x7b0 [lod]
 [<ffffffff81058bd3>] ? __wake_up+0x53/0x70
 [<ffffffffa0fc97f6>] lod_sub_recovery_thread+0x196/0xbc0 [lod]
 [<ffffffff81061d12>] ? default_wake_function+0x12/0x20
 [<ffffffffa0fc9660>] ? lod_sub_recovery_thread+0x0/0xbc0 [lod]
 [<ffffffff8109abf6>] kthread+0x96/0xa0
 [<ffffffff8100c20a>] child_rip+0xa/0x20
 [<ffffffff8109ab60>] ? kthread+0x0/0xa0
 [<ffffffff8100c200>] ? child_rip+0x0/0x20

Looks like log retrieve process is blocked by import recovery.

Comment by Gerrit Updater [ 22/Jul/15 ]

wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/15682
Subject: LU-6880 update: after reply move dtrq to finish list
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 400355f4bc9e353d20638f3264ef3c80b799a5cb

Comment by Gerrit Updater [ 28/Aug/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15682/
Subject: LU-6880 update: after reply move dtrq to finish list
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 2a874ec011e680f49405a7e901d8d0d35dcb4f1a

Comment by Joseph Gmitter (Inactive) [ 28/Aug/15 ]

Landed for 2.8.

Generated at Sat Feb 10 02:04:05 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.