[LU-2130] recovery-small test 50 osp_sync_thread()) ASSERTION( d->opd_syn_rpc_in_progress == 0 ) Created: 09/Oct/12  Updated: 02/Jan/13  Resolved: 19/Nov/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Blocker
Reporter: Oleg Drokin Assignee: Alex Zhuravlev
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 5132

 Description   

Happened after about 10 hours of running recovery-small in a loop (with reformat):

[32368.985230] Lustre: Failing over lustre-MDT0000
[32369.001102] LustreError: 31038:0:(mdt_handler.c:3208:mdt_recovery()) operation 35 on unconnected MDS from 12345-0@lo
[32369.249079] LustreError: 22775:0:(client.c:1114:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff880055a5ebf0 x1415365096908205/t0(0) o6->lustre-OST0000-osc-MDT0000@0@lo:28/4 lens 664/432 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
[32369.264751] LustreError: 31061:0:(osp_sync.c:889:osp_sync_thread()) ASSERTION( d->opd_syn_rpc_in_progress == 0 ) failed: lustre-OST0001-osc-MDT0000: 1 0 empty
[32369.265363] LustreError: 31061:0:(osp_sync.c:889:osp_sync_thread()) LBUG
[32369.265582] Pid: 31061, comm: osp-syn-1
[32369.265582] 
[32369.265908] 
[32369.265908] Call Trace:
[32369.266207]  [<ffffffffa0400915>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
[32369.266447]  [<ffffffffa0400f27>] lbug_with_loc+0x47/0xb0 [libcfs]
[32369.266666]  [<ffffffffa0d220f5>] osp_sync_thread+0x665/0x710 [osp]
[32369.266910]  [<ffffffff814faeee>] ? _spin_unlock_irq+0xe/0x20
[32369.267111]  [<ffffffffa0d21a90>] ? osp_sync_thread+0x0/0x710 [osp]
[32369.267337]  [<ffffffff8100c14a>] child_rip+0xa/0x20
[32369.267535]  [<ffffffffa0d21a90>] ? osp_sync_thread+0x0/0x710 [osp]
[32369.267752]  [<ffffffffa0d21a90>] ? osp_sync_thread+0x0/0x710 [osp]
[32369.267970]  [<ffffffff8100c140>] ? child_rip+0x0/0x20
[32369.268176] 
[32369.269041] Kernel panic - not syncing: LBUG
[32369.269277] Pid: 31061, comm: osp-syn-1
[32369.269277]  Not tainted 2.6.32-debug #6
[32369.269760] Call Trace:
[32369.269970]  [<ffffffff814f75e4>] ? panic+0xa0/0x168
[32369.270173]  [<ffffffffa0400f7b>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
[32369.270398]  [<ffffffffa0d220f5>] ? osp_sync_thread+0x665/0x710 [osp]
[32369.270629]  [<ffffffff814faeee>] ? _spin_unlock_irq+0xe/0x20
[32369.270844]  [<ffffffffa0d21a90>] ? osp_sync_thread+0x0/0x710 [osp]
[32369.272979]  [<ffffffff8100c14a>] ? child_rip+0xa/0x20
[32369.273194]  [<ffffffffa0d21a90>] ? osp_sync_thread+0x0/0x710 [osp]
[32369.273417]  [<ffffffffa0d21a90>] ? osp_sync_thread+0x0/0x710 [osp]
[32369.273628]  [<ffffffff8100c140>] ? child_rip+0x0/0x20

vmcore is available.



 Comments   
Comment by Alex Zhuravlev [ 14/Nov/12 ]

this seem to be a race between osp sync thread and ptlrpcd: interpret callback drops opd_syn_rpc_in_flight, but later osp_sync_request_commit_cb() increases opd_syn_rpc_in_flight. given osp_sync_request_commit_cb() pins requests, I think we should be waiting till all the requests are processed. need to think how to do this properly..

Comment by Alex Zhuravlev [ 14/Nov/12 ]

http://review.whamcloud.com/4581

Comment by Alex Zhuravlev [ 19/Nov/12 ]

landed

Generated at Sat Feb 10 01:22:37 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.