[LU-2556] osp_sync_interpret()) ASSERTION( d->opd_syn_rpc_in_progress > 0 ) failed Created: 31/Dec/12  Updated: 26/Mar/13  Resolved: 26/Mar/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Blocker
Reporter: Oleg Drokin Assignee: Alex Zhuravlev
Resolution: Fixed Votes: 0
Labels: MB

Severity: 3
Rank (Obsolete): 5980

 Description   

Hit this while running recovery-small (test 29b) in a loop:

[386997.184707] Lustre: Failing over lustre-MDT0000
[386997.191570] LustreError: 11-0: an error occurred while communicating with 0@lo. The mds_close operation failed with -19
[386997.192109] LustreError: Skipped 15 previous similar messages
[386997.502568] LustreError: 31761:0:(osp_sync.c:393:osp_sync_interpret()) ASSERTION( d->opd_syn_rpc_in_progress > 0 ) failed: 
[386997.503167] LustreError: 31761:0:(osp_sync.c:393:osp_sync_interpret()) LBUG
[386997.503450] Pid: 31761, comm: ptlrpcd_0
[386997.503663] 
[386997.503664] Call Trace:
[386997.504041]  [<ffffffffa0aea915>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
[386997.504325]  [<ffffffffa0aeaf27>] lbug_with_loc+0x47/0xb0 [libcfs]
[386997.504595]  [<ffffffffa09870f8>] osp_sync_interpret+0x4a8/0x560 [osp]
[386997.504915]  [<ffffffffa11ffc16>] ptlrpc_check_set+0x2b6/0x1db0 [ptlrpc]
[386997.505238]  [<ffffffffa1231b6b>] ptlrpcd_check+0x55b/0x590 [ptlrpc]
[386997.505538]  [<ffffffffa12320bb>] ptlrpcd+0x22b/0x3a0 [ptlrpc]
[386997.505809]  [<ffffffff81057d60>] ? default_wake_function+0x0/0x20
[386997.506099]  [<ffffffffa1231e90>] ? ptlrpcd+0x0/0x3a0 [ptlrpc]
[386997.506364]  [<ffffffff8100c14a>] child_rip+0xa/0x20
[386997.506635]  [<ffffffffa1231e90>] ? ptlrpcd+0x0/0x3a0 [ptlrpc]
[386997.506921]  [<ffffffffa1231e90>] ? ptlrpcd+0x0/0x3a0 [ptlrpc]
[386997.507185]  [<ffffffff8100c140>] ? child_rip+0x0/0x20
[386997.507433] 
[386997.511844] Kernel panic - not syncing: LBUG

Crashdump is in /exports/crashdumps/192.168.10.217-2012-12-31-11\:44\:36/



 Comments   
Comment by Oleg Drokin [ 06/Feb/13 ]

Just hit it again, this time in replay-dual test 23d
Crashdump in /exports/crashdumps/192.168.10.221-2013-02-05-04\:17\:50/

Comment by Jodi Levi (Inactive) [ 06/Feb/13 ]

Alex, Oleg indicated you are looking into this one, so assigning to you.

Comment by Alex Zhuravlev [ 06/Feb/13 ]

the request was with:

rq_status = -5,
rq_transno = 4294967360,

and such a case seem to be handled improperly in OSP code now.

Comment by Alex Zhuravlev [ 18/Feb/13 ]

http://review.whamcloud.com/#change,5453

Comment by Peter Jones [ 26/Mar/13 ]

Landed for 2.4

Generated at Sat Feb 10 01:26:12 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.