[LU-2397] Assertion triggered in check_for_next_transno Created: 27/Nov/12  Updated: 19/Apr/13  Resolved: 18/Jan/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Blocker
Reporter: Prakash Surya (Inactive) Assignee: Mikhail Pershin
Resolution: Fixed Votes: 0
Labels: LB

Severity: 3
Rank (Obsolete): 5688

 Description   

I did a quick search of the existing bugs, but didn't find this one.

Running the replay-single test in a single node VM setup I triggered the following:

LustreError: 14716:0:(ldlm_lib.c:1749:check_for_next_transno()) ASSERTION( req_transno >= next_transno ) failed: req_transno: 0, next_transno: 51539608157
LustreError: 14716:0:(ldlm_lib.c:1749:check_for_next_transno()) LBUG
Kernel panic - not syncing: LBUG
Pid: 14716, comm: tgt_recov Tainted: P           ---------------    2.6.32-279.9.1.1chaos.ch5.1.x86_64 #1
Call Trace:
 [<ffffffff814fdceb>] ? panic+0xa0/0x168
 [<ffffffffa0705f6b>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
 [<ffffffffa0991e75>] ? check_for_next_transno+0x585/0x590 [ptlrpc]
 [<ffffffffa09918f0>] ? check_for_next_transno+0x0/0x590 [ptlrpc]
 [<ffffffffa098be0e>] ? target_recovery_overseer+0x5e/0x250 [ptlrpc]
 [<ffffffffa098a270>] ? exp_req_replay_healthy+0x0/0x30 [ptlrpc]
 [<ffffffffa0716591>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
 [<ffffffffa0992f55>] ? target_recovery_thread+0x7b5/0x19d0 [ptlrpc]
 [<ffffffffa09927a0>] ? target_recovery_thread+0x0/0x19d0 [ptlrpc]
 [<ffffffff8100c14a>] ? child_rip+0xa/0x20
 [<ffffffffa09927a0>] ? target_recovery_thread+0x0/0x19d0 [ptlrpc]
 [<ffffffffa09927a0>] ? target_recovery_thread+0x0/0x19d0 [ptlrpc]
 [<ffffffff8100c140>] ? child_rip+0x0/0x20

The test case was simply:

$ sudo ONLY="59 60" FSTYPE=zfs ./lustre/tests/replay-single.sh


 Comments   
Comment by Peter Jones [ 27/Nov/12 ]

Could you please comment on this one? Thanks Peter

Comment by Mikhail Pershin [ 29/Nov/12 ]

replay with 0 transno? Looks wrong. Is that reproduced easily with command you've mentioned?

Comment by Prakash Surya (Inactive) [ 29/Nov/12 ]

I only hit it once while running that test in a loop for about 6 hours.. I wouldn't call that easy to reproduce. Honestly, there are probably more pressing issues to spend time on, but I just wanted to document this here since I did hit it once.

Comment by Oleg Drokin [ 31/Dec/12 ]

I hit this too, in recovery-ost-single, test 2

[107343.839932] Lustre: 26202:0:(ldlm_lib.c:2184:target_recovery_init()) RECOVERY: service lustre-OST0000, 2 recoverable clients, last_transno 12884901891
[107348.833481] Lustre: lustre-OST0000: Will be in recovery for at least 1:00, or until 2 clients reconnect
[107348.836301] LustreError: 26224:0:(ldlm_lib.c:1748:check_for_next_transno()) ASSERTION( req_transno >= next_transno ) failed: req_transno: 0, next_transno: 12884901892
[107348.839179] LustreError: 26224:0:(ldlm_lib.c:1748:check_for_next_transno()) LBUG
[107348.840790] Pid: 26224, comm: tgt_recov
[107348.841717] 
[107348.841720] Call Trace:
[107348.842588]  [<ffffffffa0ae8915>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
[107348.842866]  [<ffffffffa0ae8f27>] lbug_with_loc+0x47/0xb0 [libcfs]
[107348.843157]  [<ffffffffa1331b36>] check_for_next_transno+0x596/0x5a0 [ptlrpc]
[107348.843598]  [<ffffffffa13315a0>] ? check_for_next_transno+0x0/0x5a0 [ptlrpc]
[107348.844039]  [<ffffffffa132bb46>] target_recovery_overseer+0x66/0x230 [ptlrpc]
[107348.844532]  [<ffffffffa1329fb0>] ? exp_req_replay_healthy+0x0/0x30 [ptlrpc]
[107348.844827]  [<ffffffffa0af9641>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
[107348.845169]  [<ffffffffa1332ae3>] target_recovery_thread+0x683/0x1660 [ptlrpc]
[107348.845686]  [<ffffffff814faef5>] ? _spin_unlock_irq+0x15/0x20
[107348.846050]  [<ffffffffa1332460>] ? target_recovery_thread+0x0/0x1660 [ptlrpc]
[107348.846522]  [<ffffffff8100c14a>] child_rip+0xa/0x20
[107348.846795]  [<ffffffffa1332460>] ? target_recovery_thread+0x0/0x1660 [ptlrpc]
[107348.847313]  [<ffffffffa1332460>] ? target_recovery_thread+0x0/0x1660 [ptlrpc]
[107348.847798]  [<ffffffff8100c140>] ? child_rip+0x0/0x20
[107348.848056] 
[107348.848285] BUG: spinlock cpu recursion on CPU#3, ll_ost01_003/25839 (Not tainted)
[107348.848750]  lock: ffff88004d00cc38, .magic: dead4ead, .owner: tgt_recov/26224, .owner_cpu: 3
[107348.849257] Pid: 25839, comm: ll_ost01_003 Not tainted 2.6.32-debug #6
[107348.849558] Call Trace:
[107348.849764]  [<ffffffff8128098a>] ? spin_bug+0xaa/0x100
[107348.850006]  [<ffffffff81280ba1>] ? _raw_spin_lock+0x121/0x180
[107348.850290]  [<ffffffff81051f73>] ? __wake_up+0x53/0x70
[107348.850588]  [<ffffffff814fafde>] ? _spin_lock+0xe/0x10
[107348.850872]  [<ffffffffa1330d6c>] ? target_queue_recovery_request+0x41c/0xc50 [ptlrpc]
[107348.851438]  [<ffffffff814faf3e>] ? _spin_unlock+0xe/0x10
[107348.851617] Kernel panic - not syncing: LBUG
[107348.851621] Pid: 26224, comm: tgt_recov Not tainted 2.6.32-debug #6
[107348.851622] Call Trace:
[107348.851629]  [<ffffffff814f75e4>] ? panic+0xa0/0x168
[107348.851642]  [<ffffffffa0ae8f7b>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
[107348.851673]  [<ffffffffa1331b36>] ? check_for_next_transno+0x596/0x5a0 [ptlrpc]
[107348.851701]  [<ffffffffa13315a0>] ? check_for_next_transno+0x0/0x5a0 [ptlrpc]
[107348.851728]  [<ffffffffa132bb46>] ? target_recovery_overseer+0x66/0x230 [ptlrpc]
[107348.851755]  [<ffffffffa1329fb0>] ? exp_req_replay_healthy+0x0/0x30 [ptlrpc]
[107348.851768]  [<ffffffffa0af9641>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
[107348.851796]  [<ffffffffa1332ae3>] ? target_recovery_thread+0x683/0x1660 [ptlrpc]
[107348.851799]  [<ffffffff814faef5>] ? _spin_unlock_irq+0x15/0x20
[107348.851826]  [<ffffffffa1332460>] ? target_recovery_thread+0x0/0x1660 [ptlrpc]
[107348.851829]  [<ffffffff8100c14a>] ? child_rip+0xa/0x20
[107348.851856]  [<ffffffffa1332460>] ? target_recovery_thread+0x0/0x1660 [ptlrpc]
[107348.851883]  [<ffffffffa1332460>] ? target_recovery_thread+0x0/0x1660 [ptlrpc]
[107348.851886]  [<ffffffff8100c140>] ? child_rip+0x0/0x20

crashdump and modules are in /exports/crashdumps/192.168.10.219-2012-12-30-10\:28\:47

Comment by Mikhail Pershin [ 10/Jan/13 ]

The code below is not safe:

        } else if (cfs_atomic_read(&obd->obd_req_replay_clients) == 0) {
                CDEBUG(D_HA, "waking for completed recovery\n");
                wake_up = 1;
        } else if (req_transno == next_transno) {
                CDEBUG(D_HA, "waking for next ("LPD64")\n", next_transno);
                wake_up = 1;
        } else if (queue_len == cfs_atomic_read(&obd->obd_req_replay_clients)) {
                int d_lvl = D_HA;
                /** handle gaps occured due to lost reply or VBR */
                LASSERTF(req_transno >= next_transno,
                         "req_transno: "LPU64", next_transno: "LPU64"\n",
                         req_transno, next_transno);

the obd_req_replay_clients counter is checked for 0 and after that it is supposed to be not zero in next check against queue_len. I believe that was so in past when all counters were changed under the same lock but some time ago they became atomic values and many places became lockless due to smp improvements work. So now we can't rely on the fact that this counter can not change. Those recent issues described in this ticket are about that - queue_len is 0 and counter become 0 exactly at second check, so we enter the path which is not supposed to be run with zero queue_len.

The fix would be check queue_len > 0 in addition to the existing check, also obd_req_replay_clients check for 0 should be done after comparision with queue_len.

Comment by Mikhail Pershin [ 11/Jan/13 ]

http://review.whamcloud.com/#change,4998

Comment by Mikhail Pershin [ 18/Jan/13 ]

landed

Generated at Sat Feb 10 01:24:51 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.