[LU-2109] __llog_process_thread() GPF Created: 01/Jun/12  Updated: 19/Apr/13  Resolved: 19/Apr/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Blocker
Reporter: Brian Behlendorf Assignee: Mikhail Pershin
Resolution: Fixed Votes: 0
Labels: HB, sequoia

Issue Links:
Related
is related to LU-2362 MDS crash in llog_process_thread Closed
Severity: 3
Epic: server
Project: Orion
Rank (Obsolete): 3037

 Description   

Observed after rebooting the grove-mds2 on to our latest kernel and lustre-orion tag. The server was rebooted while under a fairly heavy test load. No detailed investigation has been done yet, the MDS was able to successfully restart after rebooting the node again and claims to have successfully completed recovery this time.

— First boot —

Mounting grove-mds2/mgs on /mnt/lustre/local/lstest-MGS0000
Lustre: Lustre: Build Version: orion-2_2_49_57_3-49chaos-49chaos--PRISTINE-2.6.32-220.17.1.3chaos.ch5.x86_64
Lustre: MGS: Mounted grove-mds2/mgs
Mounting grove-mds2/mdt0 on /mnt/lustre/local/lstest-MDT0000
LustreError: 11-0: MGC172.20.5.2@o2ib500: Communicating with 0@lo, operation llog_origin_handle_create failed with -2
LustreError: 4503:0:(mgc_request.c:250:do_config_log_add()) failed processing sptlrpc log: -2
Lustre: 4508:0:(fld_index.c:356:fld_index_init()) srv-lstest-MDT0000: File "fld" doesn't support range lookup, using stub. DNE and FIDs on OST will not work with this backend
Lustre: lstest-MDT0000: Temporarily refusing client connection from 172.20.3.48@o2ib500
Lustre: lstest-MDT0000: Temporarily refusing client connection from 172.20.2.199@o2ib500
LNet: 26244:0:(o2iblnd_cb.c:2340:kiblnd_passive_connect()) Conn race 172.20.4.51@o2ib500
LNet: 26244:0:(o2iblnd_cb.c:2340:kiblnd_passive_connect()) Conn race 172.20.4.44@o2ib500
Lustre: lstest-MDT0000: Temporarily refusing client connection from 172.20.3.53@o2ib500
Lustre: Skipped 186 previous similar messages
LustreError: 4567:0:(llog_cat.c:186:llog_cat_id2handle()) error opening log id 0x5a5a5a5a5a5a5a5a:5a5a5a5a: rc -2
LustreError: 4567:0:(llog_cat.c:503:llog_cat_cancel_records()) Cannot find log 0x5a5a5a5a5a5a5a5a
LustreError: 4567:0:(llog_cat.c:533:llog_cat_cancel_records()) Cancel 0 of 1 llog-records failed: -2
LustreError: 4567:0:(osp_sync.c:705:osp_sync_process_committed()) lstest-OST0281-osc-MDT0000: can't cancel record: -2
LustreError: 4567:0:(llog_cat.c:186:llog_cat_id2handle()) error opening log id 0x5a5a5a5a5a5a5a5a:5a5a5a5a: rc -2
LustreError: 4567:0:(llog_cat.c:503:llog_cat_cancel_records()) Cannot find log 0x5a5a5a5a5a5a5a5a
LustreError: 4567:0:(llog_cat.c:533:llog_cat_cancel_records()) Cancel 0 of 1 llog-records failed: -2
LustreError: 4567:0:(osp_sync.c:705:osp_sync_process_committed()) lstest-OST0281-osc-MDT0000: can't cancel record: -2
general protection fault: 0000 1 SMP
last sysfs file: /sys/module/sg/initstate
CPU 7

Pid: 4567, comm: osp-syn-641
Tainted: P W ---------------- 2.6.32-220.17.1.3chaos.ch5.x86_64 #1 Supermicro X8DTH-i/6/iF/6F/X8DTH
RIP: 0010:[<ffffffffa0694252>] [<ffffffffa0694252>] __llog_process_thread+0x2a2/0xc80 [obdclass]
RSP: 0018:ffff882f66e9db60 EFLAGS: 00010206
RAX: 5a5a5a5a5a5a5a5a RBX: 0000000000008701 RCX: 0000000000000000
RDX: 000000000021e000 RSI: 0000000000000001 RDI: ffff882f63cfe000
RBP: ffff882f66e9dc10 R08: ffff88179d8f1900 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff88179caf4058
R13: 000000000000fcff R14: ffff882f63cfc000 R15: ffff882f63cfe000
FS: 00007ffff7fdc700(0000) GS:ffff881894820000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007ffff7ff9000 CR3: 0000000001a85000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process osp-syn-641
(pid: 4567, threadinfo ffff882f66e9c000, task ffff882ff9988ae0)
Stack:
0000000000002000 0000000000000050 0000000000000010 0000000000000000
<0> ffff882f63cfc001 000086fe00000000 0000000000000000 ffff882f640124c0
<0> ffff882f66e9de80 0000000000000000 000000000021e000 0000fd009aee4b80
Call Trace:
[<ffffffffa0dd4570>] ? osp_sync_process_queues+0x0/0xf60 [osp]
[<ffffffffa0694d33>] __llog_process+0x103/0x4d0 [obdclass]
[<ffffffffa06961bb>] llog_cat_process_cb+0x21b/0x290 [obdclass]
[<ffffffffa06946fe>] __llog_process_thread+0x74e/0xc80 [obdclass]
[<ffffffff810618d4>] ? enqueue_task_fair+0x64/0x100
[<ffffffffa0695fa0>] ? llog_cat_process_cb+0x0/0x290 [obdclass]
[<ffffffffa0694d33>] __llog_process+0x103/0x4d0 [obdclass]
[<ffffffffa06953d8>] __llog_cat_process+0x98/0x260 [obdclass]
[<ffffffffa0dd4570>] ? osp_sync_process_queues+0x0/0xf60 [osp]
[<ffffffff81051ba3>] ? __wake_up+0x53/0x70
[<ffffffffa0dd64f2>] osp_sync_thread+0x1c2/0x620 [osp]
[<ffffffffa0dd6330>] ? osp_sync_thread+0x0/0x620 [osp]
[<ffffffff8100c14a>] child_rip+0xa/0x20
[<ffffffffa0dd6330>] ? osp_sync_thread+0x0/0x620 [osp]
[<ffffffffa0dd6330>] ? osp_sync_thread+0x0/0x620 [osp]
[<ffffffff8100c140>] ? child_rip+0x0/0x20

— Second boot —

Mounting grove-mds2/mgs on /mnt/lustre/local/lstest-MGS0000
Lustre: Lustre: Build Version: orion-2_2_49_57_3-49chaos-49chaos--PRISTINE-2.6.32-220.17.1.3chaos.ch5.x86_64
Lustre: MGS: Mounted grove-mds2/mgs
Mounting grove-mds2/mdt0 on /mnt/lustre/local/lstest-MDT0000
LustreError: 11-0: MGC172.20.5.2@o2ib500: Communicating with 0@lo, operation llog_origin_handle_create failed with -2
LustreError: 4525:0:(mgc_request.c:250:do_config_log_add()) failed processing sptlrpc log: -2
Lustre: 4530:0:(fld_index.c:356:fld_index_init()) srv-lstest-MDT0000: File "fld" doesn't support range lookup, using stub. DNE and FIDs on OST will not work with this backend
Lustre: lstest-MDT0000: Temporarily refusing client connection from 172.20.3.183@o2ib500
Lustre: lstest-MDT0000: Temporarily refusing client connection from 172.20.3.22@o2ib500
Lustre: lstest-MDT0000: Temporarily refusing client connection from 172.20.3.24@o2ib500
Lustre: Skipped 3 previous similar messages
Lustre: lstest-MDT0000: Temporarily refusing client connection from 172.20.3.36@o2ib500
Lustre: Skipped 2 previous similar messages
Lustre: lstest-MDT0000: Mounted grove-mds2/mdt0
Lustre: lstest-MDT0000: Will be in recovery for at least 5:00, or until 255 clients reconnect.
LustreError: 11-0: lstest-OST0282-osc-MDT0000: Communicating with 172.20.4.42@o2ib500, operation ost_connect failed with -16
Lustre: lstest-MDT0000: Recovery over after 1:10, of 255 clients 255 recovered and 0 were evicted.
LustreError: 11-0: lstest-OST0282-osc-MDT0000: Communicating with 172.20.4.42@o2ib500, operation ost_connect failed with -16



 Comments   
Comment by Li Wei (Inactive) [ 23/Jul/12 ]

Brian,

Given that the request buffers would not be retrievable after being freed, I suspect that the plain log handles were freed somehow while osp_sync_thread() was processing them. If this sounds possible, I wonder if you could apply http://review.whamcloud.com/3443 and try reproducing the failure? The libcfs module parameter "libcfs_panic_on_lbug" should be set to "0" on the MDS in order to get the debug log. I was not able to reproduce it in the lab with a "mdsrate --unlink" workload.

Comment by Brian Behlendorf [ 30/Jul/12 ]

We'll pull the debug patch in to our branch tag and see if we're able to reproduce the issue. We appear to still hit it sporadically restarting the servers.

Comment by Brian Behlendorf [ 30/Jul/12 ]

Haven't gotten the debug patch in place just yet, although we did just see a slight variation on this. This time we didn't get poison.

2012-07-30 09:44:03 grove-mds2 login: Lustre: 21215:0:(fld_index.c:354:fld_index_init()) srv-lstest-MDT0000: File "fld" doesn't support range lookup, using stub. DNE and FIDs on OST will not work with this backend
2012-07-30 09:44:40 LustreError: 21529:0:(llog_cat.c:184:llog_cat_id2handle()) error opening log id 0xffff8817ff250000:fbf1f800: rc -2
2012-07-30 09:44:40 LustreError: 21529:0:(llog_cat.c:501:llog_cat_cancel_records()) Cannot find log 0xffff8817ff250000
2012-07-30 09:44:40 LustreError: 21529:0:(llog_cat.c:531:llog_cat_cancel_records()) Cancel 0 of 1 llog-records failed: -2
2012-07-30 09:44:40 LustreError: 21529:0:(osp_sync.c:705:osp_sync_process_committed()) lstest-OST02d9-osc-MDT0000: can't cancel record: -2
2012-07-30 09:44:40 LustreError: 21529:0:(llog_cat.c:184:llog_cat_id2handle()) error opening log id 0xffff8817ff250000:fbf1f800: rc -2
2012-07-30 09:44:40 LustreError: 21529:0:(llog_cat.c:501:llog_cat_cancel_records()) Cannot find log 0xffff8817ff250000
2012-07-30 09:44:40 LustreError: 21529:0:(llog_cat.c:531:llog_cat_cancel_records()) Cancel 0 of 1 llog-records failed: -2
2012-07-30 09:44:40 LustreError: 21529:0:(osp_sync.c:705:osp_sync_process_committed()) lstest-OST02d9-osc-MDT0000: can't cancel record: -2
2012-07-30 09:44:40 general protection fault: 0000 [#1] SMP
2012-07-30 09:44:40 last sysfs file: /sys/module/ptlrpc/initstate

Also note that the crash isn't being caused by an LBUG it's a GPF so setting "libcfs_panic_on_lbug" to 0 isn't going to help. Resolving the address in the earlier posted stack suggests that 'lop->lop_next_block' is probably a garbage address we're attempting to jump too.

(gdb) list *( __llog_process_thread+0x2a2)
0x60c2 is in __llog_process_thread (/home/behlendo/src/git/lustre/lustre/include/lustre_log.h:648).
warning: Source file is more recent than executable.
643             ENTRY;
644
645             rc = llog_handle2ops(loghandle, &lop);
646             if (rc)
647                     RETURN(rc);
648             if (lop->lop_next_block == NULL)
649                     RETURN(-EOPNOTSUPP);
650
651             rc = lop->lop_next_block(env, loghandle, cur_idx, next_idx,
652                                      cur_offset, buf, len);
Comment by Li Wei (Inactive) [ 30/Jul/12 ]

The value of "lop" was 0x5a5a5a5a5a5a5a5a (i.e., RAX), according to

llog_next_block():
/root/lustre-dev/lustre/include/lustre_log.h:648
    60c2:       48 8b 40 08             mov    0x8(%rax),%rax
    60c6:       48 85 c0                test   %rax,%rax
    60c9:       0f 84 81 01 00 00       je     6250 <__llog_process_thread+0x430>

The log handles do not have their own slab cache. In the latest "variation", the handle in question might have been allocated to some other structure, which could explain why neither a valid log ID nor the poison was seen.

I'll think harder for how the handles could be freed.

Comment by Li Wei (Inactive) [ 03/Aug/12 ]

I have updated the diagnostic patch to narrow down the problematic zone.

Comment by Brian Behlendorf [ 06/Aug/12 ]

Observed again but unfortunately without the updated diagnostic patch. I'll pull the updated version in to our next tag.

Comment by Brian Behlendorf [ 06/Aug/12 ]

Actually, I misspoke. Revision 2 of the diagnostic patch was applied, but not revision 3. I also tweaked revision 2 to change the CDEBUG() in osp_sync_new_job() to a CERROR() to ensure we saw the error. There was a good change we wouldn't get a debug log, and in fact we didn't, but we do have the following console output.

2012-08-06 11:35:50 LNet: 13171:0:(o2iblnd_cb.c:2337:kiblnd_passive_connect()) Conn race 172.20.4.42@o2ib500
2012-08-06 11:35:50 LNet: 13171:0:(o2iblnd_cb.c:2337:kiblnd_passive_connect()) Skipped 654 previous similar messages
2012-08-06 11:35:50 LustreError: 21152:0:(llog_cat.c:184:llog_cat_id2handle()) error opening log id 0xffff881801d0e000:be556800: rc -2
2012-08-06 11:35:50 LustreError: 21152:0:(llog_cat.c:501:llog_cat_cancel_records()) Cannot find log 0xffff881801d0e000
2012-08-06 11:35:50 LustreError: 21152:0:(llog_cat.c:531:llog_cat_cancel_records()) Cancel 0 of 1 llog-records failed: -2
2012-08-06 11:35:50 LustreError: 21152:0:(osp_sync.c:709:osp_sync_process_committed()) lstest-OST0293-osc-MDT0000: can't cancel record: -2
2012-08-06 11:35:50 LustreError: 11-0: lstest-OST0282-osc-MDT0000: Communicating with 172.20.4.42@o2ib500, operation ost_connect failed with -19
2012-08-06 11:35:50 LustreError: Skipped 4 previous similar messages
2012-08-06 11:35:50 LustreError: 21152:0:(osp_sync.c:465:osp_sync_new_job()) Poisoned log ID from handle ffff882ffb055900
2012-08-06 11:35:50 LustreError: 21152:0:(osp_sync.c:465:osp_sync_new_job()) Poisoned log ID from handle ffff882ffb055900
2012-08-06 11:35:50 LustreError: 21152:0:(llog_cat.c:184:llog_cat_id2handle()) error opening log id 0x5a5a5a5a5a5a5a5a:5a5a5a5a: rc -2
2012-08-06 11:35:50 LustreError: 21152:0:(llog_cat.c:501:llog_cat_cancel_records()) Cannot find log 0x5a5a5a5a5a5a5a5a
2012-08-06 11:35:50 LustreError: 21152:0:(llog_cat.c:531:llog_cat_cancel_records()) Cancel 0 of 1 llog-records failed: -2
2012-08-06 11:35:50 LustreError: 21152:0:(osp_sync.c:709:osp_sync_process_committed()) lstest-OST0293-osc-MDT0000: can't cancel record: -2
2012-08-06 11:35:50 LustreError: 21152:0:(osp_sync.c:859:osp_sync_thread()) ASSERTION( rc == 0 || rc == LLOG_PROC_BREAK ) failed: 0 changes, 7 in progress, 7 in flight: -22
2012-08-06 11:35:50 LustreError: 21152:0:(osp_sync.c:859:osp_sync_thread()) LBUG
Comment by Li Wei (Inactive) [ 07/Aug/12 ]

Brian,

Thanks. The console log further proofs the junk log ID came from freed or reallocated log handles. I'm looking forward to see how revision 3 of the diagnostic patch can help us determine why the handle was freed while being used.

Comment by Ian Colle (Inactive) [ 08/Oct/12 ]

Brian - we haven't see an update in this ticket for a couple of months, is it still an issue?

Comment by Prakash Surya (Inactive) [ 08/Oct/12 ]

IIRC, this would only occur sporadically after upgrading, and then subside for an unknown reason. So we can't really say for sure if it's still an issue, because we never really understood the root cause to begin with and didn't have a solid reproducer. Personally, I'd be OK with moving this to an LU ticket and the resolving as "wont fix" or something similar. That way it's out in the public, and can be re-opened if it is seen again with the 2.3 code.

Comment by Ian Colle (Inactive) [ 08/Oct/12 ]

Moved to Lustre project as this code has landed to master. Marking as Won't Fix at Prakash's recommendation until/unless it reoccurs.

Comment by Andreas Dilger [ 16/Oct/12 ]

I hit this in my single-node test setup:

Failure hit when unmounting after mmp.sh was finished and filesystem was remounting:

Lustre: DEBUG MARKER: == mmp test complete, duration 394 sec 07:09:31 (1350392971)
Lustre: MGC192.168.20.154@tcp: Reactivating import
LustreError: 12112:0:(sec_config.c:1024:sptlrpc_target_local_copy_conf()) missing llog context
Lustre: testfs-MDT0000: Temporarily refusing client connection from 0@lo
LustreError: 11-0: an error occurred while communicating with 0@lo. The mds_connect operation failed with -11
Lustre: Found index 2 for testfs-OST0002, updating log
Lustre: 9041:0:(client.c:1909:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1350392977/real 1350392977]  req@ffff8800b1b6bc00 x1415989288763649/t0(0) o8->testfs-OST0002-osc-MDT0000@0@lo:28/4 lens 400/544 e 0 to 1 dl 1350392982 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
LustreError: 11-0: an error occurred while communicating with 0@lo. The mds_connect operation failed with -11
Lustre: 12244:0:(ofd_obd.c:1069:ofd_orphans_destroy()) testfs-OST0002: deleting orphan objects from 80455 to 85737
Lustre: 12248:0:(ofd_obd.c:1069:ofd_orphans_destroy()) testfs-OST0001: deleting orphan objects from 80346 to 80495
Lustre: Mounted testfs-client
LustreError: 12145:0:(llog_cat.c:187:llog_cat_id2handle()) testfs-OST0001-osc-MDT0000: error opening log id 0x5a5a5a5a5a5a5a5a:5a5a5a5a: rc = -5
LustreError: 12145:0:(llog_cat.c:513:llog_cat_cancel_records()) Cannot find log 0x5a5a5a5a5a5a5a5a
LustreError: 12145:0:(llog_cat.c:544:llog_cat_cancel_records()) testfs-OST0001-osc-MDT0000: fail to cancel 0 of 1 llog-records: rc = -5
LustreError: 12145:0:(osp_sync.c:708:osp_sync_process_committed()) testfs-OST0001-osc-MDT0000: can't cancel record: -5
LustreError: 12145:0:(llog_cat.c:187:llog_cat_id2handle()) testfs-OST0001-osc-MDT0000: error opening log id 0x5a5a5a5a5a5a5a5a:5a5a5a5a: rc = -5
LustreError: 12145:0:(llog_cat.c:513:llog_cat_cancel_records()) Cannot find log 0x5a5a5a5a5a5a5a5a
LustreError: 12145:0:(llog_cat.c:544:llog_cat_cancel_records()) testfs-OST0001-osc-MDT0000: fail to cancel 0 of 1 llog-records: rc = -5
LustreError: 12145:0:(osp_sync.c:708:osp_sync_process_committed()) testfs-OST0001-osc-MDT0000: can't cancel record: -5
general protection fault: 0000 [#1] SMP
 Tainted: P    B      ---------------    2.6.32-279.5.1.el6_lustre.g7f15218.x86_64 #1 Dell Inc.
RIP: 0010:[<ffffffffa131856c>]  [<ffffffffa131856c>] llog_process_thread+0x2cc/0xe10 [obdclass]
Process osp-syn-1
Call Trace:
 [<ffffffffa0e4fe00>] ? osp_sync_process_queues+0x0/0x11c0 [osp]
 [<ffffffffa131a8dd>] llog_process_or_fork+0x12d/0x660 [obdclass]
 [<ffffffffa131c8d3>] llog_cat_process_cb+0x2c3/0x370 [obdclass]
 [<ffffffffa1318b9b>] llog_process_thread+0x8fb/0xe10 [obdclass]
 [<ffffffffa131a8dd>] llog_process_or_fork+0x12d/0x660 [obdclass]
 [<ffffffffa131b419>] llog_cat_process_or_fork+0x89/0x290 [obdclass]
 [<ffffffffa0e4fe00>] ? osp_sync_process_queues+0x0/0x11c0 [osp]
 [<ffffffffa131b639>] llog_cat_process+0x19/0x20 [obdclass]
 [<ffffffffa0e51f20>] osp_sync_thread+0x1d0/0x700 [osp]
Comment by Ian Colle (Inactive) [ 25/Oct/12 ]

Mike - is this related to LU-2129 and http://review.whamcloud.com/4303 ?

Comment by Li Wei (Inactive) [ 02/Nov/12 ]

Abandoned the diagnostic patch for Orion; new ones are

http://review.whamcloud.com/4445 (Not for landing.)
http://review.whamcloud.com/4433 (Aimed for landing.)

Comment by Ian Colle (Inactive) [ 02/Nov/12 ]

Brian, can you guys please pull in Li Wei's new patches above for testing?

Comment by Prakash Surya (Inactive) [ 06/Nov/12 ]

I just pulled those patches into our branch.

Comment by Li Wei (Inactive) [ 06/Nov/12 ]

Prakash, thanks.

Comment by Prakash Surya (Inactive) [ 07/Nov/12 ]

Just installed those patches on the MDS and got a hit:

2012-11-07 09:53:31 Lustre: Lustre: Build Version: 2.3.54-3chaos-3chaos--PRISTINE-2.6.32-220.23.1.2chaos.ch5.x86_64
2012-11-07 09:53:32 Mounting grove-mds2/mdt0 on /mnt/lustre/local/lstest-MDT0000
2012-11-07 09:53:33 Lustre: Found index 0 for lstest-MDT0000, updating log
2012-11-07 09:53:33 LustreError: 32702:0:(mgc_request.c:248:do_config_log_add()) failed processing sptlrpc log: -2
2012-11-07 09:53:33 LustreError: 32705:0:(sec_config.c:1024:sptlrpc_target_local_copy_conf()) missing llog context
2012-11-07 09:53:33 LustreError: 137-5: lstest-MDT0000: Not available for connect from 172.20.3.185@o2ib500 (not set up)
2012-11-07 09:53:33 LustreError: 137-5: lstest-MDT0000: Not available for connect from 172.20.17.95@o2ib500 (not set up)
2012-11-07 09:53:34 LustreError: 137-5: lstest-MDT0000: Not available for connect from 172.20.4.157@o2ib500 (not set up)
2012-11-07 09:53:34 LustreError: Skipped 6 previous similar messages
2012-11-07 09:53:35 LustreError: 137-5: lstest-MDT0000: Not available for connect from 172.20.3.104@o2ib500 (not set up)
2012-11-07 09:53:35 LustreError: Skipped 15 previous similar messages
2012-11-07 09:53:35 Lustre: lstest-MDT0000: Temporarily refusing client connection from 172.20.3.172@o2ib500
2012-11-07 09:53:35 Lustre: lstest-MDT0000: Temporarily refusing client connection from 172.20.4.107@o2ib500
2012-11-07 09:53:36 LustreError: 32788:0:(osp_sync.c:584:osp_sync_process_record()) processed all old entries: 0x4378:1
2012-11-07 09:53:36 LustreError: 32799:0:(osp_sync.c:584:osp_sync_process_record()) processed all old entries: 0x437b:1
2012-11-07 09:53:36 Lustre: lstest-MDT0000: Temporarily refusing client connection from 172.20.4.128@o2ib500
2012-11-07 09:53:36 Lustre: Skipped 39 previous similar messages
2012-11-07 09:53:36 LustreError: 32845:0:(osp_sync.c:584:osp_sync_process_record()) processed all old entries: 0x438a:1
2012-11-07 09:53:36 LustreError: 32845:0:(osp_sync.c:584:osp_sync_process_record()) Skipped 17 previous similar messages
2012-11-07 09:53:37 LustreError: 11-0: lstest-OST01a4-osc-MDT0000: Communicating with 172.20.3.20@o2ib500, operation ost_connect failed with -16
2012-11-07 09:53:37 LNet: 21212:0:(o2iblnd_cb.c:2357:kiblnd_passive_connect()) Conn race 172.20.3.24@o2ib500
2012-11-07 09:53:37 Lustre: lstest-MDT0000: Temporarily refusing client connection from 172.20.3.165@o2ib500
2012-11-07 09:53:37 Lustre: Skipped 15 previous similar messages
2012-11-07 09:53:37 LustreError: 32923:0:(osp_sync.c:584:osp_sync_process_record()) processed all old entries: 0x43a4:1
2012-11-07 09:53:37 LustreError: 32923:0:(osp_sync.c:584:osp_sync_process_record()) Skipped 24 previous similar messages
2012-11-07 09:53:38 Lustre: 33016:0:(llog.c:92:llog_free_handle()) Still busy: 2: 0x3efe:0x1:0: 64767 36702 2111296 1
2012-11-07 09:53:38 Pid: 33016, comm: osp-syn-460
2012-11-07 09:53:38 
2012-11-07 09:53:38 
2012-11-07 09:53:38 Call Trace:
2012-11-07 09:53:38  [<ffffffffa05bc965>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
2012-11-07 09:53:38  [<ffffffffa075ae81>] llog_free_handle+0xb1/0x430 [obdclass]
2012-11-07 09:53:38  [<ffffffffa075b25d>] llog_close+0x5d/0x190 [obdclass]
2012-11-07 09:53:38  [<ffffffffa0760ea9>] llog_cat_cancel_records+0x179/0x490 [obdclass]
2012-11-07 09:53:38  [<ffffffffa0946f10>] ? lustre_swab_ost_body+0x0/0x10 [ptlrpc]
2012-11-07 09:53:38  [<ffffffffa0ff59b8>] osp_sync_process_committed+0x238/0x760 [osp]
2012-11-07 09:53:38  [<ffffffffa09699a7>] ? ptlrpcd_add_req+0x187/0x2e0 [ptlrpc]
2012-11-07 09:53:38  [<ffffffffa0ff5f74>] osp_sync_process_queues+0x94/0x11c0 [osp]
2012-11-07 09:53:38  [<ffffffff8105ea30>] ? default_wake_function+0x0/0x20
2012-11-07 09:53:38  [<ffffffffa075cf3b>] llog_process_thread+0x8fb/0xe10 [obdclass]
2012-11-07 09:53:38  [<ffffffffa0ff5ee0>] ? osp_sync_process_queues+0x0/0x11c0 [osp]
2012-11-07 09:53:38  [<ffffffffa075ec7d>] llog_process_or_fork+0x12d/0x660 [obdclass]
2012-11-07 09:53:38  [<ffffffffa0760c83>] llog_cat_process_cb+0x2d3/0x380 [obdclass]
2012-11-07 09:53:38  [<ffffffffa075cf3b>] llog_process_thread+0x8fb/0xe10 [obdclass]
2012-11-07 09:53:38  [<ffffffffa07609b0>] ? llog_cat_process_cb+0x0/0x380 [obdclass]
2012-11-07 09:53:38  [<ffffffffa075ec7d>] llog_process_or_fork+0x12d/0x660 [obdclass]
2012-11-07 09:53:38  [<ffffffffa075f7b9>] llog_cat_process_or_fork+0x89/0x290 [obdclass]
2012-11-07 09:53:38  [<ffffffff8104cab9>] ? __wake_up_common+0x59/0x90
2012-11-07 09:53:38  [<ffffffffa0ff5ee0>] ? osp_sync_process_queues+0x0/0x11c0 [osp]
2012-11-07 09:53:38  [<ffffffffa075f9d9>] llog_cat_process+0x19/0x20 [obdclass]
2012-11-07 09:53:38  [<ffffffffa05bd83a>] ? cfs_waitq_signal+0x1a/0x20 [libcfs]
2012-11-07 09:53:38  [<ffffffffa0ff8000>] osp_sync_thread+0x1d0/0x700 [osp]
2012-11-07 09:53:38  [<ffffffffa0ff7e30>] ? osp_sync_thread+0x0/0x700 [osp]
2012-11-07 09:53:38  [<ffffffff8100c14a>] child_rip+0xa/0x20
2012-11-07 09:53:38  [<ffffffffa0ff7e30>] ? osp_sync_thread+0x0/0x700 [osp]
2012-11-07 09:53:38  [<ffffffffa0ff7e30>] ? osp_sync_thread+0x0/0x700 [osp]
2012-11-07 09:53:38  [<ffffffff8100c140>] ? child_rip+0x0/0x20
2012-11-07 09:53:38 
2012-11-07 09:53:38 LustreError: 33016:0:(llog_cat.c:187:llog_cat_id2handle()) lstest-OST01cc-osc-MDT0000: error opening log id 0x5a5a5a5a5a5a5a5a:5a5a5a5a: rc = -2
2012-11-07 09:53:38 LustreError: 33016:0:(llog_cat.c:513:llog_cat_cancel_records()) Cannot find log 0x5a5a5a5a5a5a5a5a
2012-11-07 09:53:38 LustreError: 33016:0:(llog_cat.c:552:llog_cat_cancel_records()) lstest-OST01cc-osc-MDT0000: fail to cancel 0 of 1 llog-records: rc = -2
2012-11-07 09:53:38 LustreError: 33016:0:(osp_sync.c:721:osp_sync_process_committed()) @@@ lstest-OST01cc-osc-MDT0000: can't cancel record 0x5a5a5a5a5a5a5a5a:0x5a5a5a5a5a5a5a5a:1515870810:2:36705: -2
2012-11-07 09:53:38   req@ffff880f75ddf400 x1418000660995471/t0(0) o6->lstest-OST01cc-osc-MDT0000@172.20.3.60@o2ib500:28/4 lens 664/400 e 0 to 0 dl 1352310924 ref 1 fl Complete:R/0/0 rc 0/-2
2012-11-07 09:53:38 LustreError: 33016:0:(llog_cat.c:187:llog_cat_id2handle()) lstest-OST01cc-osc-MDT0000: error opening log id 0x5a5a5a5a5a5a5a5a:5a5a5a5a: rc = -2
2012-11-07 09:53:38 LustreError: 33016:0:(llog_cat.c:513:llog_cat_cancel_records()) Cannot find log 0x5a5a5a5a5a5a5a5a
2012-11-07 09:53:38 LustreError: 33016:0:(llog_cat.c:552:llog_cat_cancel_records()) lstest-OST01cc-osc-MDT0000: fail to cancel 0 of 1 llog-records: rc = -2
2012-11-07 09:53:38 LustreError: 33016:0:(osp_sync.c:721:osp_sync_process_committed()) @@@ lstest-OST01cc-osc-MDT0000: can't cancel record 0x5a5a5a5a5a5a5a5a:0x5a5a5a5a5a5a5a5a:1515870810:2:36707: -2
2012-11-07 09:53:38   req@ffff880f73c0bc00 x1418000660995472/t0(0) o6->lstest-OST01cc-osc-MDT0000@172.20.3.60@o2ib500:28/4 lens 664/400 e 0 to 0 dl 1352310924 ref 1 fl Complete:R/0/0 rc 0/-2
2012-11-07 09:53:38 general protection fault: 0000 [#1] SMP 
2012-11-07 09:53:38 last sysfs file: /sys/devices/system/cpu/cpu31/cache/index2/shared_cpu_map
2012-11-07 09:53:38 CPU 0 
2012-11-07 09:53:38 Modules linked in: osp(U) mdt(U) mdd(U) lod(U) mgs(U) mgc(U) osd_zfs(U) lquota(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) acpi_cpufreq freq_table mperf ksocklnd(U) ko2iblnd(U) lnet(U) sha512_generic sha256_generic libcfs(U) ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ib_sa mlx4_ib ib_mad ib_core dm_mirror dm_region_hash dm_log dm_round_robin dm_multipath dm_mod vhost_net macvtap macvlan tun kvm zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate sg ses enclosure sd_mod crc_t10dif isci libsas wmi mpt2sas scsi_transport_sas raid_class sb_edac edac_core ahci i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support ioatdma shpchp ipv6 nfs lockd fscache nfs_acl auth_rpcgss sunrpc mlx4_en mlx4_core igb dca [last unloaded: cpufreq_ondemand]
2012-11-07 09:53:39 
2012-11-07 09:53:39 Pid: 33016, comm: osp-syn-460
2012-11-07 09:53:39  Tainted: P        W  ----------------   2.6.32-220.23.1.2chaos.ch5.x86_64 #1 appro 2620x-in/S2600GZ
2012-11-07 09:53:39 RIP: 0010:[<ffffffffa075c90c>]  [<ffffffffa075c90c>] llog_process_thread+0x2cc/0xe10 [obdclass]
2012-11-07 09:53:39 RSP: 0018:ffff880f720bdb60  EFLAGS: 00010206
2012-11-07 09:53:39 RAX: 5a5a5a5a5a5a5a5a RBX: 0000000000008f81 RCX: 0000000000000000
2012-11-07 09:53:39 RDX: ffff880f763761c0 RSI: ffff880f79788000 RDI: ffff880f72d3e000
2012-11-07 09:53:39 RBP: ffff880f720bdc00 R08: ffff881f9b6b0500 R09: 0000000000000000
2012-11-07 09:53:39 R10: 0000000000000000 R11: 0000000000000000 R12: ffff881f9a728058
2012-11-07 09:53:39 R13: 000000000000fcff R14: ffff880f72d3c000 R15: ffff880f720bde80
2012-11-07 09:53:39 FS:  00007ffff7fdc700(0000) GS:ffff880060600000(0000) knlGS:0000000000000000
2012-11-07 09:53:39 CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
2012-11-07 09:53:39 CR2: 00007ffff7ff9000 CR3: 000000200eb4b000 CR4: 00000000000406f0
2012-11-07 09:53:39 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
2012-11-07 09:53:39 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
2012-11-07 09:53:39 Process osp-syn-460
2012-11-07 09:53:39  (pid: 33016, threadinfo ffff880f720bc000, task ffff880fd518cae0)
2012-11-07 09:53:39 Stack:
2012-11-07 09:53:39  ffff881f00002000 0000000100000000 ffff881f00000000 ffff880f72d3c001
2012-11-07 09:53:39 <0> 00008f8000000000 0000000000000000 0000000000000000 0000000000240000
2012-11-07 09:53:39 <0> 0000fd00720bdbe0 ffff880f72d3e000 ffff880f763761c0 ffff881f9b6b0500
2012-11-07 09:53:39 Call Trace:
2012-11-07 09:53:39  [<ffffffffa0ff5ee0>] ? osp_sync_process_queues+0x0/0x11c0 [osp]
2012-11-07 09:53:39  [<ffffffffa075ec7d>] llog_process_or_fork+0x12d/0x660 [obdclass]
2012-11-07 09:53:39  [<ffffffffa0760c83>] llog_cat_process_cb+0x2d3/0x380 [obdclass]
2012-11-07 09:53:39  [<ffffffffa075cf3b>] llog_process_thread+0x8fb/0xe10 [obdclass]
2012-11-07 09:53:39  [<ffffffffa07609b0>] ? llog_cat_process_cb+0x0/0x380 [obdclass]
2012-11-07 09:53:39  [<ffffffffa075ec7d>] llog_process_or_fork+0x12d/0x660 [obdclass]
2012-11-07 09:53:39  [<ffffffffa075f7b9>] llog_cat_process_or_fork+0x89/0x290 [obdclass]
2012-11-07 09:53:39  [<ffffffff8104cab9>] ? __wake_up_common+0x59/0x90
2012-11-07 09:53:39  [<ffffffffa0ff5ee0>] ? osp_sync_process_queues+0x0/0x11c0 [osp]
2012-11-07 09:53:39  [<ffffffffa075f9d9>] llog_cat_process+0x19/0x20 [obdclass]
2012-11-07 09:53:39  [<ffffffffa05bd83a>] ? cfs_waitq_signal+0x1a/0x20 [libcfs]
2012-11-07 09:53:39  [<ffffffffa0ff8000>] osp_sync_thread+0x1d0/0x700 [osp]
2012-11-07 09:53:39  [<ffffffffa0ff7e30>] ? osp_sync_thread+0x0/0x700 [osp]
2012-11-07 09:53:39  [<ffffffff8100c14a>] child_rip+0xa/0x20
2012-11-07 09:53:39  [<ffffffffa0ff7e30>] ? osp_sync_thread+0x0/0x700 [osp]
2012-11-07 09:53:39  [<ffffffffa0ff7e30>] ? osp_sync_thread+0x0/0x700 [osp]
2012-11-07 09:53:39  [<ffffffff8100c140>] ? child_rip+0x0/0x20
2012-11-07 09:53:39 Code: 74 0c 00 01 00 00 00 e8 c3 0c e7 ff 48 83 7d b8 00 0f 84 f8 03 00 00 4c 8b 45 b8 49 8b 80 b0 00 00 00 48 85 c0 0f 84 e4 03 00 00 <48> 8b 40 08 48 85 c0 0f 84 67 02 00 00 4d 89 f1 c7 04 24 00 20 
2012-11-07 09:53:39 RIP  [<ffffffffa075c90c>] llog_process_thread+0x2cc/0xe10 [obdclass]
2012-11-07 09:53:39  RSP <ffff880f720bdb60>
Comment by Li Wei (Inactive) [ 08/Nov/12 ]

The latest report is very helpful. Here's my reconstruction:

2012-11-07 09:53:38 Lustre: 33016:0:(llog.c:92:llog_free_handle()) Still busy: 2: 0x3efe:0x1:0: 64767 36702 2111296 1

An OSP sync thread canceled the last record in a log and was freeing the log handle. However, the thread was processing this log and it was at index 36702. Although both the handle and header structures were poisoned, the thread could continue processing the log because a) index 36702 was less than (LLOG_BITMAP_BYTES * 8 - 1), b) llh_bitmap actually contained non-zero bits after being poisoned, and c) the on-disk canceled records were intact.

2012-11-07 09:53:38 LustreError: 33016:0:(osp_sync.c:721:osp_sync_process_committed()) @@@ lstest-OST01cc-osc-MDT0000: can't cancel record 0x5a5a5a5a5a5a5a5a:0x5a5a5a5a5a5a5a5a:1515870810:2:36705: -2
2012-11-07 09:53:38   req@ffff880f75ddf400 x1418000660995471/t0(0) o6->lstest-OST01cc-osc-MDT0000@172.20.3.60@o2ib500:28/4 lens 664/400 e 0 to 0 dl 1352310924 ref 1 fl Complete:R/0/0 rc 0/-2
[...]
2012-11-07 09:53:38 LustreError: 33016:0:(osp_sync.c:721:osp_sync_process_committed()) @@@ lstest-OST01cc-osc-MDT0000: can't cancel record 0x5a5a5a5a5a5a5a5a:0x5a5a5a5a5a5a5a5a:1515870810:2:36707: -2
2012-11-07 09:53:38   req@ffff880f73c0bc00 x1418000660995472/t0(0) o6->lstest-OST01cc-osc-MDT0000@172.20.3.60@o2ib500:28/4 lens 664/400 e 0 to 0 dl 1352310924 ref 1 fl Complete:R/0/0 rc 0/-2

These were from two records processed after freeing the log handle. The indices were 36705 and 36707. If I calculated correctly, 36702, 36705, and 36707 correspond to bit 30, 33, and 35 in the 573th bitmap word:

 0 1 0 1  1 0 1 0  0 1 0 1  1 0 1 0  0 1 0 1  1 0 1 0  0 1 0 1  1 0 1 0
63                                  47                               32

 0 1 0 1  1 0 1 0  0 1 0 1  1 0 1 0  0 1 0 1  1 0 1 0  0 1 0 1  1 0 1 0
31                                  15                                0

This was why 36703, 36704, and 36706 did not appear in the log.

Comment by Prakash Surya (Inactive) [ 14/Nov/12 ]

Thanks for the detailed explanation! I need a better understanding of llogs to fully understand, but it sounds plausible at a high level. Do you have an idea as to how to fix it?

I've also seen the "Still busy" message crop up without the other "can't cancel record" messages:

<ConMan> Console [grove-mds2] log at 2012-11-13 07:00:00 PST.
2012-11-13 07:56:45 Lustre: 32930:0:(llog.c:92:llog_free_handle()) Still busy: 2: 0x6541:0x1:0: 64767 64767 4161152 1
2012-11-13 07:56:45 Pid: 32930, comm: osp-syn-431
2012-11-13 07:56:45 
2012-11-13 07:56:45 
2012-11-13 07:56:45 Call Trace:
2012-11-13 07:56:45  [<ffffffffa05aa965>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
2012-11-13 07:56:45  [<ffffffffa0703e81>] llog_free_handle+0xb1/0x430 [obdclass]
2012-11-13 07:56:45  [<ffffffffa070425d>] llog_close+0x5d/0x190 [obdclass]
2012-11-13 07:56:45  [<ffffffffa0709ea9>] llog_cat_cancel_records+0x179/0x490 [obdclass]
2012-11-13 07:56:45  [<ffffffffa08eff20>] ? lustre_swab_ost_body+0x0/0x10 [ptlrpc]
2012-11-13 07:56:45  [<ffffffffa0fcd9b8>] osp_sync_process_committed+0x238/0x760 [osp]
2012-11-13 07:56:45  [<ffffffffa09129b7>] ? ptlrpcd_add_req+0x187/0x2e0 [ptlrpc]
2012-11-13 07:56:45  [<ffffffffa0fcdf74>] osp_sync_process_queues+0x94/0x11c0 [osp]
2012-11-13 07:56:45  [<ffffffff8105ea30>] ? default_wake_function+0x0/0x20
2012-11-13 07:56:45  [<ffffffffa0705f3b>] llog_process_thread+0x8fb/0xe10 [obdclass]
2012-11-13 07:56:45  [<ffffffffa0fcdee0>] ? osp_sync_process_queues+0x0/0x11c0 [osp]
2012-11-13 07:56:45  [<ffffffffa0707c7d>] llog_process_or_fork+0x12d/0x660 [obdclass]
2012-11-13 07:56:45  [<ffffffffa0709c83>] llog_cat_process_cb+0x2d3/0x380 [obdclass]
2012-11-13 07:56:45  [<ffffffffa0705f3b>] llog_process_thread+0x8fb/0xe10 [obdclass]
2012-11-13 07:56:45  [<ffffffffa07099b0>] ? llog_cat_process_cb+0x0/0x380 [obdclass]
2012-11-13 07:56:45  [<ffffffffa0707c7d>] llog_process_or_fork+0x12d/0x660 [obdclass]
2012-11-13 07:56:45  [<ffffffffa07087b9>] llog_cat_process_or_fork+0x89/0x290 [obdclass]
2012-11-13 07:56:45  [<ffffffff8104cab9>] ? __wake_up_common+0x59/0x90
2012-11-13 07:56:45  [<ffffffffa0fcdee0>] ? osp_sync_process_queues+0x0/0x11c0 [osp]
2012-11-13 07:56:45  [<ffffffffa07089d9>] llog_cat_process+0x19/0x20 [obdclass]
2012-11-13 07:56:45  [<ffffffffa05ab83a>] ? cfs_waitq_signal+0x1a/0x20 [libcfs]
2012-11-13 07:56:45  [<ffffffffa0fd0000>] osp_sync_thread+0x1d0/0x700 [osp]
2012-11-13 07:56:45  [<ffffffffa0fcfe30>] ? osp_sync_thread+0x0/0x700 [osp]
2012-11-13 07:56:45  [<ffffffff8100c14a>] child_rip+0xa/0x20
2012-11-13 07:56:45  [<ffffffffa0fcfe30>] ? osp_sync_thread+0x0/0x700 [osp]
2012-11-13 07:56:45  [<ffffffffa0fcfe30>] ? osp_sync_thread+0x0/0x700 [osp]
2012-11-13 07:56:45  [<ffffffff8100c140>] ? child_rip+0x0/0x20
2012-11-13 07:56:45 

<ConMan> Console [grove-mds2] log at 2012-11-13 08:00:00 PST.
Comment by Li Wei (Inactive) [ 15/Nov/12 ]

Prakash,

In your last comment, according to

2012-11-13 07:56:45 Lustre: 32930:0:(llog.c:92:llog_free_handle()) Still busy: 2: 0x6541:0x1:0: 64767 64767 4161152 1

both lgh_cur_idx and lgh_last_idx were equal to the maximum index allowed in a log, meaning that the thread had processed all the records and would not use the handle anymore. I think that is why the "can't cancel record" message and the GPF did not occur.

In the case described by my last comment, lgh_cur_idx was far smaller than lgh_last_idx. What puzzles me is that how llh_count could become one before the records from lgh_cur_idx to lgh_last_idx were processed. Following Andreas and Alex's advice, I applied http://review.whamcloud.com/4508 that verifies the consistency between llh_count and llh_bitmap at log open time and failed an OST under a racer workload every ten seconds for a few hours. But no inconsistency was found.

I'll think again...

Comment by Alex Zhuravlev [ 16/Nov/12 ]

notice that during cancelation llog cookie is taken from a request, not directly from llog. so if request is freed, then corresponded memory can be re-used (or just filled with 5a).

Comment by Mikhail Pershin [ 22/Nov/12 ]

I agree with Alex, it is not about freed loghandle but about freed logid from request:

2012-11-07 09:53:38 LustreError: 33016:0:(llog_cat.c:187:llog_cat_id2handle()) lstest-OST01cc-osc-MDT0000: error opening log id 0x5a5a5a5a5a5a5a5a:5a5a5a5a: rc = -2

That means just cat_id2handle() tried to use logid being freed. And protection fault in osp_sync_process_queues() is caused just by calling ptlrpc_req_finished() on freed request.

Comment by Li Wei (Inactive) [ 23/Nov/12 ]

There was evidences early on of poisoned log handles when filling cookies into request buffers. E.g., from Brian's comment on Aug 7:

2012-08-06 11:35:50 LustreError: 21152:0:(osp_sync.c:465:osp_sync_new_job()) Poisoned log ID from handle ffff882ffb055900
Comment by Mikhail Pershin [ 23/Nov/12 ]

so, this debug is not in osp_sync_new_job() now? Can we add it again?

Comment by Alex Zhuravlev [ 23/Nov/12 ]

yes.. Li Wei, could you combine all the debug you've developed (checks for poisoned ids, llh_count vs bitmap verification, etc) into a single patch and land it on master, please?

Comment by Mikhail Pershin [ 23/Nov/12 ]

In the case described by my last comment, lgh_cur_idx was far smaller than lgh_last_idx. What puzzles me is that how llh_count could become one before the records from lgh_cur_idx to lgh_last_idx were processed.

But you've answered already that. There were no more records in llog and empty bitmap, but after poisoning it become filled with pattern and llog_process_thread() find set bit which was 36702.

I think we have to add refcount into llog_handle and free it when only last one dropped.

Comment by Li Wei (Inactive) [ 23/Nov/12 ]

Mike, I had that thought too. But considering that

  • a log is destroyed only if all its records have been cancelled, and
  • a record is cancelled after it has been processed,

a log should be destroyed at least after its OSP has processed all its records and likely moved to another log. In addition, because

  • a log is destroyed only if lgh_last_idx equals 64767, and
  • an OSP processes records sequentially towards the end,

lgh_cur_idx should be 64767 (or 64766, if a padding record exists at the end) when the OSP has processed all the records in the log. Thus, the log handle should not reach a state like

2012-11-07 09:53:38 Lustre: 33016:0:(llog.c:92:llog_free_handle()) Still busy: 2: 0x3efe:0x1:0: 64767 36702 2111296 1

where lgh_last_idx was 64767, lgh_cur_idx was 36702, and llh_count was 1.

Comment by Alex Zhuravlev [ 26/Nov/12 ]

> where lgh_last_idx was 64767, lgh_cur_idx was 36702, and llh_count was 1.

iirc, in the all reported cases the problem happened right after startup. so 64767 means that this is a llog left from the previous boot and we're reprocessing that.
now, given there is no strong ordering among RPCs going to OST it's possible that some RPC did not reach OST (or reply was lost), so it's possible to have most of
bits cleared (llog records cancelled) with few exceptions.

i think it should be possible to reproduce this locally setting timeout pretty big (to workaround slow performance compared to customer's hardware) and
dropping one rpc.

another debug patch could be to print a warning when number of processed records > llh_count (if we started with a full llog).

Comment by Mikhail Pershin [ 26/Nov/12 ]

Li Wei, about "Still busy: 2: 0x3efe:0x1:0: 64767 36702 2111296 1". Why can't that be that 36702 was just the latest cancelled record? I noticed that requests are added to the head of opd_syn_committed_there list, so later requests will be processed first, so technically it is possible that records with lower index can be cancelled later then with larger index.

Comment by Mikhail Pershin [ 26/Nov/12 ]

Looking at osp_sync_process_queues() I've noticed also how can we get our situation. First was executed the osp_sync_process_committed() and cancelled all records in current plain log because they, say, committed. Then osp_sync_can_process_new() checks if there are more work to do, and can return positive answer and osp_sync_process_record() is called adding freed logid to the request. This request processed and got -ENOENT from OST. In that case it is also added to the commit list and we will get our situation.

It is not quite so, the llh is set to NULL when request is sent. In fact, llog_process_thread() is not safe if callback deletes plain llog. We need mechanism to understand that llog was deleted and stop processing.

So I think first of all we must exist from osp_sync_process_queues() if llog was deleted by osp_sync_process_committed call.
Better to fix llog_process_thread as mentioned above.

Comment by Alex Zhuravlev [ 26/Nov/12 ]

I think it makes sense to introduce a local copy of llh_count just before we start processing of the llog, decrement it on the every alive record and break once it reaches 1. but this logic should be used only when the llog is not getting new records.

Comment by Prakash Surya (Inactive) [ 26/Nov/12 ]

FWIW, I hit this again. This time, I had all of the OSTs powered down, then rebooted the MDS (which came up fine), then powered up the OSTs. The MDS appeared to be OK when it came up, but once the OSTs were brought back online, the MDS crashed.

And now as the MDS tries to come back online, it repeatedly hits this (twice so far):

2012-11-26 12:40:40 LustreError: 33226:0:(llog_cat.c:187:llog_cat_id2handle()) lstest-OST01fc-osc-MDT0000: error opening log id 0xaf75:0: rc = -2
2012-11-26 12:40:40 BUG: unable to handle kernel NULL pointer dereference at 0000000000000060
2012-11-26 12:40:40 IP: [<ffffffffa071d3e4>] cat_cancel_cb+0x2e4/0x5e0 [obdclass]
Comment by Li Wei (Inactive) [ 26/Nov/12 ]

Ah, right, it must be records that were processed but not cancelled before MDTs were brought down.

Comment by Alex Zhuravlev [ 26/Nov/12 ]

as for cat_cancel_cb() I found that llog_cat_init_and_process() forks, so then it's running with main osp's llog_process() in parallel. I think llog_cat_init_and_process() should not fork.

Comment by Mikhail Pershin [ 27/Nov/12 ]

Prakash, that is different issue, could you add new ticket with data you have?

Comment by Prakash Surya (Inactive) [ 27/Nov/12 ]

I opened LU-2394 for the cat_cancel_cb and a link to the patch to address the issue.

Comment by Li Wei (Inactive) [ 28/Nov/12 ]

http://review.whamcloud.com/4696

A fix. For some reason, I wasn't able to get my new regression test working correctly today. Instead, I tested this patch with this hack:

--- a/lustre/obdclass/llog.c
+++ b/lustre/obdclass/llog.c
@@ -137,7 +137,8 @@ int llog_cancel_rec(const struct lu_env *env, struct llog_ha
 
         if ((llh->llh_flags & LLOG_F_ZAP_WHEN_EMPTY) &&
             (llh->llh_count == 1) &&
-            (loghandle->lgh_last_idx == (LLOG_BITMAP_BYTES * 8) - 1)) {
+            (loghandle->lgh_last_idx >= 1)) {
+               loghandle->lgh_last_idx == (LLOG_BITMAP_BYTES * 8) - 1;
                cfs_spin_unlock(&loghandle->lgh_hdr_lock);
                rc = llog_destroy(env, loghandle);
                if (rc < 0) {
Comment by Andreas Dilger [ 03/Dec/12 ]

Li Wei, this is exactly what OBD_FAIL_CHECK() is for - the ability to be able to trigger specific errors in the code at runtime instead of having to change it at compile time.

Comment by Mikhail Pershin [ 05/Dec/12 ]

http://review.whamcloud.com/#change,4745

Attempt to introduce refcounter for llog handler. Please verify it is correct and works.

Comment by Prakash Surya (Inactive) [ 13/Dec/12 ]

Mike, I pulled in that patch. I've also done some initial testing and it looks good so far. I've rebooted the MDS a few times while under light load, and haven't hit the issue yet (with the fix). Without the fix, I was able to hit it on the first try under the same load.

Comment by Mikhail Pershin [ 11/Jan/13 ]

patch was landed. Prakash, are you OK to close this ticket?

Comment by Mikhail Pershin [ 13/Jan/13 ]

closing for now, reopen if it will appear again

Comment by Prakash Surya (Inactive) [ 14/Jan/13 ]

Mike, I can't recall hitting this since pulling in the patch. Thanks!

Generated at Sat Feb 10 06:13:11 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.