[LU-2109] __llog_process_thread() GPF Created: 01/Jun/12 Updated: 19/Apr/13 Resolved: 19/Apr/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | Lustre 2.4.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Brian Behlendorf | Assignee: | Mikhail Pershin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | HB, sequoia | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Epic: | server | ||||||||
| Project: | Orion | ||||||||
| Rank (Obsolete): | 3037 | ||||||||
| Description |
|
Observed after rebooting the grove-mds2 on to our latest kernel and lustre-orion tag. The server was rebooted while under a fairly heavy test load. No detailed investigation has been done yet, the MDS was able to successfully restart after rebooting the node again and claims to have successfully completed recovery this time. — First boot — Mounting grove-mds2/mgs on /mnt/lustre/local/lstest-MGS0000 Pid: 4567, comm: osp-syn-641 — Second boot — Mounting grove-mds2/mgs on /mnt/lustre/local/lstest-MGS0000 |
| Comments |
| Comment by Li Wei (Inactive) [ 23/Jul/12 ] |
|
Brian, Given that the request buffers would not be retrievable after being freed, I suspect that the plain log handles were freed somehow while osp_sync_thread() was processing them. If this sounds possible, I wonder if you could apply http://review.whamcloud.com/3443 and try reproducing the failure? The libcfs module parameter "libcfs_panic_on_lbug" should be set to "0" on the MDS in order to get the debug log. I was not able to reproduce it in the lab with a "mdsrate --unlink" workload. |
| Comment by Brian Behlendorf [ 30/Jul/12 ] |
|
We'll pull the debug patch in to our branch tag and see if we're able to reproduce the issue. We appear to still hit it sporadically restarting the servers. |
| Comment by Brian Behlendorf [ 30/Jul/12 ] |
|
Haven't gotten the debug patch in place just yet, although we did just see a slight variation on this. This time we didn't get poison. 2012-07-30 09:44:03 grove-mds2 login: Lustre: 21215:0:(fld_index.c:354:fld_index_init()) srv-lstest-MDT0000: File "fld" doesn't support range lookup, using stub. DNE and FIDs on OST will not work with this backend 2012-07-30 09:44:40 LustreError: 21529:0:(llog_cat.c:184:llog_cat_id2handle()) error opening log id 0xffff8817ff250000:fbf1f800: rc -2 2012-07-30 09:44:40 LustreError: 21529:0:(llog_cat.c:501:llog_cat_cancel_records()) Cannot find log 0xffff8817ff250000 2012-07-30 09:44:40 LustreError: 21529:0:(llog_cat.c:531:llog_cat_cancel_records()) Cancel 0 of 1 llog-records failed: -2 2012-07-30 09:44:40 LustreError: 21529:0:(osp_sync.c:705:osp_sync_process_committed()) lstest-OST02d9-osc-MDT0000: can't cancel record: -2 2012-07-30 09:44:40 LustreError: 21529:0:(llog_cat.c:184:llog_cat_id2handle()) error opening log id 0xffff8817ff250000:fbf1f800: rc -2 2012-07-30 09:44:40 LustreError: 21529:0:(llog_cat.c:501:llog_cat_cancel_records()) Cannot find log 0xffff8817ff250000 2012-07-30 09:44:40 LustreError: 21529:0:(llog_cat.c:531:llog_cat_cancel_records()) Cancel 0 of 1 llog-records failed: -2 2012-07-30 09:44:40 LustreError: 21529:0:(osp_sync.c:705:osp_sync_process_committed()) lstest-OST02d9-osc-MDT0000: can't cancel record: -2 2012-07-30 09:44:40 general protection fault: 0000 [#1] SMP 2012-07-30 09:44:40 last sysfs file: /sys/module/ptlrpc/initstate Also note that the crash isn't being caused by an LBUG it's a GPF so setting "libcfs_panic_on_lbug" to 0 isn't going to help. Resolving the address in the earlier posted stack suggests that 'lop->lop_next_block' is probably a garbage address we're attempting to jump too. (gdb) list *( __llog_process_thread+0x2a2) 0x60c2 is in __llog_process_thread (/home/behlendo/src/git/lustre/lustre/include/lustre_log.h:648). warning: Source file is more recent than executable. 643 ENTRY; 644 645 rc = llog_handle2ops(loghandle, &lop); 646 if (rc) 647 RETURN(rc); 648 if (lop->lop_next_block == NULL) 649 RETURN(-EOPNOTSUPP); 650 651 rc = lop->lop_next_block(env, loghandle, cur_idx, next_idx, 652 cur_offset, buf, len); |
| Comment by Li Wei (Inactive) [ 30/Jul/12 ] |
|
The value of "lop" was 0x5a5a5a5a5a5a5a5a (i.e., RAX), according to llog_next_block():
/root/lustre-dev/lustre/include/lustre_log.h:648
60c2: 48 8b 40 08 mov 0x8(%rax),%rax
60c6: 48 85 c0 test %rax,%rax
60c9: 0f 84 81 01 00 00 je 6250 <__llog_process_thread+0x430>
The log handles do not have their own slab cache. In the latest "variation", the handle in question might have been allocated to some other structure, which could explain why neither a valid log ID nor the poison was seen. I'll think harder for how the handles could be freed. |
| Comment by Li Wei (Inactive) [ 03/Aug/12 ] |
|
I have updated the diagnostic patch to narrow down the problematic zone. |
| Comment by Brian Behlendorf [ 06/Aug/12 ] |
|
Observed again but unfortunately without the updated diagnostic patch. I'll pull the updated version in to our next tag. |
| Comment by Brian Behlendorf [ 06/Aug/12 ] |
|
Actually, I misspoke. Revision 2 of the diagnostic patch was applied, but not revision 3. I also tweaked revision 2 to change the CDEBUG() in osp_sync_new_job() to a CERROR() to ensure we saw the error. There was a good change we wouldn't get a debug log, and in fact we didn't, but we do have the following console output. 2012-08-06 11:35:50 LNet: 13171:0:(o2iblnd_cb.c:2337:kiblnd_passive_connect()) Conn race 172.20.4.42@o2ib500 2012-08-06 11:35:50 LNet: 13171:0:(o2iblnd_cb.c:2337:kiblnd_passive_connect()) Skipped 654 previous similar messages 2012-08-06 11:35:50 LustreError: 21152:0:(llog_cat.c:184:llog_cat_id2handle()) error opening log id 0xffff881801d0e000:be556800: rc -2 2012-08-06 11:35:50 LustreError: 21152:0:(llog_cat.c:501:llog_cat_cancel_records()) Cannot find log 0xffff881801d0e000 2012-08-06 11:35:50 LustreError: 21152:0:(llog_cat.c:531:llog_cat_cancel_records()) Cancel 0 of 1 llog-records failed: -2 2012-08-06 11:35:50 LustreError: 21152:0:(osp_sync.c:709:osp_sync_process_committed()) lstest-OST0293-osc-MDT0000: can't cancel record: -2 2012-08-06 11:35:50 LustreError: 11-0: lstest-OST0282-osc-MDT0000: Communicating with 172.20.4.42@o2ib500, operation ost_connect failed with -19 2012-08-06 11:35:50 LustreError: Skipped 4 previous similar messages 2012-08-06 11:35:50 LustreError: 21152:0:(osp_sync.c:465:osp_sync_new_job()) Poisoned log ID from handle ffff882ffb055900 2012-08-06 11:35:50 LustreError: 21152:0:(osp_sync.c:465:osp_sync_new_job()) Poisoned log ID from handle ffff882ffb055900 2012-08-06 11:35:50 LustreError: 21152:0:(llog_cat.c:184:llog_cat_id2handle()) error opening log id 0x5a5a5a5a5a5a5a5a:5a5a5a5a: rc -2 2012-08-06 11:35:50 LustreError: 21152:0:(llog_cat.c:501:llog_cat_cancel_records()) Cannot find log 0x5a5a5a5a5a5a5a5a 2012-08-06 11:35:50 LustreError: 21152:0:(llog_cat.c:531:llog_cat_cancel_records()) Cancel 0 of 1 llog-records failed: -2 2012-08-06 11:35:50 LustreError: 21152:0:(osp_sync.c:709:osp_sync_process_committed()) lstest-OST0293-osc-MDT0000: can't cancel record: -2 2012-08-06 11:35:50 LustreError: 21152:0:(osp_sync.c:859:osp_sync_thread()) ASSERTION( rc == 0 || rc == LLOG_PROC_BREAK ) failed: 0 changes, 7 in progress, 7 in flight: -22 2012-08-06 11:35:50 LustreError: 21152:0:(osp_sync.c:859:osp_sync_thread()) LBUG |
| Comment by Li Wei (Inactive) [ 07/Aug/12 ] |
|
Brian, Thanks. The console log further proofs the junk log ID came from freed or reallocated log handles. I'm looking forward to see how revision 3 of the diagnostic patch can help us determine why the handle was freed while being used. |
| Comment by Ian Colle (Inactive) [ 08/Oct/12 ] |
|
Brian - we haven't see an update in this ticket for a couple of months, is it still an issue? |
| Comment by Prakash Surya (Inactive) [ 08/Oct/12 ] |
|
IIRC, this would only occur sporadically after upgrading, and then subside for an unknown reason. So we can't really say for sure if it's still an issue, because we never really understood the root cause to begin with and didn't have a solid reproducer. Personally, I'd be OK with moving this to an LU ticket and the resolving as "wont fix" or something similar. That way it's out in the public, and can be re-opened if it is seen again with the 2.3 code. |
| Comment by Ian Colle (Inactive) [ 08/Oct/12 ] |
|
Moved to Lustre project as this code has landed to master. Marking as Won't Fix at Prakash's recommendation until/unless it reoccurs. |
| Comment by Andreas Dilger [ 16/Oct/12 ] |
|
I hit this in my single-node test setup:
Failure hit when unmounting after mmp.sh was finished and filesystem was remounting: Lustre: DEBUG MARKER: == mmp test complete, duration 394 sec 07:09:31 (1350392971) Lustre: MGC192.168.20.154@tcp: Reactivating import LustreError: 12112:0:(sec_config.c:1024:sptlrpc_target_local_copy_conf()) missing llog context Lustre: testfs-MDT0000: Temporarily refusing client connection from 0@lo LustreError: 11-0: an error occurred while communicating with 0@lo. The mds_connect operation failed with -11 Lustre: Found index 2 for testfs-OST0002, updating log Lustre: 9041:0:(client.c:1909:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1350392977/real 1350392977] req@ffff8800b1b6bc00 x1415989288763649/t0(0) o8->testfs-OST0002-osc-MDT0000@0@lo:28/4 lens 400/544 e 0 to 1 dl 1350392982 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 LustreError: 11-0: an error occurred while communicating with 0@lo. The mds_connect operation failed with -11 Lustre: 12244:0:(ofd_obd.c:1069:ofd_orphans_destroy()) testfs-OST0002: deleting orphan objects from 80455 to 85737 Lustre: 12248:0:(ofd_obd.c:1069:ofd_orphans_destroy()) testfs-OST0001: deleting orphan objects from 80346 to 80495 Lustre: Mounted testfs-client LustreError: 12145:0:(llog_cat.c:187:llog_cat_id2handle()) testfs-OST0001-osc-MDT0000: error opening log id 0x5a5a5a5a5a5a5a5a:5a5a5a5a: rc = -5 LustreError: 12145:0:(llog_cat.c:513:llog_cat_cancel_records()) Cannot find log 0x5a5a5a5a5a5a5a5a LustreError: 12145:0:(llog_cat.c:544:llog_cat_cancel_records()) testfs-OST0001-osc-MDT0000: fail to cancel 0 of 1 llog-records: rc = -5 LustreError: 12145:0:(osp_sync.c:708:osp_sync_process_committed()) testfs-OST0001-osc-MDT0000: can't cancel record: -5 LustreError: 12145:0:(llog_cat.c:187:llog_cat_id2handle()) testfs-OST0001-osc-MDT0000: error opening log id 0x5a5a5a5a5a5a5a5a:5a5a5a5a: rc = -5 LustreError: 12145:0:(llog_cat.c:513:llog_cat_cancel_records()) Cannot find log 0x5a5a5a5a5a5a5a5a LustreError: 12145:0:(llog_cat.c:544:llog_cat_cancel_records()) testfs-OST0001-osc-MDT0000: fail to cancel 0 of 1 llog-records: rc = -5 LustreError: 12145:0:(osp_sync.c:708:osp_sync_process_committed()) testfs-OST0001-osc-MDT0000: can't cancel record: -5 general protection fault: 0000 [#1] SMP Tainted: P B --------------- 2.6.32-279.5.1.el6_lustre.g7f15218.x86_64 #1 Dell Inc. RIP: 0010:[<ffffffffa131856c>] [<ffffffffa131856c>] llog_process_thread+0x2cc/0xe10 [obdclass] Process osp-syn-1 Call Trace: [<ffffffffa0e4fe00>] ? osp_sync_process_queues+0x0/0x11c0 [osp] [<ffffffffa131a8dd>] llog_process_or_fork+0x12d/0x660 [obdclass] [<ffffffffa131c8d3>] llog_cat_process_cb+0x2c3/0x370 [obdclass] [<ffffffffa1318b9b>] llog_process_thread+0x8fb/0xe10 [obdclass] [<ffffffffa131a8dd>] llog_process_or_fork+0x12d/0x660 [obdclass] [<ffffffffa131b419>] llog_cat_process_or_fork+0x89/0x290 [obdclass] [<ffffffffa0e4fe00>] ? osp_sync_process_queues+0x0/0x11c0 [osp] [<ffffffffa131b639>] llog_cat_process+0x19/0x20 [obdclass] [<ffffffffa0e51f20>] osp_sync_thread+0x1d0/0x700 [osp] |
| Comment by Ian Colle (Inactive) [ 25/Oct/12 ] |
|
Mike - is this related to |
| Comment by Li Wei (Inactive) [ 02/Nov/12 ] |
|
Abandoned the diagnostic patch for Orion; new ones are http://review.whamcloud.com/4445 (Not for landing.) |
| Comment by Ian Colle (Inactive) [ 02/Nov/12 ] |
|
Brian, can you guys please pull in Li Wei's new patches above for testing? |
| Comment by Prakash Surya (Inactive) [ 06/Nov/12 ] |
|
I just pulled those patches into our branch. |
| Comment by Li Wei (Inactive) [ 06/Nov/12 ] |
|
Prakash, thanks. |
| Comment by Prakash Surya (Inactive) [ 07/Nov/12 ] |
|
Just installed those patches on the MDS and got a hit: 2012-11-07 09:53:31 Lustre: Lustre: Build Version: 2.3.54-3chaos-3chaos--PRISTINE-2.6.32-220.23.1.2chaos.ch5.x86_64 2012-11-07 09:53:32 Mounting grove-mds2/mdt0 on /mnt/lustre/local/lstest-MDT0000 2012-11-07 09:53:33 Lustre: Found index 0 for lstest-MDT0000, updating log 2012-11-07 09:53:33 LustreError: 32702:0:(mgc_request.c:248:do_config_log_add()) failed processing sptlrpc log: -2 2012-11-07 09:53:33 LustreError: 32705:0:(sec_config.c:1024:sptlrpc_target_local_copy_conf()) missing llog context 2012-11-07 09:53:33 LustreError: 137-5: lstest-MDT0000: Not available for connect from 172.20.3.185@o2ib500 (not set up) 2012-11-07 09:53:33 LustreError: 137-5: lstest-MDT0000: Not available for connect from 172.20.17.95@o2ib500 (not set up) 2012-11-07 09:53:34 LustreError: 137-5: lstest-MDT0000: Not available for connect from 172.20.4.157@o2ib500 (not set up) 2012-11-07 09:53:34 LustreError: Skipped 6 previous similar messages 2012-11-07 09:53:35 LustreError: 137-5: lstest-MDT0000: Not available for connect from 172.20.3.104@o2ib500 (not set up) 2012-11-07 09:53:35 LustreError: Skipped 15 previous similar messages 2012-11-07 09:53:35 Lustre: lstest-MDT0000: Temporarily refusing client connection from 172.20.3.172@o2ib500 2012-11-07 09:53:35 Lustre: lstest-MDT0000: Temporarily refusing client connection from 172.20.4.107@o2ib500 2012-11-07 09:53:36 LustreError: 32788:0:(osp_sync.c:584:osp_sync_process_record()) processed all old entries: 0x4378:1 2012-11-07 09:53:36 LustreError: 32799:0:(osp_sync.c:584:osp_sync_process_record()) processed all old entries: 0x437b:1 2012-11-07 09:53:36 Lustre: lstest-MDT0000: Temporarily refusing client connection from 172.20.4.128@o2ib500 2012-11-07 09:53:36 Lustre: Skipped 39 previous similar messages 2012-11-07 09:53:36 LustreError: 32845:0:(osp_sync.c:584:osp_sync_process_record()) processed all old entries: 0x438a:1 2012-11-07 09:53:36 LustreError: 32845:0:(osp_sync.c:584:osp_sync_process_record()) Skipped 17 previous similar messages 2012-11-07 09:53:37 LustreError: 11-0: lstest-OST01a4-osc-MDT0000: Communicating with 172.20.3.20@o2ib500, operation ost_connect failed with -16 2012-11-07 09:53:37 LNet: 21212:0:(o2iblnd_cb.c:2357:kiblnd_passive_connect()) Conn race 172.20.3.24@o2ib500 2012-11-07 09:53:37 Lustre: lstest-MDT0000: Temporarily refusing client connection from 172.20.3.165@o2ib500 2012-11-07 09:53:37 Lustre: Skipped 15 previous similar messages 2012-11-07 09:53:37 LustreError: 32923:0:(osp_sync.c:584:osp_sync_process_record()) processed all old entries: 0x43a4:1 2012-11-07 09:53:37 LustreError: 32923:0:(osp_sync.c:584:osp_sync_process_record()) Skipped 24 previous similar messages 2012-11-07 09:53:38 Lustre: 33016:0:(llog.c:92:llog_free_handle()) Still busy: 2: 0x3efe:0x1:0: 64767 36702 2111296 1 2012-11-07 09:53:38 Pid: 33016, comm: osp-syn-460 2012-11-07 09:53:38 2012-11-07 09:53:38 2012-11-07 09:53:38 Call Trace: 2012-11-07 09:53:38 [<ffffffffa05bc965>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] 2012-11-07 09:53:38 [<ffffffffa075ae81>] llog_free_handle+0xb1/0x430 [obdclass] 2012-11-07 09:53:38 [<ffffffffa075b25d>] llog_close+0x5d/0x190 [obdclass] 2012-11-07 09:53:38 [<ffffffffa0760ea9>] llog_cat_cancel_records+0x179/0x490 [obdclass] 2012-11-07 09:53:38 [<ffffffffa0946f10>] ? lustre_swab_ost_body+0x0/0x10 [ptlrpc] 2012-11-07 09:53:38 [<ffffffffa0ff59b8>] osp_sync_process_committed+0x238/0x760 [osp] 2012-11-07 09:53:38 [<ffffffffa09699a7>] ? ptlrpcd_add_req+0x187/0x2e0 [ptlrpc] 2012-11-07 09:53:38 [<ffffffffa0ff5f74>] osp_sync_process_queues+0x94/0x11c0 [osp] 2012-11-07 09:53:38 [<ffffffff8105ea30>] ? default_wake_function+0x0/0x20 2012-11-07 09:53:38 [<ffffffffa075cf3b>] llog_process_thread+0x8fb/0xe10 [obdclass] 2012-11-07 09:53:38 [<ffffffffa0ff5ee0>] ? osp_sync_process_queues+0x0/0x11c0 [osp] 2012-11-07 09:53:38 [<ffffffffa075ec7d>] llog_process_or_fork+0x12d/0x660 [obdclass] 2012-11-07 09:53:38 [<ffffffffa0760c83>] llog_cat_process_cb+0x2d3/0x380 [obdclass] 2012-11-07 09:53:38 [<ffffffffa075cf3b>] llog_process_thread+0x8fb/0xe10 [obdclass] 2012-11-07 09:53:38 [<ffffffffa07609b0>] ? llog_cat_process_cb+0x0/0x380 [obdclass] 2012-11-07 09:53:38 [<ffffffffa075ec7d>] llog_process_or_fork+0x12d/0x660 [obdclass] 2012-11-07 09:53:38 [<ffffffffa075f7b9>] llog_cat_process_or_fork+0x89/0x290 [obdclass] 2012-11-07 09:53:38 [<ffffffff8104cab9>] ? __wake_up_common+0x59/0x90 2012-11-07 09:53:38 [<ffffffffa0ff5ee0>] ? osp_sync_process_queues+0x0/0x11c0 [osp] 2012-11-07 09:53:38 [<ffffffffa075f9d9>] llog_cat_process+0x19/0x20 [obdclass] 2012-11-07 09:53:38 [<ffffffffa05bd83a>] ? cfs_waitq_signal+0x1a/0x20 [libcfs] 2012-11-07 09:53:38 [<ffffffffa0ff8000>] osp_sync_thread+0x1d0/0x700 [osp] 2012-11-07 09:53:38 [<ffffffffa0ff7e30>] ? osp_sync_thread+0x0/0x700 [osp] 2012-11-07 09:53:38 [<ffffffff8100c14a>] child_rip+0xa/0x20 2012-11-07 09:53:38 [<ffffffffa0ff7e30>] ? osp_sync_thread+0x0/0x700 [osp] 2012-11-07 09:53:38 [<ffffffffa0ff7e30>] ? osp_sync_thread+0x0/0x700 [osp] 2012-11-07 09:53:38 [<ffffffff8100c140>] ? child_rip+0x0/0x20 2012-11-07 09:53:38 2012-11-07 09:53:38 LustreError: 33016:0:(llog_cat.c:187:llog_cat_id2handle()) lstest-OST01cc-osc-MDT0000: error opening log id 0x5a5a5a5a5a5a5a5a:5a5a5a5a: rc = -2 2012-11-07 09:53:38 LustreError: 33016:0:(llog_cat.c:513:llog_cat_cancel_records()) Cannot find log 0x5a5a5a5a5a5a5a5a 2012-11-07 09:53:38 LustreError: 33016:0:(llog_cat.c:552:llog_cat_cancel_records()) lstest-OST01cc-osc-MDT0000: fail to cancel 0 of 1 llog-records: rc = -2 2012-11-07 09:53:38 LustreError: 33016:0:(osp_sync.c:721:osp_sync_process_committed()) @@@ lstest-OST01cc-osc-MDT0000: can't cancel record 0x5a5a5a5a5a5a5a5a:0x5a5a5a5a5a5a5a5a:1515870810:2:36705: -2 2012-11-07 09:53:38 req@ffff880f75ddf400 x1418000660995471/t0(0) o6->lstest-OST01cc-osc-MDT0000@172.20.3.60@o2ib500:28/4 lens 664/400 e 0 to 0 dl 1352310924 ref 1 fl Complete:R/0/0 rc 0/-2 2012-11-07 09:53:38 LustreError: 33016:0:(llog_cat.c:187:llog_cat_id2handle()) lstest-OST01cc-osc-MDT0000: error opening log id 0x5a5a5a5a5a5a5a5a:5a5a5a5a: rc = -2 2012-11-07 09:53:38 LustreError: 33016:0:(llog_cat.c:513:llog_cat_cancel_records()) Cannot find log 0x5a5a5a5a5a5a5a5a 2012-11-07 09:53:38 LustreError: 33016:0:(llog_cat.c:552:llog_cat_cancel_records()) lstest-OST01cc-osc-MDT0000: fail to cancel 0 of 1 llog-records: rc = -2 2012-11-07 09:53:38 LustreError: 33016:0:(osp_sync.c:721:osp_sync_process_committed()) @@@ lstest-OST01cc-osc-MDT0000: can't cancel record 0x5a5a5a5a5a5a5a5a:0x5a5a5a5a5a5a5a5a:1515870810:2:36707: -2 2012-11-07 09:53:38 req@ffff880f73c0bc00 x1418000660995472/t0(0) o6->lstest-OST01cc-osc-MDT0000@172.20.3.60@o2ib500:28/4 lens 664/400 e 0 to 0 dl 1352310924 ref 1 fl Complete:R/0/0 rc 0/-2 2012-11-07 09:53:38 general protection fault: 0000 [#1] SMP 2012-11-07 09:53:38 last sysfs file: /sys/devices/system/cpu/cpu31/cache/index2/shared_cpu_map 2012-11-07 09:53:38 CPU 0 2012-11-07 09:53:38 Modules linked in: osp(U) mdt(U) mdd(U) lod(U) mgs(U) mgc(U) osd_zfs(U) lquota(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) acpi_cpufreq freq_table mperf ksocklnd(U) ko2iblnd(U) lnet(U) sha512_generic sha256_generic libcfs(U) ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ib_sa mlx4_ib ib_mad ib_core dm_mirror dm_region_hash dm_log dm_round_robin dm_multipath dm_mod vhost_net macvtap macvlan tun kvm zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate sg ses enclosure sd_mod crc_t10dif isci libsas wmi mpt2sas scsi_transport_sas raid_class sb_edac edac_core ahci i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support ioatdma shpchp ipv6 nfs lockd fscache nfs_acl auth_rpcgss sunrpc mlx4_en mlx4_core igb dca [last unloaded: cpufreq_ondemand] 2012-11-07 09:53:39 2012-11-07 09:53:39 Pid: 33016, comm: osp-syn-460 2012-11-07 09:53:39 Tainted: P W ---------------- 2.6.32-220.23.1.2chaos.ch5.x86_64 #1 appro 2620x-in/S2600GZ 2012-11-07 09:53:39 RIP: 0010:[<ffffffffa075c90c>] [<ffffffffa075c90c>] llog_process_thread+0x2cc/0xe10 [obdclass] 2012-11-07 09:53:39 RSP: 0018:ffff880f720bdb60 EFLAGS: 00010206 2012-11-07 09:53:39 RAX: 5a5a5a5a5a5a5a5a RBX: 0000000000008f81 RCX: 0000000000000000 2012-11-07 09:53:39 RDX: ffff880f763761c0 RSI: ffff880f79788000 RDI: ffff880f72d3e000 2012-11-07 09:53:39 RBP: ffff880f720bdc00 R08: ffff881f9b6b0500 R09: 0000000000000000 2012-11-07 09:53:39 R10: 0000000000000000 R11: 0000000000000000 R12: ffff881f9a728058 2012-11-07 09:53:39 R13: 000000000000fcff R14: ffff880f72d3c000 R15: ffff880f720bde80 2012-11-07 09:53:39 FS: 00007ffff7fdc700(0000) GS:ffff880060600000(0000) knlGS:0000000000000000 2012-11-07 09:53:39 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b 2012-11-07 09:53:39 CR2: 00007ffff7ff9000 CR3: 000000200eb4b000 CR4: 00000000000406f0 2012-11-07 09:53:39 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 2012-11-07 09:53:39 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 2012-11-07 09:53:39 Process osp-syn-460 2012-11-07 09:53:39 (pid: 33016, threadinfo ffff880f720bc000, task ffff880fd518cae0) 2012-11-07 09:53:39 Stack: 2012-11-07 09:53:39 ffff881f00002000 0000000100000000 ffff881f00000000 ffff880f72d3c001 2012-11-07 09:53:39 <0> 00008f8000000000 0000000000000000 0000000000000000 0000000000240000 2012-11-07 09:53:39 <0> 0000fd00720bdbe0 ffff880f72d3e000 ffff880f763761c0 ffff881f9b6b0500 2012-11-07 09:53:39 Call Trace: 2012-11-07 09:53:39 [<ffffffffa0ff5ee0>] ? osp_sync_process_queues+0x0/0x11c0 [osp] 2012-11-07 09:53:39 [<ffffffffa075ec7d>] llog_process_or_fork+0x12d/0x660 [obdclass] 2012-11-07 09:53:39 [<ffffffffa0760c83>] llog_cat_process_cb+0x2d3/0x380 [obdclass] 2012-11-07 09:53:39 [<ffffffffa075cf3b>] llog_process_thread+0x8fb/0xe10 [obdclass] 2012-11-07 09:53:39 [<ffffffffa07609b0>] ? llog_cat_process_cb+0x0/0x380 [obdclass] 2012-11-07 09:53:39 [<ffffffffa075ec7d>] llog_process_or_fork+0x12d/0x660 [obdclass] 2012-11-07 09:53:39 [<ffffffffa075f7b9>] llog_cat_process_or_fork+0x89/0x290 [obdclass] 2012-11-07 09:53:39 [<ffffffff8104cab9>] ? __wake_up_common+0x59/0x90 2012-11-07 09:53:39 [<ffffffffa0ff5ee0>] ? osp_sync_process_queues+0x0/0x11c0 [osp] 2012-11-07 09:53:39 [<ffffffffa075f9d9>] llog_cat_process+0x19/0x20 [obdclass] 2012-11-07 09:53:39 [<ffffffffa05bd83a>] ? cfs_waitq_signal+0x1a/0x20 [libcfs] 2012-11-07 09:53:39 [<ffffffffa0ff8000>] osp_sync_thread+0x1d0/0x700 [osp] 2012-11-07 09:53:39 [<ffffffffa0ff7e30>] ? osp_sync_thread+0x0/0x700 [osp] 2012-11-07 09:53:39 [<ffffffff8100c14a>] child_rip+0xa/0x20 2012-11-07 09:53:39 [<ffffffffa0ff7e30>] ? osp_sync_thread+0x0/0x700 [osp] 2012-11-07 09:53:39 [<ffffffffa0ff7e30>] ? osp_sync_thread+0x0/0x700 [osp] 2012-11-07 09:53:39 [<ffffffff8100c140>] ? child_rip+0x0/0x20 2012-11-07 09:53:39 Code: 74 0c 00 01 00 00 00 e8 c3 0c e7 ff 48 83 7d b8 00 0f 84 f8 03 00 00 4c 8b 45 b8 49 8b 80 b0 00 00 00 48 85 c0 0f 84 e4 03 00 00 <48> 8b 40 08 48 85 c0 0f 84 67 02 00 00 4d 89 f1 c7 04 24 00 20 2012-11-07 09:53:39 RIP [<ffffffffa075c90c>] llog_process_thread+0x2cc/0xe10 [obdclass] 2012-11-07 09:53:39 RSP <ffff880f720bdb60> |
| Comment by Li Wei (Inactive) [ 08/Nov/12 ] |
|
The latest report is very helpful. Here's my reconstruction: 2012-11-07 09:53:38 Lustre: 33016:0:(llog.c:92:llog_free_handle()) Still busy: 2: 0x3efe:0x1:0: 64767 36702 2111296 1 An OSP sync thread canceled the last record in a log and was freeing the log handle. However, the thread was processing this log and it was at index 36702. Although both the handle and header structures were poisoned, the thread could continue processing the log because a) index 36702 was less than (LLOG_BITMAP_BYTES * 8 - 1), b) llh_bitmap actually contained non-zero bits after being poisoned, and c) the on-disk canceled records were intact. 2012-11-07 09:53:38 LustreError: 33016:0:(osp_sync.c:721:osp_sync_process_committed()) @@@ lstest-OST01cc-osc-MDT0000: can't cancel record 0x5a5a5a5a5a5a5a5a:0x5a5a5a5a5a5a5a5a:1515870810:2:36705: -2 2012-11-07 09:53:38 req@ffff880f75ddf400 x1418000660995471/t0(0) o6->lstest-OST01cc-osc-MDT0000@172.20.3.60@o2ib500:28/4 lens 664/400 e 0 to 0 dl 1352310924 ref 1 fl Complete:R/0/0 rc 0/-2 [...] 2012-11-07 09:53:38 LustreError: 33016:0:(osp_sync.c:721:osp_sync_process_committed()) @@@ lstest-OST01cc-osc-MDT0000: can't cancel record 0x5a5a5a5a5a5a5a5a:0x5a5a5a5a5a5a5a5a:1515870810:2:36707: -2 2012-11-07 09:53:38 req@ffff880f73c0bc00 x1418000660995472/t0(0) o6->lstest-OST01cc-osc-MDT0000@172.20.3.60@o2ib500:28/4 lens 664/400 e 0 to 0 dl 1352310924 ref 1 fl Complete:R/0/0 rc 0/-2 These were from two records processed after freeing the log handle. The indices were 36705 and 36707. If I calculated correctly, 36702, 36705, and 36707 correspond to bit 30, 33, and 35 in the 573th bitmap word: 0 1 0 1 1 0 1 0 0 1 0 1 1 0 1 0 0 1 0 1 1 0 1 0 0 1 0 1 1 0 1 0 63 47 32 0 1 0 1 1 0 1 0 0 1 0 1 1 0 1 0 0 1 0 1 1 0 1 0 0 1 0 1 1 0 1 0 31 15 0 This was why 36703, 36704, and 36706 did not appear in the log. |
| Comment by Prakash Surya (Inactive) [ 14/Nov/12 ] |
|
Thanks for the detailed explanation! I need a better understanding of llogs to fully understand, but it sounds plausible at a high level. Do you have an idea as to how to fix it? I've also seen the "Still busy" message crop up without the other "can't cancel record" messages: <ConMan> Console [grove-mds2] log at 2012-11-13 07:00:00 PST. 2012-11-13 07:56:45 Lustre: 32930:0:(llog.c:92:llog_free_handle()) Still busy: 2: 0x6541:0x1:0: 64767 64767 4161152 1 2012-11-13 07:56:45 Pid: 32930, comm: osp-syn-431 2012-11-13 07:56:45 2012-11-13 07:56:45 2012-11-13 07:56:45 Call Trace: 2012-11-13 07:56:45 [<ffffffffa05aa965>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] 2012-11-13 07:56:45 [<ffffffffa0703e81>] llog_free_handle+0xb1/0x430 [obdclass] 2012-11-13 07:56:45 [<ffffffffa070425d>] llog_close+0x5d/0x190 [obdclass] 2012-11-13 07:56:45 [<ffffffffa0709ea9>] llog_cat_cancel_records+0x179/0x490 [obdclass] 2012-11-13 07:56:45 [<ffffffffa08eff20>] ? lustre_swab_ost_body+0x0/0x10 [ptlrpc] 2012-11-13 07:56:45 [<ffffffffa0fcd9b8>] osp_sync_process_committed+0x238/0x760 [osp] 2012-11-13 07:56:45 [<ffffffffa09129b7>] ? ptlrpcd_add_req+0x187/0x2e0 [ptlrpc] 2012-11-13 07:56:45 [<ffffffffa0fcdf74>] osp_sync_process_queues+0x94/0x11c0 [osp] 2012-11-13 07:56:45 [<ffffffff8105ea30>] ? default_wake_function+0x0/0x20 2012-11-13 07:56:45 [<ffffffffa0705f3b>] llog_process_thread+0x8fb/0xe10 [obdclass] 2012-11-13 07:56:45 [<ffffffffa0fcdee0>] ? osp_sync_process_queues+0x0/0x11c0 [osp] 2012-11-13 07:56:45 [<ffffffffa0707c7d>] llog_process_or_fork+0x12d/0x660 [obdclass] 2012-11-13 07:56:45 [<ffffffffa0709c83>] llog_cat_process_cb+0x2d3/0x380 [obdclass] 2012-11-13 07:56:45 [<ffffffffa0705f3b>] llog_process_thread+0x8fb/0xe10 [obdclass] 2012-11-13 07:56:45 [<ffffffffa07099b0>] ? llog_cat_process_cb+0x0/0x380 [obdclass] 2012-11-13 07:56:45 [<ffffffffa0707c7d>] llog_process_or_fork+0x12d/0x660 [obdclass] 2012-11-13 07:56:45 [<ffffffffa07087b9>] llog_cat_process_or_fork+0x89/0x290 [obdclass] 2012-11-13 07:56:45 [<ffffffff8104cab9>] ? __wake_up_common+0x59/0x90 2012-11-13 07:56:45 [<ffffffffa0fcdee0>] ? osp_sync_process_queues+0x0/0x11c0 [osp] 2012-11-13 07:56:45 [<ffffffffa07089d9>] llog_cat_process+0x19/0x20 [obdclass] 2012-11-13 07:56:45 [<ffffffffa05ab83a>] ? cfs_waitq_signal+0x1a/0x20 [libcfs] 2012-11-13 07:56:45 [<ffffffffa0fd0000>] osp_sync_thread+0x1d0/0x700 [osp] 2012-11-13 07:56:45 [<ffffffffa0fcfe30>] ? osp_sync_thread+0x0/0x700 [osp] 2012-11-13 07:56:45 [<ffffffff8100c14a>] child_rip+0xa/0x20 2012-11-13 07:56:45 [<ffffffffa0fcfe30>] ? osp_sync_thread+0x0/0x700 [osp] 2012-11-13 07:56:45 [<ffffffffa0fcfe30>] ? osp_sync_thread+0x0/0x700 [osp] 2012-11-13 07:56:45 [<ffffffff8100c140>] ? child_rip+0x0/0x20 2012-11-13 07:56:45 <ConMan> Console [grove-mds2] log at 2012-11-13 08:00:00 PST. |
| Comment by Li Wei (Inactive) [ 15/Nov/12 ] |
|
Prakash, In your last comment, according to 2012-11-13 07:56:45 Lustre: 32930:0:(llog.c:92:llog_free_handle()) Still busy: 2: 0x6541:0x1:0: 64767 64767 4161152 1 both lgh_cur_idx and lgh_last_idx were equal to the maximum index allowed in a log, meaning that the thread had processed all the records and would not use the handle anymore. I think that is why the "can't cancel record" message and the GPF did not occur. In the case described by my last comment, lgh_cur_idx was far smaller than lgh_last_idx. What puzzles me is that how llh_count could become one before the records from lgh_cur_idx to lgh_last_idx were processed. Following Andreas and Alex's advice, I applied http://review.whamcloud.com/4508 that verifies the consistency between llh_count and llh_bitmap at log open time and failed an OST under a racer workload every ten seconds for a few hours. But no inconsistency was found. I'll think again... |
| Comment by Alex Zhuravlev [ 16/Nov/12 ] |
|
notice that during cancelation llog cookie is taken from a request, not directly from llog. so if request is freed, then corresponded memory can be re-used (or just filled with 5a). |
| Comment by Mikhail Pershin [ 22/Nov/12 ] |
|
I agree with Alex, it is not about freed loghandle but about freed logid from request: 2012-11-07 09:53:38 LustreError: 33016:0:(llog_cat.c:187:llog_cat_id2handle()) lstest-OST01cc-osc-MDT0000: error opening log id 0x5a5a5a5a5a5a5a5a:5a5a5a5a: rc = -2 That means just cat_id2handle() tried to use logid being freed. And protection fault in osp_sync_process_queues() is caused just by calling ptlrpc_req_finished() on freed request. |
| Comment by Li Wei (Inactive) [ 23/Nov/12 ] |
|
There was evidences early on of poisoned log handles when filling cookies into request buffers. E.g., from Brian's comment on Aug 7: 2012-08-06 11:35:50 LustreError: 21152:0:(osp_sync.c:465:osp_sync_new_job()) Poisoned log ID from handle ffff882ffb055900 |
| Comment by Mikhail Pershin [ 23/Nov/12 ] |
|
so, this debug is not in osp_sync_new_job() now? Can we add it again? |
| Comment by Alex Zhuravlev [ 23/Nov/12 ] |
|
yes.. Li Wei, could you combine all the debug you've developed (checks for poisoned ids, llh_count vs bitmap verification, etc) into a single patch and land it on master, please? |
| Comment by Mikhail Pershin [ 23/Nov/12 ] |
But you've answered already that. There were no more records in llog and empty bitmap, but after poisoning it become filled with pattern and llog_process_thread() find set bit which was 36702. I think we have to add refcount into llog_handle and free it when only last one dropped. |
| Comment by Li Wei (Inactive) [ 23/Nov/12 ] |
|
Mike, I had that thought too. But considering that
a log should be destroyed at least after its OSP has processed all its records and likely moved to another log. In addition, because
lgh_cur_idx should be 64767 (or 64766, if a padding record exists at the end) when the OSP has processed all the records in the log. Thus, the log handle should not reach a state like 2012-11-07 09:53:38 Lustre: 33016:0:(llog.c:92:llog_free_handle()) Still busy: 2: 0x3efe:0x1:0: 64767 36702 2111296 1 where lgh_last_idx was 64767, lgh_cur_idx was 36702, and llh_count was 1. |
| Comment by Alex Zhuravlev [ 26/Nov/12 ] |
|
> where lgh_last_idx was 64767, lgh_cur_idx was 36702, and llh_count was 1. iirc, in the all reported cases the problem happened right after startup. so 64767 means that this is a llog left from the previous boot and we're reprocessing that. i think it should be possible to reproduce this locally setting timeout pretty big (to workaround slow performance compared to customer's hardware) and another debug patch could be to print a warning when number of processed records > llh_count (if we started with a full llog). |
| Comment by Mikhail Pershin [ 26/Nov/12 ] |
|
Li Wei, about "Still busy: 2: 0x3efe:0x1:0: 64767 36702 2111296 1". Why can't that be that 36702 was just the latest cancelled record? I noticed that requests are added to the head of opd_syn_committed_there list, so later requests will be processed first, so technically it is possible that records with lower index can be cancelled later then with larger index. |
| Comment by Mikhail Pershin [ 26/Nov/12 ] |
|
It is not quite so, the llh is set to NULL when request is sent. In fact, llog_process_thread() is not safe if callback deletes plain llog. We need mechanism to understand that llog was deleted and stop processing.
|
| Comment by Alex Zhuravlev [ 26/Nov/12 ] |
|
I think it makes sense to introduce a local copy of llh_count just before we start processing of the llog, decrement it on the every alive record and break once it reaches 1. but this logic should be used only when the llog is not getting new records. |
| Comment by Prakash Surya (Inactive) [ 26/Nov/12 ] |
|
FWIW, I hit this again. This time, I had all of the OSTs powered down, then rebooted the MDS (which came up fine), then powered up the OSTs. The MDS appeared to be OK when it came up, but once the OSTs were brought back online, the MDS crashed. And now as the MDS tries to come back online, it repeatedly hits this (twice so far): 2012-11-26 12:40:40 LustreError: 33226:0:(llog_cat.c:187:llog_cat_id2handle()) lstest-OST01fc-osc-MDT0000: error opening log id 0xaf75:0: rc = -2 2012-11-26 12:40:40 BUG: unable to handle kernel NULL pointer dereference at 0000000000000060 2012-11-26 12:40:40 IP: [<ffffffffa071d3e4>] cat_cancel_cb+0x2e4/0x5e0 [obdclass] |
| Comment by Li Wei (Inactive) [ 26/Nov/12 ] |
|
Ah, right, it must be records that were processed but not cancelled before MDTs were brought down. |
| Comment by Alex Zhuravlev [ 26/Nov/12 ] |
|
as for cat_cancel_cb() I found that llog_cat_init_and_process() forks, so then it's running with main osp's llog_process() in parallel. I think llog_cat_init_and_process() should not fork. |
| Comment by Mikhail Pershin [ 27/Nov/12 ] |
|
Prakash, that is different issue, could you add new ticket with data you have? |
| Comment by Prakash Surya (Inactive) [ 27/Nov/12 ] |
|
I opened LU-2394 for the cat_cancel_cb and a link to the patch to address the issue. |
| Comment by Li Wei (Inactive) [ 28/Nov/12 ] |
|
http://review.whamcloud.com/4696 A fix. For some reason, I wasn't able to get my new regression test working correctly today. Instead, I tested this patch with this hack: --- a/lustre/obdclass/llog.c +++ b/lustre/obdclass/llog.c @@ -137,7 +137,8 @@ int llog_cancel_rec(const struct lu_env *env, struct llog_ha if ((llh->llh_flags & LLOG_F_ZAP_WHEN_EMPTY) && (llh->llh_count == 1) && - (loghandle->lgh_last_idx == (LLOG_BITMAP_BYTES * 8) - 1)) { + (loghandle->lgh_last_idx >= 1)) { + loghandle->lgh_last_idx == (LLOG_BITMAP_BYTES * 8) - 1; cfs_spin_unlock(&loghandle->lgh_hdr_lock); rc = llog_destroy(env, loghandle); if (rc < 0) { |
| Comment by Andreas Dilger [ 03/Dec/12 ] |
|
Li Wei, this is exactly what OBD_FAIL_CHECK() is for - the ability to be able to trigger specific errors in the code at runtime instead of having to change it at compile time. |
| Comment by Mikhail Pershin [ 05/Dec/12 ] |
|
http://review.whamcloud.com/#change,4745 Attempt to introduce refcounter for llog handler. Please verify it is correct and works. |
| Comment by Prakash Surya (Inactive) [ 13/Dec/12 ] |
|
Mike, I pulled in that patch. I've also done some initial testing and it looks good so far. I've rebooted the MDS a few times while under light load, and haven't hit the issue yet (with the fix). Without the fix, I was able to hit it on the first try under the same load. |
| Comment by Mikhail Pershin [ 11/Jan/13 ] |
|
patch was landed. Prakash, are you OK to close this ticket? |
| Comment by Mikhail Pershin [ 13/Jan/13 ] |
|
closing for now, reopen if it will appear again |
| Comment by Prakash Surya (Inactive) [ 14/Jan/13 ] |
|
Mike, I can't recall hitting this since pulling in the patch. Thanks! |