[LU-6584] OSS hit LBUG and crash Created: 08/May/15 Updated: 21/Dec/15 Resolved: 07/Oct/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.7.0 |
| Fix Version/s: | Lustre 2.8.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Haisong Cai (Inactive) | Assignee: | Mikhail Pershin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | sdsc | ||
| Environment: |
[root@panda-oss-25-4 ~]# uname -a |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 4 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
No sign or indication, ie lustre-log or error messages, OSS unexpectantly crash (please see console image). /var/log/messages is attached |
| Comments |
| Comment by Haisong Cai (Inactive) [ 08/May/15 ] |
|
like to add an observation: In 2 cases happened to 2 separated OSS so far, an OSS crashed and was brought back, OSTs mounted, then almost immediately it crashed again. The console image was taken from second crash. Haisong |
| Comment by Bruno Faccini (Inactive) [ 08/May/15 ] |
|
Hello Haisong, Only interesting infos found in the messages file is : May 7 15:18:13 panda-oss-23-6 kernel: LustreError: 26222:0:(client.c:173:__ptlrpc_prep_bulk_page()) ASSERTION( pageoffset + len <= ((1UL) << 12) ) failed: May 7 15:18:13 panda-oss-23-6 kernel: LustreError: 26222:0:(client.c:173:__ptlrpc_prep_bulk_page()) LBUG May 7 15:18:13 panda-oss-23-6 kernel: Pid: 26222, comm: ll_ost_io00_013 May 7 15:18:13 panda-oss-23-6 kernel: May 7 15:18:13 panda-oss-23-6 kernel: Call Trace: May 7 15:18:13 panda-oss-23-6 kernel: [<ffffffffa094b857>] libcfs_debug_dumpstack+0x57/0x80 [libcfs] May 7 15:18:13 panda-oss-23-6 kernel: [<ffffffffa094bdd7>] lbug_with_loc+0x47/0xc0 [libcfs] May 7 15:18:13 panda-oss-23-6 kernel: [<ffffffffa0fe921b>] __ptlrpc_prep_bulk_page+0xcb/0x190 [ptlrpc] May 7 15:18:13 panda-oss-23-6 kernel: [<ffffffffa105f590>] tgt_brw_read+0xab0/0x11d0 [ptlrpc] May 7 15:18:13 panda-oss-23-6 kernel: [<ffffffffa0ffa486>] ? lustre_pack_reply_flags+0xa6/0x1e0 [ptlrpc] May 7 15:18:13 panda-oss-23-6 kernel: [<ffffffff8109518d>] ? sched_clock_cpu+0xcd/0x110 May 7 15:18:13 panda-oss-23-6 kernel: [<ffffffffa105c7ae>] tgt_handle_request0+0x9e/0x3f0 [ptlrpc] May 7 15:18:13 panda-oss-23-6 kernel: [<ffffffffa10576c0>] ? tgt_handle_recovery+0x30/0x360 [ptlrpc] May 7 15:18:13 panda-oss-23-6 kernel: [<ffffffffa105d371>] tgt_request_handle+0x1c1/0x770 [ptlrpc] May 7 15:18:13 panda-oss-23-6 kernel: [<ffffffffa100a5e3>] ptlrpc_server_handle_request+0x2e3/0xbc0 [ptlrpc] May 7 15:18:13 panda-oss-23-6 kernel: [<ffffffffa094c3de>] ? cfs_timer_arm+0xe/0x10 [libcfs] May 7 15:18:13 panda-oss-23-6 kernel: [<ffffffffa095ba0a>] ? lc_watchdog_touch+0x7a/0x190 [libcfs] May 7 15:18:13 panda-oss-23-6 kernel: [<ffffffffa1003209>] ? ptlrpc_wait_event+0xa9/0x2f0 [ptlrpc] May 7 15:18:13 panda-oss-23-6 kernel: [<ffffffff8108d273>] ? __wake_up+0x53/0x70 May 7 15:18:13 panda-oss-23-6 kernel: [<ffffffffa100cd3c>] ptlrpc_main+0x9dc/0xd90 [ptlrpc] May 7 15:18:13 panda-oss-23-6 kernel: [<ffffffffa100c360>] ? ptlrpc_main+0x0/0xd90 [ptlrpc] May 7 15:18:13 panda-oss-23-6 kernel: [<ffffffff810821be>] kthread+0xce/0xe0 May 7 15:18:13 panda-oss-23-6 kernel: [<ffffffff810820f0>] ? kthread+0x0/0xe0 May 7 15:18:13 panda-oss-23-6 kernel: [<ffffffff815f93c8>] ret_from_fork+0x58/0x90 May 7 15:18:13 panda-oss-23-6 kernel: [<ffffffff810820f0>] ? kthread+0x0/0xe0 May 7 15:18:13 panda-oss-23-6 kernel: so, did something go wrong during remote <-> local buffers mapping? |
| Comment by Haisong Cai (Inactive) [ 08/May/15 ] |
|
One of the OSS crashed 2 days ago crashed again this morning. I am attaching dmesg here. Just to list all LBUG we have encounter on this filesystem so far (this morning was panda-oss-25-4) May 6 09:06:46 panda-oss-25-4 kernel: LustreError: 30076:0:(client.c:173:__ptlrpc_prep_bulk_page()) LBUG To answer your question about kdump, no we don't have kdump enable on our OSS/MDS. thanks, |
| Comment by Haisong Cai (Inactive) [ 08/May/15 ] |
|
Adding 2 lustre-logs for this morning's crash, on panda-oss-25-4 |
| Comment by Haisong Cai (Inactive) [ 08/May/15 ] |
|
Another OSS crashed, third OSS. Same pattern: the first crash, brought back OSS/OST, in less then a minite it comes with the second crash. It generally stay on and functional for hours until next crash. |
| Comment by Peter Jones [ 08/May/15 ] |
|
Mike Could you please advise on this one? Thanks Peter |
| Comment by Rick Wagner (Inactive) [ 09/May/15 ] |
|
Haisong, This is hitting wombat (our test partition) when I'm just running IOR. Both partitions have the same stack, now, so we may able reproduce it quickly. This also eliminates it being cause by pathological client IO on the production side. --Rick |
| Comment by Rick Wagner (Inactive) [ 09/May/15 ] |
|
Mike, If it's any help, we were running v2.6.92 with a handful of patches related to large block support with ZFS ( --Rick |
| Comment by Mikhail Pershin [ 12/May/15 ] |
|
Thanks for info, I am investigating this. |
| Comment by Gerrit Updater [ 25/May/15 ] |
|
Mike Pershin (mike.pershin@intel.com) uploaded a new patch: http://review.whamcloud.com/14926 |
| Comment by Mikhail Pershin [ 25/May/15 ] |
|
I wasn't able to find exact place of problem by inspecting related code. There is patch to get a little bit more info when this bug happens. |
| Comment by Andreas Dilger [ 01/Jun/15 ] |
|
Mike, could you please take a look though the patches between 2.6.92 and 2.7.52 to see if there are any likely candidates? |
| Comment by Haisong Cai (Inactive) [ 12/Jun/15 ] |
|
We hit another LBUG and it looks like the same kind. Jun 12 01:01:45 panda-oss-25-2 kernel: LustreError: 29442:0:(client.c:173:__ptlrpc_prep_bulk_page()) ASSERTION( pageoffset + len <= ((1UL) << 12) ) failed: |
| Comment by Andreas Dilger [ 15/Jun/15 ] |
|
Haisong, it looks like you are not yet running the debug patch from Mike (http://review.whamcloud.com/14926) on your systems. It would be useful if you applied that patch (only needed on the servers) so that we can capture more information about this failure. |
| Comment by Mikhail Pershin [ 29/Jun/15 ] |
|
Rick, what version of ZFS are you using, 0.6.4? |
| Comment by Rick Wagner (Inactive) [ 29/Jun/15 ] |
|
Hi Mikhail, Yes, 0.6.4, but with the first large block support pull request, 2865. Here's the SPL and ZFS build process I used. git clone https://github.com/zfsonlinux/zpl.git cd spl git checkout spl-0.6.4 ./autogen.sh ./configure --disable-debug make pkg rpm -ivh *x86_64.rpm git clone https://github.com/zfsonlinux/zfs.git cd zfs git fetch -t https://github.com/zfsonlinux/zfs.git refs/pull/2865/head:lgblock git checkout zfs-0.6.4 git merge lgblock ./autogen.sh ./configure --disable-debug make pkg rpm -ivh *x86_64.rpm |
| Comment by Mikhail Pershin [ 02/Jul/15 ] |
|
Rick, thank you for that info, I am trying to reproduce that situation. Meanwhile, were there any other specific tuning in ZFS, e.g. maximum block size or anything related to the block size? Also what Lustre patches are you using over the stock Lustre 2.7.52? I am asking because 2.7.52 can't be built with zfs version you are using, because of SPA_MAXBLOCKSHIFT is bigger now |
| Comment by Rick Wagner (Inactive) [ 02/Jul/15 ] |
|
Mikhail, For ZFS with large block support, our record size is set to 1024k. On the Lustre side we've got a few patches applied, one or more of which has not landed:
At our site, I've handed over the build process to Dima Mishin for maintenance. Dima, can you pass on your current build (base commit, cherry picks, and any manual patches) to Mikhail? |
| Comment by Dmitry Mishin (Inactive) [ 02/Jul/15 ] |
|
I have a build from commit 8a11cb62 in master, with patch from a44175d ( |
| Comment by Minh Diep [ 10/Jul/15 ] |
|
Haisong, How often do you see this in production? could you apply the debug patch (http://review.whamcloud.com/14926)? Thanks |
| Comment by Haisong Cai (Inactive) [ 10/Jul/15 ] |
|
Hi Minh, Depending on how file-system is being used, we hit the bug between every day (as I described when I created this ticket) to several weeks apart. We are in the process of applying the patch on to our production file-system. Haisong |
| Comment by Haisong Cai (Inactive) [ 14/Jul/15 ] |
|
We are very close to get the debug patch deployed onto our production systems. This is the number of times we have hit the lbug today Each hit is an OSS downtime. [root@oasis-panda cai]# grep -i lbug /var/log/messages |
| Comment by Haisong Cai (Inactive) [ 14/Jul/15 ] |
|
In all crash cases these 2 lines always coupled: Jul 13 23:21:26 panda-oss-25-4 kernel: LustreError: 10265:0:(client.c:173:__ptlrpc_prep_bulk_page()) ASSERTION( pageoffset + len <= ((1UL) << 12) ) failed: and most of the time follow by this line: I am attaching /var/log/messages here that include all incidences from yesterday. |
| Comment by Haisong Cai (Inactive) [ 14/Jul/15 ] |
|
collected with command line: cat /var/log/messages | egrep -iv "sshd|cron|run-parts|postfix|rsyslog|audispd|named|rockscommand|channel|alert-handler|411-alert" > /tmp/log.$$ thanks, |
| Comment by Mikhail Pershin [ 18/Jul/15 ] |
|
Haisong, what do you mean by 'OSS downtime'? |
| Comment by Haisong Cai (Inactive) [ 18/Jul/15 ] |
|
Hi Mikhail, As I described at the very beginning of this ticket, when LBUG is hit and OSS crashes. Haisong |
| Comment by Zhenyu Xu [ 23/Jul/15 ] |
|
Hi, what the progress of applying the debug patch and collect the log again? The latest log is still without the debug patch. |
| Comment by Rick Wagner (Inactive) [ 23/Jul/15 ] |
|
Zhenyu, we're rolling out an update to our production file system today. After discussion with Minh we've decided to rebase our code on later releases of ZFS and Lustre that have our necessary patches. The only additional patch we've added is the debugging one for this ticket. After we're done, I'll ask Dima Mishin to post the exact releases that we're working from. |
| Comment by Dmitry Mishin (Inactive) [ 23/Jul/15 ] |
|
We used: |
| Comment by Haisong Cai (Inactive) [ 30/Jul/15 ] |
|
We hit another LBUG today. OSS is running debug patch. Jul 30 15:08:25 panda-oss-25-4 kernel: LustreError: 11719:0:(client.c:211:__ptlrpc_prep_bulk_page()) ASSERTION( pageoffset + len <= PAGE_CACHE_SIZE ) failed: offset 0, len 1913970688 |
| Comment by Haisong Cai (Inactive) [ 30/Jul/15 ] |
|
In little more than an hour: Jul 30 15:08:25 panda-oss-25-4 kernel: LustreError: 11719:0:(client.c:211:__ptlrpc_prep_bulk_page()) LBUG |
| Comment by Andreas Dilger [ 07/Aug/15 ] |
|
Is it worthwhile to print out the whole lnb struct (maybe with neighboring values also) to see if there is memory corruption that can be identified? Is the oops always in the same place? Then it seems likely that there is some systematic memory corruption (stack overflow, out of bounds array access, etc) rather than random corruption from another thread, which would cause crashes in other parts of the code. |
| Comment by Zhenyu Xu [ 12/Aug/15 ] |
|
Hi Haisong, http://review.whamcloud.com/#/c/14926/ has been updated to include more debug information to catch the LBUG, would you please apply it and re-hit the issue and collect core dump? Please add "ha" debug before re-run the test, as lctl set_param debug="+ha"
|
| Comment by Haisong Cai (Inactive) [ 12/Aug/15 ] |
|
Hi Zhenyu, We will try to apply the new debug patch. But since this is our production file-system, we will have to plan it. So I have 2 questions: 1) by any chance you know a way to reproduce the LBUG? The reason I ask is that we have a non-production file-system that I like to remind everyone that this LBUG event usually occurs 2 or 3 consecutively, meaning OSS crash, standing OSS up, [root@oasis-panda log]# grep -i lbug messages thank, |
| Comment by Zhenyu Xu [ 13/Aug/15 ] |
|
1. Unfortunately we don't know the reason of the LBUG, we just saw that some IO has invalid page length, and it seems there is some specific memory corruption there, so the debug patch is try to print out IO pages information if abnormal happens. 2. You can setup kdump to capture the kernel crash dump as http://fedoraproject.org/wiki/How_to_use_kdump_to_debug_kernel_crashes guides. |
| Comment by Andreas Dilger [ 13/Aug/15 ] |
|
If the OSS is crashing repeatedly after startup, it may mean that the bad data is arriving from the client during replay and is not being verified properly? Are there checks up in the ost layer to verify the niobuf_remote contains valid data before it is used in the OSD? It may be that the corruption is happening on the network or on the client. |
| Comment by Andreas Dilger [ 20/Aug/15 ] |
|
In the boot log I see you are using ZFS 0.6.4. It looks like there may be fixes in 0.6.4.1 and 0.6.4.2 that may be helpful in this case and
https://github.com/zfsonlinux/zfs/releases/tag/zfs-0.6.4.2
Also, just to confirm - are these OSS nodes running with 1MB ZFS blocksize? |
| Comment by Dmitry Mishin (Inactive) [ 20/Aug/15 ] |
|
We're using the f1512ee61e commit from master ZFS branch (large block support). It's later than 0.6.4.1, and I had problems running with latest master version. |
| Comment by Rick Wagner (Inactive) [ 24/Aug/15 ] |
|
Andreas, yes, we're using 1MB block sizes on the ZFS datasets that handle the OSTs. |
| Comment by Andreas Dilger [ 25/Aug/15 ] |
|
Bobijam, This niobuf verification should be in a helper function that can also be called before the currently-failing LASSERT() checks are being handled (and elsewhere in the code if you think it is helpful), and those functions can return an While I don't think this is a proper solution, it will at least tell us if the corruption is happening on the client and/or on the network, or in memory on the OSS, and it will potentially allow debugging to continue without the high frequency of OSS failures. |
| Comment by Andreas Dilger [ 25/Aug/15 ] |
|
Rick, the other possible avenue for debugging is to disable the 1MB blocksize tunable on one or more of your OST datasets, and see if this correlates to a reduction or elimination of the occurrence of this failure. This is one of the main deltas between your ZFS environment and other ZFS users, so this would allow us to isolate the memory corruption to the code handling 1MB blocksize. |
| Comment by Zhenyu Xu [ 26/Aug/15 ] |
|
http://review.whamcloud.com/#/c/14926/ has been updated to add more remote/local buffer check. |
| Comment by Rick Wagner (Inactive) [ 27/Aug/15 ] |
|
We've scheduled a maintenance window for Sep. 8 to roll out this latest patch after testing. Andreas, I'll consider changing the recordsize on some of the OSTs. The most likely scenario where we get solid information from this is if the LBUG is still hit on one of the OSSes with the changed setting. I am being a little cautious considering this since it will mean having a ZFS dataset with varying recordsizes. I don't believe the ZFS layer will care, but it's not something I've dealt with before. |
| Comment by Andreas Dilger [ 17/Sep/15 ] |
|
Hi Rick, any news on this front? Have you looked into upgrading to ZFS 0.6.5 to get the native large block support? The patch http://review.whamcloud.com/15127 " |
| Comment by Rick Wagner (Inactive) [ 18/Sep/15 ] |
|
Hi Andreas, since our last update to the code tree based on http://review.whamcloud.com/#/c/14926/ we've been stable. It's possible that we've pulled in a bugfix along with the debugging patch although I couldn't point to a specific one. We are looking at ZFS 0.6.5 to get away from the unreleased version for ZFS we've had to run. I would probably do that along with another rebase to a later unpatched tag of Lustre, maybe once On a related note, I think this issue could be removed from the 2.8 blocker list, since we started with patched versions of Lustre and ZFS. |
| Comment by Gerrit Updater [ 30/Sep/15 ] |
|
Mike Pershin (mike.pershin@intel.com) uploaded a new patch: http://review.whamcloud.com/16685 |
| Comment by Mikhail Pershin [ 30/Sep/15 ] |
|
It seems the reason of this issue is the int type overflow in lnb_rc. Instead of writing the (eof - file_offset) right into lnb_rc we have to check first it is not negative. |
| Comment by Peter Jones [ 03/Oct/15 ] |
|
Will SDSC be able to try this patch out to confirm whether it fixes the issues that they have been experiencing? |
| Comment by Rick Wagner (Inactive) [ 03/Oct/15 ] |
|
Yes, we're scheduling a PM and push this out. Could this patch be related to |
| Comment by Mikhail Pershin [ 03/Oct/15 ] |
|
Rick, this particular issue existed in IO READ code path and doesn't related to |
| Comment by Gerrit Updater [ 07/Oct/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16685/ |
| Comment by Peter Jones [ 07/Oct/15 ] |
|
Fix landed for 2.8. We'll reopen if this issue still is hit on Hyperion. If there is still an issue at SDSC and it is not, as hoped, a duplicate of this issue then please open a new ticket to track that issue. |