[LU-6584] OSS hit LBUG and crash Created: 08/May/15  Updated: 21/Dec/15  Resolved: 07/Oct/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Critical
Reporter: Haisong Cai (Inactive) Assignee: Mikhail Pershin
Resolution: Fixed Votes: 0
Labels: sdsc
Environment:

[root@panda-oss-25-4 ~]# uname -a
Linux panda-oss-25-4.sdsc.edu 3.10.73-1.el6.elrepo.x86_64 #1 SMP Thu Mar 26 16:28:30 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux
[root@panda-oss-25-4 ~]# rpm -aq | grep lustre
lustre-2.7.51-3.10.73_1.el6.elrepo.x86_64_gb019b03.x86_64
lustre-osd-zfs-mount-2.7.51-3.10.73_1.el6.elrepo.x86_64_gb019b03.x86_64
lustre-iokit-2.7.51-3.10.73_1.el6.elrepo.x86_64_gb019b03.x86_64
lustre-source-2.7.51-3.10.73_1.el6.elrepo.x86_64_gb019b03.x86_64
lustre-osd-zfs-2.7.51-3.10.73_1.el6.elrepo.x86_64_gb019b03.x86_64
lustre-modules-2.7.51-3.10.73_1.el6.elrepo.x86_64_gb019b03.x86_64
lustre-tests-2.7.51-3.10.73_1.el6.elrepo.x86_64_gb019b03.x86_64


Attachments: PNG File 23-6.png     File log.28119.gz     Text File lustre-logs.tgz     HTML File messages13     HTML File panda-oss-23-6_messages    
Issue Links:
Duplicate
Related
Severity: 4
Rank (Obsolete): 9223372036854775807

 Description   

No sign or indication, ie lustre-log or error messages, OSS unexpectantly crash (please see console image).

/var/log/messages is attached



 Comments   
Comment by Haisong Cai (Inactive) [ 08/May/15 ]

like to add an observation:

In 2 cases happened to 2 separated OSS so far, an OSS crashed and was brought back, OSTs mounted, then almost immediately it crashed again.

The console image was taken from second crash.

Haisong

Comment by Bruno Faccini (Inactive) [ 08/May/15 ]

Hello Haisong,
According to your messages/syslog file attached, seems that there is no crash-dump tool (kdump?) enabled on the OSS, right? If this is the case, is it possible to have this setup in order to take a vmcore image for post-mortem analysis upon a new occurrence?

Only interesting infos found in the messages file is :

May  7 15:18:13 panda-oss-23-6 kernel: LustreError: 26222:0:(client.c:173:__ptlrpc_prep_bulk_page()) ASSERTION( pageoffset + len <= ((1UL) << 12) ) failed:
May  7 15:18:13 panda-oss-23-6 kernel: LustreError: 26222:0:(client.c:173:__ptlrpc_prep_bulk_page()) LBUG
May  7 15:18:13 panda-oss-23-6 kernel: Pid: 26222, comm: ll_ost_io00_013
May  7 15:18:13 panda-oss-23-6 kernel:
May  7 15:18:13 panda-oss-23-6 kernel: Call Trace:
May  7 15:18:13 panda-oss-23-6 kernel: [<ffffffffa094b857>] libcfs_debug_dumpstack+0x57/0x80 [libcfs]
May  7 15:18:13 panda-oss-23-6 kernel: [<ffffffffa094bdd7>] lbug_with_loc+0x47/0xc0 [libcfs]
May  7 15:18:13 panda-oss-23-6 kernel: [<ffffffffa0fe921b>] __ptlrpc_prep_bulk_page+0xcb/0x190 [ptlrpc]
May  7 15:18:13 panda-oss-23-6 kernel: [<ffffffffa105f590>] tgt_brw_read+0xab0/0x11d0 [ptlrpc]
May  7 15:18:13 panda-oss-23-6 kernel: [<ffffffffa0ffa486>] ? lustre_pack_reply_flags+0xa6/0x1e0 [ptlrpc]
May  7 15:18:13 panda-oss-23-6 kernel: [<ffffffff8109518d>] ? sched_clock_cpu+0xcd/0x110
May  7 15:18:13 panda-oss-23-6 kernel: [<ffffffffa105c7ae>] tgt_handle_request0+0x9e/0x3f0 [ptlrpc]
May  7 15:18:13 panda-oss-23-6 kernel: [<ffffffffa10576c0>] ? tgt_handle_recovery+0x30/0x360 [ptlrpc]
May  7 15:18:13 panda-oss-23-6 kernel: [<ffffffffa105d371>] tgt_request_handle+0x1c1/0x770 [ptlrpc]
May  7 15:18:13 panda-oss-23-6 kernel: [<ffffffffa100a5e3>] ptlrpc_server_handle_request+0x2e3/0xbc0 [ptlrpc]
May  7 15:18:13 panda-oss-23-6 kernel: [<ffffffffa094c3de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
May  7 15:18:13 panda-oss-23-6 kernel: [<ffffffffa095ba0a>] ? lc_watchdog_touch+0x7a/0x190 [libcfs]
May  7 15:18:13 panda-oss-23-6 kernel: [<ffffffffa1003209>] ? ptlrpc_wait_event+0xa9/0x2f0 [ptlrpc]
May  7 15:18:13 panda-oss-23-6 kernel: [<ffffffff8108d273>] ? __wake_up+0x53/0x70
May  7 15:18:13 panda-oss-23-6 kernel: [<ffffffffa100cd3c>] ptlrpc_main+0x9dc/0xd90 [ptlrpc]
May  7 15:18:13 panda-oss-23-6 kernel: [<ffffffffa100c360>] ? ptlrpc_main+0x0/0xd90 [ptlrpc]
May  7 15:18:13 panda-oss-23-6 kernel: [<ffffffff810821be>] kthread+0xce/0xe0
May  7 15:18:13 panda-oss-23-6 kernel: [<ffffffff810820f0>] ? kthread+0x0/0xe0
May  7 15:18:13 panda-oss-23-6 kernel: [<ffffffff815f93c8>] ret_from_fork+0x58/0x90
May  7 15:18:13 panda-oss-23-6 kernel: [<ffffffff810820f0>] ? kthread+0x0/0xe0
May  7 15:18:13 panda-oss-23-6 kernel:

so, did something go wrong during remote <-> local buffers mapping?

Comment by Haisong Cai (Inactive) [ 08/May/15 ]

One of the OSS crashed 2 days ago crashed again this morning. I am attaching dmesg here.
This dmesg was taken between the first and second crash the scenario appears at every time now.

Just to list all LBUG we have encounter on this filesystem so far (this morning was panda-oss-25-4)

May 6 09:06:46 panda-oss-25-4 kernel: LustreError: 30076:0:(client.c:173:__ptlrpc_prep_bulk_page()) LBUG
May 6 10:18:54 panda-oss-25-4 kernel: LustreError: 11585:0:(client.c:173:__ptlrpc_prep_bulk_page()) LBUG
May 7 15:18:13 panda-oss-23-6 kernel: LustreError: 26222:0:(client.c:173:__ptlrpc_prep_bulk_page()) LBUG
May 7 16:47:23 panda-oss-23-6 kernel: LustreError: 10154:0:(client.c:173:__ptlrpc_prep_bulk_page()) LBUG
May 8 07:45:59 panda-oss-25-4 kernel: LustreError: 12448:0:(client.c:173:__ptlrpc_prep_bulk_page()) LBUG
May 8 09:12:59 panda-oss-25-4 kernel: LustreError: 10032:0:(client.c:173:__ptlrpc_prep_bulk_page()) LBUG

To answer your question about kdump, no we don't have kdump enable on our OSS/MDS.
We will have to exam our environment before we can enable it. At the meantime, please suggest alternative way to collect debug information, if there is any.

thanks,
Haisong

Comment by Haisong Cai (Inactive) [ 08/May/15 ]

Adding 2 lustre-logs for this morning's crash, on panda-oss-25-4

Comment by Haisong Cai (Inactive) [ 08/May/15 ]

Another OSS crashed, third OSS. Same pattern: the first crash, brought back OSS/OST, in less then a minite it comes with the second crash. It generally stay on and functional for hours until next crash.

Comment by Peter Jones [ 08/May/15 ]

Mike

Could you please advise on this one?

Thanks

Peter

Comment by Rick Wagner (Inactive) [ 09/May/15 ]

Haisong,

This is hitting wombat (our test partition) when I'm just running IOR. Both partitions have the same stack, now, so we may able reproduce it quickly. This also eliminates it being cause by pathological client IO on the production side.

--Rick

Comment by Rick Wagner (Inactive) [ 09/May/15 ]

Mike,

If it's any help, we were running v2.6.92 with a handful of patches related to large block support with ZFS (LU-4820, LU-5278, LU-6038, LU-6038, LU-6152, LU-6155) and did not see this. After 2.7 was released, we moved to v2.7.52 with the patches that hadn't landed yet. So, whatever's going on crept in between v2.6.92 and v2.7.52.

--Rick

Comment by Mikhail Pershin [ 12/May/15 ]

Thanks for info, I am investigating this.

Comment by Gerrit Updater [ 25/May/15 ]

Mike Pershin (mike.pershin@intel.com) uploaded a new patch: http://review.whamcloud.com/14926
Subject: LU-6584 osd: add more debug to check buffer size
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 9a7be00ed34a2c7f44a42f2603200977b47118d4

Comment by Mikhail Pershin [ 25/May/15 ]

I wasn't able to find exact place of problem by inspecting related code. There is patch to get a little bit more info when this bug happens.

Comment by Andreas Dilger [ 01/Jun/15 ]

Mike, could you please take a look though the patches between 2.6.92 and 2.7.52 to see if there are any likely candidates?

Comment by Haisong Cai (Inactive) [ 12/Jun/15 ]

We hit another LBUG and it looks like the same kind.
====

Jun 12 01:01:45 panda-oss-25-2 kernel: LustreError: 29442:0:(client.c:173:__ptlrpc_prep_bulk_page()) ASSERTION( pageoffset + len <= ((1UL) << 12) ) failed:
Jun 12 01:01:45 panda-oss-25-2 kernel: LustreError: 29442:0:(client.c:173:__ptlrpc_prep_bulk_page()) LBUG
Jun 12 01:01:45 panda-oss-25-2 kernel: Pid: 29442, comm: ll_ost_io01_105
Jun 12 01:01:45 panda-oss-25-2 kernel:
Jun 12 01:01:45 panda-oss-25-2 kernel: Call Trace:
Jun 12 01:01:45 panda-oss-25-2 kernel: [<ffffffffa0940857>] libcfs_debug_dumpstack+0x57/0x80 [libcfs]
Jun 12 01:01:45 panda-oss-25-2 kernel: [<ffffffffa0940dd7>] lbug_with_loc+0x47/0xc0 [libcfs]
Jun 12 01:01:45 panda-oss-25-2 kernel: [<ffffffffa0fde21b>] __ptlrpc_prep_bulk_page+0xcb/0x190 [ptlrpc]
Jun 12 01:01:45 panda-oss-25-2 kernel: [<ffffffffa1054590>] tgt_brw_read+0xab0/0x11d0 [ptlrpc]
Jun 12 01:01:45 panda-oss-25-2 kernel: [<ffffffffa0fef486>] ? lustre_pack_reply_flags+0xa6/0x1e0 [ptlrpc]
Jun 12 01:01:45 panda-oss-25-2 kernel: [<ffffffff8109518d>] ? sched_clock_cpu+0xcd/0x110
Jun 12 01:01:45 panda-oss-25-2 kernel: [<ffffffffa10517ae>] tgt_handle_request0+0x9e/0x3f0 [ptlrpc]
Jun 12 01:01:46 panda-oss-25-2 kernel: [<ffffffffa104c6c0>] ? tgt_handle_recovery+0x30/0x360 [ptlrpc]
Jun 12 01:01:46 panda-oss-25-2 kernel: [<ffffffffa1052371>] tgt_request_handle+0x1c1/0x770 [ptlrpc]
Jun 12 01:01:46 panda-oss-25-2 kernel: [<ffffffffa0fff5e3>] ptlrpc_server_handle_request+0x2e3/0xbc0 [ptlrpc]
Jun 12 01:01:46 panda-oss-25-2 kernel: [<ffffffffa09413de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
Jun 12 01:01:46 panda-oss-25-2 kernel: [<ffffffffa0950a0a>] ? lc_watchdog_touch+0x7a/0x190 [libcfs]
Jun 12 01:01:46 panda-oss-25-2 kernel: [<ffffffffa0ff8209>] ? ptlrpc_wait_event+0xa9/0x2f0 [ptlrpc]
Jun 12 01:01:46 panda-oss-25-2 kernel: [<ffffffff8108d273>] ? __wake_up+0x53/0x70
Jun 12 01:01:46 panda-oss-25-2 kernel: [<ffffffffa1001d3c>] ptlrpc_main+0x9dc/0xd90 [ptlrpc]
Jun 12 01:01:46 panda-oss-25-2 kernel: [<ffffffffa1001360>] ? ptlrpc_main+0x0/0xd90 [ptlrpc]
Jun 12 01:01:46 panda-oss-25-2 kernel: [<ffffffff810821be>] kthread+0xce/0xe0
Jun 12 01:01:46 panda-oss-25-2 kernel: [<ffffffff810820f0>] ? kthread+0x0/0xe0
Jun 12 01:01:46 panda-oss-25-2 kernel: [<ffffffff815f93c8>] ret_from_fork+0x58/0x90
Jun 12 01:01:46 panda-oss-25-2 kernel: [<ffffffff810820f0>] ? kthread+0x0/0xe0
Jun 12 01:01:46 panda-oss-25-2 kernel:
Jun 12 01:01:46 panda-oss-25-2 kernel: Kernel panic - not syncing: LBUG
Jun 12 01:01:46 panda-oss-25-2 kernel: CPU: 11 PID: 29442 Comm: ll_ost_io01_105 Tainted: P O 3.10.73-1.el6.elrepo.x86_64 #1
Jun 12 01:01:46 panda-oss-25-2 kernel: Hardware name: Intel Corporation S2600CP/S2600CP, BIOS SE5C600.86B.02.04.0003.102320141138 10/23/2014
Jun 12 01:01:46 panda-oss-25-2 kernel: 000000006fe000c0 ffff881e2f76fa88 ffffffff815ece4a ffff881e2f76fb08
Jun 12 01:01:46 panda-oss-25-2 kernel: ffffffff815ecbeb ffffffff00000008 ffff881e2f76fb18 ffff881e2f76fab8
Jun 12 01:01:46 panda-oss-25-2 kernel: ffffffffa106cb20 ffff881e2f76fac8 0000000000000000 ffffffffa095f5b1
Jun 12 01:01:46 panda-oss-25-2 kernel: Call Trace:
Jun 12 01:01:46 panda-oss-25-2 kernel: [<ffffffff815ece4a>] dump_stack+0x19/0x1f
Jun 12 01:01:46 panda-oss-25-2 kernel: [<ffffffff815ecbeb>] panic+0xc4/0x1e4
Jun 12 01:01:46 panda-oss-25-2 kernel: [<ffffffffa0940e3b>] lbug_with_loc+0xab/0xc0 [libcfs]
Jun 12 01:01:46 panda-oss-25-2 kernel: [<ffffffffa0fde21b>] __ptlrpc_prep_bulk_page+0xcb/0x190 [ptlrpc]
Jun 12 01:01:46 panda-oss-25-2 kernel: [<ffffffffa1054590>] tgt_brw_read+0xab0/0x11d0 [ptlrpc]
Jun 12 01:01:46 panda-oss-25-2 kernel: [<ffffffffa0fef486>] ? lustre_pack_reply_flags+0xa6/0x1e0 [ptlrpc]
Jun 12 01:01:46 panda-oss-25-2 kernel: [<ffffffff8109518d>] ? sched_clock_cpu+0xcd/0x110
Jun 12 01:01:46 panda-oss-25-2 kernel: [<ffffffffa10517ae>] tgt_handle_request0+0x9e/0x3f0 [ptlrpc]
Jun 12 01:01:46 panda-oss-25-2 kernel: [<ffffffffa104c6c0>] ? tgt_handle_recovery+0x30/0x360 [ptlrpc]

Comment by Andreas Dilger [ 15/Jun/15 ]

Haisong, it looks like you are not yet running the debug patch from Mike (http://review.whamcloud.com/14926) on your systems. It would be useful if you applied that patch (only needed on the servers) so that we can capture more information about this failure.

Comment by Mikhail Pershin [ 29/Jun/15 ]

Rick, what version of ZFS are you using, 0.6.4?

Comment by Rick Wagner (Inactive) [ 29/Jun/15 ]

Hi Mikhail,

Yes, 0.6.4, but with the first large block support pull request, 2865.

Here's the SPL and ZFS build process I used.

git clone https://github.com/zfsonlinux/zpl.git
cd spl
git checkout spl-0.6.4
./autogen.sh
./configure --disable-debug
make pkg
rpm -ivh *x86_64.rpm


git clone https://github.com/zfsonlinux/zfs.git
cd zfs
git fetch -t https://github.com/zfsonlinux/zfs.git refs/pull/2865/head:lgblock
git checkout zfs-0.6.4
git merge lgblock
./autogen.sh
./configure --disable-debug
make pkg
rpm -ivh *x86_64.rpm
Comment by Mikhail Pershin [ 02/Jul/15 ]

Rick, thank you for that info, I am trying to reproduce that situation. Meanwhile, were there any other specific tuning in ZFS, e.g. maximum block size or anything related to the block size?

Also what Lustre patches are you using over the stock Lustre 2.7.52? I am asking because 2.7.52 can't be built with zfs version you are using, because of SPA_MAXBLOCKSHIFT is bigger now

Comment by Rick Wagner (Inactive) [ 02/Jul/15 ]

Mikhail,

For ZFS with large block support, our record size is set to 1024k. On the Lustre side we've got a few patches applied, one or more of which has not landed:

  • LU-4820 osd: drop memcpy in zfs osd
  • LU-6038 osd-zfs: Avoid redefining KM_SLEEP
  • LU-6038 osd-zfs: sa_spill_alloc()/sa_spill_free() compat
  • LU-6152 osd-zfs: ZFS large block compat
  • LU-6155 osd-zfs: dbuf_hold_impl() called without the lock

LU-6152 in particular is the one that deals with SPA_MAXBLOCKSIZE and SPA_MAXBLOCKSHIFT.

At our site, I've handed over the build process to Dima Mishin for maintenance. Dima, can you pass on your current build (base commit, cherry picks, and any manual patches) to Mikhail?

Comment by Dmitry Mishin (Inactive) [ 02/Jul/15 ]

I have a build from commit 8a11cb62 in master, with patch from a44175d (LU-6155), and cherry-picked c39e02e8 (LU-6584 - debug), and latest master commits of zfs and spl (a7b10a931, 77ab5dd33a)

Comment by Minh Diep [ 10/Jul/15 ]

Haisong,

How often do you see this in production? could you apply the debug patch (http://review.whamcloud.com/14926)?

Thanks
-Minh

Comment by Haisong Cai (Inactive) [ 10/Jul/15 ]

Hi Minh,

Depending on how file-system is being used, we hit the bug between every day (as I described when I created this ticket) to several weeks apart.

We are in the process of applying the patch on to our production file-system.

Haisong

Comment by Haisong Cai (Inactive) [ 14/Jul/15 ]

We are very close to get the debug patch deployed onto our production systems.
But a FYI, the following is a lit of all the hits of the LBUG and each hit is an OSS downtime:

This is the number of times we have hit the lbug today Each hit is an OSS downtime.

[root@oasis-panda cai]# grep -i lbug /var/log/messages
Jul 13 12:27:36 panda-oss-24-4 kernel: LustreError: 29775:0:(client.c:173:__ptlrpc_prep_bulk_page()) LBUG
Jul 13 12:27:37 panda-oss-24-4 kernel: [<ffffffffa0951dd7>] lbug_with_loc+0x47/0xc0 [libcfs]
Jul 13 12:27:37 panda-oss-24-4 kernel: Kernel panic - not syncing: LBUG
Jul 13 12:48:52 panda-oss-24-4 kernel: LustreError: 10351:0:(client.c:173:__ptlrpc_prep_bulk_page()) LBUG
Jul 13 13:12:49 panda-oss-22-4 kernel: LustreError: 30300:0:(client.c:173:__ptlrpc_prep_bulk_page()) LBUG
Jul 13 13:12:49 panda-oss-22-4 kernel: [<ffffffffa094edd7>] lbug_with_loc+0x47/0xc0 [libcfs]
Jul 13 13:46:07 panda-oss-22-4 kernel: LustreError: 10383:0:(client.c:173:__ptlrpc_prep_bulk_page()) LBUG
Jul 13 13:46:07 panda-oss-22-4 kernel: [<ffffffffa091bdd7>] lbug_with_loc+0x47/0xc0 [libcfs]
Jul 13 17:00:40 panda-oss-25-4 kernel: LustreError: 10468:0:(client.c:173:__ptlrpc_prep_bulk_page()) LBUG
Jul 13 17:00:40 panda-oss-25-4 kernel: [<ffffffffa090fdd7>] lbug_with_loc+0x47/0xc0 [libcfs]
Jul 13 17:35:02 panda-oss-25-4 kernel: LustreError: 10310:0:(client.c:173:__ptlrpc_prep_bulk_page()) LBUG
Jul 13 17:47:21 panda-oss-24-1 kernel: LustreError: 29652:0:(client.c:173:__ptlrpc_prep_bulk_page()) LBUG
Jul 13 17:47:21 panda-oss-24-1 kernel: [<ffffffffa0944dd7>] lbug_with_loc+0x47/0xc0 [libcfs]
Jul 13 19:14:13 panda-oss-24-1 kernel: LustreError: 10101:0:(client.c:173:__ptlrpc_prep_bulk_page()) LBUG
Jul 13 19:14:13 panda-oss-24-1 kernel: [<ffffffffa091edd7>] lbug_with_loc+0x47/0xc0 [libcfs]
Jul 13 19:14:14 panda-oss-24-1 kernel: Kernel panic - not syncing: LBUG
Jul 13 19:14:14 panda-oss-24-1 kernel: [<ffffffffa091ee3b>] lbug_with_loc+0xab/0xc0 [libcfs]
Jul 13 22:48:29 panda-oss-25-4 kernel: LustreError: 10161:0:(client.c:173:__ptlrpc_prep_bulk_page()) LBUG
Jul 13 22:48:29 panda-oss-25-4 kernel: [<ffffffffa0914dd7>] lbug_with_loc+0x47/0xc0 [libcfs]
Jul 13 23:15:00 panda-oss-25-4 kernel: LustreError: 10185:0:(client.c:173:__ptlrpc_prep_bulk_page()) LBUG
Jul 13 23:15:01 panda-oss-25-4 kernel: [<ffffffffa090edd7>] lbug_with_loc+0x47/0xc0 [libcfs]
Jul 13 23:15:01 panda-oss-25-4 kernel: Kernel panic - not syncing: LBUG
Jul 13 23:15:01 panda-oss-25-4 kernel: [<ffffffffa090ee3b>] lbug_with_loc+0xab/0xc0 [libcfs]
Jul 13 23:21:26 panda-oss-25-4 kernel: LustreError: 10265:0:(client.c:173:__ptlrpc_prep_bulk_page()) LBUG
Jul 13 23:21:26 panda-oss-25-4 kernel: [<ffffffffa0916dd7>] lbug_with_loc+0x47/0xc0 [libcfs]

Comment by Haisong Cai (Inactive) [ 14/Jul/15 ]

In all crash cases these 2 lines always coupled:

Jul 13 23:21:26 panda-oss-25-4 kernel: LustreError: 10265:0:(client.c:173:__ptlrpc_prep_bulk_page()) ASSERTION( pageoffset + len <= ((1UL) << 12) ) failed:
Jul 13 23:21:26 panda-oss-25-4 kernel: LustreError: 10265:0:(client.c:173:__ptlrpc_prep_bulk_page()) LBUG

and most of the time follow by this line:
Jul 13 23:21:26 panda-oss-25-4 kernel: Pid: 10265, comm: ll_ost_io00_031

I am attaching /var/log/messages here that include all incidences from yesterday.

Comment by Haisong Cai (Inactive) [ 14/Jul/15 ]

collected with command line:

cat /var/log/messages | egrep -iv "sshd|cron|run-parts|postfix|rsyslog|audispd|named|rockscommand|channel|alert-handler|411-alert" > /tmp/log.$$

thanks,
Haisong

Comment by Mikhail Pershin [ 18/Jul/15 ]

Haisong, what do you mean by 'OSS downtime'?

Comment by Haisong Cai (Inactive) [ 18/Jul/15 ]

Hi Mikhail,

As I described at the very beginning of this ticket, when LBUG is hit and OSS crashes.
Machine will have to be power reset. During that time OSS will be unavailable.

Haisong

Comment by Zhenyu Xu [ 23/Jul/15 ]

Hi, what the progress of applying the debug patch and collect the log again? The latest log is still without the debug patch.

Comment by Rick Wagner (Inactive) [ 23/Jul/15 ]

Zhenyu, we're rolling out an update to our production file system today. After discussion with Minh we've decided to rebase our code on later releases of ZFS and Lustre that have our necessary patches. The only additional patch we've added is the debugging one for this ticket. After we're done, I'll ask Dima Mishin to post the exact releases that we're working from.

Comment by Dmitry Mishin (Inactive) [ 23/Jul/15 ]

We used:
kernel-lt-3.10.73-1.el6.elrepo from kernel-lt
spl 77ab5dd33a99
zfs f1512ee61e2f22
lustre 1ef0185e8c12 with applied dba1c12e debug patch

Comment by Haisong Cai (Inactive) [ 30/Jul/15 ]

We hit another LBUG today. OSS is running debug patch.
Here is the /var/log/message and let me know what else you want me to collect.

Jul 30 15:08:25 panda-oss-25-4 kernel: LustreError: 11719:0:(client.c:211:__ptlrpc_prep_bulk_page()) ASSERTION( pageoffset + len <= PAGE_CACHE_SIZE ) failed: offset 0, len 1913970688
Jul 30 15:08:25 panda-oss-25-4 kernel: LustreError: 11719:0:(client.c:211:__ptlrpc_prep_bulk_page()) LBUG
Jul 30 15:08:26 panda-oss-25-4 kernel: Pid: 11719, comm: ll_ost_io00_045
Jul 30 15:08:26 panda-oss-25-4 kernel:
Jul 30 15:08:26 panda-oss-25-4 kernel: Call Trace:
Jul 30 15:08:26 panda-oss-25-4 kernel: [<ffffffffa0919857>] libcfs_debug_dumpstack+0x57/0x80 [libcfs]
Jul 30 15:08:26 panda-oss-25-4 kernel: [<ffffffffa0919dd7>] lbug_with_loc+0x47/0xc0 [libcfs]
Jul 30 15:08:26 panda-oss-25-4 kernel: [<ffffffffa0ca02fc>] __ptlrpc_prep_bulk_page+0x12c/0x1f0 [ptlrpc]
Jul 30 15:08:26 panda-oss-25-4 kernel: [<ffffffffa0ca03cc>] ptlrpc_prep_bulk_page_nopin+0xc/0x10 [ptlrpc]
Jul 30 15:08:26 panda-oss-25-4 kernel: [<ffffffffa0d1d432>] tgt_brw_read+0xab2/0x1210 [ptlrpc]
Jul 30 15:08:26 panda-oss-25-4 kernel: [<ffffffffa0cb06b6>] ? lustre_pack_reply_flags+0xa6/0x1e0 [ptlrpc]
Jul 30 15:08:26 panda-oss-25-4 kernel: [<ffffffff8109518d>] ? sched_clock_cpu+0xcd/0x110
Jul 30 15:08:26 panda-oss-25-4 kernel: [<ffffffffa0d1aca6>] tgt_handle_request0+0x2c6/0x430 [ptlrpc]
Jul 30 15:08:26 panda-oss-25-4 kernel: [<ffffffffa0d1567a>] ? tgt_handle_recovery+0x14a/0x310 [ptlrpc]
Jul 30 15:08:26 panda-oss-25-4 kernel: [<ffffffffa0d1b025>] tgt_request_handle+0x215/0x990 [ptlrpc]
Jul 30 15:08:26 panda-oss-25-4 kernel: [<ffffffffa0cc07c3>] ptlrpc_server_handle_request+0x2e3/0xbc0 [ptlrpc]
Jul 30 15:08:26 panda-oss-25-4 kernel: [<ffffffffa092760a>] ? lc_watchdog_touch+0x7a/0x190 [libcfs]
Jul 30 15:08:26 panda-oss-25-4 kernel: [<ffffffffa0cb9ba9>] ? ptlrpc_wait_event+0xa9/0x300 [ptlrpc]
Jul 30 15:08:26 panda-oss-25-4 kernel: [<ffffffff8108d273>] ? __wake_up+0x53/0x70
Jul 30 15:08:26 panda-oss-25-4 kernel: [<ffffffffa0cc22cc>] ptlrpc_main+0x9ec/0xd30 [ptlrpc]
Jul 30 15:08:26 panda-oss-25-4 kernel: [<ffffffffa0cc18e0>] ? ptlrpc_main+0x0/0xd30 [ptlrpc]
Jul 30 15:08:26 panda-oss-25-4 kernel: [<ffffffff810821be>] kthread+0xce/0xe0
Jul 30 15:08:26 panda-oss-25-4 kernel: [<ffffffff810820f0>] ? kthread+0x0/0xe0
Jul 30 15:08:26 panda-oss-25-4 kernel: [<ffffffff815f93c8>] ret_from_fork+0x58/0x90
Jul 30 15:08:26 panda-oss-25-4 kernel: [<ffffffff810820f0>] ? kthread+0x0/0xe0
Jul 30 15:08:26 panda-oss-25-4 kernel:

Comment by Haisong Cai (Inactive) [ 30/Jul/15 ]

In little more than an hour:

Jul 30 15:08:25 panda-oss-25-4 kernel: LustreError: 11719:0:(client.c:211:__ptlrpc_prep_bulk_page()) LBUG
Jul 30 15:25:13 panda-oss-25-4 kernel: LustreError: 9907:0:(client.c:211:__ptlrpc_prep_bulk_page()) LBUG
Jul 30 15:25:13 panda-oss-25-4 kernel: Kernel panic - not syncing: LBUG
Jul 30 16:16:39 panda-oss-25-4 kernel: LustreError: 10230:0:(client.c:211:__ptlrpc_prep_bulk_page()) LBUG

Comment by Andreas Dilger [ 07/Aug/15 ]

Is it worthwhile to print out the whole lnb struct (maybe with neighboring values also) to see if there is memory corruption that can be identified? Is the oops always in the same place? Then it seems likely that there is some systematic memory corruption (stack overflow, out of bounds array access, etc) rather than random corruption from another thread, which would cause crashes in other parts of the code.

Comment by Zhenyu Xu [ 12/Aug/15 ]

Hi Haisong,

http://review.whamcloud.com/#/c/14926/ has been updated to include more debug information to catch the LBUG, would you please apply it and re-hit the issue and collect core dump? Please add "ha" debug before re-run the test, as

lctl set_param debug="+ha" 
Comment by Haisong Cai (Inactive) [ 12/Aug/15 ]

Hi Zhenyu,

We will try to apply the new debug patch. But since this is our production file-system, we will have to plan it.
We hit this LBUG randomly depend on how the file-system is used. For example we just had 7 hits during last 24hrs,
after about 12 days quiet period.

So I have 2 questions:

1) by any chance you know a way to reproduce the LBUG? The reason I ask is that we have a non-production file-system that
we may use for this debugging.
2) about collecting core dump, do you mean the core dump of specific lustre processes or the entire system? When LBUG is hit,
the OSS servers panic. I want to make sure we setup so that core dump can be saved properly.

I like to remind everyone that this LBUG event usually occurs 2 or 3 consecutively, meaning OSS crash, standing OSS up,
crash again, and standing OSS up again, and the third time, before it stabilized. Here are examples we had for the last 24hr.
Please notice the server names and time stamps:

[root@oasis-panda log]# grep -i lbug messages
Aug 11 12:32:47 panda-oss-25-4 kernel: LustreError: 10267:0:(client.c:211:__ptlrpc_prep_bulk_page()) LBUG
Aug 11 12:32:47 panda-oss-25-4 kernel: [<ffffffffa0920dd7>] lbug_with_loc+0x47/0xc0 [libcfs]
Aug 11 12:32:47 panda-oss-25-4 kernel: Kernel panic - not syncing: LBUG
Aug 11 12:32:47 panda-oss-25-4 kernel: [<ffffffffa0920e3b>] lbug_with_loc+0xab/0xc0 [libcfs]
Aug 11 13:29:54 panda-oss-25-4 kernel: LustreError: 10206:0:(client.c:211:__ptlrpc_prep_bulk_page()) LBUG
Aug 11 13:29:54 panda-oss-25-4 kernel: [<ffffffffa0927dd7>] lbug_with_loc+0x47/0xc0 [libcfs]
Aug 11 13:29:55 panda-oss-25-4 kernel: Kernel panic - not syncing: LBUG
Aug 11 13:29:55 panda-oss-25-4 kernel: [<ffffffffa0927e3b>] lbug_with_loc+0xab/0xc0 [libcfs]
Aug 11 13:36:50 panda-oss-24-4 kernel: LustreError: 11848:0:(client.c:211:__ptlrpc_prep_bulk_page()) LBUG
Aug 11 13:45:55 panda-oss-24-4 kernel: LustreError: 10229:0:(client.c:211:__ptlrpc_prep_bulk_page()) LBUG
Aug 11 13:45:55 panda-oss-24-4 kernel: [<ffffffffa0923dd7>] lbug_with_loc+0x47/0xc0 [libcfs]
Aug 12 08:09:43 panda-oss-25-3 kernel: LustreError: 11728:0:(client.c:211:__ptlrpc_prep_bulk_page()) LBUG
Aug 12 09:48:28 panda-oss-25-3 kernel: LustreError: 11232:0:(client.c:211:__ptlrpc_prep_bulk_page()) LBUG
Aug 12 09:48:29 panda-oss-25-3 kernel: [<ffffffffa0919dd7>] lbug_with_loc+0x47/0xc0 [libcfs]
Aug 12 09:48:29 panda-oss-25-3 kernel: Kernel panic - not syncing: LBUG
Aug 12 09:48:29 panda-oss-25-3 kernel: [<ffffffffa0919e3b>] lbug_with_loc+0xab/0xc0 [libcfs]
Aug 12 09:52:56 panda-oss-25-3 kernel: LustreError: 10548:0:(client.c:211:__ptlrpc_prep_bulk_page()) LBUG
Aug 12 09:52:56 panda-oss-25-3 kernel: [<ffffffffa0916dd7>] lbug_with_loc+0x47/0xc0 [libcfs]
Aug 12 09:52:58 panda-oss-25-3 kernel: Kernel panic - not syncing: LBUG
Aug 12 09:52:58 panda-oss-25-3 kernel: [<ffffffffa0916e3b>] lbug_with_loc+0xab/0xc0 [libcfs]
[root@oasis-panda log]#

thank,
Haisong

Comment by Zhenyu Xu [ 13/Aug/15 ]

1. Unfortunately we don't know the reason of the LBUG, we just saw that some IO has invalid page length, and it seems there is some specific memory corruption there, so the debug patch is try to print out IO pages information if abnormal happens.

2. You can setup kdump to capture the kernel crash dump as http://fedoraproject.org/wiki/How_to_use_kdump_to_debug_kernel_crashes guides.

Comment by Andreas Dilger [ 13/Aug/15 ]

If the OSS is crashing repeatedly after startup, it may mean that the bad data is arriving from the client during replay and is not being verified properly? Are there checks up in the ost layer to verify the niobuf_remote contains valid data before it is used in the OSD? It may be that the corruption is happening on the network or on the client.

Comment by Andreas Dilger [ 20/Aug/15 ]

In the boot log I see you are using ZFS 0.6.4. It looks like there may be fixes in 0.6.4.1 and 0.6.4.2 that may be helpful in this case and LU-6607 (reducing CPU usage on metadata intensive workloads):
https://github.com/zfsonlinux/zfs/releases/tag/zfs-0.6.4.1

Fixed io-spare.sh script for ZED.
Fixed multiple deadlocks which might occur when reclaiming memory.
Fixed excessive CPU usage for meta data heavy workloads when reclaiming the ARC.

https://github.com/zfsonlinux/zfs/releases/tag/zfs-0.6.4.2

Fix panic due to corrupt nvlist when running utilities zfsonlinux/zfs#3335
Fix hard lockup due to infinite loop in zfs_zget() zfsonlinux/zfs#3349
Fix panic on unmount due to iput taskq zfsonlinux/zfs#3281
Improve metadata shrinker performance on pre-3.1 kernels zfsonlinux/zfs#3501
Linux 4.1 compat: use read_iter() / write_iter()
Linux 3.12 compat: NUMA-aware per-superblock shrinker
Fix spurious hung task watchdog stack traces zfsonlinux/zfs#3402
Fix module loading in zfs import systemd service zfsonlinux/zfs#3440
Fix intermittent libzfs_init() failure to open /dev/zfs zfsonlinux/zfs#2556

Also, just to confirm - are these OSS nodes running with 1MB ZFS blocksize?

Comment by Dmitry Mishin (Inactive) [ 20/Aug/15 ]

We're using the f1512ee61e commit from master ZFS branch (large block support). It's later than 0.6.4.1, and I had problems running with latest master version.

Comment by Rick Wagner (Inactive) [ 24/Aug/15 ]

Andreas, yes, we're using 1MB block sizes on the ZFS datasets that handle the OSTs.

Comment by Andreas Dilger [ 25/Aug/15 ]

Bobijam,
can you please make a new patch that checks the contents of niobuf_remote when it is first accessed by the OST (tgt_brw_read() and tgt_brw_write()) to verify that the contents are sane, and print out all the values under D_BUFFS debugging. If the values are incorrect a CERROR() should be printed and an -EPROTO error returned to the client, and we can debug this problem as a network corruption.

This niobuf verification should be in a helper function that can also be called before the currently-failing LASSERT() checks are being handled (and elsewhere in the code if you think it is helpful), and those functions can return an EIO error to the caller rather than triggering the LASSERT. At that point the client should resend the BRW RPC due to brw_interpret()>osc_recoverable_error() and hopefully it will succeed on the second try.

While I don't think this is a proper solution, it will at least tell us if the corruption is happening on the client and/or on the network, or in memory on the OSS, and it will potentially allow debugging to continue without the high frequency of OSS failures.

Comment by Andreas Dilger [ 25/Aug/15 ]

Rick, the other possible avenue for debugging is to disable the 1MB blocksize tunable on one or more of your OST datasets, and see if this correlates to a reduction or elimination of the occurrence of this failure. This is one of the main deltas between your ZFS environment and other ZFS users, so this would allow us to isolate the memory corruption to the code handling 1MB blocksize.

Comment by Zhenyu Xu [ 26/Aug/15 ]

http://review.whamcloud.com/#/c/14926/ has been updated to add more remote/local buffer check.

Comment by Rick Wagner (Inactive) [ 27/Aug/15 ]

We've scheduled a maintenance window for Sep. 8 to roll out this latest patch after testing.

Andreas, I'll consider changing the recordsize on some of the OSTs. The most likely scenario where we get solid information from this is if the LBUG is still hit on one of the OSSes with the changed setting. I am being a little cautious considering this since it will mean having a ZFS dataset with varying recordsizes. I don't believe the ZFS layer will care, but it's not something I've dealt with before.

Comment by Andreas Dilger [ 17/Sep/15 ]

Hi Rick, any news on this front? Have you looked into upgrading to ZFS 0.6.5 to get the native large block support? The patch http://review.whamcloud.com/15127 "LU-4865 zfs: grow block size by write pattern" should also help performance when dealing with files under 1MB in size.

Comment by Rick Wagner (Inactive) [ 18/Sep/15 ]

Hi Andreas, since our last update to the code tree based on http://review.whamcloud.com/#/c/14926/ we've been stable. It's possible that we've pulled in a bugfix along with the debugging patch although I couldn't point to a specific one.

We are looking at ZFS 0.6.5 to get away from the unreleased version for ZFS we've had to run. I would probably do that along with another rebase to a later unpatched tag of Lustre, maybe once LU-4865 is included.

On a related note, I think this issue could be removed from the 2.8 blocker list, since we started with patched versions of Lustre and ZFS.

Comment by Gerrit Updater [ 30/Sep/15 ]

Mike Pershin (mike.pershin@intel.com) uploaded a new patch: http://review.whamcloud.com/16685
Subject: LU-6584 osd: prevent int type overflow in osd_read_prep()
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 687338302147dad5b09b964b8615a3b3adb78a7d

Comment by Mikhail Pershin [ 30/Sep/15 ]

It seems the reason of this issue is the int type overflow in lnb_rc. Instead of writing the (eof - file_offset) right into lnb_rc we have to check first it is not negative.

Comment by Peter Jones [ 03/Oct/15 ]

Will SDSC be able to try this patch out to confirm whether it fixes the issues that they have been experiencing?

Comment by Rick Wagner (Inactive) [ 03/Oct/15 ]

Yes, we're scheduling a PM and push this out. Could this patch be related to LU-7106? In other words could the current code create an error that propagates back to the client as ENOSPC even when there's capacity on the OST?

Comment by Mikhail Pershin [ 03/Oct/15 ]

Rick, this particular issue existed in IO READ code path and doesn't related to LU-7106. I check OSD code quickly and didn't notice other similar issues at first glance.

Comment by Gerrit Updater [ 07/Oct/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16685/
Subject: LU-6584 osd: prevent int type overflow in osd_read_prep()
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: efe3842c76b8041a048457779554ffa5ba76567d

Comment by Peter Jones [ 07/Oct/15 ]

Fix landed for 2.8. We'll reopen if this issue still is hit on Hyperion. If there is still an issue at SDSC and it is not, as hoped, a duplicate of this issue then please open a new ticket to track that issue.

Generated at Sat Feb 10 02:01:28 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.