[LU-15] strange slow IO messages and bad performance Created: 18/Nov/10  Updated: 30/Apr/18  Resolved: 26/Oct/11

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.6
Fix Version/s: Lustre 1.8.6

Type: Bug Priority: Critical
Reporter: Kit Westneat (Inactive) Assignee: Yang Sheng
Resolution: Fixed Votes: 1
Labels: None
Environment:

1.8.4-ddn2 system with a 9900 and async journaling


Attachments: File LU-15.tgz     File messages     File patch.aligned_1M     File patch.aligned_1M     File patch.diff     File patch.ext4     Text File readahead_align_1M.patch    
Issue Links:
Related
is related to LU-918 ensure that BRW requests prevent lock... Closed
is related to LU-10967 MDT page cache management improvements Open
Severity: 2
Rank (Obsolete): 4874

 Description   

We're running into a problem at SLAC that we could use some help on, and I'm not quite sure where to put the ticket. It's a 1.8.4-ddn2 system with a 9900 and async journaling and we've been seeing a lot of slow IO messages from Lustre. The strange thing is that the actual IO wait times on the sd devices is pretty low, so the problem seems to exist before it does disk access. We started getting reports of this as the OSTs started approaching 7-8TB used, out of 16TB total.

There have been some watchdogs which point to the mballoc functions as taking a very long time. As these are 16TB LUNs, there are by default a very large amount of inodes. In new installations, we have begun tuning these down, but this system was sort of the impetuous for that. Mounting the OSTs used to take 30 minutes when we were on 1.8.3, but the mballoc optimizations that landed seem to have helped that quite a bit. Still the OSTs are 50% full, but the inodes are 1% used - definitely way too many inodes.

We are starting to get a lot of heat from the customer as their users are unable to do their full workload due to the slow IO going on. Are there any stats or anything we can get to help? Would more of the call traces help?

Here are some example traces:
Nov 17 20:32:44 psanaoss104 kernel: [16961.336408] [<ffffffff80066e15>] getnstimeofday+0x15/0x40
Nov 17 20:32:44 psanaoss104 kernel: [16961.336424] [<ffffffff80070c29>] io_schedule+0x49/0x80
Nov 17 20:32:44 psanaoss104 kernel: [16961.336432] [<ffffffff800171eb>] sync_buffer+0x3b/0x50
Nov 17 20:32:44 psanaoss104 kernel: [16961.336438] [<ffffffff80070ee5>] __wait_on_bit+0x45/0x80
Nov 17 20:32:44 psanaoss104 kernel: [16961.336442] [<ffffffff800171b0>] sync_buffer+0x0/0x50
Nov 17 20:32:44 psanaoss104 kernel: [16961.336449] [<ffffffff800171b0>] sync_buffer+0x0/0x50
Nov 17 20:32:44 psanaoss104 kernel: [16961.336455] [<ffffffff80070f98>] out_of_line_wait_on_bit+0x78/0x90
Nov 17 20:32:44 psanaoss104 kernel: [16961.336465] [<ffffffff800b95a0>] wake_bit_function+0x0/0x40
Nov 17 20:32:44 psanaoss104 kernel: [16961.336478] [<ffffffff80053b90>] __wait_on_buffer+0x20/0x30
Nov 17 20:32:44 psanaoss104 kernel: [16961.336500] [<ffffffff88b430fa>] ldiskfs_mb_init_cache+0x32a/0x930 [ldiskfs]
Nov 17 20:32:44 psanaoss104 kernel: [16961.336539] [<ffffffff88b437e0>] ldiskfs_mb_init_group+0xe0/0x2a0 [ldiskfs]
Nov 17 20:32:44 psanaoss104 kernel: [16961.336548] [<ffffffff8002569b>] __up_read+0x8b/0xa0
Nov 17 20:32:44 psanaoss104 kernel: [16961.336572] [<ffffffff88b43a8b>] ldiskfs_mb_load_buddy+0xeb/0x400 [ldiskfs]
Nov 17 20:32:44 psanaoss104 kernel: [16961.336599] [<ffffffff88b44b5a>] ldiskfs_mb_regular_allocator+0x1aa/0x570 [ldiskfs]
Nov 17 20:32:44 psanaoss104 kernel: [16961.336638] [<ffffffff88b47689>] ldiskfs_mb_new_blocks+0x219/0x4d0 [ldiskfs]
Nov 17 20:32:44 psanaoss104 kernel: [16961.336649] [<ffffffff8018c31a>] __next_cpu+0x1a/0x30
Nov 17 20:32:44 psanaoss104 kernel: [16961.336654] [<ffffffff80071cb7>] __down_write_nested+0x17/0xb0
Nov 17 20:32:44 psanaoss104 kernel: [16961.336677] [<ffffffff88bc50c7>] ldiskfs_ext_new_extent_cb+0x367/0x680 [fsfilt_ldiskfs]
Nov 17 20:32:44 psanaoss104 kernel: [16961.336714] [<ffffffff88b3183e>] ldiskfs_ext_find_extent+0x28e/0x2d0 [ldiskfs]
Nov 17 20:32:44 psanaoss104 kernel: [16961.336739] [<ffffffff88b31a69>] ldiskfs_ext_walk_space+0x1e9/0x270 [ldiskfs]
Nov 17 20:32:44 psanaoss104 kernel: [16961.336749] [<ffffffff88bc4d60>] ldiskfs_ext_new_extent_cb+0x0/0x680 [fsfilt_ldiskfs]
Nov 17 20:32:44 psanaoss104 kernel: [16961.336771] [<ffffffff88bc1449>] fsfilt_map_nblocks+0xe9/0x120 [fsfilt_ldiskfs]
Nov 17 20:32:44 psanaoss104 kernel: [16961.336777] [<ffffffff800e7dc2>] __rmqueue+0x72/0x120
Nov 17 20:32:44 psanaoss104 kernel: [16961.336786] [<ffffffff800ed2e3>] zone_statistics+0x83/0xa0
Nov 17 20:32:44 psanaoss104 kernel: [16961.336808] [<ffffffff88bc167a>] fsfilt_ldiskfs_map_ext_inode_pages+0x1fa/0x220 [fsfilt_ldiskfs]
Nov 17 20:32:44 psanaoss104 kernel: [16961.336815] [<ffffffff8002569b>] __up_read+0x8b/0xa0
Nov 17 20:32:44 psanaoss104 kernel: [16961.336830] [<ffffffff88b13fb9>] start_this_handle+0x389/0x450 [jbd2]
Nov 17 20:32:44 psanaoss104 kernel: [16961.336853] [<ffffffff88bc16e1>] fsfilt_ldiskfs_map_inode_pages+0x41/0xb0 [fsfilt_ldiskfs]
Nov 17 20:32:44 psanaoss104 kernel: [16961.336875] [<ffffffff88bfe78a>] filter_direct_io+0x46a/0xd50 [obdfilter]
Nov 17 20:32:44 psanaoss104 kernel: [16961.336892] [<ffffffff889d7b10>] filter_quota_acquire+0x0/0x120 [lquota]
Nov 17 20:32:44 psanaoss104 kernel: [16961.336917] [<ffffffff88c014e2>] filter_commitrw_write+0x17a2/0x2b30 [obdfilter]
Nov 17 20:32:44 psanaoss104 kernel: [16961.336926] [<ffffffff800a0348>] find_busiest_group+0x278/0x700

Nov 17 20:32:44 psanaoss104 kernel: [16961.343328] [<ffffffff80071dec>] __down_read+0x8c/0xa9
Nov 17 20:32:44 psanaoss104 kernel: [16961.343340] [<ffffffff800bbd59>] down_read+0x19/0x20
Nov 17 20:32:44 psanaoss104 kernel: [16961.343363] [<ffffffff88b3a606>] ldiskfs_get_blocks+0x56/0x2b0 [ldiskfs]
Nov 17 20:32:44 psanaoss104 kernel: [16961.343393] [<ffffffff88b3b408>] ldiskfs_get_block+0xb8/0x100 [ldiskfs]
Nov 17 20:32:44 psanaoss104 kernel: [16961.343407] [<ffffffff8003969a>] generic_block_bmap+0x3a/0x40
Nov 17 20:32:44 psanaoss104 kernel: [16961.343419] [<ffffffff88976c35>] ksocknal_alloc_tx+0x1f5/0x290 [ksocklnd]
Nov 17 20:32:44 psanaoss104 kernel: [16961.343429] [<ffffffff8003a3dc>] mapping_tagged+0x4c/0x70
Nov 17 20:32:44 psanaoss104 kernel: [16961.343451] [<ffffffff88b38142>] ldiskfs_bmap+0xc2/0xe0 [ldiskfs]
Nov 17 20:32:44 psanaoss104 kernel: [16961.343470] [<ffffffff88b38080>] ldiskfs_bmap+0x0/0xe0 [ldiskfs]
Nov 17 20:32:44 psanaoss104 kernel: [16961.343484] [<ffffffff88c000c9>] filter_commitrw_write+0x389/0x2b30 [obdfilter]
Nov 17 20:32:44 psanaoss104 kernel: [16961.343511] [<ffffffff80068b65>] cache_alloc_refill+0x95/0x1f0
Nov 17 20:32:44 psanaoss104 kernel: [16961.343520] [<ffffffff800a1243>] dequeue_task+0x23/0x50
Nov 17 20:32:44 psanaoss104 kernel: [16961.343557] [<ffffffff88bf90d8>] filter_commitrw+0x58/0x2a0 [obdfilter]

Nov 17 20:32:44 psanaoss104 kernel: [16961.360298] Lustre: Service thread pid 16192 was inactive for 226.00s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
Nov 17 20:32:44 psanaoss104 kernel: [16961.360306] Lustre: Skipped 12 previous similar messages

Thanks,
Kit

FYI, here is the ddn2 patch list:

async_journals_by_default.patch
disable_lru_resize.patch
disable statahead
updated qla2xxx driver to latest
bug23549: allow crc32c hardware support
bug21456: lctl-ping-fix
bug21581: tarball name length fix
bug22729: use %z for size_t
bug22882: mmp might sleep negative time
bug23074: improve autogen
bug23157: change LBUG assertion to error for network problem
bug22117: limit mmp check interval to something rational
bug23064: umount without sync may cause data lost on 2.6.32 kernel
bug21745: fix obdfilter-survey LBUG
bug22820: too many default ACLs break directory access
bug22906: mkfs.lustre failed due to invalid blocks count on a 16TB target
bug20433: mdsrate fails to write after 1.3+M files opened
bug21760: application hung in direct IO
bug22891: objects not getting deleted for files which have been removed
bug23216: memory leak in obdecho
bug23335: limit echo object IDs to fit in f_oid
bug23781: memory leak with odbo
bug20394: fix bad assertion
bug21587: evict instead of LBUG
bug24003: fix cosmetic bug for lfs getstripe
revert bug22656



 Comments   
Comment by Dan Ferber (Inactive) [ 18/Nov/10 ]

Thanks Kit. Engineering is doing initial analysis, and will talk about any additional info needed to help.

Comment by Di Wang [ 18/Nov/10 ]

According to the trace, it seems mballoc is going crazy to search the extent for the new allocation.

"Mounting the OSTs used to take 30 minutes when we were on 1.8.3, but the mballoc optimizations that landed seem to have helped that quite a bit." Do you mean the fix on 18456?

If you see slow IO again. Could you please provide mb_history/mb_groups of the slow OST dev (see how mb goes during slow IO. Usually mb is fine with 50% full, not sure what happen there) and brw_stats (see how is IO during slow IO). All of them are under /proc.

And please also tell me what is the git url/tag for 1.8.4-ddn2.

Comment by Di Wang [ 18/Nov/10 ]

Btw: Are there any other different slow stack trace except the one you posted here? And if there are some other suspicious console message on the slow OST, please post here as well.

Comment by Kit Westneat (Inactive) [ 18/Nov/10 ]

We've been using Mercurial MQ for versioning, the tarball with .hg tree is available here:
http://eu.ddn.com:8080/lustre/lustre/src/lustre/lustre-1.8.4-ddn2.tar.bz2

Comment by Kit Westneat (Inactive) [ 18/Nov/10 ]

This has the mb_history for all devices and mb_groups for one device. I also ran oprofile during the slow IO. I've additionally included the output of our showall command (similar to lustre-diag).

Comment by Kit Westneat (Inactive) [ 18/Nov/10 ]

While testing iozone, a client hit this LBUG:

Nov 18 15:43:53 psana0107 kernel: LustreError: 28278:0:(socklnd_cb.c:550:ksocknal_process_transmit()) ASSERTION(rc < 0) failed
Nov 18 15:43:53 psana0107 kernel: LustreError: 28278:0:(socklnd_cb.c:550:ksocknal_process_transmit()) LBUG
Nov 18 15:43:53 psana0107 kernel: Pid: 28278, comm: socknal_sd01
Nov 18 15:43:53 psana0107 kernel:
Nov 18 15:43:53 psana0107 kernel: Call Trace:
Nov 18 15:43:53 psana0107 kernel: [<ffffffff884b56a1>] libcfs_debug_dumpstack+0x51/0x60 [libcfs]
Nov 18 15:43:53 psana0107 kernel: [<ffffffff884b5bda>] lbug_with_loc+0x7a/0xd0 [libcfs]
Nov 18 15:43:53 psana0107 kernel: [<ffffffff884bdf40>] tracefile_init+0x0/0x110 [libcfs]
Nov 18 15:43:53 psana0107 kernel: [<ffffffff886a582a>] ksocknal_process_transmit+0x33a/0x640 [ksocklnd]
Nov 18 15:43:53 psana0107 kernel: [<ffffffff886a75cb>] ksocknal_scheduler+0x38b/0x640 [ksocklnd]
Nov 18 15:43:53 psana0107 kernel: [<ffffffff800a0abe>] autoremove_wake_function+0x0/0x2e
Nov 18 15:43:53 psana0107 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11
Nov 18 15:43:53 psana0107 kernel: [<ffffffff886a7240>] ksocknal_scheduler+0x0/0x640 [ksocklnd]
Nov 18 15:43:53 psana0107 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11

I didn't see it in bz at all. Would the lustre-log be useful? Should I open another ticket for this?

Comment by Liang Zhen (Inactive) [ 18/Nov/10 ]

Kit, I've found a nasty bug in socklnd and will post a fix very soon.

Regards
Liang

Comment by Liang Zhen (Inactive) [ 18/Nov/10 ]

I've posted patch on http://review.whamcloud.com/#change,127
(description of the patch should be "fix contention on ksock_tx_t)

Description of the problem:

  • If the connection is closed before ksocknal_transmit() returns to ksocknal_process_transmit(), then nobody has refcount on conn::ksnc_sock and all pending ZC requests will be finalized by ksocknal_connsock_decref->ksocknal_finalize_zcreq
  • ksocknal_finalize_zcreq will mark not-acked ZC request as error by setting tx::tx_reside = -1.
  • This is race because ksocknal_process_transmit() will check tx::tx_resid right after calling ksocknal_transmit(), and it can get
    both tx->tx_resid != 0 and rc == 0 then hit later LASSERT(rc < 0).

I've added Jay and Lai as reviewer, also, I will file a bug on bugzilla and try to push it into mainstream.

but this bug is not reason of bad performance, so we still need to survey on performance issue.

Regards
Liang

Comment by Ariel Martinez (Inactive) [ 19/Nov/10 ]

Thanks, Liang.

Are there any updates on the performance issue? We're hoping to find a way around this without reformatting the 16TB OSTs with fewer inodes. Or at the very least, a way to get better performance if we attempt to transfer the data to another FS. We can also work on getting you remote screen access to our login sessions on the customer's system (if that'll help).

Thanks for your log. I am investigating it right now. I will get back to you as soon as I can. Btw: you saw this issue on both ana01 and ana02? or just ana01?

Since the reason why mballoc cost so much time is still mysterious to me. So I am not sure whether transferring data to another FS will help or not. But mballoc is very sensitive to the disk layout and fragment, so it might help.

Some customers have been reported disable readonly cache on OST might help here. But not sure whether it can help you, can you try

lctl set_param obdfilter.lustre-OSTXXXX.readcache_max_filesize 0
lctl set_param obdfilter.lustre-OSTXXXX.read_cache_enable 0

and see?

This should be in the logs that Kit uploaded, but these OST density is a little higher than usual at 15 OSTs per OSS. Here's an overall view.

There are 2 FSes.
ana01: currently in production: psanaoss101-104
ana02: current 'test' production: psanaoss111-114
*Both FSes share the same MDS pair - psanamdt101-102

=================================
[root@psanamdt101 ~]# pdsh -a "mount -t lustre | sort" | dshbak -c
----------------
psanamdt102
----------------
/dev/vg_ana1/mdt-200 on /lustre/ana2/mdt-200 type lustre (0)
/dev/vg_ana1/mdt on /lustre/ana1/mdt type lustre (0)
/dev/vg_ana1/mgs on /lustre/mgs type lustre (0)
----------------
psanaoss101
----------------
/dev/mapper/ost_ana1_0 on /lustre/ana1/ost_0 type lustre (0)
/dev/mapper/ost_ana1_10 on /lustre/ana1/ost_10 type lustre (0)
/dev/mapper/ost_ana1_11 on /lustre/ana1/ost_11 type lustre (0)
/dev/mapper/ost_ana1_12 on /lustre/ana1/ost_12 type lustre (0)
/dev/mapper/ost_ana1_13 on /lustre/ana1/ost_13 type lustre (0)
/dev/mapper/ost_ana1_14 on /lustre/ana1/ost_14 type lustre (0)
/dev/mapper/ost_ana1_1 on /lustre/ana1/ost_1 type lustre (0)
/dev/mapper/ost_ana1_2 on /lustre/ana1/ost_2 type lustre (0)
/dev/mapper/ost_ana1_3 on /lustre/ana1/ost_3 type lustre (0)
/dev/mapper/ost_ana1_4 on /lustre/ana1/ost_4 type lustre (0)
/dev/mapper/ost_ana1_5 on /lustre/ana1/ost_5 type lustre (0)
/dev/mapper/ost_ana1_6 on /lustre/ana1/ost_6 type lustre (0)
/dev/mapper/ost_ana1_7 on /lustre/ana1/ost_7 type lustre (0)
/dev/mapper/ost_ana1_8 on /lustre/ana1/ost_8 type lustre (0)
/dev/mapper/ost_ana1_9 on /lustre/ana1/ost_9 type lustre (0)
----------------
psanaoss102
----------------
/dev/mapper/ost_ana1_15 on /lustre/ana1/ost_15 type lustre (0)
/dev/mapper/ost_ana1_16 on /lustre/ana1/ost_16 type lustre (0)
/dev/mapper/ost_ana1_17 on /lustre/ana1/ost_17 type lustre (0)
/dev/mapper/ost_ana1_18 on /lustre/ana1/ost_18 type lustre (0)
/dev/mapper/ost_ana1_19 on /lustre/ana1/ost_19 type lustre (0)
/dev/mapper/ost_ana1_20 on /lustre/ana1/ost_20 type lustre (0)
/dev/mapper/ost_ana1_21 on /lustre/ana1/ost_21 type lustre (0)
/dev/mapper/ost_ana1_22 on /lustre/ana1/ost_22 type lustre (0)
/dev/mapper/ost_ana1_23 on /lustre/ana1/ost_23 type lustre (0)
/dev/mapper/ost_ana1_24 on /lustre/ana1/ost_24 type lustre (0)
/dev/mapper/ost_ana1_25 on /lustre/ana1/ost_25 type lustre (0)
/dev/mapper/ost_ana1_26 on /lustre/ana1/ost_26 type lustre (0)
/dev/mapper/ost_ana1_27 on /lustre/ana1/ost_27 type lustre (0)
/dev/mapper/ost_ana1_28 on /lustre/ana1/ost_28 type lustre (0)
/dev/mapper/ost_ana1_29 on /lustre/ana1/ost_29 type lustre (0)
----------------
psanaoss103
----------------
/dev/mapper/ost_ana1_30 on /lustre/ana1/ost_30 type lustre (0)
/dev/mapper/ost_ana1_31 on /lustre/ana1/ost_31 type lustre (0)
/dev/mapper/ost_ana1_32 on /lustre/ana1/ost_32 type lustre (0)
/dev/mapper/ost_ana1_33 on /lustre/ana1/ost_33 type lustre (0)
/dev/mapper/ost_ana1_34 on /lustre/ana1/ost_34 type lustre (0)
/dev/mapper/ost_ana1_35 on /lustre/ana1/ost_35 type lustre (0)
/dev/mapper/ost_ana1_36 on /lustre/ana1/ost_36 type lustre (0)
/dev/mapper/ost_ana1_37 on /lustre/ana1/ost_37 type lustre (0)
/dev/mapper/ost_ana1_38 on /lustre/ana1/ost_38 type lustre (0)
/dev/mapper/ost_ana1_39 on /lustre/ana1/ost_39 type lustre (0)
/dev/mapper/ost_ana1_40 on /lustre/ana1/ost_40 type lustre (0)
/dev/mapper/ost_ana1_41 on /lustre/ana1/ost_41 type lustre (0)
/dev/mapper/ost_ana1_42 on /lustre/ana1/ost_42 type lustre (0)
/dev/mapper/ost_ana1_43 on /lustre/ana1/ost_43 type lustre (0)
/dev/mapper/ost_ana1_44 on /lustre/ana1/ost_44 type lustre (0)
----------------
psanaoss104
----------------
/dev/mapper/ost_ana1_45 on /lustre/ana1/ost_45 type lustre (0)
/dev/mapper/ost_ana1_46 on /lustre/ana1/ost_46 type lustre (0)
/dev/mapper/ost_ana1_47 on /lustre/ana1/ost_47 type lustre (0)
/dev/mapper/ost_ana1_48 on /lustre/ana1/ost_48 type lustre (0)
/dev/mapper/ost_ana1_49 on /lustre/ana1/ost_49 type lustre (0)
/dev/mapper/ost_ana1_50 on /lustre/ana1/ost_50 type lustre (0)
/dev/mapper/ost_ana1_51 on /lustre/ana1/ost_51 type lustre (0)
/dev/mapper/ost_ana1_52 on /lustre/ana1/ost_52 type lustre (0)
/dev/mapper/ost_ana1_53 on /lustre/ana1/ost_53 type lustre (0)
/dev/mapper/ost_ana1_54 on /lustre/ana1/ost_54 type lustre (0)
/dev/mapper/ost_ana1_55 on /lustre/ana1/ost_55 type lustre (0)
/dev/mapper/ost_ana1_56 on /lustre/ana1/ost_56 type lustre (0)
/dev/mapper/ost_ana1_57 on /lustre/ana1/ost_57 type lustre (0)
/dev/mapper/ost_ana1_58 on /lustre/ana1/ost_58 type lustre (0)
/dev/mapper/ost_ana1_59 on /lustre/ana1/ost_59 type lustre (0)
==========================

ana02 has the same OST distribution, but we tried to add no_mbcache

/dev/mapper/ost_ana2_0 on /lustre/ana2/ost_0 type lustre (ro,errors=remount-ro,acl,barrier=1,nodelalloc,data=ordered,jqfmt=vfsv0,usrjquota=lquota.user.reserved,grpjquota=lquota.group.reserved)

==========================
[root@psana0107 ~]# lfs df -h /reg/d/ana01
UUID bytes Used Available Use% Mounted on
ana1-MDT0000_UUID 175.0G 966.6M 164.0G 0% /reg/d/ana01[MDT:0]
ana1-OST0000_UUID 14.3T 6.1T 7.5T 42% /reg/d/ana01[OST:0]
ana1-OST0001_UUID 14.3T 7.5T 6.1T 52% /reg/d/ana01[OST:1]
ana1-OST0002_UUID 14.3T 7.6T 6.0T 53% /reg/d/ana01[OST:2]
ana1-OST0003_UUID 14.3T 7.9T 5.7T 55% /reg/d/ana01[OST:3]
ana1-OST0004_UUID 14.3T 7.7T 5.9T 54% /reg/d/ana01[OST:4]
ana1-OST0005_UUID 14.3T 7.5T 6.1T 52% /reg/d/ana01[OST:5]
ana1-OST0006_UUID 14.3T 7.8T 5.8T 54% /reg/d/ana01[OST:6]
ana1-OST0007_UUID 14.3T 7.8T 5.8T 54% /reg/d/ana01[OST:7]
ana1-OST0008_UUID 14.3T 7.1T 6.5T 49% /reg/d/ana01[OST:8]
ana1-OST0009_UUID 14.3T 7.9T 5.7T 55% /reg/d/ana01[OST:9]
ana1-OST000a_UUID 14.3T 6.8T 6.8T 47% /reg/d/ana01[OST:10]
ana1-OST000b_UUID 14.3T 7.0T 6.6T 48% /reg/d/ana01[OST:11]
ana1-OST000c_UUID 14.3T 7.3T 6.3T 50% /reg/d/ana01[OST:12]
ana1-OST000d_UUID 14.3T 6.6T 7.0T 46% /reg/d/ana01[OST:13]
ana1-OST000e_UUID 14.3T 7.6T 6.0T 53% /reg/d/ana01[OST:14]
ana1-OST000f_UUID 14.3T 9.1T 4.5T 63% /reg/d/ana01[OST:15]
ana1-OST0010_UUID 14.3T 8.4T 5.2T 58% /reg/d/ana01[OST:16]
ana1-OST0011_UUID 14.3T 6.8T 6.8T 47% /reg/d/ana01[OST:17]
ana1-OST0012_UUID 14.3T 7.6T 6.0T 53% /reg/d/ana01[OST:18]
ana1-OST0013_UUID 14.3T 7.9T 5.6T 55% /reg/d/ana01[OST:19]
ana1-OST0014_UUID 14.3T 6.3T 7.2T 44% /reg/d/ana01[OST:20]
ana1-OST0015_UUID 14.3T 7.2T 6.4T 50% /reg/d/ana01[OST:21]
ana1-OST0016_UUID 14.3T 7.6T 6.0T 52% /reg/d/ana01[OST:22]
ana1-OST0017_UUID 14.3T 7.3T 6.3T 50% /reg/d/ana01[OST:23]
ana1-OST0018_UUID 14.3T 7.8T 5.8T 54% /reg/d/ana01[OST:24]
ana1-OST0019_UUID 14.3T 7.6T 6.0T 53% /reg/d/ana01[OST:25]
ana1-OST001a_UUID 14.3T 6.8T 6.8T 47% /reg/d/ana01[OST:26]
ana1-OST001b_UUID 14.3T 6.8T 6.8T 47% /reg/d/ana01[OST:27]
ana1-OST001c_UUID 14.3T 7.2T 6.4T 50% /reg/d/ana01[OST:28]
ana1-OST001d_UUID 14.3T 7.3T 6.3T 51% /reg/d/ana01[OST:29]
ana1-OST001e_UUID 14.3T 8.2T 5.4T 57% /reg/d/ana01[OST:30]
ana1-OST001f_UUID 14.3T 8.1T 5.5T 56% /reg/d/ana01[OST:31]
ana1-OST0020_UUID 14.3T 7.7T 5.9T 53% /reg/d/ana01[OST:32]
ana1-OST0021_UUID 14.3T 7.3T 6.3T 50% /reg/d/ana01[OST:33]
ana1-OST0022_UUID 14.3T 7.0T 6.6T 48% /reg/d/ana01[OST:34]
ana1-OST0023_UUID 14.3T 7.7T 5.9T 53% /reg/d/ana01[OST:35]
ana1-OST0024_UUID 14.3T 7.3T 6.3T 51% /reg/d/ana01[OST:36]
ana1-OST0025_UUID 14.3T 6.7T 6.9T 46% /reg/d/ana01[OST:37]
ana1-OST0026_UUID 14.3T 7.4T 6.2T 51% /reg/d/ana01[OST:38]
ana1-OST0027_UUID 14.3T 7.4T 6.2T 51% /reg/d/ana01[OST:39]
ana1-OST0028_UUID 14.3T 7.1T 6.5T 49% /reg/d/ana01[OST:40]
ana1-OST0029_UUID 14.3T 8.3T 5.3T 58% /reg/d/ana01[OST:41]
ana1-OST002a_UUID 14.3T 7.8T 5.8T 54% /reg/d/ana01[OST:42]
ana1-OST002b_UUID 14.3T 6.9T 6.7T 48% /reg/d/ana01[OST:43]
ana1-OST002c_UUID 14.3T 6.5T 7.1T 45% /reg/d/ana01[OST:44]
ana1-OST002d_UUID 14.3T 7.6T 6.0T 52% /reg/d/ana01[OST:45]
ana1-OST002e_UUID 14.3T 8.4T 5.2T 58% /reg/d/ana01[OST:46]
ana1-OST002f_UUID 14.3T 6.9T 6.7T 48% /reg/d/ana01[OST:47]
ana1-OST0030_UUID 14.3T 7.2T 6.4T 50% /reg/d/ana01[OST:48]
ana1-OST0031_UUID 14.3T 7.3T 6.3T 51% /reg/d/ana01[OST:49]
ana1-OST0032_UUID 14.3T 7.0T 6.6T 49% /reg/d/ana01[OST:50]
ana1-OST0033_UUID 14.3T 6.8T 6.8T 47% /reg/d/ana01[OST:51]
ana1-OST0034_UUID 14.3T 7.5T 6.1T 52% /reg/d/ana01[OST:52]
ana1-OST0035_UUID 14.3T 7.4T 6.2T 51% /reg/d/ana01[OST:53]
ana1-OST0036_UUID 14.3T 7.4T 6.2T 51% /reg/d/ana01[OST:54]
ana1-OST0037_UUID 14.3T 6.8T 6.8T 47% /reg/d/ana01[OST:55]
ana1-OST0038_UUID 14.3T 7.5T 6.1T 52% /reg/d/ana01[OST:56]
ana1-OST0039_UUID 14.3T 6.8T 6.8T 47% /reg/d/ana01[OST:57]
ana1-OST003a_UUID 14.3T 7.2T 6.4T 50% /reg/d/ana01[OST:58]
ana1-OST003b_UUID 14.3T 6.7T 6.9T 46% /reg/d/ana01[OST:59]
filesystem summary: 859.6T 442.2T 373.7T 51% /reg/d/ana01

[root@psana0107 ~]# lfs df -h /tmp/lustre/ana02/
UUID bytes Used Available Use% Mounted on
ana2-MDT0000_UUID 175.0G 1.1G 163.9G 0% /tmp/lustre/ana02[MDT:0]
ana2-OST0000_UUID 14.3T 13.3T 315.8G 92% /tmp/lustre/ana02[OST:0]
ana2-OST0001_UUID 14.3T 13.3T 284.1G 92% /tmp/lustre/ana02[OST:1]
ana2-OST0002_UUID 14.3T 12.5T 1.1T 87% /tmp/lustre/ana02[OST:2]
ana2-OST0003_UUID 14.3T 12.3T 1.3T 85% /tmp/lustre/ana02[OST:3]
ana2-OST0004_UUID 14.3T 12.2T 1.4T 85% /tmp/lustre/ana02[OST:4]
ana2-OST0005_UUID 14.3T 12.2T 1.4T 85% /tmp/lustre/ana02[OST:5]
ana2-OST0006_UUID 14.3T 12.6T 1.0T 87% /tmp/lustre/ana02[OST:6]
ana2-OST0007_UUID 14.3T 12.6T 999.1G 88% /tmp/lustre/ana02[OST:7]
ana2-OST0008_UUID 14.3T 12.0T 1.6T 83% /tmp/lustre/ana02[OST:8]
ana2-OST0009_UUID 14.3T 12.0T 1.6T 83% /tmp/lustre/ana02[OST:9]
ana2-OST000a_UUID 14.3T 12.3T 1.2T 86% /tmp/lustre/ana02[OST:10]
ana2-OST000b_UUID 14.3T 12.3T 1.3T 85% /tmp/lustre/ana02[OST:11]
ana2-OST000c_UUID 14.3T 12.4T 1.2T 86% /tmp/lustre/ana02[OST:12]
ana2-OST000d_UUID 14.3T 12.2T 1.4T 84% /tmp/lustre/ana02[OST:13]
ana2-OST000e_UUID 14.3T 12.1T 1.5T 84% /tmp/lustre/ana02[OST:14]
ana2-OST000f_UUID 14.3T 12.3T 1.3T 85% /tmp/lustre/ana02[OST:15]
ana2-OST0010_UUID 14.3T 12.5T 1.1T 87% /tmp/lustre/ana02[OST:16]
ana2-OST0011_UUID 14.3T 11.9T 1.7T 83% /tmp/lustre/ana02[OST:17]
ana2-OST0012_UUID 14.3T 12.5T 1.1T 87% /tmp/lustre/ana02[OST:18]
ana2-OST0013_UUID 14.3T 12.6T 1012.3G 88% /tmp/lustre/ana02[OST:19]
ana2-OST0014_UUID 14.3T 12.2T 1.4T 85% /tmp/lustre/ana02[OST:20]
ana2-OST0015_UUID 14.3T 12.2T 1.4T 84% /tmp/lustre/ana02[OST:21]
ana2-OST0016_UUID 14.3T 12.2T 1.4T 84% /tmp/lustre/ana02[OST:22]
ana2-OST0017_UUID 14.3T 12.4T 1.2T 86% /tmp/lustre/ana02[OST:23]
ana2-OST0018_UUID 14.3T 12.3T 1.3T 86% /tmp/lustre/ana02[OST:24]
ana2-OST0019_UUID 14.3T 12.1T 1.5T 84% /tmp/lustre/ana02[OST:25]
ana2-OST001a_UUID 14.3T 11.9T 1.7T 82% /tmp/lustre/ana02[OST:26]
ana2-OST001b_UUID 14.3T 12.3T 1.3T 86% /tmp/lustre/ana02[OST:27]
ana2-OST001c_UUID 14.3T 12.3T 1.3T 86% /tmp/lustre/ana02[OST:28]
ana2-OST001d_UUID 14.3T 12.0T 1.6T 83% /tmp/lustre/ana02[OST:29]
ana2-OST001e_UUID 14.3T 12.7T 915.5G 88% /tmp/lustre/ana02[OST:30]
ana2-OST001f_UUID 14.3T 12.6T 1012.7G 88% /tmp/lustre/ana02[OST:31]
ana2-OST0020_UUID 14.3T 12.4T 1.2T 86% /tmp/lustre/ana02[OST:32]
ana2-OST0021_UUID 14.3T 12.2T 1.4T 85% /tmp/lustre/ana02[OST:33]
ana2-OST0022_UUID 14.3T 12.3T 1.3T 85% /tmp/lustre/ana02[OST:34]
ana2-OST0023_UUID 14.3T 12.2T 1.4T 85% /tmp/lustre/ana02[OST:35]
ana2-OST0024_UUID 14.3T 12.3T 1.3T 86% /tmp/lustre/ana02[OST:36]
ana2-OST0025_UUID 14.3T 12.3T 1.3T 86% /tmp/lustre/ana02[OST:37]
ana2-OST0026_UUID 14.3T 12.3T 1.3T 85% /tmp/lustre/ana02[OST:38]
ana2-OST0027_UUID 14.3T 12.3T 1.3T 85% /tmp/lustre/ana02[OST:39]
ana2-OST0028_UUID 14.3T 12.1T 1.5T 84% /tmp/lustre/ana02[OST:40]
ana2-OST0029_UUID 14.3T 12.1T 1.5T 84% /tmp/lustre/ana02[OST:41]
ana2-OST002a_UUID 14.3T 12.2T 1.4T 84% /tmp/lustre/ana02[OST:42]
ana2-OST002b_UUID 14.3T 12.4T 1.2T 86% /tmp/lustre/ana02[OST:43]
ana2-OST002c_UUID 14.3T 12.2T 1.4T 84% /tmp/lustre/ana02[OST:44]
ana2-OST002d_UUID 14.3T 12.6T 1.0T 87% /tmp/lustre/ana02[OST:45]
ana2-OST002e_UUID 14.3T 12.9T 681.0G 90% /tmp/lustre/ana02[OST:46]
ana2-OST002f_UUID 14.3T 12.5T 1.1T 87% /tmp/lustre/ana02[OST:47]
ana2-OST0030_UUID 14.3T 12.4T 1.2T 86% /tmp/lustre/ana02[OST:48]
ana2-OST0031_UUID 14.3T 13.1T 550.9G 91% /tmp/lustre/ana02[OST:49]
ana2-OST0032_UUID 14.3T 12.8T 802.7G 89% /tmp/lustre/ana02[OST:50]
ana2-OST0033_UUID 14.3T 12.9T 693.6G 90% /tmp/lustre/ana02[OST:51]
ana2-OST0034_UUID 14.3T 12.1T 1.5T 84% /tmp/lustre/ana02[OST:52]
ana2-OST0035_UUID 14.3T 12.6T 982.2G 88% /tmp/lustre/ana02[OST:53]
ana2-OST0036_UUID : inactive device
ana2-OST0037_UUID 14.3T 12.5T 1.1T 87% /tmp/lustre/ana02[OST:55]
ana2-OST0038_UUID 14.3T 12.6T 975.6G 88% /tmp/lustre/ana02[OST:56]
ana2-OST0039_UUID 14.3T 12.3T 1.3T 86% /tmp/lustre/ana02[OST:57]
ana2-OST003a_UUID 14.3T 12.3T 1.3T 85% /tmp/lustre/ana02[OST:58]
ana2-OST003b_UUID 14.3T 13.3T 338.1G 92% /tmp/lustre/ana02[OST:59]
filesystem summary: 845.2T 731.6T 70.7T 86% /tmp/lustre/ana02

*This OST took a few minutes to become active after mounting ana02*
ana2-OST0036_UUID 14.3T 12.8T 868.3G 89% /tmp/lustre/ana02[OST:54]

Comment by Ariel Martinez (Inactive) [ 19/Nov/10 ]

Here's some info from dumpe2fs from an OST from each FS:

----------------------
ana01 (production):
----------------------
[root@psanaoss101 ~]# tunefs.lustre --dryrun /dev/sdc
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata
Read previous values:
Target: ana1-OST0000
Index: 0
Lustre FS: ana1
Mount type: ldiskfs
Flags: 0x2
(OST )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=172.21.48.22@tcp mgsnode=172.21.48.21@tcp failover.node=172.21.48.23@tcp failover.node=172.21.48.24@tcp ost.quota_type=ug
Permanent disk data:
Target: ana1-OST0000
Index: 0
Lustre FS: ana1
Mount type: ldiskfs
Flags: 0x2
(OST )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=172.21.48.22@tcp mgsnode=172.21.48.21@tcp failover.node=172.21.48.23@tcp failover.node=172.21.48.24@tcp ost.quota_type=ug
exiting before disk write.

[root@psanaoss101 ~]# dumpe2fs -h /dev/sdc
dumpe2fs 1.41.10.sun2.ddn5 (24-Feb-2010)
Filesystem volume name: ana1-OST0000
Last mounted on: /
Filesystem UUID: ee987eee-d432-4c80-a88a-b3a2bc4ceebc
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent mmp sparse_super large_file uninit_bg
Filesystem flags: signed_directory_hash
Default mount options: (none)
Filesystem state: clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 976748544
Block count: 3906963456
Reserved block count: 195348172
Free blocks: 2213861446
Free inodes: 976634987
First block: 0
Block size: 4096
Fragment size: 4096
Reserved GDT blocks: 92
Blocks per group: 32768
Fragments per group: 32768
Inodes per group: 8192
Inode blocks per group: 512
Filesystem created: Wed May 19 23:16:16 2010
Last mount time: Wed Nov 17 16:21:24 2010
Last write time: Wed Nov 17 16:21:24 2010
Mount count: 10
Maximum mount count: 37
Last checked: Mon Sep 20 14:22:57 2010
Check interval: 15552000 (6 months)
Next check after: Sat Mar 19 14:22:57 2011
Lifetime writes: 233 GB
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 256
Required extra isize: 28
Desired extra isize: 28
Journal inode: 8
Default directory hash: half_md4
Directory Hash Seed: b8500284-5da2-43a6-baba-4896d4164263
Journal backup: inode blocks
MMP block number: 1545
MMP update interval: 5
Journal features: journal_incompat_revoke
Journal size: 400M
Journal length: 102400
Journal sequence: 0x0031b101
Journal start: 73111

-------------------
ana02
-------------------
[root@psanaoss111 ~]# tunefs.lustre --dryrun /dev/sdc
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata
Read previous values:
Target: ana2-OST0000
Index: 0
Lustre FS: ana2
Mount type: ldiskfs
Flags: 0x2
(OST )
Persistent mount opts: errors=remount-ro,extents,mballoc,acl,no_mbcache
Parameters: mgsnode=172.21.48.22@tcp mgsnode=172.21.48.21@tcp failover.node=172.21.48.27@tcp failover.node=172.21.48.28@tcp ost.quota_type=ug
Permanent disk data:
Target: ana2-OST0000
Index: 0
Lustre FS: ana2
Mount type: ldiskfs
Flags: 0x2
(OST )
Persistent mount opts: errors=remount-ro,extents,mballoc,acl,no_mbcache
Parameters: mgsnode=172.21.48.22@tcp mgsnode=172.21.48.21@tcp failover.node=172.21.48.27@tcp failover.node=172.21.48.28@tcp ost.quota_type=ug
exiting before disk write.

[root@psanaoss111 ~]# dumpe2fs -h /dev/sdc
dumpe2fs 1.41.10.sun2.ddn5 (24-Feb-2010)
Filesystem volume name: ana2-OST0000
Last mounted on: /
Filesystem UUID: 9f7e5464-13a1-423d-b83f-0cea949ec6eb
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent mmp sparse_super large_file uninit_bg
Filesystem flags: signed_directory_hash
Default mount options: (none)
Filesystem state: clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 976748544
Block count: 3906963456
Reserved block count: 195348172
Free blocks: 280489453
Free inodes: 976677769
First block: 0
Block size: 4096
Fragment size: 4096
Reserved GDT blocks: 92
Blocks per group: 32768
Fragments per group: 32768
Inodes per group: 8192
Inode blocks per group: 512
Filesystem created: Fri Aug 6 17:08:14 2010
Last mount time: Thu Nov 18 14:37:47 2010
Last write time: Thu Nov 18 14:37:47 2010
Mount count: 17
Maximum mount count: 28
Last checked: Tue Oct 26 13:07:06 2010
Check interval: 15552000 (6 months)
Next check after: Sun Apr 24 13:07:06 2011
Lifetime writes: 233 GB
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 256
Required extra isize: 28
Desired extra isize: 28
Journal inode: 8
Default directory hash: half_md4
Directory Hash Seed: 903db1d2-e38e-4763-a0c1-09ca9b7f65f3
Journal backup: inode blocks
MMP block number: 1545
MMP update interval: 5
Journal features: journal_incompat_revoke
Journal size: 400M
Journal length: 102400
Journal sequence: 0x001756a1
Journal start: 1

Comment by Di Wang [ 19/Nov/10 ]

Thanks for your log. I am investigating it right now. I will get back to you as soon as I can. Btw: you saw this issue on both ana01 and ana02? or just ana01?

Since the reason why mballoc cost so much time is still mysterious to me. So I am not sure whether transferring data to another FS will help or not. But mballoc is very sensitive to the disk layout and fragment, so it might help.

Some customers have been reported disable readonly cache on OST might help here. But not sure whether it can help you, can you try

lctl set_param obdfilter.lustre-OSTXXXX.readcache_max_filesize 0
lctl set_param obdfilter.lustre-OSTXXXX.read_cache_enable 0

and see?

Comment by Ariel Martinez (Inactive) [ 19/Nov/10 ]

Yes, both FSes see the same performance behavior. The fallback is to reformat the FS OSTs with 1/2-3/4 the number of default inodes in hopes that the reduced number of inodes will mballoc cache inits & perhaps enable flex_bg.

The transfer of the data to another FS is only being entertained because this data is in production and not backed up. The problem if we have to reformat the OSTs then it will take multiple weeks to backup 500TB of data at the current degraded performance of 5-20MB/s per OST.

Comment by Kit Westneat (Inactive) [ 19/Nov/10 ]

Another thing to note is that we have quota_type=ug set on the OSTs and MDTs, but lfs quotacheck has never been run. However, we see messages like:

Nov 19 15:14:46 psanaoss112 kernel: [90298.991874] Lustre: ana2-OST000f: slow quota init 38s due to heavy IO load

I'm not sure if this means anything. Should we just remove the quota_type parameter for now or does it not matter?

Comment by Di Wang [ 19/Nov/10 ]

Could you help me get the fragment status of those OST devs on psanaoss104
e2fsck -E fragcheck -n /dev/mapper/ost_ana1_xx Thanks.

Comment by bschubert [ 19/Nov/10 ]

About the "slow quota init" messages. It came up on lustre-discuss some time ago, which made me to check the code and while I still did not have the time to verify, here is my initial analysis. Probably worth another ticket.

<quote>
I just poked around a bit in the code and above the fsfilt_check_slow() check,
there is also a loop that calls filter_range_is_mapped(). Now this function
calls fs_bmap() and when that eventually goes to down to ext3, it might get a
bit slow if, if another thread should modify that file (check out
linux/fs/inode.c):

/*

  • bmap() is special. It gets used by applications such as lilo and by
  • the swapper to find the on-disk block of a specific piece of data.
    *
  • Naturally, this is dangerous if the block concerned is still in the
  • journal. If somebody makes a swapfile on an ext3 data-journaling
  • filesystem and enables swap, then they may get a nasty shock when the
  • data getting swapped to that swapfile suddenly gets overwritten by
  • the original zero's written out previously to the journal and
  • awaiting writeback in the kernel's buffer cache.
    *
  • So, if we see any bmap calls here on a modified, data-journaled file,
  • take extra steps to flush any blocks which might be in the cache.
    */

I don't know though, if it can happen that several threads write to the same
file. But if it happens, it gets slow. I wonder if a possible swap file is
worth the efforts here... In fact, the reason to call
filter_range_is_mapped() certainly does not require a journal flush in that
loop. I will check myself next week, if journal flushes are ever made due to
that and open a Lustre bugzilla then. Avoiding all of that should not be
difficult

Cheers,
Bernd
</quote>

If at all I will have to investigate next week and that will also be my last chance. If I won't have time for that, @ddn-lustre-team, please open another ticket. Using system-tap it will be very easy to verify that. And patch also should be simple.

Thanks,
Bernd

Comment by Kit Westneat (Inactive) [ 19/Nov/10 ]

Here's one that was getting slow io messages:
ana1-OST0016: clean, 106714/976748544 files, 2093145661/3906963456 blocks

Comment by Di Wang [ 19/Nov/10 ]

Yes, quota init message is a little misleading, which also includes some other time here as Bernd said. If you did not set quota (lfs setquota xxxx) for some users, it should not hurt the performance.

Comment by Kit Westneat (Inactive) [ 19/Nov/10 ]

Ok, I think we're starting to narrow in on a network issue as well. I did lnet_selftest on the osses to a client that was seeing bad performance and oss1/2 get very good performance, while oss3/4 get very bad performance. I've asked their network engineer to examine the network configuration for those nodes. Additionally they have some SMC 10Gb cards that are showing a large error count.

I turned off read and write caching on the OSSes and I haven't seen as many slow IO messages though I'm not sure if the file system is getting hit fully. I would still like to know why the IO is causing call traces when the sdstats are showing low latency to the storage.

Comment by bschubert [ 19/Nov/10 ]

Maybe a sysrq-t output helps a bit. Not captured at SLAC, but recently at Diamond, who have the same problem. It clearly shows that ldiskfs_mb_init_cache() was still running hours after mounting the OST. That was also with 1.8.4-ddn2 and they had to go back to 1.8.3-ddn and accept the slow mount time, as 1.8.4 decreased the performance even more.

Comment by Dan Ferber (Inactive) [ 20/Nov/10 ]

Kit, to your earlier question on if you should open a separate bug for client that hit the LBUG while testing iozone - yes, please do open as seperate bug for this so Liang can track his work on it there. Thanks.

Comment by Di Wang [ 21/Nov/10 ]

According to the trace, it seems it took so long time to read the bitmap of the block for searching the new buddy for the allocation. I do not know why here. But the solution might be try to avoid such read as much as possible(see bug 24183, some one already gave such suggestion and patch, and it make sense to me). They are testing it now, and I would expect the patch will be included to ldiskfs patch series soon, if the test result is good, since some customers already hit this problem on 1.8.4.

And I think disable read-only cache actually will hit the disk even more as I understand. but it might cause memory pressure, maybe the bitmap will be swapped out often, and cause more reading. so that might explain it did not show many slow IO message after you disable readonly cache.

Anyway we will follow 24183 here.

Comment by Ashley Pittman (Inactive) [ 22/Nov/10 ]

Just coming up to speed here (and testing my new JIRA account), we have two customers who are experiencing similar but potentially different issues:

Diamond:
8Tb luns varying used percentages but the ones over 70% are seeing huge mount times in the order of ten minutes, with a patch applied from https://bugzilla.lustre.org/show_bug.cgi?id=18456 the mount time comes down from ~40 minutes per node to ~30 seconds but the initial write to affected OSTs will hang for approximately ten minutes. Eventually we decided to stick to our release without the patch as to us it's better for the slowdown to happen synchronously at boot rather than asynchronously at some point in the future. Diamond are seeing significantly lower performance for "full" OSTs than the rest, ~50Mb/s for the 92% OSTS and ~300Mb/s for the 60% OSTs.

Comment by Ariel Martinez (Inactive) [ 22/Nov/10 ]

We are preparing to reformat their 2nd filesystem (ana02) and we need to ensure that the fs doesn't encounter this problem. Can you recommend format options that will prevent this mballoc behavior?
1) Should we attempt to simply format with fewer inodes, if so is there a recommended ratio, i.e. 1/3 less?
2) Is it safe to use flex_bg?

Thanks

Comment by Dan Ferber (Inactive) [ 22/Nov/10 ]

Ashley, welcome to Jira. If you believe we have a separate issue at Diamond than the issue here, go ahead and create a separate bug.

Ariel, I'll see if I can get you an answer on your two questions.

Comment by Di Wang [ 22/Nov/10 ]

1. Actually, I do not think change inode/block ratio will fix this problem(I mean not help to decrease the fragments). As for ost_inode_count, there is a thumb rule here.
num_ost_inodes = 4 * <num_mds_inodes> * <default_stripe_count> / <number_osts>
2. Yes, flex_bg is a mainline feature for ext4 now. It should be safe to use that, but as I know no one has been tested it yet. At least I did not see any number about that on lustre. But there are indeed some people suggest to put those block group information into a ssd with flex_bg. And use flex_bg might help for you.

Comment by Ashley Pittman (Inactive) [ 22/Nov/10 ]

I believe it could be summarised as Diamond hitting two bugs, the long mount time and the low performance. They both only happen as the OST reaches capacity and are both mballoc related but can be viewed as separate bugs.

Comment by Dan Ferber (Inactive) [ 22/Nov/10 ]

Got it. Thanks Ashley. I'll do some investigating on these topics, for some general 1.6 to 1.8 migration discussion outside of this bug.

Comment by Di Wang [ 23/Nov/10 ]

After some investigation, I think there are several ways to alleviate the problem

1. Shrinking the readonly cache size on OST, and it will cause less bitmap being swapped out during the memory pressure. And you can shrink the readonly cache size by
lctl set_param obdfilter.*.readcache_max_filesize=$xxxM #it means only files whose file_size < XXXM will be cached. And also if you can try to lower the OST count for each OSS, it might help too.

2. Shrink the preallocation table (remove 2048 from the prealloc_table)
echo "4 8 16 32 64 128 256 512 1024" > /proc/fs/ldiskfs/xxxxx/prealloc_table
So it makes mballoc only searching 1M blocks at most, instead of 2M, when the file size is big. So it will help to get good blocks quickly, instead of try to load more blk bitmap.

3. mount with flex_bg will help too, since it will help to decrease the spindle movement when loading the bitmap.

As for mballoc itself, we might

1. Stop preallocation when the disk is quite full.
2. pin the bitmap in the memory, but it seems not very realistic here. For a 15T OSTs, it will need 480M to load the bitmap, and for OSS (15 OST per OSS), it will need 480 * 15 = 7200M.
3. Made some global 2^N blocks counter over all the groups.
4. I really hope the patch mentioned in bug 24183 can help here.
http://www.spinics.net/lists/linux-ext4/msg18816.html

Thanks
Wangdi

Comment by Dan Ferber (Inactive) [ 23/Nov/10 ]

Thanks Wangdi. Ariel, let us know if Wangdi's suggestions above alleviate this.

Comment by Di Wang [ 24/Nov/10 ]

Stop preallocating new blocks when the disk is nearly full. The patch for suggestion 1.

Comment by Ariel Martinez (Inactive) [ 24/Nov/10 ]

Still trying to test this out and resolving possible network issues that exacerbated the issue. We've actually done the following so far:

lctl set_param obdfilter.*.readcache_max_filesize=0
lctl set_param obdfilter.*.read_cache_enable=0
for ent in $(find /proc/fs/ -name "prealloc"); do echo "4 8 16 32 64 128 256 512 1024" > $ent; cat $ent; done

Comment by Ariel Martinez (Inactive) [ 30/Nov/10 ]

I've been thinking about the mballoc solutions you presented, specifically item 2.
2) Pin the bitmap to memory currently we have only 24 GB or RAM for 15x16TB OSTs which I think is just barely enough memory for each OSS. I'm looking into how much memory would be good for this scenario. For 15 OSTs using 1.8.x, then I think 32 is the absolute lower bound that I'd recommend (I'd prefer 48GB). Now, in a failover configuration where a single OSS can take 30 OSTs and considering this situation. Do you think that going to 48-64 GB of memory is the best way to help or perhaps solve this problem?

Thanks

Comment by Di Wang [ 30/Nov/10 ]

So with a 16T OST, it will require 512M to load all the group bitmap, for 30 OSTs, it is 30*512M = 15G. if you have at least 48GB memory for this OSS. we can definitely do that. And it will require a small patch to do so.

As you know the problem is that mballoc always tries to preallocate blocks with maxim size in preallocate_table (default is 2048, which is 8M), even when the disk is quite fragmented. So if you shrink the preallocate table manually as what I suggested above, (btw: you can shrink it further to 512(2M), or even 256(1M))), you might not see SLOW IO messages any more? (Btw: do you still see this IO SLOW message, even after you shrink prealloc_table). But this shrinking might impact the performance, because it will preallocate less blocks each time.

Btw: is it easy to try new ldiskfs patch on SLAC? I would suggest you try this patch first

http://www.spinics.net/lists/linux-ext4/msg18816.html

Actually, this patch only remember the largest available block range (8M) in memory, but it could help in your case. Since I saw all the slow IO message comes from locating a 8M blocks by walking through all groups.

Comment by Di Wang [ 01/Dec/10 ]

Ariel, please let me know if you can try this patch.

http://www.spinics.net/lists/linux-ext4/msg18816.html

Also I just made a small patch for pining the group bitmap in memory. If you want try, please let me know.

And could you please run

debugfs -R stats /dev/mapper/ost_ana1_14 (the output is around 18M)

on psanaoss and post the result here. Thanks.

Comment by Ariel Martinez (Inactive) [ 01/Dec/10 ]

It's getting really late and can't keep my eyes open for much longer.
It now seems that there may be a couple of issues at play here. We are able to reproduce the slowness issue and lengthy long ls delay on ana02, but this time omitting all write IO. I'm not so sure that the mballoc issue is playing a factor or if it is no longer manifesting itself when we trigger the thread dump in sysrq_trigger. Since ana02 is only doing read IO, then my best current guess is that there is enough sequential IO spread across all OSTs that the IO is becoming random and adding to the mballoc issue when we resume writes, but I'm trying to avoid dumping everything all at once since these parallel reads are what i causing the significant performance degradation. I'm currently trying to copy the scripts, binary, and sample input files over to a test lab where Shuichi has access. (still have another half hour for 2 compressed data files to transfer).

What triggers the degradation is submitting 100+ jobs over 50 clients where are each thread (job) is processing large 5-20GB data files, some are stripe count=3 while others count=1. The nodes access these files at random but apparently process them via sequential IO. I'm still trying to understand and gather info. There appear to be lots of 4K reads coming through. Let me know what info you need to see and I'll reproduce the issue and collect the stats for you. Let me know if this read-only type use and problem reproduction still indicate the allocator only issue. Also, I'm not sure that we can successfully pin the bitmap to memory until we add more ram (which is currently at 24GB). I'm trying to see how soon we can get that mem upgrade.

Alternatively, if you like we can setup a shared console session where you can remotely login and observe this in real-time with me. This combined with a chat session may move things along very quickly. You can contact me directly at amartinez@ddn.com if you'd like to take advantage of this option.

I'm sorry if some of this doesn't make complete sense, but I wanted to respond to you while I'm still awake.

Thanks

Comment by Di Wang [ 01/Dec/10 ]

If there are only read, the slow IO message should not be related with mballoc, which only plays a role on write. But if mballoc does not work well when writing, the file might be fragmented indeed, so the following read will be impacted as well.

As you said you see a lot 4K reads, something seems not right here. If each job is doing sequential IO, server should see 1M reads most time. Could you please help me get

lctl get_param obdfilter.lustre-OSTxxxx.brw_stats

where we can how is the IO for the job. Maybe we can do this together, when I am on site.

Thanks

Comment by Di Wang [ 05/Dec/10 ]

According to the debug log Ariel collected for me, it seems this job (xtc reader) does sequential IO (mostly read). But the RPC quality seems not that good, here is the log I extracted from the server, and they are coming from the same client for the same object. (each line means 1 rpc, "map xxxx pages from XXXX," means the IO size and offet for this RPC).

00002000:00001000:9:1291350787.301104:0:25057:0:(fsfilt-ldiskfs.c:1168:fsfilt_ldiskfs_map_ext_inode_pages()) inode 958394151: map 256 pages from 62976
00002000:00001000:9:1291350789.513965:0:25101:0:(fsfilt-ldiskfs.c:1168:fsfilt_ldiskfs_map_ext_inode_pages()) inode 958394151: map 12 pages from 64000
00002000:00001000:13:1291350789.849398:0:25099:0:(fsfilt-ldiskfs.c:1168:fsfilt_ldiskfs_map_ext_inode_pages()) inode 958394151: map 244 pages from 64012
00002000:00001000:1:1291350789.936820:0:25146:0:(fsfilt-ldiskfs.c:1168:fsfilt_ldiskfs_map_ext_inode_pages()) inode 958394151: map 20 pages from 64256
00002000:00001000:13:1291350791.034838:0:25135:0:(fsfilt-ldiskfs.c:1168:fsfilt_ldiskfs_map_ext_inode_pages()) inode 958394151: map 87 pages from 65024
00002000:00001000:7:1291350791.149614:0:25131:0:(fsfilt-ldiskfs.c:1168:fsfilt_ldiskfs_map_ext_inode_pages()) inode 958394151: map 169 pages from 65111
00002000:00001000:11:1291350791.968959:0:25067:0:(fsfilt-ldiskfs.c:1168:fsfilt_ldiskfs_map_ext_inode_pages()) inode 958394151: map 201 pages from 65280
00002000:00001000:11:1291350792.520693:0:25067:0:(fsfilt-ldiskfs.c:1168:fsfilt_ldiskfs_map_ext_inode_pages()) inode 958394151: map 256 pages from 65792
00002000:00001000:7:1291350793.206577:0:25023:0:(fsfilt-ldiskfs.c:1168:fsfilt_ldiskfs_map_ext_inode_pages()) inode 958394151: map 176 pages from 66304
00002000:00001000:9:1291350793.560761:0:25107:0:(fsfilt-ldiskfs.c:1168:fsfilt_ldiskfs_map_ext_inode_pages()) inode 958394151: map 53 pages from 66560
00002000:00001000:7:1291350793.711102:0:25091:0:(fsfilt-ldiskfs.c:1168:fsfilt_ldiskfs_map_ext_inode_pages()) inode 958394151: map 203 pages from 66613
00002000:00001000:10:1291350794.971409:0:25046:0:(fsfilt-ldiskfs.c:1168:fsfilt_ldiskfs_map_ext_inode_pages()) inode 958394151: map 98 pages from 67486
00002000:00001000:13:1291350794.993297:0:25152:0:(fsfilt-ldiskfs.c:1168:fsfilt_ldiskfs_map_ext_inode_pages()) inode 958394151: map 45 pages from 67840
00002000:00001000:9:1291350795.659748:0:25105:0:(fsfilt-ldiskfs.c:1168:fsfilt_ldiskfs_map_ext_inode_pages()) inode 958394151: map 211 pages from 67885
00002000:00001000:9:1291350796.309517:0:25103:0:(fsfilt-ldiskfs.c:1168:fsfilt_ldiskfs_map_ext_inode_pages()) inode 958394151: map 154 pages from 68352
00002000:00001000:6:1291350796.322665:0:25048:0:(fsfilt-ldiskfs.c:1168:fsfilt_ldiskfs_map_ext_inode_pages()) inode 958394151: map 102 pages from 68506
00002000:00001000:15:1291350796.712931:0:25148:0:(fsfilt-ldiskfs.c:1168:fsfilt_ldiskfs_map_ext_inode_pages()) inode 958394151: map 256 pages from 68608
00002000:00001000:11:1291350796.712941:0:25133:0:(fsfilt-ldiskfs.c:1168:fsfilt_ldiskfs_map_ext_inode_pages()) inode 958394151: map 86 pages from 68864
00002000:00001000:6:1291350796.732409:0:25060:0:(fsfilt-ldiskfs.c:1168:fsfilt_ldiskfs_map_ext_inode_pages()) inode 958394151: map 170 pages from 68950
00002000:00001000:6:1291350796.775945:0:25081:0:(fsfilt-ldiskfs.c:1168:fsfilt_ldiskfs_map_ext_inode_pages()) inode 958394151: map 214 pages from 69120
00002000:00001000:5:1291350797.413062:0:25093:0:(fsfilt-ldiskfs.c:1168:fsfilt_ldiskfs_map_ext_inode_pages()) inode 958394151: map 42 pages from 69334
00002000:00001000:15:1291350797.417232:0:25087:0:(fsfilt-ldiskfs.c:1168:fsfilt_ldiskfs_map_ext_inode_pages()) inode 958394151: map 92 pages from 69376
00002000:00001000:6:1291350797.617095:0:25066:0:(fsfilt-ldiskfs.c:1168:fsfilt_ldiskfs_map_ext_inode_pages()) inode 958394151: map 164 pages from 69468
00002000:00001000:14:1291350798.514616:0:25036:0:(fsfilt-ldiskfs.c:1168:fsfilt_ldiskfs_map_ext_inode_pages()) inode 958394151: map 223 pages from 69632
00002000:00001000:3:1291350798.826330:0:25065:0:(fsfilt-ldiskfs.c:1168:fsfilt_ldiskfs_map_ext_inode_pages()) inode 958394151: map 33 pages from 69855
00002000:00001000:6:1291350798.831356:0:25036:0:(fsfilt-ldiskfs.c:1168:fsfilt_ldiskfs_map_ext_inode_pages()) inode 958394151: map 230 pages from 69888

So you can see client just split 1M (256 pages) into 2 RPC (sometimes even 3), when there are several reading threads running at the same client. so the problem maybe in either read-ahead or rpc forming engine. I am also quite interested in how is the IO pattern XTC reader. Ariel will help me get the source code.

Comment by Di Wang [ 06/Dec/10 ]

It seem the current readahead code (client side) is not aligned with 1M, so it cause 1M IO being split into 2 smaller size RPC. I will post a patch to fix that.

Also I got the xtc reader IO pattern. Here is the debug log I extracted from client side.

...................
00000080:00200000:1:1291659080.082367:0:5272:0:(file.c:1426:ll_file_readv()) VFS Op:inode=45035998/3609107566(ffff810217accd60),size=40,offset=3454661206
00000080:00200000:1:1291659080.082425:0:5272:0:(file.c:1426:ll_file_readv()) VFS Op:inode=45035998/3609107566(ffff810217accd60),size=2098520,offset=3454661246
00000080:00200000:9:1291659098.451755:0:5272:0:(file.c:1426:ll_file_readv()) VFS Op:inode=45035998/3609107566(ffff810217accd60),size=40,offset=3469344934
00000080:00200000:9:1291659098.451811:0:5272:0:(file.c:1426:ll_file_readv()) VFS Op:inode=45035998/3609107566(ffff810217accd60),size=2097488,offset=3469344974
00000080:00200000:9:1291659110.445261:0:5272:0:(file.c:1426:ll_file_readv()) VFS Op:inode=45035998/3609107566(ffff810217accd60),size=40,offset=3471442462
00000080:00200000:9:1291659110.445350:0:5272:0:(file.c:1426:ll_file_readv()) VFS Op:inode=45035998/3609107566(ffff810217accd60),size=2097488,offset=3471442502
00000080:00200000:4:1291659112.496273:0:5272:0:(file.c:1426:ll_file_readv()) VFS Op:inode=45035998/3609107566(ffff810217accd60),size=40,offset=3475637518
00000080:00200000:4:1291659112.496327:0:5272:0:(file.c:1426:ll_file_readv()) VFS Op:inode=45035998/3609107566(ffff810217accd60),size=2097488,offset=3475637558
.................

So it indeed does sequential IO, (40, 2098520, 40 2098520.....), unfortunately, it is not 1M aligned neither.

Comment by Di Wang [ 07/Dec/10 ]

patch.aligned_1M is the patch to fix this unligned readahead problem.

Btw: there are probably some other reasons why OST is very busy during XTC read (slow response to df and ls -l). I will investigate the debug log further.

Comment by Ariel Martinez (Inactive) [ 07/Dec/10 ]

Thanks, Di.

The patched client RPMs have been built and I've asked SLAC about installing them on a couple of nodes. The problem is that they would like to use nodes that are not in the batch queue where I've been submitting 200+ xtcreader jobs. I suppose I could run several jobs manually on the upgraded nodes as I submit the 200 jobs on the other clients.

Comment by Di Wang [ 07/Dec/10 ]

Ariel, Could you please hold on a bit? I want test the patch more locally first, before you tested on SLAC. thanks.

Comment by Di Wang [ 07/Dec/10 ]

update patch

Comment by Di Wang [ 09/Dec/10 ]

update patch

Comment by Di Wang [ 14/Dec/10 ]

update patch for 1M align read_ahead

Comment by Di Wang [ 17/Dec/10 ]

According to the test result, the patch posted on Dec 14th (read_ahead_align_1M.patch) did improve RPC efficiency, after this patch, 99% percentage of RPC are 1M now. I will post the patch on bugzilla and try to push it to the main stream.

Actually the problem we have now is already different with the original description.

There is an application on SLAC called XTC reader, when it is running, the other users can not use the file system, for example ls -l or df might become very slow, i.e. OSS is very busy at that time. XTC is basically a read intense application, it usually creates 200 or 300 threads, each thread will do sequence read from a 40G-50G file.

There might be several reasons,

1. All those files of XTC are single stripe, and we definitely see the unbalanced load of OSS during the running of XTC, i.e. those files might not be spread equally among all those OSTs. Ariel already asked them to re_stripe those files(to multiple stripe files), and will try again. Unfortunately, re-stripe them might take long time. (Ariel please correct me, if I say sth wrong)

2. Lustre does not align 1M read-ahead in some cases, which might cause extra RPC, and then add extra load to OSS, with patch I post here, it should be fixed. (unfortunately, we can only try this patch on 1 client,

3. We see a lot random IO on server side during the running of XTC, Ariel also told me DDN9900 does not performs good on random IO. This also might be a reason why OSS is very busy here. (NRS might help here actually).

Comment by Ariel Martinez (Inactive) [ 06/Jan/11 ]

I've asked the customer to apply the patched RPMs to as many clients as possible. They said that they can start today, and I'm waiting on their response and progress.

Thanks

Comment by Ariel Martinez (Inactive) [ 06/Jan/11 ]

Ok, we've upgraded 90% of the clients to include the latest RPC patch. I'm now available to resume testing.

Comment by Di Wang [ 06/Jan/11 ]

Ok, Thanks. Can we run the test and get the debug log as we discussed. On (patched) client side, get debug log with (-1 debug level). On Server side get debug log with (RPC) debug level. Thanks.

Comment by Ariel Martinez (Inactive) [ 04/Apr/11 ]

Please close this bug. There were multiple factors at work here (OST density, high # of client read processes, and 16TB luns) and have been testing for the past few months. This past weekend we went ahead and finalized the config using the 1M align patch. It helped relieve some of the random IO for the 9900 and also disabled read-ahead caching in 1.8.

Comment by Di Wang [ 04/Apr/11 ]

close the bug as Ariel's request.

Comment by Di Wang [ 13/Apr/11 ]

Reopen the issue for pushing the patch on on bug 24183 and 1M aligned patch for read-ahead.

Comment by Ariel Martinez (Inactive) [ 13/Apr/11 ]

Thanks
We have another customer who has an FS that's 82% full and is starting to get similar stack trace dumps under fairly low IO load:

Wed Apr 13 16:42:52 MDT 2011
*Space usage*
UUID bytes Used Available Use% Mounted on
lustre-MDT0000_UUID 566.2G 2.1G 557.6G 0% /lustre[MDT:0]
lustre-OST0000_UUID 7.2T 5.8T 1.3T 81% /lustre[OST:0]
lustre-OST0001_UUID 7.2T 5.9T 1.2T 82% /lustre[OST:1]
lustre-OST0002_UUID 7.2T 5.8T 1.2T 81% /lustre[OST:2]
lustre-OST0003_UUID 7.2T 5.9T 1.2T 81% /lustre[OST:3]
lustre-OST0004_UUID 7.2T 6.0T 1.1T 83% /lustre[OST:4]
lustre-OST0005_UUID 7.2T 5.9T 1.2T 82% /lustre[OST:5]
lustre-OST0006_UUID 7.2T 5.9T 1.2T 82% /lustre[OST:6]
lustre-OST0007_UUID 7.2T 5.9T 1.1T 83% /lustre[OST:7]
lustre-OST0008_UUID 7.2T 5.9T 1.2T 82% /lustre[OST:8]
lustre-OST0009_UUID 7.2T 5.9T 1.2T 81% /lustre[OST:9]
lustre-OST000a_UUID 7.2T 5.9T 1.2T 82% /lustre[OST:10]
lustre-OST000b_UUID 7.2T 5.8T 1.2T 81% /lustre[OST:11]
filesystem summary: 86.0T 70.7T 14.4T 82% /lustre

*inode usage*
UUID Inodes IUsed IFree IUse% Mounted on
lustre-MDT0000_UUID 148.5M 7.5M 141.0M 5% /lustre[MDT:0]
lustre-OST0000_UUID 345.8M 783.9K 345.0M 0% /lustre[OST:0]
lustre-OST0001_UUID 314.7M 787.1K 313.9M 0% /lustre[OST:1]
lustre-OST0002_UUID 338.4M 788.7K 337.6M 0% /lustre[OST:2]
lustre-OST0003_UUID 333.7M 780.8K 333.0M 0% /lustre[OST:3]
lustre-OST0004_UUID 306.0M 774.9K 305.2M 0% /lustre[OST:4]
lustre-OST0005_UUID 320.7M 784.2K 320.0M 0% /lustre[OST:5]
lustre-OST0006_UUID 328.1M 781.8K 327.3M 0% /lustre[OST:6]
lustre-OST0007_UUID 312.0M 783.5K 311.2M 0% /lustre[OST:7]
lustre-OST0008_UUID 315.1M 782.4K 314.4M 0% /lustre[OST:8]
lustre-OST0009_UUID 333.8M 780.8K 333.1M 0% /lustre[OST:9]
lustre-OST000a_UUID 328.3M 787.3K 327.5M 0% /lustre[OST:10]
lustre-OST000b_UUID 337.0M 783.8K 336.2M 0% /lustre[OST:11]
filesystem summary: 148.5M 7.5M 141.0M 5% /lustre

Apr 13 13:04:18 nmxtp-n-s00008-ge kernel: [166511.428293] Lustre: Service thread pid 10524 completed after 653.64s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
Apr 13 13:04:18 nmxtp-n-s00008-ge kernel: [166511.428599] Lustre: Skipped 11 previous similar messages
Apr 13 13:04:18 nmxtp-n-s00008-ge kernel: [166511.517026] Lustre: lustre-OST000a: slow direct_io 678s due to heavy IO load
Apr 13 13:04:55 nmxtp-n-s00006-ge kernel: [166565.359624] Lustre: Service thread pid 8924 was inactive for 514.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Apr 13 13:04:55 nmxtp-n-s00006-ge kernel: [166565.359879] Pid: 8924, comm: ll_ost_io_174
Apr 13 13:04:55 nmxtp-n-s00006-ge kernel: [166565.359955]
Apr 13 13:04:55 nmxtp-n-s00006-ge kernel: [166565.359956] Call Trace:
Apr 13 13:04:55 nmxtp-n-s00006-ge kernel: [166565.360146] [<ffffffff80070c29>] io_schedule+0x49/0x80
Apr 13 13:04:55 nmxtp-n-s00006-ge kernel: [166565.360230] [<ffffffff800171eb>] sync_buffer+0x3b/0x50
Apr 13 13:04:55 nmxtp-n-s00006-ge kernel: [166565.360311] [<ffffffff80070ee5>] __wait_on_bit+0x45/0x80
Apr 13 13:04:55 nmxtp-n-s00006-ge kernel: [166565.360392] [<ffffffff800171b0>] sync_buffer+0x0/0x50
Apr 13 13:04:55 nmxtp-n-s00006-ge kernel: [166565.360543] [<ffffffff800171b0>] sync_buffer+0x0/0x50
Apr 13 13:04:55 nmxtp-n-s00006-ge kernel: [166565.360651] [<ffffffff80070f98>] out_of_line_wait_on_bit+0x78/0x90
Apr 13 13:04:55 nmxtp-n-s00006-ge kernel: [166565.360737] [<ffffffff800b95a0>] wake_bit_function+0x0/0x40
Apr 13 13:04:55 nmxtp-n-s00006-ge kernel: [166565.360824] [<ffffffff80053b90>] __wait_on_buffer+0x20/0x30
Apr 13 13:04:55 nmxtp-n-s00006-ge kernel: [166565.360918] [<ffffffff88a2a0fa>] ldiskfs_mb_init_cache+0x32a/0x930 [ldiskfs]
Apr 13 13:04:55 nmxtp-n-s00006-ge kernel: [166565.361091] [<ffffffff88a2ab30>] ldiskfs_mb_load_buddy+0x190/0x400 [ldiskfs]
Apr 13 13:04:56 nmxtp-n-s00006-ge kernel: [166565.361254] [<ffffffff88a2bb5a>] ldiskfs_mb_regular_allocator+0x1aa/0x570 [ldiskfs]
Apr 13 13:04:56 nmxtp-n-s00006-ge kernel: [166565.361428] [<ffffffff88a2e689>] ldiskfs_mb_new_blocks+0x219/0x4d0 [ldiskfs]
Apr 13 13:04:56 nmxtp-n-s00006-ge kernel: [166565.361646] [<ffffffff80012063>] __find_get_block_slow+0x103/0x120
Apr 13 13:04:56 nmxtp-n-s00006-ge kernel: [166565.361739] [<ffffffff80071cb7>] __down_write_nested+0x17/0xb0
Apr 13 13:04:56 nmxtp-n-s00006-ge kernel: [166565.361832] [<ffffffff88aac0c7>] ldiskfs_ext_new_extent_cb+0x367/0x680 [fsfilt_ldiskfs]
Apr 13 13:04:56 nmxtp-n-s00006-ge kernel: [166565.361983] [<ffffffff8001c2e8>] __getblk+0x28/0x270
Apr 13 13:04:56 nmxtp-n-s00006-ge kernel: [166565.362080] [<ffffffff88a1883e>] ldiskfs_ext_find_extent+0x28e/0x2d0 [ldiskfs]
Apr 13 13:04:56 nmxtp-n-s00006-ge kernel: [166565.362242] [<ffffffff88a18a69>] ldiskfs_ext_walk_space+0x1e9/0x270 [ldiskfs]
Apr 13 13:04:56 nmxtp-n-s00006-ge kernel: [166565.362396] [<ffffffff88aabd60>] ldiskfs_ext_new_extent_cb+0x0/0x680 [fsfilt_ldiskfs]
Apr 13 13:04:56 nmxtp-n-s00006-ge kernel: [166565.362645] [<ffffffff88aa8449>] fsfilt_map_nblocks+0xe9/0x120 [fsfilt_ldiskfs]
Apr 13 13:04:56 nmxtp-n-s00006-ge kernel: [166565.362811] [<ffffffff8885c7bf>] kiblnd_check_sends+0x38f/0x4a0 [ko2iblnd]
Apr 13 13:04:56 nmxtp-n-s00006-ge kernel: [166565.362908] [<ffffffff88aa867a>] fsfilt_ldiskfs_map_ext_inode_pages+0x1fa/0x220 [fsfilt_ldiskfs]
Apr 13 13:04:56 nmxtp-n-s00006-ge kernel: [166565.363070] [<ffffffff889fafb9>] start_this_handle+0x389/0x450 [jbd2]
Apr 13 13:04:56 nmxtp-n-s00006-ge kernel: [166565.363165] [<ffffffff88aa86e1>] fsfilt_ldiskfs_map_inode_pages+0x41/0xb0 [fsfilt_ldiskfs]
Apr 13 13:04:56 nmxtp-n-s00006-ge kernel: [166565.363329] [<ffffffff88ae576a>] filter_direct_io+0x46a/0xd50 [obdfilter]
Apr 13 13:04:56 nmxtp-n-s00006-ge kernel: [166565.363420] [<ffffffff888bfb50>] filter_quota_acquire+0x0/0x120 [lquota]
Apr 13 13:04:56 nmxtp-n-s00006-ge kernel: [166565.363563] [<ffffffff88ae84c2>] filter_commitrw_write+0x17a2/0x2b30 [obdfilter]
Apr 13 13:04:56 nmxtp-n-s00006-ge kernel: [166565.363733] [<ffffffff800a0348>] find_busiest_group+0x278/0x700
Apr 13 13:04:56 nmxtp-n-s00006-ge kernel: [166565.363851] [<ffffffff88ae00b8>] filter_commitrw+0x58/0x2a0 [obdfilter]
Apr 13 13:04:56 nmxtp-n-s00006-ge kernel: [166565.363949] [<ffffffff88a8ac1f>] ost_brw_write+0x1c3f/0x23c0 [ost]
Apr 13 13:04:57 nmxtp-n-s00006-ge kernel: [166565.364064] [<ffffffff887cafc8>] ptlrpc_send_reply+0x5f8/0x610 [ptlrpc]
Apr 13 13:04:57 nmxtp-n-s00006-ge kernel: [166565.364167] [<ffffffff887d5a85>] lustre_msg_set_last_committed+0x45/0x120 [ptlrpc]
Apr 13 13:04:57 nmxtp-n-s00006-ge kernel: [166565.364356] [<ffffffff887cf5b0>] lustre_msg_check_version_v2+0x10/0x30 [ptlrpc]
Apr 13 13:04:57 nmxtp-n-s00006-ge kernel: [166565.364543] [<ffffffff800a1c30>] default_wake_function+0x0/0x10
Apr 13 13:04:57 nmxtp-n-s00006-ge kernel: [166565.364652] [<ffffffff88a8dfc3>] ost_handle+0x2c23/0x5680 [ost]
Apr 13 13:04:57 nmxtp-n-s00006-ge kernel: [166565.364762] [<ffffffff887d41e0>] lustre_free_reply_state+0x280/0x290 [ptlrpc]
Apr 13 13:04:57 nmxtp-n-s00006-ge kernel: [166565.364946] [<ffffffff887cf3f5>] lustre_msg_get_opc+0x35/0xf0 [ptlrpc]
Apr 13 13:04:57 nmxtp-n-s00006-ge kernel: [166565.365050] [<ffffffff887dc827>] ptlrpc_server_handle_request+0xaa7/0x10b0 [ptlrpc]
Apr 13 13:04:57 nmxtp-n-s00006-ge kernel: [166565.365205] [<ffffffff8004fe6c>] try_to_wake_up+0x52c/0x550
Apr 13 13:04:57 nmxtp-n-s00006-ge kernel: [166565.365300] [<ffffffff800a1c3d>] default_wake_function+0xd/0x10
Apr 13 13:04:57 nmxtp-n-s00006-ge kernel: [166565.365383] [<ffffffff8009fa44>] __wake_up_common+0x44/0x80
Apr 13 13:04:57 nmxtp-n-s00006-ge kernel: [166565.365508] [<ffffffff80072371>] _spin_lock_bh+0x11/0x30
Apr 13 13:04:57 nmxtp-n-s00006-ge kernel: [166565.365625] [<ffffffff887e030a>] ptlrpc_main+0x12ea/0x14c0 [ptlrpc]
Apr 13 13:04:57 nmxtp-n-s00006-ge kernel: [166565.365722] [<ffffffff800a1c30>] default_wake_function+0x0/0x10
Apr 13 13:04:57 nmxtp-n-s00006-ge kernel: [166565.365812] [<ffffffff8006afb1>] child_rip+0xa/0x11
Apr 13 13:04:57 nmxtp-n-s00006-ge kernel: [166565.365917] [<ffffffff887df020>] ptlrpc_main+0x0/0x14c0 [ptlrpc]
Apr 13 13:04:57 nmxtp-n-s00006-ge kernel: [166565.366000] [<ffffffff8006afa7>] child_rip+0x0/0x11
Apr 13 13:04:57 nmxtp-n-s00006-ge kernel: [166565.366080]
Apr 13 13:04:57 nmxtp-n-s00006-ge kernel: [166565.366292] LustreError: dumping log to /tmp/lustre-log.1302721495.8924
Apr 13 13:06:01 nmxtp-n-s00006-ge kernel: [166631.345270] Lustre: Service thread pid 8855 was inactive for 514.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Apr 13 13:06:01 nmxtp-n-s00006-ge kernel: [166631.345515] Pid: 8855, comm: ll_ost_io_105
Apr 13 13:06:01 nmxtp-n-s00006-ge kernel: [166631.345591]
Apr 13 13:06:01 nmxtp-n-s00006-ge kernel: [166631.345592] Call Trace:
Apr 13 13:06:01 nmxtp-n-s00006-ge kernel: [166631.345771] [<ffffffff8885c7bf>] kiblnd_check_sends+0x38f/0x4a0 [ko2iblnd]
Apr 13 13:06:01 nmxtp-n-s00006-ge kernel: [166631.345864] [<ffffffff80071dec>] __down_read+0x8c/0xa9
Apr 13 13:06:01 nmxtp-n-s00006-ge kernel: [166631.345950] [<ffffffff800bbd59>] down_read+0x19/0x20
Apr 13 13:06:01 nmxtp-n-s00006-ge kernel: [166631.346044] [<ffffffff88a21606>] ldiskfs_get_blocks+0x56/0x2b0 [ldiskfs]
Apr 13 13:06:01 nmxtp-n-s00006-ge kernel: [166631.346202] [<ffffffff88a22408>] ldiskfs_get_block+0xb8/0x100 [ldiskfs]
Apr 13 13:06:01 nmxtp-n-s00006-ge kernel: [166631.346313] [<ffffffff8003969a>] generic_block_bmap+0x3a/0x40
Apr 13 13:06:01 nmxtp-n-s00006-ge kernel: [166631.346397] [<ffffffff8003a3dc>] mapping_tagged+0x4c/0x70
Apr 13 13:06:01 nmxtp-n-s00006-ge kernel: [166631.346489] [<ffffffff88a1f142>] ldiskfs_bmap+0xc2/0xe0 [ldiskfs]
Apr 13 13:06:02 nmxtp-n-s00006-ge kernel: [166631.346580] [<ffffffff88a1f080>] ldiskfs_bmap+0x0/0xe0 [ldiskfs]
Apr 13 13:06:02 nmxtp-n-s00006-ge kernel: [166631.346668] [<ffffffff88ae70a9>] filter_commitrw_write+0x389/0x2b30 [obdfilter]
Apr 13 13:06:02 nmxtp-n-s00006-ge kernel: [166631.346818] [<ffffffff8018c31a>] __next_cpu+0x1a/0x30
Apr 13 13:06:02 nmxtp-n-s00006-ge kernel: [166631.346899] [<ffffffff800a0348>] find_busiest_group+0x278/0x700
Apr 13 13:06:02 nmxtp-n-s00006-ge kernel: [166631.347017] [<ffffffff88ae00b8>] filter_commitrw+0x58/0x2a0 [obdfilter]
Apr 13 13:06:02 nmxtp-n-s00006-ge kernel: [166631.347176] [<ffffffff88a8ac1f>] ost_brw_write+0x1c3f/0x23c0 [ost]
Apr 13 13:06:02 nmxtp-n-s00006-ge kernel: [166631.347310] [<ffffffff887cafc8>] ptlrpc_send_reply+0x5f8/0x610 [ptlrpc]
Apr 13 13:06:02 nmxtp-n-s00006-ge kernel: [166631.347414] [<ffffffff887d5a85>] lustre_msg_set_last_committed+0x45/0x120 [ptlrpc]
Apr 13 13:06:02 nmxtp-n-s00006-ge kernel: [166631.347601] [<ffffffff887cf5b0>] lustre_msg_check_version_v2+0x10/0x30 [ptlrpc]Apr 13 13:06:02 nmxtp-n-s00006-ge kernel: [166631.347756] [<ffffffff800a1c30>] default_wake_function+0x0/0x10
Apr 13 13:06:02 nmxtp-n-s00006-ge kernel: [166631.347849] [<ffffffff88a8dfc3>] ost_handle+0x2c23/0x5680 [ost]
Apr 13 13:06:02 nmxtp-n-s00006-ge kernel: [166631.347959] [<ffffffff887d41e0>] lustre_free_reply_state+0x280/0x290 [ptlrpc]
Apr 13 13:06:02 nmxtp-n-s00006-ge kernel: [166631.349794] [<ffffffff887cf3f5>] lustre_msg_get_opc+0x35/0xf0 [ptlrpc]
Apr 13 13:06:02 nmxtp-n-s00006-ge kernel: [166631.349899] [<ffffffff887dc827>] ptlrpc_server_handle_request+0xaa7/0x10b0 [ptlrpc]
Apr 13 13:06:02 nmxtp-n-s00006-ge kernel: [166631.350053] [<ffffffff8004fe6c>] try_to_wake_up+0x52c/0x550
Apr 13 13:06:02 nmxtp-n-s00006-ge kernel: [166631.350202] [<ffffffff800a1c3d>] default_wake_function+0xd/0x10
Apr 13 13:06:02 nmxtp-n-s00006-ge kernel: [166631.350312] [<ffffffff8009fa44>] __wake_up_common+0x44/0x80
Apr 13 13:06:02 nmxtp-n-s00006-ge kernel: [166631.350394] [<ffffffff80072371>] _spin_lock_bh+0x11/0x30
Apr 13 13:06:02 nmxtp-n-s00006-ge kernel: [166631.350498] [<ffffffff887e030a>] ptlrpc_main+0x12ea/0x14c0 [ptlrpc]
Apr 13 13:06:03 nmxtp-n-s00006-ge kernel: [166631.350591] [<ffffffff800a1c30>] default_wake_function+0x0/0x10
Apr 13 13:06:03 nmxtp-n-s00006-ge kernel: [166631.350680] [<ffffffff8006afb1>] child_rip+0xa/0x11
Apr 13 13:06:03 nmxtp-n-s00006-ge kernel: [166631.350785] [<ffffffff887df020>] ptlrpc_main+0x0/0x14c0 [ptlrpc]
Apr 13 13:06:03 nmxtp-n-s00006-ge kernel: [166631.350868] [<ffffffff8006afa7>] child_rip+0x0/0x11
Apr 13 13:06:03 nmxtp-n-s00006-ge kernel: [166631.350948]
Apr 13 13:06:03 nmxtp-n-s00006-ge kernel: [166631.351162] LustreError: dumping log to /tmp/lustre-log.1302721561.8855
Apr 13 13:06:46 nmxtp-n-s00006-ge kernel: [166676.694536] Lustre: lustre-OST0005: slow journal start 147s due to heavy IO load
Apr 13 13:06:46 nmxtp-n-s00006-ge kernel: [166676.694543] Lustre: lustre-OST0005: slow journal start 30s due to heavy IO load

Comment by Ariel Martinez (Inactive) [ 13/Apr/11 ]

The user is in the process of changing the default striping from a stripe count of 2 to 1. We're trying to backtrack the changes they made in addition to upgrading lustre to 1.8.4, and they had changed the stripe count from 1 to 2 after they upgraded. They are also in the process of purging files to help alleviate the any fragmentation they estimate they could possibly clear about 20TB. This should buy them some time.

Comment by Ariel Martinez (Inactive) [ 14/Apr/11 ]

The stack straces seems to have reduced somewhat, but there are still ost_io processes in D state and slow_io errors, except that they mostly appear to be "slow setattr" due to heavy load. This is an example of how it looks now), but the errors occur on all servers and OSTs:

Here's an overview of this particular system:
UUID bytes Used Available Use% Mounted on
lustre-MDT0000_UUID 566.2G 2.1G 557.6G 0% /lustre[MDT:0]
lustre-OST0000_UUID 7.2T 4.8T 2.3T 67% /lustre[OST:0]
lustre-OST0001_UUID 7.2T 4.9T 2.2T 68% /lustre[OST:1]
lustre-OST0002_UUID 7.2T 4.8T 2.3T 67% /lustre[OST:2]
lustre-OST0003_UUID 7.2T 4.8T 2.3T 67% /lustre[OST:3]
lustre-OST0004_UUID 7.2T 4.9T 2.2T 68% /lustre[OST:4]
lustre-OST0005_UUID 7.2T 4.9T 2.2T 68% /lustre[OST:5]
lustre-OST0006_UUID 7.2T 4.8T 2.3T 67% /lustre[OST:6]
lustre-OST0007_UUID 7.2T 4.9T 2.2T 68% /lustre[OST:7]
lustre-OST0008_UUID 7.2T 4.9T 2.2T 68% /lustre[OST:8]
lustre-OST0009_UUID 7.2T 4.8T 2.3T 67% /lustre[OST:9]
lustre-OST000a_UUID 7.2T 4.9T 2.2T 67% /lustre[OST:10]
lustre-OST000b_UUID 7.2T 4.8T 2.3T 66% /lustre[OST:11]
filesystem summary: 86.0T 58.3T 26.8T 67% /lustre

*24 GB per OSS*
----------------
nmxtp-n-s00005
----------------
/dev/mpath/lun1 on /mnt/nmxtp-n-s00005/target0 type lustre (rw)
/dev/mpath/lun2 on /mnt/nmxtp-n-s00005/target1 type lustre (rw)
/dev/mpath/lun3 on /mnt/nmxtp-n-s00005/target2 type lustre (rw)
----------------
nmxtp-n-s00006
----------------
/dev/mpath/lun4 on /mnt/nmxtp-n-s00006/target3 type lustre (rw)
/dev/mpath/lun5 on /mnt/nmxtp-n-s00006/target4 type lustre (rw)
/dev/mpath/lun6 on /mnt/nmxtp-n-s00006/target5 type lustre (rw)
----------------
nmxtp-n-s00007
----------------
/dev/mpath/lun7 on /mnt/nmxtp-n-s00007/target0 type lustre (rw)
/dev/mpath/lun8 on /mnt/nmxtp-n-s00007/target1 type lustre (rw)
/dev/mpath/lun9 on /mnt/nmxtp-n-s00007/target2 type lustre (rw)
----------------
nmxtp-n-s00008
----------------
/dev/mpath/lun10 on /mnt/nmxtp-n-s00008/target3 type lustre (rw)
/dev/mpath/lun11 on /mnt/nmxtp-n-s00008/target4 type lustre (rw)
/dev/mpath/lun12 on /mnt/nmxtp-n-s00008/target5 type lustre (rw)

*The behavior after they purged from 82% to 68% use
11890 ? D 1:25 [ll_ost_io_43]
11905 ? D 1:25 [ll_ost_io_58]
11907 ? D 1:26 [ll_ost_io_60]
11924 ? D 1:26 [ll_ost_io_77]
11949 ? D 1:26 [ll_ost_io_102]
12025 ? D 1:26 [ll_ost_io_178]
12027 ? D 1:25 [ll_ost_io_180]
12068 ? D 1:27 [ll_ost_io_221]
12075 ? D 1:30 [ll_ost_io_228]
12092 ? D 1:26 [ll_ost_io_245]

Apr 13 23:33:00 nmxtp-n-s00005-ge kernel: [205304.612577] Lustre: lustre-OST0000: slow setattr 31s due to heavy IO load
Apr 13 23:33:14 nmxtp-n-s00005-ge kernel: [205318.443304] Lustre: lustre-OST0002: slow setattr 45s due to heavy IO load
Apr 13 23:33:24 nmxtp-n-s00005-ge kernel: [205328.571388] Lustre: lustre-OST0002: slow setattr 55s due to heavy IO load
Apr 13 23:33:24 nmxtp-n-s00007-ge kernel: [205355.169231] Lustre: lustre-OST0008: slow setattr 55s due to heavy IO load
Apr 13 23:33:27 nmxtp-n-s00007-ge kernel: [205358.219174] Lustre: lustre-OST0008: slow setattr 58s due to heavy IO load

Comment by Ariel Martinez (Inactive) [ 14/Apr/11 ]

Uploaded logs and stack traces from yesterday to Di's server. This is before the customer purged the filesystem.
*lctl and lustre logs in /tmp*
2011-04-13_nmxtp-n-s00005_lustrelog.bz2
2011-04-13_nmxtp-n-s00006_lustrelog.bz2
2011-04-13_nmxtp-n-s00007_lustrelog.bz2
2011-04-13_nmxtp-n-s00008_lustrelog.bz2

*syslog server that catches messages from both clients and servers*
2011-04-13_syslog_messages.bz2

*uploaded current syslog after the purge*
2011-04-14_01_syslog-messages_afterpurge.bz2

Comment by Ariel Martinez (Inactive) [ 14/Apr/11 ]

I have a meeting with the customer in the morning and I'll see what more info I can get. Hopefully the user has other jobs in queue that we can run and see if the slow io and ldiskfs stack traces decreased after the purge. A few things that I need to know today:
1) Is if ldiskfs patch should be integrated into our series as is? http://www.spinics.net/lists/linux-ext4/msg18816.html
2) If Whamcloud can test this before we attempt to roll it into our series and build it?
3) any other suggested path

Hopefully we've bought some time with tonight's purge, disabling readahead cache, and decreasing the preallocation table to exclude 2MB. I need to devise a timeline for the customer soon and determine when they can apply this patch. Hopefully the upgrade will happen before their FS fills up again. Please let me know.

Thanks

Comment by Matt Ezell [ 14/Apr/11 ]

We saw performance issues on Cray Lustre 1.6 when OSTs got fairly full (around 80%). It was traced back to slow loading of the block bitmaps.

The workaround we use is to spawn off a background debugfs process against each OST immediately after mount:

debugfs -R stats /dev/<device> > /dev/null

BZ22710 (private between Cray and Oracle, unfortunately) was opened, and I've been told it has transitioned to a enhancement request to preemptively asynchronously read in filesystem bitmaps at mount time.

Comment by Ariel Martinez (Inactive) [ 14/Apr/11 ]

Hi Matt,

I heard about this as well, but as you say we weren't privy to the details and workarounds. Ultimately we'd like to confirm that this is an issue, is what's occurring. Hopefully now that the FS is at 62%, we're going to try run the user jobs that trigger the slowdown.

I assume it's safe to do this for each OST even after it's been mounted for a couple of days?
$ nohup debugfs -R stats /dev/mpath/<dev> &> /dev/null &

Thanks,
Ariel

Comment by John Salinas (Inactive) [ 14/Apr/11 ]

It seems the site had set req_buffer_history_max=10240 awhile ago and have been running with this in production. We need a recommendation to pass along to them on the impact of running with this set all the time. The sites problems seems to have become worse when then upgraded from 1.6 -> 1.8. It seems that four sites have had issues that appear similair to this. We would like to know what options we have now that they are at DDN 1.8.4 to help reduce the impact of this - for example disabling caching, running debugfs in background, etc. These recommendations need to realize that this would be on a live customer system in a production env. In general we would like to find out more about the root cause of this issue and how running 1.8 bring out this problem. Ariel has uploaded syslog and dumps (lctl and /tmp/Lustre-XXX) please let us know if there is any additional information we could provide.

Comment by Matt Ezell [ 14/Apr/11 ]

Ariel,

We have safely run "debugfs -R stats" on our live file system, but be aware that it does read the disk and will slow down I/O. That's why we do it immediately after mount. Using DDN9900s, 8TB luns, 7 OSTs per OSS, and a file system about 70% full, it seems to take about 45 minutes to complete (running concurrently on them all).

1.6 doesn't do OSS read cache, so we don't seem to have problems due to memory pressure evicting the bitmaps.

Comment by Ariel Martinez (Inactive) [ 14/Apr/11 ]

Thanks, that's really useful info, this way we can run that during a period of light IO or temporarily pause the job scheduler.

I've requested that Di ask Andreas a question about the differences between the ldiskfs (mbaalloc) code in lustre 1.8 ext3 & ext4 versions. Di understands my question and has looked over the code. Basically, I need to understand if users, who don't need 16TB lun support, need to be running lustre 1.8 ext4 version, or if they should stick with the ext3 version to avoid this type of problem as the FS becomes fragmented (>80%) in conjunction with 1.8's read caching.

Yesterday, before the purge, the OSSes couldn't even handle 1/10 of the current load. The write IO is fairly constant at 50-100MB/s at all times, but one user's jobs will begin with short bursts (10 minutes or so) of very high read IO (up to 1.6GB/s) over several hundred jobs (best case scenario), thus causing random-like IO read pattern which the 9900 doesn't appreciate. Yesterday, the system kept coming to a screeching halt with just less than 1/10th of the jobs running approx 20 caused slow_io, stack traces, lustre dumps. Unfortunately, it was more important to get them back up and running than to spend time debugging the issue, which we'll resume on Monday after we confirm that we can keep their cluster running stable for a few days.

As of this afternoon the system is back in full throttle after the FS purged down from 82% to 60%, with prealloc_table excluding 2048, read caching disabled, and with async journal. We're going to let the cluster run in normal (unthrottled) operation through the night and if the jobs succeed without any IO errors or failures, then tomorrow morning we'll re-include 2048 into the prealloc_table (we want to avoid having the system fragment to quickly again). I've asked them to include a cron job that inserts df output into /var/log/messages, this way we can have a record/reference of the disk usage in the logs.

If the system is stable through the weekend, then Monday we'll go ahead and artificially fill up the FS over 80% again, and see if the problem manifests again and then start removing other tunables like read_caching parameters on the clients and servers, bumping ost threads back to 512, etc ...

Also, please pardon any typos or lack of cohesion in my writing, I've had very little sleep these past few days... =)

Thanks

Comment by Di Wang [ 15/Apr/11 ]

I just investigated all logs Ariel sent to me. Currently all signs point to the bug 24183, so the patch
https://bugzilla.lustre.org/attachment.cgi?id=32420 should help here, unless Andreas had different opinions. But I still like to see mb_groups, mb_history and memory information when the problem happened again. Ariel will help me get these information.

Since we will include this patch to 1.8.6 anyway, I would suggest you to include this patch in your own branch now, if necessary.

Thanks.

Comment by Ariel Martinez (Inactive) [ 15/Apr/11 ]

Other questions are based is on the difference of the Lustre 1.8 ext3 vs ext4 (16TB lun support). I'll reference support for <=8TB as "ext3" and >8TB support as "ext4". The caching features appear to not play well with the default settings and causing perhaps unnecessary contention on the OSS.
1) Should users who have <=8TB OSTs install the ext3 or ext4 version?
2) Are there any compatibility issues if the users simply upgrade to the ext4 version directly?
3) Is the ldiskfs patch referenced in this bug applicable to both versions of the source?
4) Informational: Why were 2 versions for the source provided for download? Any significant differences?

Untiil we get to the point where we stop distributing our DDN patched version, we'll need to confirm whether we need to have 2 different branches of 1.8 to support/upgrade our customers who have had 1.4/1.6 version of lustre.
5) Is there an estimate as to when 1.8.6 will be released that will include the ldiskfs and the readahead alignment patches?

Thanks

Comment by Andreas Dilger [ 15/Apr/11 ]

Reading through the full history of this bug, I think that all of the major issues have been discussed already, but I don't know which of the recommendations are all in effect at the most recent site.

  • the patch in bug 24183 is definitely an improvement for the mballoc slowness, but it isn't a cure, because if the bitmaps are being pushed out of memory by reads they will need to be reloaded.
  • 1.8 OSS read cache appears to be slowing down concurrent write performance significantly. The read/writethrough cache can be limited to small files only with "lctl set_param obdfilter.*.readcache_max_filesize=32M" for e.g. only caching files below 32MB in size. This looks like it would be quite helpful to avoid the large file reads from wiping out the filesystem metadata cache. The read cache can also be disabled completely on OSS nodes with "lctl set_param obdfilter.*.read_cache_enable=0". Note that "lctl set_param" is only in effect until the next reboot, so it would need to be put into a startup script.
  • increasing the amount of RAM would definitely help in such cases. With 15 OSTs per OSS node, the recommended minimum amount of RAM is 2GB + 2GB/OST = 32GB just to hold the filesystem metadata and have the bare_minimum amount to handle failover. Going to 48 or 64GB/OSS is strongly recommended.
  • as previously mentioned, formatting the filesystem with the flex_bg feature (usable with ext4 only) will improve the bitmap loading time dramatically, but doesn't help with existing filesystems. The workaround that was suggested to use debugfs at mount time to prime the bitmap cache is useful if this problem happens mostly after mount, but doesn't completely avoid the longer-term memory pressure due to reads.
Comment by Ariel Martinez (Inactive) [ 15/Apr/11 ]

Thanks for responding

Earlier I did the following to simply confirm that we could get the system back into a usable state:

OSS

  1. lowered ost thread count from 512 to 256
  2. lctl set_param obdfilter.*.readcache_max_filesize=0
  3. lctl set_param obdfilter.*.read_cache_enable=0
  4. for ent in $(find /proc/fs/ -name "prealloc_table"); do echo "4 8 16 32 64 128 256 512 1024" > $ent; done
  5. purged the FS from 82% to 60%

Client

  1. for ent in $(find /proc/fs/ -name "max_read_ahead_mb"); do echo 80 > $ent; done

The system responded very after the purge. I hoped to apply one setting at a time to determine what helped the most. Unfortunately this customer is a corporation, not in acedemia, and they can't afford to stay down for very long. I ended up all of the above in at the same time.

Today we re-included 2048 into the prealloc_table to avoid unnecessary fragmentation. We'll see how this goes during the weekend. As I mentioned in my 2nd previous comment, Monday I'll start to remove some of the settings and artificially fill the filesystem. At that point I'll collect my usual info and give it to Di. Regarding the readcache filesize, unfortunately the average file size for reads are between 10-20 MB, but hundreds of them occur in parallel (burst read I/O as the jobs start up). I'm not sure that caching is of any use given this user's I/O pattern, which is to read in bulk and overwrite the files serially. I've asked whether it's possible to have their jobs delete the files instead of overwriting them (in hopes of freeing up contiguous blocks), but I'm not sure how well mballoc would handle this much data freed up in bulk (frequently).

Would this io pattern negate the priming? (I'd like to hope not, since lustre is often used as scratch, but that's why I ask.)

It seems that there's ultimately no difference between the ext3 & ext4 versions of lustre 1.8. Until a permanent solution is found shall we have our customers, who upgrade from 1.6 to 1.8 (ext4), do the some combination of the following:
0) Run the ldiskfs patched version of 1.8 (ext3 or ext4)
1) Disable readahead caching and/or reduce cache filesize threshold
2) Increase ram and possibly pin the bitmaps to memory
3) Run the backgrounded debugfs immediately after a mount (expecting reduced IO during this process)
4) Reformatting the OSTs with flex_bg

Thanks

Comment by Di Wang [ 15/Apr/11 ]

"Just want to mention, pining all bitmaps to memory is probably not very realistic. Because it needs too much memory, for example, a 8T partition with 4K block_size, you will need 8*1024*1024 /128 (4KB block bitmap can describe 128MB) = 65536 * 4KB = 256G to put all bitmaps in memory."

I was wrong in the previous calculation, 8TB partition only needs 8*1024*1024 /128 = 65536 * 4KB = 256MB, so it is possible to pin bitmaps in memory, if we had correct algorithm. Sorry about that.

Current default max cache_filesize is 32M. If many sites complain the performance for this, after upgrading from 1.6 to 18, which probably means read cache has negative impacts on their performance. So I will tune it down to 8M in 1.8.6. And the customer can tune it up later, if they think that is helpful.

Comment by Ariel Martinez (Inactive) [ 18/Apr/11 ]

Well for this site they have 3 OSTs/OSS with 24 GB of RAM:
3 OSTs * 2GB = 6GB (OSTs) * 2 = 12GB (failover) + 2GB (OSS) = 14GB – which leaves ~8GB for metadata cache and OS.
0.5GB (per 16TB OST bitmap) * 3 = 1.5GB * 2 (failover) = 3GB (bitmap) – which leaves 5 GB for the OSS ...

Does that sound about right?

I just wonder if pinning the bitmap to memory would be best provided the OSS has sufficient memory and thus preventing it from [potentially] repeatedly being pushed out of memory.

Comment by Kit Westneat (Inactive) [ 18/Apr/11 ]

Robin Humble gave a talk at LUG on how to pin inodes and dentries that might be useful, it was based on this email:
http://www.mail-archive.com/lustre-discuss@lists.lustre.org/msg08508.html

Comment by Ariel Martinez (Inactive) [ 18/Apr/11 ]

Thanks, but I had something different in mind for preventing the bitmaps from being flushed out. I also tend to be a bit weary of setting the vfs values too low in a production environment. It seems that a value of 0 was the only setting that had any significant impact, which seems to be a little dangerous. I guess it seems that for now the way to keep things working for 1.6 compatibility is to keep the FS below 80% and disbling most, if not all, caching features in 1.8. Kinda seems like cron job galore to keep these workarounds in place.

Comment by Di Wang [ 19/Apr/11 ]

reassign this bug to yangsheng for porting/adding the patch
https://bugzilla.lustre.org/attachment.cgi?id=32804
to ldiskfs series(both 1.8.6 and 2.1 please)

Comment by Build Master (Inactive) [ 20/Apr/11 ]

Integrated in lustre-b1_8 » x86_64,client,ubuntu1004,inkernel #17
LU-15 Strange slow IO messages and bad performance

Johann Lombardi : c5c2986be490b2fbceb4b38d6c983d279f4bbcf8
Files :

  • ldiskfs/kernel_patches/patches/ext4-mballoc-group_check-rhel5.patch
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
Comment by Build Master (Inactive) [ 20/Apr/11 ]

Integrated in lustre-b1_8 » x86_64,client,el5,inkernel #17
LU-15 Strange slow IO messages and bad performance

Johann Lombardi : c5c2986be490b2fbceb4b38d6c983d279f4bbcf8
Files :

  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
  • ldiskfs/kernel_patches/patches/ext4-mballoc-group_check-rhel5.patch
Comment by Build Master (Inactive) [ 20/Apr/11 ]

Integrated in lustre-b1_8 » i686,client,el6,inkernel #17
LU-15 Strange slow IO messages and bad performance

Johann Lombardi : c5c2986be490b2fbceb4b38d6c983d279f4bbcf8
Files :

  • ldiskfs/kernel_patches/patches/ext4-mballoc-group_check-rhel5.patch
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
Comment by Build Master (Inactive) [ 20/Apr/11 ]

Integrated in lustre-b1_8 » x86_64,client,el6,inkernel #17
LU-15 Strange slow IO messages and bad performance

Johann Lombardi : c5c2986be490b2fbceb4b38d6c983d279f4bbcf8
Files :

  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
  • ldiskfs/kernel_patches/patches/ext4-mballoc-group_check-rhel5.patch
Comment by Build Master (Inactive) [ 20/Apr/11 ]

Integrated in lustre-b1_8 » x86_64,server,el5,inkernel #17
LU-15 Strange slow IO messages and bad performance

Johann Lombardi : c5c2986be490b2fbceb4b38d6c983d279f4bbcf8
Files :

  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
  • ldiskfs/kernel_patches/patches/ext4-mballoc-group_check-rhel5.patch
Comment by Build Master (Inactive) [ 20/Apr/11 ]

Integrated in lustre-b1_8 » x86_64,client,el5,ofa #17
LU-15 Strange slow IO messages and bad performance

Johann Lombardi : c5c2986be490b2fbceb4b38d6c983d279f4bbcf8
Files :

  • ldiskfs/kernel_patches/patches/ext4-mballoc-group_check-rhel5.patch
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
Comment by Build Master (Inactive) [ 20/Apr/11 ]

Integrated in lustre-b1_8 » i686,client,el5,ofa #17
LU-15 Strange slow IO messages and bad performance

Johann Lombardi : c5c2986be490b2fbceb4b38d6c983d279f4bbcf8
Files :

  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
  • ldiskfs/kernel_patches/patches/ext4-mballoc-group_check-rhel5.patch
Comment by Build Master (Inactive) [ 20/Apr/11 ]

Integrated in lustre-b1_8 » x86_64,server,el5,ofa #17
LU-15 Strange slow IO messages and bad performance

Johann Lombardi : c5c2986be490b2fbceb4b38d6c983d279f4bbcf8
Files :

  • ldiskfs/kernel_patches/patches/ext4-mballoc-group_check-rhel5.patch
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
Comment by Build Master (Inactive) [ 20/Apr/11 ]

Integrated in lustre-b1_8 » i686,server,el5,ofa #17
LU-15 Strange slow IO messages and bad performance

Johann Lombardi : c5c2986be490b2fbceb4b38d6c983d279f4bbcf8
Files :

  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
  • ldiskfs/kernel_patches/patches/ext4-mballoc-group_check-rhel5.patch
Comment by Build Master (Inactive) [ 20/Apr/11 ]

Integrated in lustre-b1_8 » i686,client,el5,inkernel #17
LU-15 Strange slow IO messages and bad performance

Johann Lombardi : c5c2986be490b2fbceb4b38d6c983d279f4bbcf8
Files :

  • ldiskfs/kernel_patches/patches/ext4-mballoc-group_check-rhel5.patch
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
Comment by Build Master (Inactive) [ 20/Apr/11 ]

Integrated in lustre-b1_8 » i686,server,el5,inkernel #17
LU-15 Strange slow IO messages and bad performance

Johann Lombardi : c5c2986be490b2fbceb4b38d6c983d279f4bbcf8
Files :

  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
  • ldiskfs/kernel_patches/patches/ext4-mballoc-group_check-rhel5.patch
Comment by Peter Jones [ 21/Apr/11 ]

Does this fix need landing to master?

Yes, it needs to be landed to master.

Comment by Ariel Martinez (Inactive) [ 21/Apr/11 ]

Was the readhead_align_1M patch landed as well?

Yes, it will be landed as well to 1.8.6 and 2.1

Comment by Build Master (Inactive) [ 26/Apr/11 ]

Integrated in lustre-b1_8 » x86_64,client,el5,inkernel #19
LU-15 slow IO with read intense application

Johann Lombardi : 12aae18c0ec495f459a1026ad2e86bbf0232b0fe
Files :

  • lustre/obdfilter/filter_internal.h
Comment by Build Master (Inactive) [ 26/Apr/11 ]

Integrated in lustre-b1_8 » x86_64,client,ubuntu1004,inkernel #19
LU-15 slow IO with read intense application

Johann Lombardi : 12aae18c0ec495f459a1026ad2e86bbf0232b0fe
Files :

  • lustre/obdfilter/filter_internal.h
Comment by Build Master (Inactive) [ 26/Apr/11 ]

Integrated in lustre-b1_8 » x86_64,client,el5,ofa #19
LU-15 slow IO with read intense application

Johann Lombardi : 12aae18c0ec495f459a1026ad2e86bbf0232b0fe
Files :

  • lustre/obdfilter/filter_internal.h
Comment by Build Master (Inactive) [ 26/Apr/11 ]

Integrated in lustre-b1_8 » x86_64,server,el5,inkernel #19
LU-15 slow IO with read intense application

Johann Lombardi : 12aae18c0ec495f459a1026ad2e86bbf0232b0fe
Files :

  • lustre/obdfilter/filter_internal.h
Comment by Build Master (Inactive) [ 26/Apr/11 ]

Integrated in lustre-b1_8 » x86_64,client,el6,inkernel #20
LU-15 slow IO with read intense application

Johann Lombardi : 12aae18c0ec495f459a1026ad2e86bbf0232b0fe
Files :

  • lustre/obdfilter/filter_internal.h
Comment by Build Master (Inactive) [ 26/Apr/11 ]

Integrated in lustre-b1_8 » i686,server,el5,ofa #20
LU-15 slow IO with read intense application

Johann Lombardi : 12aae18c0ec495f459a1026ad2e86bbf0232b0fe
Files :

  • lustre/obdfilter/filter_internal.h
Comment by Build Master (Inactive) [ 26/Apr/11 ]

Integrated in lustre-b1_8 » i686,client,el6,inkernel #20
LU-15 slow IO with read intense application

Johann Lombardi : 12aae18c0ec495f459a1026ad2e86bbf0232b0fe
Files :

  • lustre/obdfilter/filter_internal.h
Comment by Build Master (Inactive) [ 26/Apr/11 ]

Integrated in lustre-b1_8 » i686,client,el5,ofa #20
LU-15 slow IO with read intense application

Johann Lombardi : 12aae18c0ec495f459a1026ad2e86bbf0232b0fe
Files :

  • lustre/obdfilter/filter_internal.h
Comment by Build Master (Inactive) [ 26/Apr/11 ]

Integrated in lustre-b1_8 » i686,client,el5,inkernel #20
LU-15 slow IO with read intense application

Johann Lombardi : 12aae18c0ec495f459a1026ad2e86bbf0232b0fe
Files :

  • lustre/obdfilter/filter_internal.h
Comment by Build Master (Inactive) [ 26/Apr/11 ]

Integrated in lustre-b1_8 » i686,server,el5,inkernel #20
LU-15 slow IO with read intense application

Johann Lombardi : 12aae18c0ec495f459a1026ad2e86bbf0232b0fe
Files :

  • lustre/obdfilter/filter_internal.h
Comment by Build Master (Inactive) [ 28/Apr/11 ]

Integrated in lustre-b1_8 » x86_64,server,el5,inkernel #22
LU-15 slow IO with read-intense application

Johann Lombardi : b7eb1d769cc80216e353c4e4217dfe9070927139
Files :

  • lustre/llite/rw.c
Comment by Build Master (Inactive) [ 28/Apr/11 ]

Integrated in lustre-b1_8 » x86_64,client,ubuntu1004,inkernel #22
LU-15 slow IO with read-intense application

Johann Lombardi : b7eb1d769cc80216e353c4e4217dfe9070927139
Files :

  • lustre/llite/rw.c
Comment by Build Master (Inactive) [ 28/Apr/11 ]

Integrated in lustre-b1_8 » x86_64,client,el5,inkernel #22
LU-15 slow IO with read-intense application

Johann Lombardi : b7eb1d769cc80216e353c4e4217dfe9070927139
Files :

  • lustre/llite/rw.c
Comment by Build Master (Inactive) [ 28/Apr/11 ]

Integrated in lustre-b1_8 » x86_64,client,el6,inkernel #22
LU-15 slow IO with read-intense application

Johann Lombardi : b7eb1d769cc80216e353c4e4217dfe9070927139
Files :

  • lustre/llite/rw.c
Comment by Build Master (Inactive) [ 28/Apr/11 ]

Integrated in lustre-b1_8 » i686,client,el6,inkernel #22
LU-15 slow IO with read-intense application

Johann Lombardi : b7eb1d769cc80216e353c4e4217dfe9070927139
Files :

  • lustre/llite/rw.c
Comment by Build Master (Inactive) [ 28/Apr/11 ]

Integrated in lustre-b1_8 » x86_64,client,el5,ofa #22
LU-15 slow IO with read-intense application

Johann Lombardi : b7eb1d769cc80216e353c4e4217dfe9070927139
Files :

  • lustre/llite/rw.c
Comment by Build Master (Inactive) [ 28/Apr/11 ]

Integrated in lustre-b1_8 » x86_64,server,el5,ofa #22
LU-15 slow IO with read-intense application

Johann Lombardi : b7eb1d769cc80216e353c4e4217dfe9070927139
Files :

  • lustre/llite/rw.c
Comment by Build Master (Inactive) [ 28/Apr/11 ]

Integrated in lustre-b1_8 » i686,server,el5,inkernel #22
LU-15 slow IO with read-intense application

Johann Lombardi : b7eb1d769cc80216e353c4e4217dfe9070927139
Files :

  • lustre/llite/rw.c
Comment by Build Master (Inactive) [ 28/Apr/11 ]

Integrated in lustre-b1_8 » i686,client,el5,inkernel #22
LU-15 slow IO with read-intense application

Johann Lombardi : b7eb1d769cc80216e353c4e4217dfe9070927139
Files :

  • lustre/llite/rw.c
Comment by Build Master (Inactive) [ 28/Apr/11 ]

Integrated in lustre-b1_8 » i686,server,el5,ofa #22
LU-15 slow IO with read-intense application

Johann Lombardi : b7eb1d769cc80216e353c4e4217dfe9070927139
Files :

  • lustre/llite/rw.c
Comment by Build Master (Inactive) [ 28/Apr/11 ]

Integrated in lustre-b1_8 » i686,client,el5,ofa #22
LU-15 slow IO with read-intense application

Johann Lombardi : b7eb1d769cc80216e353c4e4217dfe9070927139
Files :

  • lustre/llite/rw.c
Comment by Build Master (Inactive) [ 20/May/11 ]

Integrated in lustre-master » x86_64,client,sles11,inkernel #130
LU-15 shrink the cache to alleviate OST memory pressure

Oleg Drokin : 491a448008d8f8a4dc61789c14bd96de205f856e
Files :

  • lustre/obdfilter/filter_internal.h
Comment by Build Master (Inactive) [ 20/May/11 ]

Integrated in lustre-master » x86_64,client,el5,inkernel #130
LU-15 shrink the cache to alleviate OST memory pressure

Oleg Drokin : 491a448008d8f8a4dc61789c14bd96de205f856e
Files :

  • lustre/obdfilter/filter_internal.h
Comment by Build Master (Inactive) [ 20/May/11 ]

Integrated in lustre-master » i686,client,el5,inkernel #130
LU-15 shrink the cache to alleviate OST memory pressure

Oleg Drokin : 491a448008d8f8a4dc61789c14bd96de205f856e
Files :

  • lustre/obdfilter/filter_internal.h
Comment by Build Master (Inactive) [ 20/May/11 ]

Integrated in lustre-master » x86_64,server,el5,inkernel #130
LU-15 shrink the cache to alleviate OST memory pressure

Oleg Drokin : 491a448008d8f8a4dc61789c14bd96de205f856e
Files :

  • lustre/obdfilter/filter_internal.h
Comment by Build Master (Inactive) [ 20/May/11 ]

Integrated in lustre-master » x86_64,client,ubuntu1004,inkernel #130
LU-15 shrink the cache to alleviate OST memory pressure

Oleg Drokin : 491a448008d8f8a4dc61789c14bd96de205f856e
Files :

  • lustre/obdfilter/filter_internal.h
Comment by Build Master (Inactive) [ 20/May/11 ]

Integrated in lustre-master » i686,client,el5,ofa #130
LU-15 shrink the cache to alleviate OST memory pressure

Oleg Drokin : 491a448008d8f8a4dc61789c14bd96de205f856e
Files :

  • lustre/obdfilter/filter_internal.h
Comment by Build Master (Inactive) [ 20/May/11 ]

Integrated in lustre-master » x86_64,client,el5,ofa #130
LU-15 shrink the cache to alleviate OST memory pressure

Oleg Drokin : 491a448008d8f8a4dc61789c14bd96de205f856e
Files :

  • lustre/obdfilter/filter_internal.h
Comment by Build Master (Inactive) [ 20/May/11 ]

Integrated in lustre-master » x86_64,server,el5,ofa #130
LU-15 shrink the cache to alleviate OST memory pressure

Oleg Drokin : 491a448008d8f8a4dc61789c14bd96de205f856e
Files :

  • lustre/obdfilter/filter_internal.h
Comment by Build Master (Inactive) [ 20/May/11 ]

Integrated in lustre-master » x86_64,client,el6,inkernel #130
LU-15 shrink the cache to alleviate OST memory pressure

Oleg Drokin : 491a448008d8f8a4dc61789c14bd96de205f856e
Files :

  • lustre/obdfilter/filter_internal.h
Comment by Build Master (Inactive) [ 20/May/11 ]

Integrated in lustre-master » x86_64,client,ubuntu1004,ofa #130
LU-15 shrink the cache to alleviate OST memory pressure

Oleg Drokin : 491a448008d8f8a4dc61789c14bd96de205f856e
Files :

  • lustre/obdfilter/filter_internal.h
Comment by Build Master (Inactive) [ 20/May/11 ]

Integrated in lustre-master » i686,server,el5,inkernel #130
LU-15 shrink the cache to alleviate OST memory pressure

Oleg Drokin : 491a448008d8f8a4dc61789c14bd96de205f856e
Files :

  • lustre/obdfilter/filter_internal.h
Comment by Build Master (Inactive) [ 20/May/11 ]

Integrated in lustre-master » i686,server,el5,ofa #130
LU-15 shrink the cache to alleviate OST memory pressure

Oleg Drokin : 491a448008d8f8a4dc61789c14bd96de205f856e
Files :

  • lustre/obdfilter/filter_internal.h
Comment by Build Master (Inactive) [ 20/May/11 ]

Integrated in lustre-master » x86_64,server,el6,inkernel #130
LU-15 shrink the cache to alleviate OST memory pressure

Oleg Drokin : 491a448008d8f8a4dc61789c14bd96de205f856e
Files :

  • lustre/obdfilter/filter_internal.h
Comment by Build Master (Inactive) [ 20/May/11 ]

Integrated in lustre-master » i686,client,el6,inkernel #130
LU-15 shrink the cache to alleviate OST memory pressure

Oleg Drokin : 491a448008d8f8a4dc61789c14bd96de205f856e
Files :

  • lustre/obdfilter/filter_internal.h
Comment by Build Master (Inactive) [ 20/May/11 ]

Integrated in lustre-master » i686,server,el6,inkernel #130
LU-15 shrink the cache to alleviate OST memory pressure

Oleg Drokin : 491a448008d8f8a4dc61789c14bd96de205f856e
Files :

  • lustre/obdfilter/filter_internal.h
Comment by Peter Jones [ 06/Jun/11 ]

can this ticket be marked as resolved or is there still some outstanding work?

Comment by Sebastien Buisson (Inactive) [ 16/Jun/11 ]

Peter,

I found this ticket searching the Jira database for "slow IO" issues, because one of our customers using Lustre 2.0 has some troubles with this.
I saw that there have been 3 commits in b1_8 to fix this issue, but only one in master. Is there any reason why http://review.whamcloud.com/441 and http://review.whamcloud.com/437 have not been ported and landed to master?

Sebastien.

Comment by Yang Sheng [ 16/Jun/11 ]

Hi, Sebastien,

http://review.whamcloud.com/442 and http://review.whamcloud.com/438 both are master patch for the Slow IO issue.

YangSheng

Comment by Sebastien Buisson (Inactive) [ 16/Jun/11 ]

Oh great, thank you.
I made a research in Gerrit but was unable to find those two. Is there any trick to successfully search in Gerrit? I entered the string 'LU-15' in the search field, but is returns no results.

Sebastien.

Comment by Peter Jones [ 16/Jun/11 ]

Sorry Sebastien I do not have a better answer for you. I have been able to search in gerrit on engineer and branch but not the subject. We are hoping to have better automated updates between gerrit and JIRA in the future but for now it relies on manual updates from the engineer about changesets which can sometimes not happen. I track that patches landed to 18x end up landing on master (when needed) but on this occasion I had prematurely marked this issue as having been ported after the first patch landed. I am tracking it correctly and we will provide further updates as the remaining patches land.

Comment by Build Master (Inactive) [ 22/Jun/11 ]

Integrated in lustre-b1_8 » i686,client,el5,inkernel #90
LU-410 Revert LU-15 slow IO with read intense application

Johann Lombardi : ec54d726360ddd09f3fa7489535bdbf9875e4306
Files :

  • lustre/obdfilter/filter_internal.h
  • lustre/ChangeLog
Comment by Build Master (Inactive) [ 22/Jun/11 ]

Integrated in lustre-b1_8 » x86_64,client,el6,inkernel #90
LU-410 Revert LU-15 slow IO with read intense application

Johann Lombardi : ec54d726360ddd09f3fa7489535bdbf9875e4306
Files :

  • lustre/obdfilter/filter_internal.h
  • lustre/ChangeLog
Comment by Build Master (Inactive) [ 22/Jun/11 ]

Integrated in lustre-b1_8 » x86_64,server,el5,inkernel #90
LU-410 Revert LU-15 slow IO with read intense application

Johann Lombardi : ec54d726360ddd09f3fa7489535bdbf9875e4306
Files :

  • lustre/obdfilter/filter_internal.h
  • lustre/ChangeLog
Comment by Build Master (Inactive) [ 22/Jun/11 ]

Integrated in lustre-b1_8 » x86_64,client,ubuntu1004,inkernel #90
LU-410 Revert LU-15 slow IO with read intense application

Johann Lombardi : ec54d726360ddd09f3fa7489535bdbf9875e4306
Files :

  • lustre/ChangeLog
  • lustre/obdfilter/filter_internal.h
Comment by Build Master (Inactive) [ 22/Jun/11 ]

Integrated in lustre-b1_8 » i686,server,el5,ofa #90
LU-410 Revert LU-15 slow IO with read intense application

Johann Lombardi : ec54d726360ddd09f3fa7489535bdbf9875e4306
Files :

  • lustre/obdfilter/filter_internal.h
  • lustre/ChangeLog
Comment by Build Master (Inactive) [ 22/Jun/11 ]

Integrated in lustre-b1_8 » x86_64,client,el5,inkernel #90
LU-410 Revert LU-15 slow IO with read intense application

Johann Lombardi : ec54d726360ddd09f3fa7489535bdbf9875e4306
Files :

  • lustre/obdfilter/filter_internal.h
  • lustre/ChangeLog
Comment by Build Master (Inactive) [ 22/Jun/11 ]

Integrated in lustre-b1_8 » i686,client,el5,ofa #90
LU-410 Revert LU-15 slow IO with read intense application

Johann Lombardi : ec54d726360ddd09f3fa7489535bdbf9875e4306
Files :

  • lustre/obdfilter/filter_internal.h
  • lustre/ChangeLog
Comment by Build Master (Inactive) [ 22/Jun/11 ]

Integrated in lustre-b1_8 » i686,client,el6,inkernel #90
LU-410 Revert LU-15 slow IO with read intense application

Johann Lombardi : ec54d726360ddd09f3fa7489535bdbf9875e4306
Files :

  • lustre/ChangeLog
  • lustre/obdfilter/filter_internal.h
Comment by Build Master (Inactive) [ 22/Jun/11 ]

Integrated in lustre-b1_8 » i686,server,el5,inkernel #90
LU-410 Revert LU-15 slow IO with read intense application

Johann Lombardi : ec54d726360ddd09f3fa7489535bdbf9875e4306
Files :

  • lustre/obdfilter/filter_internal.h
  • lustre/ChangeLog
Comment by Build Master (Inactive) [ 22/Jun/11 ]

Integrated in lustre-b1_8 » x86_64,client,el5,ofa #90
LU-410 Revert LU-15 slow IO with read intense application

Johann Lombardi : ec54d726360ddd09f3fa7489535bdbf9875e4306
Files :

  • lustre/ChangeLog
  • lustre/obdfilter/filter_internal.h
Comment by Build Master (Inactive) [ 22/Jun/11 ]

Integrated in lustre-b1_8 » x86_64,server,el5,ofa #90
LU-410 Revert LU-15 slow IO with read intense application

Johann Lombardi : ec54d726360ddd09f3fa7489535bdbf9875e4306
Files :

  • lustre/ChangeLog
  • lustre/obdfilter/filter_internal.h
Comment by Build Master (Inactive) [ 27/Jun/11 ]

Integrated in lustre-master » x86_64,client,el6,inkernel #187
Revert "LU-15 shrink the cache to alleviate OST memory pressure"

Oleg Drokin : 78ed4636ff68a70c8719f845d434c9f142d32fb4
Files :

  • lustre/obdfilter/filter_internal.h
Comment by Build Master (Inactive) [ 27/Jun/11 ]

Integrated in lustre-master » x86_64,server,el6,inkernel #187
Revert "LU-15 shrink the cache to alleviate OST memory pressure"

Oleg Drokin : 78ed4636ff68a70c8719f845d434c9f142d32fb4
Files :

  • lustre/obdfilter/filter_internal.h
Comment by Build Master (Inactive) [ 27/Jun/11 ]

Integrated in lustre-master » x86_64,server,el5,ofa #187
Revert "LU-15 shrink the cache to alleviate OST memory pressure"

Oleg Drokin : 78ed4636ff68a70c8719f845d434c9f142d32fb4
Files :

  • lustre/obdfilter/filter_internal.h
Comment by Build Master (Inactive) [ 27/Jun/11 ]

Integrated in lustre-master » x86_64,client,el5,inkernel #187
Revert "LU-15 shrink the cache to alleviate OST memory pressure"

Oleg Drokin : 78ed4636ff68a70c8719f845d434c9f142d32fb4
Files :

  • lustre/obdfilter/filter_internal.h
Comment by Build Master (Inactive) [ 27/Jun/11 ]

Integrated in lustre-master » i686,client,el5,ofa #187
Revert "LU-15 shrink the cache to alleviate OST memory pressure"

Oleg Drokin : 78ed4636ff68a70c8719f845d434c9f142d32fb4
Files :

  • lustre/obdfilter/filter_internal.h
Comment by Build Master (Inactive) [ 27/Jun/11 ]

Integrated in lustre-master » i686,client,el5,inkernel #187
Revert "LU-15 shrink the cache to alleviate OST memory pressure"

Oleg Drokin : 78ed4636ff68a70c8719f845d434c9f142d32fb4
Files :

  • lustre/obdfilter/filter_internal.h
Comment by Build Master (Inactive) [ 27/Jun/11 ]

Integrated in lustre-master » i686,client,el6,inkernel #187
Revert "LU-15 shrink the cache to alleviate OST memory pressure"

Oleg Drokin : 78ed4636ff68a70c8719f845d434c9f142d32fb4
Files :

  • lustre/obdfilter/filter_internal.h
Comment by Build Master (Inactive) [ 27/Jun/11 ]

Integrated in lustre-master » x86_64,client,sles11,inkernel #187
Revert "LU-15 shrink the cache to alleviate OST memory pressure"

Oleg Drokin : 78ed4636ff68a70c8719f845d434c9f142d32fb4
Files :

  • lustre/obdfilter/filter_internal.h
Comment by Build Master (Inactive) [ 27/Jun/11 ]

Integrated in lustre-master » x86_64,client,el5,ofa #187
Revert "LU-15 shrink the cache to alleviate OST memory pressure"

Oleg Drokin : 78ed4636ff68a70c8719f845d434c9f142d32fb4
Files :

  • lustre/obdfilter/filter_internal.h
Comment by Build Master (Inactive) [ 27/Jun/11 ]

Integrated in lustre-master » x86_64,server,el5,inkernel #187
Revert "LU-15 shrink the cache to alleviate OST memory pressure"

Oleg Drokin : 78ed4636ff68a70c8719f845d434c9f142d32fb4
Files :

  • lustre/obdfilter/filter_internal.h
Comment by Build Master (Inactive) [ 27/Jun/11 ]

Integrated in lustre-master » x86_64,client,ubuntu1004,inkernel #187
Revert "LU-15 shrink the cache to alleviate OST memory pressure"

Oleg Drokin : 78ed4636ff68a70c8719f845d434c9f142d32fb4
Files :

  • lustre/obdfilter/filter_internal.h
Comment by Build Master (Inactive) [ 27/Jun/11 ]

Integrated in lustre-master » i686,server,el5,ofa #187
Revert "LU-15 shrink the cache to alleviate OST memory pressure"

Oleg Drokin : 78ed4636ff68a70c8719f845d434c9f142d32fb4
Files :

  • lustre/obdfilter/filter_internal.h
Comment by Build Master (Inactive) [ 27/Jun/11 ]

Integrated in lustre-master » i686,server,el5,inkernel #187
Revert "LU-15 shrink the cache to alleviate OST memory pressure"

Oleg Drokin : 78ed4636ff68a70c8719f845d434c9f142d32fb4
Files :

  • lustre/obdfilter/filter_internal.h
Comment by Build Master (Inactive) [ 27/Jun/11 ]

Integrated in lustre-master » i686,server,el6,inkernel #187
Revert "LU-15 shrink the cache to alleviate OST memory pressure"

Oleg Drokin : 78ed4636ff68a70c8719f845d434c9f142d32fb4
Files :

  • lustre/obdfilter/filter_internal.h
Comment by Build Master (Inactive) [ 16/Aug/11 ]

Integrated in lustre-master » x86_64,client,el5,ofa #258
LU-15 Strange slow IO messages and bad performance

Oleg Drokin : cb714729eb58d4691293eac727c4186f224ba8f8
Files :

  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
  • ldiskfs/kernel_patches/patches/ext4-mballoc-group_check-rhel5.patch
Comment by Build Master (Inactive) [ 16/Aug/11 ]

Integrated in lustre-master » i686,client,el6,inkernel #258
LU-15 Strange slow IO messages and bad performance

Oleg Drokin : cb714729eb58d4691293eac727c4186f224ba8f8
Files :

  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
  • ldiskfs/kernel_patches/patches/ext4-mballoc-group_check-rhel5.patch
Comment by Build Master (Inactive) [ 16/Aug/11 ]

Integrated in lustre-master » x86_64,server,el5,ofa #258
LU-15 Strange slow IO messages and bad performance

Oleg Drokin : cb714729eb58d4691293eac727c4186f224ba8f8
Files :

  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
  • ldiskfs/kernel_patches/patches/ext4-mballoc-group_check-rhel5.patch
Comment by Build Master (Inactive) [ 16/Aug/11 ]

Integrated in lustre-master » x86_64,client,el5,inkernel #258
LU-15 Strange slow IO messages and bad performance

Oleg Drokin : cb714729eb58d4691293eac727c4186f224ba8f8
Files :

  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
  • ldiskfs/kernel_patches/patches/ext4-mballoc-group_check-rhel5.patch
Comment by Build Master (Inactive) [ 16/Aug/11 ]

Integrated in lustre-master » x86_64,client,sles11,inkernel #258
LU-15 Strange slow IO messages and bad performance

Oleg Drokin : cb714729eb58d4691293eac727c4186f224ba8f8
Files :

  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
  • ldiskfs/kernel_patches/patches/ext4-mballoc-group_check-rhel5.patch
Comment by Build Master (Inactive) [ 16/Aug/11 ]

Integrated in lustre-master » i686,client,el5,inkernel #258
LU-15 Strange slow IO messages and bad performance

Oleg Drokin : cb714729eb58d4691293eac727c4186f224ba8f8
Files :

  • ldiskfs/kernel_patches/patches/ext4-mballoc-group_check-rhel5.patch
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
Comment by Build Master (Inactive) [ 16/Aug/11 ]

Integrated in lustre-master » x86_64,server,el6,inkernel #258
LU-15 Strange slow IO messages and bad performance

Oleg Drokin : cb714729eb58d4691293eac727c4186f224ba8f8
Files :

  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
  • ldiskfs/kernel_patches/patches/ext4-mballoc-group_check-rhel5.patch
Comment by Build Master (Inactive) [ 16/Aug/11 ]

Integrated in lustre-master » x86_64,client,el6,inkernel #258
LU-15 Strange slow IO messages and bad performance

Oleg Drokin : cb714729eb58d4691293eac727c4186f224ba8f8
Files :

  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
  • ldiskfs/kernel_patches/patches/ext4-mballoc-group_check-rhel5.patch
Comment by Build Master (Inactive) [ 16/Aug/11 ]

Integrated in lustre-master » x86_64,client,ubuntu1004,inkernel #258
LU-15 Strange slow IO messages and bad performance

Oleg Drokin : cb714729eb58d4691293eac727c4186f224ba8f8
Files :

  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
  • ldiskfs/kernel_patches/patches/ext4-mballoc-group_check-rhel5.patch
Comment by Build Master (Inactive) [ 16/Aug/11 ]

Integrated in lustre-master » x86_64,server,el5,inkernel #258
LU-15 Strange slow IO messages and bad performance

Oleg Drokin : cb714729eb58d4691293eac727c4186f224ba8f8
Files :

  • ldiskfs/kernel_patches/patches/ext4-mballoc-group_check-rhel5.patch
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
Comment by Build Master (Inactive) [ 16/Aug/11 ]

Integrated in lustre-master » i686,server,el5,ofa #258
LU-15 Strange slow IO messages and bad performance

Oleg Drokin : cb714729eb58d4691293eac727c4186f224ba8f8
Files :

  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
  • ldiskfs/kernel_patches/patches/ext4-mballoc-group_check-rhel5.patch
Comment by Build Master (Inactive) [ 16/Aug/11 ]

Integrated in lustre-master » i686,server,el6,inkernel #258
LU-15 Strange slow IO messages and bad performance

Oleg Drokin : cb714729eb58d4691293eac727c4186f224ba8f8
Files :

  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
  • ldiskfs/kernel_patches/patches/ext4-mballoc-group_check-rhel5.patch
Comment by Build Master (Inactive) [ 16/Aug/11 ]

Integrated in lustre-master » i686,server,el5,inkernel #258
LU-15 Strange slow IO messages and bad performance

Oleg Drokin : cb714729eb58d4691293eac727c4186f224ba8f8
Files :

  • ldiskfs/kernel_patches/patches/ext4-mballoc-group_check-rhel5.patch
  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
Comment by Build Master (Inactive) [ 16/Aug/11 ]

Integrated in lustre-master » i686,client,el5,ofa #258
LU-15 Strange slow IO messages and bad performance

Oleg Drokin : cb714729eb58d4691293eac727c4186f224ba8f8
Files :

  • ldiskfs/kernel_patches/series/ldiskfs-2.6-rhel5-ext4.series
  • ldiskfs/kernel_patches/patches/ext4-mballoc-group_check-rhel5.patch
Comment by Yang Sheng [ 17/Aug/11 ]

Porting http://review.whamcloud.com/#change,437 to master. patch uoload to http://review.whamcloud.com/#change,1255

Comment by Build Master (Inactive) [ 25/Oct/11 ]

Integrated in lustre-master » x86_64,server,el5,ofa #311
LU-15 slow IO with read-intense application

Oleg Drokin : 93b6adce588ab0aa6203f980cdfca35a19887d00
Files :

  • lustre/llite/vvp_io.c
  • lustre/llite/rw.c
Comment by Build Master (Inactive) [ 25/Oct/11 ]

Integrated in lustre-master » x86_64,client,el5,inkernel #311
LU-15 slow IO with read-intense application

Oleg Drokin : 93b6adce588ab0aa6203f980cdfca35a19887d00
Files :

  • lustre/llite/rw.c
  • lustre/llite/vvp_io.c
Comment by Build Master (Inactive) [ 25/Oct/11 ]

Integrated in lustre-master » i686,client,el6,inkernel #311
LU-15 slow IO with read-intense application

Oleg Drokin : 93b6adce588ab0aa6203f980cdfca35a19887d00
Files :

  • lustre/llite/vvp_io.c
  • lustre/llite/rw.c
Comment by Build Master (Inactive) [ 25/Oct/11 ]

Integrated in lustre-master » x86_64,server,el6,inkernel #311
LU-15 slow IO with read-intense application

Oleg Drokin : 93b6adce588ab0aa6203f980cdfca35a19887d00
Files :

  • lustre/llite/rw.c
  • lustre/llite/vvp_io.c
Comment by Build Master (Inactive) [ 25/Oct/11 ]

Integrated in lustre-master » x86_64,server,el5,inkernel #311
LU-15 slow IO with read-intense application

Oleg Drokin : 93b6adce588ab0aa6203f980cdfca35a19887d00
Files :

  • lustre/llite/vvp_io.c
  • lustre/llite/rw.c
Comment by Build Master (Inactive) [ 25/Oct/11 ]

Integrated in lustre-master » x86_64,client,sles11,inkernel #311
LU-15 slow IO with read-intense application

Oleg Drokin : 93b6adce588ab0aa6203f980cdfca35a19887d00
Files :

  • lustre/llite/vvp_io.c
  • lustre/llite/rw.c
Comment by Build Master (Inactive) [ 25/Oct/11 ]

Integrated in lustre-master » x86_64,client,ubuntu1004,inkernel #311
LU-15 slow IO with read-intense application

Oleg Drokin : 93b6adce588ab0aa6203f980cdfca35a19887d00
Files :

  • lustre/llite/rw.c
  • lustre/llite/vvp_io.c
Comment by Build Master (Inactive) [ 25/Oct/11 ]

Integrated in lustre-master » x86_64,client,el5,ofa #311
LU-15 slow IO with read-intense application

Oleg Drokin : 93b6adce588ab0aa6203f980cdfca35a19887d00
Files :

  • lustre/llite/vvp_io.c
  • lustre/llite/rw.c
Comment by Build Master (Inactive) [ 25/Oct/11 ]

Integrated in lustre-master » x86_64,client,el6,inkernel #311
LU-15 slow IO with read-intense application

Oleg Drokin : 93b6adce588ab0aa6203f980cdfca35a19887d00
Files :

  • lustre/llite/rw.c
  • lustre/llite/vvp_io.c
Comment by Build Master (Inactive) [ 25/Oct/11 ]

Integrated in lustre-master » i686,client,el5,ofa #311
LU-15 slow IO with read-intense application

Oleg Drokin : 93b6adce588ab0aa6203f980cdfca35a19887d00
Files :

  • lustre/llite/vvp_io.c
  • lustre/llite/rw.c
Comment by Build Master (Inactive) [ 25/Oct/11 ]

Integrated in lustre-master » i686,client,el5,inkernel #311
LU-15 slow IO with read-intense application

Oleg Drokin : 93b6adce588ab0aa6203f980cdfca35a19887d00
Files :

  • lustre/llite/vvp_io.c
  • lustre/llite/rw.c
Comment by Build Master (Inactive) [ 25/Oct/11 ]

Integrated in lustre-master » i686,server,el5,ofa #311
LU-15 slow IO with read-intense application

Oleg Drokin : 93b6adce588ab0aa6203f980cdfca35a19887d00
Files :

  • lustre/llite/rw.c
  • lustre/llite/vvp_io.c
Comment by Build Master (Inactive) [ 25/Oct/11 ]

Integrated in lustre-master » i686,server,el5,inkernel #311
LU-15 slow IO with read-intense application

Oleg Drokin : 93b6adce588ab0aa6203f980cdfca35a19887d00
Files :

  • lustre/llite/vvp_io.c
  • lustre/llite/rw.c
Comment by Build Master (Inactive) [ 25/Oct/11 ]

Integrated in lustre-master » i686,server,el6,inkernel #311
LU-15 slow IO with read-intense application

Oleg Drokin : 93b6adce588ab0aa6203f980cdfca35a19887d00
Files :

  • lustre/llite/vvp_io.c
  • lustre/llite/rw.c
Comment by Yang Sheng [ 26/Oct/11 ]

No more work need to be done.

Generated at Sat Feb 10 01:02:54 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.