[LU-11] Lustre 2.x functionality regression: Missing aggregate MDT stats Created: 03/Nov/10  Updated: 12/Nov/10  Resolved: 12/Nov/10

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.0.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Michael MacDonald (Inactive) Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Bugzilla ID: 21,420
Rank (Obsolete): 10163

 Description   

LLNL has pointed out that LMT performance at scale (e.g. 20k clients) will greatly suffer if LMT has to read the per-client-export stats in order to recreate the missing aggregate MDT stats. By extension, any other monitoring tools depending on aggregate MDT stats will also be affected by this regression.

----- Forwarded message from Brian Behlendorf <behlendorf1@llnl.gov> -----

Date: Tue, 2 Nov 2010 14:43:49 -0700
From: Brian Behlendorf <behlendorf1@llnl.gov>
To: Jim Garlick <garlick@llnl.gov>
Subject: Re: [mjmac@whamcloud.com: Re: LMT work]

Check out bug 21420 comment #40, specifically commit 9eb3d1db in HEAD.
This is where they moved the stats from being global to being
per-export, they appear to think they were useless.

commit 9eb3d1db42d2937daef25950f6527ccb46221f8e
Author: LiuYing <emoly.liu@sun.com>
Date: Fri Oct 8 10:48:14 2010 +0800

b=21420 Add mds/mgs stats to HEAD

1)remove useless counter from mds and move some definitions
from mds to mdt;
2)move LPROCFS_MD_OP_INITs from lprocfs_alloc_md_stats() to
lprocfs_init_mps_stats(), which is needed by this stats;
3)increase mdt counter for each type operation

i=andreas
i=wangdi



 Comments   
Comment by Robert Read (Inactive) [ 03/Nov/10 ]

I reopened 21420 and requested the functionality be restored however it looks like the MDS aggregate stats were removed in 2008 in commit 69a3513021212ed1eb8823a50f80853e22e607b3. This patch only removed the unused initialization code.

Comment by Robert Read (Inactive) [ 05/Nov/10 ]

I had this chat with Andreas earlier today:

[11:00] adilger: looking at the patch, it _does_ appear that there should be MDT global stats - see mdt_lproc.c::mdt_procfs_init() hunk, and that mdt_counter_incr() is incrementing the obd_stats counter in addition to the per-export counter
[11:00] rread: true, but i couldn't find the stats when i tested this
[11:01] rread: mdt_stats_counter_init is only called for the nid_stats
[11:02] rread: don't we also need to call this with obd_stats somewhere?
[11:04] adilger: the stats init for the obd devices is done as part of the lprocfs_alloc_md_stats() code
[11:05] adilger: I wonder if the stats are being collected, but the MDT obd device itself is not being hooked into lprocfs?
[11:06] rread: the stats file was there, just no stats
[11:58] adilger: sorry, was on another concall...  I suspect this is a bug in the MDT device setup due to the half-finished MDS->MDT code reorg
[11:59] adilger: i.e. something foolish like the "old" MDT has an OBD device, and the "new" CMD MDT has a separate MDT device
[12:00] adilger: err, a separate OBD device
Comment by Robert Read (Inactive) [ 05/Nov/10 ]

Bobi Jam, please review the comments here and for some context, the most recent ones on 21420. It appears there is just an initialization problem here.

Comment by Zhenyu Xu [ 06/Nov/10 ]

found the root cause, mdt_counter_incr() should act upon obd->md_stats instead upon obd->obd_stats, the former is for recording md ops, while the later one for obd ops (such as connect, disconnect)

Comment by Zhenyu Xu [ 06/Nov/10 ]

I've tried my patch
================ w/o patch ==================================================

  1. cat /proc/fs/lustre/mdt/lustre-MDT0000/md_stats
    snapshot_time 1289058069.650218 secs.usecs
  1. cat /proc/fs/lustre/mdt/lustre-MDT0000/exports/0@lo/stats
    snapshot_time 1289058074.119639 secs.usecs
    open 1 samples [reqs]
    close 1 samples [reqs]
    mkdir 1 samples [reqs]

================ with patch ==================================================

  1. cat /proc/fs/lustre/mdt/lustre-MDT0000/md_stats
    snapshot_time 1289057387.663119 secs.usecs
    open 1 samples [reqs]
    close 1 samples [reqs]
    mkdir 1 samples [reqs]
  1. cat /proc/fs/lustre/mdt/lustre-MDT0000/exports/0@lo/stats
    snapshot_time 1289057384.833959 secs.usecs
    open 1 samples [reqs]
    close 1 samples [reqs]
    mkdir 1 samples [reqs]
Comment by Zhenyu Xu [ 06/Nov/10 ]

posted patch for review at http://review.whamcloud.com/#change,124

Comment by Zhenyu Xu [ 08/Nov/10 ]

posted patch in bz 21420.

Comment by Zhenyu Xu [ 12/Nov/10 ]

patch (https://bugzilla.lustre.org/attachment.cgi?id=32148) got landed.

Generated at Sat Feb 10 01:02:51 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.