[LU-4768] ost-survey hangs on client 2.4 Created: 14/Mar/14 Updated: 09/Dec/14 Resolved: 05/Oct/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.2 |
| Fix Version/s: | Lustre 2.7.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | francesco de giorgi | Assignee: | James Nunez (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
mds, oss: lustre 2.4.2 (CentOS 6.5 with lustre patched kernel RPM) |
||
| Issue Links: |
|
||||||||||||
| Epic/Theme: | test | ||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 13113 | ||||||||||||
| Description |
|
ost-survey on this environment mds, oss: lustre 2.4.2 (CentOS 6.5 with lustre patched kernel RPM) command line: ost-survey -s 100 /lustre will hang, chewing all the CPU in the attempt to In a lustre client 2.1.6 the max_cached_mb appears as
but on a 2.4.2 client is different
So probably the script has not been fixed to work with the new output. |
| Comments |
| Comment by Andreas Dilger [ 17/Mar/14 ] |
|
Two things are wrong here:
|
| Comment by Scott Nolin [ 07/Aug/14 ] |
|
I'm pretty sure this affects 2.5 client also, but don't have one up right now to examine. Anyway I'd just add while it's a minor problem for things like initial set up testing, it can become a big problem if trying to debug possible bad filesystems. If you forget to remove ost-survey or modify it, and start testing from a client, that client hangs. If you're not someone aware of the problem, it can cause a bit of a sysadmin freak-out. ost-survey has other problems too, for example: It looks to me like it assumes you only have one filesystem mounted so when it does things like count ost's, it looks at: It uses the long deprecated positional parameters for setstripe, so spews a lot of errors I think this tool really does needs to be fixed or just removed. If it's too low a priority to get a proper fix and review, then just not distributing it would be preferable. Scott |
| Comment by James Nunez (Inactive) [ 17/Sep/14 ] |
|
Proposed master patch at: http://review.whamcloud.com/11971 This patch only allows ost-survey to run and does not modify ll_max_cached_mb_seq_write(). The comment was made to, in ost-survey, set max_cached_mb to 256 pages. The patch sets max_cached_mb to 256 pages (in MB), 256 * pagesize / 1024 * 1024, but this allows the read values to be, possibly, generous due to reading from cache. For example, on the system I was testing this patch on, the page size is 2621440. So, max_cached_mb is set to 640. Here are the results for max_cached_mb = 2 and max_cached_mb = 640: With max_cached_mb = 2: Ost# Read(MB/s) Write(MB/s) Read-time Write-time ---------------------------------------------------- 0 258.701 125.005 3.865 8.000 1 341.453 107.399 2.929 9.311 2 340.931 117.782 2.933 8.490 3 277.857 105.958 3.599 9.438 With max_cached_mb = 640: Ost# Read(MB/s) Write(MB/s) Read-time Write-time ---------------------------------------------------- 0 1227.975 128.612 0.814 7.775 1 1109.066 108.828 0.902 9.189 2 1069.722 127.521 0.935 7.842 3 1050.668 107.104 0.952 9.337 Write performance is about the same in each case, but read performance is much larger with max_cached_mb = 640. Do we really want max_cached_mb to be 256 pages? |
| Comment by James Nunez (Inactive) [ 17/Sep/14 ] |
|
The patch for |
| Comment by James A Simmons [ 18/Sep/14 ] |
|
Yes the patch for |
| Comment by James Nunez (Inactive) [ 18/Sep/14 ] |
|
I've updated the patch to correctly get the page size and, as Andreas asked for, modified ll_max_cached_mb_seq_write() to set the requested pages to be the max of the requested pages and PTLRPC_MAX_BRW_PAGES. |
| Comment by Peter Jones [ 05/Oct/14 ] |
|
Landed for 2.7 |