[LU-1810] Striping of mount point Created: 31/Aug/12  Updated: 21/Sep/12  Resolved: 21/Sep/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.3
Fix Version/s: None

Type: Task Priority: Minor
Reporter: Douglas Allen Cain (Inactive) Assignee: Cliff White (Inactive)
Resolution: Fixed Votes: 0
Labels: striping
Environment:

rhel6


Attachments: File lu-1810.docx     Microsoft Word lu-1810.docx     Microsoft Word lu-1810.docx    
Rank (Obsolete): 10236

 Description   

After running <lfs setstripe -d /data> we ran the below command to start striping /data

lfs setstripe -s 0 -c 7 --offset=4

When we execute <lfs getstripe -v /data> it shows:

/data stripe count: 7 stripe_size: 1048576 stripe_offset: -1

Why is it that when we execute <watch lfs df> it still shows only ost0 being written to? We have stopped all services that write to this directory and umount /data but after remounting /data and restarting the services to continue writing it still shows only writing to ost0. To us it seems that the default striping is still going on even though we removed it before running the <lfs getstripe>.



 Comments   
Comment by Peter Jones [ 31/Aug/12 ]

Cliff

Could you please help with this one?

Thanks

Peter

Comment by Cliff White (Inactive) [ 31/Aug/12 ]

How large are the files you are creating?
If you run lfs getstripe <filename> what do you get?

Remember, a change in striping policy only affects files created after the change.
Existing files are not re-striped.

Comment by Cain, Douglas CTR (US) [ 04/Sep/12 ]

Cliff,

I remember reading that but what I don't understand is when we upgraded to
v2.2.0, I mounted the file system and then ran setstripe. Once I started
the service all osts started striping. I did the samething with v2.1.3 but
only one ost is striping. So in order to get striping to work, will I need
to remove all files under /data run lfs setstripe -d /data and then run the
new setstripe?

To answer your question, "How large are the files you are creating?" After
running lfs getstripe, we receive:
lmm_stripe_count: 1
lmm_stripe_size: 1048576
lmm_stripe_offset: 0
Obdidx objid objid group
0 497369 0x796d9 0

V/r,
Douglas Cain
Cyber Systems Administrator, CENTAUR Operations
Northrop Grumman Information Systems
DISA PEO-MA
COM 850-452-3560
DSN 312-922-3560

Comment by George Jackson (Inactive) [ 04/Sep/12 ]

Cliff,

Actually, in reviewing the data being sent to the filesystem, the sizes written are anywhere from 250K to 10M in size.

Thanks,
George

Comment by Cliff White (Inactive) [ 04/Sep/12 ]

As you can see from the lfs getstripe,(lmm_stripe_count: 1) the file you examined was created with a stripe count of 1.
Was this file created after the change in striping policy or before?
Please create a new file in the directory with the new striping policy and run lfs getstripe on that file

Striping policy changes only apply to files created after the striping policy was set, as file striping is set at file creation time.
If you wish to re-stripe an existing file, simply cp it to a new filename - it will be restriped
so, # cp myfile tmpname; cp tmpname myfile
will re-stripe any file.

If you run lfs getstripe on the directory, what is the result?

Also, lfs df is not especially useful for instantaneous measurement of performance.
If you want detailed results, you can look at rpc_stats on the client or brw_stats on a server.
on a client, for example for OST0001:

  1. lctl get_param osc.lustre-OST0001*.rpc_stats
    If you wish to check IO on an ost, from a command prompt on the OSS, run
  2. lctl get_param obdfilter.*.brw_stats

That provides a much more accurate picture. However, in this case, i think the issue is with the stripe settings for the directory,
lfs getstripe of the directory should show you the issue.

Comment by George Jackson (Inactive) [ 05/Sep/12 ]

Cliff,

Even with creating a new file after we run lfs setstripe on a newly created dir, /data/big, we're still unable to stripe. The lfs getstripe reports

/data/big
stripe_count: 7 stripe_size: 1048576 stripe_offset: 4
.
.
.
lmm_stripe_count: 1
lmm_stripe_size: 1048576
lmm_stripe_pattern: 1
lmm_stripe_offset 0
.
.

The newly created file shows an lmm_stripe_count of 1 as well. Could it be because /data is the mount point of the filesystem and needs to be reformatted to include a stripe count of 7? Also, at what size does the file start striping? In other words, we ran a 'dd if=/dev/zero of=/data/big/test1.file bs=5000000' but even after 2 GB we still didn't see striping occurring.

If there is any other info you need please let us know. Thanks, George

Comment by Cliff White (Inactive) [ 05/Sep/12 ]

It looks like you have set a fixed stripe_offset on /data/big. That's is likely causing your problem
You should not have stripe_offset = 4. Setting stripe_offset forces you to use
a specific OST. Make sure stripe_offset = -1 (lfs setstripe -i -1 )
Please do this:
lfs setstripe -i -1 /data/big
cd /data/big
touch foo
lfs getstripe foo

Attache the complete output of lfs getstripe to the bug.
You do not have to reformat. Stripe policy is per-directory, regardless of the policy at the mount point.

You should really leave rest of the defaults alone, and only set stripe_count (-c) Also, there is no need to erase striping (-d option)
Just set the new policy.

Comment by Cain, Douglas CTR (US) [ 05/Sep/12 ]

Cliff,

Ran the below commands:

rm -rf /data/big
mkdir /data/big
lfs getstripe /data/big - saw that stripe_count: 1 stripe_size:
1048576 stripe_offset: -1 (that's fine because we did not set anything yet)

Then ran:
lfs setstripe -c 7 /data/big - we have 7 osts
touch test
lfs getstripe /data/big
/data/big
stripe_count: 7 stripe_size: 1048576 stripe_offset: -1
/data/big/test
lmm_stripe_count: 1
lmm_stripe_size: 1048576
lmm_stripe_offset:0
Obdidx objid objid group
0 671056 0xa3d50 0

V/r,
Douglas Cain
Cyber Systems Administrator, CENTAUR Operations
Northrop Grumman Information Systems
DISA PEO-MA
COM 850-452-3560
DSN 312-922-3560

Comment by Cliff White (Inactive) [ 05/Sep/12 ]

That's quite bizarre.
If i do the same thing here on a sample system, i see:

# mkdir foo
# lfs getstripe foo
foo
stripe_count:   1 stripe_size:    1048576 stripe_offset:  -1 
# lfs setstripe -c 7 foo
# lfs getstripe foo
foo
stripe_count:   7 stripe_size:    1048576 stripe_offset:  -1 
# cd foo
# touch bar
# lfs getstripe bar
bar
lmm_stripe_count:   7
lmm_stripe_size:    1048576
lmm_layout_gen:     0
lmm_stripe_offset:  14
        obdidx           objid          objid            group
            14          238461        0x3a37d                0
            12          238461        0x3a37d                0
            15          238429        0x3a35d                0
             5          238366        0x3a31e                0
            18          238494        0x3a39e                0
            19          238366        0x3a31e                0
            17          238367        0x3a31f                0

Do you have some mount options set? Did you set any mkfsoptions when creating the filesystem?
First, run the command 'script lu-1829' - this will create a script file
Then, repeat the striping/file creation as above.
After you are done, exit script with ctrl-D. Attach the output file to this bug.
I need to see exactly what you are doing.

Comment by Cliff White (Inactive) [ 05/Sep/12 ]

Did you actually create the 'test' file in the data/big directory, or did you 'mv' it there?
I can replicate your results if i do this:

#  touch bob
#  lfs getstripe bob
bob
lmm_stripe_count:   1
lmm_stripe_size:    1048576
lmm_layout_gen:     0
lmm_stripe_offset:  18
        obdidx           objid          objid            group
            18          238434        0x3a362                0

#  mv bob foo
#  lfs getstripe foo/bob
foo/bob
lmm_stripe_count:   1
lmm_stripe_size:    1048576
lmm_layout_gen:     0
lmm_stripe_offset:  18
        obdidx           objid          objid            group
            18          238434        0x3a362                0

a 'mv' does not restripe the file. If I wish to restripe an existing file, this works:

# touch baz 
#  lfs getstripe baz
baz
lmm_stripe_count:   1
lmm_stripe_size:    1048576
lmm_layout_gen:     0
lmm_stripe_offset:  33
        obdidx           objid          objid            group
            33          239920        0x3a930                0

#  cp baz foo/baz
#  lfs getstripe foo/baz
foo/baz
lmm_stripe_count:   7
lmm_stripe_size:    1048576
lmm_layout_gen:     0
lmm_stripe_offset:  4
        obdidx           objid          objid            group
             4          240095        0x3a9df                0
             7          240031        0x3a99f                0
            10          240096        0x3a9e0                0
            11          239743        0x3a87f                0
             9          239746        0x3a882                0
             8          239488        0x3a780                0
            14          240160        0x3aa20                0

Comment by Cain, Douglas CTR (US) [ 05/Sep/12 ]

Cliff,

We are not getting the same results and to answer your question "Did you
actually create the 'test' file in the data/big directory, or did you 'mv'
it there?" I created it in the /data/big directory with the following
commands:

rm -rf /data/big
mkdir /data/big
lfs getstripe /data/big - saw that stripe_count: 1 stripe_size:
1048576 stripe_offset: -1 (that's fine because we did not set anything yet)

Then ran:
lfs setstripe -c 7 /data/big - we have 7 osts
touch /data/big/test
lfs getstripe /data/big
/data/big
stripe_count: 7 stripe_size: 1048576 stripe_offset: -1
/data/big/test
lmm_stripe_count: 1
lmm_stripe_size: 1048576
lmm_stripe_offset:0
Obdidx objid objid group
0 671056 0xa3d50 0

We are running version 2.1.3 on rhel 6. We formatted the ost file system
with:
Mkfs.lustre --ost --fsname=lustre01
--failnode=ip_address@o2ib:ip_address@tcp0
--mgsnode=ip_address@o2ib:ip_address@tcp0 /dev/mapper/mpathXX

Thanks,
Douglas

Comment by Cliff White (Inactive) [ 05/Sep/12 ]

another possibility from our testing - are you certain all the OSTs are on line and up?
If your OSTs are off-line, the files will only put stripes on functioning OSTs.

Comment by Cliff White (Inactive) [ 05/Sep/12 ]

I would take a step back at this point. Is the filesystem otherwise healthy? Any other client errors?

1) Verify that all your OSTs are online and accessible to the Lustre MGS

  • check mounts on the OSS nodes.
  • check syslogs on all server nodes. Report any LustreError messages.

2) Please attach the contents of 'tune2fs -print <your device>' for both the MGS and MDS (if MGS is separate) where <your device> is replaced by the actual /dev/ path.

3) If stripe_count is == 1, each new file create show go to a different OST. to verify this, try something like:

mkdir st1
lfs setstipe -c 1 st1
for i in a b c d e f g;do touch st1/$i; done

If you then run 'lfs getstripe st1' each file should have a different obdidx value. (and different lmm_stripe_offset) Please verify this on your system.

Comment by Cain, Douglas CTR (US) [ 06/Sep/12 ]

Yes all osts are on line.

Comment by Cain, Douglas CTR (US) [ 06/Sep/12 ]

Cliff,

I have verified that all mount points are mounted on our stand alone mgs and
two mds'.
I have verified that all osts nine in total are up and have all mount points
mounted. Seven ost for one mount point on mdt01 and two for one mount point
mdt02.

After logging into the mds and executing lctl dl, I confirmed that the mds
is talking to the mgs by seeing UP for mgc and UP for all listed osc, lov,
mdt, and mds.

After running your for I in I received the same obdidx value = 0 same
lmm_stripe_offset = 0

After running tune2fs on both mgs and mds it returned:
Setting multiple mount protection update interval to 5 seconds

Thank you,
Douglas

Comment by Cliff White (Inactive) [ 06/Sep/12 ]

Douglas, I asked you specifically for the output of 'tunefs.lustre --print <device>' run on all your MGS,MDS and OST devices. Please attach that to the bug.

Comment by Cain, Douglas CTR (US) [ 06/Sep/12 ]

Cliff,

I am waiting on my boss so I can get permission due to the area that we work
in.

V/r,
Douglas

Comment by Cain, Douglas CTR (US) [ 06/Sep/12 ]

Cliff,

Still awaiting permission to send you the output file but I can tell you
that on all osts, mds, mgs we receive:
tune2fs 1.41.90.wc4 (01-Sep-2011)
Setting multiple mount protection update internal to 5 seconds

I have to unmount all mount points if I do not unmount than I receive:
tune2fs 1.41.90.wc4 (01-Sep-2011)
^[tune2fs: MMP: device currently active while trying to open
/dev/mapper/mpathX - where X is the path.

V/r,
Douglas

Comment by Cliff White (Inactive) [ 06/Sep/12 ]

You seem to be misreading my instructions. The command i asked you to run is "tunefs.lustre --print <device>" - the --print option is crucial. Please attach the full, complete output to this bug.

Comment by Cliff White (Inactive) [ 06/Sep/12 ]

And, do not umount the devices. This can happen on a live system.

Comment by Cain, Douglas CTR (US) [ 06/Sep/12 ]

Cliff,

In this email you asked me in step 2 to run tune2fs. I will run tunefs now.

V/r,
Douglas

Comment by Cain, Douglas CTR (US) [ 06/Sep/12 ]

Cliff,

I updated the ticket through the website but I have been advised not to send
the output of the server itself due to it being in a classified area. I
have provided an attachment where I typed the results from tunefs.

V/r,
Douglas

Comment by Cliff White (Inactive) [ 06/Sep/12 ]

Thank you for explaining this is a classified site. That explains a few things. What you are attempting is a very, very basic part of Lustre. It's worked for a long time for a lot of people.
The code in that area is mature, and hasn't changed in quite awhile. It's routinely tested.

I see absolutely nothing unique or unusual in your setup, based on the data you have provided me. In the absence of error messages from a Lustre client or server, or a full script capture of
your actions, i can see no reason why you are having this problem.
Things I can suggest:

  • You appear to have two filesystems, lustre01 with 7 OSTs and lustre02 with 2 OSTs. Do you have clients mounting one or both filesystems? try your striping tests on both, if possible. Compare results.
  • Check everywhere for errors. Check every system console, every system error log (/var/log/messages, normally). Monitor the logs when you do tests.
  • As I've already mentioned, try creating many files with stripe_count = 1 and verify they are allocated on different OSTs
  • Try different values for stripe_count - try stripe_count = -1 (all)
  • Try using stripe_index to force the objects onto specific OSTs.
Comment by Cliff White (Inactive) [ 06/Sep/12 ]

The important thing is that 'lfs getstripe' should return a list of objects, when you create a striped file. Focus on getting that to work.

Comment by Cain, Douglas CTR (US) [ 07/Sep/12 ]

Cliff,

You asked if we were receiving any errors. On one of our clients we are
receiving this error:

(file.c:2196:ll_inode_revalidate_fini()) failure -13

Could you please shed some light on this?

V/r,
Douglas

Comment by Cliff White (Inactive) [ 07/Sep/12 ]

The error is EACCES 13 /* Permission denied */ I would need more context from the system log to be able to say more. It's unlikely to have anything to do with the striping issue.

Comment by Cain, Douglas CTR (US) [ 07/Sep/12 ]

Cliff,

Also, on the mds side it shows, an error while communicating with
ip_address@tcp. The ost_connect operation failed with -19. We know that -19
means ENODEV no such device is available. The server stopped or failed over.
I checked and the server is up and running and hasn't failed over.

Thanks,
Douglas

Comment by Cliff White (Inactive) [ 07/Sep/12 ]

Again, i would need to see the actual error, and some context.

Comment by George Jackson (Inactive) [ 21/Sep/12 ]

Cliff, our striping issue is now resolved. Your comments on this were very helpful but we found some underlying issues with our initial configuration related to the indexes of the OSTs. We ended up reformatting all mgt, mdt, and ost using the correct indexing and are now able to stripe. Thanks again for your help, you may close this issue as resolved.

Comment by Peter Jones [ 21/Sep/12 ]

Thanks for letting us know George!

Generated at Sat Feb 10 01:19:52 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.