[LU-1632] FID sequence numbers not working properly with filesystems formatted using 1.8? Created: 13/Jul/12  Updated: 07/Nov/13  Resolved: 21/Dec/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.3.0, Lustre 2.4.0, Lustre 2.1.2
Fix Version/s: Lustre 2.4.0, Lustre 2.1.4

Type: Bug Priority: Blocker
Reporter: jason.rappleye@nasa.gov (Inactive) Assignee: Di Wang
Resolution: Fixed Votes: 0
Labels: fid
Environment:

Lustre 2.1.2


Attachments: File dk.seq.debug.gz     Text File files.txt    
Issue Links:
Related
is related to LU-3318 mdc_set_lock_data() ASSERTION( old_in... Resolved
Severity: 3
Rank (Obsolete): 4004

 Description   

On a 2.1 filesystem - server and client both running 2.1.2, and a filesystem created with Lustre 2.1.x, many files are created on the same client using the same FID sequence number. That's how I expect it to work.

With servers and clients running 2.1.2, but with a filesystem that was originally created with Lustre 1.8, only one file is created per sequence number before the client requests another one from the MDS. For example, for several consecutive files created on the same client, I get FIDs like this:

[0x33c0e251df8:0x1:0x0]
[0x33c0e254b37:0x1:0x0]
[0x33c0e257876:0x1:0x0]
[0x33c0e25a5b5:0x1:0x0]
[0x33c0e25d2f4:0x1:0x0]
[0x33c0e260033:0x1:0x0]

lctl get_param 'seq.cli-srv-nbp*.* shows a space of [0x0 - 0x0]:0:0 for the filesystems that were formatted under Lustre 1.8.

Is this the way it's supposed to work?

Thanks,
Jason



 Comments   
Comment by Andreas Dilger [ 13/Jul/12 ]

Definitely seems unusual, and not what should be happening.

Comment by Di Wang [ 13/Jul/12 ]

Hmm, definitely not right, once you are using 2.x client. Both client and server are little-endian nodes? Could you please collect some -1 debug log on the client side, when you create these file?

Comment by jason.rappleye@nasa.gov (Inactive) [ 13/Jul/12 ]

files.txt: ls -l including fid for each file
dk.seq.debug.gz: logs collected while creating each file, with all debug flags enabled

Comment by jason.rappleye@nasa.gov (Inactive) [ 13/Jul/12 ]

Interesting - this only happens when touching a file, e.g.

$ for i in

{1..100}; do touch foo-${i}; done

If I write data to the file, it behaves as expected:

$ for i in {1..100}

; do echo foo > foo-${i}; done
$ for i in $(ls foo*); do echo -n "$(ls -l $i) "; lfs path2fid $i; done

rw-rr- 1 jrappley cstaff 4 Jul 13 18:04 foo-1 [0x33c5f3bb04a:0x1:0x0]
rw-rr- 1 jrappley cstaff 4 Jul 13 18:04 foo-10 [0x33c5f3bb04a:0xa:0x0]
rw-rr- 1 jrappley cstaff 4 Jul 13 18:04 foo-100 [0x33c5f3bb04a:0x64:0x0]
rw-rr- 1 jrappley cstaff 4 Jul 13 18:04 foo-11 [0x33c5f3bb04a:0xb:0x0]
rw-rr- 1 jrappley cstaff 4 Jul 13 18:04 foo-12 [0x33c5f3bb04a:0xc:0x0]
rw-rr- 1 jrappley cstaff 4 Jul 13 18:04 foo-13 [0x33c5f3bb04a:0xd:0x0]
...

Still smells like a bug, though.

Comment by Andreas Dilger [ 13/Jul/12 ]

Definitely shouldn't be happening this way. This causes many orders of magnitude (10^5) too many sequences to be allocated by the MDS. While it isn't fatal, it isn't what we expect and would cause some long-term overhead on the clients to have to fetch so many new FLDB entries.

Comment by Di Wang [ 14/Jul/12 ]

Hmm, it seem very strange. Did you do umount/mount between touch and echo test. Did you do that in the same client. It seems seq width of this client, which is exposed under /proc, was reset to 0 somehow. Could you please try this

lctl get_param seq.*.width

And post the result here. Thanks.

Comment by jason.rappleye@nasa.gov (Inactive) [ 16/Jul/12 ]

No umount/mount, and the client didn't disconnect/reconnect to the MDS, either.

Width is fine, but space isn't what I'd expect:

$ lctl get_param 'seq.cli-srv-nbp*.*'seq.cli-srv-nbp6-MDT0000-mdc-ffff88040990cc00.fid=[0x0:0x0:0x0]
seq.cli-srv-nbp6-MDT0000-mdc-ffff88040990cc00.server=nbp6-MDT0000_UUID
seq.cli-srv-nbp6-MDT0000-mdc-ffff88040990cc00.space=[0x0 - 0x0]:0:0
seq.cli-srv-nbp6-MDT0000-mdc-ffff88040990cc00.width=131072

Note that I can reproduce this on multiple filesystems and from different clients.

We have six production filesystem; one was created with 2.1.x, the rest were 1.8 before upgrading to 2.1. We did not use the Xyratex migration patch.

Since upgrading from 1.8 to 2.1.1, and now 2.1.2, we've had several incidents of high load average on our MDSes that are apparently due to a large number of SEQ_QUERY RPCs. They might be related to this issue. We see many mdss threads with this stack trace:

[<ffffffffa0577a13>] ? cfs_alloc+0x63/0x90 [libcfs]
[<ffffffff815221f5>] schedule_timeout+0x215/0x2e0
[<ffffffffa079b6c4>] ? sptlrpc_svc_alloc_rs+0x74/0x2d0 [ptlrpc]
[<ffffffffa076d2b4>] ? lustre_msg_add_version+0x94/0x110 [ptlrpc]
[<ffffffff81523112>] __down+0x72/0xb0
[<ffffffff81095e11>] down+0x41/0x50
[<ffffffffa08a2531>] seq_server_alloc_meta+0x41/0x720 [fid]
[<ffffffffa063c830>] ? lustre_swab_lu_seq_range+0x0/0x30 [obdclass]
[<ffffffffa08a2fc8>] seq_query+0x3b8/0x680 [fid]
[<ffffffffa076c004>] ? lustre_msg_get_opc+0x94/0x100 [ptlrpc]
[<ffffffffa0bdfc65>] mdt_handle_common+0x8d5/0x1810 [mdt]
[<ffffffffa076c004>] ? lustre_msg_get_opc+0x94/0x100 [ptlrpc]
[<ffffffffa0be0c15>] mdt_mdss_handle+0x15/0x20 [mdt]

and one like this:

[<ffffffff810902de>] ? prepare_to_wait+0x4e/0x80
[<ffffffffa0a79785>] jbd2_log_wait_commit+0xc5/0x140 [jbd2]
[<ffffffff8108fff0>] ? autoremove_wake_function+0x0/0x40
[<ffffffffa0a79836>] ? __jbd2_log_start_commit+0x36/0x40 [jbd2]
[<ffffffffa0a71b4b>] jbd2_journal_stop+0x2cb/0x320 [jbd2]
[<ffffffffa0aca048>] __ldiskfs_journal_stop+0x68/0xa0 [ldiskfs]
[<ffffffffa0c448f8>] osd_trans_stop+0xb8/0x290 [osd_ldiskfs]
[<ffffffffa08a3b06>] ? seq_store_write+0xc6/0x2b0 [fid]
[<ffffffffa08a3867>] seq_store_trans_stop+0x57/0xe0 [fid]
[<ffffffffa08a3d8c>] seq_store_update+0x9c/0x1e0 [fid]
[<ffffffffa08a299a>] seq_server_alloc_meta+0x4aa/0x720 [fid]
[<ffffffffa063c830>] ? lustre_swab_lu_seq_range+0x0/0x30 [obdclass]
[<ffffffffa08a2fc8>] seq_query+0x3b8/0x680 [fid]
[<ffffffffa076c004>] ? lustre_msg_get_opc+0x94/0x100 [ptlrpc]
[<ffffffffa0bdfc65>] mdt_handle_common+0x8d5/0x1810 [mdt]
[<ffffffffa076c004>] ? lustre_msg_get_opc+0x94/0x100 [ptlrpc]
[<ffffffffa0be0c15>] mdt_mdss_handle+0x15/0x20 [mdt]

I haven't looked into how the sequence allocation works on the server side, but my first guess is that we're bound by the time it takes to commit the latest (sequence number, width) to disk (in the OI)? Of course, if we had fewer SEQ_QUERY RPCs being issued by clients, there might not be a problem!

I'm going to use SystemTap to see if I can understand what's going on. I'll report back what I find.

Comment by Di Wang [ 16/Jul/12 ]

Ah, I know where is the problem. You did not erase the config log when you upgrade 1.8 to 2.1 right? So the problem is

void ll_delete_inode(struct inode *inode)
{
struct ll_sb_info *sbi = ll_i2sbi(inode);
int rc;
ENTRY;

rc = obd_fid_delete(sbi->ll_md_exp, ll_inode2fid(inode));
if (rc)
CERROR("fid_delete() failed, rc %d\n", rc);

it will call obd_fid_delete to delete the lmv object, but since you did not erase the config log, so it did not create the lmv layer at all. then it goes to mdc layer directly, and it will do sth obsolete there, which is completely wrong in 2.1 structure, I will cook a fix right now.

Comment by jason.rappleye@nasa.gov (Inactive) [ 16/Jul/12 ]

My investigation with SystemTap so far has shown that every time seq_client_alloc_fid is called, the seq parameter is zeroed out. Some time after the call to seq_client_alloc_fid, mdc_fid_delete is called. It calls seq_client_flush, which zeros out the FID.

On the filesystem that was formatted with Lustre 2.1, this call sequence does not happen.

Comment by Di Wang [ 16/Jul/12 ]

Btw: it seems pretty serious problem, since it will cause 1 extra RPC for every create.

Comment by jason.rappleye@nasa.gov (Inactive) [ 16/Jul/12 ]

Yup, I see that in the `lctl dl` output - on the clients, the filesystem that's OK has an lmv device, while the rest don't.

Will unmounting the MGS, removing the client config log, and remounting the MGS cause the client config log to be regenerated, presumably allowing new clients mounts to pick up the correct config log? Or do we need to do the usual `tunefs.lustre --writeconf` procedure? Though will only be slightly less intrusive for us than a client upgrade.

We've definitely seen operational issues due to the extra RPCs - see my previous comment with stack traces. We've seen load averages of ~500 while all of the mdss threads are processing requests.

Comment by jason.rappleye@nasa.gov (Inactive) [ 16/Jul/12 ]

Also, I don't see anything in the manual regarding deleting the config logs as part of the 1.8 -> 2.x upgrade procedure.

Comment by Di Wang [ 16/Jul/12 ]

No, it should not need to erase the config log indeed. But this problem will only happen if you do not erase the log, so it is a bug. . Probably only delete the client config log would not work right now. And you probably need tunefs.lustre --writeconf procedure here.

Comment by Di Wang [ 16/Jul/12 ]

http://review.whamcloud.com/#change,3422 Here is the fix based on b2_1.

Comment by jason.rappleye@nasa.gov (Inactive) [ 17/Jul/12 ]

Would it be sufficient to use this patch (after it's been reviewed, of course) to fix this problem, at least until we can take the filesystem down to regenerate the client config logs? Might there be other unintended consequences of not having the lmv layer in place?

I ask because it's relatively easy to update the Lustre clients, versus taking each filesystem down.

Comment by Andreas Dilger [ 17/Jul/12 ]

Jason, Di can answer authoritatively, but I believe the fix on the client should be enough to resolve the problem. The LMV layer is only needed on the client when DNE is enabled on the server. This means you have at least until 2.4 to regenerate the config.

Comment by Di Wang [ 17/Jul/12 ]

Yes, this patch, which is only on client side, should be enough to fix the problem. As Andreas said, LMV layer is only needed when you enable DNE on your system, which will not happen until 2.4.

Comment by Jay Lan (Inactive) [ 17/Jul/12 ]

What is DNE?
Also, on filesystems we did not erase the config log when we upgraded and thus no lmv device. However, we will still do not see the lmv device but it would work correctly with this client side patch. Is my understanding correct? Thanks!

Comment by Di Wang [ 17/Jul/12 ]

DNE means Distributed NamespacE, with which you can have multiple Metadata targets, http://wiki.whamcloud.com/display/PUB/Remote+Directories+Solution+Architecture.

Yes, if you apply this patch on client side, you do not need erase the config log. it will work correctly even there are no lmv device.

Comment by Jay Lan (Inactive) [ 05/Nov/12 ]

I tried to forward port the 2.1 patch to 2.3, but it seems some of the routines involved got changed.

Is 2.3 client still need the patch?

Comment by Di Wang [ 05/Nov/12 ]

yes, 2.3 client needs this patch. hmm, the patch seems only land on 2.1 right now.

Comment by Di Wang [ 12/Nov/12 ]

The patch has been landed to 2.1, and I will make a patch for 2.4 soon.

Comment by Di Wang [ 16/Nov/12 ]

http://review.whamcloud.com/#change,4606 patch for current master.

Comment by Jodi Levi (Inactive) [ 21/Dec/12 ]

Landed to Master

Generated at Sat Feb 10 01:18:20 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.