async update cross-MDTs (LU-3534)

[LU-3538] commit on share for cross-MDT operation. Created: 29/Jun/13  Updated: 28/Jan/16  Resolved: 28/Jan/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.8.0

Type: Technical task Priority: Major
Reporter: Di Wang Assignee: Lai Siyao
Resolution: Fixed Votes: 0
Labels: dne2

Issue Links:
Related
is related to LU-6831 The ticket for tracking all DNE2 bugs Reopened
is related to LU-7546 conf-sanity conf-sanity: lod_device_f... Resolved
Rank (Obsolete): 8902

 Description   

During recovery, if one update replay is failed, all of related updates might be failed in the following replay process. For example, client1 creates a remote directory on MDT1, and its name entry is on MDT0, then other clients will create files under this remote directory on MDT1. If MDT0 fails and the name entry insertion has not been committed to disk yet. If the recovery is failed for some reason, i.e. the directory is not being connected to the name space at all, all of the files under this directory will not be able to be accessed. To avoid this, commit on share will be applied to cross-MDT operation. i.e. If the MDT finds the object being updated was modified by some previous cross-MDT operation, this cross-MDT operation needs to be committed first. So in the previous example, before creating any files under remote directory, the creation of the remote directory must be committed to disk first.
Commit on Share (COS) will be implemented by COS lock, which is similar as current local COS implementation. During cross-MDT operation, all locks of remote objects(remote locks) will be hold on the master MDT, and all of remote locks will be COS lock. If these COS locks are being revoked, the master MDT will not only do sync on itself, but also do sync on remote MDTs.



 Comments   
Comment by Di Wang [ 19/Sep/14 ]

http://review.whamcloud.com/#/c/8356/ This is the patch I did a year ago, may not work anymore, but may give you some ideas.

Comment by Lai Siyao [ 10/Nov/14 ]

Patch is on http://review.whamcloud.com/#/c/12530/

Comment by Gerrit Updater [ 19/Dec/14 ]

Lai Siyao (lai.siyao@intel.com) uploaded a new patch: http://review.whamcloud.com/13139
Subject: LU-3538 ldlm: add strict COS lock
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c53ec90f63f2cb6c91de03c71de521d2a956ae3d

Comment by Gerrit Updater [ 04/Aug/15 ]

Lai Siyao (lai.siyao@intel.com) uploaded a new patch: http://review.whamcloud.com/15844
Subject: LU-3538 dne: enable CoS for DNE by default
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 51acbd23b5fddc34761178f94618409f204a6313

Comment by Gerrit Updater [ 31/Aug/15 ]

Lai Siyao (lai.siyao@intel.com) uploaded a new patch: http://review.whamcloud.com/16140
Subject: LU-3538 cos: combo patch for DNE COS support
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 06d51df4e17491db55ddb0c6843f0af2a71d4e4a

Comment by James A Simmons [ 17/Sep/15 ]

Is this targeted for 2.8 or 2.9?

Comment by Di Wang [ 17/Sep/15 ]

Hmm, The patch seems ok, and right now we want to see how bad it will impact the performance before land it. But I never get time to run the performance test with the patch.

Richard: Could you please find some one run a performance test to compare this build https://build.hpdd.intel.com/job/lustre-reviews/34289/ (build with COS) and https://build.hpdd.intel.com/job/lustre-master/3187/ (current master) ?

Comment by Di Wang [ 23/Sep/15 ]

I just run a few tests on OpenSFS to see if COS impact
the performance.

8 clients, 4 MDS (8MDTs) and 2 OSSs(4 OSTs)

According to the result, the performance is not impacted

mpirun -np 64 -machinefile /home/di.wang/machine_file
/usr/lib64/lustre/tests/mdsrate ―-mknod --nfiles 1048576 --dir
/mnt/lustre/tests --filefmt 'f%%d'
mpirun -np 64 -machinefile /home/di.wang/machine_file
/usr/lib64/lustre/tests/mdsrate --unlink --nfiles 1048576 --dir
/mnt/lustre/tests --filefmt 'f%%d'

With the patch

Rate: 23998.42 eff 23997.34 aggr 374.96 avg client mknods/sec (total: 64
threads 1048576 mknods 1 dirs 64 threads/dir 43.69 secs)
0: c01 finished at Tue Sep 22 20:30:24 2015

Rate: 7637.39 eff 7637.19 aggr 119.33 avg client unlinks/sec (total: 64
threads 1048576 unlinks 1 dirs 64 threads/dir 137.30 secs)
0: c01 finished at Tue Sep 22 20:33:51 2015

Without the patch

0: c01 starting at Tue Sep 22 22:03:45 2015
Rate: 24108.73 eff 24106.99 aggr 376.67 avg client mknods/sec (total: 64
threads 1048576 mknods 1 dirs 64 threads/dir 43.49 secs)
0: c01 finished at Tue Sep 22 22:04:29 2015

Rate: 7505.92 eff 7505.87 aggr 117.28 avg client unlinks/sec (total: 64
threads 1048576 unlinks 1 dirs 64 threads/dir 139.70 secs)
0: c01 finished at Tue Sep 22 22:07:54 2015

Comment by Andreas Dilger [ 28/Sep/15 ]

The patches for this ticket need to be refreshed for landing on master for 2.8.0.

Comment by Gerrit Updater [ 29/Oct/15 ]

Lai Siyao (lai.siyao@intel.com) uploaded a new patch: http://review.whamcloud.com/16984
Subject: LU-3538 cos: CoS takes effect for cross-MDT op only now
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: db7a784fcb8e594896d4f624d409e6acfb700308

Comment by Gerrit Updater [ 20/Nov/15 ]

Lai Siyao (lai.siyao@intel.com) uploaded a new patch: http://review.whamcloud.com/17304
Subject: LU-3538 cos: reuse cos lock if compatible
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 825bf4d3c71d9c3bcd583db15e834b7839a7f258

Comment by Andreas Dilger [ 04/Dec/15 ]

Lai, can you please post the performance results into this ticket.

Comment by Gerrit Updater [ 28/Jan/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12530/
Subject: LU-3538 dne: Commit-on-Sharing for DNE
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: b50bb830f92e87da9bfdc84d14e4f3f78c80c9ac

Comment by Joseph Gmitter (Inactive) [ 28/Jan/16 ]

Landed for 2.8

Generated at Sat Feb 10 01:34:46 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.