[LU-3534] async update cross-MDTs Created: 29/Jun/13  Updated: 14/Jun/18  Resolved: 28/Jan/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.8.0

Type: New Feature Priority: Minor
Reporter: Di Wang Assignee: Di Wang
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
is duplicated by LU-5302 Test failure sanity-lfsck test_13: md... Resolved
Related
is related to LU-5222 Some operation should never happened ... Open
is related to LU-6663 DNE2 directories has very very bad pe... Resolved
is related to LU-6262 replay-single test_101: Oops at out_t... Resolved
is related to LU-6297 Move rename is_subdir check from MDD ... Open
is related to LU-6380 OI scrub should be able to scan the u... Resolved
is related to LU-5571 Test failure sanity-lfsck test_13: (2... Resolved
is related to LU-7102 replay-dual test_26: FAIL: set defaul... Resolved
is related to LU-6287 sanity test 17n ASSERTION( atomic_rea... Resolved
is related to LU-6288 conf-sanity test_2: test failed to re... Resolved
is related to LU-6289 sanity-scrub test_1c: unable to handl... Resolved
is related to LU-6290 sanity-lfsck test_14: unable to handl... Resolved
is related to LU-6291 conf-sanity test_41a: failed to respo... Resolved
is related to LU-6292 replay-single test_101: osd_trans_exe... Resolved
is related to LU-6293 runtests test_1: panic on dbuf_dirty ... Resolved
is related to LU-6294 sanity-scrub test_1a: test failed to ... Resolved
is related to LU-6295 sanity-lfsck test_4: oom on MDT0 Resolved
is related to LU-6296 insanity test_1: check_for_recovery_r... Resolved
is related to LU-6328 sanity-lfsck test_14:unexpected size Resolved
is related to LU-6329 replay-single test_101: kernel panic ... Resolved
is related to LU-6330 sanity test_17n:migrate failed -1 Resolved
is related to LU-4583 Test failure on test suite replay-sin... Closed
is related to LU-6362 DNE2: add dt object and thandle check... Closed
is related to LU-3541 add sanity tests for async updates be... Resolved
is related to LU-4837 DNE 2 async update cross-MDTs Test Plan Closed
Sub-Tasks:
Key
Summary
Type
Status
Assignee
LU-3535 Send all of updates of for one operat... Technical task Resolved Di Wang  
LU-3536 log updates for cross-MDT operation. Technical task Resolved Di Wang  
LU-3537 allow cross-MDT for all metadata oper... Technical task Resolved Di Wang  
LU-3538 commit on share for cross-MDT operation. Technical task Resolved Lai Siyao  
LU-3539 Change update RPC format Technical task Resolved Li Wei  
LU-3540 recovery for cross-MDT operation Technical task Resolved Di Wang  
LU-3541 add sanity tests for async updates be... Technical task Resolved Di Wang  
LU-4076 Create local FLDB for each non0-MDT, ... Technical task Resolved Di Wang  
LU-4837 DNE 2 async update cross-MDTs Test Plan Technical task Closed Di Wang  
Severity: 3
Rank (Obsolete): 8898

 Description   

This bug is for tracking async updates between MDTs.



 Comments   
Comment by Gerrit Updater [ 17/Feb/15 ]

wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/13786
Subject: LU-3534 osp: transfer updates with bulk RPC
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 4432ca911c9bbc17fcd4f38ace16c4a931321b18

Comment by Gerrit Updater [ 09/Apr/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/10794/
Subject: LU-3534 osp: move RPC pack from declare to execution phase
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: de8572645d287d17c409b99dabdf176822d91486

Comment by James A Simmons [ 27/May/15 ]

Newbie alert. I managed to create a file system with the latest patches to test DNE2. So attempting to create a directory I got this error:

root@ninja06 scratch]# lfs setdirstripe -c 14 jsimmons
error on LL_IOC_LMV_SETSTRIPE 'jsimmons' (3): Inappropriate ioctl for device
error: setdirstripe: create stripe dir 'jsimmons' failed

Did I miss something?

Comment by Di Wang [ 27/May/15 ]

Hmm, which build are you using? Could you get me the -1 debug log? Btw: with the current patches, you can only create maximum 16 stripes striped directory, and I am working on a patch to resolve this on LU-6602.

Comment by James A Simmons [ 28/May/15 ]

I was attempting to collect all the patches and piece them together but that didn't work out so well. I grabbed your full DNE2 patch and it seems to work so far. I'm setting up my test system right now to start testing at smaller scale. I will post my results.

Comment by James A Simmons [ 28/May/15 ]

First issue I'm seeing is that when I create a directory with a specific MDS stripe count that any new directories are not inheriting the new stripe setting.

Comment by Di Wang [ 28/May/15 ]

It only inherit default dirstripe, i.e. only stripe setting set by "lfs setdirstripe -D xxx" will be inherited. If that is your case, could you please post the command lines here? thanks

Comment by James A Simmons [ 29/May/15 ]

Sorry I missed the -D option which explains why it didn't inherit. User error there With my DNE2 setup I attempted to start some jobs on our test cluster and the job got stuck for hours attempting to run. So I did testing to see what was breaking. A simple md5sum on files showed the problem very easily. For normal directories md5sum on a file came back very fast.

[root@ninja06 johndoe]# date;md5sum ior;date
Fri May 29 10:08:24 EDT 2015
4ba1b26f0a4b71dccb237d3fd25f3b67 ior
Fri May 29 10:08:24 EDT 2015

But for DNE2 directories I saw this:

[root@ninja06 jsimmons]# date;md5sum simul;date
Fri May 29 10:08:38 EDT 2015
9fef8669fb0e6669ac646d69062521d3 simul
Fri May 29 10:09:59 EDT 2015

This is not a issue for stats. An ls on the file simul comes back very fast. I did this test without your DNE2 patch set and the problem still exist. At this point I think I should open a ticket about this.

Comment by Gerrit Updater [ 01/Jun/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/11909/
Subject: LU-3534 mdt: move last_rcvd obj update to LOD
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: c233f39a015943b3df8ddb555765a81b85b31083

Comment by Gerrit Updater [ 05/Jun/15 ]

wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/15163
Subject: LU-3534 tests: a few tests cases for async update.
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 0394d9938e655b774d86a4eef15408e7c7e01e3f

Comment by James A Simmons [ 08/Jun/15 ]

In my testing I noticed end after a cleanup shut down of the file system when the file system is restarted all the MDS servers go into recovery with each other. The lwp layers reconnect fine without going into recovery but the OSP layer always enters recovery mode. Is this the excepted behavior?

Comment by Di Wang [ 09/Jun/15 ]

Yes, this is expected. LWP(light weight proxy) is not a replayable client, so it will never get into recovery mode. But OSP is a replayable client, which will always enter into recovery mode when required.

Comment by Gerrit Updater [ 11/Jun/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12825/
Subject: LU-3534 osp: send updates by separate thread
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 2fe22edfe3c365b5c270050fdeed0a86fa74a919

Comment by Gerrit Updater [ 14/Jun/15 ]

wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/15275
Subject: LU-3534 tests: Add dne-2.5 upgrade test
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: f17fe3836034a2359f53839cd116d08d640c1c31

Comment by Gerrit Updater [ 16/Jun/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12450/
Subject: LU-3534 update: change sync updates to async updates
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 0136a91b6d629556ef091f5ca210c13772207df9

Comment by Gerrit Updater [ 27/Jun/15 ]

Liang Zhen (liang.zhen@intel.com) uploaded a new patch: http://review.whamcloud.com/15421
Subject: LU-3534 ptlrpc: mbits is sent within ptlrpc_body
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b363b70a0c82bccb0d50026ccee756f72f7d3d3d

Comment by Gerrit Updater [ 29/Jun/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15421/
Subject: LU-3534 ptlrpc: mbits is sent within ptlrpc_body
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: d099fdd6cd15d0d00d9b573da5d3bfd3e4bbcb9d

Comment by Gerrit Updater [ 02/Jul/15 ]

wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/15482
Subject: LU-3534 osp: transfer updates with bulk RPC
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e1e3d405464b7807f0a016aff8c7df9fb174fc47

Comment by Gerrit Updater [ 03/Jul/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13786/
Subject: LU-3534 osp: transfer updates with bulk RPC
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 044fbff00b25f127560338d35caa4d89faa4c207

Comment by Gerrit Updater [ 28/Aug/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15163/
Subject: LU-3534 tests: a few tests cases for async update.
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: c2b82ef8a12e6c55f143fbd8986c094425ed667e

Comment by James A Simmons [ 22/Sep/15 ]

WangDi since we are near completion of this current work I recommend you open a ticket for continued DNE2 work for 2.9. I see several ticket linked to here to completed for 2.9 and I hate to see them disappear.

Comment by Di Wang [ 23/Sep/15 ]

Right now, all DNE2 works are linked under lu-6831, I will move unfinished tickets there.

Comment by Gerrit Updater [ 26/Sep/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15275/
Subject: LU-3534 tests: Add dne-2.5 upgrade test
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 38d45a59d8a29bc60d690620e9bcc3eaba108d9b

Comment by Joseph Gmitter (Inactive) [ 28/Jan/16 ]

All sub-tasks have now landed for 2.8.

Generated at Sat Feb 10 01:34:44 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.