[LU-6713] Noisy error messages on client while creating DNE filesystem Created: 12/Jun/15  Updated: 01/Jul/16  Resolved: 28/Jul/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0, Lustre 2.8.0
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Major
Reporter: Robert Read (Inactive) Assignee: Di Wang
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Seen on Lustre 2.7.0.

While create a 128 MDT filesystem, I noticed that clients sometimes take a long time to connect to new MDTs after they've been added. I saw a lot of these messages on a client's console:

Jun 12 00:08:13 client00 kernel: LustreError: 1275:0:(fld_request.c:170:fld_client_add_target()) Skipped 12 previous similar messages
Jun 12 00:08:13 client00 kernel: Lustre: 1275:0:(lmv_obd.c:300:lmv_init_ea_size()) scratch-clilmv-ffff880772a0ec00: NULL export for 11
Jun 12 00:08:13 client00 kernel: Lustre: 1275:0:(lmv_obd.c:300:lmv_init_ea_size()) Skipped 462 previous similar messages
Jun 12 00:08:19 client00 kernel: LustreError: 1277:0:(fld_request.c:170:fld_client_add_target()) cli-scratch-clilmv-ffff880772a0ec00: Attempt to add target scratch-MDT0025-mdc-ffff880772a0ec00 (idx 37) on fly - skip it
Jun 12 00:08:19 client00 kernel: LustreError: 1277:0:(fld_request.c:170:fld_client_add_target()) Skipped 13 previous similar messages
Jun 12 00:08:19 client00 kernel: Lustre: 1277:0:(lmv_obd.c:300:lmv_init_ea_size()) scratch-clilmv-ffff880772a0ec00: NULL export for 12
Jun 12 00:08:19 client00 kernel: Lustre: 1277:0:(lmv_obd.c:300:lmv_init_ea_size()) Skipped 258 previous similar messages
Jun 12 00:08:25 client00 kernel: Lustre: 1278:0:(lmv_obd.c:300:lmv_init_ea_size()) scratch-clilmv-ffff880772a0ec00: NULL export for 13
Jun 12 00:08:25 client00 kernel: Lustre: 1278:0:(lmv_obd.c:300:lmv_init_ea_size()) Skipped 56 previous similar messages
Jun 12 00:08:25 client00 kernel: LustreError: 1278:0:(fld_request.c:170:fld_client_add_target()) cli-scratch-clilmv-ffff880772a0ec00: Attempt to add target scratch-MDT0027-mdc-ffff880772a0ec00 (idx 39) on fly - skip it
Jun 12 00:08:25 client00 kernel: LustreError: 1278:0:(fld_request.c:170:fld_client_add_target()) Skipped 8 previous similar messages

Eventually the client did connect to all the MDTs, but took about ~20 minutes.



 Comments   
Comment by Andreas Dilger [ 12/Jun/15 ]

We saw some what is likely a related problem during performance testing. If "lfs mkdir -c" is used to create striped directories right after mount (before the MDSes all connect to each other), then the striped directories will have too few stripes.

It would be interesting to get a debug log from the client and one of the MDSes, if possible, to see where it is spending so much time. Even with the MDSes creating 128*128=65536 connections between themselves, that shouldn't be more than a few seconds of RPCs.

Comment by Gerrit Updater [ 13/Jun/15 ]

wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/15269
Subject: LU-6713 lmv: lock necessary part of lmv_add_target
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e04b00e82d25202950380d7cc2b31db9aff7d27a

Comment by Di Wang [ 13/Jun/15 ]

This slowness might because lmv->lmv_init_mutex cover too much area in lmv_add_target. I just shrink the protection area of lmv_add_target(). see above patch.

Comment by Gerrit Updater [ 27/Jul/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15269/
Subject: LU-6713 lmv: lock necessary part of lmv_add_target
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 1670c57315340db997c9058950148a05634f43f1

Comment by Peter Jones [ 28/Jul/15 ]

Landed for 2.8

Generated at Sat Feb 10 02:02:35 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.