[LU-16806] mkdir: cannot create directory ‘test5’: Object is remote Created: 09/May/23  Updated: 02/Feb/24  Resolved: 05/Jul/23

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.14.0, Lustre 2.15.2
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Gian-Carlo Defazio Assignee: Lai Siyao
Resolution: Fixed Votes: 0
Labels: llnl
Environment:

server (garter):
toss 4.5-6rc6
4.18.0-425.19.2.1toss.t4.x86_64
2.14.0_19.llnl

client (mutt):
toss 4.5-6
4.18.0-425.19.2.1toss.t4.x86_64
2.15.2_5.llnl


Attachments: File dk.mutt12.1683589208     File dk.mutt12.1683589684     File mkdir-attempt-client-and-mds.tar.gz    
Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Attempts to create directories are intermittently failing when using "mkdir".

So far, "lfs mkdir" succeeds every time, so does file creation using "touch".

 



 Comments   
Comment by Gian-Carlo Defazio [ 09/May/23 ]

For my notes, local issue is TOSS-6002

Comment by Gian-Carlo Defazio [ 09/May/23 ]

the llnl lustre versions are at

https://github.com/LLNL/lustre/tree/2.14.0_19.llnl

https://github.com/LLNL/lustre/tree/2.15.2_5.llnl

 

Comment by Gian-Carlo Defazio [ 09/May/23 ]

Here's an example of a few failures followed by a successful create

 

(mutt12):mdts_3_osts_3$ pwd
/p/lflood/defazio1/mdtest/mdts_3_osts_3
(mutt12):mdts_3_osts_3$ mkdir junk20
mkdir: cannot create directory ‘junk20’: Object is remote
(mutt12):mdts_3_osts_3(1)$ mkdir junk20
mkdir: cannot create directory ‘junk20’: Object is remote
(mutt12):mdts_3_osts_3(1)$ mkdir junk20
mkdir: cannot create directory ‘junk20’: Object is remote
(mutt12):mdts_3_osts_3(1)$ mkdir junk20
(mutt12):mdts_3_osts_3$ ls -ld junk20
drwx------ 2 defazio1 defazio1 26624 May  8 17:53 junk20

 

 

The parent directory is /p/lflood/defazio1/mdtest/mdts_3_osts_3

(mutt12):mdts_3_osts_3$ stat .
  File: .
  Size: 26624           Blocks: 52         IO Block: 131072 directory
Device: 11602e98h/291516056d    Inode: 198161923999531372  Links: 16
Access: (0700/drwx------)  Uid: (28153/defazio1)   Gid: (28153/defazio1)
Access: 2023-05-08 16:51:39.000000000 -0700
Modify: 2023-05-08 17:53:11.000000000 -0700
Change: 2023-05-08 17:53:11.000000000 -0700
 Birth: 2023-05-08 11:43:54.000000000 -0700

it was created with the commands

 

lfs setdirstripe --mdt-index 3 /p/lflood/defazio1/mdtest/mdts_3_osts_3
lfs setstripe --ost 3 /p/lflood/defazio1/mdtest/mdts_3_osts_3
chown defazio1:defazio1 /p/lflood/defazio1/mdtest/mdts_3_osts_3

 

(mutt12):mdts_3_osts_3(4)$ lfs  getdirstripe  .
lmv_stripe_count: 0 lmv_stripe_offset: 3 lmv_hash_type: none

(mutt12):mdts_3_osts_3$ lfs  getstripe  -d .
stripe_count:  1 stripe_size:   1048576 pattern:       raid0 stripe_offset: 3



 

Comment by Gian-Carlo Defazio [ 09/May/23 ]

In this case some directories are created on the first try, but most fail

(mutt12):mdts_3_osts_3$ for x in {1..20}; do mkdir test_name$x; done
mkdir: cannot create directory ‘test_name1’: Object is remote
mkdir: cannot create directory ‘test_name2’: Object is remote
mkdir: cannot create directory ‘test_name3’: Object is remote
mkdir: cannot create directory ‘test_name4’: Object is remote
mkdir: cannot create directory ‘test_name5’: Object is remote
mkdir: cannot create directory ‘test_name6’: Object is remote
mkdir: cannot create directory ‘test_name8’: Object is remote
mkdir: cannot create directory ‘test_name9’: Object is remote
mkdir: cannot create directory ‘test_name10’: Object is remote
mkdir: cannot create directory ‘test_name11’: Object is remote
mkdir: cannot create directory ‘test_name12’: Object is remote
mkdir: cannot create directory ‘test_name13’: Object is remote
mkdir: cannot create directory ‘test_name14’: Object is remote
mkdir: cannot create directory ‘test_name15’: Object is remote
mkdir: cannot create directory ‘test_name16’: Object is remote
mkdir: cannot create directory ‘test_name18’: Object is remote
mkdir: cannot create directory ‘test_name19’: Object is remote
mkdir: cannot create directory ‘test_name20’: Object is remote
(mutt12):mdts_3_osts_3(1)$ for x in {1..20}; do mkdir test_name$x; done
mkdir: cannot create directory ‘test_name1’: Object is remote
mkdir: cannot create directory ‘test_name2’: Object is remote
mkdir: cannot create directory ‘test_name3’: Object is remote
mkdir: cannot create directory ‘test_name4’: Object is remote
mkdir: cannot create directory ‘test_name5’: Object is remote
mkdir: cannot create directory ‘test_name7’: File exists
mkdir: cannot create directory ‘test_name8’: Object is remote
mkdir: cannot create directory ‘test_name9’: Object is remote
mkdir: cannot create directory ‘test_name10’: Object is remote
mkdir: cannot create directory ‘test_name12’: Object is remote
mkdir: cannot create directory ‘test_name13’: Object is remote
mkdir: cannot create directory ‘test_name14’: Object is remote
mkdir: cannot create directory ‘test_name16’: Object is remote
mkdir: cannot create directory ‘test_name17’: File exists
mkdir: cannot create directory ‘test_name18’: Object is remote
mkdir: cannot create directory ‘test_name19’: Object is remote
mkdir: cannot create directory ‘test_name20’: Object is remote

 

The directories that are created inherit the striping (this was done after more mkdir attempts, so more directories exist than the 2 indicated above)

(mutt12):mdts_3_osts_3$ lfs getdirstripe test_name*
lmv_stripe_count: 0 lmv_stripe_offset: 3 lmv_hash_type: none
lmv_stripe_count: 0 lmv_stripe_offset: 3 lmv_hash_type: none
lmv_stripe_count: 0 lmv_stripe_offset: 3 lmv_hash_type: none
lmv_stripe_count: 0 lmv_stripe_offset: 3 lmv_hash_type: none
lmv_stripe_count: 0 lmv_stripe_offset: 3 lmv_hash_type: none
lmv_stripe_count: 0 lmv_stripe_offset: 3 lmv_hash_type: none
lmv_stripe_count: 0 lmv_stripe_offset: 3 lmv_hash_type: none
lmv_stripe_count: 0 lmv_stripe_offset: 3 lmv_hash_type: none
lmv_stripe_count: 0 lmv_stripe_offset: 3 lmv_hash_type: none

 

However, using lfs mkdir seems to work fine

(mutt12):mdts_3_osts_3$ for x in {1..20}; do lfs mkdir -i $(($x % 4)) lfs_name$x; done
(mutt12):mdts_3_osts_3$ ls lfs_name*
lfs_name1:
lfs_name10:
lfs_name11:
lfs_name12:
lfs_name13:
lfs_name14:
lfs_name15:
lfs_name16:
lfs_name17:
lfs_name18:
lfs_name19:
lfs_name2:
lfs_name20:
lfs_name3:
lfs_name4:
lfs_name5:
lfs_name6:
lfs_name7:
lfs_name8:
lfs_name9:
Comment by Gian-Carlo Defazio [ 09/May/23 ]

I've uploaded debug logs for the attempted creation of the directory junk8

debug log which contains failed attempt for junk8 is
dk.mutt12.1683589208

It then succeeded (after failing about 10 times) in
dk.mutt12.1683589684

Comment by Peter Jones [ 09/May/23 ]

Serguei

Can you please advise?

Thanks

Peter

Comment by Serguei Smirnov [ 09/May/23 ]

It looks like this one needs attention of experts in higher-level Lustre. 

For example, is LUDOC-289 relevant here?

Comment by Olaf Faaland [ 11/May/23 ]

Hi Serguei or Peter,

Can you ask an appropriate Whamcloud-er to advise?  The two test clusters involved will be taken away from us for other testing on Tuesday or Wednesday of next week.

thanks

Comment by Lai Siyao [ 12/May/23 ]

Could you collect MDS debug logs upon failure? I'm afraid some patch for backward compatibility is missing, maybe it's on server side.

Comment by Gian-Carlo Defazio [ 17/May/23 ]

I've added mkdir-attempt-client-and-mds.tar.gz which has debug logs for the client and mds's with debug=-1.

Comment by Gerrit Updater [ 18/May/23 ]

"Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51045
Subject: LU-16806 lod: access DMV when it's valid
Project: fs/lustre-release
Branch: b2_14
Current Patch Set: 1
Commit: 65887f6728d68d8b2e10969df25e5303603a4797

Comment by Lai Siyao [ 18/May/23 ]

This is caused by a bug in 2.14 code, but it's not an issue in the following release because related code has been removed.

Comment by Peter Jones [ 18/May/23 ]

So LLNL's exposure to this issue will disappear once they upgrade to 2.15.x?

Comment by Lai Siyao [ 19/May/23 ]

Yes, upgrading to 2.15.x can get around this issue since related code has been removed.

Comment by Gian-Carlo Defazio [ 24/May/23 ]

The patch fixed the issue on 2.14.

I'm a bit confused by what the patch is doing.

Is the idea that when (lds->lds_dir_def_striping_set == 0) that means that the parent directory doesn't have default striping set for its sub-directories, so when the sub-directory is created it should instead inherit from the root of the whole filesystem?

 

Comment by Olaf Faaland [ 24/May/23 ]

Hi Lai,

Please also confirm whether you believe 2.14 servers with this patch will work properly with both Lustre 2.12 and 2.15 clients.  As we work to switch to Lustre 2.15, we will have

Clients: Lustre 2.12, 2.15
Routers: Lustre 2.12, 2.15

thanks

Comment by Lai Siyao [ 25/May/23 ]

Gian, these code is to check whether client sends mkdir request to wrong MDT. If parent has default LMV (2.14 code doesn't check this), and this default LMV is not space balanced, it's not allowed to mkdir on remote MDT.

				if (hint->dah_parent &&
				    dt_object_remote(hint->dah_parent) && lds &&
				    lds->lds_dir_def_striping_set &&
				    lds->lds_dir_def_stripe_offset !=
				    LMV_OFFSET_DEFAULT)
					GOTO(out, rc = -EREMOTE);

As for your question, yes, if parent doesn't have default LMV, the filesystem default LMV will be applied.

Comment by Lai Siyao [ 25/May/23 ]

Olaf, yes, it works with both 2.12 and 2.15 clients.

Comment by Peter Jones [ 03/Jul/23 ]

Is there any further work outstanding here? AFAIK the b2_14 fix has been proven to work and this issue does not exist on more current releases, so I would think that we can close out the ticket...

Comment by Gian-Carlo Defazio [ 05/Jul/23 ]

This issue is fixed for us on 2.14 and has been pulled into our local branch.

Comment by Gian-Carlo Defazio [ 05/Jul/23 ]

reopening to remove topllnl

Comment by Gian-Carlo Defazio [ 05/Jul/23 ]

reclosing after removing topllnl

Generated at Sat Feb 10 03:30:10 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.