[LU-1262] mkdir followed by rmdir on a different client fails -- Object doesn't exist! Created: 27/Mar/12  Updated: 09/Apr/12  Resolved: 09/Apr/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.1
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Roger Spellman (Inactive) Assignee: Lai Siyao
Resolution: Duplicate Votes: 0
Labels: None
Environment:

Lustre servers are running 2.6.32-220.el6, with Lustre 2.1.1.rc4.
Lustre clients are running 2.6.38.2, with special code created for this release, with http://review.whamcloud.com/#change,2170.


Severity: 2
Rank (Obsolete): 6427

 Description   

Customer creates a directory on one node, puts a file in that directory. Then, on a different client, tries to recursively remove the directory. Then, back on the first client, tries making the directory again. This fails.

Here are the exact steps:

usrs400 $ mkdir /mnt/lustre/foo
usrs400 $ touch /mnt/lustre/foo/bar

usrs399 $ rm -rf /mnt/lustre/foo

usrs400 $ mkdir /mnt/lustre/foo
mkdir: cannot create directory `foo': File exists
usrs400 $ ls /mnt/lustre/foo
ls: cannot access foo: No such file or directory
usrs400 $ mkdir /mnt/lustre/foo
mkdir: cannot create directory `foo': File exists
usrs400 $ rmdir /mnt/lustre/foo
rmdir: failed to remove `foo': No such file or directory
usrs400 $ mkdir /mnt/lustre/foo

The customer has waited 10 minutes for this to complete. For the following output from /var/log/messages, the customer only waited a second or two. Also, the customer unmounted and remounted the clients, to keep things simpler.

usrs400:/var/log/messages:

Mar 27 12:26:57 usrs400 kernel: [677137.616534] Lustre: Lustre: Build Version: ../lustre/scripts-20120222220600-PRISTINE-../lustre/scripts
Mar 27 12:26:57 usrs400 kernel: [677137.758165] Lustre: Added LNI 192.168.185.6@tcp [8/256/0/180]
Mar 27 12:26:57 usrs400 kernel: [677137.758237] Lustre: Accept secure, port 988
Mar 27 12:27:01 usrs400 kernel: [677141.529389] Lustre: MGC192.168.185.35@tcp: Reactivating import
Mar 27 12:27:01 usrs400 kernel: [677141.568728] LustreError: 31684:0:(obd_config.c:1147:class_process_proc_param()) writing proc entry checksum_pages err -11
Mar 27 12:27:01 usrs400 kernel: [677141.774591] Lustre: Client xxxxxx-client has started
Mar 27 12:27:32 usrs400 kernel: [677173.081643] LustreError: 31698:0:(file.c:2228:ll_inode_revalidate_fini()) failure -2 inode 144115239078592513
Mar 27 12:27:32 usrs400 kernel: [677173.208438] LustreError: 31698:0:(file.c:2228:ll_inode_revalidate_fini()) failure -2 inode 144115239078592513

usrs399:/var/log/messages:

Mar 27 12:26:20 usrs399 kernel: [677221.853364] Lustre: Lustre: Build Version: ../lustre/scripts-20120222220600-PRISTINE-../lustre/scripts
Mar 27 12:26:20 usrs399 kernel: [677221.993703] Lustre: Added LNI 192.168.185.7@tcp [8/256/0/180]
Mar 27 12:26:20 usrs399 kernel: [677221.993777] Lustre: Accept secure, port 988
Mar 27 12:26:39 usrs399 kernel: [677240.559658] Lustre: MGC192.168.185.35@tcp: Reactivating import
Mar 27 12:26:39 usrs399 kernel: [677240.592294] LustreError: 621:0:(obd_config.c:1147:class_process_proc_param()) writing proc entry checksum_pages err -11
Mar 27 12:26:39 usrs399 kernel: [677240.825955] Lustre: Client xxxxxx-client has started

On the MDS (I think that there is some clock skew):

Mar 27 12:26:39 ts-xxxxxxxx-01 kernel: Lustre: 2523:0:(ldlm_lib.c:877:target_handle_connect()) MGS: connection from b92afcf0-1504-ed4b-819e-d31039236758@192.168.185.7@tcp t0 exp (null) cur 1332851199 last 0
Mar 27 12:26:39 ts-xxxxxxxx-01 kernel: Lustre: 2523:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import MGS->NET_0x20000c0a8b907_UUID netid 20000: select flavor null
Mar 27 12:26:39 ts-xxxxxxxx-01 kernel: Lustre: 10205:0:(ldlm_lib.c:877:target_handle_connect()) xxxxxx-MDT0000: connection from 11909088-d4a2-77fa-030f-1c9e2a493436@192.168.185.7@tcp t0 exp (null) cur 1332851199 last 0
Mar 27 12:26:39 ts-xxxxxxxx-01 kernel: Lustre: 10205:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import xxxxxx-MDT0000->NET_0x20000c0a8b907_UUID netid 20000: select flavor null
Mar 27 12:27:01 ts-xxxxxxxx-01 kernel: Lustre: 2523:0:(ldlm_lib.c:877:target_handle_connect()) MGS: connection from 8927fc00-6820-4bea-63ad-35146f43bb3e@192.168.185.6@tcp t0 exp (null) cur 1332851221 last 0
Mar 27 12:27:01 ts-xxxxxxxx-01 kernel: Lustre: 2523:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import MGS->NET_0x20000c0a8b906_UUID netid 20000: select flavor null
Mar 27 12:27:32 ts-xxxxxxxx-01 kernel: Lustre: 2541:0:(mdt_handler.c:1010:mdt_getattr_name_lock()) header@ffff88061fac1ec0[0x0, 1, [0x200000be0:0x1:0x0] hash lru]

{ Mar 27 12:27:32 ts-xxxxxxxx-01 kernel: Lustre: 2541:0:(mdt_handler.c:1010:mdt_getattr_name_lock()) ....mdt@ffff88061fac1f18mdt-object@ffff88061fac1ec0(ioepoch=0 flags=0x0, epochcount=0, writecount=0) Mar 27 12:27:32 ts-xxxxxxxx-01 kernel: Lustre: 2541:0:(mdt_handler.c:1010:mdt_getattr_name_lock()) ....cmm@ffff88061fbc83c0[local] Mar 27 12:27:32 ts-xxxxxxxx-01 kernel: Lustre: 2541:0:(mdt_handler.c:1010:mdt_getattr_name_lock()) ....mdd@ffff88061fd80380mdd-object@ffff88061fd80380(open_count=0, valid=0, cltime=0, flags=0) Mar 27 12:27:32 ts-xxxxxxxx-01 kernel: Lustre: 2541:0:(mdt_handler.c:1010:mdt_getattr_name_lock()) ....osd-ldiskfs@ffff88061e88b180osd-ldiskfs-object@ffff88061e88b180(i:(null):0/0)[plain] Mar 27 12:27:32 ts-xxxxxxxx-01 kernel: Lustre: 2541:0:(mdt_handler.c:1010:mdt_getattr_name_lock()) }

header@ffff88061fac1ec0
Mar 27 12:27:32 ts-xxxxxxxx-01 kernel: Lustre: 2541:0:(mdt_handler.c:1010:mdt_getattr_name_lock()) Object doesn't exist!
Mar 27 12:27:32 ts-xxxxxxxx-01 kernel: Lustre: 2541:0:(mdt_handler.c:1010:mdt_getattr_name_lock()) header@ffff88061fac1ec0[0x0, 1, [0x200000be0:0x1:0x0] hash lru]

{ Mar 27 12:27:32 ts-xxxxxxxx-01 kernel: Lustre: 2541:0:(mdt_handler.c:1010:mdt_getattr_name_lock()) ....mdt@ffff88061fac1f18mdt-object@ffff88061fac1ec0(ioepoch=0 flags=0x0, epochcount=0, writecount=0) Mar 27 12:27:32 ts-xxxxxxxx-01 kernel: Lustre: 2541:0:(mdt_handler.c:1010:mdt_getattr_name_lock()) ....cmm@ffff88061fbc83c0[local] Mar 27 12:27:32 ts-xxxxxxxx-01 kernel: Lustre: 2541:0:(mdt_handler.c:1010:mdt_getattr_name_lock()) ....mdd@ffff88061fd80380mdd-object@ffff88061fd80380(open_count=0, valid=0, cltime=0, flags=0) Mar 27 12:27:32 ts-xxxxxxxx-01 kernel: Lustre: 2541:0:(mdt_handler.c:1010:mdt_getattr_name_lock()) ....osd-ldiskfs@ffff88061e88b180osd-ldiskfs-object@ffff88061e88b180(i:(null):0/0)[plain] Mar 27 12:27:32 ts-xxxxxxxx-01 kernel: Lustre: 2541:0:(mdt_handler.c:1010:mdt_getattr_name_lock()) }

header@ffff88061fac1ec0
Mar 27 12:27:32 ts-xxxxxxxx-01 kernel: Lustre: 2541:0:(mdt_handler.c:1010:mdt_getattr_name_lock()) Object doesn't exist!



 Comments   
Comment by Peter Jones [ 27/Mar/12 ]

Lai

Could you please comment on this one?

Thanks

Peter

Comment by Lai Siyao [ 30/Mar/12 ]

This can't reproduce in master branch, I'll check the changes made after 2.1.

Comment by Roger Spellman (Inactive) [ 30/Mar/12 ]

What are you using for a client?

As stated in the "Environment" section:

Lustre clients are running 2.6.38.2, with special code created for this release, with http://review.whamcloud.com/#change,2170.

Comment by Roger Spellman (Inactive) [ 30/Mar/12 ]

Customer reports the following.

So a little more playing around and I found out you don't need to have a file in the directory to have a problem. However, the problem only manifests itself if you use "rm -r" to remove the directory. The difference is that "rmdir" uses the "rmdir()" system call and "rm -r" uses the "unlinkat()" system call.

I also have a new fun bug for you:

client1 $ mkdir /mnt/lustre/test

client2 $ echo foo > /mnt/lustre/test/foo

client1 $ mv /mnt/lustre/test /mnt/lustre/test2

client2 $ echo bar > /mnt/lustre/test/bar #### SUCCEEDS!!!
client2 $ ls /mnt/lustre/test
ls: cannot open directory /mnt/lustre/test: No such file or directory
client2 $ echo baz > /mnt/lustre/test/baz #### SUCCEEDS!!!

client1 $ ls /mnt/lustre/test2
bar baz foo

client2 $ mkdir /mnt/lustre/test
mkdir: cannot create directory `/mnt/lustre/test': File exists

It seems that client2 cached the name to the directory node and used the cache when writing a file, but using ls did not invalidate the cache entry nor did trying to make the directory.

/var/log/messages in both clients and xxxxxxx-01 (the MDS) have no messages caused by this test.

I would actually consider this a more serious bug than the first one because our users are likely to have multiple runs that write to the same directory path and they may move the directory for subsequent runs but the old directory will receive the data.

Comment by Lai Siyao [ 30/Mar/12 ]

Hmm, kernel >= 2.6.38 uses d_set_d_op() to set dentry operations, and .d_delete is called before refcount is decreased, I'll update http://review.whamcloud.com/#change,2170 later.

Comment by Lai Siyao [ 30/Mar/12 ]

http://review.whamcloud.com/#change,2170 updated, could you try again?

Comment by Peter Jones [ 09/Apr/12 ]

As per Terascala the latest code fixes this issue

Comment by Roger Spellman (Inactive) [ 09/Apr/12 ]

Agreed. This can be closed.

Comment by Peter Jones [ 09/Apr/12 ]

Landing the patch for this work is tracked under LU-506

Generated at Sat Feb 10 01:15:04 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.