[LU-1262] mkdir followed by rmdir on a different client fails -- Object doesn't exist! Created: 27/Mar/12 Updated: 09/Apr/12 Resolved: 09/Apr/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Roger Spellman (Inactive) | Assignee: | Lai Siyao |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Lustre servers are running 2.6.32-220.el6, with Lustre 2.1.1.rc4. |
||
| Severity: | 2 |
| Rank (Obsolete): | 6427 |
| Description |
|
Customer creates a directory on one node, puts a file in that directory. Then, on a different client, tries to recursively remove the directory. Then, back on the first client, tries making the directory again. This fails. Here are the exact steps: usrs400 $ mkdir /mnt/lustre/foo usrs399 $ rm -rf /mnt/lustre/foo usrs400 $ mkdir /mnt/lustre/foo The customer has waited 10 minutes for this to complete. For the following output from /var/log/messages, the customer only waited a second or two. Also, the customer unmounted and remounted the clients, to keep things simpler. usrs400:/var/log/messages: Mar 27 12:26:57 usrs400 kernel: [677137.616534] Lustre: Lustre: Build Version: ../lustre/scripts-20120222220600-PRISTINE-../lustre/scripts usrs399:/var/log/messages: Mar 27 12:26:20 usrs399 kernel: [677221.853364] Lustre: Lustre: Build Version: ../lustre/scripts-20120222220600-PRISTINE-../lustre/scripts On the MDS (I think that there is some clock skew): Mar 27 12:26:39 ts-xxxxxxxx-01 kernel: Lustre: 2523:0:(ldlm_lib.c:877:target_handle_connect()) MGS: connection from b92afcf0-1504-ed4b-819e-d31039236758@192.168.185.7@tcp t0 exp (null) cur 1332851199 last 0 header@ffff88061fac1ec0 header@ffff88061fac1ec0 |
| Comments |
| Comment by Peter Jones [ 27/Mar/12 ] |
|
Lai Could you please comment on this one? Thanks Peter |
| Comment by Lai Siyao [ 30/Mar/12 ] |
|
This can't reproduce in master branch, I'll check the changes made after 2.1. |
| Comment by Roger Spellman (Inactive) [ 30/Mar/12 ] |
|
What are you using for a client? As stated in the "Environment" section: Lustre clients are running 2.6.38.2, with special code created for this release, with http://review.whamcloud.com/#change,2170. |
| Comment by Roger Spellman (Inactive) [ 30/Mar/12 ] |
|
Customer reports the following. So a little more playing around and I found out you don't need to have a file in the directory to have a problem. However, the problem only manifests itself if you use "rm -r" to remove the directory. The difference is that "rmdir" uses the "rmdir()" system call and "rm -r" uses the "unlinkat()" system call. I also have a new fun bug for you: client1 $ mkdir /mnt/lustre/test client2 $ echo foo > /mnt/lustre/test/foo client1 $ mv /mnt/lustre/test /mnt/lustre/test2 client2 $ echo bar > /mnt/lustre/test/bar #### SUCCEEDS!!! client1 $ ls /mnt/lustre/test2 client2 $ mkdir /mnt/lustre/test It seems that client2 cached the name to the directory node and used the cache when writing a file, but using ls did not invalidate the cache entry nor did trying to make the directory. /var/log/messages in both clients and xxxxxxx-01 (the MDS) have no messages caused by this test. I would actually consider this a more serious bug than the first one because our users are likely to have multiple runs that write to the same directory path and they may move the directory for subsequent runs but the old directory will receive the data. |
| Comment by Lai Siyao [ 30/Mar/12 ] |
|
Hmm, kernel >= 2.6.38 uses d_set_d_op() to set dentry operations, and .d_delete is called before refcount is decreased, I'll update http://review.whamcloud.com/#change,2170 later. |
| Comment by Lai Siyao [ 30/Mar/12 ] |
|
http://review.whamcloud.com/#change,2170 updated, could you try again? |
| Comment by Peter Jones [ 09/Apr/12 ] |
|
As per Terascala the latest code fixes this issue |
| Comment by Roger Spellman (Inactive) [ 09/Apr/12 ] |
|
Agreed. This can be closed. |
| Comment by Peter Jones [ 09/Apr/12 ] |
|
Landing the patch for this work is tracked under |