[LU-9766] DNE phase 2 - wrong directory inheritance Created: 12/Jul/17  Updated: 09/Nov/21  Resolved: 09/Nov/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0, Lustre 2.9.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Jean-Baptiste Riaux (Inactive) Assignee: Jean-Baptiste Riaux (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: cea

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

The inheritance of directory striping with "lfs setdirstripe" is not working as it should:

Setting a directory with default mdt striping to 2:
[root@vm4 test]# lfs setdirstripe -D -c 2 2-stripes

Creating directories (which should inherits from parent)
[root@vm4 test]# mkdir 2-stripes/foo.{0..9}

Some are correct but not all of them:
[root@vm4 test]#  lfs getdirstripe 2-stripes/
2-stripes/
lmv_stripe_count: 2 lmv_stripe_offset: 0
mdtidx           FID[seq:oid:ver]
    0           [0x300000401:0x60:0x0]
    1           [0x340000402:0x1:0x0]
2-stripes//foo.2
lmv_stripe_count: 2 lmv_stripe_offset: 1
mdtidx           FID[seq:oid:ver]
    1           [0x340000401:0xb4:0x0]
    0           [0x300000402:0x2:0x0]
2-stripes//foo.6
lmv_stripe_count: 2 lmv_stripe_offset: 1
mdtidx           FID[seq:oid:ver]
    1           [0x340000401:0xb6:0x0]
    0           [0x300000402:0x4:0x0]
2-stripes//foo.5
lmv_stripe_count: 1 lmv_stripe_offset: 0
mdtidx           FID[seq:oid:ver]
    0           [0x300000401:0x63:0x0]
2-stripes//foo.9
lmv_stripe_count: 1 lmv_stripe_offset: 0
mdtidx           FID[seq:oid:ver]
    0           [0x300000401:0x65:0x0]
2-stripes//foo.0
lmv_stripe_count: 2 lmv_stripe_offset: 1
mdtidx           FID[seq:oid:ver]
    1           [0x340000401:0xb3:0x0]
    0           [0x300000402:0x1:0x0]
2-stripes//foo.1
lmv_stripe_count: 1 lmv_stripe_offset: 0
mdtidx           FID[seq:oid:ver]
    0           [0x300000401:0x61:0x0]
2-stripes//foo.8
lmv_stripe_count: 2 lmv_stripe_offset: 1
mdtidx           FID[seq:oid:ver]
    1           [0x340000401:0xb7:0x0]
    0           [0x300000402:0x5:0x0]
2-stripes//foo.7
lmv_stripe_count: 1 lmv_stripe_offset: 0
mdtidx           FID[seq:oid:ver]
    0           [0x300000401:0x64:0x0]
2-stripes//foo.3
lmv_stripe_count: 1 lmv_stripe_offset: 0
mdtidx           FID[seq:oid:ver]
    0           [0x300000401:0x62:0x0]
2-stripes//foo.4
lmv_stripe_count: 2 lmv_stripe_offset: 1
mdtidx           FID[seq:oid:ver]
    1           [0x340000401:0xb5:0x0]
    0           [0x300000402:0x3:0x0]

On MDS, in logs I can see that lod_cache_parent_striping does not return the defined striping all the time but the default filesystem striping:

 57168 00000004:00000001:1.0:1499850278.062218:0:8981:0:(lod_object.c:3008:lod_cache_parent_lmv_striping()) Process leaving
 57169 00000004:00000001:1.0:1499850278.062219:0:8981:0:(lod_object.c:3053:lod_cache_parent_striping()) Process leaving (rc=0 : 0 : 0)
 57170 00000004:00000040:1.0:1499850278.062220:0:8981:0:(lod_object.c:3155:lod_ah_init()) inherit default EA nr:1 off:-1 t2
 57171 00000004:00000040:1.0:1499850278.062220:0:8981:0:(lod_object.c:3187:lod_ah_init()) inherit EA nr:1 off:-1
 57172 00000004:00000040:1.0:1499850278.062221:0:8981:0:(lod_object.c:3195:lod_ah_init()) final striping count:1, offset:-1
 57173 00000004:00000001:1.0:1499850278.062221:0:8981:0:(lod_object.c:3246:lod_ah_init()) Process leaving


581753 00000004:00000001:1.0:1499850525.180299:0:9133:0:(lod_object.c:3053:lod_cache_parent_striping()) Process leaving (rc=0 : 0 : 0)
581754 00000004:00000040:1.0:1499850525.180299:0:9133:0:(lod_object.c:3155:lod_ah_init()) inherit default EA nr:1 off:-1 t2
581755 00000004:00000040:1.0:1499850525.180300:0:9133:0:(lod_object.c:3175:lod_ah_init()) set stripe EA nr:2 off:0
581756 00000004:00000040:1.0:1499850525.180300:0:9133:0:(lod_object.c:3195:lod_ah_init()) final striping count:2, offset:0
581757 00000004:00000001:1.0:1499850525.180301:0:9133:0:(lod_object.c:3246:lod_ah_init()) Process leaving

This is a problem as when the stripe count is incorrect, the assigned resulting MDT is 0, so the MDT0 fills up faster than other MDTs.

Also "lfs mkdir -i 1" does not work, it creates a directory with a stripe count of 0 and one mdt index. A workaround is to do an "lfs setdirstripe -D -c 1" on the parent directory then create directories with mkdir.

When creating directories where default striping was specified, I have sometimes timeouts in 2.7 and panics on clients in 2.9

2.7:

[root@vm4]# mkdir 1-stripe-1/foo.0/foo.{0..9}
mkdir: cannot create directory `1-stripe-1/foo.0/foo.0': Input/output error
mkdir: cannot create directory `1-stripe-1/foo.0/foo.1': Cannot send after transport endpoint shutdown
mkdir: cannot create directory `1-stripe-1/foo.0/foo.2': Cannot send after transport endpoint shutdown
mkdir: cannot create directory `1-stripe-1/foo.0/foo.3': Cannot send after transport endpoint shutdown
mkdir: cannot create directory `1-stripe-1/foo.0/foo.4': Cannot send after transport endpoint shutdown
mkdir: cannot create directory `1-stripe-1/foo.0/foo.5': Cannot send after transport endpoint shutdown
mkdir: cannot create directory `1-stripe-1/foo.0/foo.6': Cannot send after transport endpoint shutdown
mkdir: cannot create directory `1-stripe-1/foo.0/foo.7': Cannot send after transport endpoint shutdown
mkdir: cannot create directory `1-stripe-1/foo.0/foo.8': Cannot send after transport endpoint shutdown
mkdir: cannot create directory `1-stripe-1/foo.0/foo.9': Cannot send after transport endpoint shutdown
[root@vm4]# mkdir 1-stripe-1/foo.0/foo.{0..9}
mkdir: cannot create directory `1-stripe-1/foo.0/foo.0': Input/output error

2.9:

crash> bt 2135
PID: 2135   TASK: ffff880035860000  CPU: 1   COMMAND: "mkdir"
 #0 [ffff880016c4b670] machine_kexec at ffffffff81059cdb
 #1 [ffff880016c4b6d0] __crash_kexec at ffffffff81105182
 #2 [ffff880016c4b7a0] crash_kexec at ffffffff81105270
 #3 [ffff880016c4b7b8] oops_end at ffffffff8168efc8
 #4 [ffff880016c4b7e0] no_context at ffffffff8167ebd3
 #5 [ffff880016c4b830] __bad_area_nosemaphore at ffffffff8167ec69
 #6 [ffff880016c4b878] bad_area at ffffffff8167ef8d
 #7 [ffff880016c4b8a0] __do_page_fault at ffffffff81691e5f
 #8 [ffff880016c4b900] do_page_fault at ffffffff81691f05
 #9 [ffff880016c4b930] page_fault at ffffffff8168e1c8
    [exception RIP: memcpy+22]
    RIP: ffffffff813269a6  RSP: ffff880016c4b9e0  RFLAGS: 00010283
    RAX: ffff8800395fb4c0  RBX: ffff880016c4baf8  RCX: ffff880016c4bfd8
    RDX: ffffffffffffffe5  RSI: 0000000000000000  RDI: ffff8800395fb4c0
    RBP: ffff880016c4bab8   R8: 0000000000019a80   R9: 0000000000000000
    R10: ffff8800395fb4c0  R11: 0000000000aaaaaa  R12: 0000000000000025
    R13: ffff880016c4bae8  R14: ffff8800358789a0  R15: 0000000000000025
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
#10 [ffff880016c4b9e0] ll_lookup_it_finish at ffffffffa0ab5715 [lustre]
#11 [ffff880016c4bac0] ll_lookup_it at ffffffffa0ab70ae [lustre]
#12 [ffff880016c4bb78] ll_lookup_nd at ffffffffa0ab89dd [lustre]
#13 [ffff880016c4bc10] lookup_real at ffffffff812083dd
#14 [ffff880016c4bc30] __lookup_hash at ffffffff81208d52
#15 [ffff880016c4bc60] lookup_slow at ffffffff816833cb
#16 [ffff880016c4bc98] link_path_walk at ffffffff8120b96f
#17 [ffff880016c4bd48] path_lookupat at ffffffff8120bb6b
#18 [ffff880016c4bde0] filename_lookup at ffffffff8120c2cb
#19 [ffff880016c4be18] filename_create at ffffffff8120c3a2
#20 [ffff880016c4bee8] user_path_create at ffffffff8120eee1
#21 [ffff880016c4bf18] sys_mkdirat at ffffffff812101f6
#22 [ffff880016c4bf70] sys_mkdir at ffffffff812102a9
#23 [ffff880016c4bf80] system_call_fastpath at ffffffff81696709
    RIP: 00007f9ddb6d29a7  RSP: 00007ffcb2ec5690  RFLAGS: 00010246
    RAX: 0000000000000053  RBX: ffffffff81696709  RCX: 00007ffcb2ec57f0
    RDX: 00000000000001ff  RSI: 00000000000001ff  RDI: 00007ffcb2ec9790
    RBP: 00007ffcb2ec87d0   R8: 00000000000001ff   R9: 00000000004029f0
    R10: 000000000000000b  R11: 0000000000000206  R12: ffffffff812102a9
    R13: ffff880016c4bf78  R14: 00000000000001ff  R15: 00007ffcb2ec8820
    ORIG_RAX: 0000000000000053  CS: 0033  SS: 002b


 Comments   
Comment by Peter Jones [ 06/Oct/17 ]

Di/Lai

Do you have any advice here?

Peter

Comment by Di Wang [ 06/Oct/17 ]

Do you still have the debug log? It seems there are some communication issue between MDTs, that is why it will only create stripe on MDT0.

According to the debug log you post, the parent's default stripe count is 1,

57168 00000004:00000001:1.0:1499850278.062218:0:8981:0:(lod_object.c:3008:lod_cache_parent_lmv_striping()) Process leaving
 57169 00000004:00000001:1.0:1499850278.062219:0:8981:0:(lod_object.c:3053:lod_cache_parent_striping()) Process leaving (rc=0 : 0 : 0)
 57170 00000004:00000040:1.0:1499850278.062220:0:8981:0:(lod_object.c:3155:lod_ah_init()) inherit default EA nr:1 off:-1 t2
 57171 00000004:00000040:1.0:1499850278.062220:0:8981:0:(lod_object.c:3187:lod_ah_init()) inherit EA nr:1 off:-1
 57172 00000004:00000040:1.0:1499850278.062221:0:8981:0:(lod_object.c:3195:lod_ah_init()) final striping count:1, offset:-1
 57173 00000004:00000001:1.0:1499850278.062221:0:8981:0:(lod_object.c:3246:lod_ah_init()) Process leaving

So the child inherits stripe count correctly.

Though the bottom half

581753 00000004:00000001:1.0:1499850525.180299:0:9133:0:(lod_object.c:3053:lod_cache_parent_striping()) Process leaving (rc=0 : 0 : 0)
581754 00000004:00000040:1.0:1499850525.180299:0:9133:0:(lod_object.c:3155:lod_ah_init()) inherit default EA nr:1 off:-1 t2
581755 00000004:00000040:1.0:1499850525.180300:0:9133:0:(lod_object.c:3175:lod_ah_init()) set stripe EA nr:2 off:0
581756 00000004:00000040:1.0:1499850525.180300:0:9133:0:(lod_object.c:3195:lod_ah_init()) final striping count:2, offset:0
581757 00000004:00000001:1.0:1499850525.180301:0:9133:0:(lod_object.c:3246:lod_ah_init(.....

The child seems created by "setdirstripe -c2", so this will override the default stripe, then create the directory with 2 stripes.

[root@vm4]# mkdir 1-stripe-1/foo.0/foo.{0..9}
mkdir: cannot create directory `1-stripe-1/foo.0/foo.0': Input/output error
mkdir: cannot create directory `1-stripe-1/foo.0/foo.1': Cannot send after transport endpoint shutdown
mkdir: cannot create directory `1-stripe-1/foo.0/foo.2': Cannot send after transport en...

These failures also suggests there are some communication issues between MDTs.

Comment by Jean-Baptiste Riaux (Inactive) [ 02/Nov/17 ]

Thanks for the inputs.
Well no I do not have the logs anymore but I can reproduce.

All MDTs were on the same MDS node and the network failures looked to be more a consequence of the test, not the cause.
I will reproduce the test with a small lustre setup on a single node to avoid network traffic.

Comment by Andreas Dilger [ 09/Nov/21 ]

Tested this is working properly in (at least) 2.14.0 and later.

Generated at Sat Feb 10 02:29:03 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.