[LU-9958] Create striped directory fail in 2.10(with LU-9500 patch) Created: 08/Sep/17  Updated: 14/Oct/17  Resolved: 05/Oct/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: sebg-crd-pm (Inactive) Assignee: Lai Siyao
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

Lustre 2.10 (lustre-release-58fd06e) + LU-9500 patch for OFED4.0
Melanox IB EDR + MLNX_OFED_LINUX-4.0-2.0.0.1-rhel7.3-x86_64.tgz
Test with 2 MDS servers (1 MDT/Server)


Attachments: File client.log     File mdt0.log     File mdt1.log    
Issue Links:
Duplicate
is duplicated by LU-10010 Create striped directory fail in 2.10... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Hi,
I try to use OFED4.0 driver in Lustre 2.10 with LU-9500 patch (https://review.whamcloud.com/#/c/28237/) but got create stripe directory error.
In LU-9461, I got infomation Lustre 2.9 have to apply LU-9026/LU-9472/ LU-9500.
Then test it OFED4.0 in Lustre2.10 + LU-9500 patch( LU-9026/LU-9472 in Lustre 2.10 )
Should I appy any other patch for this issue? Thanks.

//two mdts must in different servers
[root@hsm client]# lfs mkdir -c 2 dir1
error on LL_IOC_LMV_SETSTRIPE 'dir1' (3): Input/output error
error: mkdir: create stripe dir 'dir1' failed



 Comments   
Comment by Brad Hoagland (Inactive) [ 08/Sep/17 ]

Hello,

Please attach the entire log for us to review.

Thanks,

Brad

Comment by sebg-crd-pm (Inactive) [ 11/Sep/17 ]

FYI

lfs mkdir -c 2 /mnt/client/dir3
error on LL_IOC_LMV_SETSTRIPE 'dir3' (3): Input/output error
error: mkdir: create stripe dir 'dir3' failed

see attached file mdt1.log mdt0.log client.log

Comment by sebg-crd-pm (Inactive) [ 13/Sep/17 ]

Hi Brad,

Do you have any update after reviewing logs?

Thanks!

Comment by Peter Jones [ 14/Sep/17 ]

Lai

Can you please advise on this one?

Thanks

Peter

Comment by Lai Siyao [ 15/Sep/17 ]

in mdt1.log:

00010000:00000001:6.0:1505117481.059504:0:5182:0:(ldlm_lib.c:3268:target_bulk_io()) Process leaving (rc=18446744073709551506 : -110 : ffffffffffffff92)
00000020:00000001:6.0:1505117481.059508:0:5182:0:(out_handler.c:982:out_handle()) Process leaving via out_free (rc=18446744073709547410 : -4206 : 0xffffffffffffef92)

which caused mdt0:

00000004:00000001:9.0:1505117529.647439:0:3485:0:(osp_trans.c:1204:osp_send_update_req()) Process leaving (rc=18446744073709551506 : -110 : ffffffffffffff92)
...
00000020:00000001:1.0:1505117529.647589:0:3442:0:(update_trans.c:1091:top_trans_stop()) Process leaving (rc=18446744073709551611 : -5 : fffffffffffffffb)
...
00000004:00000001:1.0:1505117529.647779:0:3442:0:(mdt_reint.c:526:mdt_create()) Process leaving via put_child (rc=18446744073709551611 : -5 : 0xfffffffffffffffb)

It looks to be network IO on mdt1 timed out, could you verify the network on mdt1 is working correctly?

Comment by sebg-crd-pm (Inactive) [ 19/Sep/17 ]

I test this bug again in 2.10.1-RC1(no add any patch).

It looks to be network IO on mdt1 timed out, could you verify the network on mdt1 is working correctly?
=>all osp state looks like normal
[mdt1 server]
./osp/jlustre-MDT0000-osp-MDT0001/state:current_state: FULL
./osp/jlustre-MDT0000-osp-MDT0001/import: state: FULL
./osp/jlustre-OST0000-osc-MDT0001/state:current_state: FULL
./osp/jlustre-OST0000-osc-MDT0001/import: state: FULL
[mdt0 server]
./osp/jlustre-MDT0001-osp-MDT0000/state:current_state: FULL
./osp/jlustre-MDT0001-osp-MDT0000/import: state: FULL
./osp/jlustre-OST0000-osc-MDT0000/state:current_state: FULL
./osp/jlustre-OST0000-osc-MDT0000/import: state: FULL

[/var/log/message in mdt0 server]
Sep 19 02:50:14 ossb2 kernel: LNetError: 21764:0:(o2iblnd.c:1940:kiblnd_fmr_pool_map()) Failed to map mr 10/11 elements
Sep 19 02:50:14 ossb2 kernel: LNetError: 21764:0:(o2iblnd_cb.c:560:kiblnd_fmr_map_tx()) Can't map 41033 pages: -22
Sep 19 02:50:14 ossb2 kernel: LNetError: 21764:0:(o2iblnd_cb.c:1554:kiblnd_send()) Can't setup GET sink for 172.20.110.209@o2ib: -22
Sep 19 02:50:14 ossb2 kernel: LustreError: 21764:0:(events.c:449:server_bulk_callback()) event type 5, status -5, desc ffff88086ea2e400
Sep 19 02:51:54 ossb2 kernel: LustreError: 21764:0:(ldlm_lib.c:3237:target_bulk_io()) @@@ timeout on bulk WRITE after 100+0s req@ffff880457208c50 x1578948605516272/t0(0) o1000->jlustre-MDT0001-mdtlov_UUID@172.20.110.209@o2ib:210/0 lens 376/0 e 4 to 0 dl 1505803920 ref 1 fl Interpret:/0/ffffffff rc 0/-1

[/var/log/messages in mdt1 server]
Sep 19 14:51:22 ossb1 kernel: LustreError: 11-0: jlustre-MDT0000-osp-MDT0001: operation out_update to node 172.20.110.210@o2ib failed: rc = -110
Sep 19 14:51:22 ossb1 kernel: LustreError: 31069:0:(layout.c:2085:__req_capsule_get()) @@@ Wrong buffer for field `object_update_reply' (1 of 1) in format `OUT_UPDATE': 0 vs. 4096 (server)#012 req@ffff8807d3aa7800 x1578948605516272/t0(0) o1000->jlustre-MDT0000-osp-MDT0001@172.20.110.210@o2ib:24/4 lens 376/192 e 4 to 0 dl 1505803889 ref 2 fl Interpret:ReM/0/0 rc -110/-110
Sep 19 14:51:24 ossb1 kernel: LustreError: 30780:0:(llog_cat.c:773:llog_cat_cancel_records()) jlustre-MDT0000-osp-MDT0001: fail to cancel 1 of 1 llog

Comment by sebg-crd-pm (Inactive) [ 19/Sep/17 ]

It looks to be network IO on mdt1 timed out, could you verify the network on mdt1 is working correctly?
=>how to verify network on mdt1 is working correctly? Could you give me anu comment? Thanks.

Comment by sebg-crd-pm (Inactive) [ 28/Sep/17 ]

Hi Lai,

Do you need more detail log? or you have already reproduce it in your site. Thanks.

Comment by Lai Siyao [ 28/Sep/17 ]

can you test 'lfs mkdir -i 1 dir1' to create a remote directory?

Comment by sebg-crd-pm (Inactive) [ 02/Oct/17 ]

create a remote directory =>fail

[root@robin client]# lfs mkdir -i 0 dir0
[root@robin client]# lfs mkdir -i 1 dir1
error on LL_IOC_LMV_SETSTRIPE 'dir1' (3): Input/output error
error: mkdir: create stripe dir 'dir1' failed
[root@robin client]# lfs mkdir -c 2 dir2
error on LL_IOC_LMV_SETSTRIPE 'dir2' (3): Input/output error
error: mkdir: create stripe dir 'dir2' failed

Comment by sebg-crd-pm (Inactive) [ 03/Oct/17 ]

I have also test create striped directory successed when the two mdts in the same server.(transfer message by loopback device)
So I guess there is something wrong between MDT transfer message by IB .

Any update ? Thanks.

Comment by sebg-crd-pm (Inactive) [ 05/Oct/17 ]

This bug can not be reproduced in release 2.10.1

Comment by Peter Jones [ 05/Oct/17 ]

Good news - thanks

Generated at Sat Feb 10 02:30:47 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.