Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9958

Create striped directory fail in 2.10(with LU-9500 patch)

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Critical
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      Hi,
      I try to use OFED4.0 driver in Lustre 2.10 with LU-9500 patch (https://review.whamcloud.com/#/c/28237/) but got create stripe directory error.
      In LU-9461, I got infomation Lustre 2.9 have to apply LU-9026/LU-9472/ LU-9500.
      Then test it OFED4.0 in Lustre2.10 + LU-9500 patch( LU-9026/LU-9472 in Lustre 2.10 )
      Should I appy any other patch for this issue? Thanks.

      //two mdts must in different servers
      [root@hsm client]# lfs mkdir -c 2 dir1
      error on LL_IOC_LMV_SETSTRIPE 'dir1' (3): Input/output error
      error: mkdir: create stripe dir 'dir1' failed

      Attachments

        1. client.log
          3.71 MB
        2. mdt0.log
          4.49 MB
        3. mdt1.log
          4.22 MB

        Issue Links

          Activity

            [LU-9958] Create striped directory fail in 2.10(with LU-9500 patch)
            pjones Peter Jones added a comment -

            Good news - thanks

            pjones Peter Jones added a comment - Good news - thanks

            This bug can not be reproduced in release 2.10.1

            sebg-crd-pm sebg-crd-pm (Inactive) added a comment - This bug can not be reproduced in release 2.10.1

            I have also test create striped directory successed when the two mdts in the same server.(transfer message by loopback device)
            So I guess there is something wrong between MDT transfer message by IB .

            Any update ? Thanks.

            sebg-crd-pm sebg-crd-pm (Inactive) added a comment - I have also test create striped directory successed when the two mdts in the same server.(transfer message by loopback device) So I guess there is something wrong between MDT transfer message by IB . Any update ? Thanks.

            create a remote directory =>fail

            [root@robin client]# lfs mkdir -i 0 dir0
            [root@robin client]# lfs mkdir -i 1 dir1
            error on LL_IOC_LMV_SETSTRIPE 'dir1' (3): Input/output error
            error: mkdir: create stripe dir 'dir1' failed
            [root@robin client]# lfs mkdir -c 2 dir2
            error on LL_IOC_LMV_SETSTRIPE 'dir2' (3): Input/output error
            error: mkdir: create stripe dir 'dir2' failed

            sebg-crd-pm sebg-crd-pm (Inactive) added a comment - create a remote directory =>fail [root@robin client] # lfs mkdir -i 0 dir0 [root@robin client] # lfs mkdir -i 1 dir1 error on LL_IOC_LMV_SETSTRIPE 'dir1' (3): Input/output error error: mkdir: create stripe dir 'dir1' failed [root@robin client] # lfs mkdir -c 2 dir2 error on LL_IOC_LMV_SETSTRIPE 'dir2' (3): Input/output error error: mkdir: create stripe dir 'dir2' failed
            laisiyao Lai Siyao added a comment -

            can you test 'lfs mkdir -i 1 dir1' to create a remote directory?

            laisiyao Lai Siyao added a comment - can you test 'lfs mkdir -i 1 dir1' to create a remote directory?

            Hi Lai,

            Do you need more detail log? or you have already reproduce it in your site. Thanks.

            sebg-crd-pm sebg-crd-pm (Inactive) added a comment - Hi Lai, Do you need more detail log? or you have already reproduce it in your site. Thanks.

            It looks to be network IO on mdt1 timed out, could you verify the network on mdt1 is working correctly?
            =>how to verify network on mdt1 is working correctly? Could you give me anu comment? Thanks.

            sebg-crd-pm sebg-crd-pm (Inactive) added a comment - It looks to be network IO on mdt1 timed out, could you verify the network on mdt1 is working correctly? =>how to verify network on mdt1 is working correctly? Could you give me anu comment? Thanks.

            I test this bug again in 2.10.1-RC1(no add any patch).

            It looks to be network IO on mdt1 timed out, could you verify the network on mdt1 is working correctly?
            =>all osp state looks like normal
            [mdt1 server]
            ./osp/jlustre-MDT0000-osp-MDT0001/state:current_state: FULL
            ./osp/jlustre-MDT0000-osp-MDT0001/import: state: FULL
            ./osp/jlustre-OST0000-osc-MDT0001/state:current_state: FULL
            ./osp/jlustre-OST0000-osc-MDT0001/import: state: FULL
            [mdt0 server]
            ./osp/jlustre-MDT0001-osp-MDT0000/state:current_state: FULL
            ./osp/jlustre-MDT0001-osp-MDT0000/import: state: FULL
            ./osp/jlustre-OST0000-osc-MDT0000/state:current_state: FULL
            ./osp/jlustre-OST0000-osc-MDT0000/import: state: FULL

            [/var/log/message in mdt0 server]
            Sep 19 02:50:14 ossb2 kernel: LNetError: 21764:0:(o2iblnd.c:1940:kiblnd_fmr_pool_map()) Failed to map mr 10/11 elements
            Sep 19 02:50:14 ossb2 kernel: LNetError: 21764:0:(o2iblnd_cb.c:560:kiblnd_fmr_map_tx()) Can't map 41033 pages: -22
            Sep 19 02:50:14 ossb2 kernel: LNetError: 21764:0:(o2iblnd_cb.c:1554:kiblnd_send()) Can't setup GET sink for 172.20.110.209@o2ib: -22
            Sep 19 02:50:14 ossb2 kernel: LustreError: 21764:0:(events.c:449:server_bulk_callback()) event type 5, status -5, desc ffff88086ea2e400
            Sep 19 02:51:54 ossb2 kernel: LustreError: 21764:0:(ldlm_lib.c:3237:target_bulk_io()) @@@ timeout on bulk WRITE after 100+0s req@ffff880457208c50 x1578948605516272/t0(0) o1000->jlustre-MDT0001-mdtlov_UUID@172.20.110.209@o2ib:210/0 lens 376/0 e 4 to 0 dl 1505803920 ref 1 fl Interpret:/0/ffffffff rc 0/-1

            [/var/log/messages in mdt1 server]
            Sep 19 14:51:22 ossb1 kernel: LustreError: 11-0: jlustre-MDT0000-osp-MDT0001: operation out_update to node 172.20.110.210@o2ib failed: rc = -110
            Sep 19 14:51:22 ossb1 kernel: LustreError: 31069:0:(layout.c:2085:__req_capsule_get()) @@@ Wrong buffer for field `object_update_reply' (1 of 1) in format `OUT_UPDATE': 0 vs. 4096 (server)#012 req@ffff8807d3aa7800 x1578948605516272/t0(0) o1000->jlustre-MDT0000-osp-MDT0001@172.20.110.210@o2ib:24/4 lens 376/192 e 4 to 0 dl 1505803889 ref 2 fl Interpret:ReM/0/0 rc -110/-110
            Sep 19 14:51:24 ossb1 kernel: LustreError: 30780:0:(llog_cat.c:773:llog_cat_cancel_records()) jlustre-MDT0000-osp-MDT0001: fail to cancel 1 of 1 llog

            sebg-crd-pm sebg-crd-pm (Inactive) added a comment - I test this bug again in 2.10.1-RC1(no add any patch). It looks to be network IO on mdt1 timed out, could you verify the network on mdt1 is working correctly? =>all osp state looks like normal [mdt1 server] ./osp/jlustre-MDT0000-osp-MDT0001/state:current_state: FULL ./osp/jlustre-MDT0000-osp-MDT0001/import: state: FULL ./osp/jlustre-OST0000-osc-MDT0001/state:current_state: FULL ./osp/jlustre-OST0000-osc-MDT0001/import: state: FULL [mdt0 server] ./osp/jlustre-MDT0001-osp-MDT0000/state:current_state: FULL ./osp/jlustre-MDT0001-osp-MDT0000/import: state: FULL ./osp/jlustre-OST0000-osc-MDT0000/state:current_state: FULL ./osp/jlustre-OST0000-osc-MDT0000/import: state: FULL [/var/log/message in mdt0 server] Sep 19 02:50:14 ossb2 kernel: LNetError: 21764:0:(o2iblnd.c:1940:kiblnd_fmr_pool_map()) Failed to map mr 10/11 elements Sep 19 02:50:14 ossb2 kernel: LNetError: 21764:0:(o2iblnd_cb.c:560:kiblnd_fmr_map_tx()) Can't map 41033 pages: -22 Sep 19 02:50:14 ossb2 kernel: LNetError: 21764:0:(o2iblnd_cb.c:1554:kiblnd_send()) Can't setup GET sink for 172.20.110.209@o2ib: -22 Sep 19 02:50:14 ossb2 kernel: LustreError: 21764:0:(events.c:449:server_bulk_callback()) event type 5, status -5, desc ffff88086ea2e400 Sep 19 02:51:54 ossb2 kernel: LustreError: 21764:0:(ldlm_lib.c:3237:target_bulk_io()) @@@ timeout on bulk WRITE after 100+0s req@ffff880457208c50 x1578948605516272/t0(0) o1000->jlustre-MDT0001-mdtlov_UUID@172.20.110.209@o2ib:210/0 lens 376/0 e 4 to 0 dl 1505803920 ref 1 fl Interpret:/0/ffffffff rc 0/-1 [/var/log/messages in mdt1 server] Sep 19 14:51:22 ossb1 kernel: LustreError: 11-0: jlustre-MDT0000-osp-MDT0001: operation out_update to node 172.20.110.210@o2ib failed: rc = -110 Sep 19 14:51:22 ossb1 kernel: LustreError: 31069:0:(layout.c:2085:__req_capsule_get()) @@@ Wrong buffer for field `object_update_reply' (1 of 1) in format `OUT_UPDATE': 0 vs. 4096 (server)#012 req@ffff8807d3aa7800 x1578948605516272/t0(0) o1000->jlustre-MDT0000-osp-MDT0001@172.20.110.210@o2ib:24/4 lens 376/192 e 4 to 0 dl 1505803889 ref 2 fl Interpret:ReM/0/0 rc -110/-110 Sep 19 14:51:24 ossb1 kernel: LustreError: 30780:0:(llog_cat.c:773:llog_cat_cancel_records()) jlustre-MDT0000-osp-MDT0001: fail to cancel 1 of 1 llog
            laisiyao Lai Siyao added a comment -

            in mdt1.log:

            00010000:00000001:6.0:1505117481.059504:0:5182:0:(ldlm_lib.c:3268:target_bulk_io()) Process leaving (rc=18446744073709551506 : -110 : ffffffffffffff92)
            00000020:00000001:6.0:1505117481.059508:0:5182:0:(out_handler.c:982:out_handle()) Process leaving via out_free (rc=18446744073709547410 : -4206 : 0xffffffffffffef92)
            

            which caused mdt0:

            00000004:00000001:9.0:1505117529.647439:0:3485:0:(osp_trans.c:1204:osp_send_update_req()) Process leaving (rc=18446744073709551506 : -110 : ffffffffffffff92)
            ...
            00000020:00000001:1.0:1505117529.647589:0:3442:0:(update_trans.c:1091:top_trans_stop()) Process leaving (rc=18446744073709551611 : -5 : fffffffffffffffb)
            ...
            00000004:00000001:1.0:1505117529.647779:0:3442:0:(mdt_reint.c:526:mdt_create()) Process leaving via put_child (rc=18446744073709551611 : -5 : 0xfffffffffffffffb)
            

            It looks to be network IO on mdt1 timed out, could you verify the network on mdt1 is working correctly?

            laisiyao Lai Siyao added a comment - in mdt1.log: 00010000:00000001:6.0:1505117481.059504:0:5182:0:(ldlm_lib.c:3268:target_bulk_io()) Process leaving (rc=18446744073709551506 : -110 : ffffffffffffff92) 00000020:00000001:6.0:1505117481.059508:0:5182:0:(out_handler.c:982:out_handle()) Process leaving via out_free (rc=18446744073709547410 : -4206 : 0xffffffffffffef92) which caused mdt0: 00000004:00000001:9.0:1505117529.647439:0:3485:0:(osp_trans.c:1204:osp_send_update_req()) Process leaving (rc=18446744073709551506 : -110 : ffffffffffffff92) ... 00000020:00000001:1.0:1505117529.647589:0:3442:0:(update_trans.c:1091:top_trans_stop()) Process leaving (rc=18446744073709551611 : -5 : fffffffffffffffb) ... 00000004:00000001:1.0:1505117529.647779:0:3442:0:(mdt_reint.c:526:mdt_create()) Process leaving via put_child (rc=18446744073709551611 : -5 : 0xfffffffffffffffb) It looks to be network IO on mdt1 timed out, could you verify the network on mdt1 is working correctly?
            pjones Peter Jones added a comment -

            Lai

            Can you please advise on this one?

            Thanks

            Peter

            pjones Peter Jones added a comment - Lai Can you please advise on this one? Thanks Peter

            People

              laisiyao Lai Siyao
              sebg-crd-pm sebg-crd-pm (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: