Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9958

Create striped directory fail in 2.10(with LU-9500 patch)

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Critical
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      Hi,
      I try to use OFED4.0 driver in Lustre 2.10 with LU-9500 patch (https://review.whamcloud.com/#/c/28237/) but got create stripe directory error.
      In LU-9461, I got infomation Lustre 2.9 have to apply LU-9026/LU-9472/ LU-9500.
      Then test it OFED4.0 in Lustre2.10 + LU-9500 patch( LU-9026/LU-9472 in Lustre 2.10 )
      Should I appy any other patch for this issue? Thanks.

      //two mdts must in different servers
      [root@hsm client]# lfs mkdir -c 2 dir1
      error on LL_IOC_LMV_SETSTRIPE 'dir1' (3): Input/output error
      error: mkdir: create stripe dir 'dir1' failed

      Attachments

        1. client.log
          3.71 MB
        2. mdt0.log
          4.49 MB
        3. mdt1.log
          4.22 MB

        Issue Links

          Activity

            [LU-9958] Create striped directory fail in 2.10(with LU-9500 patch)

            I test this bug again in 2.10.1-RC1(no add any patch).

            It looks to be network IO on mdt1 timed out, could you verify the network on mdt1 is working correctly?
            =>all osp state looks like normal
            [mdt1 server]
            ./osp/jlustre-MDT0000-osp-MDT0001/state:current_state: FULL
            ./osp/jlustre-MDT0000-osp-MDT0001/import: state: FULL
            ./osp/jlustre-OST0000-osc-MDT0001/state:current_state: FULL
            ./osp/jlustre-OST0000-osc-MDT0001/import: state: FULL
            [mdt0 server]
            ./osp/jlustre-MDT0001-osp-MDT0000/state:current_state: FULL
            ./osp/jlustre-MDT0001-osp-MDT0000/import: state: FULL
            ./osp/jlustre-OST0000-osc-MDT0000/state:current_state: FULL
            ./osp/jlustre-OST0000-osc-MDT0000/import: state: FULL

            [/var/log/message in mdt0 server]
            Sep 19 02:50:14 ossb2 kernel: LNetError: 21764:0:(o2iblnd.c:1940:kiblnd_fmr_pool_map()) Failed to map mr 10/11 elements
            Sep 19 02:50:14 ossb2 kernel: LNetError: 21764:0:(o2iblnd_cb.c:560:kiblnd_fmr_map_tx()) Can't map 41033 pages: -22
            Sep 19 02:50:14 ossb2 kernel: LNetError: 21764:0:(o2iblnd_cb.c:1554:kiblnd_send()) Can't setup GET sink for 172.20.110.209@o2ib: -22
            Sep 19 02:50:14 ossb2 kernel: LustreError: 21764:0:(events.c:449:server_bulk_callback()) event type 5, status -5, desc ffff88086ea2e400
            Sep 19 02:51:54 ossb2 kernel: LustreError: 21764:0:(ldlm_lib.c:3237:target_bulk_io()) @@@ timeout on bulk WRITE after 100+0s req@ffff880457208c50 x1578948605516272/t0(0) o1000->jlustre-MDT0001-mdtlov_UUID@172.20.110.209@o2ib:210/0 lens 376/0 e 4 to 0 dl 1505803920 ref 1 fl Interpret:/0/ffffffff rc 0/-1

            [/var/log/messages in mdt1 server]
            Sep 19 14:51:22 ossb1 kernel: LustreError: 11-0: jlustre-MDT0000-osp-MDT0001: operation out_update to node 172.20.110.210@o2ib failed: rc = -110
            Sep 19 14:51:22 ossb1 kernel: LustreError: 31069:0:(layout.c:2085:__req_capsule_get()) @@@ Wrong buffer for field `object_update_reply' (1 of 1) in format `OUT_UPDATE': 0 vs. 4096 (server)#012 req@ffff8807d3aa7800 x1578948605516272/t0(0) o1000->jlustre-MDT0000-osp-MDT0001@172.20.110.210@o2ib:24/4 lens 376/192 e 4 to 0 dl 1505803889 ref 2 fl Interpret:ReM/0/0 rc -110/-110
            Sep 19 14:51:24 ossb1 kernel: LustreError: 30780:0:(llog_cat.c:773:llog_cat_cancel_records()) jlustre-MDT0000-osp-MDT0001: fail to cancel 1 of 1 llog

            sebg-crd-pm sebg-crd-pm (Inactive) added a comment - I test this bug again in 2.10.1-RC1(no add any patch). It looks to be network IO on mdt1 timed out, could you verify the network on mdt1 is working correctly? =>all osp state looks like normal [mdt1 server] ./osp/jlustre-MDT0000-osp-MDT0001/state:current_state: FULL ./osp/jlustre-MDT0000-osp-MDT0001/import: state: FULL ./osp/jlustre-OST0000-osc-MDT0001/state:current_state: FULL ./osp/jlustre-OST0000-osc-MDT0001/import: state: FULL [mdt0 server] ./osp/jlustre-MDT0001-osp-MDT0000/state:current_state: FULL ./osp/jlustre-MDT0001-osp-MDT0000/import: state: FULL ./osp/jlustre-OST0000-osc-MDT0000/state:current_state: FULL ./osp/jlustre-OST0000-osc-MDT0000/import: state: FULL [/var/log/message in mdt0 server] Sep 19 02:50:14 ossb2 kernel: LNetError: 21764:0:(o2iblnd.c:1940:kiblnd_fmr_pool_map()) Failed to map mr 10/11 elements Sep 19 02:50:14 ossb2 kernel: LNetError: 21764:0:(o2iblnd_cb.c:560:kiblnd_fmr_map_tx()) Can't map 41033 pages: -22 Sep 19 02:50:14 ossb2 kernel: LNetError: 21764:0:(o2iblnd_cb.c:1554:kiblnd_send()) Can't setup GET sink for 172.20.110.209@o2ib: -22 Sep 19 02:50:14 ossb2 kernel: LustreError: 21764:0:(events.c:449:server_bulk_callback()) event type 5, status -5, desc ffff88086ea2e400 Sep 19 02:51:54 ossb2 kernel: LustreError: 21764:0:(ldlm_lib.c:3237:target_bulk_io()) @@@ timeout on bulk WRITE after 100+0s req@ffff880457208c50 x1578948605516272/t0(0) o1000->jlustre-MDT0001-mdtlov_UUID@172.20.110.209@o2ib:210/0 lens 376/0 e 4 to 0 dl 1505803920 ref 1 fl Interpret:/0/ffffffff rc 0/-1 [/var/log/messages in mdt1 server] Sep 19 14:51:22 ossb1 kernel: LustreError: 11-0: jlustre-MDT0000-osp-MDT0001: operation out_update to node 172.20.110.210@o2ib failed: rc = -110 Sep 19 14:51:22 ossb1 kernel: LustreError: 31069:0:(layout.c:2085:__req_capsule_get()) @@@ Wrong buffer for field `object_update_reply' (1 of 1) in format `OUT_UPDATE': 0 vs. 4096 (server)#012 req@ffff8807d3aa7800 x1578948605516272/t0(0) o1000->jlustre-MDT0000-osp-MDT0001@172.20.110.210@o2ib:24/4 lens 376/192 e 4 to 0 dl 1505803889 ref 2 fl Interpret:ReM/0/0 rc -110/-110 Sep 19 14:51:24 ossb1 kernel: LustreError: 30780:0:(llog_cat.c:773:llog_cat_cancel_records()) jlustre-MDT0000-osp-MDT0001: fail to cancel 1 of 1 llog
            laisiyao Lai Siyao added a comment -

            in mdt1.log:

            00010000:00000001:6.0:1505117481.059504:0:5182:0:(ldlm_lib.c:3268:target_bulk_io()) Process leaving (rc=18446744073709551506 : -110 : ffffffffffffff92)
            00000020:00000001:6.0:1505117481.059508:0:5182:0:(out_handler.c:982:out_handle()) Process leaving via out_free (rc=18446744073709547410 : -4206 : 0xffffffffffffef92)
            

            which caused mdt0:

            00000004:00000001:9.0:1505117529.647439:0:3485:0:(osp_trans.c:1204:osp_send_update_req()) Process leaving (rc=18446744073709551506 : -110 : ffffffffffffff92)
            ...
            00000020:00000001:1.0:1505117529.647589:0:3442:0:(update_trans.c:1091:top_trans_stop()) Process leaving (rc=18446744073709551611 : -5 : fffffffffffffffb)
            ...
            00000004:00000001:1.0:1505117529.647779:0:3442:0:(mdt_reint.c:526:mdt_create()) Process leaving via put_child (rc=18446744073709551611 : -5 : 0xfffffffffffffffb)
            

            It looks to be network IO on mdt1 timed out, could you verify the network on mdt1 is working correctly?

            laisiyao Lai Siyao added a comment - in mdt1.log: 00010000:00000001:6.0:1505117481.059504:0:5182:0:(ldlm_lib.c:3268:target_bulk_io()) Process leaving (rc=18446744073709551506 : -110 : ffffffffffffff92) 00000020:00000001:6.0:1505117481.059508:0:5182:0:(out_handler.c:982:out_handle()) Process leaving via out_free (rc=18446744073709547410 : -4206 : 0xffffffffffffef92) which caused mdt0: 00000004:00000001:9.0:1505117529.647439:0:3485:0:(osp_trans.c:1204:osp_send_update_req()) Process leaving (rc=18446744073709551506 : -110 : ffffffffffffff92) ... 00000020:00000001:1.0:1505117529.647589:0:3442:0:(update_trans.c:1091:top_trans_stop()) Process leaving (rc=18446744073709551611 : -5 : fffffffffffffffb) ... 00000004:00000001:1.0:1505117529.647779:0:3442:0:(mdt_reint.c:526:mdt_create()) Process leaving via put_child (rc=18446744073709551611 : -5 : 0xfffffffffffffffb) It looks to be network IO on mdt1 timed out, could you verify the network on mdt1 is working correctly?
            pjones Peter Jones added a comment -

            Lai

            Can you please advise on this one?

            Thanks

            Peter

            pjones Peter Jones added a comment - Lai Can you please advise on this one? Thanks Peter

            Hi Brad,

            Do you have any update after reviewing logs?

            Thanks!

            sebg-crd-pm sebg-crd-pm (Inactive) added a comment - Hi Brad, Do you have any update after reviewing logs? Thanks!

            FYI

            lfs mkdir -c 2 /mnt/client/dir3
            error on LL_IOC_LMV_SETSTRIPE 'dir3' (3): Input/output error
            error: mkdir: create stripe dir 'dir3' failed

            see attached file mdt1.log mdt0.log client.log

            sebg-crd-pm sebg-crd-pm (Inactive) added a comment - FYI lfs mkdir -c 2 /mnt/client/dir3 error on LL_IOC_LMV_SETSTRIPE 'dir3' (3): Input/output error error: mkdir: create stripe dir 'dir3' failed see attached file mdt1.log mdt0.log client.log

            Hello,

            Please attach the entire log for us to review.

            Thanks,

            Brad

            bhoagland Brad Hoagland (Inactive) added a comment - Hello, Please attach the entire log for us to review. Thanks, Brad

            People

              laisiyao Lai Siyao
              sebg-crd-pm sebg-crd-pm (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: