Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11753

MDS BUG on lfs migrate [osd_it_ea_rec]

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.12.0
    • Lustre 2.12.0
    • None
    • CentOS 7.6 clients and servers (kernel 3.10.0-957.1.3.el7_lustre.x86_64)
    • 3
    • 9223372036854775807

    Description

      Client doing a migration of a directory from mdt0 to mdt1:

      $ cd /fir/users/sthiell/mdtest
      $ lfs migrate -m 1 32142854
      

      MDT0:

      [153404.321665] BUG: unable to handle kernel paging request at ffff903cb4a7d000
      [153404.328760] IP: [<ffffffffb9d7f29e>] strncpy+0x1e/0x30
      [153404.334023] PGD 6ae452067 PUD 2038b43063 PMD 20386e0063 PTE 8000002034a7d061
      [153404.341259] Oops: 0003 [#1] SMP 
      [153404.344632] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) ldiskfs(OE) lustre(OE) lmv(OE) mdc(OE) osc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache ib_ucm rpcrdma rdma_ucm ib_uverbs ib_iser ib_umad rdma_cm iw_cm libiscsi ib_ipoib scsi_transport_iscsi ib_cm mlx5_ib ib_core mpt2sas mptctl mptbase dell_rbu sunrpc vfat fat dm_round_robin dcdbas amd64_edac_mod edac_mce_amd kvm_amd kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd ses dm_multipath enclosure ipmi_si pcspkr ipmi_devintf dm_mod ccp sg ipmi_msghandler i2c_piix4 k10temp acpi_power_meter ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif
      [153404.417761]  crct10dif_generic i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops mlx5_core ttm ahci mlxfw libahci drm devlink crct10dif_pclmul crct10dif_common tg3 crc32c_intel ptp megaraid_sas libata drm_panel_orientation_quirks pps_core mpt3sas(OE) raid_class scsi_transport_sas
      [153404.443948] CPU: 5 PID: 45301 Comm: mdt_out01_004 Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.1.3.el7_lustre.x86_64 #1
      [153404.456538] Hardware name: Dell Inc. PowerEdge R6415/065PKD, BIOS 1.3.6 04/20/2018
      [153404.464191] task: ffff902c9aa3b0c0 ti: ffff903cb5bec000 task.ti: ffff903cb5bec000
      [153404.471757] RIP: 0010:[<ffffffffb9d7f29e>]  [<ffffffffb9d7f29e>] strncpy+0x1e/0x30
      [153404.479435] RSP: 0018:ffff903cb5befaf0  EFLAGS: 00010206
      [153404.484835] RAX: ffff903cb4a7d000 RBX: ffff903cb4a7cfe0 RCX: ffff903cb4a7d000
      [153404.492053] RDX: 0000000000000064 RSI: ffff903ca7d1213e RDI: ffff903cb4a7d000
      [153404.499275] RBP: ffff903cb5befaf0 R08: ffff903cb4a7d010 R09: 0000000000000018
      [153404.506495] R10: ffff902cbca82400 R11: ffff902cbca82400 R12: ffff901de9582000
      [153404.513714] R13: ffff903ca7d12118 R14: 0000000000000000 R15: 0000000000000010
      [153404.520933] FS:  00007ff8bee70740(0000) GS:ffff903cbf640000(0000) knlGS:0000000000000000
      [153404.529105] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [153404.534939] CR2: ffff903cb4a7d000 CR3: 00000006ade10000 CR4: 00000000003407e0
      [153404.542157] Call Trace:
      [153404.544711]  [<ffffffffc118a6ac>] osd_it_ea_rec+0x2ec/0x610 [osd_ldiskfs]
      [153404.551605]  [<ffffffffc0c334b9>] dt_index_page_build+0x149/0x470 [obdclass]
      [153404.558756]  [<ffffffffc0c330e0>] dt_index_walk+0x1a0/0x430 [obdclass]
      [153404.565386]  [<ffffffffc0c33370>] ? dt_index_walk+0x430/0x430 [obdclass]
      [153404.572190]  [<ffffffffc0c34444>] dt_index_read+0x394/0x6a0 [obdclass]
      [153404.578848]  [<ffffffffc0eceb32>] tgt_obd_idx_read+0x612/0x860 [ptlrpc]
      [153404.585579]  [<ffffffffc0ed135a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
      [153404.592570]  [<ffffffffc0eaaa51>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
      [153404.600233]  [<ffffffffc0aacbde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
      [153404.607399]  [<ffffffffc0e7592b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [153404.615171]  [<ffffffffc0e727b5>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]
      [153404.622046]  [<ffffffffb9ad67c2>] ? default_wake_function+0x12/0x20
      [153404.628403]  [<ffffffffb9acba9b>] ? __wake_up_common+0x5b/0x90
      [153404.634352]  [<ffffffffc0e7925c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
      [153404.640729]  [<ffffffffc0e78760>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
      [153404.648208]  [<ffffffffb9ac1c31>] kthread+0xd1/0xe0
      [153404.653172]  [<ffffffffb9ac1b60>] ? insert_kthread_work+0x40/0x40
      [153404.659357]  [<ffffffffba174c24>] ret_from_fork_nospec_begin+0xe/0x21
      [153404.665886]  [<ffffffffb9ac1b60>] ? insert_kthread_work+0x40/0x40
      

      Lustre 2.12.0 RC2

      Thanks,
      Stephane

      Attachments

        Issue Links

          Activity

            [LU-11753] MDS BUG on lfs migrate [osd_it_ea_rec]

            Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34376
            Subject: LU-11753 utils: print out DNE2 directory hash flags
            Project: fs/lustre-release
            Branch: b2_10
            Current Patch Set: 1
            Commit: 47096b801de9b70f01caa7aae4104f2b851bc474

            gerrit Gerrit Updater added a comment - Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34376 Subject: LU-11753 utils: print out DNE2 directory hash flags Project: fs/lustre-release Branch: b2_10 Current Patch Set: 1 Commit: 47096b801de9b70f01caa7aae4104f2b851bc474
            pjones Peter Jones added a comment -

            Landed for 2.12

            pjones Peter Jones added a comment - Landed for 2.12

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33865/
            Subject: LU-11753 obdclass: lu_dirent record length missing '0'
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 77f01308c5095030ee84f83339c085bcbcf04155

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33865/ Subject: LU-11753 obdclass: lu_dirent record length missing '0' Project: fs/lustre-release Branch: master Current Patch Set: Commit: 77f01308c5095030ee84f83339c085bcbcf04155

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33837/
            Subject: LU-11753 obdclass: index_page support variable length rec
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: a13e4991a1350f54f97c6ba13686d33c7a3eeb57

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33837/ Subject: LU-11753 obdclass: index_page support variable length rec Project: fs/lustre-release Branch: master Current Patch Set: Commit: a13e4991a1350f54f97c6ba13686d33c7a3eeb57

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33843/
            Subject: LU-11753 utils: print out DNE2 directory hash flags
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 795eea12271f5c1bab3414803db3538d7c266d66

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33843/ Subject: LU-11753 utils: print out DNE2 directory hash flags Project: fs/lustre-release Branch: master Current Patch Set: Commit: 795eea12271f5c1bab3414803db3538d7c266d66

            Ok I see, thanks Andreas! (the correct LU# seems to be LU-7607)

            sthiell Stephane Thiell added a comment - Ok I see, thanks Andreas! (the correct LU# seems to be LU-7607 )
            adilger Andreas Dilger added a comment - - edited

            I got the following error from another terminal with a CWD within the migrated directory:

            [root@fir-rbh01 32142854-copy4]# lfs getdirstripe .
            lfs getdirstripe: cannot open '.': No such file or directory (2)
            error: getdirstripe failed for ..
            
            This is probably a limitation of lfs migrate at the moment but I'm fine with that!
            

            Yes, migrating files/directories between MDTs changes their FIDs, which is important to note if your HSM copytool depends on FIDs... There is LU-7607 "Preserve inode number after MDT migration" but it is not implemented yet.

            adilger Andreas Dilger added a comment - - edited I got the following error from another terminal with a CWD within the migrated directory: [root@fir-rbh01 32142854-copy4]# lfs getdirstripe . lfs getdirstripe: cannot open '.': No such file or directory (2) error: getdirstripe failed for .. This is probably a limitation of lfs migrate at the moment but I'm fine with that! Yes, migrating files/directories between MDTs changes their FIDs, which is important to note if your HSM copytool depends on FIDs... There is LU-7607 "Preserve inode number after MDT migration" but it is not implemented yet.
            pjones Peter Jones added a comment -

            Fantastic news - thanks Stephane!

            pjones Peter Jones added a comment - Fantastic news - thanks Stephane!

            Patch tested on top of 2.12.0 RC2:
            https://review.whamcloud.com/#/c/33837/4
            https://review.whamcloud.com/#/c/33865/1
             
            Result: success!

            [root@fir-rbh01 mdtest]# lfs getdirstripe 32142854-copy6
            lmv_stripe_count: 0 lmv_stripe_offset: 0 lmv_hash_type: none
            [root@fir-rbh01 mdtest]# lfs migrate -m 1 32142854-copy6
            [root@fir-rbh01 mdtest]# lfs getdirstripe 32142854-copy6
            lmv_stripe_count: 0 lmv_stripe_offset: 1 lmv_hash_type: none
            

            BTW I believe the "File exists" errors are due to the fact that the original directory 32142854 is partially migrated and MDS crashed several times, so maybe the two directories are present in both MDT? I don't want to resume its migration now because the directory is part of my reproducer for the current issue, but we can probably confirm that later.

            Thank you all!
             

            sthiell Stephane Thiell added a comment - Patch tested on top of 2.12.0 RC2: https://review.whamcloud.com/#/c/33837/4 https://review.whamcloud.com/#/c/33865/1   Result: success! [root@fir-rbh01 mdtest]# lfs getdirstripe 32142854-copy6 lmv_stripe_count: 0 lmv_stripe_offset: 0 lmv_hash_type: none [root@fir-rbh01 mdtest]# lfs migrate -m 1 32142854-copy6 [root@fir-rbh01 mdtest]# lfs getdirstripe 32142854-copy6 lmv_stripe_count: 0 lmv_stripe_offset: 1 lmv_hash_type: none BTW I believe the "File exists" errors are due to the fact that the original directory 32142854 is partially migrated and MDS crashed several times, so maybe the two directories are present in both MDT? I don't want to resume its migration now because the directory is part of my reproducer for the current issue, but we can probably confirm that later. Thank you all!  

            Peter, ok great, no problem! working on it now.

            sthiell Stephane Thiell added a comment - Peter, ok great, no problem! working on it now.

            People

              laisiyao Lai Siyao
              sthiell Stephane Thiell
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: