Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.13.0
    • Lustre 2.10.5
    • None
    • 1
    • 9223372036854775807

    Description

      server keeps crashing with the following error.

      [  981.957669] Lustre: nbp13-OST0008: trigger OI scrub by RPC for the [0x100080000:0x217edd:0x0] with flags 0x4a, rc = 0
      [  981.989579] Lustre: Skipped 11 previous similar messages
      [ 1045.404615] ------------[ cut here ]------------
      [ 1045.418484] kernel BUG at /tmp/rpmbuild-lustre-jlan-ItUrr9b3/BUILD/lustre-2.10.5/ldiskfs/ldiskfs.h:1907!
      [ 1045.446989] invalid opcode: 0000 [#1] SMP 
      [ 1045.459302] Modules linked in: ofd(OE) ost(OE) osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) dm_service_time ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) lpfc ib_iser(OE) libiscsi scsi_transport_iscsi crct10dif_generic scsi_transport_fc scsi_tgt rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) bonding ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) sunrpc dm_mirror dm_region_hash dm_log mlx5_ib(OE) ib_core(OE) intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul mgag200 ghash_clmulni_intel i2c_algo_bit ttm dm_multipath aesni_intel drm_kms_helper lrw syscopyarea gf128mul sysfillrect sysimgblt glue_helper fb_sys_fops ablk_helper mlx5_core(OE) mlxfw(OE) tg3 ses cryptd mlx_compat(OE) drm ptp ipmi_si enclosure mei_me i2c_core pps_core hpwdt hpilo ipmi_devintf lpc_ich dm_mod mfd_core mei shpchp pcspkr wmi ipmi_msghandler acpi_power_meter binfmt_misc tcp_bic ip_tables virtio_scsi virtio_ring virtio xfs libcrc32c ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_common sg usb_storage smartpqi(E) crc32c_intel scsi_transport_sas [last unloaded: pps_core]
      [ 1045.776428] CPU: 5 PID: 11348 Comm: lfsck Tainted: G           OE  ------------   3.10.0-693.21.1.el7.20180508.x86_64.lustre2105 #1
      [ 1045.811992] Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380 Gen10, BIOS U30 06/15/2018
      [ 1045.837624] task: ffff882ddca23f40 ti: ffff882bd280c000 task.ti: ffff882bd280c000
      [ 1045.860117] RIP: 0010:[<ffffffffa10fbd04>]  [<ffffffffa10fbd04>] ldiskfs_rec_len_to_disk.part.9+0x4/0x10 [ldiskfs]
      [ 1045.891259] RSP: 0018:ffff882bd280f980  EFLAGS: 00010207
      [ 1045.907218] RAX: 0000000000000000 RBX: ffff882bd280fb58 RCX: ffff882bd280f994
      [ 1045.928666] RDX: 00000000ffffffac RSI: ffffffffffffff81 RDI: 00000000ffffff81
      [ 1045.950113] RBP: ffff882bd280f980 R08: 00000000ffffff81 R09: ffffffffa10fded0
      [ 1045.971560] R10: ffff88303f803b00 R11: 0000000000ffffff R12: 000000000000003c
      [ 1045.993006] R13: ffff881e2eae7708 R14: ffff881e2eae7690 R15: 0000000000000000
      [ 1046.014452] FS:  0000000000000000(0000) GS:ffff882f7ef40000(0000) knlGS:0000000000000000
      [ 1046.038775] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 1046.056039] CR2: 00007ffff20df034 CR3: 0000002ef4268000 CR4: 00000000003607e0
      [ 1046.077485] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 1046.098932] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [ 1046.120378] Call Trace:
      [ 1046.127717]  [<ffffffffa10fe245>] htree_inlinedir_to_tree+0x445/0x450 [ldiskfs]
      [ 1046.149690]  [<ffffffff8123002e>] ? __generic_file_splice_read+0x4ee/0x5e0
      [ 1046.170356]  [<ffffffff81234cdd>] ? __getblk+0x2d/0x2e0
      [ 1046.186052]  [<ffffffff81234c4c>] ? __find_get_block+0xbc/0x120
      [ 1046.203841]  [<ffffffff81234cdd>] ? __getblk+0x2d/0x2e0
      [ 1046.219541]  [<ffffffffa10cdfa0>] ? __ldiskfs_get_inode_loc+0x110/0x3e0 [ldiskfs]
      [ 1046.242039]  [<ffffffffa10c89ef>] ? ldiskfs_xattr_find_entry+0x9f/0x130 [ldiskfs]
      [ 1046.264536]  [<ffffffffa10c0277>] ldiskfs_htree_fill_tree+0x137/0x2f0 [ldiskfs]
      [ 1046.286507]  [<ffffffff811df826>] ? kmem_cache_alloc_trace+0x1d6/0x200
      [ 1046.306126]  [<ffffffffa10ae5ec>] ldiskfs_readdir+0x61c/0x850 [ldiskfs]
      [ 1046.326012]  [<ffffffffa1147640>] ? osd_declare_ref_del+0x130/0x130 [osd_ldiskfs]
      [ 1046.348507]  [<ffffffff812256b2>] ? generic_getxattr+0x52/0x70
      [ 1046.366036]  [<ffffffffa1145cde>] osd_ldiskfs_it_fill+0xbe/0x260 [osd_ldiskfs]
      [ 1046.387747]  [<ffffffffa1145eb7>] osd_it_ea_load+0x37/0x100 [osd_ldiskfs]
      [ 1046.408158]  [<ffffffffa122808c>] lfsck_open_dir+0x11c/0x3a0 [lfsck]
      [ 1046.427257]  [<ffffffffa1228cb2>] lfsck_master_oit_engine+0x9a2/0x1190 [lfsck]
      [ 1046.448969]  [<ffffffff816946f7>] ? __schedule+0x477/0xa30
      [ 1046.465453]  [<ffffffffa1229d96>] lfsck_master_engine+0x8f6/0x1360 [lfsck]
      [ 1046.486120]  [<ffffffff810c4d40>] ? wake_up_state+0x20/0x20
      [ 1046.502865]  [<ffffffffa12294a0>] ? lfsck_master_oit_engine+0x1190/0x1190 [lfsck]
      [ 1046.525360]  [<ffffffff810b1131>] kthread+0xd1/0xe0
      [ 1046.540011]  [<ffffffff810b1060>] ? insert_kthread_work+0x40/0x40
      [ 1046.558323]  [<ffffffff816a14dd>] ret_from_fork+0x5d/0xb0
      [ 1046.574540]  [<ffffffff810b1060>] ? insert_kthread_work+0x40/0x40
      [ 1046.592852] Code: 44 04 02 48 8d 44 03 c8 48 01 c7 e8 b7 f6 22 e0 48 83 c4 10 5b 41 5c 41 5d 41 5e 41 5f 5d c3 0f 0b 0f 0b 0f 1f 40 00 55 48 89 e5 <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 85 f6 48 
      [ 1046.650192] RIP  [<ffffffffa10fbd04>] ldiskfs_rec_len_to_disk.part.9+0x4/0x10 [ldiskfs]
      
      

      Attachments

        1. debug-lfsck-nbp15-MDT0000.gz
          60 kB
        2. dumpe2fs.out
          36 kB
        3. nbp13.debug.gz
          24.76 MB
        4. nbp13.lfsck.debug.out1.gz
          297 kB
        5. nbp13.lfsck.debug.out2.gz
          4 kB
        6. oi_scrub.out
          6 kB

        Issue Links

          Activity

            [LU-11584] kernel BUG at ldiskfs.h:1907!

            Patch landed to master.

            jgmitter Joseph Gmitter (Inactive) added a comment - Patch landed to master.

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33546/
            Subject: LU-11584 osd-ldiskfs: fix lost+found object replace
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 900352f2bc15906a8fba9cb889df4b166a53bade

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33546/ Subject: LU-11584 osd-ldiskfs: fix lost+found object replace Project: fs/lustre-release Branch: master Current Patch Set: Commit: 900352f2bc15906a8fba9cb889df4b166a53bade

            Open new prio1 case LU-11737

            after delete quarantined files hitting lbug.

            mhanafi Mahmoud Hanafi added a comment - Open new prio1 case LU-11737 after delete quarantined files hitting lbug.

            Mahmoud, the orphan OST objects can be cleaned up with LFSCK layout checking. The orphans are linked into the $MOUNT/.lustre/lost+found directory if "lctl lfsck_start -o -t layout" is used (the "-o" option can be used as part of a full LFSCK run as well).

            adilger Andreas Dilger added a comment - Mahmoud, the orphan OST objects can be cleaned up with LFSCK layout checking. The orphans are linked into the $MOUNT/.lustre/lost+found directory if " lctl lfsck_start -o -t layout " is used (the "-o" option can be used as part of a full LFSCK run as well).

            I was able to find all the inodes with bad LMA and delete them via ldiskfs. So what we have left are files that trigger OI scrub and that report "?" for size/uid/etc. The user has been able recover all the effected files, so we just need a way to delete the files.

            If we delete the files via ldiskfs how can we make sure that the objects will be cleaned up.

             

            mhanafi Mahmoud Hanafi added a comment - I was able to find all the inodes with bad LMA and delete them via ldiskfs. So what we have left are files that trigger OI scrub and that report "?" for size/uid/etc. The user has been able recover all the effected files, so we just need a way to delete the files. If we delete the files via ldiskfs how can we make sure that the objects will be cleaned up.  
            bzzz Alex Zhuravlev added a comment - - edited

            I still don't understand why the nbp13 log doesn't contain "unsupported incompat LMA feature" message.

            bzzz Alex Zhuravlev added a comment - - edited I still don't understand why the nbp13 log doesn't contain "unsupported incompat LMA feature" message.

            I'm modifying the test to simulate additional broken LinkEA, going to report results ASAP.

            bzzz Alex Zhuravlev added a comment - I'm modifying the test to simulate additional broken LinkEA, going to report results ASAP.

            Any comments on the output of nbp13.lfsck?

            mhanafi Mahmoud Hanafi added a comment - Any comments on the output of nbp13.lfsck?

            Li Dongyang (dongyangli@ddn.com) uploaded a new patch: https://review.whamcloud.com/33576
            Subject: LU-11584 e2fsck: check xattr 'system.data' before setting inline_data feature
            Project: tools/e2fsprogs
            Branch: master-lustre
            Current Patch Set: 1
            Commit: 64b71635ffa84a01946199e3cd31b1ee9fd9a15f

            gerrit Gerrit Updater added a comment - Li Dongyang (dongyangli@ddn.com) uploaded a new patch: https://review.whamcloud.com/33576 Subject: LU-11584 e2fsck: check xattr 'system.data' before setting inline_data feature Project: tools/e2fsprogs Branch: master-lustre Current Patch Set: 1 Commit: 64b71635ffa84a01946199e3cd31b1ee9fd9a15f

            Here is the nbp13 lfsck runs.

             nbp13-srv1 ~ # lctl get_param -n  mdd.*.lfsck_namespace
            name: lfsck_namespace
            magic: 0xa0621a0b
            version: 2
            status: completed
            flags: inconsistent
            param: dryrun
            last_completed_time: 1541281433
            time_since_last_completed: 341 seconds
            latest_start_time: 1541281072
            time_since_latest_start: 702 seconds
            last_checkpoint_time: 1541281433
            time_since_last_checkpoint: 341 seconds
            latest_start_position: 77, N/A, N/A
            last_checkpoint_position: 317719759, N/A, N/A
            first_failure_position: 153388517, [0x2000020af:0x39d9:0x0], 0x753a410c57f07b3
            checked_phase1: 30987846
            checked_phase2: 111
            inconsistent_phase1: 2
            inconsistent_phase2: 3
            failed_phase1: 21
            failed_phase2: 3
            directories: 2709152
            dirent_inconsistent: 0
            linkea_inconsistent: 2
            nlinks_inconsistent: 0
            multiple_linked_checked: 5
            multiple_linked_inconsistent: 0
            unknown_inconsistency: 0
            unmatched_pairs_inconsistent: 0
            dangling_inconsistent: 0
            multiple_referenced_inconsistent: 3
            bad_file_type_inconsistent: 0
            lost_dirent_inconsistent: 0
            local_lost_found_scanned: 3
            local_lost_found_moved: 3
            local_lost_found_skipped: 0
            local_lost_found_failed: 0
            striped_dirs_scanned: 0
            striped_dirs_inconsistent: 0
            striped_dirs_failed: 0
            striped_dirs_disabled: 0
            striped_dirs_skipped: 0
            striped_shards_scanned: 0
            striped_shards_inconsistent: 0
            striped_shards_failed: 0
            striped_shards_skipped: 0
            name_hash_inconsistent: 0
            linkea_overflow_inconsistent: 0
            success_count: 3
            run_time_phase1: 362 seconds
            run_time_phase2: 0 seconds
            average_speed_phase1: 85601 items/sec
            average_speed_phase2: 111 objs/sec
            average_speed_total: 85366 items/sec
            real_time_speed_phase1: N/A
            real_time_speed_phase2: N/A
            current_position: N/A
            

            nbp13.lfsck.debug.out2.gz nbp13.lfsck.debug.out1.gz

            mhanafi Mahmoud Hanafi added a comment - Here is the nbp13 lfsck runs. nbp13-srv1 ~ # lctl get_param -n mdd.*.lfsck_namespace name: lfsck_namespace magic: 0xa0621a0b version: 2 status: completed flags: inconsistent param: dryrun last_completed_time: 1541281433 time_since_last_completed: 341 seconds latest_start_time: 1541281072 time_since_latest_start: 702 seconds last_checkpoint_time: 1541281433 time_since_last_checkpoint: 341 seconds latest_start_position: 77, N/A, N/A last_checkpoint_position: 317719759, N/A, N/A first_failure_position: 153388517, [0x2000020af:0x39d9:0x0], 0x753a410c57f07b3 checked_phase1: 30987846 checked_phase2: 111 inconsistent_phase1: 2 inconsistent_phase2: 3 failed_phase1: 21 failed_phase2: 3 directories: 2709152 dirent_inconsistent: 0 linkea_inconsistent: 2 nlinks_inconsistent: 0 multiple_linked_checked: 5 multiple_linked_inconsistent: 0 unknown_inconsistency: 0 unmatched_pairs_inconsistent: 0 dangling_inconsistent: 0 multiple_referenced_inconsistent: 3 bad_file_type_inconsistent: 0 lost_dirent_inconsistent: 0 local_lost_found_scanned: 3 local_lost_found_moved: 3 local_lost_found_skipped: 0 local_lost_found_failed: 0 striped_dirs_scanned: 0 striped_dirs_inconsistent: 0 striped_dirs_failed: 0 striped_dirs_disabled: 0 striped_dirs_skipped: 0 striped_shards_scanned: 0 striped_shards_inconsistent: 0 striped_shards_failed: 0 striped_shards_skipped: 0 name_hash_inconsistent: 0 linkea_overflow_inconsistent: 0 success_count: 3 run_time_phase1: 362 seconds run_time_phase2: 0 seconds average_speed_phase1: 85601 items/sec average_speed_phase2: 111 objs/sec average_speed_total: 85366 items/sec real_time_speed_phase1: N/A real_time_speed_phase2: N/A current_position: N/A nbp13.lfsck.debug.out2.gz nbp13.lfsck.debug.out1.gz

            People

              bzzz Alex Zhuravlev
              mhanafi Mahmoud Hanafi
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: