Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2868

sanity-scrub test 1c lod_dev.c:69:lod_fld_lookup()) ASSERTION( fid_is_sane(fid) ) failed: Invalid FID [0x0:0x932ec000:0xffff8800]

Details

    • Story
    • Resolution: Fixed
    • Major
    • Lustre 2.4.0
    • Lustre 2.4.0
    • None
    • 6934

    Description

      I just tried sanity-scrub on my system as a prerequisite for running it in reviews and it crashed in test 1c on the first run:

      [ 2204.888058] Lustre: lustre-MDT0000: used disk, loading
      [ 2204.996060] Lustre: lustre-OST0001: deleting orphan objects from 0x0:38 to 64
      [ 2205.528062] LustreError: 8125:0:(lod_dev.c:69:lod_fld_lookup()) ASSERTION( fi
      d_is_sane(fid) ) failed: Invalid FID [0x0:0x932ec000:0xffff8800]
      [ 2205.528581] LustreError: 8125:0:(lod_dev.c:69:lod_fld_lookup()) LBUG
      [ 2205.528831] Pid: 8125, comm: lfsck
      [ 2205.529035]
      [ 2205.529036] Call Trace:
      [ 2205.529415]  [<ffffffffa0837915>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      [ 2205.529679]  [<ffffffffa0837f17>] lbug_with_loc+0x47/0xb0 [libcfs]
      [ 2205.529930]  [<ffffffffa0633ac5>] lod_fld_lookup+0x295/0x410 [lod]
      [ 2205.530297]  [<ffffffffa0636276>] lod_object_alloc+0x1f6/0x4a0 [lod]
      [ 2205.530556]  [<ffffffffa0b7a132>] mdd_object_init+0xf2/0x1f0 [mdd]
      [ 2205.530830]  [<ffffffffa055d02d>] lu_object_alloc+0xcd/0x300 [obdclass]
      [ 2205.531102]  [<ffffffffa055d3a9>] ? htable_lookup+0x119/0x1c0 [obdclass]
      [ 2205.531371]  [<ffffffffa055db95>] lu_object_find_at+0x205/0x360 [obdclass]
      [ 2205.531629]  [<ffffffff81054aaa>] ? enqueue_task_fair+0x14a/0x4e0
      [ 2205.531873]  [<ffffffff81096d6a>] ? sched_clock_cpu+0x6a/0x110
      [ 2205.532129]  [<ffffffffa055dd2f>] lu_object_find_slice+0x1f/0x80 [obdclass]
      [ 2205.532390]  [<ffffffffa0b77370>] mdd_object_find+0x10/0x70 [mdd]
      [ 2205.532641]  [<ffffffffa0b9e07b>] mdd_lfsck_oit_engine+0x21b/0x1160 [mdd]
      [ 2205.532896]  [<ffffffff8100bc0e>] ? apic_timer_interrupt+0xe/0x20
      [ 2205.533143]  [<ffffffff814faf1b>] ? _spin_unlock_irqrestore+0x1b/0x30
      [ 2205.533402]  [<ffffffff81051f73>] ? __wake_up+0x53/0x70
      [ 2205.533637]  [<ffffffffa0ba2180>] ? mdd_lfsck_main+0x0/0x1020 [mdd]
      [ 2205.535020]  [<ffffffffa0ba27ac>] mdd_lfsck_main+0x62c/0x1020 [mdd]
      [ 2205.535306]  [<ffffffffa0ba2180>] ? mdd_lfsck_main+0x0/0x1020 [mdd]
      [ 2205.535585]  [<ffffffff8100c14a>] child_rip+0xa/0x20
      [ 2205.535846]  [<ffffffffa0ba2180>] ? mdd_lfsck_main+0x0/0x1020 [mdd]
      [ 2205.536178]  [<ffffffffa0ba2180>] ? mdd_lfsck_main+0x0/0x1020 [mdd]
      [ 2205.536455]  [<ffffffff8100c140>] ? child_rip+0x0/0x20
      [ 2205.536700]
      [ 2205.550028] Kernel panic - not syncing: LBUG
      

      Attachments

        Activity

          [LU-2868] sanity-scrub test 1c lod_dev.c:69:lod_fld_lookup()) ASSERTION( fid_is_sane(fid) ) failed: Invalid FID [0x0:0x932ec000:0xffff8800]
          pjones Peter Jones added a comment -

          Landed for 2.4

          pjones Peter Jones added a comment - Landed for 2.4
          yong.fan nasf (Inactive) added a comment - This is the patch: http://review.whamcloud.com/#change,5622

          There is one corner case may cause otable-based iteration to access non-initialized area: there are no objects to be scanned just at the beginning (especially for resuming from the latest checkpoint case), then otable-based iteration may return invalid FID from low layer via dt_it_ops::rec().

          yong.fan nasf (Inactive) added a comment - There is one corner case may cause otable-based iteration to access non-initialized area: there are no objects to be scanned just at the beginning (especially for resuming from the latest checkpoint case), then otable-based iteration may return invalid FID from low layer via dt_it_ops::rec().

          There appear to be quite a few places in the code checking LASSERT(fid_is_sane()) which are probably operating on a FID from the disk or network. I fixed a few similar problems in http://review.whamcloud.com/5456, but more need to be fixed.

          adilger Andreas Dilger added a comment - There appear to be quite a few places in the code checking LASSERT(fid_is_sane()) which are probably operating on a FID from the disk or network. I fixed a few similar problems in http://review.whamcloud.com/5456 , but more need to be fixed.

          I will investigate this bug.

          yong.fan nasf (Inactive) added a comment - I will investigate this bug.

          People

            yong.fan nasf (Inactive)
            green Oleg Drokin
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: