Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14175

OI Scrub triggered followed by LBUG ASSERTION( idx1 == 0 || idx1 == osd->od_index ) failed

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.15.0
    • Lustre 2.12.5
    • None
    • CentOS 7.6
    • 2
    • 9223372036854775807

    Description

      I'm opening this with Sev2 as we have an OST down on Oak. Indeed we have a problem this morning with one OST on Oak (note that Oak has been upgraded to 2.12.5 from 2.10 recently):

       

      Dec 02 09:13:01 oak-io1-s2 kernel: Lustre: oak-OST000b: Recovery over after 2:42, of 1789 clients 1631 recovered and 158 were evicted.
      Dec 02 09:13:01 oak-io1-s2 kernel: Lustre: Skipped 3 previous similar messages
      Dec 02 09:13:01 oak-io1-s2 kernel: Lustre: oak-OST000b: deleting orphan objects from 0x10400013a0:371764 to 0x10400013a0:371809
      Dec 02 09:13:01 oak-io1-s2 kernel: Lustre: oak-OST000b: deleting orphan objects from 0x1040000bd0:3790954 to 0x1040000bd0:3790977
      Dec 02 09:13:01 oak-io1-s2 kernel: Lustre: oak-OST000b: deleting orphan objects from 0x0:33786809 to 0x0:33786849
      Dec 02 09:13:01 oak-io1-s2 kernel: Lustre: oak-OST000b: deleting orphan objects from 0x1040000400:3170249 to 0x1040000400:3170273
      Dec 02 09:13:02 oak-io1-s2 kernel: Lustre: oak-OST000b: trigger OI scrub by RPC for the [0x1000b0000:0x10c759a:0x0] with flags 0x4a, rc = 0
      
      [root@oak-io1-s2 ~]# lctl get_param -n osd-ldiskfs.oak-OST000b.oi_scrub
      name: OI_scrub
      magic: 0x4c5fd252
      oi_files: 64
      status: scanning
      flags: auto
      param:
      time_since_last_completed: N/A
      time_since_latest_start: 16 seconds
      time_since_last_checkpoint: 16 seconds
      latest_start_position: 12
      last_checkpoint_position: 11
      first_failure_position: N/A
      checked: 1186
      updated: 0
      failed: 0
      prior_updated: 0
      noscrub: 4
      igif: 0
      success_count: 0
      run_time: 16 seconds
      average_speed: 74 objects/sec
      real-time_speed: 74 objects/sec
      current_position: 1263
      scrub_in_prior: no
      scrub_full_speed: yes
      partial_scan: no
      lf_scanned: 0
      lf_repaired: 0
      lf_failed: 0
      [root@oak-io1-s2 ~]# 
      Message from syslogd@oak-io1-s2 at Dec  2 09:13:19 ...
       kernel:LustreError: 291930:0:(osd_compat.c:701:osd_obj_update_entry()) ASSERTION( idx1 == 0 || idx1 == osd->od_index ) failed: invalid given FID [0x1000a0000:0x1d37dd1:0x0], not match the device index 11
      
      Message from syslogd@oak-io1-s2 at Dec  2 09:13:19 ...
       kernel:LustreError: 291930:0:(osd_compat.c:701:osd_obj_update_entry()) LBUG
      

      The backtrace is:

      Dec  2 03:41:08 oak-io1-s2 kernel: LustreError: 255421:0:(osd_compat.c:701:osd_obj_update_entry()) ASSERTION( idx1 == 0 || idx1 == osd->od_index ) failed: invalid given FID [0x1000a0000:0x1d37dd1:0x0], not match the device index 11
      Dec  2 03:41:08 oak-io1-s2 kernel: LustreError: 255421:0:(osd_compat.c:701:osd_obj_update_entry()) LBUG
      Dec  2 03:41:08 oak-io1-s2 kernel: Pid: 255421, comm: OI_scrub 3.10.0-957.27.2.el7_lustre.pl2.x86_64 #1 SMP Thu Nov 7 15:26:16 PST 2019
      Dec  2 03:41:08 oak-io1-s2 kernel: Call Trace:
      Dec  2 03:41:08 oak-io1-s2 kernel: [<ffffffffc0b3e7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
      Dec  2 03:41:08 oak-io1-s2 kernel: [<ffffffffc0b3e87c>] lbug_with_loc+0x4c/0xa0 [libcfs]
      Dec  2 03:41:08 oak-io1-s2 kernel: [<ffffffffc1458149>] osd_obj_update_entry+0x969/0x980 [osd_ldiskfs]
      Dec  2 03:41:08 oak-io1-s2 kernel: [<ffffffffc145a8a0>] osd_obj_map_update+0x1a0/0x340 [osd_ldiskfs]
      Dec  2 03:41:08 oak-io1-s2 kernel: [<ffffffffc14471a9>] osd_oi_update+0x69/0x290 [osd_ldiskfs]
      Dec  2 03:41:08 oak-io1-s2 kernel: [<ffffffffc145c71c>] osd_scrub_refresh_mapping+0x27c/0x440 [osd_ldiskfs]
      Dec  2 03:41:08 oak-io1-s2 kernel: [<ffffffffc14611e0>] osd_scrub_check_update+0x280/0x10f0 [osd_ldiskfs]
      Dec  2 03:41:08 oak-io1-s2 kernel: [<ffffffffc14620b5>] osd_scrub_exec+0x65/0x4f0 [osd_ldiskfs]
      Dec  2 03:41:08 oak-io1-s2 kernel: [<ffffffffc14629e8>] osd_inode_iteration+0x4a8/0xcf0 [osd_ldiskfs]
      Dec  2 03:41:08 oak-io1-s2 kernel: [<ffffffffc1463ad9>] osd_scrub_main+0x8a9/0xe40 [osd_ldiskfs]
      Dec  2 03:41:08 oak-io1-s2 kernel: [<ffffffffaa4c2e81>] kthread+0xd1/0xe0
      Dec  2 03:41:08 oak-io1-s2 kernel: [<ffffffffaab77c37>] ret_from_fork_nospec_end+0x0/0x39
      Dec  2 03:41:08 oak-io1-s2 kernel: [<ffffffffffffffff>] 0xffffffffffffffff
      

      We ran fsck on the device and then the issue occurred again:

      Dec  2 09:13:19 oak-io1-s2 kernel: LustreError: 291930:0:(osd_compat.c:701:osd_obj_update_entry()) ASSERTION( idx1 == 0 || idx1 == osd->od_index ) failed: invalid given FID [0x1000a0000:0x1d37dd1:0x0], not match the device index 11
      Dec  2 09:13:19 oak-io1-s2 kernel: LustreError: 291930:0:(osd_compat.c:701:osd_obj_update_entry()) LBUG
      Dec  2 09:13:19 oak-io1-s2 kernel: Pid: 291930, comm: OI_scrub 3.10.0-957.27.2.el7_lustre.pl2.x86_64 #1 SMP Thu Nov 7 15:26:16 PST 2019
      Dec  2 09:13:19 oak-io1-s2 kernel: Call Trace:
      Dec  2 09:13:19 oak-io1-s2 kernel: [<ffffffffc0cbe7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
      Dec  2 09:13:19 oak-io1-s2 kernel: [<ffffffffc0cbe87c>] lbug_with_loc+0x4c/0xa0 [libcfs]
      Dec  2 09:13:19 oak-io1-s2 kernel: [<ffffffffc15c6149>] osd_obj_update_entry+0x969/0x980 [osd_ldiskfs]
      Dec  2 09:13:19 oak-io1-s2 kernel: [<ffffffffc15c88a0>] osd_obj_map_update+0x1a0/0x340 [osd_ldiskfs]
      Dec  2 09:13:19 oak-io1-s2 kernel: [<ffffffffc15b51a9>] osd_oi_update+0x69/0x290 [osd_ldiskfs]
      Dec  2 09:13:19 oak-io1-s2 kernel: [<ffffffffc15ca71c>] osd_scrub_refresh_mapping+0x27c/0x440 [osd_ldiskfs]
      Dec  2 09:13:19 oak-io1-s2 kernel: [<ffffffffc15cf1e0>] osd_scrub_check_update+0x280/0x10f0 [osd_ldiskfs]
      Dec  2 09:13:19 oak-io1-s2 kernel: [<ffffffffc15d00b5>] osd_scrub_exec+0x65/0x4f0 [osd_ldiskfs]
      Dec  2 09:13:19 oak-io1-s2 kernel: [<ffffffffc15d09e8>] osd_inode_iteration+0x4a8/0xcf0 [osd_ldiskfs]
      Dec  2 09:13:19 oak-io1-s2 kernel: [<ffffffffc15d1ad9>] osd_scrub_main+0x8a9/0xe40 [osd_ldiskfs]
      Dec  2 09:13:19 oak-io1-s2 kernel: [<ffffffffbcac2e81>] kthread+0xd1/0xe0
      Dec  2 09:13:19 oak-io1-s2 kernel: [<ffffffffbd177c37>] ret_from_fork_nospec_end+0x0/0x39
      Dec  2 09:13:19 oak-io1-s2 kernel: [<ffffffffffffffff>] 0xffffffffffffffff
      

      Do you have an idea on how to find which file it is? I'm thinking on remounting with noscrub to avoid the LBUG, that will be my next step.

      Thanks!
      Stephane

      Attachments

        Issue Links

          Activity

            [LU-14175] OI Scrub triggered followed by LBUG ASSERTION( idx1 == 0 || idx1 == osd->od_index ) failed
            pjones Peter Jones added a comment -

            Seems to be landed for 2.15

            pjones Peter Jones added a comment - Seems to be landed for 2.15

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/43153/
            Subject: LU-14175 osd: print inode number with FID in OI scrub
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 5bab4acf8320b46076c81f32f7954f91dae21bc9

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/43153/ Subject: LU-14175 osd: print inode number with FID in OI scrub Project: fs/lustre-release Branch: master Current Patch Set: Commit: 5bab4acf8320b46076c81f32f7954f91dae21bc9

            Patch is only improving the error message, not actually fixing the crash.

            adilger Andreas Dilger added a comment - Patch is only improving the error message, not actually fixing the crash.

            Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43153
            Subject: LU-14175 osd: print inode number with FID in OI scrub
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 1cbe61483dad10b67d93db3fe2b9b522a7ed50b3

            gerrit Gerrit Updater added a comment - Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43153 Subject: LU-14175 osd: print inode number with FID in OI scrub Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 1cbe61483dad10b67d93db3fe2b9b522a7ed50b3
            pjones Peter Jones added a comment -

            Lai

            Can you please assist?

            Thanks

            Peter

            pjones Peter Jones added a comment - Lai Can you please assist? Thanks Peter

            Mounting oak-OST000b with noscrub seems to avoid the LBUG for now. The status of OI Scrub is 'crashed':

            [root@oak-io1-s2 ~]# lctl get_param osd-ldiskfs.oak-OST000b.oi_scrub
            osd-ldiskfs.oak-OST000b.oi_scrub=
            name: OI_scrub
            magic: 0x4c5fd252
            oi_files: 64
            status: crashed
            flags: auto
            param:
            time_since_last_completed: N/A
            time_since_latest_start: 2154 seconds
            time_since_last_checkpoint: 2154 seconds
            latest_start_position: 12
            last_checkpoint_position: 11
            first_failure_position: N/A
            checked: 0
            updated: 0
            failed: 0
            prior_updated: 0
            noscrub: 0
            igif: 0
            success_count: 0
            run_time: 0 seconds
            average_speed: 0 objects/sec
            real-time_speed: N/A
            current_position: N/A
            lf_scanned: 0
            lf_repaired: 0
            lf_failed: 0
            

            Perhaps there is something to improve here to avoid the LBUG on OI Scrub. Let me know how I can help, I do have 2 crash dumps if needed. Thanks!

            sthiell Stephane Thiell added a comment - Mounting oak-OST000b with noscrub seems to avoid the LBUG for now. The status of OI Scrub is 'crashed': [root@oak-io1-s2 ~]# lctl get_param osd-ldiskfs.oak-OST000b.oi_scrub osd-ldiskfs.oak-OST000b.oi_scrub= name: OI_scrub magic: 0x4c5fd252 oi_files: 64 status: crashed flags: auto param: time_since_last_completed: N/A time_since_latest_start: 2154 seconds time_since_last_checkpoint: 2154 seconds latest_start_position: 12 last_checkpoint_position: 11 first_failure_position: N/A checked: 0 updated: 0 failed: 0 prior_updated: 0 noscrub: 0 igif: 0 success_count: 0 run_time: 0 seconds average_speed: 0 objects/sec real-time_speed: N/A current_position: N/A lf_scanned: 0 lf_repaired: 0 lf_failed: 0 Perhaps there is something to improve here to avoid the LBUG on OI Scrub. Let me know how I can help, I do have 2 crash dumps if needed. Thanks!

            People

              laisiyao Lai Siyao
              sthiell Stephane Thiell
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: