[LU-14175] OI Scrub triggered followed by LBUG ASSERTION( idx1 == 0 || idx1 == osd->od_index ) failed Created: 02/Dec/20 Updated: 05/May/22 Resolved: 05/May/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.5 |
| Fix Version/s: | Lustre 2.15.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Stephane Thiell | Assignee: | Lai Siyao |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
CentOS 7.6 |
||
| Issue Links: |
|
||||||||
| Severity: | 2 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
I'm opening this with Sev2 as we have an OST down on Oak. Indeed we have a problem this morning with one OST on Oak (note that Oak has been upgraded to 2.12.5 from 2.10 recently):
Dec 02 09:13:01 oak-io1-s2 kernel: Lustre: oak-OST000b: Recovery over after 2:42, of 1789 clients 1631 recovered and 158 were evicted. Dec 02 09:13:01 oak-io1-s2 kernel: Lustre: Skipped 3 previous similar messages Dec 02 09:13:01 oak-io1-s2 kernel: Lustre: oak-OST000b: deleting orphan objects from 0x10400013a0:371764 to 0x10400013a0:371809 Dec 02 09:13:01 oak-io1-s2 kernel: Lustre: oak-OST000b: deleting orphan objects from 0x1040000bd0:3790954 to 0x1040000bd0:3790977 Dec 02 09:13:01 oak-io1-s2 kernel: Lustre: oak-OST000b: deleting orphan objects from 0x0:33786809 to 0x0:33786849 Dec 02 09:13:01 oak-io1-s2 kernel: Lustre: oak-OST000b: deleting orphan objects from 0x1040000400:3170249 to 0x1040000400:3170273 Dec 02 09:13:02 oak-io1-s2 kernel: Lustre: oak-OST000b: trigger OI scrub by RPC for the [0x1000b0000:0x10c759a:0x0] with flags 0x4a, rc = 0 [root@oak-io1-s2 ~]# lctl get_param -n osd-ldiskfs.oak-OST000b.oi_scrub name: OI_scrub magic: 0x4c5fd252 oi_files: 64 status: scanning flags: auto param: time_since_last_completed: N/A time_since_latest_start: 16 seconds time_since_last_checkpoint: 16 seconds latest_start_position: 12 last_checkpoint_position: 11 first_failure_position: N/A checked: 1186 updated: 0 failed: 0 prior_updated: 0 noscrub: 4 igif: 0 success_count: 0 run_time: 16 seconds average_speed: 74 objects/sec real-time_speed: 74 objects/sec current_position: 1263 scrub_in_prior: no scrub_full_speed: yes partial_scan: no lf_scanned: 0 lf_repaired: 0 lf_failed: 0 [root@oak-io1-s2 ~]# Message from syslogd@oak-io1-s2 at Dec 2 09:13:19 ... kernel:LustreError: 291930:0:(osd_compat.c:701:osd_obj_update_entry()) ASSERTION( idx1 == 0 || idx1 == osd->od_index ) failed: invalid given FID [0x1000a0000:0x1d37dd1:0x0], not match the device index 11 Message from syslogd@oak-io1-s2 at Dec 2 09:13:19 ... kernel:LustreError: 291930:0:(osd_compat.c:701:osd_obj_update_entry()) LBUG The backtrace is: Dec 2 03:41:08 oak-io1-s2 kernel: LustreError: 255421:0:(osd_compat.c:701:osd_obj_update_entry()) ASSERTION( idx1 == 0 || idx1 == osd->od_index ) failed: invalid given FID [0x1000a0000:0x1d37dd1:0x0], not match the device index 11 Dec 2 03:41:08 oak-io1-s2 kernel: LustreError: 255421:0:(osd_compat.c:701:osd_obj_update_entry()) LBUG Dec 2 03:41:08 oak-io1-s2 kernel: Pid: 255421, comm: OI_scrub 3.10.0-957.27.2.el7_lustre.pl2.x86_64 #1 SMP Thu Nov 7 15:26:16 PST 2019 Dec 2 03:41:08 oak-io1-s2 kernel: Call Trace: Dec 2 03:41:08 oak-io1-s2 kernel: [<ffffffffc0b3e7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs] Dec 2 03:41:08 oak-io1-s2 kernel: [<ffffffffc0b3e87c>] lbug_with_loc+0x4c/0xa0 [libcfs] Dec 2 03:41:08 oak-io1-s2 kernel: [<ffffffffc1458149>] osd_obj_update_entry+0x969/0x980 [osd_ldiskfs] Dec 2 03:41:08 oak-io1-s2 kernel: [<ffffffffc145a8a0>] osd_obj_map_update+0x1a0/0x340 [osd_ldiskfs] Dec 2 03:41:08 oak-io1-s2 kernel: [<ffffffffc14471a9>] osd_oi_update+0x69/0x290 [osd_ldiskfs] Dec 2 03:41:08 oak-io1-s2 kernel: [<ffffffffc145c71c>] osd_scrub_refresh_mapping+0x27c/0x440 [osd_ldiskfs] Dec 2 03:41:08 oak-io1-s2 kernel: [<ffffffffc14611e0>] osd_scrub_check_update+0x280/0x10f0 [osd_ldiskfs] Dec 2 03:41:08 oak-io1-s2 kernel: [<ffffffffc14620b5>] osd_scrub_exec+0x65/0x4f0 [osd_ldiskfs] Dec 2 03:41:08 oak-io1-s2 kernel: [<ffffffffc14629e8>] osd_inode_iteration+0x4a8/0xcf0 [osd_ldiskfs] Dec 2 03:41:08 oak-io1-s2 kernel: [<ffffffffc1463ad9>] osd_scrub_main+0x8a9/0xe40 [osd_ldiskfs] Dec 2 03:41:08 oak-io1-s2 kernel: [<ffffffffaa4c2e81>] kthread+0xd1/0xe0 Dec 2 03:41:08 oak-io1-s2 kernel: [<ffffffffaab77c37>] ret_from_fork_nospec_end+0x0/0x39 Dec 2 03:41:08 oak-io1-s2 kernel: [<ffffffffffffffff>] 0xffffffffffffffff We ran fsck on the device and then the issue occurred again: Dec 2 09:13:19 oak-io1-s2 kernel: LustreError: 291930:0:(osd_compat.c:701:osd_obj_update_entry()) ASSERTION( idx1 == 0 || idx1 == osd->od_index ) failed: invalid given FID [0x1000a0000:0x1d37dd1:0x0], not match the device index 11 Dec 2 09:13:19 oak-io1-s2 kernel: LustreError: 291930:0:(osd_compat.c:701:osd_obj_update_entry()) LBUG Dec 2 09:13:19 oak-io1-s2 kernel: Pid: 291930, comm: OI_scrub 3.10.0-957.27.2.el7_lustre.pl2.x86_64 #1 SMP Thu Nov 7 15:26:16 PST 2019 Dec 2 09:13:19 oak-io1-s2 kernel: Call Trace: Dec 2 09:13:19 oak-io1-s2 kernel: [<ffffffffc0cbe7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs] Dec 2 09:13:19 oak-io1-s2 kernel: [<ffffffffc0cbe87c>] lbug_with_loc+0x4c/0xa0 [libcfs] Dec 2 09:13:19 oak-io1-s2 kernel: [<ffffffffc15c6149>] osd_obj_update_entry+0x969/0x980 [osd_ldiskfs] Dec 2 09:13:19 oak-io1-s2 kernel: [<ffffffffc15c88a0>] osd_obj_map_update+0x1a0/0x340 [osd_ldiskfs] Dec 2 09:13:19 oak-io1-s2 kernel: [<ffffffffc15b51a9>] osd_oi_update+0x69/0x290 [osd_ldiskfs] Dec 2 09:13:19 oak-io1-s2 kernel: [<ffffffffc15ca71c>] osd_scrub_refresh_mapping+0x27c/0x440 [osd_ldiskfs] Dec 2 09:13:19 oak-io1-s2 kernel: [<ffffffffc15cf1e0>] osd_scrub_check_update+0x280/0x10f0 [osd_ldiskfs] Dec 2 09:13:19 oak-io1-s2 kernel: [<ffffffffc15d00b5>] osd_scrub_exec+0x65/0x4f0 [osd_ldiskfs] Dec 2 09:13:19 oak-io1-s2 kernel: [<ffffffffc15d09e8>] osd_inode_iteration+0x4a8/0xcf0 [osd_ldiskfs] Dec 2 09:13:19 oak-io1-s2 kernel: [<ffffffffc15d1ad9>] osd_scrub_main+0x8a9/0xe40 [osd_ldiskfs] Dec 2 09:13:19 oak-io1-s2 kernel: [<ffffffffbcac2e81>] kthread+0xd1/0xe0 Dec 2 09:13:19 oak-io1-s2 kernel: [<ffffffffbd177c37>] ret_from_fork_nospec_end+0x0/0x39 Dec 2 09:13:19 oak-io1-s2 kernel: [<ffffffffffffffff>] 0xffffffffffffffff Do you have an idea on how to find which file it is? I'm thinking on remounting with noscrub to avoid the LBUG, that will be my next step. Thanks! |
| Comments |
| Comment by Stephane Thiell [ 02/Dec/20 ] |
|
Mounting oak-OST000b with noscrub seems to avoid the LBUG for now. The status of OI Scrub is 'crashed': [root@oak-io1-s2 ~]# lctl get_param osd-ldiskfs.oak-OST000b.oi_scrub osd-ldiskfs.oak-OST000b.oi_scrub= name: OI_scrub magic: 0x4c5fd252 oi_files: 64 status: crashed flags: auto param: time_since_last_completed: N/A time_since_latest_start: 2154 seconds time_since_last_checkpoint: 2154 seconds latest_start_position: 12 last_checkpoint_position: 11 first_failure_position: N/A checked: 0 updated: 0 failed: 0 prior_updated: 0 noscrub: 0 igif: 0 success_count: 0 run_time: 0 seconds average_speed: 0 objects/sec real-time_speed: N/A current_position: N/A lf_scanned: 0 lf_repaired: 0 lf_failed: 0 Perhaps there is something to improve here to avoid the LBUG on OI Scrub. Let me know how I can help, I do have 2 crash dumps if needed. Thanks! |
| Comment by Peter Jones [ 03/Dec/20 ] |
|
Lai Can you please assist? Thanks Peter |
| Comment by Gerrit Updater [ 27/Mar/21 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43153 |
| Comment by Andreas Dilger [ 27/Mar/21 ] |
|
Patch is only improving the error message, not actually fixing the crash. |
| Comment by Gerrit Updater [ 15/Apr/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/43153/ |
| Comment by Peter Jones [ 05/May/22 ] |
|
Seems to be landed for 2.15 |