[LU-14175] OI Scrub triggered followed by LBUG ASSERTION( idx1 == 0 || idx1 == osd->od_index ) failed Created: 02/Dec/20  Updated: 05/May/22  Resolved: 05/May/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.5
Fix Version/s: Lustre 2.15.0

Type: Bug Priority: Major
Reporter: Stephane Thiell Assignee: Lai Siyao
Resolution: Fixed Votes: 0
Labels: None
Environment:

CentOS 7.6


Issue Links:
Related
is related to LU-14119 FID-in-LMA [fid1] does not match the ... Resolved
Severity: 2
Rank (Obsolete): 9223372036854775807

 Description   

I'm opening this with Sev2 as we have an OST down on Oak. Indeed we have a problem this morning with one OST on Oak (note that Oak has been upgraded to 2.12.5 from 2.10 recently):

 

Dec 02 09:13:01 oak-io1-s2 kernel: Lustre: oak-OST000b: Recovery over after 2:42, of 1789 clients 1631 recovered and 158 were evicted.
Dec 02 09:13:01 oak-io1-s2 kernel: Lustre: Skipped 3 previous similar messages
Dec 02 09:13:01 oak-io1-s2 kernel: Lustre: oak-OST000b: deleting orphan objects from 0x10400013a0:371764 to 0x10400013a0:371809
Dec 02 09:13:01 oak-io1-s2 kernel: Lustre: oak-OST000b: deleting orphan objects from 0x1040000bd0:3790954 to 0x1040000bd0:3790977
Dec 02 09:13:01 oak-io1-s2 kernel: Lustre: oak-OST000b: deleting orphan objects from 0x0:33786809 to 0x0:33786849
Dec 02 09:13:01 oak-io1-s2 kernel: Lustre: oak-OST000b: deleting orphan objects from 0x1040000400:3170249 to 0x1040000400:3170273
Dec 02 09:13:02 oak-io1-s2 kernel: Lustre: oak-OST000b: trigger OI scrub by RPC for the [0x1000b0000:0x10c759a:0x0] with flags 0x4a, rc = 0
[root@oak-io1-s2 ~]# lctl get_param -n osd-ldiskfs.oak-OST000b.oi_scrub
name: OI_scrub
magic: 0x4c5fd252
oi_files: 64
status: scanning
flags: auto
param:
time_since_last_completed: N/A
time_since_latest_start: 16 seconds
time_since_last_checkpoint: 16 seconds
latest_start_position: 12
last_checkpoint_position: 11
first_failure_position: N/A
checked: 1186
updated: 0
failed: 0
prior_updated: 0
noscrub: 4
igif: 0
success_count: 0
run_time: 16 seconds
average_speed: 74 objects/sec
real-time_speed: 74 objects/sec
current_position: 1263
scrub_in_prior: no
scrub_full_speed: yes
partial_scan: no
lf_scanned: 0
lf_repaired: 0
lf_failed: 0
[root@oak-io1-s2 ~]# 
Message from syslogd@oak-io1-s2 at Dec  2 09:13:19 ...
 kernel:LustreError: 291930:0:(osd_compat.c:701:osd_obj_update_entry()) ASSERTION( idx1 == 0 || idx1 == osd->od_index ) failed: invalid given FID [0x1000a0000:0x1d37dd1:0x0], not match the device index 11

Message from syslogd@oak-io1-s2 at Dec  2 09:13:19 ...
 kernel:LustreError: 291930:0:(osd_compat.c:701:osd_obj_update_entry()) LBUG

The backtrace is:

Dec  2 03:41:08 oak-io1-s2 kernel: LustreError: 255421:0:(osd_compat.c:701:osd_obj_update_entry()) ASSERTION( idx1 == 0 || idx1 == osd->od_index ) failed: invalid given FID [0x1000a0000:0x1d37dd1:0x0], not match the device index 11
Dec  2 03:41:08 oak-io1-s2 kernel: LustreError: 255421:0:(osd_compat.c:701:osd_obj_update_entry()) LBUG
Dec  2 03:41:08 oak-io1-s2 kernel: Pid: 255421, comm: OI_scrub 3.10.0-957.27.2.el7_lustre.pl2.x86_64 #1 SMP Thu Nov 7 15:26:16 PST 2019
Dec  2 03:41:08 oak-io1-s2 kernel: Call Trace:
Dec  2 03:41:08 oak-io1-s2 kernel: [<ffffffffc0b3e7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
Dec  2 03:41:08 oak-io1-s2 kernel: [<ffffffffc0b3e87c>] lbug_with_loc+0x4c/0xa0 [libcfs]
Dec  2 03:41:08 oak-io1-s2 kernel: [<ffffffffc1458149>] osd_obj_update_entry+0x969/0x980 [osd_ldiskfs]
Dec  2 03:41:08 oak-io1-s2 kernel: [<ffffffffc145a8a0>] osd_obj_map_update+0x1a0/0x340 [osd_ldiskfs]
Dec  2 03:41:08 oak-io1-s2 kernel: [<ffffffffc14471a9>] osd_oi_update+0x69/0x290 [osd_ldiskfs]
Dec  2 03:41:08 oak-io1-s2 kernel: [<ffffffffc145c71c>] osd_scrub_refresh_mapping+0x27c/0x440 [osd_ldiskfs]
Dec  2 03:41:08 oak-io1-s2 kernel: [<ffffffffc14611e0>] osd_scrub_check_update+0x280/0x10f0 [osd_ldiskfs]
Dec  2 03:41:08 oak-io1-s2 kernel: [<ffffffffc14620b5>] osd_scrub_exec+0x65/0x4f0 [osd_ldiskfs]
Dec  2 03:41:08 oak-io1-s2 kernel: [<ffffffffc14629e8>] osd_inode_iteration+0x4a8/0xcf0 [osd_ldiskfs]
Dec  2 03:41:08 oak-io1-s2 kernel: [<ffffffffc1463ad9>] osd_scrub_main+0x8a9/0xe40 [osd_ldiskfs]
Dec  2 03:41:08 oak-io1-s2 kernel: [<ffffffffaa4c2e81>] kthread+0xd1/0xe0
Dec  2 03:41:08 oak-io1-s2 kernel: [<ffffffffaab77c37>] ret_from_fork_nospec_end+0x0/0x39
Dec  2 03:41:08 oak-io1-s2 kernel: [<ffffffffffffffff>] 0xffffffffffffffff

We ran fsck on the device and then the issue occurred again:

Dec  2 09:13:19 oak-io1-s2 kernel: LustreError: 291930:0:(osd_compat.c:701:osd_obj_update_entry()) ASSERTION( idx1 == 0 || idx1 == osd->od_index ) failed: invalid given FID [0x1000a0000:0x1d37dd1:0x0], not match the device index 11
Dec  2 09:13:19 oak-io1-s2 kernel: LustreError: 291930:0:(osd_compat.c:701:osd_obj_update_entry()) LBUG
Dec  2 09:13:19 oak-io1-s2 kernel: Pid: 291930, comm: OI_scrub 3.10.0-957.27.2.el7_lustre.pl2.x86_64 #1 SMP Thu Nov 7 15:26:16 PST 2019
Dec  2 09:13:19 oak-io1-s2 kernel: Call Trace:
Dec  2 09:13:19 oak-io1-s2 kernel: [<ffffffffc0cbe7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
Dec  2 09:13:19 oak-io1-s2 kernel: [<ffffffffc0cbe87c>] lbug_with_loc+0x4c/0xa0 [libcfs]
Dec  2 09:13:19 oak-io1-s2 kernel: [<ffffffffc15c6149>] osd_obj_update_entry+0x969/0x980 [osd_ldiskfs]
Dec  2 09:13:19 oak-io1-s2 kernel: [<ffffffffc15c88a0>] osd_obj_map_update+0x1a0/0x340 [osd_ldiskfs]
Dec  2 09:13:19 oak-io1-s2 kernel: [<ffffffffc15b51a9>] osd_oi_update+0x69/0x290 [osd_ldiskfs]
Dec  2 09:13:19 oak-io1-s2 kernel: [<ffffffffc15ca71c>] osd_scrub_refresh_mapping+0x27c/0x440 [osd_ldiskfs]
Dec  2 09:13:19 oak-io1-s2 kernel: [<ffffffffc15cf1e0>] osd_scrub_check_update+0x280/0x10f0 [osd_ldiskfs]
Dec  2 09:13:19 oak-io1-s2 kernel: [<ffffffffc15d00b5>] osd_scrub_exec+0x65/0x4f0 [osd_ldiskfs]
Dec  2 09:13:19 oak-io1-s2 kernel: [<ffffffffc15d09e8>] osd_inode_iteration+0x4a8/0xcf0 [osd_ldiskfs]
Dec  2 09:13:19 oak-io1-s2 kernel: [<ffffffffc15d1ad9>] osd_scrub_main+0x8a9/0xe40 [osd_ldiskfs]
Dec  2 09:13:19 oak-io1-s2 kernel: [<ffffffffbcac2e81>] kthread+0xd1/0xe0
Dec  2 09:13:19 oak-io1-s2 kernel: [<ffffffffbd177c37>] ret_from_fork_nospec_end+0x0/0x39
Dec  2 09:13:19 oak-io1-s2 kernel: [<ffffffffffffffff>] 0xffffffffffffffff

Do you have an idea on how to find which file it is? I'm thinking on remounting with noscrub to avoid the LBUG, that will be my next step.

Thanks!
Stephane



 Comments   
Comment by Stephane Thiell [ 02/Dec/20 ]

Mounting oak-OST000b with noscrub seems to avoid the LBUG for now. The status of OI Scrub is 'crashed':

[root@oak-io1-s2 ~]# lctl get_param osd-ldiskfs.oak-OST000b.oi_scrub
osd-ldiskfs.oak-OST000b.oi_scrub=
name: OI_scrub
magic: 0x4c5fd252
oi_files: 64
status: crashed
flags: auto
param:
time_since_last_completed: N/A
time_since_latest_start: 2154 seconds
time_since_last_checkpoint: 2154 seconds
latest_start_position: 12
last_checkpoint_position: 11
first_failure_position: N/A
checked: 0
updated: 0
failed: 0
prior_updated: 0
noscrub: 0
igif: 0
success_count: 0
run_time: 0 seconds
average_speed: 0 objects/sec
real-time_speed: N/A
current_position: N/A
lf_scanned: 0
lf_repaired: 0
lf_failed: 0

Perhaps there is something to improve here to avoid the LBUG on OI Scrub. Let me know how I can help, I do have 2 crash dumps if needed. Thanks!

Comment by Peter Jones [ 03/Dec/20 ]

Lai

Can you please assist?

Thanks

Peter

Comment by Gerrit Updater [ 27/Mar/21 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43153
Subject: LU-14175 osd: print inode number with FID in OI scrub
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1cbe61483dad10b67d93db3fe2b9b522a7ed50b3

Comment by Andreas Dilger [ 27/Mar/21 ]

Patch is only improving the error message, not actually fixing the crash.

Comment by Gerrit Updater [ 15/Apr/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/43153/
Subject: LU-14175 osd: print inode number with FID in OI scrub
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 5bab4acf8320b46076c81f32f7954f91dae21bc9

Comment by Peter Jones [ 05/May/22 ]

Seems to be landed for 2.15

Generated at Sat Feb 10 03:07:29 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.