Details
-
Bug
-
Resolution: Not a Bug
-
Major
-
None
-
Lustre 2.4.2
-
None
-
RHEL6
-
3
-
14692
Description
Context:
OI_Scrub has been triggered after failover of the MDT on the failover MDS. (related to LU-4554)
---8<---
LustreError: 0-0: ptmp2-MDT0000: trigger OI scrub by RPC for [0x22cb1aa25:0xfabf:0x0], rc = 0 [1]
LustreError: 0-0: spool2-MDT0000: trigger OI scrub by RPC for [0x20cf1887f:0x92c:0x0], rc = 0 [1]
---8<---
Issue:
Lustre clients were hung while trying to read/write from/to the filesystem, getting an error EINPROGRESS from the server for each request until the completion of the OI_Scrub process.
However, the following commands were still working: ls, cd, df
Due to the number of inodes, the OI_Scrub took 3 hours to complete, hanging the production.
OI_Scrub status once completed:
---8<---
- cat /proc/fs/lustre/osd-ldiskfs/ptmp2-MDT0000/oi_scrub
name: OI_scrub
magic: 0x4c5fd252
oi_files: 1
status: completed
flags:
param:
time_since_last_completed: 382 seconds
time_since_latest_start: 11068 seconds
time_since_last_checkpoint: 382 seconds
latest_start_position: 12
last_checkpoint_position: 499122177
first_failure_position: N/A
checked: 190095126
updated: 2
failed: 0
prior_updated: 0
noscrub: 1965
igif: 239
success_count: 3
run_time: 10685 seconds
average_speed: 17790 objects/sec
real-time_speed: N/A
current_position: N/A
---8<---
run_time/3600 = 10685/3600 ~= 2.97 hours.
As a workaround, auto_scrub has been disabled (echo 0 > /proc/fs/lustre/osd-ldiskfs/ptmp2-MDT0000/auto_scrub)
We have since upgraded to Lustre 2.4.3 with the patch from LU-4554. The customer would like to enable the auto_scrub feature in order to get a consistent OI table, but cannot accept such an impact on the production systems.
Regarding the "OI Scrub and inode Iterator Solution Architecture", client can access the MDT while OI Scrub is running. Except the operations of FID-to-path or accessing parent from non-directory child, other operations behave as normal.