[LU-5265] Lustre clients hang while OI_Scrub is running Created: 27/Jun/14 Updated: 27/Aug/14 Resolved: 27/Aug/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.2 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Bruno Travouillon (Inactive) | Assignee: | nasf (Inactive) |
| Resolution: | Not a Bug | Votes: | 0 |
| Labels: | None | ||
| Environment: |
RHEL6 |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 14692 |
| Description |
|
Context: --- Issue: However, the following commands were still working: ls, cd, df Due to the number of inodes, the OI_Scrub took 3 hours to complete, hanging the production. OI_Scrub status once completed:
run_time/3600 = 10685/3600 ~= 2.97 hours. As a workaround, auto_scrub has been disabled (echo 0 > /proc/fs/lustre/osd-ldiskfs/ptmp2-MDT0000/auto_scrub) We have since upgraded to Lustre 2.4.3 with the patch from Regarding the "OI Scrub and inode Iterator Solution Architecture", client can access the MDT while OI Scrub is running. Except the operations of FID-to-path or accessing parent from non-directory child, other operations behave as normal. |
| Comments |
| Comment by Bruno Travouillon (Inactive) [ 27/Jun/14 ] |
|
Top on the MDS while OI_Scrub was running |
| Comment by Peter Jones [ 27/Jun/14 ] |
|
Fan Yong Could you please advise on this issue? Thanks Peter |
| Comment by nasf (Inactive) [ 30/Jun/14 ] |
|
During the OI scrub rebuilding the OI files, if the client accesses the system with name-based RPC, such as lookup, then it will not be affected. But if the client sends FID-based RPC to the MDS and related FID mapping has not been rebuilt yet, it will get -EINPROGRESS until related FID mapping has been rebuilt, the worst case is that the application has to wait until OI scrub finished. The case of FID-based RPC usually happens for old connected client that caches the FID on client-side before the upgrading or before the MDS file-level backup/restore. For the new connected client, the FID-based RPC will always be after name-based RPC (except for FID-to-path), so the new connected client will not be affected. So for your above case, it is normal. Since your system has already run OI scrub, the inconsistent cases should have been fixed already. So even though you enable the "auto_scrub", the OI scrub should not be triggered unless it finds some new inconsistency (very rare). On the other hand, even though the OI scrub is rebuilding the OI files, it is NOT all the FIDs will be affected. Means that if the application tries to open-read/write the file which FID is not cached on the client or its FID mapping has been rebuilt already, then the application should not be affected by the OI scrub. So please tell me whether your system often hits the OI mapping failures (and trigger OI scrub) or not. If not, then enable "auto_scrub" will be OK. Otherwise, means the OI scrub cannot build the OI files completely, there should be other hidden bugs. |
| Comment by Bruno Travouillon (Inactive) [ 09/Jul/14 ] |
|
Thanks for you clear answer. However, can you tell me how to check if OI scrub is triggered while auto_scrub is off? In osd_fid_lookup(), the LCONSOLE message "trigger OI scrub by RPC for DFID" only displays when auto_scrub is on. Should I check on clients' consoles? |
| Comment by nasf (Inactive) [ 11/Jul/14 ] |
|
If auto_scrub is disabled, then the OI scrub will NOT be triggered automatically even though some inconsistency is detected during the normal processing. So you can NOT find the message about OI scrub auto running on the MDS. But under such case, the administrator can trigger OI scrub manually via "lctl lfsck_start". The OI scrub is server-side work, in any cases, the client will NOT print any message. |
| Comment by Bruno Travouillon (Inactive) [ 17/Jul/14 ] |
|
Understood. We should enable the auto_scrub by the beginning of September. We will then be able to monitor the OI mapping failures and open a new ticket if we hit some issue. Thanks. |