[LU-13980] Kernel panic on OST after removing files under '/O' folder Created: 23/Sep/20 Updated: 19/May/21 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.8, Lustre 2.12.4 |
| Fix Version/s: | None |
| Type: | Task | Priority: | Trivial |
| Reporter: | Runzhou Han | Assignee: | Andreas Dilger |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Environment: |
CentOS Linux release 7.7.1908 (Core) with kernel 3.10.0-957.1.3.el7_lustre.x86_64 for Lustre 2.10.8 and CentOS Linux release 7.7.1908 (Core) with 3.10.0-1062.9.1.el7_lustre.x86_64 for Lustre 2.12.4. |
||
| Attachments: |
|
||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||||||
| Description |
|
I removed some data stripes under '/O' folder on OST and started LFSCK. Then OST was forced to reboot because of kernel panic. By looking into vmcore, I found the specific error line is: [ 1057.367833] Lustre: lustre-OST0000: new disk, initializing [ 1057.367877] Lustre: srv-lustre-OST0000: No data found on store. Initialize space [ 1057.417121] Lustre: lustre-OST0000: Imperative Recovery not enabled, recovery window 300-900 [ 1062.018722] Lustre: lustre-OST0000: Connection restored to lustre-MDT0000-mdtlov_UUID (at 10.0.0.122@tcp) [ 1089.010284] Lustre: lustre-OST0000: Connection restored to 89c68bff-12c8-9f48-f01e-f6306c666eb9 (at 10.0.0.98@tcp) [ 1281.516928] LustreError: 10410:0:(osd_handler.c:1982:osd_object_release()) LBUG [ 1281.516939] Pid: 10410, comm: ll_ost_out00_00 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Mon May 27 03:45:37 UTC 2019 [ 1281.516944] Call Trace: [ 1281.516960] [<ffffffffc05fd7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs] [ 1281.516986] [<ffffffffc05fd87c>] lbug_with_loc+0x4c/0xa0 [libcfs] [ 1281.517004] [<ffffffffc0b93820>] osd_get_ldiskfs_dirent_param+0x0/0x130 [osd_ldiskfs] [ 1281.517173] [<ffffffffc07442b0>] lu_object_put+0x190/0x3e0 [obdclass] [ 1281.517244] [<ffffffffc09d8bc3>] out_handle+0x1503/0x1bc0 [ptlrpc] [ 1281.517369] [<ffffffffc09ce7ca>] tgt_request_handle+0x92a/0x1370 [ptlrpc] [ 1281.517481] [<ffffffffc097705b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] [ 1281.517582] [<ffffffffc097a7a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] (The full dmesg log collected in vmcore is in the attachment) Then I found after removing files under '/O' on OST, even a simple write operation can result in the same kernel panic. I'm just curious about why 'osd_object_release' is evoked in above situations and the position of LFSCK functions in the error call trace. Thanks a lot! |
| Comments |
| Comment by Andreas Dilger [ 23/Sep/20 ] |
|
Can you please provide some more information about this problem. How did you remove the objects under the /O directory? Was the filesystem mounted as both type "lustre" and type "ldiskfs" at the same time, or was the "lustre" OST unmounted first? Which files were removed? The log messages makes it appear that the filesystem was sufficiently corrupted that the OST startup process wasn't able to detect the Lustre configuration files. If you are able to reproduce this, please enable full debugging with "lctl set_param debug=-1" on the OST before starting LFSCK, and then attach the debug log which should be written to /tmp/lustre_log.<timestamp> when the LBUG is triggered, or can be dumped manually like "lctl dk /tmp/lustre_log.txt". |
| Comment by Runzhou Han [ 23/Sep/20 ] |
|
Thank you for you reply. I mounted as both type "lustre" and type "ldiskfs" at the same time. The file removed is a data stripe of a client file. In my configuration my stripe setting is: lfs setstripe -i 0 -c -1 -S 64K /lustre For example, on client node I create a file with the following command: dd if=/dev/zero of=/lustre/10M bs=1M count=10 Then I use the "lfs getstripe /lustre/10M" to locate its data stripes on OSTs, [root@mds Desktop]# lfs getstripe 10M 10M lmm_stripe_count: 3 lmm_stripe_size: 65536 lmm_pattern: 1 lmm_layout_gen: 0 lmm_stripe_offset: 0 obdidx objid objid group 0 2 0x2 0 1 2 0x2 0 2 2 0x2 0 Next I remove one of them under one OST "ldiskfs" mount point. [root@oss0 osboxes]# rm -f /ost0_ldiskfs/O/0/d2/2 Then running LFSCK on MDT will trigger kernel panic caused by LBUG. I'm able to reproduce the LBUG. However, after kernel panic takes place, I'm not able to manipulate the VM any more (I was using a virtual machine cluster). The VM would either freezes or I configure the kernel to reboot in x seconds after kernel panic. In the next boot I'm not able to find /tmp/lustre_log.<timestamp>.
|
| Comment by Andreas Dilger [ 24/Sep/20 ] |
|
Mounting the OST filesystem as both "lustre" and "ldiskfs" at the same time is not supported, since (as you can see with this assertion) the state is being changed from underneath the filesystem in an unexpected manner. It would be the same as if modifying the blocks underneath ext4 while it is mounted. I mistakenly thought that the "lustre-OST0000: new disk, initializing" was caused by a large number of files being deleted from the filesystem before startup, but I now can see from the low OST object numbers in your "lfs getstripe" output that this is a new filesystem, so this message is expected. I agree that it would be good to handle this error more gracefully (e.g. return an error instead of LBUG). Looking elsewhere in Jira, it seems that this LBUG is hit often enough that the error handling should really be more tolerant, since the design policy is that the server should not LASSERT() on bad values that come from the client or disk. |
| Comment by Runzhou Han [ 24/Sep/20 ] |
|
I see. Maybe I should not mount them at the same time. Actually, I was trying to emulate some special cases in which the underneath file system is corrupted by accident while the system is still running. I want to see how Lustre reacts to these unexpected failures (especially LFSCK's reaction). In fact, to help develop more robust fsck for PFS/DFS, I'm also doing the same thing to other systems (e.g., BeeGFS, OrangeFS and Ceph). Since Lustre heavily relies on kernel modules, I observed more kernel crashes in Lustre when injecting faults. That's why I'm here to learn more about Lustre |
| Comment by Gerrit Updater [ 26/Sep/20 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40058 |
| Comment by Gerrit Updater [ 24/Nov/20 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40738 |
| Comment by Gerrit Updater [ 03/Dec/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40738/ |