[LU-1749] llog_lvfs_create()) error looking up logfile Created: 15/Aug/12 Updated: 16/Aug/16 Resolved: 16/Aug/16 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.3.0, Lustre 2.4.0, Lustre 2.1.4 |
| Type: | Bug | Priority: | Major |
| Reporter: | Christopher Morrone | Assignee: | Lai Siyao |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | llnl | ||
| Environment: | |||
| Severity: | 3 |
| Rank (Obsolete): | 4411 |
| Description |
|
We had an MDS (running 2.1.1-17chaos) lock up due to bug 2012-08-14 13:48:55 LustreError: 3697:0:(llog_lvfs.c:616:llog_lvfs_create()) error looking up logfile 0x1a768b0a:0xa2f17dca: rc -116 2012-08-14 13:48:55 LustreError: 3697:0:(llog_cat.c:174:llog_cat_id2handle()) error opening log id 0x1a768b0a:a2f17dca: rc -116 2012-08-14 13:48:55 LustreError: 3697:0:(llog_obd.c:318:cat_cancel_cb()) Cannot find handle for log 0x1a768b0a 2012-08-14 13:48:55 LustreError: 3472:0:(llog_obd.c:391:llog_obd_origin_setup()) llog_process() with cat_cancel_cb failed: -116 2012-08-14 13:48:55 LustreError: 3472:0:(llog_obd.c:218:llog_setup_named()) obd lsa-OST00bd-osc ctxt 2 lop_setup=ffffffffa05b0a60 failed -116 2012-08-14 13:48:55 LustreError: 3472:0:(osc_request.c:4186:__osc_llog_init()) failed LLOG_MDS_OST_ORIG_CTXT 2012-08-14 13:48:55 LustreError: 3472:0:(osc_request.c:4203:__osc_llog_init()) osc 'lsa-OST00bd-osc' tgt 'mdd_obd-lsa-MDT0000' catid ffff8808312bd860 rc=-116 2012-08-14 13:48:55 LustreError: 3472:0:(osc_request.c:4205:__osc_llog_init()) logid 0x1a7680f6:0x615e782e 2012-08-14 13:48:55 LustreError: 3472:0:(osc_request.c:4233:osc_llog_init()) rc: -116 2012-08-14 13:48:55 LustreError: 3472:0:(lov_log.c:248:lov_llog_init()) error osc_llog_init idx 189 osc 'lsa-OST00bd-osc' tgt 'mdd_obd-lsa-MDT0000' (rc=-116) 2012-08-14 13:48:55 LustreError: 3698:0:(llog_lvfs.c:616:llog_lvfs_create()) error looking up logfile 0x1a768b0a:0xa2f17dca: rc -116 2012-08-14 13:48:55 LustreError: 3698:0:(llog_cat.c:174:llog_cat_id2handle()) error opening log id 0x1a768b0a:a2f17dca: rc -116 2012-08-14 13:48:55 LustreError: 3698:0:(llog_obd.c:318:cat_cancel_cb()) Cannot find handle for log 0x1a768b0a 2012-08-14 13:48:55 LustreError: 3407:0:(llog_obd.c:391:llog_obd_origin_setup()) llog_process() with cat_cancel_cb failed: -116 2012-08-14 13:48:55 LustreError: 3407:0:(llog_obd.c:218:llog_setup_named()) obd lsa-OST00bd-osc ctxt 2 lop_setup=ffffffffa05b0a60 failed -116 2012-08-14 13:48:55 LustreError: 3407:0:(osc_request.c:4186:__osc_llog_init()) failed LLOG_MDS_OST_ORIG_CTXT 2012-08-14 13:48:55 LustreError: 3407:0:(osc_request.c:4203:__osc_llog_init()) osc 'lsa-OST00bd-osc' tgt 'mdd_obd-lsa-MDT0000' catid ffff8804333af960 rc=-116 2012-08-14 13:48:55 LustreError: 3407:0:(osc_request.c:4205:__osc_llog_init()) logid 0x1a7680f6:0x615e782e 2012-08-14 13:48:56 LustreError: 3407:0:(osc_request.c:4233:osc_llog_init()) rc: -116 2012-08-14 13:48:56 LustreError: 3407:0:(lov_log.c:248:lov_llog_init()) error osc_llog_init idx 189 osc 'lsa-OST00bd-osc' tgt 'mdd_obd-lsa-MDT0000' (rc=-116) I don't see a file in the OBJECTS directory that seems to match 0x1a768b0a:0xa2f17dca (if that is where we are looking). Although -116 is -ESTALE, so I'm not sure that we're even getting to the lower-level lookup. It may be that mds_lvfs_fid2dentry() is returning -ESTALE because the id is 0. This has left the OST connection "inactive" on the MDS, so any users with data on that OST are currently dead in the water. |
| Comments |
| Comment by Peter Jones [ 15/Aug/12 ] |
|
Oleg is looking into this one |
| Comment by Oleg Drokin [ 15/Aug/12 ] |
|
Do you have any confirmation of clients inability to access data on that OST? |
| Comment by Oleg Drokin [ 15/Aug/12 ] |
|
Ok, now regarding the ESTALE theory, there are way too many ways to get that. The oid itself is suspect for 0x1a768b0a id, you need tto have a pretty sizeable fs and use a lot of the inodes there, otherwise (quoting from ext3_nfs_get_inode): if (ino > le32_to_cpu(EXT3_SB(sb)->s_es->s_inodes_count)) return ERR_PTR(-ESTALE); Do you have more than 443976458 inodes on this fs? If not, then it appear to be a case of CATALOGS or the a catalog in llogs is corrupted and contains garbage. Another possibility is if this inode (443976458) does exist, but has a different generation. (can you check that please too with debugfs?) |
| Comment by Christopher Morrone [ 15/Aug/12 ] |
|
Sigh, actually I didn't even think about that. The admins led me to believe it was a serious problem, but now that I look it appears that the clients are all connected to the OST just fine. I think we can lower the priority on this one. |
| Comment by Christopher Morrone [ 15/Aug/12 ] |
|
We created the filesystem with 1074003968 inodes. 210379586 in are use and 863624382 are free (as reported by df -i on the mds). |
| Comment by Oleg Drokin [ 15/Aug/12 ] |
|
ok, so the inode is real and valid. Can you please check if it's in use and what generation does it have? |
| Comment by Oleg Drokin [ 15/Aug/12 ] |
|
Overall (pending verification of inode being in use by some other file), I suspect this will turn out to be another instance of bug 22658 (commit id c4f5d67193d61a3948bc1e01c5d602b8ffb7d011), only instead of llog being already deleted, it's deleted and the inode reused for something else producing ESTALE. |
| Comment by Christopher Morrone [ 15/Aug/12 ] |
|
It looks like inode 443976458 is /OBJECTS/1a768b0a:ea7675ee, which debugfs's "stat" shows as generation number 3933631982. Is that the right generation number? Is a number that high normal? So it looks like you are correct about the inode being reused. |
| Comment by Oleg Drokin [ 15/Aug/12 ] |
|
Well, 0xea7675ee IS 3933631982 which means the inode now corresponds to a valid llog, but a different one than we expect. Weird that it coincided like that. There is no upper limit on generation, so that' fine. After all I suspect the end cause is the same, reused llog inode. And a the fix we need to also handle ESTALE in a similar way to how we handle ENOENT now in llog create path. |
| Comment by Peter Jones [ 16/Aug/12 ] |
|
Lai Could you please create a fix to address the issue Oleg has identified in his analysis? Thanks Peter |
| Comment by Lai Siyao [ 17/Aug/12 ] |
|
review is on http://review.whamcloud.com/#change,3708 |
| Comment by Ned Bass [ 27/Aug/12 ] |
What normally causes the llog to be deleted by itself? The inactive OST breaks quota reporting, so we need to decide whether to take down the MDS to manually remove the file. |
| Comment by Diego Moreno (Inactive) [ 12/Sep/12 ] |
|
Also at CEA they also hit this bug, so we're interested in this patch. Thanks, |
| Comment by D. Marc Stearman (Inactive) [ 01/Oct/12 ] |
|
Has any progress been made on this? |
| Comment by Lai Siyao [ 03/Oct/12 ] |
|
Patch for b2_3 is at: http://review.whamcloud.com/#change,4163 |
| Comment by Peter Jones [ 16/Oct/12 ] |
|
Landed for 2.3 and 2.4 |
| Comment by D. Marc Stearman (Inactive) [ 17/Oct/12 ] |
|
Is there a patch for 2.1? |
| Comment by Christopher Morrone [ 17/Oct/12 ] |
|
Patch looks pretty simple, applies cleanly to 2.1. |
| Comment by Emoly Liu [ 05/Dec/12 ] |
|
Patch for b2_1 is at http://review.whamcloud.com/4742 |
| Comment by Ned Bass [ 29/Mar/13 ] |
|
We're still affected by this bug. OST lsa-OST00bd remains inactive on the MDS. MDS is at tag 2.1.4-3chaos which includes patch http://review.whamcloud.com/4742. The file /OBJECTS/1a7680f6:615e782e does not exist on the MDT. 2013-03-20 10:05:33 LustreError: 3488:0:(llog_lvfs.c:616:llog_lvfs_create()) error looking up logfile 0x1a7680f6:0x615e782e: rc -116 2013-03-20 10:05:33 LustreError: 3488:0:(llog_obd.c:220:llog_setup_named()) obd lsa-OST00bd-osc ctxt 2 lop_setup=ffffffffa06a0f20 failed -116 2013-03-20 10:05:33 LustreError: 3488:0:(osc_request.c:4229:__osc_llog_init()) failed LLOG_MDS_OST_ORIG_CTXT 2013-03-20 10:05:33 LustreError: 3488:0:(osc_request.c:4246:__osc_llog_init()) osc 'lsa-OST00bd-osc' tgt 'mdd_obd-lsa-MDT0000' catid ffff88022f6ff8c0 rc=-116 2013-03-20 10:05:33 LustreError: 3488:0:(osc_request.c:4248:__osc_llog_init()) logid 0x1a7680f6:0x615e782e 2013-03-20 10:05:33 LustreError: 3488:0:(osc_request.c:4276:osc_llog_init()) rc: -116 2013-03-20 10:05:33 LustreError: 3488:0:(lov_log.c:248:lov_llog_init()) error osc_llog_init idx 189 osc 'lsa-OST00bd-osc' tgt 'mdd_obd-lsa-MDT0000' (rc=-116) |
| Comment by Ned Bass [ 29/Mar/13 ] |
|
Reopening. |
| Comment by Ned Bass [ 03/Apr/13 ] |
|
It appears that inode 443976458 and 443975021 were again reused for a valid llogs (note inode and generations numbers match file name): # levi-mds1 /root > debugfs -c -R 'stat /OBJECTS/1a7680f6:46b56639' /dev/sdb debugfs 1.41.12 (17-May-2010) /dev/sdb: catastrophic mode - not reading inode or group bitmaps Inode: 443973878 Type: regular Mode: 0666 Flags: 0x80000 Generation: 1186293305 Version: 0x00000000:00000000 # levi-mds1 /root > printf "%d:%d\n" 0x1a7680f6 0x46b56639 443973878:1186293305 # levi-mds1 /root > debugfs -c -R 'stat /OBJECTS/1a76856d:a619a721' /dev/sdb debugfs 1.41.12 (17-May-2010) /dev/sdb: catastrophic mode - not reading inode or group bitmaps Inode: 443975021 Type: regular Mode: 0666 Flags: 0x0 Generation: 2786699041 Version: 0x00000000:00000000 # levi-mds1 /root > printf "%d:%d\n" 0x1a76856d 0xa619a721 443975021:2786699041 This explains why we are still seeing -ESTALE, as there still seems to be a catalog entry referring to those inodes but with a different generation number: LustreError: 3488:0:(llog_lvfs.c:616:llog_lvfs_create()) error looking up logfile 0x1a7680f6:0x615e782e: rc -116 LustreError: 3488:0:(llog_obd.c:220:llog_setup_named()) obd lsa-OST00bd-osc ctxt 2 lop_setup=ffffffffa06a0f20 failed -116 LustreError: 3488:0:(osc_request.c:4229:__osc_llog_init()) failed LLOG_MDS_OST_ORIG_CTXT LustreError: 3488:0:(osc_request.c:4246:__osc_llog_init()) osc 'lsa-OST00bd-osc' tgt 'mdd_obd-lsa-MDT0000' catid ffff88022f6ff8c0 rc=-116 LustreError: 3488:0:(osc_request.c:4248:__osc_llog_init()) logid 0x1a7680f6:0x615e782e LustreError: 3488:0:(osc_request.c:4276:osc_llog_init()) rc: -116 LustreError: 3488:0:(lov_log.c:248:lov_llog_init()) error osc_llog_init idx 189 osc 'lsa-OST00bd-osc' tgt 'mdd_obd-lsa-MDT0000' (rc=-116) LustreError: 3770:0:(llog_lvfs.c:616:llog_lvfs_create()) error looking up logfile 0x1a768d8e:0xead6b12b: rc -116 LustreError: 3770:0:(llog_cat.c:174:llog_cat_id2handle()) error opening log id 0x1a768d8e:ead6b12b: rc -116 LustreError: 3770:0:(llog_obd.c:320:cat_cancel_cb()) Cannot find handle for log 0x1a768d8e: -116 But, even with the patch for this bug, ESTALE is not handled gracefully and the OST is left deactivated: LustreError: 5254:0:(lov_log.c:160:lov_llog_origin_connect()) error osc_llog_connect tgt 189 (-107) LustreError: 5254:0:(mds_lov.c:872:__mds_lov_synchronize()) mdd_obd-lsa-MDT0000: lsa-OST00bd_UUID failed at llog_origin_connect: -107 Lustre: lsa-OST00bd_UUID: Sync failed deactivating: rc -107 |
| Comment by Ned Bass [ 03/Apr/13 ] |
|
Raising priority because this bug breaks quota reporting. |
| Comment by Lai Siyao [ 06/May/13 ] |
|
landed. |
| Comment by James A Simmons [ 16/Aug/16 ] |
|
Old ticket for unsupported version |