Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1749

llog_lvfs_create()) error looking up logfile

Details

    • 3
    • 4411

    Description

      We had an MDS (running 2.1.1-17chaos) lock up due to bug LU-1276. The admins rebooted the node to work around the issue, and after the reboot when the MDS started up we hit the following error:

      2012-08-14 13:48:55 LustreError: 3697:0:(llog_lvfs.c:616:llog_lvfs_create()) error looking up logfile 0x1a768b0a:0xa2f17dca: rc -116
      2012-08-14 13:48:55 LustreError: 3697:0:(llog_cat.c:174:llog_cat_id2handle()) error opening log id 0x1a768b0a:a2f17dca: rc -116
      2012-08-14 13:48:55 LustreError: 3697:0:(llog_obd.c:318:cat_cancel_cb()) Cannot find handle for log 0x1a768b0a
      2012-08-14 13:48:55 LustreError: 3472:0:(llog_obd.c:391:llog_obd_origin_setup()) llog_process() with cat_cancel_cb failed: -116
      2012-08-14 13:48:55 LustreError: 3472:0:(llog_obd.c:218:llog_setup_named()) obd lsa-OST00bd-osc ctxt 2 lop_setup=ffffffffa05b0a60 failed -116
      2012-08-14 13:48:55 LustreError: 3472:0:(osc_request.c:4186:__osc_llog_init()) failed LLOG_MDS_OST_ORIG_CTXT
      2012-08-14 13:48:55 LustreError: 3472:0:(osc_request.c:4203:__osc_llog_init()) osc 'lsa-OST00bd-osc' tgt 'mdd_obd-lsa-MDT0000' catid ffff8808312bd860 rc=-116
      2012-08-14 13:48:55 LustreError: 3472:0:(osc_request.c:4205:__osc_llog_init()) logid 0x1a7680f6:0x615e782e
      2012-08-14 13:48:55 LustreError: 3472:0:(osc_request.c:4233:osc_llog_init()) rc: -116
      2012-08-14 13:48:55 LustreError: 3472:0:(lov_log.c:248:lov_llog_init()) error osc_llog_init idx 189 osc 'lsa-OST00bd-osc' tgt 'mdd_obd-lsa-MDT0000' (rc=-116)
      2012-08-14 13:48:55 LustreError: 3698:0:(llog_lvfs.c:616:llog_lvfs_create()) error looking up logfile 0x1a768b0a:0xa2f17dca: rc -116
      2012-08-14 13:48:55 LustreError: 3698:0:(llog_cat.c:174:llog_cat_id2handle()) error opening log id 0x1a768b0a:a2f17dca: rc -116
      2012-08-14 13:48:55 LustreError: 3698:0:(llog_obd.c:318:cat_cancel_cb()) Cannot find handle for log 0x1a768b0a
      2012-08-14 13:48:55 LustreError: 3407:0:(llog_obd.c:391:llog_obd_origin_setup()) llog_process() with cat_cancel_cb failed: -116
      2012-08-14 13:48:55 LustreError: 3407:0:(llog_obd.c:218:llog_setup_named()) obd lsa-OST00bd-osc ctxt 2 lop_setup=ffffffffa05b0a60 failed -116
      2012-08-14 13:48:55 LustreError: 3407:0:(osc_request.c:4186:__osc_llog_init()) failed LLOG_MDS_OST_ORIG_CTXT
      2012-08-14 13:48:55 LustreError: 3407:0:(osc_request.c:4203:__osc_llog_init()) osc 'lsa-OST00bd-osc' tgt 'mdd_obd-lsa-MDT0000' catid ffff8804333af960 rc=-116
      2012-08-14 13:48:55 LustreError: 3407:0:(osc_request.c:4205:__osc_llog_init()) logid 0x1a7680f6:0x615e782e
      2012-08-14 13:48:56 LustreError: 3407:0:(osc_request.c:4233:osc_llog_init()) rc: -116
      2012-08-14 13:48:56 LustreError: 3407:0:(lov_log.c:248:lov_llog_init()) error osc_llog_init idx 189 osc 'lsa-OST00bd-osc' tgt 'mdd_obd-lsa-MDT0000' (rc=-116)
      

      I don't see a file in the OBJECTS directory that seems to match 0x1a768b0a:0xa2f17dca (if that is where we are looking). Although -116 is -ESTALE, so I'm not sure that we're even getting to the lower-level lookup. It may be that mds_lvfs_fid2dentry() is returning -ESTALE because the id is 0.

      This has left the OST connection "inactive" on the MDS, so any users with data on that OST are currently dead in the water.

      Attachments

        Activity

          [LU-1749] llog_lvfs_create()) error looking up logfile

          Old ticket for unsupported version

          simmonsja James A Simmons added a comment - Old ticket for unsupported version
          laisiyao Lai Siyao added a comment -

          landed.

          laisiyao Lai Siyao added a comment - landed.

          Raising priority because this bug breaks quota reporting.

          nedbass Ned Bass (Inactive) added a comment - Raising priority because this bug breaks quota reporting.

          It appears that inode 443976458 and 443975021 were again reused for a valid llogs (note inode and generations numbers match file name):

          # levi-mds1 /root > debugfs -c -R 'stat /OBJECTS/1a7680f6:46b56639' /dev/sdb    
          debugfs 1.41.12 (17-May-2010)                                                   
          /dev/sdb: catastrophic mode - not reading inode or group bitmaps                
          Inode: 443973878   Type: regular    Mode:  0666   Flags: 0x80000                
          Generation: 1186293305    Version: 0x00000000:00000000        
          
          # levi-mds1 /root > printf "%d:%d\n" 0x1a7680f6 0x46b56639
          443973878:1186293305
          
          # levi-mds1 /root > debugfs -c -R 'stat /OBJECTS/1a76856d:a619a721' /dev/sdb 
          debugfs 1.41.12 (17-May-2010)
          /dev/sdb: catastrophic mode - not reading inode or group bitmaps
          Inode: 443975021   Type: regular    Mode:  0666   Flags: 0x0
          Generation: 2786699041    Version: 0x00000000:00000000
          
          # levi-mds1 /root > printf "%d:%d\n" 0x1a76856d 0xa619a721
          443975021:2786699041
          

          This explains why we are still seeing -ESTALE, as there still seems to be a catalog entry referring to those inodes but with a different generation number:

          LustreError: 3488:0:(llog_lvfs.c:616:llog_lvfs_create()) error looking up logfile 0x1a7680f6:0x615e782e: rc -116
          LustreError: 3488:0:(llog_obd.c:220:llog_setup_named()) obd lsa-OST00bd-osc ctxt 2 lop_setup=ffffffffa06a0f20 failed -116
          LustreError: 3488:0:(osc_request.c:4229:__osc_llog_init()) failed LLOG_MDS_OST_ORIG_CTXT
          LustreError: 3488:0:(osc_request.c:4246:__osc_llog_init()) osc 'lsa-OST00bd-osc' tgt 'mdd_obd-lsa-MDT0000' catid ffff88022f6ff8c0 rc=-116
          LustreError: 3488:0:(osc_request.c:4248:__osc_llog_init()) logid 0x1a7680f6:0x615e782e
          LustreError: 3488:0:(osc_request.c:4276:osc_llog_init()) rc: -116
          LustreError: 3488:0:(lov_log.c:248:lov_llog_init()) error osc_llog_init idx 189 osc 'lsa-OST00bd-osc' tgt 'mdd_obd-lsa-MDT0000' (rc=-116)
          LustreError: 3770:0:(llog_lvfs.c:616:llog_lvfs_create()) error looking up logfile 0x1a768d8e:0xead6b12b: rc -116
          LustreError: 3770:0:(llog_cat.c:174:llog_cat_id2handle()) error opening log id 0x1a768d8e:ead6b12b: rc -116
          LustreError: 3770:0:(llog_obd.c:320:cat_cancel_cb()) Cannot find handle for log 0x1a768d8e: -116
          

          But, even with the patch for this bug, ESTALE is not handled gracefully and the OST is left deactivated:

          LustreError: 5254:0:(lov_log.c:160:lov_llog_origin_connect()) error osc_llog_connect tgt 189 (-107)
          LustreError: 5254:0:(mds_lov.c:872:__mds_lov_synchronize()) mdd_obd-lsa-MDT0000: lsa-OST00bd_UUID failed at llog_origin_connect: -107
          Lustre: lsa-OST00bd_UUID: Sync failed deactivating: rc -107
          
          nedbass Ned Bass (Inactive) added a comment - It appears that inode 443976458 and 443975021 were again reused for a valid llogs (note inode and generations numbers match file name): # levi-mds1 /root > debugfs -c -R 'stat /OBJECTS/1a7680f6:46b56639' /dev/sdb debugfs 1.41.12 (17-May-2010) /dev/sdb: catastrophic mode - not reading inode or group bitmaps Inode: 443973878 Type: regular Mode: 0666 Flags: 0x80000 Generation: 1186293305 Version: 0x00000000:00000000 # levi-mds1 /root > printf "%d:%d\n" 0x1a7680f6 0x46b56639 443973878:1186293305 # levi-mds1 /root > debugfs -c -R 'stat /OBJECTS/1a76856d:a619a721' /dev/sdb debugfs 1.41.12 (17-May-2010) /dev/sdb: catastrophic mode - not reading inode or group bitmaps Inode: 443975021 Type: regular Mode: 0666 Flags: 0x0 Generation: 2786699041 Version: 0x00000000:00000000 # levi-mds1 /root > printf "%d:%d\n" 0x1a76856d 0xa619a721 443975021:2786699041 This explains why we are still seeing -ESTALE, as there still seems to be a catalog entry referring to those inodes but with a different generation number: LustreError: 3488:0:(llog_lvfs.c:616:llog_lvfs_create()) error looking up logfile 0x1a7680f6:0x615e782e: rc -116 LustreError: 3488:0:(llog_obd.c:220:llog_setup_named()) obd lsa-OST00bd-osc ctxt 2 lop_setup=ffffffffa06a0f20 failed -116 LustreError: 3488:0:(osc_request.c:4229:__osc_llog_init()) failed LLOG_MDS_OST_ORIG_CTXT LustreError: 3488:0:(osc_request.c:4246:__osc_llog_init()) osc 'lsa-OST00bd-osc' tgt 'mdd_obd-lsa-MDT0000' catid ffff88022f6ff8c0 rc=-116 LustreError: 3488:0:(osc_request.c:4248:__osc_llog_init()) logid 0x1a7680f6:0x615e782e LustreError: 3488:0:(osc_request.c:4276:osc_llog_init()) rc: -116 LustreError: 3488:0:(lov_log.c:248:lov_llog_init()) error osc_llog_init idx 189 osc 'lsa-OST00bd-osc' tgt 'mdd_obd-lsa-MDT0000' (rc=-116) LustreError: 3770:0:(llog_lvfs.c:616:llog_lvfs_create()) error looking up logfile 0x1a768d8e:0xead6b12b: rc -116 LustreError: 3770:0:(llog_cat.c:174:llog_cat_id2handle()) error opening log id 0x1a768d8e:ead6b12b: rc -116 LustreError: 3770:0:(llog_obd.c:320:cat_cancel_cb()) Cannot find handle for log 0x1a768d8e: -116 But, even with the patch for this bug, ESTALE is not handled gracefully and the OST is left deactivated: LustreError: 5254:0:(lov_log.c:160:lov_llog_origin_connect()) error osc_llog_connect tgt 189 (-107) LustreError: 5254:0:(mds_lov.c:872:__mds_lov_synchronize()) mdd_obd-lsa-MDT0000: lsa-OST00bd_UUID failed at llog_origin_connect: -107 Lustre: lsa-OST00bd_UUID: Sync failed deactivating: rc -107

          Reopening.

          nedbass Ned Bass (Inactive) added a comment - Reopening.

          We're still affected by this bug. OST lsa-OST00bd remains inactive on the MDS. MDS is at tag 2.1.4-3chaos which includes patch http://review.whamcloud.com/4742. The file /OBJECTS/1a7680f6:615e782e does not exist on the MDT.

          2013-03-20 10:05:33 LustreError: 3488:0:(llog_lvfs.c:616:llog_lvfs_create()) error looking up logfile 0x1a7680f6:0x615e782e: rc -116
          2013-03-20 10:05:33 LustreError: 3488:0:(llog_obd.c:220:llog_setup_named()) obd lsa-OST00bd-osc ctxt 2 lop_setup=ffffffffa06a0f20 failed -116
          2013-03-20 10:05:33 LustreError: 3488:0:(osc_request.c:4229:__osc_llog_init()) failed LLOG_MDS_OST_ORIG_CTXT
          2013-03-20 10:05:33 LustreError: 3488:0:(osc_request.c:4246:__osc_llog_init()) osc 'lsa-OST00bd-osc' tgt 'mdd_obd-lsa-MDT0000' catid ffff88022f6ff8c0 rc=-116
          2013-03-20 10:05:33 LustreError: 3488:0:(osc_request.c:4248:__osc_llog_init()) logid 0x1a7680f6:0x615e782e
          2013-03-20 10:05:33 LustreError: 3488:0:(osc_request.c:4276:osc_llog_init()) rc: -116
          2013-03-20 10:05:33 LustreError: 3488:0:(lov_log.c:248:lov_llog_init()) error osc_llog_init idx 189 osc 'lsa-OST00bd-osc' tgt 'mdd_obd-lsa-MDT0000' (rc=-116)
          
          nedbass Ned Bass (Inactive) added a comment - We're still affected by this bug. OST lsa-OST00bd remains inactive on the MDS. MDS is at tag 2.1.4-3chaos which includes patch http://review.whamcloud.com/4742 . The file /OBJECTS/1a7680f6:615e782e does not exist on the MDT. 2013-03-20 10:05:33 LustreError: 3488:0:(llog_lvfs.c:616:llog_lvfs_create()) error looking up logfile 0x1a7680f6:0x615e782e: rc -116 2013-03-20 10:05:33 LustreError: 3488:0:(llog_obd.c:220:llog_setup_named()) obd lsa-OST00bd-osc ctxt 2 lop_setup=ffffffffa06a0f20 failed -116 2013-03-20 10:05:33 LustreError: 3488:0:(osc_request.c:4229:__osc_llog_init()) failed LLOG_MDS_OST_ORIG_CTXT 2013-03-20 10:05:33 LustreError: 3488:0:(osc_request.c:4246:__osc_llog_init()) osc 'lsa-OST00bd-osc' tgt 'mdd_obd-lsa-MDT0000' catid ffff88022f6ff8c0 rc=-116 2013-03-20 10:05:33 LustreError: 3488:0:(osc_request.c:4248:__osc_llog_init()) logid 0x1a7680f6:0x615e782e 2013-03-20 10:05:33 LustreError: 3488:0:(osc_request.c:4276:osc_llog_init()) rc: -116 2013-03-20 10:05:33 LustreError: 3488:0:(lov_log.c:248:lov_llog_init()) error osc_llog_init idx 189 osc 'lsa-OST00bd-osc' tgt 'mdd_obd-lsa-MDT0000' (rc=-116)
          emoly.liu Emoly Liu added a comment -

          Patch for b2_1 is at http://review.whamcloud.com/4742

          emoly.liu Emoly Liu added a comment - Patch for b2_1 is at http://review.whamcloud.com/4742

          Patch looks pretty simple, applies cleanly to 2.1.

          morrone Christopher Morrone (Inactive) added a comment - Patch looks pretty simple, applies cleanly to 2.1.

          Is there a patch for 2.1?

          marc@llnl.gov D. Marc Stearman (Inactive) added a comment - Is there a patch for 2.1?
          pjones Peter Jones added a comment -

          Landed for 2.3 and 2.4

          pjones Peter Jones added a comment - Landed for 2.3 and 2.4

          People

            laisiyao Lai Siyao
            morrone Christopher Morrone (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: