[LU-11578] ldiskfs_map_blocks: comm mdt00_100: lblock 0 mapped to illegal pblock - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.10.4
Labels:
None
Environment:
CentOS 7.4

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

We hit the following issue today on Oak's MDT0. We just added another MDT a few days ago so it was empty but I started today to lfs migrate a test directory (from MDT0 to MDT1) when this happened on MDT0. lfs migrate did actually work for a while (about ~40k inodes have been migrated) until MDT0 did this:

Oct 26 17:26:13 oak-md1-s2 kernel: LDISKFS-fs error (device dm-0): ldiskfs_map_blocks:594: inode #659619751: block 774843950: comm mdt00_100: lblock 0 mapped to illegal pblock (length 1)
Oct 26 17:26:13 oak-md1-s2 kernel: Aborting journal on device dm-0-8.
Oct 26 17:26:13 oak-md1-s2 kernel: LustreError: 3844:0:(osd_handler.c:1586:osd_trans_commit_cb()) transaction @0xffff881b024c6a80 commit error: 2
Oct 26 17:26:13 oak-md1-s2 kernel: LDISKFS-fs (dm-0): Remounting filesystem read-only

I performed a fsck with the new e2fsprogs-1.44.3.wc1-0.el7.x86_64
Not sure this fixed our issue as I don't see any reference to inode 659619751 in it. I'm attaching the full fsck logs.

Hopefully, MDT0 then restarted without problem but I haven't touched the partially migrated directory at this time to avoid further issues on a Friday night on this production system. My feeling is that lfs migrate hit some migrated inode that somehow got corrupted on MDT0? The system has been working fine for weeks now (we only have the issue reported in ~~LU-11205~~ regarding changelog_clear errors), so I assume this is due to my running lfs migrate. I can perform troubleshooting next week. Any recommendation to avoid this in the future is welcome, thanks much!

Stephane

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

oak-md1-s2_kernel.log
2.48 MB
27/Oct/18 4:30 AM
oak-MDT0000_fsck.log
299 kB
27/Oct/18 4:32 AM
oak-MDT0000-fsck-2018-11-06.log
20 kB
06/Nov/18 7:03 PM

Issue Links

duplicates

LU-12485 (osd_handler.c:2146:osd_object_release()) ASSERTION( !(o->oo_destroyed == 0 && o->oo_inode && o->oo_inode->i_nlink == 0) ) faile

Resolved

is related to

LU-13157 migrate symlink with target name length > 59 cause crash

Resolved

LU-14511 oak-MDT0000: deleted inode referenced and aborted journal

Resolved

is related to

LU-10581 osd_handler.c:1978:osd_object_release()) LBUG

Open

Activity

[LU-11578] ldiskfs_map_blocks: comm mdt00_100: lblock 0 mapped to illegal pblock

Yang Sheng added a comment - 26/Jul/22 3:56 AM

Close as duplicated with LU-10581.

Yang Sheng added a comment - 26/Jul/22 3:56 AM Close as duplicated with LU-10581 .

Yang Sheng added a comment - 13/Dec/18 5:03 AM

Hi, Stephane,

Looks like it is similar with LU-10581, Some code dec nlink and then fail to exit but leave inode as is. So the cache is inconsistent with on-disk data. I'll try to dig out where it occurred. Thanks for your help.

Thanks,
YangSheng

Yang Sheng added a comment - 13/Dec/18 5:03 AM Hi, Stephane, Looks like it is similar with LU-10581 , Some code dec nlink and then fail to exit but leave inode as is. So the cache is inconsistent with on-disk data. I'll try to dig out where it occurred. Thanks for your help. Thanks, YangSheng

Stephane Thiell added a comment - 12/Dec/18 8:46 PM

Hi YangSheng!

Results:

[root@oak-md1-s2 ~]# debugfs /dev/mapper/md1-rbod1-mdt0
debugfs 1.44.3.wc1 (23-July-2018)
debugfs:  ncheck 564154962
Inode	Pathname
564154962	/REMOTE_PARENT_DIR/0x200002b64:0x65a2:0x0/openssl-1.0.2l-0/ssl/cacert.pem
debugfs:  stat /REMOTE_PARENT_DIR/0x200002b64:0x65a2:0x0/openssl-1.0.2l-0/ssl/cacert.pem
Inode: 564154962   Type: regular    Mode:  0640   Flags: 0x0
Generation: 529132813    Version: 0x00000005:44289388
User: 282232   Group:  3199   Size: 0
File ACL: 0
Links: 1   Blockcount: 0
Fragment:  Address: 0    Number: 0    Size: 0
 ctime: 0x59790e8c:00000000 -- Wed Jul 26 14:50:04 2017
 atime: 0x597945cf:00000000 -- Wed Jul 26 18:45:51 2017
 mtime: 0x59270fc5:00000000 -- Thu May 25 10:09:25 2017
crtime: 0x59790e35:e633c9d8 -- Wed Jul 26 14:48:37 2017
Size of extra inode fields: 28
Extended attributes:
  system.posix_acl_access (28) = 01 00 00 00 01 00 06 00 04 00 07 00 08 00 07 00 7f 0c 00 00 10 00 04 00 20 00 00 00 
  trusted.lma (24) = 00 00 00 00 00 00 00 00 64 2b 00 00 02 00 00 00 11 ce 00 00 00 00 00 00 
  lma: fid=[0x200002b64:0xce11:0x0] compat=0 incompat=0
  trusted.lov (56)
  trusted.link (80)
BLOCKS:

debugfs:

On a client:

[root@oak-gw02 ~]# lfs fid2path /oak [0x200002b64:0xce11:0x0]
/oak/stanford/groups/ruthm/sthiell/anaconda2.off/pkgs/openssl-1.0.2l-0/ssl/cacert.pem
/oak/stanford/groups/ruthm/sthiell/anaconda2.off/ssl/cacert.pem
[root@oak-gw02 ~]#

Stephane Thiell added a comment - 12/Dec/18 8:46 PM Hi YangSheng! Results: [root@oak-md1-s2 ~]# debugfs /dev/mapper/md1-rbod1-mdt0 debugfs 1.44.3.wc1 (23-July-2018) debugfs: ncheck 564154962 Inode Pathname 564154962 /REMOTE_PARENT_DIR/0x200002b64:0x65a2:0x0/openssl-1.0.2l-0/ssl/cacert.pem debugfs: stat /REMOTE_PARENT_DIR/0x200002b64:0x65a2:0x0/openssl-1.0.2l-0/ssl/cacert.pem Inode: 564154962 Type: regular Mode: 0640 Flags: 0x0 Generation: 529132813 Version: 0x00000005:44289388 User: 282232 Group: 3199 Size: 0 File ACL: 0 Links: 1 Blockcount: 0 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x59790e8c:00000000 -- Wed Jul 26 14:50:04 2017 atime: 0x597945cf:00000000 -- Wed Jul 26 18:45:51 2017 mtime: 0x59270fc5:00000000 -- Thu May 25 10:09:25 2017 crtime: 0x59790e35:e633c9d8 -- Wed Jul 26 14:48:37 2017 Size of extra inode fields: 28 Extended attributes: system.posix_acl_access (28) = 01 00 00 00 01 00 06 00 04 00 07 00 08 00 07 00 7f 0c 00 00 10 00 04 00 20 00 00 00 trusted.lma (24) = 00 00 00 00 00 00 00 00 64 2b 00 00 02 00 00 00 11 ce 00 00 00 00 00 00 lma: fid=[0x200002b64:0xce11:0x0] compat=0 incompat=0 trusted.lov (56) trusted.link (80) BLOCKS: debugfs: On a client: [root@oak-gw02 ~]# lfs fid2path /oak [0x200002b64:0xce11:0x0] /oak/stanford/groups/ruthm/sthiell/anaconda2.off/pkgs/openssl-1.0.2l-0/ssl/cacert.pem /oak/stanford/groups/ruthm/sthiell/anaconda2.off/ssl/cacert.pem [root@oak-gw02 ~]#

Yang Sheng added a comment - 12/Dec/18 6:21 PM

Hi, Stephane,

I have got some data from vmcore as below:

crash> struct osd_object ffff88202b660400 -x
struct osd_object {
  oo_dt = {
    do_lu = {
      lo_header = 0xffff881c939377b0,
      lo_dev = 0xffff882027e22000,
      lo_ops = 0xffffffffc1058060 <osd_lu_obj_ops>,
      lo_linkage = {
        next = 0xffff881c939377f0,
        prev = 0xffff881c87ad52c8
      },
      lo_dev_ref = {<No data fields>}
    },
    do_ops = 0xffffffffc1057f60 <osd_obj_ops>,
    do_body_ops = 0xffffffffc1059b20 <osd_body_ops>,
    do_index_ops = 0x0
  },
  oo_inode = 0xffff88125f9d9a38,
  oo_hl_head = 0xffff881c87497200,
  oo_ext_idx_sem = {
    {
      count = {
        counter = 0x0
      },
      __UNIQUE_ID_rh_kabi_hide3 = {
        count = 0x0
      },
      {<No data fields>}
    },
    wait_lock = {
      raw_lock = {
        val = {
          counter = 0x0
        }
      }
    },
    osq = {
      tail = {
        counter = 0x0
      }
    },
    wait_list = {
      next = 0xffff88202b660460
    },
    owner = 0x0
  },
  oo_sem = {
    {
      count = {
        counter = 0x0
      },
      __UNIQUE_ID_rh_kabi_hide3 = {
        count = 0x0
      },
      {<No data fields>}
    },
    wait_lock = {
      raw_lock = {
        val = {
          counter = 0x0
        }
      }
    },
    osq = {
      tail = {
        counter = 0x0
      }
    },
    wait_list = {
      next = 0xffff88202b660480
    },
    owner = 0x0
  },
  oo_dir = 0x0,
  oo_guard = {
    {
      rlock = {
        raw_lock = {
          val = {
            counter = 0x0
          }
        }
      }
    }
  },
  oo_destroyed = 0x0,
  oo_lma_flags = 0x0,
  oo_compat_dot_created = 0x1,
  oo_compat_dotdot_created = 0x1,
  oo_owner = 0x0,
  oo_xattr_list = {
    next = 0xffff88202b6604b8,
    prev = 0xffff88202b6604b8
  }
}
.....
crash> struct inode 0xffff88125f9d9a38
struct inode {
  i_mode = 33184,
  i_opflags = 0,
  i_uid = {
    val = 282232
  },
  i_gid = {
    val = 3199
  },
  i_flags = 128,
  i_acl = 0xffffffffffffffff,
  i_default_acl = 0xffffffffffffffff,
  i_op = 0xffffffffc0f5afc0 <ldiskfs_file_inode_operations>,
  i_sb = 0xffff8802d8e7b000,
  i_mapping = 0xffff88125f9d9b88,
  i_security = 0x0,
  i_ino = 564154962,
  {
    i_nlink = 0,
    __i_nlink = 0
  },
  i_rdev = 0,
  i_size = 0,
  i_atime = {
    tv_sec = 1501119951,
    tv_nsec = 0
  },
...........
crash> struct lu_object_header 0xffff881c939377b0 -x
struct lu_object_header {
  loh_fid = {
    f_seq = 0x200002b64,
    f_oid = 0xce11,
    f_ver = 0x0
  },
  loh_flags = 0x0,
  loh_ref = {
    counter = 0x0
  },
  loh_attr = 0x8001,
  loh_hash = {
    next = 0x0,
    pprev = 0xffffc90016edd928
  },
  loh_lru = {
    next = 0xffff881c939377e0,
    prev = 0xffff881c939377e0
  },
  loh_layers = {
    next = 0xffff881c93937818,
    prev = 0xffff88202b660418
  },
  loh_reference = {<No data fields>}
}
..........

So please help collect info as below:

on MDS:
# debugfs [your mds device]
debugfs: ncheck 564154962
Inode	Pathname
564154962   [pathname]
debugfs: stat [pathname]
paste output in here
......
On client:
# lfs fid2path <your mount point> [0x200002b64:0xce11:0x0]

Thanks,
YangSheng

Yang Sheng added a comment - 12/Dec/18 6:21 PM Hi, Stephane, I have got some data from vmcore as below: crash> struct osd_object ffff88202b660400 -x struct osd_object { oo_dt = { do_lu = { lo_header = 0xffff881c939377b0, lo_dev = 0xffff882027e22000, lo_ops = 0xffffffffc1058060 <osd_lu_obj_ops>, lo_linkage = { next = 0xffff881c939377f0, prev = 0xffff881c87ad52c8 }, lo_dev_ref = {<No data fields>} }, do_ops = 0xffffffffc1057f60 <osd_obj_ops>, do_body_ops = 0xffffffffc1059b20 <osd_body_ops>, do_index_ops = 0x0 }, oo_inode = 0xffff88125f9d9a38, oo_hl_head = 0xffff881c87497200, oo_ext_idx_sem = { { count = { counter = 0x0 }, __UNIQUE_ID_rh_kabi_hide3 = { count = 0x0 }, {<No data fields>} }, wait_lock = { raw_lock = { val = { counter = 0x0 } } }, osq = { tail = { counter = 0x0 } }, wait_list = { next = 0xffff88202b660460 }, owner = 0x0 }, oo_sem = { { count = { counter = 0x0 }, __UNIQUE_ID_rh_kabi_hide3 = { count = 0x0 }, {<No data fields>} }, wait_lock = { raw_lock = { val = { counter = 0x0 } } }, osq = { tail = { counter = 0x0 } }, wait_list = { next = 0xffff88202b660480 }, owner = 0x0 }, oo_dir = 0x0, oo_guard = { { rlock = { raw_lock = { val = { counter = 0x0 } } } } }, oo_destroyed = 0x0, oo_lma_flags = 0x0, oo_compat_dot_created = 0x1, oo_compat_dotdot_created = 0x1, oo_owner = 0x0, oo_xattr_list = { next = 0xffff88202b6604b8, prev = 0xffff88202b6604b8 } } ..... crash> struct inode 0xffff88125f9d9a38 struct inode { i_mode = 33184, i_opflags = 0, i_uid = { val = 282232 }, i_gid = { val = 3199 }, i_flags = 128, i_acl = 0xffffffffffffffff, i_default_acl = 0xffffffffffffffff, i_op = 0xffffffffc0f5afc0 <ldiskfs_file_inode_operations>, i_sb = 0xffff8802d8e7b000, i_mapping = 0xffff88125f9d9b88, i_security = 0x0, i_ino = 564154962, { i_nlink = 0, __i_nlink = 0 }, i_rdev = 0, i_size = 0, i_atime = { tv_sec = 1501119951, tv_nsec = 0 }, ........... crash> struct lu_object_header 0xffff881c939377b0 -x struct lu_object_header { loh_fid = { f_seq = 0x200002b64, f_oid = 0xce11, f_ver = 0x0 }, loh_flags = 0x0, loh_ref = { counter = 0x0 }, loh_attr = 0x8001, loh_hash = { next = 0x0, pprev = 0xffffc90016edd928 }, loh_lru = { next = 0xffff881c939377e0, prev = 0xffff881c939377e0 }, loh_layers = { next = 0xffff881c93937818, prev = 0xffff88202b660418 }, loh_reference = {<No data fields>} } .......... So please help collect info as below: on MDS: # debugfs [your mds device] debugfs: ncheck 564154962 Inode Pathname 564154962 [pathname] debugfs: stat [pathname] paste output in here ...... On client: # lfs fid2path <your mount point> [0x200002b64:0xce11:0x0] Thanks, YangSheng

Yang Sheng added a comment - 28/Nov/18 4:31 PM

Hi, Stephane,

Sorry for the late. I have downloaded the packages. Will start to analyze later. Thanks for the patience.

Thanks,
YangSheng

Yang Sheng added a comment - 28/Nov/18 4:31 PM Hi, Stephane, Sorry for the late. I have downloaded the packages. Will start to analyze later. Thanks for the patience. Thanks, YangSheng

Stephane Thiell added a comment - 27/Nov/18 8:03 PM

Hey YangSheng,

Have you been able to take a look at this? Let me know if you need anything else from me.

Thanks!
Stephane

Stephane Thiell added a comment - 27/Nov/18 8:03 PM Hey YangSheng, Have you been able to take a look at this? Let me know if you need anything else from me. Thanks! Stephane

Stephane Thiell added a comment - 16/Nov/18 7:09 PM

Hi YangSheng,

Sure, thanks for taking a look. I just uploaded the vmcore and matching debuginfo rpms:

lustre-debuginfo-2.10.4-1.el7.centos.x86_64.rpm
kernel-debuginfo-3.10.0-693.2.2.el7_lustre.pl1.x86_64.rpm
kernel-debuginfo-common-x86_64-3.10.0-693.2.2.el7_lustre.pl1.x86_64.rpm
and one of the vmcores as LU-11578-vmcore-oak-md1-s2-2018-11-06-11-00-23

Thanks!
Stephane

Stephane Thiell added a comment - 16/Nov/18 7:09 PM Hi YangSheng, Sure, thanks for taking a look. I just uploaded the vmcore and matching debuginfo rpms: lustre-debuginfo-2.10.4-1.el7.centos.x86_64.rpm kernel-debuginfo-3.10.0-693.2.2.el7_lustre.pl1.x86_64.rpm kernel-debuginfo-common-x86_64-3.10.0-693.2.2.el7_lustre.pl1.x86_64.rpm and one of the vmcores as LU-11578 -vmcore-oak-md1-s2-2018-11-06-11-00-23 Thanks! Stephane

Yang Sheng added a comment - 16/Nov/18 6:51 AM

Hi, Stephane,

Sorry for the late reply, Could you please upload vmcore on our ftp site? ftp.whamcloud.com. This crash is really stranger for me. Also debuginfo rpms if you still keep them. If not, as comment in this ticket, you should use 2.10.4 + 3.10.0-692.2.2 kernel for your MDS?

Thanks,
YangSheng

Yang Sheng added a comment - 16/Nov/18 6:51 AM Hi, Stephane, Sorry for the late reply, Could you please upload vmcore on our ftp site? ftp.whamcloud.com. This crash is really stranger for me. Also debuginfo rpms if you still keep them. If not, as comment in this ticket, you should use 2.10.4 + 3.10.0-692.2.2 kernel for your MDS? Thanks, YangSheng

Stephane Thiell added a comment - 15/Nov/18 7:44 AM

Hi YangSheng,

Any ideas of what could have happened here?

Thanks!
Stephane

Stephane Thiell added a comment - 15/Nov/18 7:44 AM Hi YangSheng, Any ideas of what could have happened here? Thanks! Stephane

Stephane Thiell added a comment - 06/Nov/18 7:32 PM

I tried a second time after shutting down the client that ran the lfs migrate but the same LBUG happened. I was able to put it back into production by trying a third time and doing an abort_recovery after most of the clients recovered.

[  156.180372] Lustre: oak-MDT0000: Denying connection for new client cdb722e9-f2fd-2e1e-3be0-ae3ca5c64dbb(at 10.8.15.2@o2ib6), waiting for 1417 known clients (1406 recovered, 9 in progress, and 0 evicted) to recover in 8:45
[  156.202310] Lustre: Skipped 1 previous similar message
[  159.651156] LustreError: 3530:0:(mdt_handler.c:6167:mdt_iocontrol()) oak-MDT0000: Aborting recovery for device
[  159.662323] LustreError: 3530:0:(ldlm_lib.c:2606:target_stop_recovery_thread()) oak-MDT0000: Aborting recovery
[  159.673500] Lustre: 3850:0:(ldlm_lib.c:2059:target_recovery_overseer()) recovery is aborted, evict exports in recovery
[  159.685661] Lustre: oak-MDT0000: disconnecting 11 stale clients
[  159.720906] Lustre: 3850:0:(ldlm_lib.c:1619:abort_req_replay_queue()) @@@ aborted:  req@ffff88100d339200 x1616374803034688/t0(279195624051) o36->ebc6458d-1c97-7178-4cec-be2598c570b5@10.9.114.12@o2ib4:319/0 lens 624/0 e 0 to 0 dl 1541532139 ref 1 fl Complete:/4/ffffffff rc 0/-1
[  159.748315] LustreError: 3850:0:(ldlm_lib.c:1640:abort_lock_replay_queue()) @@@ aborted:  req@ffff88102435a100 x1610729933017152/t0(0) o101->10a6d338-b751-f002-23d5-4f33526aa200@10.12.4.29@o2ib:537/0 lens 328/0 e 0 to 0 dl 1541532357 ref 1 fl Complete:/40/ffffffff rc 0/-1
[  159.789840] Lustre: oak-MDT0000: Recovery over after 0:55, of 1417 clients 1406 recovered and 11 were evicted.

I have two vmcore's of the above LBUG and system logs also available if needed.

Best,
Stephane

Stephane Thiell added a comment - 06/Nov/18 7:32 PM I tried a second time after shutting down the client that ran the lfs migrate but the same LBUG happened. I was able to put it back into production by trying a third time and doing an abort_recovery after most of the clients recovered. [ 156.180372] Lustre: oak-MDT0000: Denying connection for new client cdb722e9-f2fd-2e1e-3be0-ae3ca5c64dbb(at 10.8.15.2@o2ib6), waiting for 1417 known clients (1406 recovered, 9 in progress, and 0 evicted) to recover in 8:45 [ 156.202310] Lustre: Skipped 1 previous similar message [ 159.651156] LustreError: 3530:0:(mdt_handler.c:6167:mdt_iocontrol()) oak-MDT0000: Aborting recovery for device [ 159.662323] LustreError: 3530:0:(ldlm_lib.c:2606:target_stop_recovery_thread()) oak-MDT0000: Aborting recovery [ 159.673500] Lustre: 3850:0:(ldlm_lib.c:2059:target_recovery_overseer()) recovery is aborted, evict exports in recovery [ 159.685661] Lustre: oak-MDT0000: disconnecting 11 stale clients [ 159.720906] Lustre: 3850:0:(ldlm_lib.c:1619:abort_req_replay_queue()) @@@ aborted: req@ffff88100d339200 x1616374803034688/t0(279195624051) o36->ebc6458d-1c97-7178-4cec-be2598c570b5@10.9.114.12@o2ib4:319/0 lens 624/0 e 0 to 0 dl 1541532139 ref 1 fl Complete:/4/ffffffff rc 0/-1 [ 159.748315] LustreError: 3850:0:(ldlm_lib.c:1640:abort_lock_replay_queue()) @@@ aborted: req@ffff88102435a100 x1610729933017152/t0(0) o101->10a6d338-b751-f002-23d5-4f33526aa200@10.12.4.29@o2ib:537/0 lens 328/0 e 0 to 0 dl 1541532357 ref 1 fl Complete:/40/ffffffff rc 0/-1 [ 159.789840] Lustre: oak-MDT0000: Recovery over after 0:55, of 1417 clients 1406 recovered and 11 were evicted. I have two vmcore's of the above LBUG and system logs also available if needed. Best, Stephane

People

Assignee:: Yang Sheng

Reporter:: Stephane Thiell

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 27/Oct/18 4:35 AM

Updated:: 26/Jul/22 3:56 AM

Resolved:: 26/Jul/22 3:56 AM