Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8527

Lustre 2.8 server crashed on bring up during large scale test shot

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.9.0
    • Lustre 2.8.0
    • None
    • RHEL6.7 servers running latest lustre 2.8 using latest lustre 2.8 clients.
    • 3
    • 9223372036854775807

    Description

      In our first attempt to test lustre 2.8 servers at scale the MDS crashed at bring up. The backtrace is:

      <2>[ 8011.527071] kernel BUG at fs/jbd2/transaction.c:1028!
      <4>[ 8011.532938] invalid opcode: 0000 1 SMP
      <4>[ 8011.537761] last sysfs file: /sys/devices/virtual/block/dm-7/uevent
      <4>[ 8011.544986] CPU 5
      <4>[ 8011.547040] Modules linked in: osp(U) mdd(U) lod(U) mdt(U) lfsck(U) mgc(U) osd_ldiskfs(U) ldiskfs(U) mbcache jbd2 lquota(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ptlrpc(U) obdclass(U) ko2iblnd(U) lnet(U) sha512_generic crc32c_intel libcfs(U) mpt2sas mptctl mptbase dell_rbu autofs4 8021q garp stp llc nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_REJECT xt_comment nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx4_ib ib_sa ib_mad ib_core ib_addr dm_mirror dm_region_hash dm_log dm_round_robin scsi_dh_rdac dm_multipath dm_mod sg ipmi_devintf sd_mod crc_t10dif iTCO_wdt iTCO_vendor_support microcode wmi power_meter acpi_ipmi ipmi_si ipmi_msghandler dcdbas mpt3sas scsi_transport_sas raid_class shpchp sb_edac edac_core lpc_ich mfd_core ahci ipv6 nfs lockd fscache auth_rpcgss nfs_acl sunrpc tg3 mlx4_en ptp pps_core mlx4_core [last unloaded: scsi_wait_scan]
      <4>[ 8011.646238]
      <4>[ 8011.648140] Pid: 30211, comm: mdt01_503 Not tainted 2.6.32-573.26.1.el6.dne2.x86_64 #1 Dell Inc. PowerEdge R630/0CNCJW
      <4>[ 8011.660578] RIP: 0010:[<ffffffffa0c9d92d>] [<ffffffffa0c9d92d>] jbd2_journal_dirty_metadata+0x10d/0x150 [jbd2]
      <4>[ 8011.672316] RSP: 0018:ffff883e6751f580 EFLAGS: 00010246
      <4>[ 8011.678473] RAX: ffff883e3a8be380 RBX: ffff883ce650e3d8 RCX: ffff883e9a7656e0
      <4>[ 8011.686676] RDX: 0000000000000000 RSI: ffff883e9a7656e0 RDI: ffff883ce650e3d8
      <4>[ 8011.694873] RBP: ffff883e6751f5a0 R08: c010000000000000 R09: 0000000000000000
      <4>[ 8011.703072] R10: 0000000000000001 R11: 0000000000000008 R12: ffff883df92c8ba8
      <4>[ 8011.716541] R13: ffff883e9a7656e0 R14: ffff884023ca1800 R15: 0000000000000058
      <4>[ 8011.724738] FS: 0000000000000000(0000) GS:ffff88018fca0000(0000) knlGS:0000000000000000
      <4>[ 8011.734220] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
      <4>[ 8011.740863] CR2: 00007fc59a79b380 CR3: 0000000001a8d000 CR4: 00000000001407e0
      <4>[ 8011.749059] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      <4>[ 8011.757263] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      <4>[ 8011.765460] Process mdt01_503 (pid: 30211, threadinfo ffff883e6751c000, task ffff883e6750c040)
      <4>[ 8011.775518] Stack:
      <4>[ 8011.777983] ffff883ce650e3d8 ffffffffa0d827b0 ffff883e9a7656e0 0000000000000000
      <4>[ 8011.786313] <d> ffff883e6751f5e0 ffffffffa0cc0fab ffff883e6751f5e0 ffffffffa0ccf968
      <4>[ 8011.795363] <d> 0000000000000000 0000000000000058 0000000000000058 ffff883e9a7656e0
      <4>[ 8011.804839] Call Trace:
      <4>[ 8011.807798] [<ffffffffa0cc0fab>] __ldiskfs_handle_dirty_metadata+0x7b/0x100 [ldiskfs]
      <4>[ 8011.817097] [<ffffffffa0ccf968>] ? ldiskfs_bread+0x18/0x80 [ldiskfs]
      <4>[ 8011.824517] [<ffffffffa0d6635c>] osd_ldiskfs_write_record+0xec/0x340 [osd_ldiskfs]
      <4>[ 8011.833519] [<ffffffffa0d690a3>] osd_write+0x183/0x5b0 [osd_ldiskfs]
      <4>[ 8011.840965] [<ffffffffa05f8b4d>] dt_record_write+0x3d/0x130 [obdclass]
      <4>[ 8011.848680] [<ffffffffa05bb44f>] llog_osd_write_rec+0xb6f/0x1ad0 [obdclass]
      <4>[ 8011.856787] [<ffffffffa0d7ba0b>] ? dynlock_unlock+0x16b/0x1d0 [osd_ldiskfs]
      <4>[ 8011.864894] [<ffffffffa05a9426>] llog_write_rec+0xb6/0x270 [obdclass]
      <4>[ 8011.872419] [<ffffffffa05b2893>] llog_cat_add_rec+0x1c3/0x7b0 [obdclass]
      <4>[ 8011.880234] [<ffffffffa05a9239>] llog_add+0x89/0x1c0 [obdclass]
      <4>[ 8011.887176] [<ffffffffa108641d>] osp_sync_add_rec+0x26d/0x9b0 [osp]
      <4>[ 8011.894501] [<ffffffffa1086c07>] osp_sync_add+0x77/0x80 [osp]
      <4>[ 8011.901248] [<ffffffffa0fb502e>] ? lod_sub_get_thandle+0x24e/0x3c0 [lod]
      <4>[ 8011.909060] [<ffffffffa1077823>] osp_object_destroy+0x173/0x230 [osp]
      <4>[ 8011.916582] [<ffffffffa0fb85ad>] lod_sub_object_destroy+0x1fd/0x440 [lod]

      Attachments

        Issue Links

          Activity

            [LU-8527] Lustre 2.8 server crashed on bring up during large scale test shot

            Lustre: 18235:0:(osd_handler.c:1272:osd_trans_dump_creds()) attr_set: 2/2/1, xattr_set: 1011/92/0
            Lustre: 18235:0:(osd_handler.c:1282:osd_trans_dump_creds()) write: 5043/43356/0, punch: 0/0/0, quota 4/52/0
            Lustre: 18235:0:(osd_handler.c:1289:osd_trans_dump_creds()) insert: 1009/17152/0, delete: 2/9/1
            Lustre: 18235:0:(osd_handler.c:1296:osd_trans_dump_creds()) ref_add: 1/1/0, ref_del: 2/2/1

            the transaction declared at least 2+92+43356+52+17152+9+2+2=60667 credits, while consumed 1+1+1=3
            well, probably a bit more as we don't track everything, but still way less than 60667.
            it looks like a false assertion though I don't understand the root cause yet.

            bzzz Alex Zhuravlev added a comment - Lustre: 18235:0:(osd_handler.c:1272:osd_trans_dump_creds()) attr_set: 2/2/1, xattr_set: 1011/92/0 Lustre: 18235:0:(osd_handler.c:1282:osd_trans_dump_creds()) write: 5043/43356/0, punch: 0/0/0, quota 4/52/0 Lustre: 18235:0:(osd_handler.c:1289:osd_trans_dump_creds()) insert: 1009/17152/0, delete: 2/9/1 Lustre: 18235:0:(osd_handler.c:1296:osd_trans_dump_creds()) ref_add: 1/1/0, ref_del: 2/2/1 the transaction declared at least 2+92+43356+52+17152+9+2+2=60667 credits, while consumed 1+1+1=3 well, probably a bit more as we don't track everything, but still way less than 60667. it looks like a false assertion though I don't understand the root cause yet.

            i'm trying to come up with a debugging patch.

            bzzz Alex Zhuravlev added a comment - i'm trying to come up with a debugging patch.

            Any updates?

            simmonsja James A Simmons added a comment - Any updates?
            yujian Jian Yu added a comment -

            Hi Alex,

            It's Lustre b2_8_fe branch with commit 71badf2 on the tip.

            yujian Jian Yu added a comment - Hi Alex, It's Lustre b2_8_fe branch with commit 71badf2 on the tip.

            please, tell the latest commit for the branch you're using.

            bzzz Alex Zhuravlev added a comment - please, tell the latest commit for the branch you're using.

            the first crash happened while running a large stripe single shared file with IOR. After the first crash this bug came back at random times.

            simmonsja James A Simmons added a comment - the first crash happened while running a large stripe single shared file with IOR. After the first crash this bug came back at random times.

            It seems possible that this is related to creating or deleting a wide-striped file? I found LU-3102 that may be related.

            adilger Andreas Dilger added a comment - It seems possible that this is related to creating or deleting a wide-striped file? I found LU-3102 that may be related.
            yujian Jian Yu added a comment -

            Thank you Dustin for the logs.

            Hi Alex,

            Could you please look into the logs? And although patch http://review.whamcloud.com/22058 doesn't fix the issue in this ticket, do you think it is still a better-to-have patch for Lustre b2_8_fe servers?

            yujian Jian Yu added a comment - Thank you Dustin for the logs. Hi Alex, Could you please look into the logs? And although patch http://review.whamcloud.com/22058 doesn't fix the issue in this ticket, do you think it is still a better-to-have patch for Lustre b2_8_fe servers?

            I cut the log down to show times from 7:11PM (recovery
            started and /proc/fs/lustre/osd-ldiskfs enabled) to 7:18PM when the
            crashdump kernel took over. Please see output below:

            Aug 23 19:05:18 atlas1-mds1.ccs.ornl.gov gedi;lustre-mgmt1.ccs.ornl.gov;test_lustre kernel: Kernel logging (proc) stopped.
            Aug 23 19:05:18 atlas1-mds1.ccs.ornl.gov kernel: imklog 5.8.10, log source = /proc/kmsg started.
            Aug 23 19:05:47 atlas1-mds1.ccs.ornl.gov kernel: [ 138.744691] LNet: HW CPU cores: 8, npartitions: 2
            Aug 23 19:05:47 atlas1-mds1.ccs.ornl.gov kernel: [ 138.757447] alg: No test for adler32 (adler32-zlib)
            Aug 23 19:05:47 atlas1-mds1.ccs.ornl.gov kernel: [ 138.763082] alg: No test for crc32 (crc32-table)
            Aug 23 19:05:47 atlas1-mds1.ccs.ornl.gov kernel: [ 138.768422] alg: No test for crc32 (crc32-pclmul)
            Aug 23 19:05:55 atlas1-mds1.ccs.ornl.gov kernel: [ 146.861939] Lustre: Lustre: Build Version: 2.8.0-g242e67d-CHANGED-2.6.32-573.26.1.el6.dne2.x86_64
            Aug 23 19:05:56 atlas1-mds1.ccs.ornl.gov kernel: [ 147.181854] LNet: Added LNI 10.36.226.117@o2ib [63/2560/0/180]
            Aug 23 19:05:56 atlas1-mds1.ccs.ornl.gov kernel: [ 147.452347] LNetError: 2837:0:(o2iblnd_cb.c:2310:kiblnd_passive_connect()) Can't accept conn from 10.36.230.233@o2ib200 on NA (ib0:1:10.36.226.117): bad dst nid 10.36.226.117@o2ib200
            Aug 23 19:05:56 atlas1-mds1.ccs.ornl.gov kernel: [ 147.589168] LNet: Added LNI 10.36.226.117@o2ib200 [63/2560/0/180]
            Aug 23 19:10:27 atlas1-mds1.ccs.ornl.gov kernel: [ 419.034586] LDISKFS-fs (dm-7): warning: mounting fs with errors, running e2fsck is recommended
            Aug 23 19:10:27 atlas1-mds1.ccs.ornl.gov kernel: [ 419.050076] LDISKFS-fs (dm-7): recovery complete
            Aug 23 19:10:27 atlas1-mds1.ccs.ornl.gov kernel: [ 419.065246] LDISKFS-fs (dm-7): mounted filesystem with ordered data mode. quota=on. Opts:
            Aug 23 19:10:29 atlas1-mds1.ccs.ornl.gov kernel: [ 420.483435] Lustre: atlas1-MDT0000: Not available for connect from 17469@gni100 (not set up)
            Aug 23 19:10:31 atlas1-mds1.ccs.ornl.gov kernel: [ 422.098522] Lustre: atlas1-MDT0000: Imperative Recovery enabled, recovery window shrunk from 1800-5400 down to 900-2700
            Aug 23 19:10:32 atlas1-mds1.ccs.ornl.gov kernel: [ 423.169261] Lustre: atlas1-MDT0000: Will be in recovery for at least 15:00, or until 18437 clients reconnect
            Aug 23 19:10:32 atlas1-mds1.ccs.ornl.gov kernel: [ 423.180828] Lustre: atlas1-MDT0000: Connection restored to 7e9ab0dc-95d7-895f-67d0-472e66bcea23 (at 232@gni100)
            Aug 23 19:10:33 atlas1-mds1.ccs.ornl.gov kernel: [ 424.194505] Lustre: atlas1-MDT0000: Connection restored to b8f051f0-79b3-229d-0cab-f02d9100888a (at 18990@gni100)
            Aug 23 19:10:34 atlas1-mds1.ccs.ornl.gov kernel: [ 425.308076] Lustre: atlas1-MDT0000: Connection restored to 6954d66d-f1d9-df22-81b6-a2a84a767326 (at 17860@gni100)
            Aug 23 19:10:34 atlas1-mds1.ccs.ornl.gov kernel: [ 425.320064] Lustre: Skipped 5 previous similar messages
            Aug 23 19:10:36 atlas1-mds1.ccs.ornl.gov kernel: [ 427.309298] Lustre: atlas1-MDT0000: Connection restored to 97ba02a4-3cbf-2339-af5c-bc672e524fcf (at 15033@gni100)
            Aug 23 19:10:36 atlas1-mds1.ccs.ornl.gov kernel: [ 427.321286] Lustre: Skipped 249 previous similar messages
            Aug 23 19:10:40 atlas1-mds1.ccs.ornl.gov kernel: [ 431.312192] Lustre: atlas1-MDT0000: Connection restored to 32e83f22-2827-c76a-be5a-444297abac0f (at 1031@gni100)
            Aug 23 19:10:40 atlas1-mds1.ccs.ornl.gov kernel: [ 431.324082] Lustre: Skipped 11156 previous similar messages
            Aug 23 19:12:02 atlas1-mds1.ccs.ornl.gov kernel: [ 513.742286] Lustre: atlas1-MDT0000: Connection restored to 68400d40-3736-e9e3-b4eb-7d2f1a9c2a9b (at 10.36.225.10@o2ib)
            Aug 23 19:12:02 atlas1-mds1.ccs.ornl.gov kernel: [ 513.754757] Lustre: Skipped 2849 previous similar messages
            Aug 23 19:12:41 atlas1-mds1.ccs.ornl.gov kernel: [ 552.533598] Lustre: atlas1-MDT0000: Connection restored to d6edd526-b008-01f3-beb9-910b627acce0 (at 10.36.205.218@o2ib)
            Aug 23 19:13:32 atlas1-mds1.ccs.ornl.gov kernel: [ 603.859284] Lustre: atlas1-MDT0000: Connection restored to fdf3ca98-024f-2d65-83bc-b10712d608f5 (at 10.36.205.209@o2ib)
            Aug 23 19:13:32 atlas1-mds1.ccs.ornl.gov kernel: [ 603.871786] Lustre: Skipped 1 previous similar message
            Aug 23 19:14:51 atlas1-mds1.ccs.ornl.gov kernel: [ 682.177088] Lustre: atlas1-MDT0000: Connection restored to c281fcc2-9ed2-632b-dfca-a063cbb468fd (at 16884@gni100)
            Aug 23 19:14:51 atlas1-mds1.ccs.ornl.gov kernel: [ 682.189010] Lustre: Skipped 1677 previous similar messages
            Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.646938] Lustre: 18235:0:(osd_handler.c:1265:osd_trans_dump_creds()) create: 1008/8064/0, destroy: 1/4/1
            Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.658473] Lustre: 18235:0:(osd_handler.c:1272:osd_trans_dump_creds()) attr_set: 2/2/1, xattr_set: 1011/92/0
            Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.670198] Lustre: 18235:0:(osd_handler.c:1282:osd_trans_dump_creds()) write: 5043/43356/0, punch: 0/0/0, quota 4/52/0
            Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.682891] Lustre: 18235:0:(osd_handler.c:1289:osd_trans_dump_creds()) insert: 1009/17152/0, delete: 2/9/1
            Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.694423] Lustre: 18235:0:(osd_handler.c:1296:osd_trans_dump_creds()) ref_add: 1/1/0, ref_del: 2/2/1
            Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.703059] Lustre: atlas1-MDT0000: Recovery over after 4:34, of 18437 clients 18437 recovered and 0 were evicted.
            Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.717547] LustreError: 18235:0:(osd_internal.h:1073:osd_trans_exec_op()) atlas1-MDT0000-osd: op = 7, rb = 7
            Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.729087] LustreError: 18235:0:(osd_internal.h:1081:osd_trans_exec_op()) LBUG
            Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.737712] Pid: 18235, comm: mdt01_399
            Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.742233]
            Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.742233] Call Trace:
            Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.747087] [<ffffffffa0423875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
            Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.755118] [<ffffffffa0423e77>] lbug_with_loc+0x47/0xb0 [libcfs]
            Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.762280] [<ffffffffa0c4b340>] osd_write+0x420/0x5b0 [osd_ldiskfs]
            Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.769753] [<ffffffffa0569b4d>] dt_record_write+0x3d/0x130 [obdclass]
            Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.777402] [<ffffffffa052c44f>] llog_osd_write_rec+0xb6f/0x1ad0 [obdclass]
            Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.785527] [<ffffffffa051a426>] llog_write_rec+0xb6/0x270 [obdclass]
            Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.793080] [<ffffffffa0523893>] llog_cat_add_rec+0x1c3/0x7b0 [obdclass]
            Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.800936] [<ffffffffa051a239>] llog_add+0x89/0x1c0 [obdclass]
            Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.807904] [<ffffffffa0f1e41d>] osp_sync_add_rec+0x26d/0x9b0 [osp]
            Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.815261] [<ffffffffa0f1ec07>] osp_sync_add+0x77/0x80 [osp]
            Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.822015] [<ffffffffa0f0f823>] osp_object_destroy+0x173/0x230 [osp]
            Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.829581] [<ffffffffa0e635ad>] lod_sub_object_destroy+0x1fd/0x440 [lod]
            Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.837516] [<ffffffffa0e5745b>] lod_object_destroy+0x36b/0x770 [lod]
            Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.845067] [<ffffffffa042ed81>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
            Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.852726] [<ffffffffa0ebde6b>] mdd_finish_unlink+0x28b/0x3d0 [mdd]
            Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.860164] [<ffffffffa042c1d8>] ? libcfs_log_return+0x28/0x40 [libcfs]
            Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.867922] [<ffffffffa0ec2845>] mdd_unlink+0xab5/0xf70 [mdd]
            Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.874700] [<ffffffffa042ed81>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
            Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.882383] [<ffffffffa0d800c8>] mdo_unlink+0x18/0x50 [mdt]
            Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.888962] [<ffffffffa0d88f70>] mdt_reint_unlink+0xbb0/0x1060 [mdt]
            Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.896416] [<ffffffffa0d8015d>] mdt_reint_rec+0x5d/0x200 [mdt]
            Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.903394] [<ffffffffa0d6bddb>] mdt_reint_internal+0x62b/0x9f0 [mdt]
            Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.910960] [<ffffffffa0d6c63b>] mdt_reint+0x6b/0x120 [mdt]
            Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.917602] [<ffffffffa07e79cc>] tgt_request_handle+0x8ec/0x1440 [ptlrpc]
            Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.925557] [<ffffffffa0794a91>] ptlrpc_main+0xd21/0x1800 [ptlrpc]
            Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.932824] [<ffffffff81539b0e>] ? thread_return+0x4e/0x7d0
            Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.939414] [<ffffffffa0793d70>] ? ptlrpc_main+0x0/0x1800 [ptlrpc]
            Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.946690] [<ffffffff810a138e>] kthread+0x9e/0xc0
            Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.952407] [<ffffffff8100c28a>] child_rip+0xa/0x20
            Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.958206] [<ffffffff810a12f0>] ? kthread+0x0/0xc0
            Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.964016] [<ffffffff8100c280>] ? child_rip+0x0/0x20
            Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.970033]
            Aug 23 19:19:57 atlas1-mds1.ccs.ornl.gov gedi;lustre-mgmt1.ccs.ornl.gov;test_lustre kernel: imklog 5.8.10, log source = /proc/kmsg started.
            Aug 23 19:19:57 atlas1-mds1.ccs.ornl.gov gedi;lustre-mgmt1.ccs.ornl.gov;test_lustre kernel: [ 0.000000] Initializing cgroup subsys cpuset
            Aug 23 19:19:57 atlas1-mds1.ccs.ornl.gov gedi;lustre-mgmt1.ccs.ornl.gov;test_lustre kernel: [ 0.000000] Initializing cgroup subsys cpu
            Aug 23 19:19:57 atlas1-mds1.ccs.ornl.gov gedi;lustre-mgmt1.ccs.ornl.gov;test_lustre kernel: [ 0.000000] Linux version 2.6.32-573.26.1.el6.dne2.x86_64 (jsimmons@lgmgmt1) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-16) (GCC) ) #1 SMP Thu Jul 21 13:22:19 EDT 2016
            Aug 23 19:19:57 atlas1-mds1.ccs.ornl.gov gedi;lustre-mgmt1.ccs.ornl.gov;test_lustre kernel: [ 0.000000] Command line: initrd=initrd-2.6.32-573.26.1.el6.dne2.x86_64-gedi selinux=0 audit=0 panic=10 console=tty0 console=ttyS1,115200n8 console=ttyS2,115200n8 crashkernel=192M init=/gedi-preinit BOOT_IMAGE=vmlinuz-2.6.32-573.26.1.el6.dne2.x86_64 BOOTIF=01-20-47-47-7e-c5-e0

            dustb100 Dustin Leverman added a comment - I cut the log down to show times from 7:11PM (recovery started and /proc/fs/lustre/osd-ldiskfs enabled) to 7:18PM when the crashdump kernel took over. Please see output below: Aug 23 19:05:18 atlas1-mds1.ccs.ornl.gov gedi;lustre-mgmt1.ccs.ornl.gov;test_lustre kernel: Kernel logging (proc) stopped. Aug 23 19:05:18 atlas1-mds1.ccs.ornl.gov kernel: imklog 5.8.10, log source = /proc/kmsg started. Aug 23 19:05:47 atlas1-mds1.ccs.ornl.gov kernel: [ 138.744691] LNet: HW CPU cores: 8, npartitions: 2 Aug 23 19:05:47 atlas1-mds1.ccs.ornl.gov kernel: [ 138.757447] alg: No test for adler32 (adler32-zlib) Aug 23 19:05:47 atlas1-mds1.ccs.ornl.gov kernel: [ 138.763082] alg: No test for crc32 (crc32-table) Aug 23 19:05:47 atlas1-mds1.ccs.ornl.gov kernel: [ 138.768422] alg: No test for crc32 (crc32-pclmul) Aug 23 19:05:55 atlas1-mds1.ccs.ornl.gov kernel: [ 146.861939] Lustre: Lustre: Build Version: 2.8.0-g242e67d-CHANGED-2.6.32-573.26.1.el6.dne2.x86_64 Aug 23 19:05:56 atlas1-mds1.ccs.ornl.gov kernel: [ 147.181854] LNet: Added LNI 10.36.226.117@o2ib [63/2560/0/180] Aug 23 19:05:56 atlas1-mds1.ccs.ornl.gov kernel: [ 147.452347] LNetError: 2837:0:(o2iblnd_cb.c:2310:kiblnd_passive_connect()) Can't accept conn from 10.36.230.233@o2ib200 on NA (ib0:1:10.36.226.117): bad dst nid 10.36.226.117@o2ib200 Aug 23 19:05:56 atlas1-mds1.ccs.ornl.gov kernel: [ 147.589168] LNet: Added LNI 10.36.226.117@o2ib200 [63/2560/0/180] Aug 23 19:10:27 atlas1-mds1.ccs.ornl.gov kernel: [ 419.034586] LDISKFS-fs (dm-7): warning: mounting fs with errors, running e2fsck is recommended Aug 23 19:10:27 atlas1-mds1.ccs.ornl.gov kernel: [ 419.050076] LDISKFS-fs (dm-7): recovery complete Aug 23 19:10:27 atlas1-mds1.ccs.ornl.gov kernel: [ 419.065246] LDISKFS-fs (dm-7): mounted filesystem with ordered data mode. quota=on. Opts: Aug 23 19:10:29 atlas1-mds1.ccs.ornl.gov kernel: [ 420.483435] Lustre: atlas1-MDT0000: Not available for connect from 17469@gni100 (not set up) Aug 23 19:10:31 atlas1-mds1.ccs.ornl.gov kernel: [ 422.098522] Lustre: atlas1-MDT0000: Imperative Recovery enabled, recovery window shrunk from 1800-5400 down to 900-2700 Aug 23 19:10:32 atlas1-mds1.ccs.ornl.gov kernel: [ 423.169261] Lustre: atlas1-MDT0000: Will be in recovery for at least 15:00, or until 18437 clients reconnect Aug 23 19:10:32 atlas1-mds1.ccs.ornl.gov kernel: [ 423.180828] Lustre: atlas1-MDT0000: Connection restored to 7e9ab0dc-95d7-895f-67d0-472e66bcea23 (at 232@gni100) Aug 23 19:10:33 atlas1-mds1.ccs.ornl.gov kernel: [ 424.194505] Lustre: atlas1-MDT0000: Connection restored to b8f051f0-79b3-229d-0cab-f02d9100888a (at 18990@gni100) Aug 23 19:10:34 atlas1-mds1.ccs.ornl.gov kernel: [ 425.308076] Lustre: atlas1-MDT0000: Connection restored to 6954d66d-f1d9-df22-81b6-a2a84a767326 (at 17860@gni100) Aug 23 19:10:34 atlas1-mds1.ccs.ornl.gov kernel: [ 425.320064] Lustre: Skipped 5 previous similar messages Aug 23 19:10:36 atlas1-mds1.ccs.ornl.gov kernel: [ 427.309298] Lustre: atlas1-MDT0000: Connection restored to 97ba02a4-3cbf-2339-af5c-bc672e524fcf (at 15033@gni100) Aug 23 19:10:36 atlas1-mds1.ccs.ornl.gov kernel: [ 427.321286] Lustre: Skipped 249 previous similar messages Aug 23 19:10:40 atlas1-mds1.ccs.ornl.gov kernel: [ 431.312192] Lustre: atlas1-MDT0000: Connection restored to 32e83f22-2827-c76a-be5a-444297abac0f (at 1031@gni100) Aug 23 19:10:40 atlas1-mds1.ccs.ornl.gov kernel: [ 431.324082] Lustre: Skipped 11156 previous similar messages Aug 23 19:12:02 atlas1-mds1.ccs.ornl.gov kernel: [ 513.742286] Lustre: atlas1-MDT0000: Connection restored to 68400d40-3736-e9e3-b4eb-7d2f1a9c2a9b (at 10.36.225.10@o2ib) Aug 23 19:12:02 atlas1-mds1.ccs.ornl.gov kernel: [ 513.754757] Lustre: Skipped 2849 previous similar messages Aug 23 19:12:41 atlas1-mds1.ccs.ornl.gov kernel: [ 552.533598] Lustre: atlas1-MDT0000: Connection restored to d6edd526-b008-01f3-beb9-910b627acce0 (at 10.36.205.218@o2ib) Aug 23 19:13:32 atlas1-mds1.ccs.ornl.gov kernel: [ 603.859284] Lustre: atlas1-MDT0000: Connection restored to fdf3ca98-024f-2d65-83bc-b10712d608f5 (at 10.36.205.209@o2ib) Aug 23 19:13:32 atlas1-mds1.ccs.ornl.gov kernel: [ 603.871786] Lustre: Skipped 1 previous similar message Aug 23 19:14:51 atlas1-mds1.ccs.ornl.gov kernel: [ 682.177088] Lustre: atlas1-MDT0000: Connection restored to c281fcc2-9ed2-632b-dfca-a063cbb468fd (at 16884@gni100) Aug 23 19:14:51 atlas1-mds1.ccs.ornl.gov kernel: [ 682.189010] Lustre: Skipped 1677 previous similar messages Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.646938] Lustre: 18235:0:(osd_handler.c:1265:osd_trans_dump_creds()) create: 1008/8064/0, destroy: 1/4/1 Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.658473] Lustre: 18235:0:(osd_handler.c:1272:osd_trans_dump_creds()) attr_set: 2/2/1, xattr_set: 1011/92/0 Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.670198] Lustre: 18235:0:(osd_handler.c:1282:osd_trans_dump_creds()) write: 5043/43356/0, punch: 0/0/0, quota 4/52/0 Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.682891] Lustre: 18235:0:(osd_handler.c:1289:osd_trans_dump_creds()) insert: 1009/17152/0, delete: 2/9/1 Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.694423] Lustre: 18235:0:(osd_handler.c:1296:osd_trans_dump_creds()) ref_add: 1/1/0, ref_del: 2/2/1 Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.703059] Lustre: atlas1-MDT0000: Recovery over after 4:34, of 18437 clients 18437 recovered and 0 were evicted. Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.717547] LustreError: 18235:0:(osd_internal.h:1073:osd_trans_exec_op()) atlas1-MDT0000-osd: op = 7, rb = 7 Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.729087] LustreError: 18235:0:(osd_internal.h:1081:osd_trans_exec_op()) LBUG Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.737712] Pid: 18235, comm: mdt01_399 Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.742233] Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.742233] Call Trace: Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.747087] [<ffffffffa0423875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.755118] [<ffffffffa0423e77>] lbug_with_loc+0x47/0xb0 [libcfs] Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.762280] [<ffffffffa0c4b340>] osd_write+0x420/0x5b0 [osd_ldiskfs] Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.769753] [<ffffffffa0569b4d>] dt_record_write+0x3d/0x130 [obdclass] Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.777402] [<ffffffffa052c44f>] llog_osd_write_rec+0xb6f/0x1ad0 [obdclass] Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.785527] [<ffffffffa051a426>] llog_write_rec+0xb6/0x270 [obdclass] Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.793080] [<ffffffffa0523893>] llog_cat_add_rec+0x1c3/0x7b0 [obdclass] Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.800936] [<ffffffffa051a239>] llog_add+0x89/0x1c0 [obdclass] Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.807904] [<ffffffffa0f1e41d>] osp_sync_add_rec+0x26d/0x9b0 [osp] Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.815261] [<ffffffffa0f1ec07>] osp_sync_add+0x77/0x80 [osp] Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.822015] [<ffffffffa0f0f823>] osp_object_destroy+0x173/0x230 [osp] Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.829581] [<ffffffffa0e635ad>] lod_sub_object_destroy+0x1fd/0x440 [lod] Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.837516] [<ffffffffa0e5745b>] lod_object_destroy+0x36b/0x770 [lod] Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.845067] [<ffffffffa042ed81>] ? libcfs_debug_msg+0x41/0x50 [libcfs] Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.852726] [<ffffffffa0ebde6b>] mdd_finish_unlink+0x28b/0x3d0 [mdd] Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.860164] [<ffffffffa042c1d8>] ? libcfs_log_return+0x28/0x40 [libcfs] Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.867922] [<ffffffffa0ec2845>] mdd_unlink+0xab5/0xf70 [mdd] Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.874700] [<ffffffffa042ed81>] ? libcfs_debug_msg+0x41/0x50 [libcfs] Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.882383] [<ffffffffa0d800c8>] mdo_unlink+0x18/0x50 [mdt] Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.888962] [<ffffffffa0d88f70>] mdt_reint_unlink+0xbb0/0x1060 [mdt] Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.896416] [<ffffffffa0d8015d>] mdt_reint_rec+0x5d/0x200 [mdt] Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.903394] [<ffffffffa0d6bddb>] mdt_reint_internal+0x62b/0x9f0 [mdt] Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.910960] [<ffffffffa0d6c63b>] mdt_reint+0x6b/0x120 [mdt] Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.917602] [<ffffffffa07e79cc>] tgt_request_handle+0x8ec/0x1440 [ptlrpc] Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.925557] [<ffffffffa0794a91>] ptlrpc_main+0xd21/0x1800 [ptlrpc] Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.932824] [<ffffffff81539b0e>] ? thread_return+0x4e/0x7d0 Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.939414] [<ffffffffa0793d70>] ? ptlrpc_main+0x0/0x1800 [ptlrpc] Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.946690] [<ffffffff810a138e>] kthread+0x9e/0xc0 Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.952407] [<ffffffff8100c28a>] child_rip+0xa/0x20 Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.958206] [<ffffffff810a12f0>] ? kthread+0x0/0xc0 Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.964016] [<ffffffff8100c280>] ? child_rip+0x0/0x20 Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.970033] Aug 23 19:19:57 atlas1-mds1.ccs.ornl.gov gedi;lustre-mgmt1.ccs.ornl.gov;test_lustre kernel: imklog 5.8.10, log source = /proc/kmsg started. Aug 23 19:19:57 atlas1-mds1.ccs.ornl.gov gedi;lustre-mgmt1.ccs.ornl.gov;test_lustre kernel: [ 0.000000] Initializing cgroup subsys cpuset Aug 23 19:19:57 atlas1-mds1.ccs.ornl.gov gedi;lustre-mgmt1.ccs.ornl.gov;test_lustre kernel: [ 0.000000] Initializing cgroup subsys cpu Aug 23 19:19:57 atlas1-mds1.ccs.ornl.gov gedi;lustre-mgmt1.ccs.ornl.gov;test_lustre kernel: [ 0.000000] Linux version 2.6.32-573.26.1.el6.dne2.x86_64 (jsimmons@lgmgmt1) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-16) (GCC) ) #1 SMP Thu Jul 21 13:22:19 EDT 2016 Aug 23 19:19:57 atlas1-mds1.ccs.ornl.gov gedi;lustre-mgmt1.ccs.ornl.gov;test_lustre kernel: [ 0.000000] Command line: initrd=initrd-2.6.32-573.26.1.el6.dne2.x86_64-gedi selinux=0 audit=0 panic=10 console=tty0 console=ttyS1,115200n8 console=ttyS2,115200n8 crashkernel=192M init=/gedi-preinit BOOT_IMAGE=vmlinuz-2.6.32-573.26.1.el6.dne2.x86_64 BOOTIF=01-20-47-47-7e-c5-e0

            LOD itself doesn't declare credits, but it ask OSD to declare operations. is it possible to load osd-ldiskfs module with ldiskfs_track_declares_assert=1 and attach dmesg output starting a bit earlier (hopefully we'll hit our assert with additional information). thanks in advance.

            bzzz Alex Zhuravlev added a comment - LOD itself doesn't declare credits, but it ask OSD to declare operations. is it possible to load osd-ldiskfs module with ldiskfs_track_declares_assert=1 and attach dmesg output starting a bit earlier (hopefully we'll hit our assert with additional information). thanks in advance.

            Nope patch 22058 doesn't fix it for us We are in a state or constant crashing when the MDS finishes recovery.

            simmonsja James A Simmons added a comment - Nope patch 22058 doesn't fix it for us We are in a state or constant crashing when the MDS finishes recovery.

            People

              bobijam Zhenyu Xu
              simmonsja James A Simmons
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: