I cut the log down to show times from 7:11PM (recovery
started and /proc/fs/lustre/osd-ldiskfs enabled) to 7:18PM when the
crashdump kernel took over. Please see output below:
Aug 23 19:05:18 atlas1-mds1.ccs.ornl.gov gedi;lustre-mgmt1.ccs.ornl.gov;test_lustre kernel: Kernel logging (proc) stopped.
Aug 23 19:05:18 atlas1-mds1.ccs.ornl.gov kernel: imklog 5.8.10, log source = /proc/kmsg started.
Aug 23 19:05:47 atlas1-mds1.ccs.ornl.gov kernel: [ 138.744691] LNet: HW CPU cores: 8, npartitions: 2
Aug 23 19:05:47 atlas1-mds1.ccs.ornl.gov kernel: [ 138.757447] alg: No test for adler32 (adler32-zlib)
Aug 23 19:05:47 atlas1-mds1.ccs.ornl.gov kernel: [ 138.763082] alg: No test for crc32 (crc32-table)
Aug 23 19:05:47 atlas1-mds1.ccs.ornl.gov kernel: [ 138.768422] alg: No test for crc32 (crc32-pclmul)
Aug 23 19:05:55 atlas1-mds1.ccs.ornl.gov kernel: [ 146.861939] Lustre: Lustre: Build Version: 2.8.0-g242e67d-CHANGED-2.6.32-573.26.1.el6.dne2.x86_64
Aug 23 19:05:56 atlas1-mds1.ccs.ornl.gov kernel: [ 147.181854] LNet: Added LNI 10.36.226.117@o2ib [63/2560/0/180]
Aug 23 19:05:56 atlas1-mds1.ccs.ornl.gov kernel: [ 147.452347] LNetError: 2837:0:(o2iblnd_cb.c:2310:kiblnd_passive_connect()) Can't accept conn from 10.36.230.233@o2ib200 on NA (ib0:1:10.36.226.117): bad dst nid 10.36.226.117@o2ib200
Aug 23 19:05:56 atlas1-mds1.ccs.ornl.gov kernel: [ 147.589168] LNet: Added LNI 10.36.226.117@o2ib200 [63/2560/0/180]
Aug 23 19:10:27 atlas1-mds1.ccs.ornl.gov kernel: [ 419.034586] LDISKFS-fs (dm-7): warning: mounting fs with errors, running e2fsck is recommended
Aug 23 19:10:27 atlas1-mds1.ccs.ornl.gov kernel: [ 419.050076] LDISKFS-fs (dm-7): recovery complete
Aug 23 19:10:27 atlas1-mds1.ccs.ornl.gov kernel: [ 419.065246] LDISKFS-fs (dm-7): mounted filesystem with ordered data mode. quota=on. Opts:
Aug 23 19:10:29 atlas1-mds1.ccs.ornl.gov kernel: [ 420.483435] Lustre: atlas1-MDT0000: Not available for connect from 17469@gni100 (not set up)
Aug 23 19:10:31 atlas1-mds1.ccs.ornl.gov kernel: [ 422.098522] Lustre: atlas1-MDT0000: Imperative Recovery enabled, recovery window shrunk from 1800-5400 down to 900-2700
Aug 23 19:10:32 atlas1-mds1.ccs.ornl.gov kernel: [ 423.169261] Lustre: atlas1-MDT0000: Will be in recovery for at least 15:00, or until 18437 clients reconnect
Aug 23 19:10:32 atlas1-mds1.ccs.ornl.gov kernel: [ 423.180828] Lustre: atlas1-MDT0000: Connection restored to 7e9ab0dc-95d7-895f-67d0-472e66bcea23 (at 232@gni100)
Aug 23 19:10:33 atlas1-mds1.ccs.ornl.gov kernel: [ 424.194505] Lustre: atlas1-MDT0000: Connection restored to b8f051f0-79b3-229d-0cab-f02d9100888a (at 18990@gni100)
Aug 23 19:10:34 atlas1-mds1.ccs.ornl.gov kernel: [ 425.308076] Lustre: atlas1-MDT0000: Connection restored to 6954d66d-f1d9-df22-81b6-a2a84a767326 (at 17860@gni100)
Aug 23 19:10:34 atlas1-mds1.ccs.ornl.gov kernel: [ 425.320064] Lustre: Skipped 5 previous similar messages
Aug 23 19:10:36 atlas1-mds1.ccs.ornl.gov kernel: [ 427.309298] Lustre: atlas1-MDT0000: Connection restored to 97ba02a4-3cbf-2339-af5c-bc672e524fcf (at 15033@gni100)
Aug 23 19:10:36 atlas1-mds1.ccs.ornl.gov kernel: [ 427.321286] Lustre: Skipped 249 previous similar messages
Aug 23 19:10:40 atlas1-mds1.ccs.ornl.gov kernel: [ 431.312192] Lustre: atlas1-MDT0000: Connection restored to 32e83f22-2827-c76a-be5a-444297abac0f (at 1031@gni100)
Aug 23 19:10:40 atlas1-mds1.ccs.ornl.gov kernel: [ 431.324082] Lustre: Skipped 11156 previous similar messages
Aug 23 19:12:02 atlas1-mds1.ccs.ornl.gov kernel: [ 513.742286] Lustre: atlas1-MDT0000: Connection restored to 68400d40-3736-e9e3-b4eb-7d2f1a9c2a9b (at 10.36.225.10@o2ib)
Aug 23 19:12:02 atlas1-mds1.ccs.ornl.gov kernel: [ 513.754757] Lustre: Skipped 2849 previous similar messages
Aug 23 19:12:41 atlas1-mds1.ccs.ornl.gov kernel: [ 552.533598] Lustre: atlas1-MDT0000: Connection restored to d6edd526-b008-01f3-beb9-910b627acce0 (at 10.36.205.218@o2ib)
Aug 23 19:13:32 atlas1-mds1.ccs.ornl.gov kernel: [ 603.859284] Lustre: atlas1-MDT0000: Connection restored to fdf3ca98-024f-2d65-83bc-b10712d608f5 (at 10.36.205.209@o2ib)
Aug 23 19:13:32 atlas1-mds1.ccs.ornl.gov kernel: [ 603.871786] Lustre: Skipped 1 previous similar message
Aug 23 19:14:51 atlas1-mds1.ccs.ornl.gov kernel: [ 682.177088] Lustre: atlas1-MDT0000: Connection restored to c281fcc2-9ed2-632b-dfca-a063cbb468fd (at 16884@gni100)
Aug 23 19:14:51 atlas1-mds1.ccs.ornl.gov kernel: [ 682.189010] Lustre: Skipped 1677 previous similar messages
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.646938] Lustre: 18235:0:(osd_handler.c:1265:osd_trans_dump_creds()) create: 1008/8064/0, destroy: 1/4/1
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.658473] Lustre: 18235:0:(osd_handler.c:1272:osd_trans_dump_creds()) attr_set: 2/2/1, xattr_set: 1011/92/0
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.670198] Lustre: 18235:0:(osd_handler.c:1282:osd_trans_dump_creds()) write: 5043/43356/0, punch: 0/0/0, quota 4/52/0
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.682891] Lustre: 18235:0:(osd_handler.c:1289:osd_trans_dump_creds()) insert: 1009/17152/0, delete: 2/9/1
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.694423] Lustre: 18235:0:(osd_handler.c:1296:osd_trans_dump_creds()) ref_add: 1/1/0, ref_del: 2/2/1
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.703059] Lustre: atlas1-MDT0000: Recovery over after 4:34, of 18437 clients 18437 recovered and 0 were evicted.
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.717547] LustreError: 18235:0:(osd_internal.h:1073:osd_trans_exec_op()) atlas1-MDT0000-osd: op = 7, rb = 7
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.729087] LustreError: 18235:0:(osd_internal.h:1081:osd_trans_exec_op()) LBUG
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.737712] Pid: 18235, comm: mdt01_399
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.742233]
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.742233] Call Trace:
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.747087] [<ffffffffa0423875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.755118] [<ffffffffa0423e77>] lbug_with_loc+0x47/0xb0 [libcfs]
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.762280] [<ffffffffa0c4b340>] osd_write+0x420/0x5b0 [osd_ldiskfs]
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.769753] [<ffffffffa0569b4d>] dt_record_write+0x3d/0x130 [obdclass]
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.777402] [<ffffffffa052c44f>] llog_osd_write_rec+0xb6f/0x1ad0 [obdclass]
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.785527] [<ffffffffa051a426>] llog_write_rec+0xb6/0x270 [obdclass]
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.793080] [<ffffffffa0523893>] llog_cat_add_rec+0x1c3/0x7b0 [obdclass]
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.800936] [<ffffffffa051a239>] llog_add+0x89/0x1c0 [obdclass]
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.807904] [<ffffffffa0f1e41d>] osp_sync_add_rec+0x26d/0x9b0 [osp]
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.815261] [<ffffffffa0f1ec07>] osp_sync_add+0x77/0x80 [osp]
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.822015] [<ffffffffa0f0f823>] osp_object_destroy+0x173/0x230 [osp]
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.829581] [<ffffffffa0e635ad>] lod_sub_object_destroy+0x1fd/0x440 [lod]
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.837516] [<ffffffffa0e5745b>] lod_object_destroy+0x36b/0x770 [lod]
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.845067] [<ffffffffa042ed81>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.852726] [<ffffffffa0ebde6b>] mdd_finish_unlink+0x28b/0x3d0 [mdd]
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.860164] [<ffffffffa042c1d8>] ? libcfs_log_return+0x28/0x40 [libcfs]
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.867922] [<ffffffffa0ec2845>] mdd_unlink+0xab5/0xf70 [mdd]
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.874700] [<ffffffffa042ed81>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.882383] [<ffffffffa0d800c8>] mdo_unlink+0x18/0x50 [mdt]
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.888962] [<ffffffffa0d88f70>] mdt_reint_unlink+0xbb0/0x1060 [mdt]
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.896416] [<ffffffffa0d8015d>] mdt_reint_rec+0x5d/0x200 [mdt]
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.903394] [<ffffffffa0d6bddb>] mdt_reint_internal+0x62b/0x9f0 [mdt]
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.910960] [<ffffffffa0d6c63b>] mdt_reint+0x6b/0x120 [mdt]
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.917602] [<ffffffffa07e79cc>] tgt_request_handle+0x8ec/0x1440 [ptlrpc]
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.925557] [<ffffffffa0794a91>] ptlrpc_main+0xd21/0x1800 [ptlrpc]
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.932824] [<ffffffff81539b0e>] ? thread_return+0x4e/0x7d0
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.939414] [<ffffffffa0793d70>] ? ptlrpc_main+0x0/0x1800 [ptlrpc]
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.946690] [<ffffffff810a138e>] kthread+0x9e/0xc0
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.952407] [<ffffffff8100c28a>] child_rip+0xa/0x20
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.958206] [<ffffffff810a12f0>] ? kthread+0x0/0xc0
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.964016] [<ffffffff8100c280>] ? child_rip+0x0/0x20
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.970033]
Aug 23 19:19:57 atlas1-mds1.ccs.ornl.gov gedi;lustre-mgmt1.ccs.ornl.gov;test_lustre kernel: imklog 5.8.10, log source = /proc/kmsg started.
Aug 23 19:19:57 atlas1-mds1.ccs.ornl.gov gedi;lustre-mgmt1.ccs.ornl.gov;test_lustre kernel: [ 0.000000] Initializing cgroup subsys cpuset
Aug 23 19:19:57 atlas1-mds1.ccs.ornl.gov gedi;lustre-mgmt1.ccs.ornl.gov;test_lustre kernel: [ 0.000000] Initializing cgroup subsys cpu
Aug 23 19:19:57 atlas1-mds1.ccs.ornl.gov gedi;lustre-mgmt1.ccs.ornl.gov;test_lustre kernel: [ 0.000000] Linux version 2.6.32-573.26.1.el6.dne2.x86_64 (jsimmons@lgmgmt1) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-16) (GCC) ) #1 SMP Thu Jul 21 13:22:19 EDT 2016
Aug 23 19:19:57 atlas1-mds1.ccs.ornl.gov gedi;lustre-mgmt1.ccs.ornl.gov;test_lustre kernel: [ 0.000000] Command line: initrd=initrd-2.6.32-573.26.1.el6.dne2.x86_64-gedi selinux=0 audit=0 panic=10 console=tty0 console=ttyS1,115200n8 console=ttyS2,115200n8 crashkernel=192M init=/gedi-preinit BOOT_IMAGE=vmlinuz-2.6.32-573.26.1.el6.dne2.x86_64 BOOTIF=01-20-47-47-7e-c5-e0
Lustre: 18235:0:(osd_handler.c:1272:osd_trans_dump_creds()) attr_set: 2/2/1, xattr_set: 1011/92/0
Lustre: 18235:0:(osd_handler.c:1282:osd_trans_dump_creds()) write: 5043/43356/0, punch: 0/0/0, quota 4/52/0
Lustre: 18235:0:(osd_handler.c:1289:osd_trans_dump_creds()) insert: 1009/17152/0, delete: 2/9/1
Lustre: 18235:0:(osd_handler.c:1296:osd_trans_dump_creds()) ref_add: 1/1/0, ref_del: 2/2/1
the transaction declared at least 2+92+43356+52+17152+9+2+2=60667 credits, while consumed 1+1+1=3
well, probably a bit more as we don't track everything, but still way less than 60667.
it looks like a false assertion though I don't understand the root cause yet.