[LU-8527] Lustre 2.8 server crashed on bring up during large scale test shot Created: 23/Aug/16  Updated: 18/Oct/16  Resolved: 08/Oct/16

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Major
Reporter: James A Simmons Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: None
Environment:

RHEL6.7 servers running latest lustre 2.8 using latest lustre 2.8 clients.


Attachments: Text File dmesg.txt    
Issue Links:
Related
is related to LU-3102 kernel BUG at fs/jbd2/transaction.c:1033 Resolved
is related to LU-8267 sanity test_129: (osd_internal.h:1137... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

In our first attempt to test lustre 2.8 servers at scale the MDS crashed at bring up. The backtrace is:

<2>[ 8011.527071] kernel BUG at fs/jbd2/transaction.c:1028!
<4>[ 8011.532938] invalid opcode: 0000 1 SMP
<4>[ 8011.537761] last sysfs file: /sys/devices/virtual/block/dm-7/uevent
<4>[ 8011.544986] CPU 5
<4>[ 8011.547040] Modules linked in: osp(U) mdd(U) lod(U) mdt(U) lfsck(U) mgc(U) osd_ldiskfs(U) ldiskfs(U) mbcache jbd2 lquota(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ptlrpc(U) obdclass(U) ko2iblnd(U) lnet(U) sha512_generic crc32c_intel libcfs(U) mpt2sas mptctl mptbase dell_rbu autofs4 8021q garp stp llc nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_REJECT xt_comment nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx4_ib ib_sa ib_mad ib_core ib_addr dm_mirror dm_region_hash dm_log dm_round_robin scsi_dh_rdac dm_multipath dm_mod sg ipmi_devintf sd_mod crc_t10dif iTCO_wdt iTCO_vendor_support microcode wmi power_meter acpi_ipmi ipmi_si ipmi_msghandler dcdbas mpt3sas scsi_transport_sas raid_class shpchp sb_edac edac_core lpc_ich mfd_core ahci ipv6 nfs lockd fscache auth_rpcgss nfs_acl sunrpc tg3 mlx4_en ptp pps_core mlx4_core [last unloaded: scsi_wait_scan]
<4>[ 8011.646238]
<4>[ 8011.648140] Pid: 30211, comm: mdt01_503 Not tainted 2.6.32-573.26.1.el6.dne2.x86_64 #1 Dell Inc. PowerEdge R630/0CNCJW
<4>[ 8011.660578] RIP: 0010:[<ffffffffa0c9d92d>] [<ffffffffa0c9d92d>] jbd2_journal_dirty_metadata+0x10d/0x150 [jbd2]
<4>[ 8011.672316] RSP: 0018:ffff883e6751f580 EFLAGS: 00010246
<4>[ 8011.678473] RAX: ffff883e3a8be380 RBX: ffff883ce650e3d8 RCX: ffff883e9a7656e0
<4>[ 8011.686676] RDX: 0000000000000000 RSI: ffff883e9a7656e0 RDI: ffff883ce650e3d8
<4>[ 8011.694873] RBP: ffff883e6751f5a0 R08: c010000000000000 R09: 0000000000000000
<4>[ 8011.703072] R10: 0000000000000001 R11: 0000000000000008 R12: ffff883df92c8ba8
<4>[ 8011.716541] R13: ffff883e9a7656e0 R14: ffff884023ca1800 R15: 0000000000000058
<4>[ 8011.724738] FS: 0000000000000000(0000) GS:ffff88018fca0000(0000) knlGS:0000000000000000
<4>[ 8011.734220] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
<4>[ 8011.740863] CR2: 00007fc59a79b380 CR3: 0000000001a8d000 CR4: 00000000001407e0
<4>[ 8011.749059] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>[ 8011.757263] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>[ 8011.765460] Process mdt01_503 (pid: 30211, threadinfo ffff883e6751c000, task ffff883e6750c040)
<4>[ 8011.775518] Stack:
<4>[ 8011.777983] ffff883ce650e3d8 ffffffffa0d827b0 ffff883e9a7656e0 0000000000000000
<4>[ 8011.786313] <d> ffff883e6751f5e0 ffffffffa0cc0fab ffff883e6751f5e0 ffffffffa0ccf968
<4>[ 8011.795363] <d> 0000000000000000 0000000000000058 0000000000000058 ffff883e9a7656e0
<4>[ 8011.804839] Call Trace:
<4>[ 8011.807798] [<ffffffffa0cc0fab>] __ldiskfs_handle_dirty_metadata+0x7b/0x100 [ldiskfs]
<4>[ 8011.817097] [<ffffffffa0ccf968>] ? ldiskfs_bread+0x18/0x80 [ldiskfs]
<4>[ 8011.824517] [<ffffffffa0d6635c>] osd_ldiskfs_write_record+0xec/0x340 [osd_ldiskfs]
<4>[ 8011.833519] [<ffffffffa0d690a3>] osd_write+0x183/0x5b0 [osd_ldiskfs]
<4>[ 8011.840965] [<ffffffffa05f8b4d>] dt_record_write+0x3d/0x130 [obdclass]
<4>[ 8011.848680] [<ffffffffa05bb44f>] llog_osd_write_rec+0xb6f/0x1ad0 [obdclass]
<4>[ 8011.856787] [<ffffffffa0d7ba0b>] ? dynlock_unlock+0x16b/0x1d0 [osd_ldiskfs]
<4>[ 8011.864894] [<ffffffffa05a9426>] llog_write_rec+0xb6/0x270 [obdclass]
<4>[ 8011.872419] [<ffffffffa05b2893>] llog_cat_add_rec+0x1c3/0x7b0 [obdclass]
<4>[ 8011.880234] [<ffffffffa05a9239>] llog_add+0x89/0x1c0 [obdclass]
<4>[ 8011.887176] [<ffffffffa108641d>] osp_sync_add_rec+0x26d/0x9b0 [osp]
<4>[ 8011.894501] [<ffffffffa1086c07>] osp_sync_add+0x77/0x80 [osp]
<4>[ 8011.901248] [<ffffffffa0fb502e>] ? lod_sub_get_thandle+0x24e/0x3c0 [lod]
<4>[ 8011.909060] [<ffffffffa1077823>] osp_object_destroy+0x173/0x230 [osp]
<4>[ 8011.916582] [<ffffffffa0fb85ad>] lod_sub_object_destroy+0x1fd/0x440 [lod]



 Comments   
Comment by James A Simmons [ 23/Aug/16 ]

Small detail. This crashed happened during an IOR run. Also it doesn't seem repeatable.

Comment by James A Simmons [ 23/Aug/16 ]

Here is the dmesg from the crash

Comment by Jian Yu [ 23/Aug/16 ]

Hi Bobi,

I saw the similar stack trace reported in LU-3102 before. Could you please look into this failure?

Thank you.

Comment by Zhenyu Xu [ 23/Aug/16 ]

is this FE 2.8? If yes, I think http://review.whamcloud.com/19732 would be a promising patch.

Comment by Zhenyu Xu [ 23/Aug/16 ]

port of it at http://review.whamcloud.com/22058

Comment by James A Simmons [ 23/Aug/16 ]

Okay I'm building it now. Will let you know the results.

Comment by James A Simmons [ 23/Aug/16 ]

Nope patch 22058 doesn't fix it for us We are in a state or constant crashing when the MDS finishes recovery.

Comment by Alex Zhuravlev [ 24/Aug/16 ]

LOD itself doesn't declare credits, but it ask OSD to declare operations. is it possible to load osd-ldiskfs module with ldiskfs_track_declares_assert=1 and attach dmesg output starting a bit earlier (hopefully we'll hit our assert with additional information). thanks in advance.

Comment by Dustin Leverman [ 25/Aug/16 ]

I cut the log down to show times from 7:11PM (recovery
started and /proc/fs/lustre/osd-ldiskfs enabled) to 7:18PM when the
crashdump kernel took over. Please see output below:

Aug 23 19:05:18 atlas1-mds1.ccs.ornl.gov gedi;lustre-mgmt1.ccs.ornl.gov;test_lustre kernel: Kernel logging (proc) stopped.
Aug 23 19:05:18 atlas1-mds1.ccs.ornl.gov kernel: imklog 5.8.10, log source = /proc/kmsg started.
Aug 23 19:05:47 atlas1-mds1.ccs.ornl.gov kernel: [ 138.744691] LNet: HW CPU cores: 8, npartitions: 2
Aug 23 19:05:47 atlas1-mds1.ccs.ornl.gov kernel: [ 138.757447] alg: No test for adler32 (adler32-zlib)
Aug 23 19:05:47 atlas1-mds1.ccs.ornl.gov kernel: [ 138.763082] alg: No test for crc32 (crc32-table)
Aug 23 19:05:47 atlas1-mds1.ccs.ornl.gov kernel: [ 138.768422] alg: No test for crc32 (crc32-pclmul)
Aug 23 19:05:55 atlas1-mds1.ccs.ornl.gov kernel: [ 146.861939] Lustre: Lustre: Build Version: 2.8.0-g242e67d-CHANGED-2.6.32-573.26.1.el6.dne2.x86_64
Aug 23 19:05:56 atlas1-mds1.ccs.ornl.gov kernel: [ 147.181854] LNet: Added LNI 10.36.226.117@o2ib [63/2560/0/180]
Aug 23 19:05:56 atlas1-mds1.ccs.ornl.gov kernel: [ 147.452347] LNetError: 2837:0:(o2iblnd_cb.c:2310:kiblnd_passive_connect()) Can't accept conn from 10.36.230.233@o2ib200 on NA (ib0:1:10.36.226.117): bad dst nid 10.36.226.117@o2ib200
Aug 23 19:05:56 atlas1-mds1.ccs.ornl.gov kernel: [ 147.589168] LNet: Added LNI 10.36.226.117@o2ib200 [63/2560/0/180]
Aug 23 19:10:27 atlas1-mds1.ccs.ornl.gov kernel: [ 419.034586] LDISKFS-fs (dm-7): warning: mounting fs with errors, running e2fsck is recommended
Aug 23 19:10:27 atlas1-mds1.ccs.ornl.gov kernel: [ 419.050076] LDISKFS-fs (dm-7): recovery complete
Aug 23 19:10:27 atlas1-mds1.ccs.ornl.gov kernel: [ 419.065246] LDISKFS-fs (dm-7): mounted filesystem with ordered data mode. quota=on. Opts:
Aug 23 19:10:29 atlas1-mds1.ccs.ornl.gov kernel: [ 420.483435] Lustre: atlas1-MDT0000: Not available for connect from 17469@gni100 (not set up)
Aug 23 19:10:31 atlas1-mds1.ccs.ornl.gov kernel: [ 422.098522] Lustre: atlas1-MDT0000: Imperative Recovery enabled, recovery window shrunk from 1800-5400 down to 900-2700
Aug 23 19:10:32 atlas1-mds1.ccs.ornl.gov kernel: [ 423.169261] Lustre: atlas1-MDT0000: Will be in recovery for at least 15:00, or until 18437 clients reconnect
Aug 23 19:10:32 atlas1-mds1.ccs.ornl.gov kernel: [ 423.180828] Lustre: atlas1-MDT0000: Connection restored to 7e9ab0dc-95d7-895f-67d0-472e66bcea23 (at 232@gni100)
Aug 23 19:10:33 atlas1-mds1.ccs.ornl.gov kernel: [ 424.194505] Lustre: atlas1-MDT0000: Connection restored to b8f051f0-79b3-229d-0cab-f02d9100888a (at 18990@gni100)
Aug 23 19:10:34 atlas1-mds1.ccs.ornl.gov kernel: [ 425.308076] Lustre: atlas1-MDT0000: Connection restored to 6954d66d-f1d9-df22-81b6-a2a84a767326 (at 17860@gni100)
Aug 23 19:10:34 atlas1-mds1.ccs.ornl.gov kernel: [ 425.320064] Lustre: Skipped 5 previous similar messages
Aug 23 19:10:36 atlas1-mds1.ccs.ornl.gov kernel: [ 427.309298] Lustre: atlas1-MDT0000: Connection restored to 97ba02a4-3cbf-2339-af5c-bc672e524fcf (at 15033@gni100)
Aug 23 19:10:36 atlas1-mds1.ccs.ornl.gov kernel: [ 427.321286] Lustre: Skipped 249 previous similar messages
Aug 23 19:10:40 atlas1-mds1.ccs.ornl.gov kernel: [ 431.312192] Lustre: atlas1-MDT0000: Connection restored to 32e83f22-2827-c76a-be5a-444297abac0f (at 1031@gni100)
Aug 23 19:10:40 atlas1-mds1.ccs.ornl.gov kernel: [ 431.324082] Lustre: Skipped 11156 previous similar messages
Aug 23 19:12:02 atlas1-mds1.ccs.ornl.gov kernel: [ 513.742286] Lustre: atlas1-MDT0000: Connection restored to 68400d40-3736-e9e3-b4eb-7d2f1a9c2a9b (at 10.36.225.10@o2ib)
Aug 23 19:12:02 atlas1-mds1.ccs.ornl.gov kernel: [ 513.754757] Lustre: Skipped 2849 previous similar messages
Aug 23 19:12:41 atlas1-mds1.ccs.ornl.gov kernel: [ 552.533598] Lustre: atlas1-MDT0000: Connection restored to d6edd526-b008-01f3-beb9-910b627acce0 (at 10.36.205.218@o2ib)
Aug 23 19:13:32 atlas1-mds1.ccs.ornl.gov kernel: [ 603.859284] Lustre: atlas1-MDT0000: Connection restored to fdf3ca98-024f-2d65-83bc-b10712d608f5 (at 10.36.205.209@o2ib)
Aug 23 19:13:32 atlas1-mds1.ccs.ornl.gov kernel: [ 603.871786] Lustre: Skipped 1 previous similar message
Aug 23 19:14:51 atlas1-mds1.ccs.ornl.gov kernel: [ 682.177088] Lustre: atlas1-MDT0000: Connection restored to c281fcc2-9ed2-632b-dfca-a063cbb468fd (at 16884@gni100)
Aug 23 19:14:51 atlas1-mds1.ccs.ornl.gov kernel: [ 682.189010] Lustre: Skipped 1677 previous similar messages
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.646938] Lustre: 18235:0:(osd_handler.c:1265:osd_trans_dump_creds()) create: 1008/8064/0, destroy: 1/4/1
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.658473] Lustre: 18235:0:(osd_handler.c:1272:osd_trans_dump_creds()) attr_set: 2/2/1, xattr_set: 1011/92/0
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.670198] Lustre: 18235:0:(osd_handler.c:1282:osd_trans_dump_creds()) write: 5043/43356/0, punch: 0/0/0, quota 4/52/0
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.682891] Lustre: 18235:0:(osd_handler.c:1289:osd_trans_dump_creds()) insert: 1009/17152/0, delete: 2/9/1
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.694423] Lustre: 18235:0:(osd_handler.c:1296:osd_trans_dump_creds()) ref_add: 1/1/0, ref_del: 2/2/1
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.703059] Lustre: atlas1-MDT0000: Recovery over after 4:34, of 18437 clients 18437 recovered and 0 were evicted.
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.717547] LustreError: 18235:0:(osd_internal.h:1073:osd_trans_exec_op()) atlas1-MDT0000-osd: op = 7, rb = 7
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.729087] LustreError: 18235:0:(osd_internal.h:1081:osd_trans_exec_op()) LBUG
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.737712] Pid: 18235, comm: mdt01_399
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.742233]
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.742233] Call Trace:
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.747087] [<ffffffffa0423875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.755118] [<ffffffffa0423e77>] lbug_with_loc+0x47/0xb0 [libcfs]
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.762280] [<ffffffffa0c4b340>] osd_write+0x420/0x5b0 [osd_ldiskfs]
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.769753] [<ffffffffa0569b4d>] dt_record_write+0x3d/0x130 [obdclass]
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.777402] [<ffffffffa052c44f>] llog_osd_write_rec+0xb6f/0x1ad0 [obdclass]
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.785527] [<ffffffffa051a426>] llog_write_rec+0xb6/0x270 [obdclass]
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.793080] [<ffffffffa0523893>] llog_cat_add_rec+0x1c3/0x7b0 [obdclass]
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.800936] [<ffffffffa051a239>] llog_add+0x89/0x1c0 [obdclass]
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.807904] [<ffffffffa0f1e41d>] osp_sync_add_rec+0x26d/0x9b0 [osp]
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.815261] [<ffffffffa0f1ec07>] osp_sync_add+0x77/0x80 [osp]
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.822015] [<ffffffffa0f0f823>] osp_object_destroy+0x173/0x230 [osp]
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.829581] [<ffffffffa0e635ad>] lod_sub_object_destroy+0x1fd/0x440 [lod]
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.837516] [<ffffffffa0e5745b>] lod_object_destroy+0x36b/0x770 [lod]
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.845067] [<ffffffffa042ed81>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.852726] [<ffffffffa0ebde6b>] mdd_finish_unlink+0x28b/0x3d0 [mdd]
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.860164] [<ffffffffa042c1d8>] ? libcfs_log_return+0x28/0x40 [libcfs]
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.867922] [<ffffffffa0ec2845>] mdd_unlink+0xab5/0xf70 [mdd]
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.874700] [<ffffffffa042ed81>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.882383] [<ffffffffa0d800c8>] mdo_unlink+0x18/0x50 [mdt]
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.888962] [<ffffffffa0d88f70>] mdt_reint_unlink+0xbb0/0x1060 [mdt]
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.896416] [<ffffffffa0d8015d>] mdt_reint_rec+0x5d/0x200 [mdt]
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.903394] [<ffffffffa0d6bddb>] mdt_reint_internal+0x62b/0x9f0 [mdt]
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.910960] [<ffffffffa0d6c63b>] mdt_reint+0x6b/0x120 [mdt]
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.917602] [<ffffffffa07e79cc>] tgt_request_handle+0x8ec/0x1440 [ptlrpc]
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.925557] [<ffffffffa0794a91>] ptlrpc_main+0xd21/0x1800 [ptlrpc]
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.932824] [<ffffffff81539b0e>] ? thread_return+0x4e/0x7d0
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.939414] [<ffffffffa0793d70>] ? ptlrpc_main+0x0/0x1800 [ptlrpc]
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.946690] [<ffffffff810a138e>] kthread+0x9e/0xc0
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.952407] [<ffffffff8100c28a>] child_rip+0xa/0x20
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.958206] [<ffffffff810a12f0>] ? kthread+0x0/0xc0
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.964016] [<ffffffff8100c280>] ? child_rip+0x0/0x20
Aug 23 19:15:06 atlas1-mds1.ccs.ornl.gov kernel: [ 697.970033]
Aug 23 19:19:57 atlas1-mds1.ccs.ornl.gov gedi;lustre-mgmt1.ccs.ornl.gov;test_lustre kernel: imklog 5.8.10, log source = /proc/kmsg started.
Aug 23 19:19:57 atlas1-mds1.ccs.ornl.gov gedi;lustre-mgmt1.ccs.ornl.gov;test_lustre kernel: [ 0.000000] Initializing cgroup subsys cpuset
Aug 23 19:19:57 atlas1-mds1.ccs.ornl.gov gedi;lustre-mgmt1.ccs.ornl.gov;test_lustre kernel: [ 0.000000] Initializing cgroup subsys cpu
Aug 23 19:19:57 atlas1-mds1.ccs.ornl.gov gedi;lustre-mgmt1.ccs.ornl.gov;test_lustre kernel: [ 0.000000] Linux version 2.6.32-573.26.1.el6.dne2.x86_64 (jsimmons@lgmgmt1) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-16) (GCC) ) #1 SMP Thu Jul 21 13:22:19 EDT 2016
Aug 23 19:19:57 atlas1-mds1.ccs.ornl.gov gedi;lustre-mgmt1.ccs.ornl.gov;test_lustre kernel: [ 0.000000] Command line: initrd=initrd-2.6.32-573.26.1.el6.dne2.x86_64-gedi selinux=0 audit=0 panic=10 console=tty0 console=ttyS1,115200n8 console=ttyS2,115200n8 crashkernel=192M init=/gedi-preinit BOOT_IMAGE=vmlinuz-2.6.32-573.26.1.el6.dne2.x86_64 BOOTIF=01-20-47-47-7e-c5-e0

Comment by Jian Yu [ 25/Aug/16 ]

Thank you Dustin for the logs.

Hi Alex,

Could you please look into the logs? And although patch http://review.whamcloud.com/22058 doesn't fix the issue in this ticket, do you think it is still a better-to-have patch for Lustre b2_8_fe servers?

Comment by Andreas Dilger [ 26/Aug/16 ]

It seems possible that this is related to creating or deleting a wide-striped file? I found LU-3102 that may be related.

Comment by James A Simmons [ 01/Sep/16 ]

the first crash happened while running a large stripe single shared file with IOR. After the first crash this bug came back at random times.

Comment by Alex Zhuravlev [ 02/Sep/16 ]

please, tell the latest commit for the branch you're using.

Comment by Jian Yu [ 02/Sep/16 ]

Hi Alex,

It's Lustre b2_8_fe branch with commit 71badf2 on the tip.

Comment by James A Simmons [ 07/Sep/16 ]

Any updates?

Comment by Alex Zhuravlev [ 08/Sep/16 ]

i'm trying to come up with a debugging patch.

Comment by Alex Zhuravlev [ 14/Sep/16 ]

Lustre: 18235:0:(osd_handler.c:1272:osd_trans_dump_creds()) attr_set: 2/2/1, xattr_set: 1011/92/0
Lustre: 18235:0:(osd_handler.c:1282:osd_trans_dump_creds()) write: 5043/43356/0, punch: 0/0/0, quota 4/52/0
Lustre: 18235:0:(osd_handler.c:1289:osd_trans_dump_creds()) insert: 1009/17152/0, delete: 2/9/1
Lustre: 18235:0:(osd_handler.c:1296:osd_trans_dump_creds()) ref_add: 1/1/0, ref_del: 2/2/1

the transaction declared at least 2+92+43356+52+17152+9+2+2=60667 credits, while consumed 1+1+1=3
well, probably a bit more as we don't track everything, but still way less than 60667.
it looks like a false assertion though I don't understand the root cause yet.

Comment by Andreas Dilger [ 14/Sep/16 ]

My first guess would be to take a look at the ldiskfs xattr inode (wide striping) patch. It may be that it is declaring too many credits in some cases.

Comment by nasf (Inactive) [ 23/Sep/16 ]

Lustre: 18235:0:(osd_handler.c:1272:osd_trans_dump_creds()) attr_set: 2/2/1, xattr_set: 1011/92/0

The "xattr_set" seems abnormal, under which case the oti_declare_ops[xattr_set] can be so much larger than oti_declare_ops_cred[xattr_set] ?

Comment by nasf (Inactive) [ 26/Sep/16 ]

It is suspected that the concurrent multiple llog append operations, such as unlink large striped (large EA) files concurrently, that can fill up a llog file in very short time. Because the llog may be shared (at the same time) by many concurrent modifications, then under some extreme cases, some unlink large striped file operation may have not declared to create llog file, but during the real llog append for recording the striped OST-objects, it finds that the current llog is filled up, and it needs to create new llog file, but because it does not declare related credits, then the lower layer may hit trouble.

I am not sure whether it is just such case or not. But since it is reproducible on customer site, I will make a patch for verification.

Comment by Gerrit Updater [ 26/Sep/16 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/22726
Subject: LU-8527 obdclass: declare more credits for llog append
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c38efe8101d4bc977ceb9f07b041ac56795dd90d

Comment by Alex Zhuravlev [ 26/Sep/16 ]

Fan Yong, there was no problem with write credits in the customer's case:
Lustre: 18235:0:(osd_handler.c:1282:osd_trans_dump_creds()) write: 5043/43356/0, punch: 0/0/0, quota 4/52/0
no single credit was used for write.

we do declare additional create:

Bar.java
	if (!IS_ERR_OR_NULL(next)) {
		if (!llog_exist(next)) {
			if (dt_object_remote(cathandle->lgh_obj)) {
                        } else {
                                 rc = llog_declare_create(env, next, th);
                                 llog_declare_write_rec(env, cathandle, &lirec->lid_hdr, -1, th);

if you think it's possible to hit a race where we miss to declare additional llog (say, the next was just created), then it makes sense to add an assertion and hit it.

Comment by nasf (Inactive) [ 26/Sep/16 ]

The case may be that when the unlink RPC service thread declares, the next llog exists already, then it will not declare the credit for create new llog file, but after the unlink transaction started, related llog file may has been filled up by others (and plus itself). Then it will try to create new llog file that will use the credits declared for subsequent write, then cause some transaction trouble.

Lustre: 18235:0:(osd_handler.c:1282:osd_trans_dump_creds()) write: 5043/43356/0, punch: 0/0/0, quota 4/52/0

I am not sure whether it is the same as the original trouble. According to our current implementation, there are some nested transactions, means one transaction during other transaction declare, that may cause the OSD layer credits information are not correct. That is another issue to be fixed.

Comment by Alex Zhuravlev [ 26/Sep/16 ]

check llog_cat_current_log().. when the current is full, then the next is taken and set to NULL. in any case, there is no point to declare yet another "next". we just need to make sure that we've declared at least one. for example, if we find "next" existing, then we can repeat from the beginning.

Comment by nasf (Inactive) [ 26/Sep/16 ]

As you can see:

int llog_cat_declare_add_rec(const struct lu_env *env,
                             struct llog_handle *cathandle,
                             struct llog_rec_hdr *rec, struct thandle *th)
{
...
        if (!llog_exist(cathandle->u.chd.chd_current_log)) {
...
        next = cathandle->u.chd.chd_next_log;
        if (next) {
(1)==>                if (!llog_exist(next)) {
                                 /* declare create */
...
}

static struct llog_handle *llog_cat_current_log(struct llog_handle *cathandle,
                                                struct thandle *th)
{
...
        loghandle = cathandle->u.chd.chd_next_log;
        cathandle->u.chd.chd_current_log = loghandle;
(2)==>        cathandle->u.chd.chd_next_log = NULL;
...
}

During above checking, we do NOT take "cathandle->lgh_lock", then it is possible (at least in theory) that the thread1 has already moved to the line (1), then another thread2 sets the "chd_current_log" as "chd_next_log" and set "chd_next_log" as NULL at line (2), and then create related llog object via llog_cat_new_log(). So the thread1 will find the "next" exist, then will not declare to create the "next". Such case should exists, otherwise the line (1) check is redundant, and should be replaced as "LASSERT(!llog_exist(next))".

Comment by Alex Zhuravlev [ 26/Sep/16 ]

yes, this is possible, but we declare few creations usually because of striping (those are independed from llog point of view). I think it's unlikely to hit this race often while most of time new llog declarations are not used.
also, we do have per-operation accounting in OSD and I can't remember any report on lack of credits at create operation. notice OSD checks credits before operation and after. in all the cases it's write which fail. this is why I suggested to enable the assertion in declaration checks. hopefully we'll see something useful.
my currenct suspicion is write itself, probably it's too optimistic.

Comment by nasf (Inactive) [ 26/Sep/16 ]

Yes, it is unlikely. One important reason is that we seldom test large striped file during our daily Maloo tests, that may hides some corner cases when llog handle switching. In this case, large striping is the most different factor compared with our daily testing. So I suspect such point.

As for the write itself optimistic, I have checked related code. The "offset" is the most important parameter to indicate how to optimise. But for llog case, such parameter for write is either "0" (for modify header, including bitmap) or "-1" (for append record). But for these two cases, the OSD layer does very limited optimisation as to I cannot find out suspected points.

Comment by Alex Zhuravlev [ 26/Sep/16 ]

if create was missing a declaration or credits, then we'd get a warning or assertion (when enabled). I just checked with skipped declaration and low credits..

Comment by nasf (Inactive) [ 26/Sep/16 ]

Another suspected point: "osd_thandle::ot_credits" is defined as "unsigned short". Is such two-bytes large enough for the credits declaration for large striping operations? Possible overflow?

Comment by Alex Zhuravlev [ 26/Sep/16 ]

I was looking on this as well. this can potentially explain our assertion, but not BUG() in JBD2..

Comment by Gerrit Updater [ 28/Sep/16 ]

Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: http://review.whamcloud.com/22782
Subject: LU-8527 osd: ot_credits must be 32bit
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c6ed09256d2d76e255a3decd1d54b756b6360977

Comment by Gerrit Updater [ 08/Oct/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/22782/
Subject: LU-8527 osd: ot_credits must be 32bit
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 9edc89c11eb673b0c0da08381a6a779e40ff8ac2

Comment by Peter Jones [ 08/Oct/16 ]

Landed for 2.9

Comment by James A Simmons [ 18/Oct/16 ]

We were able to reproduce the problem at smaller scale and after we applied the LU-8527 patch the MDS no longer crashed. Thank you.

Generated at Sat Feb 10 02:18:20 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.