Details
-
Bug
-
Resolution: Cannot Reproduce
-
Major
-
None
-
Lustre 2.12.0
-
None
-
CentOS 7.6, ldiskfs
-
3
-
9223372036854775807
Description
The following backtraces can be seen on a new MDS with 2.12.0 (with DNE, PFL and DoM enabled):
[717785.597502] Lustre: fir-MDT0000: Client 85c45b71-10aa-be9f-071b-9a9c50548db2 (at 10.8.15.7@o2ib6) reconnecting [717785.607591] Lustre: Skipped 1 previous similar message [717785.612864] Lustre: fir-MDT0000: Connection restored to ff1032ee-a0fe-1065-6b12-2d8b61fa9745 (at 10.8.15.7@o2ib6) [717785.623208] Lustre: Skipped 1 previous similar message [717786.125444] Lustre: 5936:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1548496141/real 0] req@ffff911f54e31e00 x1623664975829664/t0(0) o13->fir-OST0000-osc-MDT0000@10.0.10.101@o2ib7:7/4 lens 224/368 e 0 to 1 dl 1548496148 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 [717786.152781] Lustre: 5936:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 3 previous similar messages [717786.162526] Lustre: fir-OST0000-osc-MDT0000: Connection to fir-OST0000 (at 10.0.10.101@o2ib7) was lost; in progress operations using this service will wait for recovery to complete [717786.178676] Lustre: Skipped 7 previous similar messages [717787.122451] NMI watchdog: BUG: soft lockup - CPU#9 stuck for 23s! [mdt_io01_012:7223] [717787.157454] NMI watchdog: BUG: soft lockup - CPU#13 stuck for 22s! [mdt_io01_014:8952] [717787.130365] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) ldiskfs(OE) lustre(OE) lmv(OE) mdc(OE) osc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd [717787.157456] Modules linked in: [717787.157457] osp(OE) [717787.157458] mdd(OE) [717787.157459] lod(OE) [717787.157460] mdt(OE) [717787.157460] lfsck(OE) [717787.157461] mgs(OE) [717787.157461] mgc(OE) [717787.157462] osd_ldiskfs(OE) [717787.157462] lquota(OE) [717787.157463] ldiskfs(OE) [717787.157463] lustre(OE) [717787.157464] lmv(OE) [717787.157465] mdc(OE) [717787.157465] osc(OE) [717787.157466] lov(OE) [717787.157468] fid(OE) [717787.157469] fld(OE) [717787.157469] ko2iblnd(OE) [717787.157470] ptlrpc(OE) [717787.157470] obdclass(OE) [717787.157471] lnet(OE) [717787.157472] libcfs(OE) [717787.157472] rpcsec_gss_krb5 [717787.157473] auth_rpcgss [717787.157474] nfsv4 [717787.157474] dns_resolver [717787.157475] nfs [717787.157476] lockd [717787.157476] grace [717787.157477] fscache [717787.157477] rdma_ucm(OE) [717787.157478] ib_ucm(OE) [717787.157479] rdma_cm(OE) [717787.157479] iw_cm(OE) [717787.157480] ib_ipoib(OE) [717787.157481] ib_cm(OE) [717787.157481] ib_umad(OE) [717787.157482] mlx5_fpga_tools(OE) [717787.157483] mlx4_en(OE) [717787.157483] mlx4_ib(OE) [717787.157484] mlx4_core(OE) [717787.157485] dell_rbu [717787.157485] sunrpc [717787.157486] vfat [717787.157487] fat [717787.157487] dm_round_robin [717787.157488] dcdbas [717787.157489] amd64_edac_mod [717787.157489] edac_mce_amd [717787.157490] kvm_amd [717787.157490] kvm [717787.157491] irqbypass [717787.157492] crc32_pclmul [717787.157492] ghash_clmulni_intel [717787.157493] aesni_intel [717787.157493] lrw [717787.157494] gf128mul [717787.157495] glue_helper [717787.157495] ablk_helper [717787.157496] cryptd [717787.157496] ses [717787.157497] enclosure [717787.157497] dm_multipath [717787.157498] pcspkr [717787.157498] ipmi_si [717787.157499] ipmi_devintf [717787.157499] dm_mod [717787.157500] ccp [717787.157501] k10temp [717787.157501] ipmi_msghandler [717787.157502] sg [717787.157502] i2c_piix4 [717787.157503] acpi_power_meter [717787.157504] knem(OE) [717787.157504] ip_tables [717787.157505] ext4 [717787.157506] mbcache [717787.157506] jbd2 [717787.157507] sd_mod [717787.157507] crc_t10dif [717787.157507] crct10dif_generic [717787.157508] mlx5_ib(OE) [717787.157509] ib_uverbs(OE) [717787.157510] ib_core(OE) [717787.157510] i2c_algo_bit [717787.157511] drm_kms_helper [717787.157512] ahci [717787.157512] mlx5_core(OE) [717787.157513] syscopyarea [717787.157514] sysfillrect [717787.157514] sysimgblt [717787.157515] fb_sys_fops [717787.157516] mlxfw(OE) [717787.157516] libahci [717787.157517] devlink [717787.157517] ttm [717787.157518] crct10dif_pclmul [717787.157519] crct10dif_common [717787.157519] mlx_compat(OE) [717787.157520] tg3 [717787.157521] drm [717787.157521] crc32c_intel [717787.157522] libata [717787.157522] ptp [717787.157523] drm_panel_orientation_quirks [717787.157523] megaraid_sas [717787.157524] pps_core [717787.157525] mpt3sas(OE) [717787.157525] raid_class [717787.157526] scsi_transport_sas [717787.157526] [last unloaded: libcfs] [717787.157530] CPU: 13 PID: 8952 Comm: mdt_io01_014 Kdump: loaded Tainted: G OE ------------ 3.10.0-957.1.3.el7_lustre.x86_64 #1 [717787.157531] Hardware name: Dell Inc. PowerEdge R6415/065PKD, BIOS 1.6.7 10/29/2018 [717787.157533] task: ffff91599d3d1040 ti: ffff9159f42a4000 task.ti: ffff9159f42a4000 [717787.157534] RIP: 0010:[<ffffffffadd1228e>] [717787.157541] [<ffffffffadd1228e>] native_queued_spin_lock_slowpath+0x1ce/0x200 [717787.157542] RSP: 0018:ffff9159f42a7798 EFLAGS: 00000202 [717787.157543] RAX: 0000000000000001 RBX: ffff911e33293a30 RCX: 0000000000000001 [717787.157544] RDX: 0000000000000101 RSI: 0000000000000001 RDI: ffff9149fce5fc80 [717787.157545] RBP: ffff9159f42a7798 R08: 0000000000000101 R09: ffffffffc12b0d1a [717787.157546] R10: ffff9139ff6df140 R11: ffffdbb24f22f600 R12: ffff9159f42a7738 [717787.157546] R13: ffff911e33293b18 R14: 0000000000000000 R15: 1fffffffffffffff [717787.157548] FS: 00007f277afbf700(0000) GS:ffff9139ff6c0000(0000) knlGS:0000000000000000 [717787.157549] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [717787.157550] CR2: 00007f2781327000 CR3: 000000402c810000 CR4: 00000000003407e0 [717787.157551] Call Trace: [717787.157556] [<ffffffffae35bfcb>] queued_spin_lock_slowpath+0xb/0xf [717787.157560] [<ffffffffae36a480>] _raw_spin_lock+0x20/0x30 [717787.157574] [<ffffffffc129d507>] ldiskfs_es_lru_add+0x57/0x90 [ldiskfs] [717787.157586] [<ffffffffc12b0d35>] ldiskfs_ext_map_blocks+0x7b5/0xf60 [ldiskfs] [717787.157593] [<ffffffffc0b34ae2>] ? kiblnd_launch_tx+0xd2/0xc20 [ko2iblnd] [717787.157599] [<ffffffffadf210f2>] ? crypto_create_tfm+0x32/0xd0 [717787.157609] [<ffffffffc1336181>] ? qsd_op_begin+0xb1/0x4b0 [lquota] [717787.157618] [<ffffffffc126c518>] ldiskfs_map_blocks+0x98/0x700 [ldiskfs] [717787.157625] [<ffffffffc1336181>] ? qsd_op_begin+0xb1/0x4b0 [lquota] [717787.157633] [<ffffffffc126ab85>] ? ldiskfs_inode_attach_jinode+0x55/0xd0 [ldiskfs] [717787.157647] [<ffffffffc13a2443>] osd_ldiskfs_map_inode_pages+0x143/0x420 [osd_ldiskfs] [717787.157657] [<ffffffffc13a3b32>] osd_write_commit+0x3a2/0x8c0 [osd_ldiskfs] [717787.157667] [<ffffffffc12922c9>] ? __ldiskfs_journal_start_sb+0x69/0xe0 [ldiskfs] [717787.157689] [<ffffffffc1552b93>] mdt_obd_commitrw+0x953/0x1150 [mdt] [717787.157742] [<ffffffffc0ed0e2c>] obd_commitrw+0x9c/0x370 [ptlrpc] [717787.157784] [<ffffffffc0ed3a9d>] tgt_brw_write+0x100d/0x1a90 [ptlrpc] [717787.157821] [<ffffffffc0e6d2e0>] ? lustre_msg_buf_v2+0x1b0/0x1b0 [ptlrpc] [717787.157828] [<ffffffffadce011c>] ? update_curr+0x14c/0x1e0 [717787.157830] [<ffffffffadcdca7e>] ? account_entity_dequeue+0xae/0xd0 [717787.157862] [<ffffffffc0e28a30>] ? target_send_reply_msg+0x170/0x170 [ptlrpc] [717787.157903] [<ffffffffc0ed735a>] tgt_request_handle+0xaea/0x1580 [ptlrpc] [717787.157943] [<ffffffffc0eb0a51>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc] [717787.157951] [<ffffffffc0abcbde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs] [717787.157989] [<ffffffffc0e7b92b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [717787.158026] [<ffffffffc0e787b5>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc] [717787.158029] [<ffffffffadcd67c2>] ? default_wake_function+0x12/0x20 [717787.158032] [<ffffffffadccba9b>] ? __wake_up_common+0x5b/0x90 [717787.158070] [<ffffffffc0e7f25c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc] [717787.158106] [<ffffffffc0e7e760>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc] [717787.158110] [<ffffffffadcc1c31>] kthread+0xd1/0xe0 [717787.158112] [<ffffffffadcc1b60>] ? insert_kthread_work+0x40/0x40 [717787.158115] [<ffffffffae374c24>] ret_from_fork_nospec_begin+0xe/0x21 [717787.158117] [<ffffffffadcc1b60>] ? insert_kthread_work+0x40/0x40 [717787.158117] Code: [717787.158118] 37 [717787.158118] 81 [717787.158119] fe [717787.158119] 00 [717787.158120] 01 [717787.158120] 00 [717787.158120] 00 [717787.158120] 74 [717787.158121] f4 [717787.158121] e9 [717787.158121] 93 [717787.158122] fe [717787.158122] ff [717787.158122] ff [717787.158123] 0f [717787.158123] 1f [717787.158123] 80 [717787.158124] 00 [717787.158124] 00 [717787.158124] 00 [717787.158125] 00 [717787.158125] 83 [717787.158125] fa [717787.158126] 01 [717787.158126] 75 [717787.158126] 11 [717787.158127] 0f [717787.158127] 1f [717787.158127] 00 [717787.158128] e9 [717787.158128] 68 [717787.158128] fe [717787.158129] ff [717787.158129] ff [717787.158129] 0f [717787.158129] 1f [717787.158130] 00 [717787.158130] 85 [717787.158130] c0 [717787.158131] 74 [717787.158131] 0c [717787.158131] f3 [717787.158132] 90 [717787.158132] <8b> [717787.158133] 07 [717787.158133] 0f [717787.158133] b6 [717787.158134] c0 [717787.158134] 83 [717787.158134] f8 [717787.158135] 03 [717787.158135] 75 [717787.158135] f0 [717787.158135] b8 [717787.158136] 01 [717787.158136] 00 [717787.158136] 00 [717787.158137] 00 [717787.158137] 66 [717787.158137] 89 [717787.158138] 07 [717787.158138] 5d [717787.158138] c3 [717787.158139] 66 [717787.165452] NMI watchdog: BUG: soft lockup - CPU#14 stuck for 22s! [mdt_io02_012:7222] etc.
ldiskfs config:
[root@fir-md1-s1 ~]# dumpe2fs -h /dev/mapper/md1-rbod1-mdt1
dumpe2fs 1.44.3.wc1 (23-July-2018)
Filesystem volume name: fir-MDT0001
Last mounted on: /
Filesystem UUID: ab333144-bee5-4e2a-b4bb-aa55c29b1718
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr dir_index filetype needs_recovery extent 64bit mmp flex_bg dirdata sparse_super large_file huge_file uninit_bg dir_nlink quota
Filesystem flags: signed_directory_hash
Default mount options: user_xattr acl
Filesystem state: clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 288005760
Block count: 4681213440
Reserved block count: 234060672
Free blocks: 4607819718
Free inodes: 288004250
First block: 0
Block size: 4096
Fragment size: 4096
Group descriptor size: 64
Blocks per group: 32768
Fragments per group: 32768
Inodes per group: 2016
Inode blocks per group: 504
Flex block group size: 16
Filesystem created: Thu Jan 24 14:02:56 2019
Last mount time: Fri Jan 25 11:45:51 2019
Last write time: Fri Jan 25 11:45:51 2019
Mount count: 4
Maximum mount count: -1
Last checked: Thu Jan 24 14:02:56 2019
Check interval: 0 (<none>)
Lifetime writes: 275 GB
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 1024
Required extra isize: 32
Desired extra isize: 32
Journal inode: 8
Default directory hash: half_md4
Directory Hash Seed: df4e92b3-3ca4-4642-bcf8-333ebb00b894
Journal backup: inode blocks
MMP block number: 10335
MMP update interval: 5
User quota inode: 3
Group quota inode: 4
Journal features: journal_incompat_revoke journal_64bit
Journal size: 4096M
Journal length: 1048576
Journal sequence: 0x00289850
Journal start: 316606
MMP_block:
mmp_magic: 0x4d4d50
mmp_check_interval: 10
mmp_sequence: 0x004078
mmp_update_date: Sat Jan 26 10:41:07 2019
mmp_update_time: 1548528067
mmp_node_name: fir-md1-s2
mmp_device_name: dm-2
The impact seems to have been temporary as right now it's working fine again. We're running some large fpsync to synchronize datasets from an old lustre filesystem, so this is mostly writes.
Stephane