Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 1.8.x (1.8.0 - 1.8.5)
-
Lustre version: 1.8.5.54-20110316022453-PRISTINE-2.6.18-194.17.1.el5_lustre.20110315140510
lctl version: 1.8.5.54-20110316022453-PRISTINE-2.6.18-194.17.1.el5_lustre.20110315140510
Red Hat Enterprise Linux Server release 5.4 (Tikanga)
auth type over ldap and kerberos
quota enabled only for group on lustre fsLustre version: 1.8.5.54-20110316022453-PRISTINE-2.6.18-194.17.1.el5_lustre.20110315140510 lctl version: 1.8.5.54-20110316022453-PRISTINE-2.6.18-194.17.1.el5_lustre.20110315140510 Red Hat Enterprise Linux Server release 5.4 (Tikanga) auth type over ldap and kerberos quota enabled only for group on lustre fs
Description
The Lustre infrastructure is based on two HP Blade Server with an
Hitachi Shared Storage. On the first server we have MDS, MGS, OST0/1/2,
on the second server we have OST3/4..
The first server is osiride-lp-030 and the second is osiride-lp-031.
The clustering of these services are based on Red Hat Cluster Suite.
The crash of the Lustre infrastructure is daily and we experience in the
log these dumps:
Dec 9 11:27:08 osiride-lp-030 kernel: BUG: soft lockup - CPU#8 stuck for 10s! [ll_mdt_06:21936]
Dec 9 11:27:08 osiride-lp-030 kernel: CPU 8:
Dec 9 11:27:08 osiride-lp-030 kernel: Modules linked in: obdfilter(U) ost(U) mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U) crc16(U) lock_dlm(U) gfs2(U)
dlm(U) configfs(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) obdclass(U) lvfs(U) lnet(U) libcfs(U) bonding(U) ipv6(U) xfrm_nalgo(U) cryp
to_api(U) video(U) backlight(U) sbs(U) power_meter(U) hwmon(U) i2c_ec(U) i2c_core(U) dell_wmi(U) wmi(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U)
ac(U) dm_round_robin(U) dm_multipath(U) scsi_dh(U) parport_pc(U) lp(U) parport(U) joydev(U) bnx2x(U) sg(U) amd64_edac_mod(U) shpchp(U) bnx2(U) serio_raw(U) t
g3(U) pcspkr(U) edac_mc(U) hpilo(U) dm_raid45(U) dm_message(U) dm_region_hash(U) dm_mem_cache(U) dm_snapshot(U) dm_zero(U) dm_mirror(U) dm_log(U) dm_mod(U) u
sb_storage(U) qla2xxx(U) scsi_transport_fc(U) cciss(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U)
Dec 9 11:27:08 osiride-lp-030 kernel: Pid: 21936, comm: ll_mdt_06 Tainted: G 2.6.18-194.17.1.el5_lustre.20110315140510 #1
Dec 9 11:27:08 osiride-lp-030 kernel: RIP: 0010:[<ffffffff8882a270>] [<ffffffff8882a270>] :lquota:dquot_create_oqaq+0x2b0/0x510
Dec 9 11:27:08 osiride-lp-030 kernel: RSP: 0018:ffff8104484e3ac0 EFLAGS: 00000246
Dec 9 11:27:08 osiride-lp-030 kernel: RAX: 0000000000000000 RBX: ffff81041eee3ef0 RCX: 000000000000000c
Dec 9 11:27:08 osiride-lp-030 kernel: RDX: 0000000000000000 RSI: 0000000000001400 RDI: 0000000000001400
Dec 9 11:27:08 osiride-lp-030 kernel: RBP: 0000000000000004 R08: 000000000000000c R09: 0000000001000000
Dec 9 11:27:08 osiride-lp-030 kernel: R10: 000000000000000c R11: 0000000000500000 R12: ffffffffffffffff
Dec 9 11:27:08 osiride-lp-030 kernel: R13: 003fffffffffffff R14: 0000000000000282 R15: ffff81041eee3f00
Dec 9 11:27:08 osiride-lp-030 kernel: FS: 00002b6411676230(0000) GS:ffff81010fc954c0(0000) knlGS:00000000f6cf2b90
Dec 9 11:27:08 osiride-lp-030 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Dec 9 11:27:08 osiride-lp-030 kernel: CR2: 00000000f6140000 CR3: 0000000000201000 CR4: 00000000000006e0
Dec 9 11:27:08 osiride-lp-030 kernel:
Dec 9 11:27:08 osiride-lp-030 kernel: Call Trace:
Dec 9 11:27:08 osiride-lp-030 kernel: [<ffffffff8882ad69>] :lquota:lustre_dqget+0x679/0x7e0
Dec 9 11:27:08 osiride-lp-030 kernel: [<ffffffff8882b086>] :lquota:init_oqaq+0x56/0x1c0
Dec 9 11:27:08 osiride-lp-030 kernel: [<ffffffff8883285e>] :lquota:mds_set_dqblk+0x8de/0x2010
Dec 9 11:27:08 osiride-lp-030 kernel: [<ffffffff88732fd3>] :ptlrpc:ptl_send_buf+0x3f3/0x5b0
Dec 9 11:27:08 osiride-lp-030 kernel: [<ffffffff8873b94a>] :ptlrpc:lustre_pack_reply_flags+0x86a/0x950
Dec 9 11:27:08 osiride-lp-030 kernel: [<ffffffff80150d56>] __next_cpu+0x19/0x28
Dec 9 11:27:08 osiride-lp-030 kernel: [<ffffffff88823e9a>] :lquota:mds_quota_ctl+0x16a/0x3c0
Dec 9 11:27:08 osiride-lp-030 kernel: [<ffffffff8873ba59>] :ptlrpc:lustre_pack_reply+0x29/0xb0
Dec 9 11:27:08 osiride-lp-030 kernel: [<ffffffff88afe78f>] :mds:mds_handle+0x3d7f/0x4d10
Dec 9 11:27:08 osiride-lp-030 kernel: [<ffffffff800767ae>] smp_send_reschedule+0x4e/0x53
Dec 9 11:27:08 osiride-lp-030 kernel: [<ffffffff8008c92d>] enqueue_task+0x41/0x56
Dec 9 11:27:08 osiride-lp-030 kernel: [<ffffffff8873da35>] :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0
Dec 9 11:27:08 osiride-lp-030 kernel: [<ffffffff887473b9>] :ptlrpc:ptlrpc_server_handle_request+0x989/0xe00
Dec 9 11:27:08 osiride-lp-030 kernel: [<ffffffff88747b15>] :ptlrpc:ptlrpc_wait_event+0x2e5/0x310
Dec 9 11:27:08 osiride-lp-030 kernel: [<ffffffff8008b3bd>] __wake_up_common+0x3e/0x68
Dec 9 11:27:08 osiride-lp-030 kernel: [<ffffffff88748ac8>] :ptlrpc:ptlrpc_main+0xf88/0x1150
Dec 9 11:27:08 osiride-lp-030 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11
Dec 9 11:27:08 osiride-lp-030 kernel: [<ffffffff8008c92d>] enqueue_task+0x41/0x56
Dec 9 11:27:08 osiride-lp-030 kernel: [<ffffffff8873da35>] :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0
Dec 9 11:27:08 osiride-lp-030 kernel: [<ffffffff887473b9>] :ptlrpc:ptlrpc_server_handle_request+0x989/0xe00
Dec 9 11:27:08 osiride-lp-030 kernel: [<ffffffff88747b15>] :ptlrpc:ptlrpc_wait_event+0x2e5/0x310
Dec 9 11:27:08 osiride-lp-030 kernel: [<ffffffff8008b3bd>] __wake_up_common+0x3e/0x68
Dec 9 11:27:08 osiride-lp-030 kernel: [<ffffffff88748ac8>] :ptlrpc:ptlrpc_main+0xf88/0x1150
Dec 9 11:27:08 osiride-lp-030 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11
Dec 9 11:27:08 osiride-lp-030 kernel: [<ffffffff88747b40>] :ptlrpc:ptlrpc_main+0x0/0x1150
Dec 9 11:27:08 osiride-lp-030 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11
Dec 9 11:27:08 osiride-lp-030 kernel:
Dec 9 11:27:15 osiride-lp-030 kernel: Lustre: Service thread pid 23639 was inactive for 218.00s. Watchdog stack traces are limited to 3 per 300 seconds, sk pping this one.
This saturates the resources of the server and the clients are unable to
access to the filesystem.
Regards
Attachments
Issue Links
- Trackbacks
-
Lustre 1.8.x known issues tracker While testing against Lustre b18 branch, we would hit known bugs which were already reported in Lustre Bugzilla https://bugzilla.lustre.org/. In order to move away from relying on Bugzilla, we would create a JIRA
-
Changelog 1.8 Changes from version 1.8.7wc1 to version 1.8.8wc1 Server support for kernels: 2.6.18308.4.1.el5 (RHEL5) Client support for unpatched kernels: 2.6.18308.4.1.el5 (RHEL5) 2.6.32220.13.1.el6 (RHEL6) Recommended e2fsprogs version: 1.41.90....
-
Changelog 2.1 Changes from version 2.1.1 to version 2.1.2 Server support for kernels: 2.6.18308.4.1.el5 (RHEL5) 2.6.32220.17.1.el6 (RHEL6) Client support for unpatched kernels: 2.6.18308.4.1.el5 (RHEL5) 2.6.32220.17.1....
-
Changelog 2.2 version 2.2.0 Support for networks: o2iblnd OFED 1.5.4 Server support for kernels: 2.6.32220.4.2.el6 (RHEL6) Client support for unpatched kernels: 2.6.18274.18.1.el5 (RHEL5) 2.6.32220.4.2.el6 (RHEL6) 2.6.32.360....