Details
-
Bug
-
Resolution: Fixed
-
Major
-
None
-
Lustre 2.1.1
-
None
-
Lustre servers are running 2.6.32-220.el6, with Lustre 2.1.1.rc4.
Lustre clients are running 2.6.38.2, with special code created for this release, with http://review.whamcloud.com/#change,2170. (patch 8)
-
2
-
6412
Description
Customer running a script that calls a Java program (see attachments). Two clients panic'd.
2012-04-13 19:21:49 +0000 [26759.401564] -----------[ cut here ]-----------
2012-04-13 19:21:49 +0000 [26759.424756] WARNING: at fs/libfs.c:363 simple_setattr+0x99/0xb0()
2012-04-13 19:21:49 +0000 [26759.455860] Hardware name: ProLiant BL460c G7
2012-04-13 19:21:49 +0000 [26759.477542] Modules linked in: lmv mgc lustre lquota lov osc mdc fid fld ksocklnd ptlrpc obdclass lnet lvfs libcfs ppdev 8021q garp bridge stp llc nfsd nfs lockd nfs_acl auth_rpcgss sunrpc netconsole configfs dm_crypt dm_mod crc32c ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi iptable_filter be2net ip_tables x_tables hpilo joydev hpwdt rtc_cmos psmouse rtc_core lp rtc_lib evdev parport mac_hid loop serio_raw tcp_scalable fuse virtio_blk virtio virtio_ring xenfs ext4 mbcache jbd2 xfs usbhid exportfs raid1 mptspi mptsas mptscsih mptbase mpt2sas raid_class arcmsr aic94xx libsas libata scsi_transport_sas aic7xxx aic79xx scsi_transport_spi megaraid_sas cciss sd_mod sg hpsa scsi_mod uhci_hcd ehci_hcd [last unloaded: libcfs]
2012-04-13 19:21:49 +0000 [26759.830142] Pid: 11122, comm: java Not tainted 2.6.38.2-ts4 #11
2012-04-13 19:21:49 +0000 [26759.860423] Call Trace:
2012-04-13 19:21:49 +0000 [26759.872545] [<ffffffff8105d10f>] ? warn_slowpath_common+0x7f/0xc0
2012-04-13 19:21:49 +0000 [26759.903121] [<ffffffff8105d16a>] ? warn_slowpath_null+0x1a/0x20
2012-04-13 19:21:49 +0000 [26759.933790] [<ffffffff8115bea9>] ? simple_setattr+0x99/0xb0
2012-04-13 19:21:50 +0000 [26759.962234] [<ffffffff8115be10>] ? simple_setattr+0x0/0xb0
2012-04-13 19:21:50 +0000 [26759.990161] [<ffffffffa1200af6>] ? ll_md_setattr+0x3e6/0x840 [lustre]
2012-04-13 19:21:50 +0000 [26760.024040] [<ffffffffa12011b4>] ? ll_setattr_raw+0x264/0xe40 [lustre]
2012-04-13 19:21:50 +0000 [26760.056779] [<ffffffffa0ba3126>] ? cfs_hash_del+0xa6/0x1d0 [libcfs]
2012-04-13 19:21:50 +0000 [26760.089867] [<ffffffffa1201ded>] ? ll_setattr+0x5d/0x100 [lustre]
2012-04-13 19:21:50 +0000 [26760.120906] [<ffffffff81153761>] ? notify_change+0x161/0x2c0
2012-04-13 19:21:50 +0000 [26760.150011] [<ffffffff811383d1>] ? do_truncate+0x61/0x90
2012-04-13 19:21:50 +0000 [26760.176623] [<ffffffff8113862d>] ? sys_ftruncate+0xdd/0xf0
2012-04-13 19:21:50 +0000 [26760.205178] [<ffffffff8100bfc2>] ? system_call_fastpath+0x16/0x1b
2012-04-13 19:21:50 +0000 [26760.235981] --[ end trace 402d40ca74c5ea86 ]--
On one OSS, about an hour before the client crash, I saw this:
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: Lustre: Service thread pid 16292 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: Pid: 16292, comm: ll_ost_io_254
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel:
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: Call Trace:
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa026f1a9>] ? LNetGet+0x3e9/0x830 [lnet]
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa02699c5>] ? LNetMDBind+0x135/0x490 [lnet]
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffff8104fdef>] ? lock_timer_base+0x2b/0x4f
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffff8104febd>] ? try_to_del_timer_sync+0xaa/0xb7
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffff814aa778>] schedule_timeout+0x1c6/0x1ee
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffff810500ce>] ? process_timeout+0x0/0x10
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa0221351>] cfs_waitq_timedwait+0x11/0x20 [libcfs]
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa03b417f>] target_bulk_io+0x3af/0xa40 [ptlrpc]
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa0221910>] ? cfs_alloc+0x30/0x60 [libcfs]
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffff81038bc6>] ? default_wake_function+0x0/0x14
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa038752f>] ost_brw_write+0x12af/0x2040 [ost]
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa03ef7cc>] ? lustre_msg_set_timeout+0x9c/0x110 [ptlrpc]
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffff810c6edb>] ? free_hot_page+0x3f/0x44
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffff810c7081>] ? __free_pages+0x5a/0x70
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa03b19c0>] ? target_bulk_timeout+0x0/0x100 [ptlrpc]
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa038e104>] ost_handle+0x2604/0x57e0 [ost]
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa03ea29e>] ? lustre_msg_get_opc+0x8e/0xf0 [ptlrpc]
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa03f735c>] ptlrpc_server_handle_request+0x4ec/0xfc0 [ptlrpc]
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffff81037e95>] ? enqueue_task+0x7c/0x8b
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa02212ae>] ? cfs_timer_arm+0xe/0x10 [libcfs]
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa022deb0>] ? lc_watchdog_touch+0x70/0x150 [libcfs]
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa022dfd7>] ? lc_watchdog_disable+0x47/0x120 [libcfs]
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa03fa714>] ? ptlrpc_wait_event+0xa4/0x2d0 [ptlrpc]
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffff81030fda>] ? __wake_up+0x48/0x55
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa03fae7f>] ptlrpc_main+0x53f/0x1670 [ptlrpc]
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffff81003ada>] child_rip+0xa/0x20
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa03fa940>] ? ptlrpc_main+0x0/0x1670 [ptlrpc]
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffff81003ad0>] ? child_rip+0x0/0x20
I don't know if it is related, or not.
Attached files:
Reproduce.java: Java program used when reproducing the problem
reproduce.sh: Shell script that calls Reproduce.Java in a loop
messages-mds: /var/log/messages from the MDS
messages-oss-1 from the first OSS
messages-oss-2 from the second OSS
users*: Client output via netconsole.