Details
-
Bug
-
Resolution: Fixed
-
Major
-
None
-
Lustre 2.1.1
-
None
-
Lustre servers are running 2.6.32-220.el6, with Lustre 2.1.1.rc4.
Lustre clients are running 2.6.38.2, with special code created for this release, with http://review.whamcloud.com/#change,2170. (patch 8)
-
2
-
6412
Description
Customer running a script that calls a Java program (see attachments). Two clients panic'd.
2012-04-13 19:21:49 +0000 [26759.401564] -----------[ cut here ]-----------
2012-04-13 19:21:49 +0000 [26759.424756] WARNING: at fs/libfs.c:363 simple_setattr+0x99/0xb0()
2012-04-13 19:21:49 +0000 [26759.455860] Hardware name: ProLiant BL460c G7
2012-04-13 19:21:49 +0000 [26759.477542] Modules linked in: lmv mgc lustre lquota lov osc mdc fid fld ksocklnd ptlrpc obdclass lnet lvfs libcfs ppdev 8021q garp bridge stp llc nfsd nfs lockd nfs_acl auth_rpcgss sunrpc netconsole configfs dm_crypt dm_mod crc32c ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi iptable_filter be2net ip_tables x_tables hpilo joydev hpwdt rtc_cmos psmouse rtc_core lp rtc_lib evdev parport mac_hid loop serio_raw tcp_scalable fuse virtio_blk virtio virtio_ring xenfs ext4 mbcache jbd2 xfs usbhid exportfs raid1 mptspi mptsas mptscsih mptbase mpt2sas raid_class arcmsr aic94xx libsas libata scsi_transport_sas aic7xxx aic79xx scsi_transport_spi megaraid_sas cciss sd_mod sg hpsa scsi_mod uhci_hcd ehci_hcd [last unloaded: libcfs]
2012-04-13 19:21:49 +0000 [26759.830142] Pid: 11122, comm: java Not tainted 2.6.38.2-ts4 #11
2012-04-13 19:21:49 +0000 [26759.860423] Call Trace:
2012-04-13 19:21:49 +0000 [26759.872545] [<ffffffff8105d10f>] ? warn_slowpath_common+0x7f/0xc0
2012-04-13 19:21:49 +0000 [26759.903121] [<ffffffff8105d16a>] ? warn_slowpath_null+0x1a/0x20
2012-04-13 19:21:49 +0000 [26759.933790] [<ffffffff8115bea9>] ? simple_setattr+0x99/0xb0
2012-04-13 19:21:50 +0000 [26759.962234] [<ffffffff8115be10>] ? simple_setattr+0x0/0xb0
2012-04-13 19:21:50 +0000 [26759.990161] [<ffffffffa1200af6>] ? ll_md_setattr+0x3e6/0x840 [lustre]
2012-04-13 19:21:50 +0000 [26760.024040] [<ffffffffa12011b4>] ? ll_setattr_raw+0x264/0xe40 [lustre]
2012-04-13 19:21:50 +0000 [26760.056779] [<ffffffffa0ba3126>] ? cfs_hash_del+0xa6/0x1d0 [libcfs]
2012-04-13 19:21:50 +0000 [26760.089867] [<ffffffffa1201ded>] ? ll_setattr+0x5d/0x100 [lustre]
2012-04-13 19:21:50 +0000 [26760.120906] [<ffffffff81153761>] ? notify_change+0x161/0x2c0
2012-04-13 19:21:50 +0000 [26760.150011] [<ffffffff811383d1>] ? do_truncate+0x61/0x90
2012-04-13 19:21:50 +0000 [26760.176623] [<ffffffff8113862d>] ? sys_ftruncate+0xdd/0xf0
2012-04-13 19:21:50 +0000 [26760.205178] [<ffffffff8100bfc2>] ? system_call_fastpath+0x16/0x1b
2012-04-13 19:21:50 +0000 [26760.235981] --[ end trace 402d40ca74c5ea86 ]--
On one OSS, about an hour before the client crash, I saw this:
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: Lustre: Service thread pid 16292 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: Pid: 16292, comm: ll_ost_io_254
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel:
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: Call Trace:
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa026f1a9>] ? LNetGet+0x3e9/0x830 [lnet]
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa02699c5>] ? LNetMDBind+0x135/0x490 [lnet]
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffff8104fdef>] ? lock_timer_base+0x2b/0x4f
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffff8104febd>] ? try_to_del_timer_sync+0xaa/0xb7
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffff814aa778>] schedule_timeout+0x1c6/0x1ee
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffff810500ce>] ? process_timeout+0x0/0x10
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa0221351>] cfs_waitq_timedwait+0x11/0x20 [libcfs]
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa03b417f>] target_bulk_io+0x3af/0xa40 [ptlrpc]
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa0221910>] ? cfs_alloc+0x30/0x60 [libcfs]
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffff81038bc6>] ? default_wake_function+0x0/0x14
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa038752f>] ost_brw_write+0x12af/0x2040 [ost]
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa03ef7cc>] ? lustre_msg_set_timeout+0x9c/0x110 [ptlrpc]
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffff810c6edb>] ? free_hot_page+0x3f/0x44
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffff810c7081>] ? __free_pages+0x5a/0x70
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa03b19c0>] ? target_bulk_timeout+0x0/0x100 [ptlrpc]
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa038e104>] ost_handle+0x2604/0x57e0 [ost]
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa03ea29e>] ? lustre_msg_get_opc+0x8e/0xf0 [ptlrpc]
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa03f735c>] ptlrpc_server_handle_request+0x4ec/0xfc0 [ptlrpc]
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffff81037e95>] ? enqueue_task+0x7c/0x8b
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa02212ae>] ? cfs_timer_arm+0xe/0x10 [libcfs]
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa022deb0>] ? lc_watchdog_touch+0x70/0x150 [libcfs]
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa022dfd7>] ? lc_watchdog_disable+0x47/0x120 [libcfs]
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa03fa714>] ? ptlrpc_wait_event+0xa4/0x2d0 [ptlrpc]
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffff81030fda>] ? __wake_up+0x48/0x55
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa03fae7f>] ptlrpc_main+0x53f/0x1670 [ptlrpc]
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffff81003ada>] child_rip+0xa/0x20
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa03fa940>] ? ptlrpc_main+0x0/0x1670 [ptlrpc]
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffff81003ad0>] ? child_rip+0x0/0x20
I don't know if it is related, or not.
Attached files:
Reproduce.java: Java program used when reproducing the problem
reproduce.sh: Shell script that calls Reproduce.Java in a loop
messages-mds: /var/log/messages from the MDS
messages-oss-1 from the first OSS
messages-oss-2 from the second OSS
users*: Client output via netconsole.
I'm moving back to this thread to continue the file issue since it had gotten merged in with the OST failure...
I ran the test yesterday so the lines from May 21 don't pertain to the test. The two fids Lai asks about ([0x2003acec9:0x1:0x0] & [0x2003acec9:0x2:0x0]) don't exist in the file system, but I assume they were located under the directory indicated by [0x200000400:0xa:0x0] which is owned by one of our users who is more tolerant of random failures and so he hasn't been telling me about any issues he has had since he knows it is "use at your own risk" at this time.
I have uploaded a new tarball where I ran the test again. It contains the following files:
client.out - debug_kernel output from the client
mds.out - debug_kernel output from the mds
fid_mapping - mapping of fids to files for the fids found in client.out
error_fid_mapping - mapping of paths to fids and OSTs for the files that weren't found typescript - the output of "script"
The tarball is enoent-20120523.tar.gz.