[LU-1328] Failing customer's file creation test - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.1.1
Labels:
None
Environment:

Hide
Lustre servers are running 2.6.32-220.el6, with Lustre 2.1.1.rc4.
Lustre clients are running 2.6.38.2, with special code created for this release, with http://review.whamcloud.com/#change,2170. (patch 8)

Show
Lustre servers are running 2.6.32-220.el6, with Lustre 2.1.1.rc4. Lustre clients are running 2.6.38.2, with special code created for this release, with http://review.whamcloud.com/#change,2170 . (patch 8)

Severity:
2
Rank (Obsolete):
6412

Description

Customer running a script that calls a Java program (see attachments). Two clients panic'd.

2012-04-13 19:21:49 +0000 [26759.401564] -----------~~[ cut here ]~~-----------
2012-04-13 19:21:49 +0000 [26759.424756] WARNING: at fs/libfs.c:363 simple_setattr+0x99/0xb0()
2012-04-13 19:21:49 +0000 [26759.455860] Hardware name: ProLiant BL460c G7
2012-04-13 19:21:49 +0000 [26759.477542] Modules linked in: lmv mgc lustre lquota lov osc mdc fid fld ksocklnd ptlrpc obdclass lnet lvfs libcfs ppdev 8021q garp bridge stp llc nfsd nfs lockd nfs_acl auth_rpcgss sunrpc netconsole configfs dm_crypt dm_mod crc32c ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi iptable_filter be2net ip_tables x_tables hpilo joydev hpwdt rtc_cmos psmouse rtc_core lp rtc_lib evdev parport mac_hid loop serio_raw tcp_scalable fuse virtio_blk virtio virtio_ring xenfs ext4 mbcache jbd2 xfs usbhid exportfs raid1 mptspi mptsas mptscsih mptbase mpt2sas raid_class arcmsr aic94xx libsas libata scsi_transport_sas aic7xxx aic79xx scsi_transport_spi megaraid_sas cciss sd_mod sg hpsa scsi_mod uhci_hcd ehci_hcd [last unloaded: libcfs]
2012-04-13 19:21:49 +0000 [26759.830142] Pid: 11122, comm: java Not tainted 2.6.38.2-ts4 #11
2012-04-13 19:21:49 +0000 [26759.860423] Call Trace:
2012-04-13 19:21:49 +0000 [26759.872545] [<ffffffff8105d10f>] ? warn_slowpath_common+0x7f/0xc0
2012-04-13 19:21:49 +0000 [26759.903121] [<ffffffff8105d16a>] ? warn_slowpath_null+0x1a/0x20
2012-04-13 19:21:49 +0000 [26759.933790] [<ffffffff8115bea9>] ? simple_setattr+0x99/0xb0
2012-04-13 19:21:50 +0000 [26759.962234] [<ffffffff8115be10>] ? simple_setattr+0x0/0xb0
2012-04-13 19:21:50 +0000 [26759.990161] [<ffffffffa1200af6>] ? ll_md_setattr+0x3e6/0x840 [lustre]
2012-04-13 19:21:50 +0000 [26760.024040] [<ffffffffa12011b4>] ? ll_setattr_raw+0x264/0xe40 [lustre]
2012-04-13 19:21:50 +0000 [26760.056779] [<ffffffffa0ba3126>] ? cfs_hash_del+0xa6/0x1d0 [libcfs]
2012-04-13 19:21:50 +0000 [26760.089867] [<ffffffffa1201ded>] ? ll_setattr+0x5d/0x100 [lustre]
2012-04-13 19:21:50 +0000 [26760.120906] [<ffffffff81153761>] ? notify_change+0x161/0x2c0
2012-04-13 19:21:50 +0000 [26760.150011] [<ffffffff811383d1>] ? do_truncate+0x61/0x90
2012-04-13 19:21:50 +0000 [26760.176623] [<ffffffff8113862d>] ? sys_ftruncate+0xdd/0xf0
2012-04-13 19:21:50 +0000 [26760.205178] [<ffffffff8100bfc2>] ? system_call_fastpath+0x16/0x1b
2012-04-13 19:21:50 +0000 [26760.235981] --~~[ end trace 402d40ca74c5ea86 ]~~--

On one OSS, about an hour before the client crash, I saw this:

Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: Lustre: Service thread pid 16292 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: Pid: 16292, comm: ll_ost_io_254
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel:
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: Call Trace:
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa026f1a9>] ? LNetGet+0x3e9/0x830 [lnet]
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa02699c5>] ? LNetMDBind+0x135/0x490 [lnet]
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffff8104fdef>] ? lock_timer_base+0x2b/0x4f
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffff8104febd>] ? try_to_del_timer_sync+0xaa/0xb7
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffff814aa778>] schedule_timeout+0x1c6/0x1ee
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffff810500ce>] ? process_timeout+0x0/0x10
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa0221351>] cfs_waitq_timedwait+0x11/0x20 [libcfs]
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa03b417f>] target_bulk_io+0x3af/0xa40 [ptlrpc]
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa0221910>] ? cfs_alloc+0x30/0x60 [libcfs]
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffff81038bc6>] ? default_wake_function+0x0/0x14
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa038752f>] ost_brw_write+0x12af/0x2040 [ost]
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa03ef7cc>] ? lustre_msg_set_timeout+0x9c/0x110 [ptlrpc]
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffff810c6edb>] ? free_hot_page+0x3f/0x44
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffff810c7081>] ? __free_pages+0x5a/0x70
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa03b19c0>] ? target_bulk_timeout+0x0/0x100 [ptlrpc]
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa038e104>] ost_handle+0x2604/0x57e0 [ost]
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa03ea29e>] ? lustre_msg_get_opc+0x8e/0xf0 [ptlrpc]
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa03f735c>] ptlrpc_server_handle_request+0x4ec/0xfc0 [ptlrpc]
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffff81037e95>] ? enqueue_task+0x7c/0x8b
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa02212ae>] ? cfs_timer_arm+0xe/0x10 [libcfs]
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa022deb0>] ? lc_watchdog_touch+0x70/0x150 [libcfs]
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa022dfd7>] ? lc_watchdog_disable+0x47/0x120 [libcfs]
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa03fa714>] ? ptlrpc_wait_event+0xa4/0x2d0 [ptlrpc]
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffff81030fda>] ? __wake_up+0x48/0x55
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa03fae7f>] ptlrpc_main+0x53f/0x1670 [ptlrpc]
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffff81003ada>] child_rip+0xa/0x20
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa03fa940>] ? ptlrpc_main+0x0/0x1670 [ptlrpc]
Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffff81003ad0>] ? child_rip+0x0/0x20

I don't know if it is related, or not.

Attached files:

Reproduce.java: Java program used when reproducing the problem
reproduce.sh: Shell script that calls Reproduce.Java in a loop
messages-mds: /var/log/messages from the MDS
messages-oss-1 from the first OSS
messages-oss-2 from the second OSS
users*: Client output via netconsole.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

20120606-netconsole.tbz2
8 kB
06/Jun/12 1:47 PM
debug.tar.gz
9.42 MB
22/May/12 3:51 PM
enoent-20120523.tar.gz
9.64 MB
23/May/12 2:05 PM
enoent-20120524.tar.bz2
5.64 MB
25/May/12 10:30 AM
messages.mds
108 kB
22/May/12 10:29 AM
messages-mds
177 kB
16/Apr/12 5:03 PM
messages-oss-1
43 kB
16/Apr/12 5:03 PM
messages-oss-2
67 kB
16/Apr/12 5:03 PM
Reproduce.java
1 kB
16/Apr/12 5:03 PM
reproduce.sh
0.7 kB
16/Apr/12 5:03 PM
staterrs.tar.gz
2.12 MB
15/May/12 12:58 PM
usrs388.netconsole
3 kB
16/Apr/12 5:03 PM
usrs389.netconsole
3 kB
16/Apr/12 5:03 PM
usrs390.netconsole
3 kB
16/Apr/12 5:03 PM
usrs391.netconsole
3 kB
16/Apr/12 5:03 PM
usrs392.netconsole
3 kB
16/Apr/12 5:03 PM
usrs393.messages
20 kB
17/Apr/12 1:28 PM
usrs393.netconsole
2 kB
16/Apr/12 5:03 PM
usrs394.netconsole
3 kB
16/Apr/12 5:03 PM
usrs395.netconsole
3 kB
16/Apr/12 5:03 PM
usrs396.netconsole
4 kB
16/Apr/12 5:03 PM
usrs397.netconsole
3 kB
16/Apr/12 5:03 PM
usrs398.netconsole
3 kB
16/Apr/12 5:03 PM
usrs399.netconsole
3 kB
16/Apr/12 5:03 PM
usrs400.netconsole
3 kB
16/Apr/12 5:03 PM

Activity

[LU-1328] Failing customer's file creation test

Roger Spellman (Inactive) added a comment - 23/May/12 2:05 PM

I'm moving back to this thread to continue the file issue since it had gotten merged in with the OST failure...

I ran the test yesterday so the lines from May 21 don't pertain to the test. The two fids Lai asks about ([0x2003acec9:0x1:0x0] & [0x2003acec9:0x2:0x0]) don't exist in the file system, but I assume they were located under the directory indicated by [0x200000400:0xa:0x0] which is owned by one of our users who is more tolerant of random failures and so he hasn't been telling me about any issues he has had since he knows it is "use at your own risk" at this time.

I have uploaded a new tarball where I ran the test again. It contains the following files:

client.out - debug_kernel output from the client
mds.out - debug_kernel output from the mds
fid_mapping - mapping of fids to files for the fids found in client.out
error_fid_mapping - mapping of paths to fids and OSTs for the files that weren't found typescript - the output of "script"

The tarball is enoent-20120523.tar.gz.

Roger Spellman (Inactive) added a comment - 23/May/12 2:05 PM I'm moving back to this thread to continue the file issue since it had gotten merged in with the OST failure... I ran the test yesterday so the lines from May 21 don't pertain to the test. The two fids Lai asks about ( [0x2003acec9:0x1:0x0] & [0x2003acec9:0x2:0x0] ) don't exist in the file system, but I assume they were located under the directory indicated by [0x200000400:0xa:0x0] which is owned by one of our users who is more tolerant of random failures and so he hasn't been telling me about any issues he has had since he knows it is "use at your own risk" at this time. I have uploaded a new tarball where I ran the test again. It contains the following files: client.out - debug_kernel output from the client mds.out - debug_kernel output from the mds fid_mapping - mapping of fids to files for the fids found in client.out error_fid_mapping - mapping of paths to fids and OSTs for the files that weren't found typescript - the output of "script" The tarball is enoent-20120523.tar.gz.

Lai Siyao added a comment - 22/May/12 11:02 PM

When is the test conducted?

I saw two lines of error messages suspicious in messages.mds, but they are at May 21 23:00:01:
May 21 23:00:01 ts-xxxxxxxx-01 kernel: LustreError: 6386:0:(mdt_open.c:1472:mdt_reint_open()) open [0x200000400:0xa:0x0]/(logdates->[0x2003acec9:0x1:0x0]) cr_flag=01102 mode=0042775 msg_flag=0x0 failed: -21
May 21 23:00:01 ts-xxxxxxxx-01 kernel: LustreError: 6386:0:(mdt_open.c:1472:mdt_reint_open()) open [0x200000400:0xa:0x0]/(logdates->[0x2003acec9:0x2:0x0]) cr_flag=01102 mode=0042775 msg_flag=0x0 failed: -21

Could you upload script output along with this test? And you can use `lfs fid2path [0x2003acec9:0x1:0x0]` and `lfs fid2path [0x2003acec9:0x2:0x0]` to check which two files are failed to open.

Lai Siyao added a comment - 22/May/12 11:02 PM When is the test conducted? I saw two lines of error messages suspicious in messages.mds, but they are at May 21 23:00:01: May 21 23:00:01 ts-xxxxxxxx-01 kernel: LustreError: 6386:0:(mdt_open.c:1472:mdt_reint_open()) open [0x200000400:0xa:0x0] /(logdates-> [0x2003acec9:0x1:0x0] ) cr_flag=01102 mode=0042775 msg_flag=0x0 failed: -21 May 21 23:00:01 ts-xxxxxxxx-01 kernel: LustreError: 6386:0:(mdt_open.c:1472:mdt_reint_open()) open [0x200000400:0xa:0x0] /(logdates-> [0x2003acec9:0x2:0x0] ) cr_flag=01102 mode=0042775 msg_flag=0x0 failed: -21 Could you upload script output along with this test? And you can use `lfs fid2path [0x2003acec9:0x1:0x0] ` and `lfs fid2path [0x2003acec9:0x2:0x0] ` to check which two files are failed to open.

Roger Spellman (Inactive) added a comment - 22/May/12 3:51 PM

This file contains client.out and mds.out with the debug info you requested.

Roger Spellman (Inactive) added a comment - 22/May/12 3:51 PM This file contains client.out and mds.out with the debug info you requested.

Lai Siyao added a comment - 22/May/12 11:46 AM

Yes, it's better to enable debug trace on both client and server. Upon error, you can run `lctl dk /tmp/create.log`, then you can check debug trace in the log file, which contains recent debug traces, and this is why this command should be run immediately after error occurs, otherwise the old debug trace is dropped.

Lai Siyao added a comment - 22/May/12 11:46 AM Yes, it's better to enable debug trace on both client and server. Upon error, you can run `lctl dk /tmp/create.log`, then you can check debug trace in the log file, which contains recent debug traces, and this is why this command should be run immediately after error occurs, otherwise the old debug trace is dropped.

Roger Spellman (Inactive) added a comment - 22/May/12 11:17 AM

What command is run to get view the debug data?

Roger Spellman (Inactive) added a comment - 22/May/12 11:17 AM What command is run to get view the debug data?

Roger Spellman (Inactive) added a comment - 22/May/12 10:30 AM

> I have no way to debug without debug logs and messages; it would really to useful to dump debuglog upon error on both client and MDS, and it's best to enable 'info' and 'vfstrace' debug trace, you can check it by `lctl get_param debug`, if it doesn't contain 'info' or 'vfstrace', you can enable it with `lctl set_param debug=+"info vfstrace"`.

I will ask the customer to enable these. Are these enabled on both the client and servers?

Roger Spellman (Inactive) added a comment - 22/May/12 10:30 AM > I have no way to debug without debug logs and messages; it would really to useful to dump debuglog upon error on both client and MDS, and it's best to enable 'info' and 'vfstrace' debug trace, you can check it by `lctl get_param debug`, if it doesn't contain 'info' or 'vfstrace', you can enable it with `lctl set_param debug=+"info vfstrace"`. I will ask the customer to enable these. Are these enabled on both the client and servers?

Roger Spellman (Inactive) added a comment - 22/May/12 10:29 AM

This is the messages file from the MDS.

Roger Spellman (Inactive) added a comment - 22/May/12 10:29 AM This is the messages file from the MDS.

Lai Siyao added a comment - 22/May/12 3:28 AM

I have no way to debug without debug logs and messages; it would really to useful to dump debuglog upon error on both client and MDS, and it's best to enable 'info' and 'vfstrace' debug trace, you can check it by `lctl get_param debug`, if it doesn't contain 'info' or 'vfstrace', you can enable it with `lctl set_param debug=+"info vfstrace"`.

Lai Siyao added a comment - 22/May/12 3:28 AM I have no way to debug without debug logs and messages; it would really to useful to dump debuglog upon error on both client and MDS, and it's best to enable 'info' and 'vfstrace' debug trace, you can check it by `lctl get_param debug`, if it doesn't contain 'info' or 'vfstrace', you can enable it with `lctl set_param debug=+"info vfstrace"`.

Lai Siyao added a comment - 21/May/12 10:39 PM

Is there any error message on MDS server?

Lai Siyao added a comment - 21/May/12 10:39 PM Is there any error message on MDS server?

Roger Spellman (Inactive) added a comment - 21/May/12 12:04 PM

We applied the patch. The customer reports the following:

$ lfs find . -type f | sh -c 'while read file; do stat $file & done >/dev/null'
stat: cannot stat `./20120507-091222/d3/3/270.bin': No such file or directory
stat: cannot stat `./20120507-091222/d5/32/681.bin': No such file or directory
stat: cannot stat `./20120507-091222/d5/32/1525.bin': No such file or directory
stat: cannot stat `./20120507-091222/d12/0/284.bin': No such file or directory
stat: cannot stat `./20120507-091222/d12/0/1037.bin': No such file or directory
stat: cannot stat `./20120507-091222/d12/0/917.bin': No such file or directory
stat: cannot stat `./20120507-091222/d11/32/1807.bin': No such file or directory
stat: cannot stat `./20120507-091222/d10/6/2587.bin': No such file or directory
stat: cannot stat `./20120507-091222/d10/6/891.bin': No such file or directory
stat: cannot stat `./20120507-091222/d10/6/750.bin': No such file or directory
stat: cannot stat `./20120507-091222/d7/34/151.bin': No such file or directory

Here are some additional details:

This is a single client with 24 cores running the test.

"ls -l' on each file succeeds - if I perform the stat sequentially it succeeds - it only fails when I launch the stat processes in the background thus making them parallel.

Each time I run the test, I get different files that can't be found.

Roger Spellman (Inactive) added a comment - 21/May/12 12:04 PM We applied the patch. The customer reports the following: $ lfs find . -type f | sh -c 'while read file; do stat $file & done >/dev/null' stat: cannot stat `./20120507-091222/d3/3/270.bin': No such file or directory stat: cannot stat `./20120507-091222/d5/32/681.bin': No such file or directory stat: cannot stat `./20120507-091222/d5/32/1525.bin': No such file or directory stat: cannot stat `./20120507-091222/d12/0/284.bin': No such file or directory stat: cannot stat `./20120507-091222/d12/0/1037.bin': No such file or directory stat: cannot stat `./20120507-091222/d12/0/917.bin': No such file or directory stat: cannot stat `./20120507-091222/d11/32/1807.bin': No such file or directory stat: cannot stat `./20120507-091222/d10/6/2587.bin': No such file or directory stat: cannot stat `./20120507-091222/d10/6/891.bin': No such file or directory stat: cannot stat `./20120507-091222/d10/6/750.bin': No such file or directory stat: cannot stat `./20120507-091222/d7/34/151.bin': No such file or directory Here are some additional details: This is a single client with 24 cores running the test. "ls -l' on each file succeeds - if I perform the stat sequentially it succeeds - it only fails when I launch the stat processes in the background thus making them parallel. Each time I run the test, I get different files that can't be found.

Lai Siyao added a comment - 17/May/12 8:21 PM

Yes, the patch affects Lustre MDS server only, and only touches one file.

Lai Siyao added a comment - 17/May/12 8:21 PM Yes, the patch affects Lustre MDS server only, and only touches one file.

People

Assignee:: Lai Siyao

Reporter:: Roger Spellman (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 16/Apr/12 5:03 PM

Updated:: 27/Sep/12 5:49 PM

Resolved:: 01/Jul/12 10:58 PM