[LU-2124] Test failure on test suite obdfilter-survey, subtest test_1a Created: 09/Oct/12  Updated: 04/Nov/13  Resolved: 04/Nov/13

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0, Lustre 2.4.1
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: WC Triage
Resolution: Fixed Votes: 0
Labels: zfs

Issue Links:
Related
is related to LU-2887 sanity-quota test_12a: slow due to ZF... Resolved
Severity: 3
Rank (Obsolete): 5125

 Description   

This issue was created by maloo for Li Wei <liwei@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/584999d6-1207-11e2-a663-52540035b04c.

The sub-test test_1a failed with the following error:

test failed to respond and timed out

Info required for matching: obdfilter-survey 1a



 Comments   
Comment by Nathaniel Clark [ 23/Jul/13 ]

OST console log:

21:57:00:Lustre: DEBUG MARKER: == obdfilter-survey test 1a: Object Storage Targets survey =========================================== 21:56:49 (1349758609)
21:57:00:Lustre: DEBUG MARKER: lctl dl | grep obdfilter
21:57:00:Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep tcp | cut -f 1 -d '@'
22:48:59:hrtimer: interrupt took 55369 ns
Comment by Jian Yu [ 09/Sep/13 ]

Lustre build: http://build.whamcloud.com/job/lustre-b2_4/45/ (2.4.1 RC2)
Distro/Arch: RHEL6.4/x86_64
FSTYPE=zfs

obdfilter-survey test 1a hung as follows:

== obdfilter-survey test 1a: Object Storage Targets survey == 23:44:44 (1378622684)
CMD: client-24vm4 lctl dl | grep obdfilter
CMD: client-24vm4 /usr/sbin/lctl list_nids | grep tcp | cut -f 1 -d '@'
+ NETTYPE=tcp thrlo=8 nobjhi=1 thrhi=16 size=1024 case=disk rslt_loc=/tmp targets="10.10.4.119:lustre-OST0000 10.10.4.119:lustre-OST0001 10.10.4.119:lustre-OST0002 10.10.4.119:lustre-OST0003 10.10.4.119:lustre-OST0004 10.10.4.119:lustre-OST0005 10.10.4.119:lustre-OST0006" /usr/bin/obdfilter-survey
Warning: Permanently added '10.10.4.119' (RSA) to the list of known hosts.
Sat Sep  7 23:44:51 PDT 2013 Obdfilter-survey for case=disk from client-24vm2.lab.whamcloud.com

Dmesg on OSS node client-24vm4 showed that:

lctl          D 0000000000000000     0 19552  19496 0x00000080
 ffff88001bf65748 0000000000000086 ffff8800ffffffff 0000126bad99a78e
 ffff880061356070 ffff8800618efec0 00000000003e8684 ffffffffadd3ec96
 ffff88001fefdaf8 ffff88001bf65fd8 000000000000fb88 ffff88001fefdaf8
Call Trace:
 [<ffffffff810a2431>] ? ktime_get_ts+0xb1/0xf0
 [<ffffffff8150ed03>] io_schedule+0x73/0xc0
 [<ffffffffa03e6d4c>] cv_wait_common+0x8c/0x100 [spl]
 [<ffffffff81096da0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa03e6dd8>] __cv_wait_io+0x18/0x20 [spl]
 [<ffffffffa052939b>] zio_wait+0xfb/0x190 [zfs]
 [<ffffffffa049f07d>] dmu_buf_hold_array_by_dnode+0x1dd/0x560 [zfs]
 [<ffffffffa049ff88>] dmu_buf_hold_array_by_bonus+0x68/0x90 [zfs]
 [<ffffffffa0dc1b33>] osd_bufs_get+0x493/0xa30 [osd_zfs]
 [<ffffffffa0e609cb>] ofd_preprw_read+0x14b/0x7f0 [ofd]
 [<ffffffffa0e617ea>] ofd_preprw+0x77a/0x1480 [ofd]
 [<ffffffffa05a7473>] echo_client_iocontrol+0x2003/0x3b40 [obdecho]
 [<ffffffff81281826>] ? vsnprintf+0x336/0x5e0
 [<ffffffffa071049f>] class_handle_ioctl+0x12ff/0x1ec0 [obdclass]
 [<ffffffffa06f82ab>] obd_class_ioctl+0x4b/0x190 [obdclass]
 [<ffffffff81195352>] vfs_ioctl+0x22/0xa0
 [<ffffffff8103c7d8>] ? pvclock_clocksource_read+0x58/0xd0
 [<ffffffff811954f4>] do_vfs_ioctl+0x84/0x580
 [<ffffffff81195a71>] sys_ioctl+0x81/0xa0
 [<ffffffff810dc685>] ? __audit_syscall_exit+0x265/0x290
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

Maloo report: https://maloo.whamcloud.com/test_sets/f9d6f946-18ab-11e3-aa54-52540035b04c

The same failure also occurred on previous Lustre b2_4 builds:
https://maloo.whamcloud.com/test_sets/0ce085aa-169c-11e3-aa2a-52540035b04c
https://maloo.whamcloud.com/test_sets/92b17690-16b4-11e3-8c83-52540035b04c
https://maloo.whamcloud.com/test_sets/f1befe08-1657-11e3-aa2a-52540035b04c
https://maloo.whamcloud.com/test_sets/ecb6f352-1409-11e3-980d-52540035b04c
https://maloo.whamcloud.com/test_sets/e10c9e46-13f3-11e3-9e61-52540035b04c

Comment by Jian Yu [ 01/Nov/13 ]

Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/46/
FSTYPE=zfs

The same failure occurred:
https://maloo.whamcloud.com/test_sets/42ba0f84-3064-11e3-b28a-52540035b04c

We'll see whether the timeout failure can disappear after TEI-790 is resolved.

Comment by Jian Yu [ 04/Nov/13 ]

Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/47/
FSTYPE=zfs

With OSTCOUNT=2, obdfilter-survey test 1a passed:
https://maloo.whamcloud.com/test_sets/a488f632-4453-11e3-8472-52540035b04c

Let's close this ticket.

Generated at Sat Feb 10 01:22:34 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.