[LU-6008] sanity test_102b: setstripe failed Created: 09/Dec/14  Updated: 23/Dec/15  Resolved: 23/Dec/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.4
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Duplicate Votes: 0
Labels: zfs

Issue Links:
Related
is related to LU-5242 Test hang sanity test_132, test_133: ... Resolved
Severity: 3
Rank (Obsolete): 16745

 Description   

This issue was created by maloo for Li Wei <liwei@whamcloud.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/896d09b2-7cbd-11e4-b42a-5254006e85c2.

The sub-test test_102b failed with the following error:

setstripe failed

More detailed test output:

== sanity test 102b: getfattr/setfattr for trusted.lov EAs ============== 06:04:38 (1417759478)
get/set/list trusted.lov xattr ...
error on ioctl 0x4008669a for '/mnt/lustre/f102b.sanity' (3): File too large
error: setstripe: create stripe file '/mnt/lustre/f102b.sanity' failed
 sanity test_102b: @@@@@@ FAIL: setstripe failed 

From the MDS debug log:

00000004:02000400:1.0:1417759538.570621:0:15379:0:(osp_precreate.c:1021:osp_precreate_timeout_condition()) lustre-OST0001-osc-MDT0000: slow creates, last=[0x100010000:0x1561:0x0], next=[0x100010000:0x1561:0x0], reserved=0, syn_changes=35, syn_rpc_in_progress=1, status=-19
00000004:00000001:1.0:1417759538.574056:0:15379:0:(osp_precreate.c:1121:osp_precreate_reserve()) Process leaving (rc=18446744073709551506 : -110 : ffffffffffffff92)
00000004:00000001:1.0:1417759538.574060:0:15379:0:(osp_object.c:219:osp_declare_object_create()) Process leaving (rc=18446744073709551506 : -110 : ffffffffffffff92)
00020000:00001000:1.0:1417759538.574063:0:15379:0:(lod_qos.c:587:lod_qos_declare_object_on()) can't declare creation on #1: -110
00000004:00000001:1.0:1417759538.574065:0:15379:0:(osp_object.c:398:osp_object_release()) Process entered
00000004:00000001:1.0:1417759538.574067:0:15379:0:(osp_object.c:413:osp_object_release()) Process leaving
00000020:00000001:1.0:1417759538.574069:0:15379:0:(lustre_fid.h:719:fid_flatten32()) Process leaving (rc=4261412608 : 4261412608 : fdffff00)
00000004:00000010:1.0:1417759538.574073:0:15379:0:(osp_object.c:390:osp_object_free()) slab-freed '(obj)': 200 at ffff880034f0c9e0.
00020000:00000001:1.0:1417759538.574076:0:15379:0:(lod_qos.c:593:lod_qos_declare_object_on()) Process leaving (rc=18446744073709551506 : -110 : ffffffffffffff92)
00020000:00001000:1.0:1417759538.574077:0:15379:0:(lod_qos.c:912:lod_alloc_specific()) can't declare new object on #1: -110
00020000:00020000:1.0:1417759538.574080:0:15379:0:(lod_qos.c:941:lod_alloc_specific()) can't lstripe objid [0x2000013a0:0x176:0x0]: have 1 want 2
00020000:00000001:1.0:1417759538.575511:0:15379:0:(lod_qos.c:950:lod_alloc_specific()) Process leaving (rc=18446744073709551589 : -27 : ffffffffffffffe5)

From the OSS dmesg log (before this test):

[...]
INFO: task txg_sync:9775 blocked for more than 120 seconds.
      Tainted: P           ---------------    2.6.32-431.29.2.el6_lustre.g6b22a20.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
txg_sync      D 0000000000000000     0  9775      2 0x00000080
 ffff88006b6f7ba0 0000000000000046 00000000ffffffff 00001204a652a9e0
 ffff88006b6f7b10 ffff8800695079f0 0000000000148ca8 ffffffffab7ecaf4
 ffff880073bed098 ffff88006b6f7fd8 000000000000fbc8 ffff880073bed098
Call Trace:
 [<ffffffff810a6d31>] ? ktime_get_ts+0xb1/0xf0
 [<ffffffff81529ea3>] io_schedule+0x73/0xc0
 [<ffffffffa014341c>] cv_wait_common+0x8c/0x100 [spl]
 [<ffffffff8109afa0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa01434a8>] __cv_wait_io+0x18/0x20 [spl]
 [<ffffffffa02890ab>] zio_wait+0xfb/0x1b0 [zfs]
 [<ffffffffa021e9e3>] dsl_pool_sync+0xb3/0x3f0 [zfs]
 [<ffffffffa0236e4b>] spa_sync+0x40b/0xa60 [zfs]
 [<ffffffffa0240711>] ? spa_txg_history_set+0xb1/0xd0 [zfs]
 [<ffffffffa0243916>] txg_sync_thread+0x2e6/0x510 [zfs]
 [<ffffffff810591a9>] ? set_user_nice+0xc9/0x130
 [<ffffffffa0243630>] ? txg_sync_thread+0x0/0x510 [zfs]
 [<ffffffffa013ec2f>] thread_generic_wrapper+0x5f/0x70 [spl]
 [<ffffffffa013ebd0>] ? thread_generic_wrapper+0x0/0x70 [spl]
 [<ffffffff8109abf6>] kthread+0x96/0xa0
 [<ffffffff8100c20a>] child_rip+0xa/0x20
 [<ffffffff8109ab60>] ? kthread+0x0/0xa0
 [<ffffffff8100c200>] ? child_rip+0x0/0x20
Lustre: lustre-OST0001: Client lustre-MDT0000-mdtlov_UUID (at 10.1.5.68@tcp) reconnecting
Lustre: Skipped 25 previous similar messages
Lustre: lustre-OST0001: Client lustre-MDT0000-mdtlov_UUID (at 10.1.5.68@tcp) refused reconnection, still busy with 1 active RPCs
Lustre: Skipped 25 previous similar messages
Lustre: lustre-OST0001: Client lustre-MDT0000-mdtlov_UUID (at 10.1.5.68@tcp) reconnecting
Lustre: Skipped 51 previous similar messages
Lustre: lustre-OST0001: Client lustre-MDT0000-mdtlov_UUID (at 10.1.5.68@tcp) refused reconnection, still busy with 1 active RPCs
Lustre: Skipped 51 previous similar messages
[...]

See also the portion of the OSS dmesg log for later test failures in the same session.

Info required for matching: sanity 102b



 Comments   
Comment by Andreas Dilger [ 09/Dec/14 ]

There is some chance that the OST ZFS sync/wait problem is related to LU-5242. The error being printed about "file too large" is from the MDS lod_alloc_specific() -EFBIG return code. That just means that not enough objects could be allocated by the MDS for the requested striping.

Comment by Andreas Dilger [ 23/Dec/15 ]

Closing this as a duplicate of LU-5242, since it hasn't failed in a long time.

Generated at Sat Feb 10 01:56:24 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.