[LU-4177] Failure on test suite parallel-scale-nfsv3 test_iorssf Created: 29/Oct/13  Updated: 08/Jan/14  Resolved: 30/Oct/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Di Wang
Resolution: Duplicate Votes: 0
Labels: dne
Environment:

client and server: lustre-b2_5 build #2 RHEL6 ldiskfs


Severity: 3
Bugzilla ID: 1,639
Rank (Obsolete): 11308

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/861fabce-40a6-11e3-a5c2-52540035b04c.

The sub-test test_iorssf failed with the following error:

test_iorssf returned 1

client dmesg shows:

Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test iorssf: iorssf == 01:40:35 (1383036035)
INFO: task IOR:13356 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
IOR           D 0000000000000000     0 13356  13352 0x00000080
 ffff880324b4bc88 0000000000000082 0000000000000000 ffff880324b4bc4c
 0000000000000000 ffff88033fc24300 ffff880032e76700 0000000000000400
 ffff880323f4c638 ffff880324b4bfd8 000000000000fb88 ffff880323f4c638
Call Trace:
 [<ffffffff8150f78e>] __mutex_lock_slowpath+0x13e/0x180
 [<ffffffff8150f62b>] mutex_lock+0x2b/0x50
 [<ffffffff8111c4b1>] generic_file_aio_write+0x71/0x100
 [<ffffffff8100b9ce>] ? common_interrupt+0xe/0x13
 [<ffffffffa0386f8e>] nfs_file_write+0xde/0x1f0 [nfs]
 [<ffffffff8118106a>] do_sync_write+0xfa/0x140
 [<ffffffff81096da0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff8100bb8e>] ? apic_timer_interrupt+0xe/0x20
 [<ffffffff8121bed6>] ? security_file_permission+0x16/0x20
 [<ffffffff81181368>] vfs_write+0xb8/0x1a0
 [<ffffffff81181c61>] sys_write+0x51/0x90
 [<ffffffff810dc685>] ? __audit_syscall_exit+0x265/0x290
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Lustre: DEBUG MARKER: parallel-scale-nfsv3 test_iorssf: @@@@@@ FAIL: test_iorssf failed with 1


 Comments   
Comment by Sarah Liu [ 29/Oct/13 ]

same error seen in parallel-scale-nfsv4 test_iorssf:

https://maloo.whamcloud.com/test_sets/01509f4c-40a7-11e3-a5c2-52540035b04c

Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test iorssf: iorssf == 02:27:02 (1383038822)
INFO: task IOR:17809 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
IOR           D 0000000000000002     0 17809  17806 0x00000080
 ffff8803293e9c88 0000000000000086 0000000000000000 ffff880032e36700
 0000000000000002 0000000000000001 ffff8803293e9c28 ffffffff81065d54
 ffff88032cfb1058 ffff8803293e9fd8 000000000000fb88 ffff88032cfb1058
Call Trace:
 [<ffffffff81065d54>] ? enqueue_task_fair+0x64/0x100
 [<ffffffff8150f78e>] __mutex_lock_slowpath+0x13e/0x180
 [<ffffffff8150f62b>] mutex_lock+0x2b/0x50
 [<ffffffff8111c4b1>] generic_file_aio_write+0x71/0x100
 [<ffffffffa0386f8e>] nfs_file_write+0xde/0x1f0 [nfs]
 [<ffffffff8118106a>] do_sync_write+0xfa/0x140
 [<ffffffff81096da0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff8121bed6>] ? security_file_permission+0x16/0x20
 [<ffffffff81181368>] vfs_write+0xb8/0x1a0
 [<ffffffff81181c61>] sys_write+0x51/0x90
 [<ffffffff810dc685>] ? __audit_syscall_exit+0x265/0x290
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
INFO: task IOR:17810 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
IOR           D 0000000000000002     0 17810  17806 0x00000080
 ffff88032bb81c88 0000000000000086 0000000000000000 ffff88032bb81c88
 0000000000000000 0000000000000000 00000000ffffffff ffff88032bb80000
 ffff880334f01ab8 ffff88032bb81fd8 000000000000fb88 ffff880334f01ab8
Call Trace:
 [<ffffffff8150e18a>] ? thread_return+0xa8/0x76e
 [<ffffffff8150f78e>] __mutex_lock_slowpath+0x13e/0x180
 [<ffffffff8150f62b>] mutex_lock+0x2b/0x50
 [<ffffffff8111c4b1>] generic_file_aio_write+0x71/0x100
 [<ffffffffa0386f8e>] nfs_file_write+0xde/0x1f0 [nfs]
 [<ffffffff8118106a>] do_sync_write+0xfa/0x140
 [<ffffffff81096da0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff811c419b>] ? inotify_inode_queue_event+0xeb/0x120
 [<ffffffff8100bb8e>] ? apic_timer_interrupt+0xe/0x20
 [<ffffffff8121bed6>] ? security_file_permission+0x16/0x20
 [<ffffffff81181368>] vfs_write+0xb8/0x1a0
 [<ffffffff8117f8fa>] ? vfs_llseek+0x2a/0x40
 [<ffffffff81181c61>] sys_write+0x51/0x90
 [<ffffffff810dc685>] ? __audit_syscall_exit+0x265/0x290
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Comment by Jodi Levi (Inactive) [ 29/Oct/13 ]

Di,
Could you please have a look and comment on this one?

Comment by Di Wang [ 30/Oct/13 ]

Path: /mnt/lustre/d0.ior.ssf
FS: 28.2 GiB Used FS: 4.4% Inodes: 1.8 Mi Used Inodes: 0.0%
Participating tasks: 4

Summary:
api = POSIX
test filename = /mnt/lustre/d0.ior.ssf/iorData
access = single-shared-file
pattern = segmented (1 segment)
ordering in a file = sequential offsets
ordering inter file= no tasks offsets
clients = 4 (2 per node)
repetitions = 1
xfersize = 2 MiB
blocksize = 3 GiB
aggregate filesize = 12 GiB

Commencing write performance test.
Tue Oct 29 01:40:36 2013

Commencing read performance test.
Tue Oct 29 01:52:55 2013

Operation Max (MiB) Min (MiB) Mean (MiB) Std Dev Max (OPs) Min (OPs) Mean (OPs) Std Dev Mean (s) Op grep #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize

--------- --------- --------- ---------- ------- --------- --------- ---------- ------- --------
write 16.64 16.64 16.64 0.00 8.32 8.32 8.32 0.00 738.47005 4 2 1 0 0 1 0 0 1 3221225472 2097152 12884901888 -1 POSIX EXCEL
read 177.79 177.79 177.79 0.00 88.89 88.89 88.89 0.00 69.11676 4 2 1 0 0 1 0 0 1 3221225472 2097152 12884901888 -1 POSIX EXCEL

Max Write: 16.64 MiB/sec (17.45 MB/sec)
Max Read: 177.79 MiB/sec (186.42 MB/sec)

Run finished: Tue Oct 29 01:54:04 2013
rm: cannot remove `/mnt/lustre/d0.ior.ssf': Directory not empty
parallel-scale-nfsv3 test_iorssf: @@@@@@ FAIL: test_iorssf failed with 1
Trace dump:
= /usr/lib64/lustre/tests/test-framework.sh:4264:error_noexit()
= /usr/lib64/lustre/tests/test-framework.sh:4291:error()
= /usr/lib64/lustre/tests/test-framework.sh:4530:run_one()
= /usr/lib64/lustre/tests/test-framework.sh:4563:run_one_logged()
= /usr/lib64/lustre/tests/test-framework.sh:4433:run_test()
= /usr/lib64/lustre/tests/parallel-scale-nfs.sh:93:main()
Dumping lctl log to /tmp/test_logs/2013-10-29/012032/parallel-scale-nfsv3.test_iorssf.*.1383036845.log

It seems duplicate with LU-1639.

Comment by Di Wang [ 30/Oct/13 ]

I checked debug log on MDS, it seems show the directory is indeed not empty

00000004:00000001:3.0:1383036844.842826:0:3111:0:(osd_handler.c:4729:osd_it_ea_next()) Process entered
00000004:00000001:3.0:1383036844.842827:0:3111:0:(osd_handler.c:4737:osd_it_ea_next()) Process leaving (rc=0 : 0 : 0)
00000004:00000001:3.0:1383036844.842828:0:3111:0:(osd_handler.c:4565:osd_it_ea_fini()) Process entered
00000020:00000001:3.0:1383036844.842830:0:3111:0:(lustre_fid.h:719:fid_flatten32()) Process leaving (rc=29361006 : 29361006 : 1c0036e)
00000004:00000001:3.0:1383036844.842831:0:3111:0:(osd_handler.c:4568:osd_it_ea_fini()) Process leaving
00000004:00000001:3.0:1383036844.842832:0:3111:0:(mdd_dir.c:271:mdd_dir_is_empty()) Process leaving (rc=18446744073709551577 : -39 : ffffffffffffffd9)
00000004:00000001:3.0:1383036844.842833:0:3111:0:(mdd_dir.c:462:mdd_may_delete()) Process leaving (rc=18446744073709551577 : -39 : ffffffffffffffd9)
00000004:00000001:3.0:1383036844.842835:0:3111:0:(mdd_dir.c:1351:mdd_unlink_sanity_check()) Process leaving (rc=18446744073709551577 : -39 : ffffffffffffffd9)
00000004:00000001:3.0:1383036844.842836:0:3111:0:(mdd_dir.c:1502:mdd_unlink()) Process leaving via cleanup (rc=18446744073709551577 : -39 : 0xffffffffffffffd9)

Though no idea about why iorData is not being deleted before rmdir d0.ior.ssf. Unfortunately there are no debug log here to know the reason. Anyway, it is nothing about DNE, should be duplicate with LU-1639.

Generated at Sat Feb 10 01:40:21 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.