[LU-2179] parallel-scale test_write_append_truncate: APPEND-after-trunc bad file size 1048576 != 1215563 Created: 15/Oct/12  Updated: 01/May/13  Resolved: 01/May/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.3.0, Lustre 2.4.0
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Critical
Reporter: Maloo Assignee: Jinshan Xiong (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 5217

 Description   

This issue was created by maloo for yujian <yujian@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/b5ed6690-167a-11e2-80d0-52540035b04c.

Lustre Tag: v2_3_0_RC3
Lustre Build: http://build.whamcloud.com/job/lustre-b2_3/36
Distro/Arch: RHEL6.3/x86_64(server), FC15/x86_64(client)
Network: TCP
ENABLE_QUOTA=yes

The sub-test test_write_append_truncate failed with the following error:

r= 0: create /mnt/lustre/d0.write_append_truncate/f0.wat, max size: 3703701, seed 1350245712: No such file or directory
r= 0 l=0000: WR A  645675/0x09da2b, AP a 1219262/0x129abe, TR@  709840/0x0ad4d0
r= 0 l=1000: WR M  954949/0x0e9245, AP m  194499/0x02f7c3, TR@ 1136109/0x1155ed
r= 0 l=1926: APPEND-after-trunc bad file size 1048576 != 1215563
r= 0 l=1926: append-after-TRUNC bad [880684-1047299]/[0xd702c-0xffb03] != 0
r= 0 l=1926: WR C  880684/0x0d702c, AP c  168263/0x029147, TR@ 1047300/0x0ffb04
000000   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C
*
0d7020   C   C   C   C   C   C   C   C   C   C   C   C   c   c   c   c
0d7030   c   c   c   c   c   c   c   c   c   c   c   c   c   c   c   c
*
100000

Info required for matching: parallel-scale write_append_truncate



 Comments   
Comment by Peter Jones [ 15/Oct/12 ]

Oleg will be looking into this

Comment by Peter Jones [ 16/Oct/12 ]

Jinshan is working on a fix for this issue

Comment by Jinshan Xiong (Inactive) [ 16/Oct/12 ]

patch is at: http://review.whamcloud.com/4281

Comment by Jian Yu [ 17/Oct/12 ]

patch is at: http://review.whamcloud.com/4281

After applying patch set 2 on b2_3 (based on commit e5d5cd2) and building FC15 client packages manually, I ran the write_append_truncate test with the following parameters on the FC15 clients with RHEL6.3/x86_64 2.3.0 RC3 (build #36) servers:

== parallel-scale test write_append_truncate: write_append_truncate ================================== 03:08:12 (1350468492)
OPTIONS:
clients=client-18,client-5
write_REP=10000
write_THREADS=8
MACHINEFILE=/tmp/parallel-scale.machines
client-18
client-5
+ write_append_truncate -v -s 1350245712 -n 10000 /mnt/lustre/d0.write_append_truncate/f0.wat
+ chmod 0777 /mnt/lustre
drwxrwxrwx 4 root root 4096 Oct 17 03:08 /mnt/lustre
+ su mpiuser sh -c "/usr/lib64/openmpi/bin/mpirun -mca orte_rsh_agent rsh:ssh -np 16 -machinefile /tmp/parallel-scale.machines write_append_truncate -v -s 1350245712 -n 10000 /mnt/lustre/d0.write_append_truncate/f0.wat "

So far, the test has been run 5 times successfully. It's still ongoing to complete 10 times.

Comment by Jian Yu [ 17/Oct/12 ]

So far, the test has been run 5 times successfully. It's still ongoing to complete 10 times.

The sixth run hung somehow:

r= 0 l=6790: WR E   84295/0x014947, AP e  372625/0x05af91, TR@  136326/0x021486
r= 0 l=6791: WR F  596548/0x091a44, AP f  686234/0x0a789a, TR@ 1005524/0x0f57d4
r= 0 l=6792: WR G  812779/0x0c66eb, AP g  926155/0x0e21cb, TR@ 1305652/0x13ec34
r= 0 l=6793: WR H  841164/0x0cd5cc, AP h 1080566/0x107cf6, TR@ 1399403/0x155a6b

Stack trace on the Client node showed that:

[206222.460480] write_append_tr S 0000000000000000     0 15826  15801 0x00000080
[206222.467627]  ffff8802f3635938 0000000000000082 0000000000000000 ffff880269654560
[206222.475152]  ffff8802f3635fd8 ffff8802f3635fd8 0000000000013840 0000000000013840
[206222.482681]  ffff8803272f1720 ffff880269654560 0000000000000000 0000000100000000
[206222.490208] Call Trace:
[206222.492737]  [<ffffffff81474d9d>] schedule_hrtimeout_range_clock+0x50/0x111
[206222.499770]  [<ffffffff81080b33>] ? arch_local_irq_save+0x15/0x1b
[206222.505931]  [<ffffffff8147588c>] ? _raw_spin_unlock_irqrestore+0x17/0x19
[206222.512792]  [<ffffffff8106f1d0>] ? add_wait_queue+0x3d/0x45
[206222.518520]  [<ffffffff81474e71>] schedule_hrtimeout_range+0x13/0x15
[206222.524940]  [<ffffffff8112fd7f>] poll_schedule_timeout+0x48/0x64
[206222.531101]  [<ffffffff81130585>] do_select+0x4b1/0x4f5
[206222.536397]  [<ffffffff8112fe45>] ? __pollwait+0x0/0xcc
[206222.541699]  [<ffffffff8112ff11>] ? pollwake+0x0/0x54
[206222.546822]  [<ffffffff8112ff11>] ? pollwake+0x0/0x54
[206222.551946]  [<ffffffff8122c9d4>] ? radix_tree_lookup_slot+0xe/0x10
[206222.558287]  [<ffffffff8104127e>] ? should_resched+0xe/0x2d
[206222.563928]  [<ffffffff814742d0>] ? _cond_resched+0xe/0x22
[206222.569482]  [<ffffffff810d9d04>] ? filemap_fault+0x20d/0x36c
[206222.575298]  [<ffffffff810d8118>] ? unlock_page+0x27/0x2b
[206222.580774]  [<ffffffff81059b0b>] ? current_fs_time+0x37/0x3e
[206222.586590]  [<ffffffff8113485b>] ? touch_atime+0x116/0x131
[206222.592239]  [<ffffffff8104127e>] ? should_resched+0xe/0x2d
[206222.597898]  [<ffffffff814742d0>] ? _cond_resched+0xe/0x22
[206222.603452]  [<ffffffff8112fb60>] ? might_fault+0x21/0x23
[206222.608921]  [<ffffffff8113072c>] core_sys_select+0x163/0x202
[206222.614737]  [<ffffffff811212b2>] ? do_sync_read+0xbf/0xff
[206222.620299]  [<ffffffff8113085c>] sys_select+0x91/0xb9
[206222.625508]  [<ffffffff81009bc2>] system_call_fastpath+0x16/0x1b
[206222.631582] write_append_tr S 0000000000000000     0 15827  15801 0x00000080
[206222.638729]  ffff8802f3637a58 0000000000000082 0000000000000000 ffff880269c9ae40
[206222.646255]  ffff8802f3637fd8 ffff8802f3637fd8 0000000000013840 0000000000013840
[206222.653782]  ffff8803272d9720 ffff880269c9ae40 ffff8802f3637681 0000000000000000
[206222.661311] Call Trace:
[206222.663840]  [<ffffffff81474d9d>] schedule_hrtimeout_range_clock+0x50/0x111
[206222.670872]  [<ffffffff81080b33>] ? arch_local_irq_save+0x15/0x1b
[206222.677033]  [<ffffffff8147588c>] ? _raw_spin_unlock_irqrestore+0x17/0x19
[206222.683894]  [<ffffffff8106f1d0>] ? add_wait_queue+0x3d/0x45
[206222.689622]  [<ffffffff81474e71>] schedule_hrtimeout_range+0x13/0x15
[206222.696042]  [<ffffffff8112fd7f>] poll_schedule_timeout+0x48/0x64
[206222.702203]  [<ffffffff81130d13>] do_sys_poll+0x2f4/0x386
[206222.707671]  [<ffffffff8112fe45>] ? __pollwait+0x0/0xcc
[206222.712967]  [<ffffffff8112ff11>] ? pollwake+0x0/0x54
[206222.718097]  [<ffffffff8112ff11>] ? pollwake+0x0/0x54
[206222.723220]  [<ffffffff8106f23d>] ? autoremove_wake_function+0x2b/0x3d
[206222.729813]  [<ffffffff81059b0b>] ? current_fs_time+0x37/0x3e
[206222.735637]  [<ffffffff8113472b>] ? file_update_time+0xf9/0x113
[206222.741633]  [<ffffffff81128c90>] ? pipe_write+0x448/0x45a
[206222.747198]  [<ffffffff81121099>] ? fsnotify_modify+0x5f/0x67
[206222.753019]  [<ffffffff81130e48>] sys_poll+0x51/0xbb
[206222.758063]  [<ffffffff81009bc2>] system_call_fastpath+0x16/0x1b

Maloo report: https://maloo.whamcloud.com/test_sets/8782dbba-18c7-11e2-a6a7-52540035b04c

Comment by Jian Yu [ 18/Oct/12 ]

Hi Xiong,
The write_append_truncate test passed 10 times on the FC15 clients provisioned with the packages in client-18:/root/rpmbuild/RPMS/x86_64/. All of the logs are in /home/yujian/test_logs/2012-10-17/224352/ on brent node.

Comment by Jodi Levi (Inactive) [ 18/Oct/12 ]

http://review.whamcloud.com/4295

Comment by Jinshan Xiong (Inactive) [ 20/Oct/12 ]

patch for master is at: http://review.whamcloud.com/4317

Comment by Jodi Levi (Inactive) [ 19/Apr/13 ]

With Change, 4295 and Change, 4317 landed, can this ticket be closed?

Generated at Sat Feb 10 01:23:02 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.