[LU-4839] Test failure sanity-hsm test_60: Timed out waiting for progress update Created: 31/Mar/14  Updated: 10/Aug/15  Resolved: 23/Apr/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0, Lustre 2.5.3
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Critical
Reporter: Maloo Assignee: Nathaniel Clark
Resolution: Fixed Votes: 0
Labels: 22pl, HB, mq115

Issue Links:
Related
Severity: 3
Rank (Obsolete): 13335

 Description   

This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

This issue relates to the following test suite run:
http://maloo.whamcloud.com/test_sets/a72e2f24-a25e-11e3-80d1-52540035b04c
https://maloo.whamcloud.com/test_sets/e3e8358a-b803-11e3-8f4e-52540035b04c

The sub-test test_60 failed with the following error:

Timed out waiting for progress update!

Info required for matching: sanity-hsm 60



 Comments   
Comment by Jian Yu [ 20/Aug/14 ]

While testing patch http://review.whamcloud.com/11097 on Lustre b2_5 branch with FSTYPE=zfs, sanity-hsm test 60 hit the same failure:
https://testing.hpdd.intel.com/test_sets/bee64df4-2878-11e4-85c7-5254006e85c2

Comment by Jian Yu [ 22/Aug/14 ]

While testing patch http://review.whamcloud.com/11539 on Lustre b2_5 branch with FSTYPE=zfs, sanity-hsm test 60 hit the same failure:
https://testing.hpdd.intel.com/test_sets/a4162380-29ce-11e4-a2a1-5254006e85c2

Comment by Jian Yu [ 28/Aug/14 ]

While testing patch http://review.whamcloud.com/11574 on Lustre b2_5 branch with FSTYPE=zfs, sanity-hsm test 60 hit the same failure:
https://testing.hpdd.intel.com/test_sets/1a9cd4da-2d5e-11e4-b550-5254006e85c2

Comment by Bruno Faccini (Inactive) [ 14/Sep/14 ]

+1 at https://testing.hpdd.intel.com/test_sets/03614ef8-3b88-11e4-ad5c-5254006e85c2, during auto-tests of patch http://review.whamcloud.com/11895 (for LU-5042) on Lustre b2_5 branch with FSTYPE=zfs.

It is noteworthy that all this failures occurred with ZFS and also that each time copytool seem to have handled the request and in the process to archive file but triggered a "bandwith control" event.

Comment by Peter Jones [ 23/Sep/14 ]

Nathaniel

Does Bruno's comment give you some insight on how to avoid this failure?

Thanks

Peter

Comment by Nathaniel Clark [ 23/Sep/14 ]

Re: "bandwith control" events

The passing tests also seem to all have them in the copytool log.

Re: ZFS

This is NOT zfs-only, this also happens during DNE testing:
https://testing.hpdd.intel.com/test_sets/45976e74-40ec-11e4-b0c7-5254006e85c2
https://testing.hpdd.intel.com/test_sets/b39b363a-4123-11e4-9ca5-5254006e85c2
https://testing.hpdd.intel.com/test_sets/64967486-40f2-11e4-9ca5-5254006e85c2

Comment by Nathaniel Clark [ 27/Sep/14 ]

This seems like it might be a timing issue where the action completes before it can be picked up by the while loop.

Comment by Nathaniel Clark [ 27/Sep/14 ]

Fix bandwidth control in lhsmtool. The active request was failing too quickly.
http://review.whamcloud.com/12093

Comment by Jian Yu [ 29/Sep/14 ]

More instances on Lustre b2_5 branch:
https://testing.hpdd.intel.com/test_sets/cce87cfe-4760-11e4-8a80-5254006e85c2
https://testing.hpdd.intel.com/test_sets/874176a4-4672-11e4-8deb-5254006e85c2
https://testing.hpdd.intel.com/test_sets/2809a6e8-4672-11e4-b3aa-5254006e85c2
https://testing.hpdd.intel.com/test_sets/14968cdc-488b-11e4-8e19-5254006e85c2
https://testing.hpdd.intel.com/test_sets/e6dbc1fe-4ca9-11e4-9a20-5254006e85c2

Comment by Jodi Levi (Inactive) [ 15/Oct/14 ]

Patch landed to Master.

Comment by Jian Yu [ 25/Oct/14 ]

One more instance on Lustre b2_5 branch: https://testing.hpdd.intel.com/test_sets/ea696b02-5c6a-11e4-b364-5254006e85c2
I'll back-port the patch.

Comment by Jian Yu [ 25/Oct/14 ]

Just found James had back-ported it to Lustre b2_5 branch: http://review.whamcloud.com/12405

Comment by Dmitry Eremin (Inactive) [ 06/Nov/14 ]

Failed again in master https://testing.hpdd.intel.com/test_sets/d8ddbeb4-65bc-11e4-9c16-5254006e85c2

Comment by nasf (Inactive) [ 07/Nov/14 ]

Another failure instance on b2_5:
https://testing.hpdd.intel.com/sub_tests/a622f1fa-65e0-11e4-b6c7-5254006e85c2

Comment by Andreas Dilger [ 08/Nov/14 ]

Dmitry, was your master test failure based on a tree that had this fix applied?

Nasf, the b2_5 patch hasn't landed yet v

Comment by Dmitry Eremin (Inactive) [ 10/Nov/14 ]

Sure, It was on latest master at that time. My patch is on top and patch for this bug is last in this list.

3 days ago	Dmitry Eremin	LU-5577 libcfs: fix warnings in libcfs/curproc.h 79/12379/3	commit | commitdiff | tree | snapshot
4 days ago	John L. Hammond	LU-5814 lov: remove unused {get,set}_info handlers 45/12445/4	commit | commitdiff | tree | snapshot
5 days ago	Frank Zago	LU-5691 hsm: remove a request from the index if not... 42/12142/2	commit | commitdiff | tree | snapshot
5 days ago	Bob Glossman	LU-5853 build: fix el7 build regression 46/12546/2	commit | commitdiff | tree | snapshot
5 days ago	Dmitry Eremin	LU-5383 utils: fix array index out of bounds 24/12524/2	commit | commitdiff | tree | snapshot
5 days ago	Dmitry Eremin	LU-5577 changelog: fix comparison between signed and... 74/12474/2	commit | commitdiff | tree | snapshot
5 days ago	John L. Hammond	LU-5814 echo: remove userspace LSM handling 46/12446/4	commit | commitdiff | tree | snapshot
5 days ago	John L. Hammond	LU-5814 lov: remove LL_IOC_RECREATE_{FID,OBJ} 42/12442/4	commit | commitdiff | tree | snapshot
5 days ago	John L. Hammond	LU-2675 utils: remove loadgen 95/12395/2	commit | commitdiff | tree | snapshot
5 days ago	Dmitry Eremin	LU-5577 obdclass: change uuid_unpack arg to size_t 89/12389/2	commit | commitdiff | tree | snapshot
5 days ago	Dmitry Eremin	LU-5577 obdclass: change lu_site->ls_purge_start to... 84/12384/2	commit | commitdiff | tree | snapshot
5 days ago	Dmitry Eremin	LU-5577 mdd: lu_dirent_calc_size() return type to size_t 83/12383/2	commit | commitdiff | tree | snapshot
5 days ago	Dmitry Eremin	LU-5577 ldlm: count of pools is unsigned long 04/12304/3	commit | commitdiff | tree | snapshot
5 days ago	Wei Liu	LU-5387 test: Skip sanity test_239 if MDS version older... 41/12241/2	commit | commitdiff | tree | snapshot
5 days ago	Johann Lombardi	LU-5668 test: enable ior data consistency check 58/12058/9	commit | commitdiff | tree | snapshot
5 days ago	Jian Yu	LU-5443 lustre: replace direct HZ access with kernel... 52/12052/8	commit | commitdiff | tree | snapshot
5 days ago	Liang Zhen	LU-5545 ptlrpc: false alarm in AT network latency measuring 18/12018/5	commit | commitdiff | tree | snapshot
5 days ago	Andriy Skulysh	LU-5651: ptlrpc: fix import state during replay 15/12015/4	commit | commitdiff | tree | snapshot
5 days ago	Dmitry Eremin	LU-5591 lod: fix Null pointer dereference in lod_ah_init() 70/11770/8	commit | commitdiff | tree | snapshot
5 days ago	Dmitry Eremin	LU-5589 obdclass: fix NULL pointer dereference 69/11769/5	commit | commitdiff | tree | snapshot
5 days ago	John L. Hammond	LU-2675 obd: cleanup struct md_op_data and uses 34/11734/4	commit | commitdiff | tree | snapshot
5 days ago	Emoly Liu	LU-4167 tests: correct version check to enable ff_convert 56/11556/6	commit | commitdiff | tree | snapshot
5 days ago	Dmitry Eremin	LU-5577 mdc: fix comparison between signed and unsigned 79/11379/17	commit | commitdiff | tree | snapshot
5 days ago	Bruno Faccini	LU-4176 tests: re-enable sanity-hsm/test_31a 77/9577/5	commit | commitdiff | tree | snapshot
5 days ago	Niu Yawei	LU-5807 qos: enable QOS_DEBUG() 34/12434/3	commit | commitdiff | tree | snapshot
5 days ago	Jian Yu	LU-4856 obdclass: check val in proc_max_dirty_pages_in_mb() 69/12269/4	commit | commitdiff | tree | snapshot
5 days ago	Alexander.Boyko	LU-5380 at: net AT after connect 55/11155/2	commit | commitdiff | tree | snapshot
5 days ago	Niu Yawei	LU-4810 utils: print messages when set tunables 65/9865/5	commit | commitdiff | tree | snapshot
5 days ago	Jinshan Xiong	LU-3259 clio: cl_lock simplification 58/10858/15	commit | commitdiff | tree | snapshot
6 days ago	Amir Shehata	LU-4181 tests: cleanup lustre before starting lnet... 69/12469/3	commit | commitdiff | tree | snapshot
6 days ago	Jian Yu	LU-5079 tests: decrease at_max value in replay-vbr... 90/12490/2	commit | commitdiff | tree | snapshot
6 days ago	Bob Glossman	LU-5825 kernel: kernel update [RHEL7 3.10.0-123.9.2... 78/12478/3	commit | commitdiff | tree | snapshot
6 days ago	Bob Glossman	LU-5795 kernel: kernel update [SLES11 SP3 3.0.101-0.40] 01/12401/2	commit | commitdiff | tree | snapshot
6 days ago	Kit Westneat	LU-5842 tests: reduce time to run sanity-sec tests... 32/12532/2	commit | commitdiff | tree | snapshot
7 days ago	Lai Siyao	LU-3270 statahead: race in start/stop statahead 66/9666/8	commit | commitdiff | tree | snapshot
7 days ago	Lai Siyao	LU-2272 statahead: ll_intent_drop_lock() called in... 65/9665/9	commit | commitdiff | tree | snapshot
7 days ago	Lai Siyao	LU-3270 statahead: use dcache-like interface for sa... 64/9664/11	commit | commitdiff | tree | snapshot
7 days ago	Joshua Walgenbach	LU-4647 nodemap: add mapping functionality 99/9299/44	commit | commitdiff | tree | snapshot
9 days ago	Liang Zhen	LU-5435 lnet: lustre network latency simulation 09/11409/14	commit | commitdiff | tree | snapshot
9 days ago	Jinshan Xiong	LU-4665 utils: lfs setstripe to specify OSTs 83/9383/29	commit | commitdiff | tree | snapshot
9 days ago	Liang Zhen	LU-5435 lnet: LNet drop rule implementation 14/11314/10	commit | commitdiff | tree | snapshot
9 days ago	Jinshan Xiong	LU-4198 clio: generalize cl_sync_io 56/8656/18	commit | commitdiff | tree | snapshot
9 days ago	Liang Zhen	LU-5435 libcfs: copy out ioctl inline buffer 13/11313/14	commit | commitdiff | tree | snapshot
9 days ago	Fan Yong	LU-5519 lfsck: repair slave LMV for striped directory 48/11848/15	commit | commitdiff | tree | snapshot
9 days ago	Henri Doreau	LU-3613 llite: Add ioctl to get parent fids from link EA. 69/7069/17	commit | commitdiff | tree | snapshot
10 days ago	Fan Yong	LU-5519 lfsck: repair master LMV for striped directory 47/11847/12	commit | commitdiff | tree | snapshot
10 days ago	Fan Yong	LU-5519 lfsck: repair bad name hash for striped directory 46/11846/13	commit | commitdiff | tree | snapshot
10 days ago	Yang Sheng	LU-5584 llite: ensure all data flush out when umount 03/12103/10	commit | commitdiff | tree | snapshot
10 days ago	Oleg Drokin	Revert "LU-5568 lnet: fix kernel crash when network... 02/12502/2	commit | commitdiff | tree | snapshot
11 days ago	Wang Shilong	LU-5568 lnet: fix kernel crash when network failed... 18/11718/11	commit | commitdiff | tree | snapshot
11 days ago	Frank Zago	LU-5756 hsm: add missing return code in llapi_hsm_copyt... 14/12314/6	commit | commitdiff | tree | snapshot
11 days ago	Nathaniel Clark	LU-5743 build: Update to zfs/spl 0.6.3-1.1 73/12273/3	commit | commitdiff | tree | snapshot
11 days ago	Bob Glossman	LU-5641 tests: ensure user daemon is in group bin 44/12044/4	commit | commitdiff | tree | snapshot
11 days ago	Niu Yawei	LU-5287 export: hold exp_lock when modify exp_flags 71/11871/3	commit | commitdiff | tree | snapshot
11 days ago	Minh Diep	LU-5674 test: print spl debug info 80/11580/18	commit | commitdiff | tree | snapshot
11 days ago	Vitaly Fertman	LU-4942 at: per-export lock callback timeout 36/9336/9	commit | commitdiff | tree | snapshot
11 days ago	Patrick Farrell	LU-5626 ldiskfs: update non-htree dotdot in rename 39/11939/11	commit | commitdiff | tree | snapshot
11 days ago	Johann Lombardi	LU-5675 quota: correctly set II_FL_NONUNQ in dt_index_r... 74/12074/3	commit | commitdiff | tree | snapshot
11 days ago	Fan Yong	LU-5519 lfsck: LFSCK code framework adjustment (2) 45/11845/13	commit | commitdiff | tree | snapshot
11 days ago	Fan Yong	LU-5518 lfsck: recover orphans from backend lost+found 36/11536/25	commit | commitdiff | tree | snapshot
11 days ago	Fan Yong	LU-5517 lfsck: repair invalid nlink count 16/11516/29	commit | commitdiff | tree | snapshot
11 days ago	Niu Yawei	LU-5727 ldlm: revert changes to ldlm_cancel_aged_policy() 48/12448/3	commit | commitdiff | tree | snapshot
11 days ago	Niu Yawei	LU-5777 quota: reserve enough credits for setattr 61/12361/3	commit | commitdiff | tree | snapshot
13 days ago	Jian Yu	LU-5606 tests: add version check codes to conf-sanity... 76/12376/2	commit | commitdiff | tree | snapshot
13 days ago	Henri Doreau	LU-1996 lustre: Flexible changelog format. 60/4060/25	commit | commitdiff | tree | snapshot
2014-10-25	Fan Yong	LU-5624 tests: ignore bad lfsck performance for ZFS... 22/12322/2	commit | commitdiff | tree | snapshot
2014-10-25	John L. Hammond	LU-2675 llog: remove obd_llog_init() and obd_llod_finish() 81/11781/2	commit | commitdiff | tree | snapshot
2014-10-25	John L. Hammond	LU-2675 osc: remove obsolete llog handling 74/11774/4	commit | commitdiff | tree | snapshot
2014-10-25	John L. Hammond	LU-2675 lustre: remove linux/obd_support.h 31/11931/3	commit | commitdiff | tree | snapshot
2014-10-25	John L. Hammond	LU-4075 osd: handle getxattr for trusted.version 49/11649/2	commit | commitdiff | tree | snapshot
2014-10-25	Li Xi	LU-5054 llite: enforce pool name length limit 06/10306/11	commit | commitdiff | tree | snapshot
2014-10-25	John L. Hammond	LU-5352 dt: correct if condition in dt_index_read() 21/11121/6	commit | commitdiff | tree | snapshot
2014-10-24	John L. Hammond	LU-2675 mgc: remove libmgc.c 72/11772/4	commit | commitdiff | tree | snapshot
2014-10-24	John L. Hammond	LU-2675 libcfs: add libcfs/byteorder.h 86/11986/2	commit | commitdiff | tree | snapshot
2014-10-24	John L. Hammond	LU-2675 libcfs: remove LUSTRE_{,SRV_}LNET_PID 85/11985/2	commit | commitdiff | tree | snapshot
2014-10-24	John L. Hammond	LU-5779 test: wait for CT registration in sanity-hsm... 67/12367/2	commit | commitdiff | tree | snapshot
2014-10-22	Nathaniel Clark	LU-5706 tests: Ensure preconditions in conf-sanity/57 36/12236/6	commit | commitdiff | tree | snapshot
2014-10-22	John L. Hammond	LU-2675 libcfs: ignore CDEBUG_ENTRY_EXIT for userspace 81/12281/2	commit | commitdiff | tree | snapshot
2014-10-22	Amir Shehata	LU-2456 lnet: lnetctl utility man page 59/11859/10	commit | commitdiff | tree | snapshot
2014-10-22	Amir Shehata	LU-2456 lnet: configure lnet on startup 98/11798/11	commit | commitdiff | tree | snapshot
2014-10-22	Amir Shehata	LU-2456 lnet: DLC user space Configuration utility 26/8026/65	commit | commitdiff | tree | snapshot
2014-10-22	Amir Shehata	LU-2456 lnet: DLC user space Configuration library 25/8025/63	commit | commitdiff | tree | snapshot
2014-10-22	Fan Yong	LU-5506 lfsck: skip orphan MDT-object handling for... 44/11444/23	commit | commitdiff | tree | snapshot
2014-10-22	Fan Yong	LU-5516 lfsck: repair orphan parent MDT-object 91/11391/29	commit | commitdiff | tree | snapshot
2014-10-21	Alexander.Boyko	LU-5079 ptlrpc: fix early reply timeout for recovery 13/11213/11	commit | commitdiff | tree | snapshot
2014-10-20	Henri Doreau	LU-5752 doc: Added missing manpages to Makefile.am 08/12308/2	commit | commitdiff | tree | snapshot
2014-10-20	Fan Yong	LU-4976 osp: add doxygen comments for osp_object.c... 99/10799/12	commit | commitdiff | tree | snapshot
2014-10-17	James Nunez	LU-4298 utils: do not create file with no striping... 75/8375/8	commit | commitdiff | tree | snapshot
2014-10-16	Alex Zhuravlev	LU-4974 lod: documentation for lod_object.c 22/11022/10	commit | commitdiff | tree | snapshot
2014-10-16	Fan Yong	LU-5516 lfsck: repair the lost name entry 49/12249/3	commit | commitdiff | tree | snapshot
2014-10-16	Fan Yong	LU-5515 lfsck: repair bad file type in name entry 48/12248/3	commit | commitdiff | tree | snapshot
2014-10-16	Fan Yong	LU-5513 lfsck: repair multiple referenced name entry 47/12247/5	commit | commitdiff | tree | snapshot
2014-10-15	John L. Hammond	LU-2675 libcfs: remove {ENTRY,EXIT}_NESTING macros 84/11984/5	commit | commitdiff | tree | snapshot
2014-10-15	Yang Sheng	LU-951 test: re-enable replay-single test_73a 27/12227/3	commit | commitdiff | tree | snapshot
2014-10-11	Oleg Drokin	New tag 2.6.54 2.6.54 v2_6_54 v2_6_54_0	commit | commitdiff | tree | snapshot
2014-10-11	Bruno Faccini	LU-5573 obdclass: strengthen against concurrent server... 14/12114/4	commit | commitdiff | tree | snapshot
2014-10-11	Bobi Jam	LU-4943 obdclass: detach MGC dev on error 29/10129/14	commit | commitdiff | tree | snapshot
2014-10-11	Nathaniel Clark	LU-4839 utils: fix bandwidth ctl in lhsmtool 93/12093/7	commit | commitdiff | tree | snapshot
Comment by nasf (Inactive) [ 10/Nov/14 ]

We need the b2_5 patch, another failure on b2_5:
https://testing.hpdd.intel.com/test_sets/741c533a-683e-11e4-a449-5254006e85c2

Comment by Nathaniel Clark [ 10/Nov/14 ]

Patch for b2_5
http://review.whamcloud.com/12654 (Abandoned)

Comment by Jian Yu [ 10/Nov/14 ]

Hi Nathaniel Clark,

Just found James had back-ported it to Lustre b2_5 branch: http://review.whamcloud.com/12405

The patch for Lustre b2_5 branch is ready to land.

Comment by Bob Glossman (Inactive) [ 10/Nov/14 ]

still seen in master:
https://testing.hpdd.intel.com/test_sets/f9b35f48-682a-11e4-acbe-5254006e85c2

Comment by Nathaniel Clark [ 11/Nov/14 ]

Current failures have a delay during copytool startup:

1415239123.290216 lhsmtool_posix[22116]: action=0 src=(null) dst=(null) mount_point=/mnt/lustre
1415239123.678180 lhsmtool_posix[22117]: waiting for message from kernel
1415239133.981741 lhsmtool_posix[22117]: copytool fs=lustre archive#=2 item_count=1
1415239133.982096 lhsmtool_posix[22117]: waiting for message from kernel
1415239133.982184 lhsmtool_posix[22118]: '[0x200000401:0x1c2:0x0]' action ARCHIVE reclen 72, cookie=0x545ad59d
1415239133.984387 lhsmtool_posix[22118]: processing file 'd60.sanity-hsm/f60.sanity-hsm'
1415239134.028282 lhsmtool_posix[22118]: archiving '/mnt/lustre/.lustre/fid/0x200000401:0x1c2:0x0' to '/home/autotest2/.autotest/shared_dir/2014-11-05/074239-70147036187460/arc1/01c2/0000/0401/0000/0002/0000/0x200000401:0x1c2:0x0_tmp'
1415239149.689299 lhsmtool_posix[22118]: saving stripe info of '/mnt/lustre/.lustre/fid/0x200000401:0x1c2:0x0' in /home/autotest2/.autotest/shared_dir/2014-11-05/074239-70147036187460/arc1/01c2/0000/0401/0000/0002/0000/0x200000401:0x1c2:0x0_tmp.lov
1415239151.495965 lhsmtool_posix[22118]: start copy of 39000000 bytes from '/mnt/lustre/.lustre/fid/0x200000401:0x1c2:0x0' to '/home/autotest2/.autotest/shared_dir/2014-11-05/074239-70147036187460/arc1/01c2/0000/0401/0000/0002/0000/0x200000401:0x1c2:0x0_tmp'
1415239156.616170 lhsmtool_posix[22118]: %13 
1415239156.625435 lhsmtool_posix[22118]: bandwith control: 1048576B/s excess=1048576 sleep for 1.000000000s
1415239161.652771 lhsmtool_posix[22118]: %26 
1415239161.661059 lhsmtool_posix[22118]: bandwith control: 1048576B/s excess=1048576 sleep for 1.000000000s
1415239166.690009 lhsmtool_posix[22118]: %40 
1415239166.699557 lhsmtool_posix[22118]: bandwith control: 1048576B/s excess=1048576 sleep for 1.000000000s
1415239171.725522 lhsmtool_posix[22118]: %53 
1415239171.729102 lhsmtool_posix[22118]: bandwith control: 1048576B/s excess=1048576 sleep for 1.000000000s
1415239176.737525 lhsmtool_posix[22118]: %67 
1415239176.740985 lhsmtool_posix[22118]: bandwith control: 1048576B/s excess=1048576 sleep for 1.000000000s
exiting: Interrupt

Notice the amount of time from first log message to the first bandwidth control message (about 33sec). This would let the 30 sec timeout occur before any coping had actually occurred. Some even as long as a minute

Comment by Nathaniel Clark [ 12/Nov/14 ]

http://review.whamcloud.com/12682

Comment by nasf (Inactive) [ 16/Nov/14 ]

another failure instance on b2_5:
https://testing.hpdd.intel.com/test_sets/296862b8-6d14-11e4-9bc9-5254006e85c2

Comment by Gerrit Updater [ 17/Nov/14 ]

Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: http://review.whamcloud.com/12682
Subject: LU-4839 tests: Give copytool more time to start
Project: fs/lustre-release
Branch: master
Current Patch Set: 4
Commit: 2295411ba1e28647312a53b6bf642c999df23e27

Comment by Gerrit Updater [ 23/Nov/14 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12682/
Subject: LU-4839 tests: Give copytool more time to start
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 6948e80bc149aa689e09334a70941340143fa2ce

Comment by Gerrit Updater [ 23/Nov/14 ]

Jian Yu (jian.yu@intel.com) uploaded a new patch: http://review.whamcloud.com/12823
Subject: LU-4839 tests: Give copytool more time to start
Project: fs/lustre-release
Branch: b2_5
Current Patch Set: 1
Commit: e0fe5dcb9225ec44b481f263741330b5d0312549

Comment by Jian Yu [ 23/Nov/14 ]

http://review.whamcloud.com/12682

Here is the back-ported patch for Lustre b2_5 branch: http://review.whamcloud.com/12823

Comment by Gerrit Updater [ 01/Dec/14 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12405/
Subject: LU-4839 utils: fix bandwidth ctl in lhsmtool
Project: fs/lustre-release
Branch: b2_5
Current Patch Set:
Commit: dd7d3a41b8f14f5013a8f5d605cdc85a16825b75

Comment by Jodi Levi (Inactive) [ 01/Dec/14 ]

Patches landed to Master.
Patch landings for other versions tracked externally.

Comment by Jian Yu [ 02/Dec/14 ]

The failure still occurred on Lustre b2_5 branch after the patches were landed:
https://testing.hpdd.intel.com/test_sets/1d41c008-7a28-11e4-807e-5254006e85c2

Comment by Jian Yu [ 03/Dec/14 ]

The failure still occurred on Lustre b2_5 branch after the patches were landed

I just found the patch http://review.whamcloud.com/12823 has not been landed on Lustre b2_5 branch.

Comment by Gerrit Updater [ 04/Dec/14 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12823/
Subject: LU-4839 tests: Give copytool more time to start
Project: fs/lustre-release
Branch: b2_5
Current Patch Set:
Commit: 3d9be957e67d81cba1a12aad04245c2716308ee9

Comment by Jian Yu [ 08/Dec/14 ]

Lustre b2_5 build: https://build.hpdd.intel.com/job/lustre-b2_5/105/ (which contains the patch http://review.whamcloud.com/12823)
FSTYPE=zfs

The same failure still occurred: https://testing.hpdd.intel.com/test_sets/c0eea1ea-7dba-11e4-a179-5254006e85c2

Comment by Li Wei (Inactive) [ 09/Dec/14 ]

Indeed. b2_5: https://testing.hpdd.intel.com/test_sets/4aecf848-7cc0-11e4-b42a-5254006e85c2

Comment by Jian Yu [ 12/Dec/14 ]

More instance on Lustre b2_5 branch:
https://testing.hpdd.intel.com/test_sets/8521c5ea-8105-11e4-9c9a-5254006e85c2

Comment by Andreas Dilger [ 12/Dec/14 ]

Still seeing this test fail on master. 7x in the past week:
https://testing.hpdd.intel.com/test_sets/66760558-81f4-11e4-90bc-5254006e85c2
https://testing.hpdd.intel.com/test_sets/b432c4fe-80cb-11e4-9ec8-5254006e85c2
https://testing.hpdd.intel.com/test_sets/e5c6de08-801f-11e4-b486-5254006e85c2
https://testing.hpdd.intel.com/test_sets/8547fb7e-80c0-11e4-a434-5254006e85c2
https://testing.hpdd.intel.com/test_sets/3b87e076-79db-11e4-9e8a-5254006e85c2
https://testing.hpdd.intel.com/test_sets/5eb0c2c8-79e7-11e4-807e-5254006e85c2

Comment by Andreas Dilger [ 15/Dec/14 ]

Nathaniel, is it possible the test still isn't giving enough time for this to pass on review-zfs? This seems like one of the more common failures in review-zfs, so if we increase the wait time only for ZFS backed filesystems it will hopefully allow more passes (assuming there isn't some other real failure here, I haven't looked into the logs).

Comment by Nathaniel Clark [ 29/Dec/14 ]

Unfortunately the wait time can't be increased much more without compromising the test. Since the test is trying to ensure that updates happen every 5 seconds instead of the default 30. I've already pushed the wait time up to 20 seconds. There seems to be a significant delay between "archiving" and "saving striping info":

1417489810.840712 lhsmtool_posix[8894]: processing file 'd60.sanity-hsm/f60.sanity-hsm'
1417489810.895598 lhsmtool_posix[8894]: archiving '/mnt/lustre/.lustre/fid/0x400000401:0x1c2:0x0' to '/home/autotest2/.autotest/shared_dir/2014-12-01/143223-70364285431720/arc1/01c2/0000/0401/0000/0004/0000/0x400000401:0x1c2:0x0_tmp'
1417489841.241595 lhsmtool_posix[8894]: saving stripe info of '/mnt/lustre/.lustre/fid/0x400000401:0x1c2:0x0' in /home/autotest2/.autotest/shared_dir/2014-12-01/143223-70364285431720/arc1/01c2/0000/0401/0000/0004/0000/0x400000401:0x1c2:0x0_tmp.lov
1417489845.934025 lhsmtool_posix[8894]: start copy of 39000000 bytes from '/mnt/lustre/.lustre/fid/0x400000401:0x1c2:0x0' to '/home/autotest2/.autotest/shared_dir/2014-12-01/143223-70364285431720/arc1/01c2/0000/0401/0000/0004/0000/0x400000401:0x1c2:0x0_tmp'
1417489850.089915 lhsmtool_posix[8894]: %13 

This step is all about creating destination directories and opening for write the destination file (except for an open_by_fid). My current theory is that the issue resides with NFS being the shared directory that copytool wants to write to.

Comment by Gerrit Updater [ 30/Dec/14 ]

Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: http://review.whamcloud.com/13214
Subject: LU-4839 utils: DEBUG ONLY add debugging to hsmtool
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ae604857d0a7949db0c78e2a4df8d7309250b830

Comment by Gerrit Updater [ 11/Feb/15 ]

Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: http://review.whamcloud.com/13731
Subject: LU-4839 tests: wait for copytool start sanity-hsm/60
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: bd32b390a4077db4c763148f14855a96b9ec8545

Comment by nasf (Inactive) [ 26/Feb/15 ]

Another failure instance on b2_5:
https://testing.hpdd.intel.com/test_sets/a234c7d0-bd24-11e4-a946-5254006e85c2

Comment by Bruno Faccini (Inactive) [ 26/Feb/15 ]

Nasf, Nathaniel,
When browsing for LU-6203/test_251 issues, I found that both+only sanity-hsm/test_[60,251] have failed in this last autotests session. Having a look in the debug_traces for test_60 I found that we face the same issue I already strongly suspect in LU-6203, which is that the lock flush/cancel between Client running the sub-test (and creating the file being archived) and the OSS/OST can take more than 20s causing the test failure because copytool has been stuck during that time waiting to access/copy datas for the archive operation.
So I wonder if the solution for this ticket could not be the same I thought for LU-6203/test_251, to add a cancel_lru_locks() before the hsm_archive in order to flush/cancel locks early.

BTW, I don't know what causes such delay, looks like it only occurs with ZFS and could also be related to some VM/disk issue...

Comment by Gerrit Updater [ 03/Mar/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13731/
Subject: LU-4839 tests: wait for copytool start sanity-hsm/60
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: d0636eede1ab340421177c6a97ec27689099f953

Comment by Gerrit Updater [ 04/Mar/15 ]

Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: http://review.whamcloud.com/13962
Subject: LU-4839 tests: wait for copytool start sanity-hsm/60
Project: fs/lustre-release
Branch: b2_5
Current Patch Set: 1
Commit: 9a9e80b5518a36790a4472486b81a5162c2277a3

Comment by Peter Jones [ 23/Apr/15 ]

Landed for 2.8

Generated at Sat Feb 10 01:46:17 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.