[LU-3582] Runtests failed: old and new files are different: rc=11 Created: 13/Jul/13 Updated: 07/Aug/17 Resolved: 21/Oct/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.1, Lustre 2.5.0 |
| Fix Version/s: | Lustre 2.5.0, Lustre 2.4.2 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Keith Mannthey (Inactive) | Assignee: | Keith Mannthey (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | HB | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9079 | ||||||||||||||||
| Description |
|
https://maloo.whamcloud.com/test_sets/e97cb600-eb22-11e2-847f-52540035b04c From the set logs: 11:15:04:Files /bin/date and /mnt/lustre/runtest.22993//bin/date differ 11:15:04:Files /bin/rmdir and /mnt/lustre/runtest.22993//bin/rmdir differ 11:15:04:Files /bin/sed and /mnt/lustre/runtest.22993//bin/sed differ 11:15:04: runtests : @@@@@@ FAIL: old and new files are different: rc=11 11:15:04: Trace dump: 11:15:05: = /usr/lib64/lustre/tests/test-framework.sh:4057:error_noexit() 11:15:05: = /usr/lib64/lustre/tests/test-framework.sh:4084:error() 11:15:05: = /usr/lib64/lustre/tests/test-framework.sh:4089:error_exit() 11:15:05: = /usr/lib64/lustre/tests/runtests:84:main() 11:15:05:Dumping lctl log to /logdir/test_logs/2013-07-11/lustre-reviews-el6-x86_64--review-zfs--1_3_1__16600__-70253551900780-174227/runtests..*.1373652893.log 11:15:05:CMD: wtm-15vm3,wtm-15vm4,wtm-15vm5,wtm-15vm6.rosso.whamcloud.com /usr/sbin/lctl dk > /logdir/test_logs/2013-07-11/lustre-reviews-el6-x86_64--review-zfs--1_3_1__16600__-70253551900780-174227/runtests..debug_log.\$(hostname -s).1373652893.log; 11:15:05: dmesg > /logdir/test_logs/2013-07-11/lustre-reviews-el6-x86_64--review-zfs--1_3_1__16600__-70253551900780-174227/runtests..dmesg.\$(hostname -s).1373652893.log 11:15:16:status script Total(sec) E(xcluded) S(low) 11:15:16:------------------------------------------------------------------------------------ 11:15:16:test-framework exiting on error 11:15:16:runtests returned 0 11:15:16:running: sanity-hsm 11:15:16:run_suite sanity-hsm /usr/lib64/lustre/tests/sanity-hsm.sh The test fails. The auto test logging is pretty minimal for this test. |
| Comments |
| Comment by Keith Mannthey (Inactive) [ 13/Jul/13 ] |
|
https://maloo.whamcloud.com/test_sets/e97cb600-eb22-11e2-847f-52540035b04c is a 2nd spot. Both these have been on review-zfs. |
| Comment by Andreas Dilger [ 15/Jul/13 ] |
|
Keith, can you please submit a patch to change runtests to have multiple subtests (i.e. test_1(), maybe test_2(), etc). This would make the test reporting better (1/1 tests pass instead of 0/0), and this would also dump the client/server debug and console logs on failures. |
| Comment by Keith Mannthey (Inactive) [ 17/Jul/13 ] |
|
Simple patch: http://review.whamcloud.com/7014 |
| Comment by Keith Mannthey (Inactive) [ 26/Jul/13 ] |
|
Ok that simple patch an run_test to runtests has landed. Now to watch for this issue again and see if we can better sort out what happens during the failure. |
| Comment by Nathaniel Clark [ 31/Jul/13 ] |
|
Failure with test_1 data: |
| Comment by Keith Mannthey (Inactive) [ 31/Jul/13 ] |
|
Great. So it seems the OST is blocked while processing. log "comparing $COUNT previously copied files"
cat $FILES | tr "\0" "\n" | ( rc=0; while read f; do
[ $V ] && log "verifying $DST/$f"
diff -q "$f" "$DST/$f" || rc=22
done
[ "$rc" = 0 ] || error_exit "old and new files are different: rc=$rc" )
We see the log message then we get a timeout. I think the core issue is related to this: 22:22:51:Lustre: lustre-OST0003: deleting orphan objects from 0x0:5512 to 0x0:5537 22:22:51:Lustre: lustre-OST0004: deleting orphan objects from 0x0:5288 to 0x0:5313 22:22:51:Lustre: lustre-OST0001: deleting orphan objects from 0x240000400:34 to 0x240000400:161 22:22:51:Lustre: lustre-OST0002: deleting orphan objects from 0x280000400:34 to 0x280000400:161 22:22:51:LustreError: 17155:0:(osd_handler.c:2122:osd_fld_lookup()) lustre-OST0002-osd: cannot find FLD range for [0x280000400:0xa1:0x0]: rc = -5 22:22:51:LustreError: 17155:0:(osd_handler.c:2122:osd_fld_lookup()) Skipped 1983 previous similar messages 22:22:51:LustreError: 17155:0:(osd_oi.c:493:fid_is_on_ost()) lustre-OST0002-osd: Can not lookup fld for [0x280000400:0xa1:0x0] 22:22:51:LustreError: 17155:0:(osd_oi.c:493:fid_is_on_ost()) Skipped 1823 previous similar messages 22:22:51:Lustre: lustre-OST0004: deleting orphan objects from 0x300000400:34 to 0x300000400:161 22:22:51:Lustre: lustre-OST0003: deleting orphan objects from 0x2c0000400:34 to 0x2c0000400:193 "cannot find FLD range for" appears over the course of an hour on both the OSS and MDS. There is more investigation to be done as to why things are orphaned at this point and why the lfd are messed up. |
| Comment by Keith Mannthey (Inactive) [ 07/Aug/13 ] |
|
So the test writes some files the checks they they are ok. Then it umounts and remounts the filesystem and does the check again. We are orphaning data on unmount and then comparison on remount when we fail. from this set of logs: https://maloo.whamcloud.com/test_sets/ba704840-fd1f-11e2-b90c-52540035b04c 21:34:22:Lustre: DEBUG MARKER: /usr/sbin/lctl mark finished at Sat Aug 3 21:34:16 PDT 2013 \(1913\) 21:34:22:Lustre: DEBUG MARKER: finished at Sat Aug 3 21:34:16 PDT 2013 (1913) 21:34:22:LustreError: 28879:0:(ofd_grant.c:163:ofd_grant_sanity_check()) ofd_statfs: tot_granted 10223616 != fo_tot_granted 1116471296 21:35:13:LustreError: 7013:0:(ofd_grant.c:163:ofd_grant_sanity_check()) ofd_destroy_export: tot_granted 10223616 != fo_tot_granted 1116471296 21:35:13:LustreError: 7013:0:(ofd_grant.c:163:ofd_grant_sanity_check()) ofd_destroy_export: tot_granted 10223616 != fo_tot_granted 1093664768 21:35:13:LustreError: 28878:0:(ofd_grant.c:163:ofd_grant_sanity_check()) ofd_statfs: tot_granted 10223616 != fo_tot_granted 1093664768 21:35:13:LustreError: 28869:0:(service.c:1999:ptlrpc_server_handle_request()) @@@ Dropping timed-out request from 12345-10.10.16.146@tcp: deadline 6:6s ago 21:35:13: req@ffff8800730eb400 x1442409413140504/t0(0) o103->40b08f6e-9581-2760-df9e-3c6d4451a181@10.10.16.146@tcp:0/0 lens 328/0 e 0 to 0 dl 1375590863 ref 1 fl Interpret:/0/ffffffff rc 0/-1 21:35:13:Lustre: 28869:0:(service.c:2031:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (6:6s); client may timeout. req@ffff8800730eb400 x1442409413140504/t0(0) o103->40b08f6e-9581-2760-df9e-3c6d4451a181@10.10.16.146@tcp:0/0 lens 328/0 e 0 to 0 dl 1375590863 ref 1 fl Interpret:/0/ffffffff rc 0/-1 21:35:13:LustreError: 7013:0:(ofd_grant.c:163:ofd_grant_sanity_check()) ofd_destroy_export: tot_granted 10223616 != fo_tot_granted 1094189056 21:35:13:LustreError: 7013:0:(ofd_grant.c:163:ofd_grant_sanity_check()) Skipped 1 previous similar message 21:35:13:Lustre: 29341:0:(service.c:2031:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (6:2s); client may timeout. req@ffff880069710800 x1442409413140528/t0(0) o9->40b08f6e-9581-2760-df9e-3c6d4451a181@10.10.16.146@tcp:0/0 lens 224/192 e 0 to 0 dl 1375590869 ref 1 fl Complete:/0/0 rc 0/0 21:35:13:Lustre: 28875:0:(service.c:2031:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (6:10s); client may timeout. req@ffff880069a67000 x1442409413140520/t0(0) o9->40b08f6e-9581-2760-df9e-3c6d4451a181@10.10.16.146@tcp:0/0 lens 224/192 e 0 to 0 dl 1375590863 ref 1 fl Complete:/0/0 rc 0/0 21:35:13:LustreError: 11-0: lustre-MDT0000-lwp-OST0000: Communicating with 10.10.16.143@tcp, operation obd_ping failed with -107. 21:35:13:LustreError: Skipped 6 previous similar messages 21:35:13:Lustre: lustre-MDT0000-lwp-OST0000: Connection to lustre-MDT0000 (at 10.10.16.143@tcp) was lost; in progress operations using this service will wait for recovery to complete 21:35:13:Lustre: Skipped 6 previous similar messages 21:35:13:Lustre: 7026:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1375590879/real 1375590879] req@ffff8800730f0800 x1442406562087668/t0(0) o400->MGC10.10.16.143@tcp@10.10.16.143@tcp:26/25 lens 224/224 e 0 to 1 dl 1375590886 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 21:35:13:LustreError: 166-1: MGC10.10.16.143@tcp: Connection to MGS (at 10.10.16.143@tcp) was lost; in progress operations using this service will fail 21:35:13:Lustre: DEBUG MARKER: grep -c /mnt/ost1' ' /proc/mounts 21:35:13:Lustre: DEBUG MARKER: umount -d -f /mnt/ost1 ..... 21:37:16:Lustre: DEBUG MARKER: mkdir -p /mnt/ost3; mount -t lustre lustre-ost3/ost3 /mnt/ost3 21:37:16:Lustre: lustre-OST0001: deleting orphan objects from 0x0:425 to 0x0:449 The osts were still trying to access the mdt when it was unmounted. The Manual states (and makes logical sense) mount order should be and I would argue we should preserve this order on unmount in test-framework I don't quite trust the logs timestamps but it is clear to me the MDS is not present when the ost unmounts. I need to look into the test-framework a bit more and start work on a patch or two. I don't think there is a core Lustre issue present at this time. |
| Comment by Keith Mannthey (Inactive) [ 07/Aug/13 ] |
|
I made a quick patch and it can be seen here: http://review.whamcloud.com/7255 |
| Comment by Keith Mannthey (Inactive) [ 07/Aug/13 ] |
|
This problem is seen both on both zfs and ldiskfs. After chatting with Oleg is seem the server order should not really matter as then the clients are unmounted the data should be synced. I am also setting up a local repo for this issue to see "how" the files are corrupted. |
| Comment by Keith Mannthey (Inactive) [ 07/Aug/13 ] |
|
My v1 patch didn't help and we actually hit the exact error with the stopall changes in place (unmount osts before mds and dropping the -f flag from the unmount). This is starring to look like a Lustre core issue now. The research continues. |
| Comment by Keith Mannthey (Inactive) [ 08/Aug/13 ] |
|
My single VM local testing did not trigger the error after 425 loops on ldiskfs. I am going to setup multi vm testing now. |
| Comment by Keith Mannthey (Inactive) [ 09/Aug/13 ] |
|
I am still not getting a local repro we will see how it goes overnight. I am starting to think there my be 2 errors seen. One is the RC=11 RC=22 errors (where the diff of files takes place) there other is a timeout like below: When we remount for the first time we just hang. 18:23:51:Lustre: DEBUG MARKER: /usr/sbin/lctl mark comparing 725 previously copied files 18:23:51:Lustre: DEBUG MARKER: comparing 725 previously copied files 18:23:51:LustreError: 28675:0:(fld_handler.c:145:fld_server_lookup()) srv-lustre-MDT0000: Cannot find sequence 0x240000401: rc = -5 18:23:51:LustreError: 28675:0:(fld_handler.c:145:fld_server_lookup()) Skipped 3597 previous similar messages 18:23:51:Lustre: 28681:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1376009297/real 1376009297] req@ffff8800563e2800 x1442840111838728/t0(0) o5->lustre-OST0000-osc-MDT0000@10.10.17.34@tcp:28/4 lens 432/432 e 5 to 1 dl 1376010224 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 18:23:51:Lustre: 28681:0:(client.c:1868:ptlrpc_expire_one_request()) Skipped 16 previous similar messages 18:23:51:Lustre: lustre-OST0000-osc-MDT0000: Connection to lustre-OST0000 (at 10.10.17.34@tcp) was lost; in progress operations using this service will wait for recovery to complete 18:23:51:LustreError: 28681:0:(osp_precreate.c:734:osp_precreate_cleanup_orphans()) lustre-OST0000-osc-MDT0000: cannot cleanup orphans: rc = -11 18:23:51:LustreError: 11-0: lustre-OST0000-osc-MDT0000: Communicating with 10.10.17.34@tcp, operation ost_connect failed with -16. 18:23:51:LustreError: Skipped 6 previous similar messages 18:23:51:LustreError: 11-0: lustre-OST0000-osc-MDT0000: Communicating with 10.10.17.34@tcp, operation ost_connect failed with -16. 18:23:51:LustreError: Skipped 3 previous similar messages 18:23:51:LustreError: 11-0: lustre-OST0000-osc-MDT0000: Communicating with 10.10.17.34@tcp, operation ost_connect failed with -16. 18:23:51:LustreError: Skipped 7 previous similar messages 18:23:51:LustreError: 11-0: lustre-OST0000-osc-MDT0000: Communicating with 10.10.17.34@tcp, operation ost_connect failed with -16. 18:23:51:LustreError: Skipped 15 previous similar messages 18:23:51:LustreError: 11-0: lustre-OST0000-osc-MDT0000: Communicating with 10.10.17.34@tcp, operation ost_connect failed with -16. 18:23:51:LustreError: Skipped 30 previous similar messages 18:23:51:LustreError: 11-0: lustre-OST0000-osc-MDT0000: Communicating with 10.10.17.34@tcp, operation ost_connect failed with -16. 18:23:51:LustreError: Skipped 60 previous similar messages and the OST is not any better: 17:48:47:Lustre: DEBUG MARKER: Using TIMEOUT=20 17:48:47:Lustre: DEBUG MARKER: /usr/sbin/lctl mark comparing 725 previously copied files 17:48:47:Lustre: DEBUG MARKER: comparing 725 previously copied files 17:48:47:Lustre: lustre-OST0000: deleting orphan objects from 0x200000402:905 to 0x200000402:1032 17:48:47:Lustre: lustre-OST0001: deleting orphan objects from 0x240000401:1800 to 0x240000401:1927 17:48:47:Lustre: lustre-OST0002: deleting orphan objects from 0x280000400:1800 to 0x280000400:1927 17:48:47:LustreError: 15183:0:(osd_handler.c:2123:osd_fld_lookup()) lustre-OST0001-osd: cannot find FLD range for [0x240000401:0x787:0x0]: rc = -5 17:48:47:LustreError: 15183:0:(osd_handler.c:2123:osd_fld_lookup()) Skipped 3597 previous similar messages 17:48:47:LustreError: 15183:0:(osd_oi.c:493:fid_is_on_ost()) lustre-OST0001-osd: Can not lookup fld for [0x240000401:0x787:0x0] 17:48:47:LustreError: 15183:0:(osd_oi.c:493:fid_is_on_ost()) Skipped 3405 previous similar messages 17:48:47:Lustre: lustre-OST0003: deleting orphan objects from 0x2c0000402:1799 to 0x2c0000402:1926 17:48:47:Lustre: lustre-OST0004: deleting orphan objects from 0x300000400:1800 to 0x300000400:1927 17:52:08:Lustre: lustre-OST0005: deleting orphan objects from 0x340000401:1800 to 0x340000401:1927 17:52:08:Lustre: lustre-OST0006: deleting orphan objects from 0x380000400:1801 to 0x380000400:2575 17:52:08:INFO: task ll_ost00_000:13457 blocked for more than 120 seconds. 17:52:08:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 17:52:08:ll_ost00_000 D 0000000000000000 0 13457 2 0x00000080 17:52:08: ffff8800575a3640 0000000000000046 0000000000000001 000000003510abfb 17:52:08: 0000000000000002 ffff8800575a36f0 ffffffffa0c960f0 ffffffffa0c960ec 17:52:08: ffff880062a685f8 ffff8800575a3fd8 000000000000fb88 ffff880062a685f8 17:52:08:Call Trace: 17:52:08: [<ffffffff8150f81e>] __mutex_lock_slowpath+0x13e/0x180 17:52:08: [<ffffffff81281964>] ? snprintf+0x34/0x40 17:52:08: [<ffffffff8150f6bb>] mutex_lock+0x2b/0x50 17:52:08: [<ffffffffa0c7d6aa>] osd_obj_map_lookup+0x20a/0x750 [osd_ldiskfs] 17:52:08: [<ffffffff81166b9a>] ? kmem_getpages+0xba/0x170 17:52:08: [<ffffffffa0c6c89e>] osd_oi_lookup+0x23e/0x4a0 [osd_ldiskfs] 17:52:08: [<ffffffffa0c691e2>] osd_object_init+0x752/0x1110 [osd_ldiskfs] 17:52:08: [<ffffffff81167dc3>] ? kmem_cache_alloc_trace+0x1a3/0x1b0 17:52:08: [<ffffffffa05d0238>] lu_object_alloc+0xd8/0x320 [obdclass] 17:52:08: [<ffffffffa05d0de0>] lu_object_find_at+0x210/0x360 [obdclass] 17:52:08: [<ffffffffa0d18eb3>] ? ofd_key_init+0x53/0x1a0 [ofd] 17:52:08: [<ffffffffa05cd87f>] ? keys_fill+0x6f/0x190 [obdclass] 17:52:08: [<ffffffffa05d0f46>] lu_object_find+0x16/0x20 [obdclass] 17:52:08: [<ffffffffa0d2c515>] ofd_object_find+0x35/0xf0 [ofd] 17:52:08: [<ffffffffa0d3b6a3>] ofd_lvbo_init+0x323/0x940 [ofd] 17:52:08: [<ffffffffa071c48c>] ldlm_resource_get+0x36c/0x8f0 [ptlrpc] 17:52:08: [<ffffffffa0716955>] ldlm_lock_create+0x55/0xa10 [ptlrpc] 17:52:08: [<ffffffffa073de46>] ldlm_handle_enqueue0+0x156/0x10b0 [ptlrpc] 17:52:08: [<ffffffffa073edfe>] ldlm_handle_enqueue+0x5e/0x70 [ptlrpc] 17:52:08: [<ffffffffa073ee10>] ? ldlm_server_completion_ast+0x0/0x6c0 [ptlrpc] 17:52:08: [<ffffffffa0cf8420>] ? ost_blocking_ast+0x0/0x10e0 [ost] 17:52:08: [<ffffffffa07398f0>] ? ldlm_server_glimpse_ast+0x0/0x3b0 [ptlrpc] 17:52:08: [<ffffffffa0d0188c>] ost_handle+0x1bbc/0x4030 [ost] 17:52:08: [<ffffffffa0462d64>] ? libcfs_id2str+0x74/0xb0 [libcfs] 17:52:08: [<ffffffffa076e598>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc] 17:52:08: [<ffffffffa045754e>] ? cfs_timer_arm+0xe/0x10 [libcfs] 17:52:08: [<ffffffffa0468a6f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 17:52:08: [<ffffffffa07659a9>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 17:52:08: [<ffffffff81055ad3>] ? __wake_up+0x53/0x70 17:52:08: [<ffffffffa076f91d>] ptlrpc_main+0xabd/0x1700 [ptlrpc] 17:52:08: [<ffffffffa076ee60>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] 17:52:08: [<ffffffff81096956>] kthread+0x96/0xa0 17:52:08: [<ffffffff8100c0ca>] child_rip+0xa/0x20 17:52:08: [<ffffffff810968c0>] ? kthread+0x0/0xa0 17:52:08: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 The timeout go on for a while. 18:04:17:Lustre: 15154:0:(service.c:1339:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-207), not sending early reply 18:04:17: req@ffff88006d5fbc00 x1442840111838728/t0(0) o5->lustre-MDT0000-mdtlov_UUID@10.10.17.33@tcp:0/0 lens 432/432 e 5 to 0 dl 1376010109 ref 2 fl Interpret:/0/0 rc 0/0 18:04:17:Lustre: lustre-OST0000: Client lustre-MDT0000-mdtlov_UUID (at 10.10.17.33@tcp) reconnecting 18:04:17:Lustre: lustre-OST0000: Client lustre-MDT0000-mdtlov_UUID (at 10.10.17.33@tcp) refused reconnection, still busy with 1 active RPCs 18:04:17:Lustre: lustre-OST0000: Client lustre-MDT0000-mdtlov_UUID (at 10.10.17.33@tcp) refused reconnection, still busy with 1 active RPCs 18:04:17:Lustre: lustre-OST0000: Client lustre-MDT0000-mdtlov_UUID (at 10.10.17.33@tcp) reconnecting 18:04:17:Lustre: Skipped 1 previous similar message 18:04:17:Lustre: lustre-OST0000: Client lustre-MDT0000-mdtlov_UUID (at 10.10.17.33@tcp) refused reconnection, still busy with 1 active RPCs 18:04:17:Lustre: lustre-OST0000: Client lustre-MDT0000-mdtlov_UUID (at 10.10.17.33@tcp) refused reconnection, still busy with 1 active RPCs 18:04:17:Lustre: lustre-OST0000: Client lustre-MDT0000-mdtlov_UUID (at 10.10.17.33@tcp) refused reconnection, still busy with 1 active RPCs I don't know why time is off so far on these logs. |
| Comment by Keith Mannthey (Inactive) [ 09/Aug/13 ] |
|
Multi VM is not reproducing the issue after over 400 attempts. I will try tweaking some things. Also it was noted in the Logs in some cases time is jumping around on the servers (random 30+ min gains), I opened a TT to track the issue: https://jira.hpdd.intel.com/browse/TT-1615 |
| Comment by Keith Mannthey (Inactive) [ 09/Aug/13 ] |
|
I am not seeing 17:48:47:LustreError: 15183:0:(osd_handler.c:2123:osd_fld_lookup()) lustre-OST0001-osd: cannot find FLD range for [0x240000401:0x787:0x0]: rc = -5 17:48:47:LustreError: 15183:0:(osd_handler.c:2123:osd_fld_lookup()) Skipped 3597 previous similar messages 17:48:47:LustreError: 15183:0:(osd_oi.c:493:fid_is_on_ost()) lustre-OST0001-osd: Can not lookup fld for [0x240000401:0x787:0x0] 17:48:47:LustreError: 15183:0:(osd_oi.c:493:fid_is_on_ost()) Skipped 3405 previous similar messages In my local testing. |
| Comment by Keith Mannthey (Inactive) [ 10/Aug/13 ] |
|
I think this those errors are form the scrub process somehow but I am not 100% sure yet. I don't get these errors locally yet. There is alot of "time" issues with the rosso runs. I followed up some with TT-1615. In runs that passs there might be 1-2 min drift here and there in runs that fail 10-60+ min drift is seen. On torro runs they are as synced within a few seconds. There might be some correlation between time drift and the timeout errors. It is not clear yet but there is some strong evidence. This issue is staring to get worse (It is reporting about 10% right now). |
| Comment by Keith Mannthey (Inactive) [ 13/Aug/13 ] |
|
No sing of the -22 and -11 errors on Torro. No sign of these same timeout errors on Torro. At this point Rosso is part of the trigger. Weekend testing did not reproduce this issue locally. |
| Comment by Keith Mannthey (Inactive) [ 13/Aug/13 ] |
|
Since we only see this in autotest I created a 2nd patch http://review.whamcloud.com/7311 it removes the -q for the diff. I asked it to run runtests several times. |
| Comment by Keith Mannthey (Inactive) [ 20/Aug/13 ] |
|
Hmm testing has not ran for the above patch, TT-1644 was opened last week. |
| Comment by Andreas Dilger [ 23/Aug/13 ] |
|
Di, some questions for you in this bug:
Keith, the main focus for your investigation should be why the OST is timing out in the first place. This is supposed to be a very simple, "normal usage" kind of test, so if there are timeouts something is very wrong. |
| Comment by Keith Mannthey (Inactive) [ 24/Aug/13 ] |
|
Also rc=22 (after another remount) is a common error code. The timeouts look like: 15:44:33:INFO: task mount.lustre:12729 blocked for more than 120 seconds. 15:44:33:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 15:44:33:mount.lustre D 0000000000000000 0 12729 12728 0x00000080 15:44:33: ffff880064447558 0000000000000086 ffff880064447528 000000001cd3dcca 15:44:33: ffff880064447578 ffffffffa03d7faf ffffffffa0c03317 ffffffffa0c03313 15:44:33: ffff88007b3845f8 ffff880064447fd8 000000000000fb88 ffff88007b3845f8 15:44:33:Call Trace: 15:44:33: [<ffffffffa03d7faf>] ? start_this_handle+0xdf/0x4a0 [jbd2] 15:44:33: [<ffffffff8150f81e>] __mutex_lock_slowpath+0x13e/0x180 15:44:33: [<ffffffffa043c2b8>] ? __ldiskfs_journal_stop+0x68/0xa0 [ldiskfs] 15:44:33: [<ffffffff8150f6bb>] mutex_lock+0x2b/0x50 15:44:33: [<ffffffffa0be7e16>] osd_obj_map_recover+0x186/0x9e0 [osd_ldiskfs] 15:44:33: [<ffffffffa0bec61c>] ? osd_scrub_get_fid+0x12c/0x210 [osd_ldiskfs] 15:44:33: [<ffffffffa0bf0946>] osd_ios_lf_fill+0x4a6/0x5e0 [osd_ldiskfs] 15:44:33: [<ffffffffa0406b2f>] ldiskfs_readdir+0x3cf/0x730 [ldiskfs] 15:44:33: [<ffffffffa0bf04a0>] ? osd_ios_lf_fill+0x0/0x5e0 [osd_ldiskfs] 15:44:33: [<ffffffffa040623c>] ? free_rb_tree_fname+0x7c/0xe0 [ldiskfs] 15:44:33: [<ffffffffa0bebe2e>] osd_ios_general_scan+0xae/0x1a0 [osd_ldiskfs] 15:44:33: [<ffffffffa0bd8f57>] ? osd_oi_init+0x127/0x7d0 [osd_ldiskfs] 15:44:33: [<ffffffffa0bf2b83>] osd_scrub_setup+0x5f3/0xe50 [osd_ldiskfs] 15:44:33: [<ffffffffa05a5bcc>] ? lprocfs_alloc_stats+0x15c/0x760 [obdclass] 15:44:33: [<ffffffffa05cfdfb>] ? lu_site_init+0x2ab/0x4a0 [obdclass] 15:44:33: [<ffffffffa0bc84a9>] osd_device_alloc+0x5f9/0x830 [osd_ldiskfs] 15:44:33: [<ffffffffa05b8f97>] obd_setup+0x1d7/0x2e0 [obdclass] 15:44:33: [<ffffffffa05b92a8>] class_setup+0x208/0x870 [obdclass] 15:44:33: [<ffffffffa05c0abc>] class_process_config+0xc7c/0x1c30 [obdclass] 15:44:33: [<ffffffffa05c5b03>] ? lustre_cfg_new+0x2d3/0x6e0 [obdclass] 15:44:33: [<ffffffffa05c6059>] do_lcfg+0x149/0x480 [obdclass] 15:44:33: [<ffffffffa05c6424>] lustre_start_simple+0x94/0x200 [obdclass] 15:44:33: [<ffffffffa05feb6d>] server_fill_super+0xc8d/0x1a24 [obdclass] 15:44:33: [<ffffffffa05cbe48>] lustre_fill_super+0x1d8/0x530 [obdclass] 15:44:33: [<ffffffffa05cbc70>] ? lustre_fill_super+0x0/0x530 [obdclass] 15:44:33: [<ffffffff8118435f>] get_sb_nodev+0x5f/0xa0 15:44:33: [<ffffffffa05c3ae5>] lustre_get_sb+0x25/0x30 [obdclass] 15:44:33: [<ffffffff8118399b>] vfs_kern_mount+0x7b/0x1b0 15:44:33: [<ffffffff81183b42>] do_kern_mount+0x52/0x130 15:44:33: [<ffffffff811a3d92>] do_mount+0x2d2/0x8d0 15:44:33: [<ffffffff811a4420>] sys_mount+0x90/0xe0 15:44:33: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b Looks like osd_scrub_setup and friends are getting stuck. I will look around some and see if I can sort it out some more. There is a jdb2 thread also. |
| Comment by Di Wang [ 25/Aug/13 ] |
|
Andreas, 1. Yes, FIDs on OST should not use NORMAL seq without DNE. This looks weird. |
| Comment by Keith Mannthey (Inactive) [ 30/Aug/13 ] |
|
Maloo had some hassles with test logs that slowed down the search for timeout issues. I have had a few cycles to look at the scrub and jdb code paths but I am sill looking. I will update when I have a more solid idea of the root cause. |
| Comment by Keith Mannthey (Inactive) [ 01/Sep/13 ] |
|
I have one set of timeout hang data at this point maloo is still without many error logs. INFO: task ll_ost00_000:13457 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ll_ost00_000 D 0000000000000000 0 13457 2 0x00000080 ffff8800575a3640 0000000000000046 0000000000000001 000000003510abfb 0000000000000002 ffff8800575a36f0 ffffffffa0c960f0 ffffffffa0c960ec ffff880062a685f8 ffff8800575a3fd8 000000000000fb88 ffff880062a685f8 Call Trace: [<ffffffff8150f81e>] __mutex_lock_slowpath+0x13e/0x180 [<ffffffff81281964>] ? snprintf+0x34/0x40 [<ffffffff8150f6bb>] mutex_lock+0x2b/0x50 [<ffffffffa0c7d6aa>] osd_obj_map_lookup+0x20a/0x750 [osd_ldiskfs] [<ffffffff81166b9a>] ? kmem_getpages+0xba/0x170 [<ffffffffa0c6c89e>] osd_oi_lookup+0x23e/0x4a0 [osd_ldiskfs] [<ffffffffa0c691e2>] osd_object_init+0x752/0x1110 [osd_ldiskfs] [<ffffffff81167dc3>] ? kmem_cache_alloc_trace+0x1a3/0x1b0 [<ffffffffa05d0238>] lu_object_alloc+0xd8/0x320 [obdclass] [<ffffffffa05d0de0>] lu_object_find_at+0x210/0x360 [obdclass] [<ffffffffa0d18eb3>] ? ofd_key_init+0x53/0x1a0 [ofd] [<ffffffffa05cd87f>] ? keys_fill+0x6f/0x190 [obdclass] [<ffffffffa05d0f46>] lu_object_find+0x16/0x20 [obdclass] [<ffffffffa0d2c515>] ofd_object_find+0x35/0xf0 [ofd] [<ffffffffa0d3b6a3>] ofd_lvbo_init+0x323/0x940 [ofd] [<ffffffffa071c48c>] ldlm_resource_get+0x36c/0x8f0 [ptlrpc] [<ffffffffa0716955>] ldlm_lock_create+0x55/0xa10 [ptlrpc] [<ffffffffa073de46>] ldlm_handle_enqueue0+0x156/0x10b0 [ptlrpc] [<ffffffffa073edfe>] ldlm_handle_enqueue+0x5e/0x70 [ptlrpc] [<ffffffffa073ee10>] ? ldlm_server_completion_ast+0x0/0x6c0 [ptlrpc] [<ffffffffa0cf8420>] ? ost_blocking_ast+0x0/0x10e0 [ost] [<ffffffffa07398f0>] ? ldlm_server_glimpse_ast+0x0/0x3b0 [ptlrpc] [<ffffffffa0d0188c>] ost_handle+0x1bbc/0x4030 [ost] [<ffffffffa0462d64>] ? libcfs_id2str+0x74/0xb0 [libcfs] [<ffffffffa076e598>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc] [<ffffffffa045754e>] ? cfs_timer_arm+0xe/0x10 [libcfs] [<ffffffffa0468a6f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] [<ffffffffa07659a9>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] [<ffffffff81055ad3>] ? __wake_up+0x53/0x70 [<ffffffffa076f91d>] ptlrpc_main+0xabd/0x1700 [ptlrpc] [<ffffffffa076ee60>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] [<ffffffff81096956>] kthread+0x96/0xa0 [<ffffffff8100c0ca>] child_rip+0xa/0x20 [<ffffffff810968c0>] ? kthread+0x0/0xa0 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 There is a 2nd thread with this exact error path (2 and only 2 of the the OST ll_ost00_XXX threads) . I don't see any sign of an ill acquired directory mutex in this codepath we don't even touch the inode of most of this code path. In osd_obj_map_recover we are blocked trying to acquire a mutex from a directory inode. So far there is no sign of who actually has it a crashdump would be helpful. In general we are stuck waiting in osd_io_lookup and this is newer chunk of code IO Scrub code. From commit: 3b657b6b81913afe21e4140f374f0edab20a7f38 iget:
inode = osd_iget(info, dev, id);
if (IS_ERR(inode)) {
result = PTR_ERR(inode);
if (result == -ENOENT || result == -ESTALE) {
if (!in_oi) {
fid_zero(&oic->oic_fid);
GOTO(out, result = 0);
}
/* XXX: There are three possible cases:
* 1. Backup/restore caused the OI invalid.
* 2. Someone unlinked the object but NOT removed
* the OI mapping, such as mount target device
* as ldiskfs, and modify something directly.
* 3. Someone just removed the object between the
* former oi_lookup and the iget. It is normal.
*
* It is diffcult to distinguish the 2nd from the
* 1st case. Relatively speaking, the 1st case is
* common than the 2nd case, trigger OI scrub. */
result = osd_oi_lookup(info, dev, fid, id, true); <-- soon we hang on the directory mutex.
As all the timeout errors that I have seen from runtests have been io scrub related I will add Fan Yong and email him directly to see if he can help further identify the root cause. |
| Comment by Keith Mannthey (Inactive) [ 01/Sep/13 ] |
|
Fan Yong, We seems to be getting tasks that hang in some of the io scrub paths. Could you take a look at these back traces and share any ideas you might have? |
| Comment by nasf (Inactive) [ 03/Sep/13 ] |
|
I have found the similar system hung when test sanity-scrub. I am investigating it ( |
| Comment by Keith Mannthey (Inactive) [ 04/Sep/13 ] |
|
Look to be related in the timeout issues. |
| Comment by Keith Mannthey (Inactive) [ 04/Sep/13 ] |
|
nasf, It is interesting your patch for |
| Comment by nasf (Inactive) [ 05/Sep/13 ] |
|
The OI scrub has the functionality to recovery orphan OST-objects from /lost+found to their original places. We performed such recovery via lustre tool - ll_recover_lost_found_objs with offline mode before. But now, we can do that via OI scrub with online mode. Usually, it is e2fsck to put some orphan OST-objects under /lost+found, and when the OST mounts up, the OI scrub will scan /lost+found to perform related recovery. I do not think runtest itself will link something under /lost+found. If it is a new file-system and never run e2fsck, then the /lost+found should be empty. |
| Comment by Jian Yu [ 06/Sep/13 ] |
|
Lustre build: http://build.whamcloud.com/job/lustre-b2_4/44/ (2.4.1 RC1) runtests failed as follows: comparing 725 previously copied files Files /bin/gzip and /mnt/lustre/runtest.31371//bin/gzip differ Files /bin/findmnt and /mnt/lustre/runtest.31371//bin/findmnt differ <~snip~> Files /bin/ipcalc and /mnt/lustre/runtest.31371//bin/ipcalc differ Files /bin/touch and /mnt/lustre/runtest.31371//bin/touch differ runtests : @@@@@@ FAIL: old and new files are different: rc=22 Maloo report: https://maloo.whamcloud.com/test_sets/847a9a8e-1641-11e3-aa2a-52540035b04c FYI, the same test passed on Lustre 2.4.1 RC1 with FSTYPE=zfs and MDSCOUNT=1. |
| Comment by Keith Mannthey (Inactive) [ 11/Sep/13 ] |
|
Di, My initial debug patch to has never ran, I am leaving it in place for the Tools team. I created a new patch without the test params: |
| Comment by Di Wang [ 11/Sep/13 ] |
|
Hmm, runtests should not do any cross-MDT operation, unless I miss sth. I will take a look. |
| Comment by Di Wang [ 11/Sep/13 ] |
|
hmm, I can not reproduce this problem locally with "MDSCOUNT=4 FSTYPE=zfs sh llmount.sh" And I check the test-scripts, it seems all of tests should be done in MDT0. According to Maloo log, it seems sometimes runtests can pass with DNE+ZFS as well Sometimes it even failed with single MDT + ldiskfs |
| Comment by Keith Mannthey (Inactive) [ 11/Sep/13 ] |
|
Could you try run tests in a loop? I think there is something about the 700 or so files and the unmount/remount window. I have not been able to hit this issue in my local environment with single mds + ldiskfs. |
| Comment by Di Wang [ 12/Sep/13 ] |
|
I tried 20 times "for ((i=0;i<20;i++)) do sh auster -r -s -f testnode runtests ; done" with 2MDS(4MDTs) + ZFS. Still can not reproduce the problem. |
| Comment by Jodi Levi (Inactive) [ 13/Sep/13 ] |
|
Is this still happening regularly? |
| Comment by Keith Mannthey (Inactive) [ 16/Sep/13 ] |
|
Sorry for the late response. We don't seem to be hitting this issue right now. I looked at Maloo on friday and it had been almost 2 weeks since we hand seen issues. It is unclear what has happened but we are definitely not hitting this issue like we were. |
| Comment by Jodi Levi (Inactive) [ 17/Sep/13 ] |
|
Reducing priority for a while longer to see if this occurs again and then will close. |
| Comment by Jodi Levi (Inactive) [ 21/Oct/13 ] |
|
Patches landed to Master and issue does not seem to be occurring any longer. |
| Comment by Jian Yu [ 25/Nov/13 ] |
Patch for Lustre b2_4 branch is in http://review.whamcloud.com/8384. |
| Comment by Jian Yu [ 26/Nov/13 ] |
|
Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/58/ runtests failed again: I ran the test with the same configuration for 6 times manually, all runs passed: |
| Comment by Gerrit Updater [ 07/Oct/16 ] |
|
Mike Pershin (mike.pershin@intel.com) uploaded a new patch: http://review.whamcloud.com/22997 |
| Comment by Gerrit Updater [ 13/Jul/17 ] |
|
Mike Pershin (mike.pershin@intel.com) uploaded a new patch: https://review.whamcloud.com/28009 |
| Comment by Gerrit Updater [ 04/Aug/17 ] |
|
Minh Diep (minh.diep@intel.com) merged in patch https://review-stage.hpdd.intel.com/28009/ |
| Comment by Gerrit Updater [ 07/Aug/17 ] |
|
Minh Diep (minh.diep@intel.com) merged in patch https://review-stage.hpdd.intel.com/28009/ |