[LU-3582] Runtests failed: old and new files are different: rc=11 Created: 13/Jul/13  Updated: 07/Aug/17  Resolved: 21/Oct/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.1, Lustre 2.5.0
Fix Version/s: Lustre 2.5.0, Lustre 2.4.2

Type: Bug Priority: Minor
Reporter: Keith Mannthey (Inactive) Assignee: Keith Mannthey (Inactive)
Resolution: Fixed Votes: 0
Labels: HB

Issue Links:
Related
is related to LU-3126 conf-sanity test_41b: fld_server_look... Resolved
is related to LU-3706 sanity-scrub test_13 timeout Resolved
is related to LU-3895 conf-sanity test 47: umount -d -f /mn... Closed
Severity: 3
Rank (Obsolete): 9079

 Description   

https://maloo.whamcloud.com/test_sets/e97cb600-eb22-11e2-847f-52540035b04c

From the set logs:

11:15:04:Files /bin/date and /mnt/lustre/runtest.22993//bin/date differ
11:15:04:Files /bin/rmdir and /mnt/lustre/runtest.22993//bin/rmdir differ
11:15:04:Files /bin/sed and /mnt/lustre/runtest.22993//bin/sed differ
11:15:04: runtests : @@@@@@ FAIL: old and new files are different: rc=11 
11:15:04:  Trace dump:
11:15:05:  = /usr/lib64/lustre/tests/test-framework.sh:4057:error_noexit()
11:15:05:  = /usr/lib64/lustre/tests/test-framework.sh:4084:error()
11:15:05:  = /usr/lib64/lustre/tests/test-framework.sh:4089:error_exit()
11:15:05:  = /usr/lib64/lustre/tests/runtests:84:main()
11:15:05:Dumping lctl log to /logdir/test_logs/2013-07-11/lustre-reviews-el6-x86_64--review-zfs--1_3_1__16600__-70253551900780-174227/runtests..*.1373652893.log
11:15:05:CMD: wtm-15vm3,wtm-15vm4,wtm-15vm5,wtm-15vm6.rosso.whamcloud.com /usr/sbin/lctl dk > /logdir/test_logs/2013-07-11/lustre-reviews-el6-x86_64--review-zfs--1_3_1__16600__-70253551900780-174227/runtests..debug_log.\$(hostname -s).1373652893.log;
11:15:05:         dmesg > /logdir/test_logs/2013-07-11/lustre-reviews-el6-x86_64--review-zfs--1_3_1__16600__-70253551900780-174227/runtests..dmesg.\$(hostname -s).1373652893.log
11:15:16:status        script            Total(sec) E(xcluded) S(low) 
11:15:16:------------------------------------------------------------------------------------
11:15:16:test-framework exiting on error
11:15:16:runtests returned 0
11:15:16:running: sanity-hsm 
11:15:16:run_suite sanity-hsm /usr/lib64/lustre/tests/sanity-hsm.sh

The test fails.

The auto test logging is pretty minimal for this test.



 Comments   
Comment by Keith Mannthey (Inactive) [ 13/Jul/13 ]

https://maloo.whamcloud.com/test_sets/e97cb600-eb22-11e2-847f-52540035b04c is a 2nd spot. Both these have been on review-zfs.

Comment by Andreas Dilger [ 15/Jul/13 ]

Keith, can you please submit a patch to change runtests to have multiple subtests (i.e. test_1(), maybe test_2(), etc). This would make the test reporting better (1/1 tests pass instead of 0/0), and this would also dump the client/server debug and console logs on failures.

Comment by Keith Mannthey (Inactive) [ 17/Jul/13 ]

Simple patch: http://review.whamcloud.com/7014

Comment by Keith Mannthey (Inactive) [ 26/Jul/13 ]

Ok that simple patch an run_test to runtests has landed. Now to watch for this issue again and see if we can better sort out what happens during the failure.

Comment by Nathaniel Clark [ 31/Jul/13 ]

Failure with test_1 data:
https://maloo.whamcloud.com/test_sets/3f05e348-f9c5-11e2-befb-52540035b04c

Comment by Keith Mannthey (Inactive) [ 31/Jul/13 ]

Great.

So it seems the OST is blocked while processing.

log "comparing $COUNT previously copied files"
cat $FILES | tr "\0" "\n" | ( rc=0; while read f; do
        [ $V ] && log "verifying $DST/$f"
        diff -q "$f" "$DST/$f" || rc=22
done
[ "$rc" = 0 ] || error_exit "old and new files are different: rc=$rc" )

We see the log message then we get a timeout.

I think the core issue is related to this:

22:22:51:Lustre: lustre-OST0003: deleting orphan objects from 0x0:5512 to 0x0:5537
22:22:51:Lustre: lustre-OST0004: deleting orphan objects from 0x0:5288 to 0x0:5313
22:22:51:Lustre: lustre-OST0001: deleting orphan objects from 0x240000400:34 to 0x240000400:161
22:22:51:Lustre: lustre-OST0002: deleting orphan objects from 0x280000400:34 to 0x280000400:161
22:22:51:LustreError: 17155:0:(osd_handler.c:2122:osd_fld_lookup()) lustre-OST0002-osd: cannot find FLD range for [0x280000400:0xa1:0x0]: rc = -5
22:22:51:LustreError: 17155:0:(osd_handler.c:2122:osd_fld_lookup()) Skipped 1983 previous similar messages
22:22:51:LustreError: 17155:0:(osd_oi.c:493:fid_is_on_ost()) lustre-OST0002-osd: Can not lookup fld for [0x280000400:0xa1:0x0]
22:22:51:LustreError: 17155:0:(osd_oi.c:493:fid_is_on_ost()) Skipped 1823 previous similar messages
22:22:51:Lustre: lustre-OST0004: deleting orphan objects from 0x300000400:34 to 0x300000400:161
22:22:51:Lustre: lustre-OST0003: deleting orphan objects from 0x2c0000400:34 to 0x2c0000400:193

"cannot find FLD range for" appears over the course of an hour on both the OSS and MDS. There is more investigation to be done as to why things are orphaned at this point and why the lfd are messed up.

Comment by Keith Mannthey (Inactive) [ 07/Aug/13 ]

So the test writes some files the checks they they are ok. Then it umounts and remounts the filesystem and does the check again.

We are orphaning data on unmount and then comparison on remount when we fail.

from this set of logs: https://maloo.whamcloud.com/test_sets/ba704840-fd1f-11e2-b90c-52540035b04c
There are 715 files total and the logs on the ost show this:

21:34:22:Lustre: DEBUG MARKER: /usr/sbin/lctl mark finished at Sat Aug  3 21:34:16 PDT 2013 \(1913\)
21:34:22:Lustre: DEBUG MARKER: finished at Sat Aug 3 21:34:16 PDT 2013 (1913)
21:34:22:LustreError: 28879:0:(ofd_grant.c:163:ofd_grant_sanity_check()) ofd_statfs: tot_granted 10223616 != fo_tot_granted 1116471296
21:35:13:LustreError: 7013:0:(ofd_grant.c:163:ofd_grant_sanity_check()) ofd_destroy_export: tot_granted 10223616 != fo_tot_granted 1116471296
21:35:13:LustreError: 7013:0:(ofd_grant.c:163:ofd_grant_sanity_check()) ofd_destroy_export: tot_granted 10223616 != fo_tot_granted 1093664768
21:35:13:LustreError: 28878:0:(ofd_grant.c:163:ofd_grant_sanity_check()) ofd_statfs: tot_granted 10223616 != fo_tot_granted 1093664768
21:35:13:LustreError: 28869:0:(service.c:1999:ptlrpc_server_handle_request()) @@@ Dropping timed-out request from 12345-10.10.16.146@tcp: deadline 6:6s ago
21:35:13:  req@ffff8800730eb400 x1442409413140504/t0(0) o103->40b08f6e-9581-2760-df9e-3c6d4451a181@10.10.16.146@tcp:0/0 lens 328/0 e 0 to 0 dl 1375590863 ref 1 fl Interpret:/0/ffffffff rc 0/-1
21:35:13:Lustre: 28869:0:(service.c:2031:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (6:6s); client may timeout.  req@ffff8800730eb400 x1442409413140504/t0(0) o103->40b08f6e-9581-2760-df9e-3c6d4451a181@10.10.16.146@tcp:0/0 lens 328/0 e 0 to 0 dl 1375590863 ref 1 fl Interpret:/0/ffffffff rc 0/-1
21:35:13:LustreError: 7013:0:(ofd_grant.c:163:ofd_grant_sanity_check()) ofd_destroy_export: tot_granted 10223616 != fo_tot_granted 1094189056
21:35:13:LustreError: 7013:0:(ofd_grant.c:163:ofd_grant_sanity_check()) Skipped 1 previous similar message
21:35:13:Lustre: 29341:0:(service.c:2031:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (6:2s); client may timeout.  req@ffff880069710800 x1442409413140528/t0(0) o9->40b08f6e-9581-2760-df9e-3c6d4451a181@10.10.16.146@tcp:0/0 lens 224/192 e 0 to 0 dl 1375590869 ref 1 fl Complete:/0/0 rc 0/0
21:35:13:Lustre: 28875:0:(service.c:2031:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (6:10s); client may timeout.  req@ffff880069a67000 x1442409413140520/t0(0) o9->40b08f6e-9581-2760-df9e-3c6d4451a181@10.10.16.146@tcp:0/0 lens 224/192 e 0 to 0 dl 1375590863 ref 1 fl Complete:/0/0 rc 0/0
21:35:13:LustreError: 11-0: lustre-MDT0000-lwp-OST0000: Communicating with 10.10.16.143@tcp, operation obd_ping failed with -107.
21:35:13:LustreError: Skipped 6 previous similar messages
21:35:13:Lustre: lustre-MDT0000-lwp-OST0000: Connection to lustre-MDT0000 (at 10.10.16.143@tcp) was lost; in progress operations using this service will wait for recovery to complete
21:35:13:Lustre: Skipped 6 previous similar messages
21:35:13:Lustre: 7026:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1375590879/real 1375590879]  req@ffff8800730f0800 x1442406562087668/t0(0) o400->MGC10.10.16.143@tcp@10.10.16.143@tcp:26/25 lens 224/224 e 0 to 1 dl 1375590886 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
21:35:13:LustreError: 166-1: MGC10.10.16.143@tcp: Connection to MGS (at 10.10.16.143@tcp) was lost; in progress operations using this service will fail
21:35:13:Lustre: DEBUG MARKER: grep -c /mnt/ost1' ' /proc/mounts
21:35:13:Lustre: DEBUG MARKER: umount -d -f /mnt/ost1
.....

21:37:16:Lustre: DEBUG MARKER: mkdir -p /mnt/ost3; mount -t lustre   		                   lustre-ost3/ost3 /mnt/ost3
21:37:16:Lustre: lustre-OST0001: deleting orphan objects from 0x0:425 to 0x0:449

The osts were still trying to access the mdt when it was unmounted.

The Manual states (and makes logical sense) mount order should be
Mount the MGT.
Mount the MDT.
Mount the OST(s)
Mount the client(s).

and I would argue we should preserve this order on unmount in test-framework

I don't quite trust the logs timestamps but it is clear to me the MDS is not present when the ost unmounts.
Also we are unmounting with the -f flag which I understand means no chance for recovery.
I think both of these factors lead to the orphaned data.

I need to look into the test-framework a bit more and start work on a patch or two. I don't think there is a core Lustre issue present at this time.

Comment by Keith Mannthey (Inactive) [ 07/Aug/13 ]

I made a quick patch and it can be seen here: http://review.whamcloud.com/7255

Comment by Keith Mannthey (Inactive) [ 07/Aug/13 ]

This problem is seen both on both zfs and ldiskfs.

After chatting with Oleg is seem the server order should not really matter as then the clients are unmounted the data should be synced.

I am also setting up a local repo for this issue to see "how" the files are corrupted.

Comment by Keith Mannthey (Inactive) [ 07/Aug/13 ]

My v1 patch didn't help and we actually hit the exact error with the stopall changes in place (unmount osts before mds and dropping the -f flag from the unmount). This is starring to look like a Lustre core issue now. The research continues.

Comment by Keith Mannthey (Inactive) [ 08/Aug/13 ]

My single VM local testing did not trigger the error after 425 loops on ldiskfs.

I am going to setup multi vm testing now.

Comment by Keith Mannthey (Inactive) [ 09/Aug/13 ]

I am still not getting a local repro we will see how it goes overnight.

I am starting to think there my be 2 errors seen. One is the RC=11 RC=22 errors (where the diff of files takes place) there other is a timeout like below:
https://maloo.whamcloud.com/test_sets/5298321c-00a0-11e3-bb00-52540035b04c

When we remount for the first time we just hang.
The mds is unhappy:

18:23:51:Lustre: DEBUG MARKER: /usr/sbin/lctl mark comparing 725 previously copied files
18:23:51:Lustre: DEBUG MARKER: comparing 725 previously copied files
18:23:51:LustreError: 28675:0:(fld_handler.c:145:fld_server_lookup()) srv-lustre-MDT0000: Cannot find sequence 0x240000401: rc = -5
18:23:51:LustreError: 28675:0:(fld_handler.c:145:fld_server_lookup()) Skipped 3597 previous similar messages
18:23:51:Lustre: 28681:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1376009297/real 1376009297]  req@ffff8800563e2800 x1442840111838728/t0(0) o5->lustre-OST0000-osc-MDT0000@10.10.17.34@tcp:28/4 lens 432/432 e 5 to 1 dl 1376010224 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
18:23:51:Lustre: 28681:0:(client.c:1868:ptlrpc_expire_one_request()) Skipped 16 previous similar messages
18:23:51:Lustre: lustre-OST0000-osc-MDT0000: Connection to lustre-OST0000 (at 10.10.17.34@tcp) was lost; in progress operations using this service will wait for recovery to complete
18:23:51:LustreError: 28681:0:(osp_precreate.c:734:osp_precreate_cleanup_orphans()) lustre-OST0000-osc-MDT0000: cannot cleanup orphans: rc = -11
18:23:51:LustreError: 11-0: lustre-OST0000-osc-MDT0000: Communicating with 10.10.17.34@tcp, operation ost_connect failed with -16.
18:23:51:LustreError: Skipped 6 previous similar messages
18:23:51:LustreError: 11-0: lustre-OST0000-osc-MDT0000: Communicating with 10.10.17.34@tcp, operation ost_connect failed with -16.
18:23:51:LustreError: Skipped 3 previous similar messages
18:23:51:LustreError: 11-0: lustre-OST0000-osc-MDT0000: Communicating with 10.10.17.34@tcp, operation ost_connect failed with -16.
18:23:51:LustreError: Skipped 7 previous similar messages
18:23:51:LustreError: 11-0: lustre-OST0000-osc-MDT0000: Communicating with 10.10.17.34@tcp, operation ost_connect failed with -16.
18:23:51:LustreError: Skipped 15 previous similar messages
18:23:51:LustreError: 11-0: lustre-OST0000-osc-MDT0000: Communicating with 10.10.17.34@tcp, operation ost_connect failed with -16.
18:23:51:LustreError: Skipped 30 previous similar messages
18:23:51:LustreError: 11-0: lustre-OST0000-osc-MDT0000: Communicating with 10.10.17.34@tcp, operation ost_connect failed with -16.
18:23:51:LustreError: Skipped 60 previous similar messages

and the OST is not any better:

17:48:47:Lustre: DEBUG MARKER: Using TIMEOUT=20
17:48:47:Lustre: DEBUG MARKER: /usr/sbin/lctl mark comparing 725 previously copied files
17:48:47:Lustre: DEBUG MARKER: comparing 725 previously copied files
17:48:47:Lustre: lustre-OST0000: deleting orphan objects from 0x200000402:905 to 0x200000402:1032
17:48:47:Lustre: lustre-OST0001: deleting orphan objects from 0x240000401:1800 to 0x240000401:1927
17:48:47:Lustre: lustre-OST0002: deleting orphan objects from 0x280000400:1800 to 0x280000400:1927
17:48:47:LustreError: 15183:0:(osd_handler.c:2123:osd_fld_lookup()) lustre-OST0001-osd: cannot find FLD range for [0x240000401:0x787:0x0]: rc = -5
17:48:47:LustreError: 15183:0:(osd_handler.c:2123:osd_fld_lookup()) Skipped 3597 previous similar messages
17:48:47:LustreError: 15183:0:(osd_oi.c:493:fid_is_on_ost()) lustre-OST0001-osd: Can not lookup fld for [0x240000401:0x787:0x0]
17:48:47:LustreError: 15183:0:(osd_oi.c:493:fid_is_on_ost()) Skipped 3405 previous similar messages
17:48:47:Lustre: lustre-OST0003: deleting orphan objects from 0x2c0000402:1799 to 0x2c0000402:1926
17:48:47:Lustre: lustre-OST0004: deleting orphan objects from 0x300000400:1800 to 0x300000400:1927
17:52:08:Lustre: lustre-OST0005: deleting orphan objects from 0x340000401:1800 to 0x340000401:1927
17:52:08:Lustre: lustre-OST0006: deleting orphan objects from 0x380000400:1801 to 0x380000400:2575
17:52:08:INFO: task ll_ost00_000:13457 blocked for more than 120 seconds.
17:52:08:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
17:52:08:ll_ost00_000  D 0000000000000000     0 13457      2 0x00000080
17:52:08: ffff8800575a3640 0000000000000046 0000000000000001 000000003510abfb
17:52:08: 0000000000000002 ffff8800575a36f0 ffffffffa0c960f0 ffffffffa0c960ec
17:52:08: ffff880062a685f8 ffff8800575a3fd8 000000000000fb88 ffff880062a685f8
17:52:08:Call Trace:
17:52:08: [<ffffffff8150f81e>] __mutex_lock_slowpath+0x13e/0x180
17:52:08: [<ffffffff81281964>] ? snprintf+0x34/0x40
17:52:08: [<ffffffff8150f6bb>] mutex_lock+0x2b/0x50
17:52:08: [<ffffffffa0c7d6aa>] osd_obj_map_lookup+0x20a/0x750 [osd_ldiskfs]
17:52:08: [<ffffffff81166b9a>] ? kmem_getpages+0xba/0x170
17:52:08: [<ffffffffa0c6c89e>] osd_oi_lookup+0x23e/0x4a0 [osd_ldiskfs]
17:52:08: [<ffffffffa0c691e2>] osd_object_init+0x752/0x1110 [osd_ldiskfs]
17:52:08: [<ffffffff81167dc3>] ? kmem_cache_alloc_trace+0x1a3/0x1b0
17:52:08: [<ffffffffa05d0238>] lu_object_alloc+0xd8/0x320 [obdclass]
17:52:08: [<ffffffffa05d0de0>] lu_object_find_at+0x210/0x360 [obdclass]
17:52:08: [<ffffffffa0d18eb3>] ? ofd_key_init+0x53/0x1a0 [ofd]
17:52:08: [<ffffffffa05cd87f>] ? keys_fill+0x6f/0x190 [obdclass]
17:52:08: [<ffffffffa05d0f46>] lu_object_find+0x16/0x20 [obdclass]
17:52:08: [<ffffffffa0d2c515>] ofd_object_find+0x35/0xf0 [ofd]
17:52:08: [<ffffffffa0d3b6a3>] ofd_lvbo_init+0x323/0x940 [ofd]
17:52:08: [<ffffffffa071c48c>] ldlm_resource_get+0x36c/0x8f0 [ptlrpc]
17:52:08: [<ffffffffa0716955>] ldlm_lock_create+0x55/0xa10 [ptlrpc]
17:52:08: [<ffffffffa073de46>] ldlm_handle_enqueue0+0x156/0x10b0 [ptlrpc]
17:52:08: [<ffffffffa073edfe>] ldlm_handle_enqueue+0x5e/0x70 [ptlrpc]
17:52:08: [<ffffffffa073ee10>] ? ldlm_server_completion_ast+0x0/0x6c0 [ptlrpc]
17:52:08: [<ffffffffa0cf8420>] ? ost_blocking_ast+0x0/0x10e0 [ost]
17:52:08: [<ffffffffa07398f0>] ? ldlm_server_glimpse_ast+0x0/0x3b0 [ptlrpc]
17:52:08: [<ffffffffa0d0188c>] ost_handle+0x1bbc/0x4030 [ost]
17:52:08: [<ffffffffa0462d64>] ? libcfs_id2str+0x74/0xb0 [libcfs]
17:52:08: [<ffffffffa076e598>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
17:52:08: [<ffffffffa045754e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
17:52:08: [<ffffffffa0468a6f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
17:52:08: [<ffffffffa07659a9>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
17:52:08: [<ffffffff81055ad3>] ? __wake_up+0x53/0x70
17:52:08: [<ffffffffa076f91d>] ptlrpc_main+0xabd/0x1700 [ptlrpc]
17:52:08: [<ffffffffa076ee60>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
17:52:08: [<ffffffff81096956>] kthread+0x96/0xa0
17:52:08: [<ffffffff8100c0ca>] child_rip+0xa/0x20
17:52:08: [<ffffffff810968c0>] ? kthread+0x0/0xa0
17:52:08: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

The timeout go on for a while.

18:04:17:Lustre: 15154:0:(service.c:1339:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-207), not sending early reply
18:04:17:  req@ffff88006d5fbc00 x1442840111838728/t0(0) o5->lustre-MDT0000-mdtlov_UUID@10.10.17.33@tcp:0/0 lens 432/432 e 5 to 0 dl 1376010109 ref 2 fl Interpret:/0/0 rc 0/0
18:04:17:Lustre: lustre-OST0000: Client lustre-MDT0000-mdtlov_UUID (at 10.10.17.33@tcp) reconnecting
18:04:17:Lustre: lustre-OST0000: Client lustre-MDT0000-mdtlov_UUID (at 10.10.17.33@tcp) refused reconnection, still busy with 1 active RPCs
18:04:17:Lustre: lustre-OST0000: Client lustre-MDT0000-mdtlov_UUID (at 10.10.17.33@tcp) refused reconnection, still busy with 1 active RPCs
18:04:17:Lustre: lustre-OST0000: Client lustre-MDT0000-mdtlov_UUID (at 10.10.17.33@tcp) reconnecting
18:04:17:Lustre: Skipped 1 previous similar message
18:04:17:Lustre: lustre-OST0000: Client lustre-MDT0000-mdtlov_UUID (at 10.10.17.33@tcp) refused reconnection, still busy with 1 active RPCs
18:04:17:Lustre: lustre-OST0000: Client lustre-MDT0000-mdtlov_UUID (at 10.10.17.33@tcp) refused reconnection, still busy with 1 active RPCs
18:04:17:Lustre: lustre-OST0000: Client lustre-MDT0000-mdtlov_UUID (at 10.10.17.33@tcp) refused reconnection, still busy with 1 active RPCs

I don't know why time is off so far on these logs.

Comment by Keith Mannthey (Inactive) [ 09/Aug/13 ]

Multi VM is not reproducing the issue after over 400 attempts. I will try tweaking some things.

Also it was noted in the Logs in some cases time is jumping around on the servers (random 30+ min gains), I opened a TT to track the issue: https://jira.hpdd.intel.com/browse/TT-1615

Comment by Keith Mannthey (Inactive) [ 09/Aug/13 ]

I am not seeing

17:48:47:LustreError: 15183:0:(osd_handler.c:2123:osd_fld_lookup()) lustre-OST0001-osd: cannot find FLD range for [0x240000401:0x787:0x0]: rc = -5
17:48:47:LustreError: 15183:0:(osd_handler.c:2123:osd_fld_lookup()) Skipped 3597 previous similar messages
17:48:47:LustreError: 15183:0:(osd_oi.c:493:fid_is_on_ost()) lustre-OST0001-osd: Can not lookup fld for [0x240000401:0x787:0x0]
17:48:47:LustreError: 15183:0:(osd_oi.c:493:fid_is_on_ost()) Skipped 3405 previous similar messages

In my local testing.
A spam of over 7000 messages in a seconds makes me things there is some nasty "just retry until it works" logic in play. I am looking into this a bit.

Comment by Keith Mannthey (Inactive) [ 10/Aug/13 ]

I think this those errors are form the scrub process somehow but I am not 100% sure yet. I don't get these errors locally yet.

There is alot of "time" issues with the rosso runs. I followed up some with TT-1615.

In runs that passs there might be 1-2 min drift here and there in runs that fail 10-60+ min drift is seen. On torro runs they are as synced within a few seconds. There might be some correlation between time drift and the timeout errors. It is not clear yet but there is some strong evidence.

This issue is staring to get worse (It is reporting about 10% right now).

Comment by Keith Mannthey (Inactive) [ 13/Aug/13 ]

No sing of the -22 and -11 errors on Torro. No sign of these same timeout errors on Torro. At this point Rosso is part of the trigger.

Weekend testing did not reproduce this issue locally.

Comment by Keith Mannthey (Inactive) [ 13/Aug/13 ]

Since we only see this in autotest I created a 2nd patch http://review.whamcloud.com/7311 it removes the -q for the diff. I asked it to run runtests several times.

Comment by Keith Mannthey (Inactive) [ 20/Aug/13 ]

Hmm testing has not ran for the above patch, TT-1644 was opened last week.

Comment by Andreas Dilger [ 23/Aug/13 ]

Di, some questions for you in this bug:

  • why are the OSTs using non-zero FID_SEQ_NORMAL when not running in DNE mode? This shouldn't happen.
  • I think that LU-3126 patch http://review.whamcloud.com/7266 should reduce the FLDB usage in the OSD, but does it eliminate FLDB usage completely for 2.5? That may avoid some or all of the problems here, if the timeouts are due to OSD->FLDB timeouts or deadlocks.

Keith, the main focus for your investigation should be why the OST is timing out in the first place. This is supposed to be a very simple, "normal usage" kind of test, so if there are timeouts something is very wrong.

Comment by Keith Mannthey (Inactive) [ 24/Aug/13 ]

Also rc=22 (after another remount) is a common error code.

The timeouts look like:

15:44:33:INFO: task mount.lustre:12729 blocked for more than 120 seconds.
15:44:33:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
15:44:33:mount.lustre  D 0000000000000000     0 12729  12728 0x00000080
15:44:33: ffff880064447558 0000000000000086 ffff880064447528 000000001cd3dcca
15:44:33: ffff880064447578 ffffffffa03d7faf ffffffffa0c03317 ffffffffa0c03313
15:44:33: ffff88007b3845f8 ffff880064447fd8 000000000000fb88 ffff88007b3845f8
15:44:33:Call Trace:
15:44:33: [<ffffffffa03d7faf>] ? start_this_handle+0xdf/0x4a0 [jbd2]
15:44:33: [<ffffffff8150f81e>] __mutex_lock_slowpath+0x13e/0x180
15:44:33: [<ffffffffa043c2b8>] ? __ldiskfs_journal_stop+0x68/0xa0 [ldiskfs]
15:44:33: [<ffffffff8150f6bb>] mutex_lock+0x2b/0x50
15:44:33: [<ffffffffa0be7e16>] osd_obj_map_recover+0x186/0x9e0 [osd_ldiskfs]
15:44:33: [<ffffffffa0bec61c>] ? osd_scrub_get_fid+0x12c/0x210 [osd_ldiskfs]
15:44:33: [<ffffffffa0bf0946>] osd_ios_lf_fill+0x4a6/0x5e0 [osd_ldiskfs]
15:44:33: [<ffffffffa0406b2f>] ldiskfs_readdir+0x3cf/0x730 [ldiskfs]
15:44:33: [<ffffffffa0bf04a0>] ? osd_ios_lf_fill+0x0/0x5e0 [osd_ldiskfs]
15:44:33: [<ffffffffa040623c>] ? free_rb_tree_fname+0x7c/0xe0 [ldiskfs]
15:44:33: [<ffffffffa0bebe2e>] osd_ios_general_scan+0xae/0x1a0 [osd_ldiskfs]
15:44:33: [<ffffffffa0bd8f57>] ? osd_oi_init+0x127/0x7d0 [osd_ldiskfs]
15:44:33: [<ffffffffa0bf2b83>] osd_scrub_setup+0x5f3/0xe50 [osd_ldiskfs]
15:44:33: [<ffffffffa05a5bcc>] ? lprocfs_alloc_stats+0x15c/0x760 [obdclass]
15:44:33: [<ffffffffa05cfdfb>] ? lu_site_init+0x2ab/0x4a0 [obdclass]
15:44:33: [<ffffffffa0bc84a9>] osd_device_alloc+0x5f9/0x830 [osd_ldiskfs]
15:44:33: [<ffffffffa05b8f97>] obd_setup+0x1d7/0x2e0 [obdclass]
15:44:33: [<ffffffffa05b92a8>] class_setup+0x208/0x870 [obdclass]
15:44:33: [<ffffffffa05c0abc>] class_process_config+0xc7c/0x1c30 [obdclass]
15:44:33: [<ffffffffa05c5b03>] ? lustre_cfg_new+0x2d3/0x6e0 [obdclass]
15:44:33: [<ffffffffa05c6059>] do_lcfg+0x149/0x480 [obdclass]
15:44:33: [<ffffffffa05c6424>] lustre_start_simple+0x94/0x200 [obdclass]
15:44:33: [<ffffffffa05feb6d>] server_fill_super+0xc8d/0x1a24 [obdclass]
15:44:33: [<ffffffffa05cbe48>] lustre_fill_super+0x1d8/0x530 [obdclass]
15:44:33: [<ffffffffa05cbc70>] ? lustre_fill_super+0x0/0x530 [obdclass]
15:44:33: [<ffffffff8118435f>] get_sb_nodev+0x5f/0xa0
15:44:33: [<ffffffffa05c3ae5>] lustre_get_sb+0x25/0x30 [obdclass]
15:44:33: [<ffffffff8118399b>] vfs_kern_mount+0x7b/0x1b0
15:44:33: [<ffffffff81183b42>] do_kern_mount+0x52/0x130
15:44:33: [<ffffffff811a3d92>] do_mount+0x2d2/0x8d0
15:44:33: [<ffffffff811a4420>] sys_mount+0x90/0xe0
15:44:33: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

Looks like osd_scrub_setup and friends are getting stuck. I will look around some and see if I can sort it out some more. There is a jdb2 thread also.

Comment by Di Wang [ 25/Aug/13 ]

Andreas,

1. Yes, FIDs on OST should not use NORMAL seq without DNE. This looks weird.
2. No, the patch on LU-3126 only eliminate FLDB lookup for checking MDT and OST FIDs, and we still need check remote FIDs in OSD. But this is only for FIDs on MDT. On OST, it should not check FLDB in OSD layer after 3126 is landed.

Comment by Keith Mannthey (Inactive) [ 30/Aug/13 ]

Maloo had some hassles with test logs that slowed down the search for timeout issues. I have had a few cycles to look at the scrub and jdb code paths but I am sill looking. I will update when I have a more solid idea of the root cause.

Comment by Keith Mannthey (Inactive) [ 01/Sep/13 ]

I have one set of timeout hang data at this point maloo is still without many error logs.
My data has the following timeout hang:

INFO: task ll_ost00_000:13457 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ll_ost00_000  D 0000000000000000     0 13457      2 0x00000080
 ffff8800575a3640 0000000000000046 0000000000000001 000000003510abfb
 0000000000000002 ffff8800575a36f0 ffffffffa0c960f0 ffffffffa0c960ec
 ffff880062a685f8 ffff8800575a3fd8 000000000000fb88 ffff880062a685f8
Call Trace:
 [<ffffffff8150f81e>] __mutex_lock_slowpath+0x13e/0x180
 [<ffffffff81281964>] ? snprintf+0x34/0x40
 [<ffffffff8150f6bb>] mutex_lock+0x2b/0x50
 [<ffffffffa0c7d6aa>] osd_obj_map_lookup+0x20a/0x750 [osd_ldiskfs]
 [<ffffffff81166b9a>] ? kmem_getpages+0xba/0x170
 [<ffffffffa0c6c89e>] osd_oi_lookup+0x23e/0x4a0 [osd_ldiskfs]
 [<ffffffffa0c691e2>] osd_object_init+0x752/0x1110 [osd_ldiskfs]
 [<ffffffff81167dc3>] ? kmem_cache_alloc_trace+0x1a3/0x1b0
 [<ffffffffa05d0238>] lu_object_alloc+0xd8/0x320 [obdclass]
 [<ffffffffa05d0de0>] lu_object_find_at+0x210/0x360 [obdclass]
 [<ffffffffa0d18eb3>] ? ofd_key_init+0x53/0x1a0 [ofd]
 [<ffffffffa05cd87f>] ? keys_fill+0x6f/0x190 [obdclass]
 [<ffffffffa05d0f46>] lu_object_find+0x16/0x20 [obdclass]
 [<ffffffffa0d2c515>] ofd_object_find+0x35/0xf0 [ofd]
 [<ffffffffa0d3b6a3>] ofd_lvbo_init+0x323/0x940 [ofd]
 [<ffffffffa071c48c>] ldlm_resource_get+0x36c/0x8f0 [ptlrpc]
 [<ffffffffa0716955>] ldlm_lock_create+0x55/0xa10 [ptlrpc]
 [<ffffffffa073de46>] ldlm_handle_enqueue0+0x156/0x10b0 [ptlrpc]
 [<ffffffffa073edfe>] ldlm_handle_enqueue+0x5e/0x70 [ptlrpc]
 [<ffffffffa073ee10>] ? ldlm_server_completion_ast+0x0/0x6c0 [ptlrpc]
 [<ffffffffa0cf8420>] ? ost_blocking_ast+0x0/0x10e0 [ost]
 [<ffffffffa07398f0>] ? ldlm_server_glimpse_ast+0x0/0x3b0 [ptlrpc]
 [<ffffffffa0d0188c>] ost_handle+0x1bbc/0x4030 [ost]
 [<ffffffffa0462d64>] ? libcfs_id2str+0x74/0xb0 [libcfs]
 [<ffffffffa076e598>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
 [<ffffffffa045754e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
 [<ffffffffa0468a6f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
 [<ffffffffa07659a9>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
 [<ffffffff81055ad3>] ? __wake_up+0x53/0x70
 [<ffffffffa076f91d>] ptlrpc_main+0xabd/0x1700 [ptlrpc]
 [<ffffffffa076ee60>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
 [<ffffffff81096956>] kthread+0x96/0xa0
 [<ffffffff8100c0ca>] child_rip+0xa/0x20
 [<ffffffff810968c0>] ? kthread+0x0/0xa0
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

There is a 2nd thread with this exact error path (2 and only 2 of the the OST ll_ost00_XXX threads) . I don't see any sign of an ill acquired directory mutex in this codepath we don't even touch the inode of most of this code path. In osd_obj_map_recover we are blocked trying to acquire a mutex from a directory inode. So far there is no sign of who actually has it a crashdump would be helpful.

In general we are stuck waiting in osd_io_lookup and this is newer chunk of code IO Scrub code. From commit: 3b657b6b81913afe21e4140f374f0edab20a7f38

iget:
        inode = osd_iget(info, dev, id);
        if (IS_ERR(inode)) {
                result = PTR_ERR(inode);
                if (result == -ENOENT || result == -ESTALE) {
                        if (!in_oi) {
                                fid_zero(&oic->oic_fid);
                                GOTO(out, result = 0);
                        }

                        /* XXX: There are three possible cases:
                         *      1. Backup/restore caused the OI invalid.
                         *      2. Someone unlinked the object but NOT removed
                         *         the OI mapping, such as mount target device
                         *         as ldiskfs, and modify something directly.
                         *      3. Someone just removed the object between the
                         *         former oi_lookup and the iget. It is normal.
                         *
                         *      It is diffcult to distinguish the 2nd from the
                         *      1st case. Relatively speaking, the 1st case is
                         *      common than the 2nd case, trigger OI scrub. */
                        result = osd_oi_lookup(info, dev, fid, id, true); <-- soon we hang on the directory mutex. 

As all the timeout errors that I have seen from runtests have been io scrub related I will add Fan Yong and email him directly to see if he can help further identify the root cause.

Comment by Keith Mannthey (Inactive) [ 01/Sep/13 ]

Fan Yong, We seems to be getting tasks that hang in some of the io scrub paths. Could you take a look at these back traces and share any ideas you might have?

Comment by nasf (Inactive) [ 03/Sep/13 ]

I have found the similar system hung when test sanity-scrub. I am investigating it (LU-3706). But I am not sure whether it is the root reason for this bug (old and new files are different).

Comment by Keith Mannthey (Inactive) [ 04/Sep/13 ]

Look to be related in the timeout issues.

Comment by Keith Mannthey (Inactive) [ 04/Sep/13 ]

nasf, It is interesting your patch for LU-3706 is dealing with files that have been moved to the Lost and Found. Do think runtest has moved files to lost and found? This would not be the expected behaviour but might explain a few things.

Comment by nasf (Inactive) [ 05/Sep/13 ]

The OI scrub has the functionality to recovery orphan OST-objects from /lost+found to their original places. We performed such recovery via lustre tool - ll_recover_lost_found_objs with offline mode before. But now, we can do that via OI scrub with online mode.

Usually, it is e2fsck to put some orphan OST-objects under /lost+found, and when the OST mounts up, the OI scrub will scan /lost+found to perform related recovery. I do not think runtest itself will link something under /lost+found. If it is a new file-system and never run e2fsck, then the /lost+found should be empty.

Comment by Jian Yu [ 06/Sep/13 ]

Lustre build: http://build.whamcloud.com/job/lustre-b2_4/44/ (2.4.1 RC1)
Distro/Arch: RHEL6.4/x86_64
FSTYPE=zfs
MDSCOUNT=4

runtests failed as follows:

comparing 725 previously copied files
Files /bin/gzip and /mnt/lustre/runtest.31371//bin/gzip differ
Files /bin/findmnt and /mnt/lustre/runtest.31371//bin/findmnt differ
<~snip~>
Files /bin/ipcalc and /mnt/lustre/runtest.31371//bin/ipcalc differ
Files /bin/touch and /mnt/lustre/runtest.31371//bin/touch differ
 runtests : @@@@@@ FAIL: old and new files are different: rc=22 

Maloo report: https://maloo.whamcloud.com/test_sets/847a9a8e-1641-11e3-aa2a-52540035b04c
Unfortunately, the test logs were not uploaded to Maloo. And I could not find the logdir on Rosso cluster. Patch http://review.whamcloud.com/7014 needs to be cherry-picked to Lustre b2_4 branch so that logs can be uploaded to Maloo next time the test fails.

FYI, the same test passed on Lustre 2.4.1 RC1 with FSTYPE=zfs and MDSCOUNT=1.

Comment by Keith Mannthey (Inactive) [ 11/Sep/13 ]

Di,
MDSCOUNT=4 is interesting with 2.4. Perhaps we are looking at a DNE issue as we seem to have DNE behaviour in these runs?

My initial debug patch to has never ran, I am leaving it in place for the Tools team.

I created a new patch without the test params:
http://review.whamcloud.com/7603

Comment by Di Wang [ 11/Sep/13 ]

Hmm, runtests should not do any cross-MDT operation, unless I miss sth. I will take a look.

Comment by Di Wang [ 11/Sep/13 ]

hmm, I can not reproduce this problem locally with "MDSCOUNT=4 FSTYPE=zfs sh llmount.sh" And I check the test-scripts, it seems all of tests should be done in MDT0.

According to Maloo log, it seems sometimes runtests can pass with DNE+ZFS as well
https://maloo.whamcloud.com/test_sets/518f76e6-1865-11e3-bf95-52540035b04c

Sometimes it even failed with single MDT + ldiskfs
https://maloo.whamcloud.com/test_sets/bb6e5d76-0446-11e3-90ba-52540035b04c

Comment by Keith Mannthey (Inactive) [ 11/Sep/13 ]

Could you try run tests in a loop? I think there is something about the 700 or so files and the unmount/remount window.

I have not been able to hit this issue in my local environment with single mds + ldiskfs.

Comment by Di Wang [ 12/Sep/13 ]

I tried 20 times "for ((i=0;i<20;i++)) do sh auster -r -s -f testnode runtests ; done" with 2MDS(4MDTs) + ZFS. Still can not reproduce the problem.

Comment by Jodi Levi (Inactive) [ 13/Sep/13 ]

Is this still happening regularly?

Comment by Keith Mannthey (Inactive) [ 16/Sep/13 ]

Sorry for the late response.

We don't seem to be hitting this issue right now. I looked at Maloo on friday and it had been almost 2 weeks since we hand seen issues. It is unclear what has happened but we are definitely not hitting this issue like we were.

Comment by Jodi Levi (Inactive) [ 17/Sep/13 ]

Reducing priority for a while longer to see if this occurs again and then will close.

Comment by Jodi Levi (Inactive) [ 21/Oct/13 ]

Patches landed to Master and issue does not seem to be occurring any longer.

Comment by Jian Yu [ 25/Nov/13 ]

Patch http://review.whamcloud.com/7014 needs to be cherry-picked to Lustre b2_4 branch so that logs can be uploaded to Maloo next time the test fails.

Patch for Lustre b2_4 branch is in http://review.whamcloud.com/8384.

Comment by Jian Yu [ 26/Nov/13 ]

Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/58/
Distro/Arch: RHEL6.4/x86_64
FSTYPE=zfs
MDSCOUNT=1

runtests failed again:
https://maloo.whamcloud.com/test_sets/275948da-5603-11e3-8e94-52540035b04c

I ran the test with the same configuration for 6 times manually, all runs passed:
https://maloo.whamcloud.com/test_sessions/d184f670-567f-11e3-82a8-52540035b04c
https://maloo.whamcloud.com/test_sessions/cfcafbe0-567f-11e3-9304-52540035b04c
https://maloo.whamcloud.com/test_sessions/cdad4016-567f-11e3-9304-52540035b04c
https://maloo.whamcloud.com/test_sessions/cd3494b8-567f-11e3-9288-52540035b04c
https://maloo.whamcloud.com/test_sessions/cb6fd476-567f-11e3-9304-52540035b04c
https://maloo.whamcloud.com/test_sessions/cad680fa-567f-11e3-9288-52540035b04c

Comment by Gerrit Updater [ 07/Oct/16 ]

Mike Pershin (mike.pershin@intel.com) uploaded a new patch: http://review.whamcloud.com/22997
Subject: LU-3582 mdc: add cl_device to the MDC
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 642730ade7da65f86a8b57100c86e2241d81569a

Comment by Gerrit Updater [ 13/Jul/17 ]

Mike Pershin (mike.pershin@intel.com) uploaded a new patch: https://review.whamcloud.com/28009
Subject: LU-3582 mdc: add cl_device to the MDC
Project: fs/lustre-release
Branch: dom
Current Patch Set: 1
Commit: 8271c53f0ff7c68b6c175ffb7462e55c6904dbeb

Comment by Gerrit Updater [ 04/Aug/17 ]

Minh Diep (minh.diep@intel.com) merged in patch https://review-stage.hpdd.intel.com/28009/
Subject: LU-3582 mdc: add cl_device to the MDC
Project: fs/lustre-release
Branch: dom
Current Patch Set:
Commit: 8adbb3aa6cdddd84f1a94ff40e76406a441e9336

Comment by Gerrit Updater [ 07/Aug/17 ]

Minh Diep (minh.diep@intel.com) merged in patch https://review-stage.hpdd.intel.com/28009/
Subject: LU-3582 mdc: add cl_device to the MDC
Project: fs/lustre-release
Branch: dom
Current Patch Set:
Commit: f5510004de33ab768d016f6d26612b005680eb4d

Generated at Sat Feb 10 01:35:09 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.