[LU-650] recovery-double-scale: (llite_lib.c:1591:ll_update_inode()) LBUG Created: 31/Aug/11  Updated: 10/Sep/11  Resolved: 10/Sep/11

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.0
Fix Version/s: Lustre 2.1.0

Type: Bug Priority: Blocker
Reporter: Jian Yu Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: None
Environment:

Lustre Tag: v2_1_0_0_RC1
Lustre Build: http://newbuild.whamcloud.com/job/lustre-master/274/
e2fsprogs Build: http://newbuild.whamcloud.com/job/e2fsprogs-master/42/
Distro/Arch: RHEL6/x86_64(in-kernel OFED, kernel version: 2.6.32-131.6.1.el6.x86_64)
ENABLE_QUOTA=yes
FAILURE_MODE=HARD

MGS/MDS Nodes: fat-amd-1-ib(active), fat-amd-2-ib(passive)
\ /
1 combined MGS/MDT

OSS Nodes: fat-amd-3-ib(active), fat-amd-4-ib(active)
\ /
OST1 (active in fat-amd-3-ib)
OST2 (active in fat-amd-4-ib)
OST3 (active in fat-amd-3-ib)
OST4 (active in fat-amd-4-ib)
OST5 (active in fat-amd-3-ib)
OST6 (active in fat-amd-4-ib)

Client Nodes: client-[1,2,4,5,12,13,15],fat-intel-4

Network Addresses:
fat-amd-1-ib: 192.168.4.132
fat-amd-2-ib: 192.168.4.133
fat-amd-3-ib: 192.168.4.134
fat-amd-4-ib: 192.168.4.135
client-1-ib: 192.168.4.1
client-2-ib: 192.168.4.2
client-4-ib: 192.168.4.4
client-5-ib: 192.168.4.5
client-12-ib: 192.168.4.12
client-13-ib: 192.168.4.13
client-15-ib: 192.168.4.15
fat-intel-4-ib: 192.168.4.131


Attachments: File recovery-double-scale.1314722275.log.tar.bz2    
Severity: 3
Rank (Obsolete): 4905

 Description   

recovery-double-scale test failed as follows:

==== START === test 5: failover OST, then 2 clients ====
<~snip~>
Started client load: dd on client-13-ib
Started client load: tar on client-1-ib
==== Checking the clients loads AFTER  failovers -- ERRORS_OK=yes
Done checking / re-Starting client loads. PASS

==== START === test 6: failover OST, then MDS ========== 
==== Checking the clients loads BEFORE failover -- failure NOT OK
Client load failed on node client-1-ib, rc=1
2011-08-30 09:37:48 Terminating clients loads ...

Server failover period: 300 seconds
Exited after:           0 seconds
Status: FAIL: rc=4

Console log on client node client-1-ib showed that:

Lustre: DEBUG MARKER: Started client load: tar on client-1-ib^M
Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failovers -- ERRORS_OK=yes^M
LustreError: 2344:0:(llite_lib.c:1586:ll_update_inode()) lsm mismatch for inode 144115238860488711^M
LustreError: 2344:0:(llite_lib.c:1587:ll_update_inode()) lli_smd:^M
LustreError: 2344:0:(debug.c:71:dump_lsm()) lsm ffff8803168da740, objid 0x7, maxbytes 0x0, magic 0x0BD10BD0, stripe_size 1048576, stripe_count 1 pool ^M
LustreError: 11-0: an error occurred while communicating with 192.168.4.134@o2ib. The ost_connect operation failed with -16^M
LustreError: 2344:0:(llite_lib.c:1589:ll_update_inode()) lsm:^M
LustreError: 2344:0:(debug.c:71:dump_lsm()) lsm ffff8803168da3c0, objid 0x7, maxbytes 0xfffffffffff, magic 0x0BD10BD0, stripe_size 1048576, stripe_count 1 pool ^M
LustreError: 2344:0:(llite_lib.c:1591:ll_update_inode()) LBUG^M
Pid: 2344, comm: tar^M
^M
Call Trace:^M
 [<ffffffffa02f4855>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]^M
 [<ffffffffa02f4e95>] lbug_with_loc+0x75/0xe0 [libcfs]^M
 [<ffffffffa072d018>] ll_update_inode+0xd48/0xea0 [lustre]^M
 [<ffffffff8118d6d4>] ? ifind+0x74/0xd0^M
 [<ffffffffa07410f0>] ? ll_test_inode+0x0/0x1a0 [lustre]^M
 [<ffffffffa0741b9b>] ll_iget+0x21b/0x280 [lustre]^M
 [<ffffffffa02f59ae>] ? cfs_free+0xe/0x10 [libcfs]^M
 [<ffffffffa07d6de8>] ? lmv_intent_lookup+0x718/0x1190 [lmv]^M
 [<ffffffffa0741380>] ? ll_md_blocking_ast+0x0/0x600 [lustre]^M
 [<ffffffffa072d6f9>] ll_prep_inode+0x589/0xb50 [lustre]^M
 [<ffffffffa0742a05>] ll_lookup_it_finish+0x85/0xa80 [lustre]^M
 [<ffffffffa02f59ae>] ? cfs_free+0xe/0x10 [libcfs]^M
 [<ffffffffa0727581>] ? ll_finish_md_op_data+0x2a1/0x4a0 [lustre]^M
 [<ffffffffa07252b8>] ? ll_prep_md_op_data+0x148/0x410 [lustre]^M
 [<ffffffffa0743835>] ll_lookup_it+0x435/0xd60 [lustre]^M
 [<ffffffffa0741380>] ? ll_md_blocking_ast+0x0/0x600 [lustre]^M
 [<ffffffffa07441e8>] ll_lookup_nd+0x88/0x440 [lustre]^M^@
 [<ffffffff8118ac3e>] ? d_alloc+0x13e/0x1b0^M
 [<ffffffff811806cb>] do_lookup+0x18b/0x220^M
 [<ffffffff81180cc9>] __link_path_walk+0x569/0x820^M
 [<ffffffffa06fc000>] ? return_if_equal+0x0/0x30 [lustre]^M
 [<ffffffff8118164a>] path_walk+0x6a/0xe0^M
 [<ffffffff8118181b>] do_path_lookup+0x5b/0xa0^M
 [<ffffffff811819a7>] user_path_at+0x57/0xa0^M
 [<ffffffff810dbe25>] ? call_rcu_sched+0x15/0x20^M
 [<ffffffff810dbe3e>] ? call_rcu+0xe/0x10^M
 [<ffffffff81262a05>] ? _atomic_dec_and_lock+0x55/0x80^M
 [<ffffffff811910e0>] ? mntput_no_expire+0x30/0x110^M
 [<ffffffff811740d1>] ? __fput+0x1a1/0x210^M
 [<ffffffff811a16f7>] do_utimes+0xc7/0xf0^M
 [<ffffffff810d1ad2>] ? audit_syscall_entry+0x272/0x2a0^M
 [<ffffffff811a1822>] sys_utimensat+0x32/0x90^M
 [<ffffffff8100b172>] system_call_fastpath+0x16/0x1b^M
^M
Kernel panic - not syncing: LBUG^M
Pid: 2344, comm: tar Tainted: G           ---------------- T 2.6.32-131.6.1.el6.x86_64 #1^M
Call Trace:^M
 [<ffffffff814da518>] ? panic+0x78/0x143^M
 [<ffffffffa02f4eeb>] ? lbug_with_loc+0xcb/0xe0 [libcfs]^M
 [<ffffffffa072d018>] ? ll_update_inode+0xd48/0xea0 [lustre]^M
 [<ffffffff8118d6d4>] ? ifind+0x74/0xd0^M
 [<ffffffffa07410f0>] ? ll_test_inode+0x0/0x1a0 [lustre]^M
 [<ffffffffa0741b9b>] ? ll_iget+0x21b/0x280 [lustre]^M
 [<ffffffffa02f59ae>] ? cfs_free+0xe/0x10 [libcfs]^M
 [<ffffffffa07d6de8>] ? lmv_intent_lookup+0x718/0x1190 [lmv]^M
 [<ffffffffa0741380>] ? ll_md_blocking_ast+0x0/0x600 [lustre]^M
 [<ffffffffa072d6f9>] ? ll_prep_inode+0x589/0xb50 [lustre]^M
 [<ffffffffa0742a05>] ? ll_lookup_it_finish+0x85/0xa80 [lustre]^M
 [<ffffffffa02f59ae>] ? cfs_free+0xe/0x10 [libcfs]^M
 [<ffffffffa0727581>] ? ll_finish_md_op_data+0x2a1/0x4a0 [lustre]^M
 [<ffffffffa07252b8>] ? ll_prep_md_op_data+0x148/0x410 [lustre]^M
 [<ffffffffa0743835>] ? ll_lookup_it+0x435/0xd60 [lustre]^M
 [<ffffffffa0741380>] ? ll_md_blocking_ast+0x0/0x600 [lustre]^M
 [<ffffffffa07441e8>] ? ll_lookup_nd+0x88/0x440 [lustre]^M
 [<ffffffff8118ac3e>] ? d_alloc+0x13e/0x1b0^M
 [<ffffffff811806cb>] ? do_lookup+0x18b/0x220^M
 [<ffffffff81180cc9>] ? __link_path_walk+0x569/0x820^M
 [<ffffffffa06fc000>] ? return_if_equal+0x0/0x30 [lustre]^M
 [<ffffffff8118164a>] ? path_walk+0x6a/0xe0^M
 [<ffffffff8118181b>] ? do_path_lookup+0x5b/0xa0^M
 [<ffffffff811819a7>] ? user_path_at+0x57/0xa0^M
 [<ffffffff810dbe25>] ? call_rcu_sched+0x15/0x20^M
 [<ffffffff810dbe3e>] ? call_rcu+0xe/0x10^M
 [<ffffffff81262a05>] ? _atomic_dec_and_lock+0x55/0x80^M
 [<ffffffff811910e0>] ? mntput_no_expire+0x30/0x110^M
 [<ffffffff811740d1>] ? __fput+0x1a1/0x210^M
 [<ffffffff811a16f7>] ? do_utimes+0xc7/0xf0^M
 [<ffffffff810d1ad2>] ? audit_syscall_entry+0x272/0x2a0^M
 [<ffffffff811a1822>] ? sys_utimensat+0x32/0x90^M
 [<ffffffff8100b172>] ? system_call_fastpath+0x16/0x1b^M

Syslog on OSS node fat-amd-3-ib showed that:

Aug 30 09:31:50 fat-amd-3-ib kernel: Lustre: DEBUG MARKER: Started client load: tar on client-1-ib
Aug 30 09:31:50 fat-amd-3-ib xinetd[2179]: EXIT: shell status=0 pid=4039 duration=0(sec)Aug 30 09:31:53 fat-amd-3-ib xinetd[2179]: START: shell pid=4063 from=::ffff:192.168.4.131
Aug 30 09:31:53 fat-amd-3-ib rshd[4064]: root@fat-intel-4-ib.lab.whamcloud.com as root: cmd='(PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /us
r/lib64/lustre/tests; LUSTRE="/usr/lib64/lustre"  sh -c "/usr/sbin/lctl mark ==== Checking the clients loads AFTER  failovers -- ERRORS_OK=yes");echo XXRETCODE:$?'
Aug 30 09:31:53 fat-amd-3-ib kernel: Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failovers -- ERRORS_OK=yes
Aug 30 09:31:53 fat-amd-3-ib xinetd[2179]: EXIT: shell status=0 pid=4063 duration=0(sec)
Aug 30 09:31:53 fat-amd-3-ib kernel: LustreError: 2620:0:(ldlm_lib.c:910:target_handle_connect()) lustre-OST0005: denying connection for new client 192.168.4.1@o2ib (1e1bdc24-235d-ddd3-8241-31c1b1eae3e1): 2 clients in recovery for 75s
Aug 30 09:31:53 fat-amd-3-ib kernel: LustreError: 2620:0:(ldlm_lib.c:910:target_handle_connect()) Skipped 59 previous similar messages
Aug 30 09:31:54 fat-amd-3-ib xinetd[2179]: START: shell pid=4087 from=::ffff:192.168.4.131
Aug 30 09:31:54 fat-amd-3-ib rshd[4088]: root@fat-intel-4-ib.lab.whamcloud.com as root: cmd='(PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /us
r/lib64/lustre/tests; LUSTRE="/usr/lib64/lustre"  sh -c "/usr/sbin/lctl mark Done checking \/ re-Starting client loads. PASS");echo XXRETCODE:$?'Aug 30 09:31:54 fat-amd-3-ib kernel: Lustre: DEBUG MARKER: Done checking / re-Starting client loads. PASS
Aug 30 09:31:54 fat-amd-3-ib xinetd[2179]: EXIT: shell status=0 pid=4087 duration=0(sec)
Aug 30 09:32:12 fat-amd-3-ib kernel: INFO: task tgt_recov:3840 blocked for more than 120 seconds.
Aug 30 09:32:12 fat-amd-3-ib kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 30 09:32:12 fat-amd-3-ib kernel: tgt_recov     D 0000000000000009     0  3840      2 0x00000080Aug 30 09:32:12 fat-amd-3-ib kernel: ffff8802178c3d30 0000000000000046 ffff8802178c3d30 ffffffffa05bbef6Aug 30 09:32:12 fat-amd-3-ib kernel: 0000000000000000 0000000000000000 ffffffffa06557a0 0000000000000007
Aug 30 09:32:12 fat-amd-3-ib kernel: ffff880217cca638 ffff8802178c3fd8 000000000000f598 ffff880217cca638
Aug 30 09:32:12 fat-amd-3-ib kernel: Call Trace:
Aug 30 09:32:12 fat-amd-3-ib kernel: [<ffffffffa05bbef6>] ? check_for_next_transno+0x2d6/0x4f0 [ptlrpc]
Aug 30 09:32:12 fat-amd-3-ib kernel: [<ffffffffa05bbc20>] ? check_for_next_transno+0x0/0x4f0 [ptlrpc]
Aug 30 09:32:12 fat-amd-3-ib kernel: [<ffffffffa05bd765>] target_recovery_overseer+0x85/0x200 [ptlrpc]Aug 30 09:32:12 fat-amd-3-ib kernel: [<ffffffffa05ba780>] ? exp_req_replay_healthy+0x0/0x30 [ptlrpc]Aug 30 09:32:12 fat-amd-3-ib kernel: [<ffffffff8108e160>] ? autoremove_wake_function+0x0/0x40Aug 30 09:32:12 fat-amd-3-ib kernel: [<ffffffffa05be364>] target_recovery_thread+0x4a4/0xf50 [ptlrpc]
Aug 30 09:32:12 fat-amd-3-ib kernel: [<ffffffffa05bdec0>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]
Aug 30 09:32:12 fat-amd-3-ib kernel: [<ffffffff8100c1ca>] child_rip+0xa/0x20
Aug 30 09:32:12 fat-amd-3-ib kernel: [<ffffffffa05bdec0>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]
Aug 30 09:32:12 fat-amd-3-ib kernel: [<ffffffffa05bdec0>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]
Aug 30 09:32:12 fat-amd-3-ib kernel: [<ffffffff8100c1c0>] ? child_rip+0x0/0x20
Aug 30 09:32:38 fat-amd-3-ib kernel: Lustre: lustre-OST0000: haven't heard from client 1e1bdc24-235d-ddd3-8241-31c1b1eae3e1 (at 192.168.4.1@o2ib) in 48 seconds. I think it's de
ad, and I am evicting it. exp ffff88031783e000, cur 1314721958 expire 1314721928 last 1314721910
Aug 30 09:32:38 fat-amd-3-ib kernel: Lustre: Skipped 4 previous similar messages
Aug 30 09:33:08 fat-amd-3-ib kernel: Lustre: 3840:0:(ldlm_lib.c:1569:target_recovery_overseer()) recovery is timed out, evict stale exports
Aug 30 09:33:08 fat-amd-3-ib kernel: Lustre: 3840:0:(ldlm_lib.c:1569:target_recovery_overseer()) recovery is timed out, evict stale exports
Aug 30 09:33:38 fat-amd-3-ib kernel: Lustre: 2657:0:(ldlm_lib.c:877:target_handle_connect()) lustre-OST0005: connection from 53ce78e4-dbf9-09af-3e1f-ba4d62ada8c7@192.168.4.13@o2ib recovering/t0 exp (null) cur 1314722018 last 0
Aug 30 09:33:38 fat-amd-3-ib kernel: Lustre: 2657:0:(ldlm_lib.c:877:target_handle_connect()) Skipped 137 previous similar messages
Aug 30 09:36:08 fat-amd-3-ib kernel: Lustre: 3840:0:(ldlm_lib.c:1569:target_recovery_overseer()) recovery is timed out, evict stale exports
Aug 30 09:36:08 fat-amd-3-ib kernel: Lustre: 3840:0:(ldlm_lib.c:1569:target_recovery_overseer()) recovery is timed out, evict stale exports

Maloo report: https://maloo.whamcloud.com/test_sets/a4d4597c-d387-11e0-8d02-52540025f9af

Please find console, syslog, dmesg and debug logs in the attached recovery-double-scale.1314722275.log.tar.bz2.



 Comments   
Comment by Peter Jones [ 31/Aug/11 ]

Bobi

Could you look into this one please?

Thanks

Peter

Comment by Oleg Drokin [ 31/Aug/11 ]

I think this could be related to LU-16, see how maxbytes is different in one of the lsm and that's the only change.

Comment by Zhenyu Xu [ 01/Sep/11 ]

Just wondering why in the first place the inode got 0 maxbytes in its lli->lli_smd, it should at least be LUSTRE_STRIPE_MAXBYTES as lov_alloc_memmd() alloc the lov_stripe_md.

LustreError: 2344:0:(llite_lib.c:1587:ll_update_inode()) lli_smd:
LustreError: 2344:0:(debug.c:71:dump_lsm()) lsm ffff8803168da740, objid 0x7, maxbytes 0x0, magic 0x0BD10BD0, stripe_size 1048576, stripe_count 1 pool
LustreError: 11-0: an error occurred while communicating with 192.168.4.134@o2ib. The ost_connect operation failed with -16
LustreError: 2344:0:(llite_lib.c:1589:ll_update_inode()) lsm:
LustreError: 2344:0:(debug.c:71:dump_lsm()) lsm ffff8803168da3c0, objid 0x7, maxbytes 0xfffffffffff, magic 0x0BD10BD0, stripe_size 1048576, stripe_count 1 pool

Comment by Peter Jones [ 01/Sep/11 ]

Consider as blocker for 2.1

Comment by Zhenyu Xu [ 02/Sep/11 ]

patch tracking at http://review.whamcloud.com/1327

Comment by Jessica A. Popp (Inactive) [ 08/Sep/11 ]

Bobi -

I checked on the status and Build #2070 failed - but the error didn't report back correctly to Gerrit so I asked in test tools chat room and the failure was a false negative. Robert added the build back to the queue, but he didn't have a build number for me. If it works correctly this time the results should post back to your 1327 Gerrit tracker. But I just wanted to let you know there is a new job waiting in the build queue. The queue is pretty long - so it might not have even run by the time you start your day. Hopefully the build works this time.

-Jessica

Comment by Build Master (Inactive) [ 10/Sep/11 ]

Integrated in lustre-master » i686,client,el6,inkernel #277
LU-650 Max possible file size is mutable

Oleg Drokin : 303f41aa0ff07a43a31e397a3d163a6b8b6e4864
Files :

  • lustre/include/obd.h
  • lustre/lov/lov_ea.c
Comment by Build Master (Inactive) [ 10/Sep/11 ]

Integrated in lustre-master » x86_64,server,el6,inkernel #277
LU-650 Max possible file size is mutable

Oleg Drokin : 303f41aa0ff07a43a31e397a3d163a6b8b6e4864
Files :

  • lustre/include/obd.h
  • lustre/lov/lov_ea.c
Comment by Build Master (Inactive) [ 10/Sep/11 ]

Integrated in lustre-master » x86_64,server,el5,ofa #277
LU-650 Max possible file size is mutable

Oleg Drokin : 303f41aa0ff07a43a31e397a3d163a6b8b6e4864
Files :

  • lustre/lov/lov_ea.c
  • lustre/include/obd.h
Comment by Build Master (Inactive) [ 10/Sep/11 ]

Integrated in lustre-master » x86_64,client,sles11,inkernel #277
LU-650 Max possible file size is mutable

Oleg Drokin : 303f41aa0ff07a43a31e397a3d163a6b8b6e4864
Files :

  • lustre/lov/lov_ea.c
  • lustre/include/obd.h
Comment by Build Master (Inactive) [ 10/Sep/11 ]

Integrated in lustre-master » x86_64,client,el6,inkernel #277
LU-650 Max possible file size is mutable

Oleg Drokin : 303f41aa0ff07a43a31e397a3d163a6b8b6e4864
Files :

  • lustre/include/obd.h
  • lustre/lov/lov_ea.c
Comment by Build Master (Inactive) [ 10/Sep/11 ]

Integrated in lustre-master » x86_64,client,ubuntu1004,inkernel #277
LU-650 Max possible file size is mutable

Oleg Drokin : 303f41aa0ff07a43a31e397a3d163a6b8b6e4864
Files :

  • lustre/lov/lov_ea.c
  • lustre/include/obd.h
Comment by Build Master (Inactive) [ 10/Sep/11 ]

Integrated in lustre-master » x86_64,client,el5,inkernel #277
LU-650 Max possible file size is mutable

Oleg Drokin : 303f41aa0ff07a43a31e397a3d163a6b8b6e4864
Files :

  • lustre/include/obd.h
  • lustre/lov/lov_ea.c
Comment by Build Master (Inactive) [ 10/Sep/11 ]

Integrated in lustre-master » i686,server,el5,inkernel #277
LU-650 Max possible file size is mutable

Oleg Drokin : 303f41aa0ff07a43a31e397a3d163a6b8b6e4864
Files :

  • lustre/include/obd.h
  • lustre/lov/lov_ea.c
Comment by Build Master (Inactive) [ 10/Sep/11 ]

Integrated in lustre-master » i686,server,el6,inkernel #277
LU-650 Max possible file size is mutable

Oleg Drokin : 303f41aa0ff07a43a31e397a3d163a6b8b6e4864
Files :

  • lustre/lov/lov_ea.c
  • lustre/include/obd.h
Comment by Build Master (Inactive) [ 10/Sep/11 ]

Integrated in lustre-master » i686,client,el5,ofa #277
LU-650 Max possible file size is mutable

Oleg Drokin : 303f41aa0ff07a43a31e397a3d163a6b8b6e4864
Files :

  • lustre/lov/lov_ea.c
  • lustre/include/obd.h
Comment by Build Master (Inactive) [ 10/Sep/11 ]

Integrated in lustre-master » x86_64,server,el5,inkernel #277
LU-650 Max possible file size is mutable

Oleg Drokin : 303f41aa0ff07a43a31e397a3d163a6b8b6e4864
Files :

  • lustre/lov/lov_ea.c
  • lustre/include/obd.h
Comment by Build Master (Inactive) [ 10/Sep/11 ]

Integrated in lustre-master » x86_64,client,el5,ofa #277
LU-650 Max possible file size is mutable

Oleg Drokin : 303f41aa0ff07a43a31e397a3d163a6b8b6e4864
Files :

  • lustre/lov/lov_ea.c
  • lustre/include/obd.h
Comment by Build Master (Inactive) [ 10/Sep/11 ]

Integrated in lustre-master » i686,client,el5,inkernel #277
LU-650 Max possible file size is mutable

Oleg Drokin : 303f41aa0ff07a43a31e397a3d163a6b8b6e4864
Files :

  • lustre/include/obd.h
  • lustre/lov/lov_ea.c
Generated at Sat Feb 10 01:09:06 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.