[LU-292] Test failure on test suite recovery-small Created: 07/May/11  Updated: 23/Aug/12  Resolved: 10/Aug/11

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.0
Fix Version/s: Lustre 2.1.0

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Liang Zhen (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 4922

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/4f87c578-77b4-11e0-9b1b-52540025f9af.
03:33:24:Lustre: DEBUG MARKER: == recovery-small test 28: handle error adding new clients (bug 6086) ================================ 20:33:24 (1304652804)
03:33:35:Lustre: 12843:0:(client.c:1775:ptlrpc_expire_one_request()) @@@ Request x1368023838678515 sent from lustre-MDT0000 to NID 192.168.4.23@o2ib has timed out for slow reply: [sent 1304652804] [real_sent 1304652804] [current 1304652815] [deadline 11s] [delay 0s] req@ffff810037e66c00 x1368023838678515/t0(0) o-1->Àx<9a>^V^D<81>ÿÿ^G@NET_0x50000c0a80417_UUID:15/16 lens 296/192 e 0 to 1 dl 1304652815 ref 1 fl Rpc:XN/ffffffff/ffffffff rc 0/-1
03:33:35:LustreError: 138-a: lustre-MDT0000: A client on nid 192.168.4.23@o2ib was evicted due to a lock blocking callback time out: rc -107
03:33:36:LustreError: 12843:0:(mdt_handler.c:2813:mdt_recovery()) operation 41 on unconnected MDS from 12345-192.168.4.23@o2ib
03:33:36:LustreError: 12843:0:(ldlm_lib.c:2118:target_send_reply_msg()) @@@ processing error (107) req@ffff81041baef800 x1368025347789244/t0(0) o-1><?>@<?>:0/0 lens 192/0 e 0 to 0 dl 1304652857 ref 1 fl Interpret:/ffffffff/ffffffff rc -107/-1
03:33:36:LustreError: 12843:0:(obd_support.h:455:obd_fail_check_set()) *** obd_fail_loc=12f ***
03:33:36:LustreError: 12843:0:(mdt_recovery.c:549:mdt_client_new()) no room for 0 clients - fix LR_MAX_CLIENTS
03:33:55:Lustre: Failing over lustre-MDT0000
03:33:55:Lustre: Skipped 8 previous similar messages
03:33:55:Lustre: 21999:0:(quota_master.c:793:close_quota_files()) quota[0] is off already
03:33:55:Lustre: 21999:0:(quota_master.c:793:close_quota_files()) Skipped 1 previous similar message
03:33:55:Lustre: mdd_obd-lustre-MDT0000-0: shutting down for failover; client state will be preserved.
03:33:55:LustreError: 21999:0:(ldlm_request.c:1169:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway
03:33:55:LustreError: 21999:0:(ldlm_request.c:1796:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108
03:33:55:LustreError: 5682:0:(service.c:2686:ptlrpc_unregister_service()) ASSERTION(service->srv_n_queued_reqs == 0) failed
03:33:55:LustreError: 5682:0:(service.c:2686:ptlrpc_unregister_service()) LBUG
03:33:55:Pid: 5682, comm: obd_zombid
03:33:55:
03:33:55:Call Trace:
03:33:55: [<ffffffff8877c5f1>] libcfs_debug_dumpstack+0x51/0x60 [libcfs]
03:33:55: [<ffffffff8877cb2a>] lbug_with_loc+0x7a/0xd0 [libcfs]
03:33:55: [<ffffffff88787960>] cfs_tracefile_init+0x0/0x10a [libcfs]
03:33:55: [<ffffffff889412e3>] ptlrpc_unregister_service+0x4e3/0xbd0 [ptlrpc]
03:33:55: [<ffffffff8002e244>] __wake_up+0x38/0x4f
03:33:55: [<ffffffff88c390bb>] mgs_cleanup+0xeb/0x220 [mgs]
03:33:55: [<ffffffff8884b2cf>] class_decref+0x43f/0x5b0 [obdclass]
03:33:55: [<ffffffff88c392d6>] mgs_destroy_export+0xe6/0xf0 [mgs]
03:33:55: [<ffffffff88831302>] obd_zombie_impexp_cull+0x402/0x4f0 [obdclass]
03:33:55: [<ffffffff88838847>] obd_zombie_impexp_thread+0x1f7/0x2a0 [obdclass]
03:33:55: [<ffffffff8008cf99>] default_wake_function+0x0/0xe
03:33:55: [<ffffffff8005dfb1>] child_rip+0xa/0x11
03:33:55: [<ffffffff88838650>] obd_zombie_impexp_thread+0x0/0x2a0 [obdclass]
03:33:55: [<ffffffff8005dfa7>] child_rip+0x0/0x11
03:33:55:
03:33:55:Kernel panic - not syncing: LBUG
03:33:55: <7>APIC error on CPU0: 00(04)



 Comments   
Comment by Mikhail Pershin [ 07/May/11 ]

this is quite similar to bug ORI-125 I am working on currently, if this issue can be reproduced I have the possible fix already to check it.

UPD: no, this is quite different, LBUG occurs before exports and zombie barriers.

Comment by Jian Yu [ 29/Jul/11 ]

LU-546 was closed as a duplicate of this ticket.

Comment by Jian Yu [ 01/Aug/11 ]

Lustre Clients:
Tag: 1.8.6-wc1
Distro/Arch: RHEL6/x86_64 (kernel version: 2.6.32_131.2.1.el6)
Build: http://newbuild.whamcloud.com/job/lustre-b1_8/100/arch=x86_64,build_type=client,distro=el6,ib_stack=inkernel/
Network: IB (inkernel OFED)
ENABLE_QUOTA=yes

Lustre Servers:
Tag: v2_0_66_0
Distro/Arch: RHEL6/x86_64 (kernel version: 2.6.32-131.2.1.el6_lustre)
Build: http://newbuild.whamcloud.com/job/lustre-master/228/arch=x86_64,build_type=server,distro=el6,ib_stack=inkernel/
Network: IB (inkernel OFED)

After running sanity-benchmark, unmounting MDS hit the same LBUG:

Aug  1 04:54:30 fat-amd-1 rshd[12437]: root@fat-amd-3-ib.lab.whamcloud.com as root: cmd='(PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /usr/li
b64/lustre/tests; sh -c "umount -d -f /mnt/mds");echo XXRETCODE:$?'
Aug  1 04:54:31 fat-amd-1 kernel: LustreError: 2568:0:(service.c:2704:ptlrpc_unregister_service()) ASSERTION(service->srv_n_queued_reqs == 0) failed
Aug  1 04:54:31 fat-amd-1 kernel: LustreError: 2568:0:(service.c:2704:ptlrpc_unregister_service()) LBUG
Aug  1 04:54:31 fat-amd-1 kernel: Pid: 2568, comm: obd_zombid
Aug  1 04:54:31 fat-amd-1 kernel:
Aug  1 04:54:31 fat-amd-1 kernel: Call Trace:
Aug  1 04:54:31 fat-amd-1 kernel: [<ffffffffa0370855>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
Aug  1 04:54:31 fat-amd-1 kernel: [<ffffffffa0370e95>] lbug_with_loc+0x75/0xe0 [libcfs]
Aug  1 04:54:31 fat-amd-1 kernel: [<ffffffffa037bcb6>] libcfs_assertion_failed+0x66/0x70 [libcfs]
Aug  1 04:54:31 fat-amd-1 kernel: [<ffffffffa057ea03>] ptlrpc_unregister_service+0xb83/0xc20 [ptlrpc]
Aug  1 04:54:31 fat-amd-1 kernel: [<ffffffff8105dc72>] ? default_wake_function+0x12/0x20
Aug  1 04:54:31 fat-amd-1 kernel: [<ffffffff8104af29>] ? __wake_up_common+0x59/0x90
Aug  1 04:54:31 fat-amd-1 kernel: [<ffffffff8104f843>] ? __wake_up+0x53/0x70
Aug  1 04:54:31 fat-amd-1 kernel: [<ffffffffa08a6acc>] mgs_cleanup+0x4c/0x220 [mgs]
Aug  1 04:54:31 fat-amd-1 kernel: [<ffffffffa046375a>] class_decref+0x19a/0x610 [obdclass]
Aug  1 04:54:31 fat-amd-1 kernel: [<ffffffff810dbebe>] ? call_rcu+0xe/0x10
Aug  1 04:54:31 fat-amd-1 kernel: [<ffffffffa08a632f>] ? mgs_destroy_export+0x3f/0x110 [mgs]
Aug  1 04:54:31 fat-amd-1 kernel: [<ffffffffa044eac5>] obd_zombie_impexp_cull+0x335/0x5a0 [obdclass]
Aug  1 04:54:31 fat-amd-1 kernel: [<ffffffff8108e51c>] ? remove_wait_queue+0x3c/0x50
Aug  1 04:54:31 fat-amd-1 kernel: [<ffffffffa044ee35>] obd_zombie_impexp_thread+0x105/0x270 [obdclass]
Aug  1 04:54:31 fat-amd-1 kernel: [<ffffffff8105dc60>] ? default_wake_function+0x0/0x20
Aug  1 04:54:31 fat-amd-1 kernel: [<ffffffff8100c1ca>] child_rip+0xa/0x20
Aug  1 04:54:31 fat-amd-1 kernel: [<ffffffffa044ed30>] ? obd_zombie_impexp_thread+0x0/0x270 [obdclass]
Aug  1 04:54:31 fat-amd-1 kernel: [<ffffffff8100c1c0>] ? child_rip+0x0/0x20
Aug  1 04:54:31 fat-amd-1 kernel:
Aug  1 04:54:31 fat-amd-1 kernel: Kernel panic - not syncing: LBUG
Aug  1 04:54:31 fat-amd-1 kernel: Pid: 2568, comm: obd_zombid Tainted: G           ---------------- T 2.6.32-131.2.1.el6_lustre.x86_64 #1
Aug  1 04:54:31 fat-amd-1 kernel: Call Trace:
Aug  1 04:54:31 fat-amd-1 kernel: [<ffffffff814db1b8>] ? panic+0x78/0x143
Aug  1 04:54:31 fat-amd-1 kernel: [<ffffffffa0370eeb>] ? lbug_with_loc+0xcb/0xe0 [libcfs]
Aug  1 04:54:31 fat-amd-1 kernel: [<ffffffffa037bcb6>] ? libcfs_assertion_failed+0x66/0x70 [libcfs]
Aug  1 04:54:31 fat-amd-1 kernel: [<ffffffffa057ea03>] ? ptlrpc_unregister_service+0xb83/0xc20 [ptlrpc]
Aug  1 04:54:31 fat-amd-1 kernel: [<ffffffff8105dc72>] ? default_wake_function+0x12/0x20
Aug  1 04:54:31 fat-amd-1 kernel: [<ffffffff8104af29>] ? __wake_up_common+0x59/0x90
Aug  1 04:54:31 fat-amd-1 kernel: [<ffffffff8104f843>] ? __wake_up+0x53/0x70
Aug  1 04:54:31 fat-amd-1 kernel: [<ffffffffa08a6acc>] ? mgs_cleanup+0x4c/0x220 [mgs]
Aug  1 04:54:31 fat-amd-1 kernel: [<ffffffffa046375a>] ? class_decref+0x19a/0x610 [obdclass]
Aug  1 04:54:31 fat-amd-1 kernel: [<ffffffff810dbebe>] ? call_rcu+0xe/0x10
Aug  1 04:54:31 fat-amd-1 kernel: [<ffffffffa08a632f>] ? mgs_destroy_export+0x3f/0x110 [mgs]
Aug  1 04:54:31 fat-amd-1 kernel: [<ffffffffa044eac5>] ? obd_zombie_impexp_cull+0x335/0x5a0 [obdclass]
Aug  1 04:54:31 fat-amd-1 kernel: [<ffffffff8108e51c>] ? remove_wait_queue+0x3c/0x50
Aug  1 04:54:31 fat-amd-1 kernel: [<ffffffffa044ee35>] ? obd_zombie_impexp_thread+0x105/0x270 [obdclass]
Aug  1 04:54:31 fat-amd-1 kernel: [<ffffffff8105dc60>] ? default_wake_function+0x0/0x20
Aug  1 04:54:31 fat-amd-1 kernel: [<ffffffff8100c1ca>] ? child_rip+0xa/0x20
Aug  1 04:54:31 fat-amd-1 kernel: [<ffffffffa044ed30>] ? obd_zombie_impexp_thread+0x0/0x270 [obdclass]
Aug  1 04:54:31 fat-amd-1 kernel: [<ffffffff8100c1c0>] ? child_rip+0x0/0x20

sanity-quota test 29 also hit the above LBUG: https://maloo.whamcloud.com/test_sets/8f500b08-bc3c-11e0-8bdf-52540025f9af

Comment by Liang Zhen (Inactive) [ 01/Aug/11 ]

I've posted a patch for this:
http://review.whamcloud.com/#change,1172

Comment by Peter Jones [ 03/Aug/11 ]

Liang has worked on this

Comment by Liang Zhen (Inactive) [ 04/Aug/11 ]

the test failed because this:
== sanity test 180a: test obdecho on osc == 12:24:11 (1312313051)
WARNING: All config files need .conf: /etc/modprobe.d/lustre-lnet, it will be ignored in a future release.

      • glibc detected *** awk: malloc(): memory corruption: 0x0000000000cb31b0 ***

it seems that it's a userspace tool issue, and I don't think it has anything to do with this patch

Comment by Build Master (Inactive) [ 10/Aug/11 ]

Integrated in lustre-master » x86_64,client,el5,ofa #250
LU-292 srv_n_queued_reqs is screwed up by ptlrpc_unregister_service

Oleg Drokin : 90444d82f5bbeeca44a71809f88aed71515a2da8
Files :

  • lustre/ptlrpc/service.c
Comment by Build Master (Inactive) [ 10/Aug/11 ]

Integrated in lustre-master » x86_64,server,el5,inkernel #250
LU-292 srv_n_queued_reqs is screwed up by ptlrpc_unregister_service

Oleg Drokin : 90444d82f5bbeeca44a71809f88aed71515a2da8
Files :

  • lustre/ptlrpc/service.c
Comment by Build Master (Inactive) [ 10/Aug/11 ]

Integrated in lustre-master » i686,server,el6,inkernel #250
LU-292 srv_n_queued_reqs is screwed up by ptlrpc_unregister_service

Oleg Drokin : 90444d82f5bbeeca44a71809f88aed71515a2da8
Files :

  • lustre/ptlrpc/service.c
Comment by Build Master (Inactive) [ 10/Aug/11 ]

Integrated in lustre-master » x86_64,server,el6,inkernel #250
LU-292 srv_n_queued_reqs is screwed up by ptlrpc_unregister_service

Oleg Drokin : 90444d82f5bbeeca44a71809f88aed71515a2da8
Files :

  • lustre/ptlrpc/service.c
Comment by Build Master (Inactive) [ 10/Aug/11 ]

Integrated in lustre-master » x86_64,server,el5,ofa #250
LU-292 srv_n_queued_reqs is screwed up by ptlrpc_unregister_service

Oleg Drokin : 90444d82f5bbeeca44a71809f88aed71515a2da8
Files :

  • lustre/ptlrpc/service.c
Comment by Build Master (Inactive) [ 10/Aug/11 ]

Integrated in lustre-master » x86_64,client,el5,inkernel #250
LU-292 srv_n_queued_reqs is screwed up by ptlrpc_unregister_service

Oleg Drokin : 90444d82f5bbeeca44a71809f88aed71515a2da8
Files :

  • lustre/ptlrpc/service.c
Comment by Build Master (Inactive) [ 10/Aug/11 ]

Integrated in lustre-master » x86_64,client,el6,inkernel #250
LU-292 srv_n_queued_reqs is screwed up by ptlrpc_unregister_service

Oleg Drokin : 90444d82f5bbeeca44a71809f88aed71515a2da8
Files :

  • lustre/ptlrpc/service.c
Comment by Build Master (Inactive) [ 10/Aug/11 ]

Integrated in lustre-master » i686,client,el5,inkernel #250
LU-292 srv_n_queued_reqs is screwed up by ptlrpc_unregister_service

Oleg Drokin : 90444d82f5bbeeca44a71809f88aed71515a2da8
Files :

  • lustre/ptlrpc/service.c
Comment by Build Master (Inactive) [ 10/Aug/11 ]

Integrated in lustre-master » i686,server,el5,ofa #250
LU-292 srv_n_queued_reqs is screwed up by ptlrpc_unregister_service

Oleg Drokin : 90444d82f5bbeeca44a71809f88aed71515a2da8
Files :

  • lustre/ptlrpc/service.c
Comment by Build Master (Inactive) [ 10/Aug/11 ]

Integrated in lustre-master » x86_64,client,sles11,inkernel #250
LU-292 srv_n_queued_reqs is screwed up by ptlrpc_unregister_service

Oleg Drokin : 90444d82f5bbeeca44a71809f88aed71515a2da8
Files :

  • lustre/ptlrpc/service.c
Comment by Build Master (Inactive) [ 10/Aug/11 ]

Integrated in lustre-master » x86_64,client,ubuntu1004,inkernel #250
LU-292 srv_n_queued_reqs is screwed up by ptlrpc_unregister_service

Oleg Drokin : 90444d82f5bbeeca44a71809f88aed71515a2da8
Files :

  • lustre/ptlrpc/service.c
Comment by Build Master (Inactive) [ 10/Aug/11 ]

Integrated in lustre-master » i686,client,el6,inkernel #250
LU-292 srv_n_queued_reqs is screwed up by ptlrpc_unregister_service

Oleg Drokin : 90444d82f5bbeeca44a71809f88aed71515a2da8
Files :

  • lustre/ptlrpc/service.c
Comment by Peter Jones [ 10/Aug/11 ]

Landed for 2.1

Comment by Build Master (Inactive) [ 10/Aug/11 ]

Integrated in lustre-master » i686,server,el5,inkernel #250
LU-292 srv_n_queued_reqs is screwed up by ptlrpc_unregister_service

Oleg Drokin : 90444d82f5bbeeca44a71809f88aed71515a2da8
Files :

  • lustre/ptlrpc/service.c
Comment by Build Master (Inactive) [ 10/Aug/11 ]

Integrated in lustre-master » i686,client,el5,ofa #250
LU-292 srv_n_queued_reqs is screwed up by ptlrpc_unregister_service

Oleg Drokin : 90444d82f5bbeeca44a71809f88aed71515a2da8
Files :

  • lustre/ptlrpc/service.c
Comment by Nathan Rutman [ 22/Aug/12 ]

bz23289 screwed this up. The fix in LU-292 fixes that, but the wrong way – the queued counter only counts incoming (pre-req_in RPCs, not all queued rpc's (those waiting for handling).
The correct fix should be:

diff --git a/lustre/ptlrpc/service.c b/lustre/ptlrpc/service.c
index a0bd65b..7d37f0c 100644
--- a/lustre/ptlrpc/service.c
+++ b/lustre/ptlrpc/service.c
@@ -1516,7 +1516,6 @@ ptlrpc_server_handle_req_in(struct ptlrpc_service *svc)
         req = cfs_list_entry(svc->srv_req_in_queue.next,
                              struct ptlrpc_request, rq_list);
         cfs_list_del_init (&req->rq_list);
-        svc->srv_n_queued_reqs--;
         /* Consider this still a "queued" request as far as stats are
            concerned */
         cfs_spin_unlock(&svc->srv_lock);
@@ -1631,6 +1630,7 @@ ptlrpc_server_handle_req_in(struct ptlrpc_service *svc)
 
 err_req:
         cfs_spin_lock(&svc->srv_rq_lock);
+        svc->srv_n_queued_reqs--;
         svc->srv_n_active_reqs++;
         cfs_spin_unlock(&svc->srv_rq_lock);
         ptlrpc_server_finish_request(svc, req);
@@ -1691,6 +1691,7 @@ ptlrpc_server_handle_request(struct ptlrpc_service *svc,
         }
 
         cfs_list_del_init(&request->rq_list);
+        svc->srv_n_queued_reqs--;
         svc->srv_n_active_reqs++;
         if (request->rq_hp)
                 svc->srv_n_active_hpreq++;
@@ -2749,6 +2750,7 @@ int ptlrpc_unregister_service(struct ptlrpc_service *service)
 
                 req = ptlrpc_server_request_get(service, 1);
                 cfs_list_del(&req->rq_list);
+                service->srv_n_queued_reqs--;
                 service->srv_n_active_reqs++;
                 ptlrpc_server_finish_request(service, req);
         }
Comment by Liang Zhen (Inactive) [ 23/Aug/12 ]

Nathan, because svc::srv_n_queued_reqs is protected by srv_lock, not svc::srv_rq_lock, so we can't fix it by your way.

Generated at Sat Feb 10 01:05:37 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.