[LU-2827] mdt_intent_fixup_resent() cannot find the proper lock in hash Created: 18/Feb/13 Updated: 02/Nov/15 Resolved: 20/Jun/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | Lustre 2.6.0, Lustre 2.5.4 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Andrew Perepechko | Assignee: | Jian Yu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | LB, llnl | ||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 6847 | ||||||||||||||||||||
| Description |
|
If a successful reply is lost for an intent lock request, MDS will not correctly recover from this situation on resend. The cause of which seems to be the code of ldlm_handle_enqueue0() and mdt_intent_fixup_resent() int ldlm_handle_enqueue0(struct ldlm_namespace *ns, struct ptlrpc_request *req, const struct ldlm_request *dlm_req, const struct ldlm_callback_suite *cbs) { ... /* The lock's callback data might be set in the policy function */ lock = ldlm_lock_create(ns, &dlm_req->lock_desc.l_resource.lr_name, dlm_req->lock_desc.l_resource.lr_type, dlm_req->lock_desc.l_req_mode, cbs, NULL, 0); ... lock->l_export = class_export_lock_get(req->rq_export, lock); if (lock->l_export->exp_lock_hash) { cfs_hash_add(lock->l_export->exp_lock_hash, &lock->l_remote_handle, &lock->l_exp_hash); } ... err = ldlm_lock_enqueue(ns, &lock, cookie, &flags); ... } static void mdt_intent_fixup_resent(struct mdt_thread_info *info, struct ldlm_lock *new_lock, struct ldlm_lock **old_lock, struct mdt_lock_handle *lh) { struct ptlrpc_request *req = mdt_info_req(info); struct obd_export *exp = req->rq_export; struct lustre_handle remote_hdl; struct ldlm_request *dlmreq; struct ldlm_lock *lock; if (!(lustre_msg_get_flags(req->rq_reqmsg) & MSG_RESENT)) return; dlmreq = req_capsule_client_get(info->mti_pill, &RMF_DLM_REQ); remote_hdl = dlmreq->lock_handle[0]; lock = cfs_hash_lookup(exp->exp_lock_hash, &remote_hdl); if (lock) { if (lock != new_lock) { ... } On resend, ldlm_handle_enqueue0() add the new lock into hash even though there's already a granted lock with the same remote handle. mdt_intent_fixup_resent() will find the newly added lock in hash and ignore it. This will cause to an enqueue request on the newly created lock, deadlock and client eviction. Alexey thinks that the problem has existed since we moved from the correct code: static void fixup_handle_for_resent_req(struct ptlrpc_request *req, int offset, struct ldlm_lock *new_lock, struct ldlm_lock **old_lock, struct lustre_handle *lockh) { struct obd_export *exp = req->rq_export; struct ldlm_request *dlmreq = lustre_msg_buf(req->rq_reqmsg, offset, sizeof(*dlmreq)); struct lustre_handle remote_hdl = dlmreq->lock_handle[0]; struct list_head *iter; if (!(lustre_msg_get_flags(req->rq_reqmsg) & MSG_RESENT)) return; spin_lock(&exp->exp_ldlm_data.led_lock); list_for_each(iter, &exp->exp_ldlm_data.led_held_locks) { struct ldlm_lock *lock; lock = list_entry(iter, struct ldlm_lock, l_export_chain); if (lock == new_lock) continue; <==================== N.B. if (lock->l_remote_handle.cookie == remote_hdl.cookie) { lockh->cookie = lock->l_handle.h_cookie; LDLM_DEBUG(lock, "restoring lock cookie"); DEBUG_REQ(D_DLMTRACE, req,"restoring lock cookie "LPX64, lockh->cookie); if (old_lock) *old_lock = LDLM_LOCK_GET(lock); spin_unlock(&exp->exp_ldlm_data.led_lock); return; } } ... } Logs for this issue will follow. |
| Comments |
| Comment by Andrew Perepechko [ 18/Feb/13 ] |
|
Logs for the issue (xid 1427344285761632) |
| Comment by Andrew Perepechko [ 18/Feb/13 ] |
|
The original patch presumably introducing the issue: https://bugzilla.lustre.org/show_bug.cgi?id=16777 |
| Comment by Alexey Lyashkov [ 23/Feb/13 ] |
|
per xyratex internal discussion - change cfs_hash to insert lock in tail instead of head looks enough as short term fix, but if we have a too much resends we may double hash size and that will produce speed down for lookup operation - so.. we need look to better solution. |
| Comment by Keith Mannthey (Inactive) [ 26/Feb/13 ] |
|
Under what conditions are you seeing a successful reply being lost for an intent lock request? How can the problem be recreated? |
| Comment by Andrew Perepechko [ 26/Feb/13 ] |
|
A reply can be lost because of a network error. A simple reproducer for this issue, I believe, is: [root@panda tests]# touch /mnt/lustre/file [root@panda tests]# lctl set_param fail_loc=0x8000030c fail_loc=0x8000030c [root@panda tests]# setfattr -n user.attr -v value /mnt/lustre/file |
| Comment by Keith Mannthey (Inactive) [ 26/Feb/13 ] |
|
The network should not loose things. Can you be more specific? |
| Comment by Alexey Lyashkov [ 27/Feb/13 ] |
|
Keith, you really think network should be don't loose something? PS. LNet don't have a guaranteed a delivering packet and delivering order. Network may have a flap or packet lost.. it's NORMAL. |
| Comment by Oleg Drokin [ 27/Feb/13 ] |
|
Well, it is true that ideally network should not lose anything. In practice, though, packets do get lost sometimes, and we must deal with this unfortunate fact of life correctly. The exact reason of why a reply might have been lost is not as important here (in case of hte reproducer it is lost deliberately as a test) |
| Comment by Keith Mannthey (Inactive) [ 05/Mar/13 ] |
|
I think "lost" just means different things to different people As step one on this issue I will work to add a test. |
| Comment by Andreas Dilger [ 08/Mar/13 ] |
|
Panda, do you already have a patch for this bug? Could you please push it to Gerrit? |
| Comment by Keith Mannthey (Inactive) [ 11/Mar/13 ] |
|
I don't seem to be able to trigger a deadlock and client eviction in my local test vms. [root@panda tests]# touch /mnt/lustre/file [root@panda tests]# lctl set_param fail_loc=0x8000030c fail_loc=0x8000030c [root@panda tests]# setfattr -n user.attr -v value /mnt/lustre/file I tried adding more io but I was not able to trigger any form of a deadlock or eviction. Would you mind writing a Lustre test for the deadlock and eviction that you see? It will help to address the issue. |
| Comment by Andrew Perepechko [ 12/Mar/13 ] |
|
What does dmesg show after running setfattr? |
| Comment by Andrew Perepechko [ 12/Mar/13 ] |
|
> Panda, do you already have a patch for this bug? Could you please push it to Gerrit? No, I don't have a patch. I could trying coding if we agree how it can be fixed best. I can't agree with Alexey that changing hash to link new entries to tail instead of head is a good solution. |
| Comment by Keith Mannthey (Inactive) [ 12/Mar/13 ] |
|
I don't have those logs at this point that VMs setup is pretty busy right now. I don't recall seeing anything about setfattr in the dmesg. What do you see on your system? How do you trigger this issue? |
| Comment by Keith Mannthey (Inactive) [ 13/Mar/13 ] |
|
Lucky day. I was cleaning up some old terminals and I found the output from one of my reproduction experiments. I created a test replay-single test 91 . I think there was a 60 second sleep in this version before reverting the fail_loc status. Lustre: DEBUG MARKER: == replay-single test 91: time out lock replay (3764) == 18:48:14 (1363052894) Lustre: 3258:0:(client.c:1866:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1363052895/real 1363052895] req@ffff88003ba76000 x1429244805151980/t0(0) o101->lustre-MDT0000-mdc-ffff88003d8e2c00@192.168.56.103@tcp:12/10 lens 576/1080 e 0 to 1 dl 1363052902 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Lustre: lustre-MDT0000-mdc-ffff88003d8e2c00: Connection to lustre-MDT0000 (at 192.168.56.103@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: lustre-MDT0000-mdc-ffff88003d8e2c00: Connection restored to lustre-MDT0000 (at 192.168.56.103@tcp) Lustre: DEBUG MARKER: == replay-single test complete, duration 62 sec == 18:49:12 (1363052952) Can you provide your debug log or test for this issue? I should have some machines cycles tomorrow if you have further suggestions on how to reproduce the issue. |
| Comment by Andrew Perepechko [ 14/Mar/13 ] |
|
The defect clearly reproduces for me with WC repository commit 9fb5d1ff837880a7596d98f24ff6bb1bf0033f2b: [root@panda tests]# bash llmount.sh Stopping clients: panda /mnt/lustre (opts:) Stopping clients: panda /mnt/lustre2 (opts:) Loading modules from /mnt/nfs/xyratex/lustre-release/lustre/tests/.. detected 4 online CPUs by sysfs Force libcfs to create 2 CPU partitions ../libcfs/libcfs/libcfs options: 'cpu_npartitions=2' debug=vfstrace rpctrace dlmtrace neterror ha config ioctl super subsystem_debug=all -lnet -lnd -pinger gss/krb5 is not supported quota/lquota options: 'hash_lqs_cur_bits=3' Formatting mgs, mds, osts Format mds1: /tmp/lustre-mdt1 Format ost1: /tmp/lustre-ost1 Format ost2: /tmp/lustre-ost2 Checking servers environments Checking clients panda environments Loading modules from /mnt/nfs/xyratex/lustre-release/lustre/tests/.. detected 4 online CPUs by sysfs Force libcfs to create 2 CPU partitions debug=vfstrace rpctrace dlmtrace neterror ha config ioctl super subsystem_debug=all -lnet -lnd -pinger gss/krb5 is not supported Setup mgs, mdt, osts Starting mds1: -o loop /tmp/lustre-mdt1 /mnt/mds1 Started lustre-MDT0000 Starting ost1: -o loop /tmp/lustre-ost1 /mnt/ost1 Started lustre-OST0000 Starting ost2: -o loop /tmp/lustre-ost2 /mnt/ost2 Started lustre-OST0001 Starting client: panda: -o user_xattr,flock panda@tcp:/lustre /mnt/lustre Using TIMEOUT=20 enable jobstats, set job scheduler as procname_uid Waiting 90 secs for update Updated after 2s: wanted 'procname_uid' got 'procname_uid' disable quota as required [root@panda tests]# cd /mnt/lustre [root@panda lustre]# touch /mnt/lustre/file [root@panda lustre]# lctl set_param fail_loc=0x8000030c fail_loc=0x8000030c [root@panda lustre]# setfattr -n user.attr -v value /mnt/lustre/file [root@panda lustre]# dmesg Lustre: Mounted lustre-client Lustre: DEBUG MARKER: Using TIMEOUT=20 Lustre: ctl-lustre-MDT0000: super-sequence allocation rc = 0 [0x0000000200000400-0x0000000240000400):0:mdt Lustre: *** cfs_fail_loc=30c, val=2147483648*** LustreError: 30243:0:(ldlm_lib.c:2422:target_send_reply_msg()) @@@ dropping reply req@ffff88007d7d6850 x1429464107975760/t0(0) o101->ab287eba-76b5-0c4e-723e-377c1d2b0157@0@lo:0/0 lens 576/624 e 0 to 0 dl 1363243290 ref 1 fl Interpret:/0/0 rc 0/0 Lustre: 30711:0:(client.c:1866:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1363243284/real 1363243284] req@ffff88006b6db400 x1429464107975760/t0(0) o101->lustre-MDT0000-mdc-ffff88006b6db800@0@lo:12/10 lens 576/1048 e 0 to 1 dl 1363243291 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Lustre: lustre-MDT0000-mdc-ffff88006b6db800: Connection to lustre-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete Lustre: lustre-MDT0000: Client ab287eba-76b5-0c4e-723e-377c1d2b0157 (at 0@lo) reconnecting Lustre: lustre-MDT0000-mdc-ffff88006b6db800: Connection restored to lustre-MDT0000 (at 0@lo) LNet: Service thread pid 30243 was inactive for 40.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Pid: 30243, comm: mdt01_001 Call Trace: [<ffffffff814ead32>] schedule_timeout+0x192/0x2e0 [<ffffffff8107cb50>] ? process_timeout+0x0/0x10 [<ffffffffa0fe46d1>] cfs_waitq_timedwait+0x11/0x20 [libcfs] [<ffffffffa12bb22d>] ldlm_completion_ast+0x4ed/0x960 [ptlrpc] [<ffffffffa12b6950>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc] [<ffffffff8105fa40>] ? default_wake_function+0x0/0x20 [<ffffffffa12ba968>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc] [<ffffffffa12bad40>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] [<ffffffffa0981c60>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] [<ffffffffa098492b>] mdt_object_lock0+0x33b/0xaf0 [mdt] [<ffffffffa0981c60>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] [<ffffffffa12bad40>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] [<ffffffffa09851a4>] mdt_object_lock+0x14/0x20 [mdt] [<ffffffffa0985211>] mdt_object_find_lock+0x61/0x170 [mdt] [<ffffffffa09a7671>] mdt_reint_setxattr+0x461/0x1850 [mdt] [<ffffffffa1164c70>] ? lu_ucred+0x20/0x30 [obdclass] [<ffffffffa09811e5>] ? mdt_ucred+0x15/0x20 [mdt] [<ffffffffa099bf8c>] ? mdt_root_squash+0x2c/0x410 [mdt] [<ffffffffa130fe96>] ? __req_capsule_get+0x166/0x700 [ptlrpc] [<ffffffffa09a0781>] mdt_reint_rec+0x41/0xe0 [mdt] [<ffffffffa0999de3>] mdt_reint_internal+0x4e3/0x7d0 [mdt] [<ffffffffa099a114>] mdt_reint+0x44/0xe0 [mdt] [<ffffffffa098b008>] mdt_handle_common+0x628/0x1620 [mdt] [<ffffffffa09c27d5>] mds_regular_handle+0x15/0x20 [mdt] [<ffffffffa12f402c>] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] [<ffffffffa0fe45de>] ? cfs_timer_arm+0xe/0x10 [libcfs] [<ffffffffa12eb759>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] [<ffffffff81052223>] ? __wake_up+0x53/0x70 [<ffffffffa12f5576>] ptlrpc_main+0xb76/0x1870 [ptlrpc] [<ffffffffa12f4a00>] ? ptlrpc_main+0x0/0x1870 [ptlrpc] [<ffffffff8100c0ca>] child_rip+0xa/0x20 [<ffffffffa12f4a00>] ? ptlrpc_main+0x0/0x1870 [ptlrpc] [<ffffffffa12f4a00>] ? ptlrpc_main+0x0/0x1870 [ptlrpc] [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 LustreError: dumping log to /tmp/lustre-log.1363243331.30243 LustreError: 0:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 157s: evicting client at 0@lo ns: mdt-ffff8800cd97e000 lock: ffff8800cdaa2600/0x3d633a3159cdb915 lrc: 3/0,0 mode: PR/PR res: 8589935616/1 bits 0x1b rrc: 3 type: IBT flags: 0x200000000020 nid: 0@lo remote: 0x3d633a3159cdb8f9 expref: 9 pid: 30243 timeout: 4297311737 lvb_type: 0 LNet: Service thread pid 30243 completed after 150.27s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). LustreError: 11-0: lustre-MDT0000-mdc-ffff88006b6db800: Communicating with 0@lo, operation obd_ping failed with -107. Lustre: lustre-MDT0000-mdc-ffff88006b6db800: Connection to lustre-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete LustreError: 167-0: lustre-MDT0000-mdc-ffff88006b6db800: This client was evicted by lustre-MDT0000; in progress operations using this service will fail. Lustre: lustre-MDT0000-mdc-ffff88006b6db800: Connection restored to lustre-MDT0000 (at 0@lo) |
| Comment by Keith Mannthey (Inactive) [ 14/Mar/13 ] |
[root@client tests]# bash llmount.sh Stopping clients: client /mnt/lustre (opts:) Stopping clients: client /mnt/lustre2 (opts:) Loading modules from /usr/lib64/lustre/tests/.. detected 1 online CPUs by sysfs libcfs will create CPU partition based on online CPUs debug=vfstrace rpctrace dlmtrace neterror ha config ioctl super subsystem_debug=all -lnet -lnd -pinger gss/krb5 is not supported quota/lquota options: 'hash_lqs_cur_bits=3' Formatting mgs, mds, osts Format mds1: /dev/sdb Format ost1: /dev/sdb Checking servers environments Checking clients client environments Loading modules from /usr/lib64/lustre/tests/.. detected 1 online CPUs by sysfs libcfs will create CPU partition based on online CPUs debug=vfstrace rpctrace dlmtrace neterror ha config ioctl super subsystem_debug=all -lnet -lnd -pinger gss/krb5 is not supported Setup mgs, mdt, osts Starting mds1: /dev/sdb /mnt/mds1 Started lustre-MDT0000 Starting ost1: /dev/sdb /mnt/ost1 Started lustre-OST0000 Starting client: client: -o user_xattr,flock mds@tcp:/lustre /mnt/lustre Starting client client: -o user_xattr,flock mds@tcp:/lustre /mnt/lustre Started clients client: mds@tcp:/lustre on /mnt/lustre type lustre (rw,user_xattr,flock) Using TIMEOUT=20 enable jobstats, set job scheduler as procname_uid Waiting 90 secs for update Updated after 4s: wanted 'procname_uid' got 'procname_uid' disable quota as required [root@client tests]# cd /mnt/lustre [root@client lustre]# touch /mnt/lustre/file [root@client lustre]# lctl set_param fail_loc=0x8000030c fail_loc=0x8000030c [root@client lustre]# setfattr -n user.attr -v value /mnt/lustre/file [root@client lustre]# dmesg <snip> Lustre: Mounted lustre-client Lustre: DEBUG MARKER: Using TIMEOUT=20 I have waited way past 157s (well over 15 min) and there is no change to dmesg. This is the same behavior I saw before. My version is Lustre: Build Version: 2.3.62--PRISTINE-2.6.32-279.19.1.el6_lustre.x86_64 . It is a Master build from last week it is post 9fb5d1ff837880a7596d98f24ff6bb1bf0033f2b by a good bit. Any more ideas for reproduction? I am inclined to think this is not a generic Lustre Master issue. |
| Comment by Oleg Drokin [ 15/Mar/13 ] |
|
Keith, I don't see the mark for the fail injection to hit in your logs? Starting client: client: -o user_xattr,flock mds@tcp:/lustre /mnt/lustre ... [root@client lustre]# lctl set_param fail_loc=0x8000030c What is this mds node that you mount? The failloc needs to be injected on the mds node and you seems to be running on a client? (unless mds is just a fancy alias to client in your setup) |
| Comment by Andrew Perepechko [ 15/Mar/13 ] |
|
I've just checked out the latest version: 20baff1f7f77b02f629e5f7980807ad151423b3b. And I'm sorry if my explanation on fail_loc was not detailed enough. |
| Comment by Keith Mannthey (Inactive) [ 15/Mar/13 ] |
|
I learn something every day Ok so a repeat of the test setting the fail_loc on the mds. From the mds Lustre: DEBUG MARKER: Using TIMEOUT=20 Lustre: ctl-lustre-MDT0000: super-sequence allocation rc = 0 [0x0000000200000400-0x0000000240000400):0:mdt Lustre: *** cfs_fail_loc=30c, val=2147483648*** LustreError: 1304:0:(ldlm_lib.c:2422:target_send_reply_msg()) @@@ dropping reply req@ffff880034a99000 x1429516885090712/t0(0) o101->6ea10d6d-cb72-918b-ed2d-ac3bfeae442f@192.168.56.101@tcp:0/0 lens 576/568 e 0 to 0 dl 1363372217 ref 1 fl Interpret:/0/0 rc 0/0 Lustre: lustre-MDT0000: Client 6ea10d6d-cb72-918b-ed2d-ac3bfeae442f (at 192.168.56.101@tcp) reconnecting [root@mds ~]# from the client ustre: Mounted lustre-client Lustre: DEBUG MARKER: Using TIMEOUT=20 Lustre: 3056:0:(client.c:1866:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1363372217/real 1363372217] req@ffff88003a3d7800 x1429516885090712/t0(0) o101->lustre-MDT0000-mdc-ffff88003b0bc000@192.168.56.103@tcp:12/10 lens 576/1080 e 0 to 1 dl 1363372224 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Lustre: lustre-MDT0000-mdc-ffff88003b0bc000: Connection to lustre-MDT0000 (at 192.168.56.103@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: lustre-MDT0000-mdc-ffff88003b0bc000: Connection restored to lustre-MDT0000 (at 192.168.56.103@tcp) [root@client mnt]# The "setfattr -n user.attr -v value /mnt/lustre/file" hung for a small amount of time 10-20 seconds maybe. |
| Comment by Keith Mannthey (Inactive) [ 15/Mar/13 ] |
|
All right I rebooted everything and did it again and this time I got it. from the mds: Lustre: Lustre: Build Version: 2.3.62--PRISTINE-2.6.32-279.19.1.el6_lustre.x86_64 LDISKFS-fs (sdb): mounted filesystem with ordered data mode. quota=on. Opts: LDISKFS-fs (sdb): mounted filesystem with ordered data mode. quota=on. Opts: Lustre: Setting parameter lustre-MDT0000-mdtlov.lov.stripesize in log lustre-MDT0000 Lustre: ctl-lustre-MDT0000: No data found on store. Initialize space Lustre: lustre-MDT0000: new disk, initializing Lustre: 1413:0:(mdt_lproc.c:380:lprocfs_wr_identity_upcall()) lustre-MDT0000: identity upcall set to /usr/sbin/l_getidentity LustreError: 11-0: lustre-MDT0000-lwp-MDT0000: Communicating with 0@lo, operation mds_connect failed with -11. Lustre: DEBUG MARKER: Using TIMEOUT=20 Lustre: ctl-lustre-MDT0000: super-sequence allocation rc = 0 [0x0000000200000400-0x0000000240000400):0:mdt Lustre: *** cfs_fail_loc=30c, val=2147483648*** LustreError: 1390:0:(ldlm_lib.c:2422:target_send_reply_msg()) @@@ dropping reply req@ffff88003453d000 x1429600593772960/t0(0) o101->27ba2a8e-a9c1-f2e8-0c49-e7d56aa2868c@192.168.56.101@tcp:0/0 lens 576/624 e 0 to 0 dl 1363373869 ref 1 fl Interpret:/0/0 rc 0/0 Lustre: lustre-MDT0000: Client 27ba2a8e-a9c1-f2e8-0c49-e7d56aa2868c (at 192.168.56.101@tcp) reconnecting LNet: Service thread pid 1390 was inactive for 40.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Pid: 1390, comm: mdt00_001 Call Trace: [<ffffffff814ead12>] schedule_timeout+0x192/0x2e0 [<ffffffff8107cb50>] ? process_timeout+0x0/0x10 [<ffffffffa03376d1>] cfs_waitq_timedwait+0x11/0x20 [libcfs] [<ffffffffa065622d>] ldlm_completion_ast+0x4ed/0x960 [ptlrpc] [<ffffffffa0651950>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc] [<ffffffff8105fa40>] ? default_wake_function+0x0/0x20 [<ffffffffa0655968>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc] [<ffffffffa0655d40>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] [<ffffffffa0d1cc60>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] [<ffffffffa0d1f92b>] mdt_object_lock0+0x33b/0xaf0 [mdt] [<ffffffffa0d1cc60>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] [<ffffffffa0655d40>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] [<ffffffffa0d201a4>] mdt_object_lock+0x14/0x20 [mdt] [<ffffffffa0d20211>] mdt_object_find_lock+0x61/0x170 [mdt] [<ffffffffa0d42671>] mdt_reint_setxattr+0x461/0x1850 [mdt] [<ffffffffa04fcc80>] ? lu_ucred+0x20/0x30 [obdclass] [<ffffffffa0d1c1e5>] ? mdt_ucred+0x15/0x20 [mdt] [<ffffffffa0d36f8c>] ? mdt_root_squash+0x2c/0x410 [mdt] [<ffffffffa06aaeb6>] ? __req_capsule_get+0x166/0x700 [ptlrpc] [<ffffffffa0d3b781>] mdt_reint_rec+0x41/0xe0 [mdt] [<ffffffffa0d34de3>] mdt_reint_internal+0x4e3/0x7d0 [mdt] [<ffffffffa0d35114>] mdt_reint+0x44/0xe0 [mdt] [<ffffffffa0d26008>] mdt_handle_common+0x628/0x1620 [mdt] [<ffffffffa0d5e6e5>] mds_regular_handle+0x15/0x20 [mdt] [<ffffffffa068f04c>] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] [<ffffffffa03375de>] ? cfs_timer_arm+0xe/0x10 [libcfs] [<ffffffffa0686799>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] [<ffffffff81052223>] ? __wake_up+0x53/0x70 [<ffffffffa0690596>] ptlrpc_main+0xb76/0x1870 [ptlrpc] [<ffffffffa068fa20>] ? ptlrpc_main+0x0/0x1870 [ptlrpc] [<ffffffff8100c0ca>] child_rip+0xa/0x20 [<ffffffffa068fa20>] ? ptlrpc_main+0x0/0x1870 [ptlrpc] [<ffffffffa068fa20>] ? ptlrpc_main+0x0/0x1870 [ptlrpc] [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 LustreError: dumping log to /tmp/lustre-log.1363373910.1390 LustreError: 0:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 157s: evicting client at 192.168.56.101@tcp ns: mdt-ffff88003acdf000 lock: ffff88003cfb3000/0xd91a117c53c742cb lrc: 3/0,0 mode: PR/PR res: 8589935616/2 bits 0x1b rrc: 3 type: IBT flags: 0x200000000020 nid: 192.168.56.101@tcp remote: 0xedc553c6f6f67ee9 expref: 8 pid: 1390 timeout: 4295563571 lvb_type: 0 LNet: Service thread pid 1390 completed after 150.43s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). Would you mind making a Lustre test for this issue? |
| Comment by Oleg Drokin [ 18/Mar/13 ] |
|
Keith, since you are able to reproduce it now, I think it's trivial for you to create the test. |
| Comment by Keith Mannthey (Inactive) [ 18/Mar/13 ] |
|
Oleg which test tests for this presently? I can can take care of the test. Is there a clear path forward to fix the issue? |
| Comment by Keith Mannthey (Inactive) [ 20/Mar/13 ] |
|
Well a test not working at all or it just does not fail much. This is with a 10 min wait from the MDS dmesg: Lustre: DEBUG MARKER: Using TIMEOUT=20 Lustre: DEBUG MARKER: == replay-single test 91: Find proper lock in hash LU-2827 == 17:06:09 (1363737969) Lustre: *** cfs_fail_loc=30c, val=2147483648*** LustreError: 3835:0:(ldlm_lib.c:2422:target_send_reply_msg()) @@@ dropping reply req@ffff880034928800 x1429982887805068/t0(0) o101->ac9ba92b-1fcc-8ab8-fe18-f7a10315f583@192.168.56.101@tcp:0/0 lens 576/568 e 0 to 0 dl 1363737965 ref 1 fl Interpret:/0/0 rc 0/0 Lustre: lustre-MDT0000: Client ac9ba92b-1fcc-8ab8-fe18-f7a10315f583 (at 192.168.56.101@tcp) reconnecting Lustre: DEBUG MARKER: == replay-single test complete, duration 631 sec == 17:16:22 (1363738582) The fail_loc is set on the mds we don't seem to trigger the hang. This is the test that is not working. #LU-2827 mdt_intent_fixup_resent need to find the proper lock in hash
test_91() {
touch $DIR/$tname
#define OBD_FAIL_LDLM_REPLY 0x30c
do_facet $SINGLEMDS "lctl set_param fail_loc=0x8000030c"
setfattr -n user.attr -v value $DIR/$tname
sleep 600
#check the state of the system
sanity_mount_check ||
error "Client has been evicted"
do_facet $SINGLEMDS "lctl set_param fail_loc=0x0"
}
run_test 91 "Find proper lock in hash LU-2827"
I tried it by hand a few times (including a fresh reboot) exactly the same way I triggered it before and did not encounter the error. I see the client reconnecting quickly and no eviction. Andrew you seem to reproduce this quite easily. How many attempts does it take you to trigger it? What should the timeout be from the setfattr to the mount check? My inital thought was 180s but I escalated it thinking I was not waiting long enough? I am going to run a 5 min wait loop overnight to see if I can catch the error again. |
| Comment by Andrew Perepechko [ 20/Mar/13 ] |
|
Hello Keith! The issue has been reproducing very easily for me. Actually, I had to try a lot to make the test not work. This is how it works for me. 00010000:00010000:1.0:1363770071.095970:0:2835:0:(ldlm_lock.c:631:ldlm_add_bl_work_item()) ### lock incompatible; sending blocking AST. ns: mdt-ffff8801281dd000 lock: ffff8800c85
d8480/0x37eadca6bd308e8b lrc: 2/0,0 mode: PR/PR res: 8589935619/4 bits 0x3 rrc: 3 type: IBT flags: 0x4000000 nid: 0@lo remote: 0x37eadca6bd308e6f expref: 8 pid: 3330 timeout: 0
=============
00010000:00010000:1.0:1363770071.095987:0:2835:0:(ldlm_lock.c:631:ldlm_add_bl_work_item()) ### lock incompatible; sending blocking AST. ns: mdt-ffff8801281dd000 lock: ffff8800c85
d8a80/0x37eadca6bd308eae lrc: 2/0,0 mode: PR/PR res: 8589935619/4 bits 0x3 rrc: 3 type: IBT flags: 0x4000000 nid: 0@lo remote: 0x37eadca6bd308e6f expref: 8 pid: 2835 timeout: 0
00010000:00000001:1.0:1363770071.096007:0:2835:0:(ldlm_lock.c:666:ldlm_add_ast_work_item()) Process leaving
=============
00010000:00010000:1.0:1363770071.096129:0:2835:0:(ldlm_lockd.c:834:ldlm_server_blocking_ast()) ### server preparing blocking AST ns: mdt-ffff8801281dd000 lock: ffff8800c85d8a80/0x37eadca6bd308eae lrc: 3/0,0 mode: PR/PR res: 8589935619/4 bits 0x3 rrc: 3 type: IBT flags: 0x4000020 nid: 0@lo remote: 0x37eadca6bd308e6f expref: 8 pid: 2835 timeout: 0
00010000:00010000:1.0:1363770071.096144:0:2835:0:(ldlm_lockd.c:487:ldlm_add_waiting_lock()) ### adding to wait list(timeout: 150, AT: on) ns: mdt-ffff8801281dd000 lock: ffff8800c85d8a80/0x37eadca6bd308eae lrc: 4/0,0 mode: PR/PR res: 8589935619/4 bits 0x3 rrc: 3 type: IBT flags: 0x4000020 nid: 0@lo remote: 0x37eadca6bd308e6f expref: 8 pid: 2835 timeout: 4299631388
00010000:00010000:1.0:1363770071.096482:0:2835:0:(ldlm_lockd.c:834:ldlm_server_blocking_ast()) ### server preparing blocking AST ns: mdt-ffff8801281dd000 lock: ffff8800c85d8480/0x37eadca6bd308e8b lrc: 3/0,0 mode: PR/PR res: 8589935619/4 bits 0x3 rrc: 3 type: IBT flags: 0x4000020 nid: 0@lo remote: 0x37eadca6bd308e6f expref: 8 pid: 3330 timeout: 0
00010000:00010000:1.0:1363770071.096499:0:2835:0:(ldlm_lockd.c:487:ldlm_add_waiting_lock()) ### adding to wait list(timeout: 150, AT: on) ns: mdt-ffff8801281dd000 lock: ffff8800c85d8480/0x37eadca6bd308e8b lrc: 4/0,0 mode: PR/PR res: 8589935619/4 bits 0x3 rrc: 3 type: IBT flags: 0x4000020 nid: 0@lo remote: 0x37eadca6bd308e6f expref: 8 pid: 3330 timeout: 4299631389
==============
00010000:00010000:0.0:1363770071.097126:0:2972:0:(ldlm_lockd.c:1524:ldlm_handle_bl_callback()) ### client blocking AST callback handler ns: lustre-MDT0000-mdc-ffff8800cdb54c00 lock: ffff8800cd89abc0/0x37eadca6bd308e6f lrc: 2/0,0 mode: PR/PR res: 8589935619/4 bits 0x3 rrc: 1 type: IBT flags: 0x10000000 nid: local remote: 0x37eadca6bd308eae expref: -99 pid: 3430 timeout: 0
00010000:00010000:1.0:1363770071.097574:0:3217:0:(ldlm_lockd.c:1524:ldlm_handle_bl_callback()) ### client blocking AST callback handler ns: lustre-MDT0000-mdc-ffff8800cdb54c00 lock: ffff8800cd89abc0/0x37eadca6bd308e6f lrc: 3/0,0 mode: --/PR res: 8589935619/4 bits 0x3 rrc: 1 type: IBT flags: 0x30002090 nid: local remote: 0x37eadca6bd308eae expref: -99 pid: 3430 timeout: 0
00010000:00010000:1.0:1363770071.097614:0:3217:0:(ldlm_request.c:1251:ldlm_cli_cancel()) ### lock is already being destroyed <==================== no reply sent?
However, if two cancels are processed sequentially as compared to parallel execution ldlm_callback_handler() will return -EINVAL: lock = ldlm_handle2lock_long(&dlm_req->lock_handle[0], 0);
if (!lock) {
CDEBUG(D_DLMTRACE, "callback on lock "LPX64" - lock "
"disappeared\n", dlm_req->lock_handle[0].cookie);
rc = ldlm_callback_reply(req, -EINVAL);
ldlm_callback_errmsg(req, "Operate with invalid parameter", rc,
&dlm_req->lock_handle[0]);
RETURN(0);
}
if ((lock->l_flags & LDLM_FL_FAIL_LOC) &&
lustre_msg_get_opc(req->rq_reqmsg) == LDLM_BL_CALLBACK)
OBD_RACE(OBD_FAIL_LDLM_CP_BL_RACE);
/* Copy hints/flags (e.g. LDLM_FL_DISCARD_DATA) from AST. */
lock_res_and_lock(lock);
lock->l_flags |= ldlm_flags_from_wire(dlm_req->lock_flags &
LDLM_AST_FLAGS);
if (lustre_msg_get_opc(req->rq_reqmsg) == LDLM_BL_CALLBACK) {
/* If somebody cancels lock and cache is already dropped,
* or lock is failed before cp_ast received on client,
* we can tell the server we have no lock. Otherwise, we
* should send cancel after dropping the cache. */
if (((lock->l_flags & LDLM_FL_CANCELING) &&
(lock->l_flags & LDLM_FL_BL_DONE)) ||
(lock->l_flags & LDLM_FL_FAILED)) {
LDLM_DEBUG(lock, "callback on lock "
LPX64" - lock disappeared\n",
dlm_req->lock_handle[0].cookie);
unlock_res_and_lock(lock);
LDLM_LOCK_RELEASE(lock);
rc = ldlm_callback_reply(req, -EINVAL);
ldlm_callback_errmsg(req, "Operate on stale lock", rc,
&dlm_req->lock_handle[0]);
RETURN(0);
}
/* BL_AST locks are not needed in LRU.
* Let ldlm_cancel_lru() be fast. */
ldlm_lock_remove_from_lru(lock);
lock->l_flags |= LDLM_FL_BL_AST;
}
unlock_res_and_lock(lock);
Perhaps, adding an OBD_RACE barrier right before the ldlm_handle2lock_long() call might help the issue become more easily reproducible. |
| Comment by Andrew Perepechko [ 20/Mar/13 ] |
|
By the way, sleep 600 is not required by your test. And could you, please, make sure that sanity_mount_check() really checks if there was an eviction? |
| Comment by Keith Mannthey (Inactive) [ 20/Mar/13 ] |
|
I have yet to repro the issue with the test. I will be sure sanity_mount_check or some other check is the right thing to catch eviction. Perhaps I will work on this in a single VM environment as over night testing didn't get me anything. The one time I triggered it (documented up above) was right from a fresh reboot. |
| Comment by Keith Mannthey (Inactive) [ 21/Mar/13 ] |
|
I put the beta test here http://review.whamcloud.com/5795 A timeout is need so that we can detect the error and take correct actions. I tried to recreate the issue in a single VM for a while today but I didn't get the error. Can you try the test to see if it fails for you? Perhaps you can help finish the test as you seem to be able to redly reproduce the issue and I don't for some reason. |
| Comment by Oleg Drokin [ 26/Mar/13 ] |
|
Keith, the problem you have is your tet creates the double lock but does not really try to revoke it. Previous test for this was replay-single test 52 (I think, still cannot find original report I digged way back in the future), but now stat does not cancel the lock at all (not that it ever did too, I guess). Potentially mds intent lock is ill suited for this test now, and instead we might want to do a test that drops lock reply on ost and then resend creates another one, and then we cause cancel (by e.g. unlinking a file or doing stat from another client - important to do it from another client to avoid ELC cancelling our copy of the lock beforehand) |
| Comment by Keith Mannthey (Inactive) [ 29/Mar/13 ] |
|
replay single test 52 stemmed from https://projectlava.xyratex.com/show_bug.cgi?id=3764 (circa 2004) I have started working on a two client test I hope to a have code to share tomorrow. |
| Comment by Keith Mannthey (Inactive) [ 02/Apr/13 ] |
|
Well spent good cycles trying to created the issue with 2 clients and VMs for the MDT/MDS and OSS/OST. I tried all sort of operations. In general the first operations would block for a few seconds the things work like normal. The fail_loc setting only really seems change the behavior on initial operations. I did not encounter the issue today at all. I spent some time trying to sort how to drop locks on the OST but I find any examples of this. Any hints? |
| Comment by Oleg Drokin [ 02/Apr/13 ] |
|
To for a lock drop, do some conflicting operation. touch /mnt/lustre/file # creates file ls -l /mnt/lustre/file # obtains read locks on ost objects of the file echo aaa >/mnt/lustre/file # attempts to write - cancels read lock in process (unlink would do too), but due to ELC client does it all by itself # now, assuming /mnt/lustre2 is same lustre fs on different mountpoint echo aaa >>/mnt/lustre2/file # this actually forces ost to cancel the lock given out from a previous operation. |
| Comment by Oleg Drokin [ 02/Apr/13 ] |
|
Actually, looking at the title of this ticket (duh), this is mdt_intent_fixup() bug, so it's mds only, as such, its best to concentrate on mds operations after all, like setxattr. So example from above would probably be: touch somefile ; ls -l somefile ; rm -rf othermountpoint/somefile with the reply drop around ls -l |
| Comment by Andrew Perepechko [ 02/Apr/13 ] |
|
Alexey, Vitaly and I have discussed a possible solution. I'm coding it at the moment. |
| Comment by Minh Diep [ 04/Apr/13 ] |
|
Panda, I have reproduced this in single node cluster but not in multiple nodes. Have you ever tried to reproduce this in multiple nodes cluster? If not, could you try that? |
| Comment by Andrew Perepechko [ 08/Apr/13 ] |
|
Minh, I haven't tried to reproduce with multi-node configurations. But, as I wrote in https://jira.hpdd.intel.com/browse/LU-2827?focusedCommentId=54450&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-54450, it seems clear that without special fault injection code it might be hard to reproduce the issue on a multi-node setup. I've posted a prototype patch for discussion/review: http://review.whamcloud.com/5978 (it does not include a test case). |
| Comment by Andrew Perepechko [ 09/Apr/13 ] |
|
Xyratex-bug-id: MRP-975 |
| Comment by Andrew Perepechko [ 10/Apr/13 ] |
|
A test case: http://review.whamcloud.com/#change,6000 |
| Comment by Peter Jones [ 26/Apr/13 ] |
|
Lowering priority. This patch is not ready and this limitation has been present in earlier version of Lustre |
| Comment by Andrew Perepechko [ 29/Apr/13 ] |
|
There's no new patch at the moment. We will be discussing a different solution than was coded in the gerrit patches. If Whamcloud are going to code their own patch, please let us know. |
| Comment by Bruno Faccini (Inactive) [ 30/Apr/14 ] |
|
Since both tickets (and their patches!) are highly related, I tried running the reproducer of |
| Comment by James A Simmons [ 02/May/14 ] |
|
I also see this patch does NOT apply well to the b2_5 branch. |
| Comment by Bruno Faccini (Inactive) [ 06/May/14 ] |
|
I just had a look into the MDS crash-dump I got when running the reproducer of |
| Comment by James A Simmons [ 06/May/14 ] |
|
I just tried this patch with the LASSERT removed. For my simul run I see the following: 22441.866214] INFO: task mdt03_000:14714 blocked for more than 120 seconds. |
| Comment by Oleg Drokin [ 08/May/14 ] |
|
Bruno: The assert was introduced specifically to catch this case. Basically what we are tryign to catch is this: Server has processed a request and returned it back to client, but client cannot accept it due to too small buffer, so it resends. |
| Comment by Bruno Faccini (Inactive) [ 10/May/14 ] |
|
Oleg, BTW, it reproduces solid on homogeneous b2_4/b2_5/master Servers/Client platforms and again triggers the LBUG/LASSERT(lock->l_granted_mode != lock->l_req_mode) that the change introduced in ldlm_process_inodebits_lock() when running with this ticket/Gerrit-change |
| Comment by Bruno Faccini (Inactive) [ 20/May/14 ] |
|
I spent some time to reproduce the MDS LBUG//LASSERT(lock->l_granted_mode != lock->l_req_mode) I had when running the simple reproducer from _ in fact even a simple dd to create a file with a default and significant striping (I am using 154 OSTs), right after file-system start/mount triggers the problem ... _ what I see in the Client/MDS full debug logs then is that, as expected after Client 1st/original request 00000080:00010000:2.0:1400265787.405960:0:2750:0:(file.c:3998:ll_layout_refresh()) ### lustre: requeue layout lock for file [0x200002340:0x1:0x0](ffff880320dbc678) 00000080:00000001:2.0:1400265787.405962:0:2750:0:(obd_class.h:1691:md_enqueue()) Process entered 00800000:00000001:2.0:1400265787.405963:0:2750:0:(lmv_obd.c:1949:lmv_enqueue()) Process entered 00800000:00000001:2.0:1400265787.405964:0:2750:0:(lmv_obd.c:620:lmv_check_connect()) Process entered 00800000:00000001:2.0:1400265787.405965:0:2750:0:(lmv_obd.c:623:lmv_check_connect()) Process leaving (rc=0 : 0 : 0) 00800000:00000002:2.0:1400265787.405967:0:2750:0:(lmv_obd.c:1956:lmv_enqueue()) ENQUEUE 'layout' on [0x200002340:0x1:0x0] 00800000:00000002:2.0:1400265787.405968:0:2750:0:(lmv_obd.c:1963:lmv_enqueue()) ENQUEUE 'layout' on [0x200002340:0x1:0x0] -> mds #0 00800000:00000001:2.0:1400265787.405970:0:2750:0:(obd_class.h:1691:md_enqueue()) Process entered 00000002:00000001:2.0:1400265787.405971:0:2750:0:(mdc_locks.c:802:mdc_enqueue()) Process entered 00000002:00000001:2.0:1400265787.405972:0:2750:0:(mdc_locks.c:511:mdc_intent_layout_pack()) Process entered 00000100:00000010:2.0:1400265787.405973:0:2750:0:(client.c:410:ptlrpc_request_cache_alloc()) slab-alloced 'req': 976 at ffff8802faff9400. 00000020:00000040:2.0:1400265787.405974:0:2750:0:(genops.c:1022:class_import_get()) import ffff880328196800 refcount=6 obd=lustre-MDT0000-mdc-ffff880331322400 00010000:00000001:2.0:1400265787.405977:0:2750:0:(ldlm_request.c:766:ldlm_prep_elc_req()) Process entered 00010000:00000001:2.0:1400265787.405978:0:2750:0:(ldlm_request.c:1649:ldlm_prepare_lru_list()) Process entered 00010000:00000001:2.0:1400265787.405980:0:2750:0:(ldlm_lock.c:211:ldlm_lock_put()) Process entered 00010000:00000001:2.0:1400265787.405981:0:2750:0:(ldlm_lock.c:244:ldlm_lock_put()) Process leaving 00010000:00000001:2.0:1400265787.405982:0:2750:0:(ldlm_request.c:1771:ldlm_prepare_lru_list()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:2.0:1400265787.405984:0:2750:0:(client.c:584:__ptlrpc_request_bufs_pack()) Process entered 02000000:00000001:2.0:1400265787.405985:0:2750:0:(sec.c:416:sptlrpc_req_get_ctx()) Process entered 02000000:00000001:2.0:1400265787.405986:0:2750:0:(sec.c:434:sptlrpc_req_get_ctx()) Process leaving (rc=0 : 0 : 0) 02000000:00000010:2.0:1400265787.405988:0:2750:0:(sec_null.c:174:null_alloc_reqbuf()) kmalloced 'req->rq_reqbuf': 512 at ffff8802f7eb1800. 00000100:00000001:2.0:1400265787.405990:0:2750:0:(client.c:638:__ptlrpc_request_bufs_pack()) Process leaving (rc=0 : 0 : 0) 00010000:00000001:2.0:1400265787.405991:0:2750:0:(ldlm_request.c:1886:ldlm_cli_cancel_list()) Process entered 00010000:00000001:2.0:1400265787.405992:0:2750:0:(ldlm_request.c:1889:ldlm_cli_cancel_list()) Process leaving (rc=0 : 0 : 0) 00010000:00000001:2.0:1400265787.405994:0:2750:0:(ldlm_request.c:1886:ldlm_cli_cancel_list()) Process entered 00010000:00000001:2.0:1400265787.405995:0:2750:0:(ldlm_request.c:1889:ldlm_cli_cancel_list()) Process leaving (rc=0 : 0 : 0) 00010000:00000001:2.0:1400265787.405996:0:2750:0:(ldlm_request.c:817:ldlm_prep_elc_req()) Process leaving (rc=0 : 0 : 0) 00000002:00000001:2.0:1400265787.405998:0:2750:0:(mdc_locks.c:538:mdc_intent_layout_pack()) Process leaving (rc=18446612145115206656 : -131928594344960 : ffff8802faff9400) 00000002:00000001:2.0:1400265787.406000:0:2750:0:(lustre_mdc.h:107:mdc_get_rpc_lock()) Process entered 00010000:00000001:2.0:1400265787.406001:0:2750:0:(ldlm_request.c:875:ldlm_cli_enqueue()) Process entered 00010000:00000001:2.0:1400265787.406002:0:2750:0:(ldlm_lock.c:1596:ldlm_lock_create()) Process entered 00010000:00000001:2.0:1400265787.406003:0:2750:0:(lustre_fid.h:707:fid_flatten32()) Process leaving (rc=36962305 : 36962305 : 2340001) 00010000:00000040:2.0:1400265787.406004:0:2750:0:(ldlm_resource.c:1189:ldlm_resource_getref()) getref res: ffff880325333840 count: 2 00010000:00000001:2.0:1400265787.406006:0:2750:0:(ldlm_lock.c:472:ldlm_lock_new()) Process entered 00010000:00000010:2.0:1400265787.406007:0:2750:0:(ldlm_lock.c:477:ldlm_lock_new()) slab-alloced 'lock': 504 at ffff880320f970c0. 00000020:00000001:2.0:1400265787.406008:0:2750:0:(lustre_handles.c:80:class_handle_hash()) Process entered 00000020:00000040:2.0:1400265787.406009:0:2750:0:(lustre_handles.c:114:class_handle_hash()) added object ffff880320f970c0 with handle 0x5121400b7cdaafb2 to hash 00000020:00000001:2.0:1400265787.406011:0:2750:0:(lustre_handles.c:115:class_handle_hash()) Process leaving 00010000:00000001:2.0:1400265787.406011:0:2750:0:(ldlm_lock.c:515:ldlm_lock_new()) Process leaving (rc=18446612145752338624 : -131927957212992 : ffff880320f970c0) 00010000:00000001:2.0:1400265787.406013:0:2750:0:(ldlm_lock.c:1636:ldlm_lock_create()) Process leaving (rc=18446612145752338624 : -131927957212992 : ffff880320f970c0) 00010000:00000001:2.0:1400265787.406015:0:2750:0:(ldlm_lock.c:274:ldlm_lock_remove_from_lru()) Process entered 00010000:00000001:2.0:1400265787.406016:0:2750:0:(ldlm_lock.c:283:ldlm_lock_remove_from_lru()) Process leaving 00010000:00010000:2.0:1400265787.406017:0:2750:0:(ldlm_lock.c:797:ldlm_lock_addref_internal_nolock()) ### ldlm_lock_addref(CR) ns: lustre-MDT0000-mdc-ffff880331322400 lock: ffff880320f970c0/0x5121400b7cdaafb2 lrc: 3/1,0 mode: --/CR res: [0x200002340:0x1:0x0].0 bits 0x0 rrc: 2 type: IBT flags: 0x10000000000000 nid: local remote: 0x0 expref: -99 pid: 2750 timeout: 0 lvb_type: 3 00010000:00010000:2.0:1400265787.406023:0:2750:0:(ldlm_request.c:926:ldlm_cli_enqueue()) ### client-side enqueue START, flags 1000 ns: lustre-MDT0000-mdc-ffff880331322400 lock: ffff880320f970c0/0x5121400b7cdaafb2 lrc: 3/1,0 mode: --/CR res: [0x200002340:0x1:0x0].0 bits 0x8 rrc: 2 type: IBT flags: 0x0 nid: local remote: 0x0 expref: -99 pid: 2750 timeout: 0 lvb_type: 3 00010000:00010000:2.0:1400265787.406028:0:2750:0:(ldlm_request.c:988:ldlm_cli_enqueue()) ### sending request ns: lustre-MDT0000-mdc-ffff880331322400 lock: ffff880320f970c0/0x5121400b7cdaafb2 lrc: 3/1,0 mode: --/CR res: [0x200002340:0x1:0x0].0 bits 0x8 rrc: 2 type: IBT flags: 0x0 nid: local remote: 0x0 expref: -99 pid: 2750 timeout: 0 lvb_type: 3 00000100:00000001:2.0:1400265787.406038:0:2750:0:(client.c:2676:ptlrpc_queue_wait()) Process entered 00000100:00000001:2.0:1400265787.406039:0:2750:0:(client.c:859:ptlrpc_prep_set()) Process entered 00000100:00000010:2.0:1400265787.406040:0:2750:0:(client.c:860:ptlrpc_prep_set()) kmalloced 'set': 144 at ffff8803287f55c0. 00000100:00000001:2.0:1400265787.406041:0:2750:0:(client.c:876:ptlrpc_prep_set()) Process leaving (rc=18446612145878554048 : -131927830997568 : ffff8803287f55c0) 00000100:00000001:2.0:1400265787.406043:0:2750:0:(client.c:2605:ptlrpc_request_addref()) Process entered 00000100:00000001:2.0:1400265787.406044:0:2750:0:(client.c:2607:ptlrpc_request_addref()) Process leaving (rc=18446612145115206656 : -131928594344960 : ffff8802faff9400) 00000020:00000001:2.0:1400265787.406046:0:2750:0:(class_obd.c:138:lustre_get_jobid()) Process entered 00000020:00000001:2.0:1400265787.406047:0:2750:0:(class_obd.c:143:lustre_get_jobid()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:2.0:1400265787.406048:0:2750:0:(client.c:2111:ptlrpc_set_wait()) Process entered 00000100:00000001:2.0:1400265787.406049:0:2750:0:(client.c:1415:ptlrpc_send_new_req()) Process entered 00000100:00000040:2.0:1400265787.406051:0:2750:0:(lustre_net.h:3296:ptlrpc_rqphase_move()) @@@ move req "New" -> "Rpc" req@ffff8802faff9400 x1466802977730748/t0(0) o101->lustre-MDT0000-mdc-ffff880331322400@10.10.4.17@tcp:12/10 lens 376/1544 e 0 to 0 dl 0 ref 2 fl New:/0/ffffffff rc 0/-1 00000100:00000001:2.0:1400265787.406057:0:2750:0:(client.c:1064:ptlrpc_import_delay_req()) Process entered 00000100:00000001:2.0:1400265787.406058:0:2750:0:(client.c:1115:ptlrpc_import_delay_req()) Process leaving (rc=0 : 0 : 0) 02000000:00000001:2.0:1400265787.406059:0:2750:0:(sec.c:659:sptlrpc_req_refresh_ctx()) Process entered 02000000:00000001:2.0:1400265787.406060:0:2750:0:(sec.c:686:sptlrpc_req_refresh_ctx()) Process leaving (rc=0 : 0 : 0) 00000100:00100000:2.0:1400265787.406062:0:2750:0:(client.c:1478:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc file:a56bcdbb-ca09-9568-70fd-b3f47d606d12:2750:1466802977730748:10.10.4.17@tcp:101 00000100:00000001:2.0:1400265787.406065:0:2750:0:(niobuf.c:682:ptl_send_rpc()) Process entered 02000000:00000001:2.0:1400265787.406066:0:2750:0:(sec.c:961:sptlrpc_cli_wrap_request()) Process entered 02000000:00000001:2.0:1400265787.406066:0:2750:0:(sec.c:997:sptlrpc_cli_wrap_request()) Process leaving (rc=0 : 0 : 0) 02000000:00000001:2.0:1400265787.406068:0:2750:0:(sec.c:1689:sptlrpc_cli_alloc_repbuf()) Process entered 02000000:00000010:2.0:1400265787.406069:0:2750:0:(sec_null.c:217:null_alloc_repbuf()) kmalloced 'req->rq_repbuf': 2048 at ffff880308075800. 02000000:00000001:2.0:1400265787.406070:0:2750:0:(sec.c:1699:sptlrpc_cli_alloc_repbuf()) Process leaving (rc=0 : 0 : 0) 00000400:00000010:2.0:1400265787.406072:0:2750:0:(lib-lnet.h:500:lnet_me_alloc()) alloc '(me)': 96 at ffff880320454dc0 (tot 83133036). 00000400:00000010:2.0:1400265787.406075:0:2750:0:(lib-lnet.h:468:lnet_md_alloc()) alloc '(md)': 128 at ffff880320454d40 (tot 83133164). 00000100:00000200:2.0:1400265787.406077:0:2750:0:(niobuf.c:808:ptl_send_rpc()) Setup reply buffer: 2048 bytes, xid 1466802977730748, portal 10 00000100:00000001:2.0:1400265787.406079:0:2750:0:(client.c:2605:ptlrpc_request_addref()) Process entered 00000100:00000001:2.0:1400265787.406079:0:2750:0:(client.c:2607:ptlrpc_request_addref()) Process leaving (rc=18446612145115206656 : -131928594344960 : ffff8802faff9400) 00000100:00000040:2.0:1400265787.406082:0:2750:0:(niobuf.c:829:ptl_send_rpc()) @@@ send flg=0 req@ffff8802faff9400 x1466802977730748/t0(0) o101->lustre-MDT0000-mdc-ffff880331322400@10.10.4.17@tcp:12/10 lens 376/1544 e 0 to 0 dl 1400265794 ref 3 fl Rpc:/0/ffffffff rc 0/-1 00000100:00000001:2.0:1400265787.406088:0:2750:0:(niobuf.c:60:ptl_send_buf()) Process entered Server handle+reply 00000100:00100000:0.0:1400265787.402458:0:11385:0:(service.c:1939:ptlrpc_server_handle_req_in()) got req x1466802977730748 00000020:00000001:0.0:1400265787.402460:0:11385:0:(genops.c:741:class_conn2export()) Process entered 00000020:00000040:0.0:1400265787.402460:0:11385:0:(genops.c:753:class_conn2export()) looking for export cookie 0xd169abc5abfab541 00000020:00000001:0.0:1400265787.402462:0:11385:0:(lustre_handles.c:172:class_handle2object()) Process entered 00000020:00000040:0.0:1400265787.402463:0:11385:0:(genops.c:839:class_export_get()) GETting export ffff880212978800 : new refcount 12 00000020:00000001:0.0:1400265787.402464:0:11385:0:(lustre_handles.c:195:class_handle2object()) Process leaving (rc=18446612141216073728 : -131932493477888 : ffff880212978800) 00000020:00000001:0.0:1400265787.402466:0:11385:0:(genops.c:755:class_conn2export()) Process leaving (rc=18446612141216073728 : -131932493477888 : ffff880212978800) 00000100:00000001:0.0:1400265787.402468:0:11385:0:(service.c:1068:ptlrpc_update_export_timer()) Process entered 00000100:00000001:0.0:1400265787.402469:0:11385:0:(service.c:1081:ptlrpc_update_export_timer()) Process leaving 00000020:00000010:0.0:1400265787.402470:0:11385:0:(lu_object.c:1643:keys_init()) kmalloced 'ctx->lc_value': 320 at ffff880268932c00. 02000000:00000010:0.0:1400265787.402472:0:11385:0:(capa.c:416:lu_capainfo_key_init()) kmalloced 'value': 128 at ffff880212f7cac0. 00000020:00000010:0.0:1400265787.402474:0:11385:0:(lu_ucred.c:53:lu_ucred_key_init()) kmalloced 'value': 72 at ffff880213336bc0. 00000020:00000010:0.0:1400265787.402475:0:11385:0:(tgt_main.c:187:tgt_ses_key_init()) kmalloced 'value': 144 at ffff880215e7a740. 00000100:00000040:0.0:1400265787.402477:0:11385:0:(service.c:1203:ptlrpc_at_set_timer()) armed mdt at +1s 00000100:00000001:0.0:1400265787.402479:0:11385:0:(service.c:1670:ptlrpc_server_request_add()) Process entered 00000100:00000001:0.0:1400265787.402479:0:11385:0:(service.c:1571:ptlrpc_server_hpreq_init()) Process entered 00000100:00000001:0.0:1400265787.402481:0:11385:0:(service.c:1622:ptlrpc_server_hpreq_init()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:0.0:1400265787.402482:0:11385:0:(service.c:1685:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:0.0:1400265787.402486:0:11385:0:(service.c:2002:ptlrpc_server_handle_req_in()) Process leaving (rc=1 : 1 : 1) 00000400:00000001:0.0:1400265787.402488:0:11385:0:(watchdog.c:433:lc_watchdog_disable()) Process entered 00000400:00000001:0.0:1400265787.402489:0:11385:0:(watchdog.c:441:lc_watchdog_disable()) Process leaving 00000400:00000001:0.0:1400265787.402491:0:11385:0:(watchdog.c:416:lc_watchdog_touch()) Process entered 00000400:00000001:0.0:1400265787.402492:0:11385:0:(watchdog.c:427:lc_watchdog_touch()) Process leaving 00000100:00000001:0.0:1400265787.402493:0:11385:0:(service.c:2025:ptlrpc_server_handle_request()) Process entered 00000100:00000001:0.0:1400265787.402494:0:11385:0:(service.c:1803:ptlrpc_server_request_get()) Process entered 00000100:00100000:0.0:1400265787.402496:0:11385:0:(nrs_fifo.c:182:nrs_fifo_req_get()) NRS start fifo request from 12345-10.10.4.10@tcp, seq: 172 00000100:00000040:0.0:1400265787.402498:0:11385:0:(service.c:1841:ptlrpc_server_request_get()) RPC GETting export ffff880212978800 : new rpc_count 1 00000100:00000001:0.0:1400265787.402499:0:11385:0:(service.c:1843:ptlrpc_server_request_get()) Process leaving (rc=18446612141172062208 : -131932537489408 : ffff88020ff7f800) 00000100:00000040:0.0:1400265787.402502:0:11385:0:(lustre_net.h:3296:ptlrpc_rqphase_move()) @@@ move req "New" -> "Interpret" req@ffff88020ff7f800 x1466802977730748/t0(0) o101->a56bcdbb-ca09-9568-70fd-b3f47d606d12@10.10.4.10@tcp:0/0 lens 376/0 e 0 to 0 dl 1400265793 ref 1 fl New:/0/ffffffff rc 0/-1 00000100:00000001:0.0:1400265787.402508:0:11385:0:(service.c:1068:ptlrpc_update_export_timer()) Process entered 00000100:00000001:0.0:1400265787.402509:0:11385:0:(service.c:1081:ptlrpc_update_export_timer()) Process leaving 00000100:00100000:0.0:1400265787.402511:0:11385:0:(service.c:2086:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_002:a56bcdbb-ca09-9568-70fd-b3f47d606d12+12:2750:x1466802977730748:12345-10.10.4.10@tcp:101 00000100:00000200:0.0:1400265787.402513:0:11385:0:(service.c:2091:ptlrpc_server_handle_request()) got req 1466802977730748 00000020:00000001:0.0:1400265787.402515:0:11385:0:(tgt_handler.c:589:tgt_request_handle()) Process entered 00000020:00000001:0.0:1400265787.402516:0:11385:0:(tgt_handler.c:553:tgt_handler_find_check()) Process entered 00000020:00000001:0.0:1400265787.402517:0:11385:0:(tgt_handler.c:575:tgt_handler_find_check()) Process leaving (rc=18446744072111134752 : -1598416864 : ffffffffa0ba1820) 00000020:00000001:0.0:1400265787.402519:0:11385:0:(tgt_handler.c:492:tgt_handle_recovery()) Process entered 00000020:00000001:0.0:1400265787.402520:0:11385:0:(tgt_handler.c:542:tgt_handle_recovery()) Process leaving (rc=1 : 1 : 1) 00000020:00000001:0.0:1400265787.402522:0:11385:0:(tgt_handler.c:377:tgt_handle_request0()) Process entered 00000020:00000001:0.0:1400265787.402522:0:11385:0:(tgt_handler.c:307:tgt_request_preprocess()) Process entered 00000020:00000001:0.0:1400265787.402524:0:11385:0:(tgt_handler.c:362:tgt_request_preprocess()) Process leaving (rc=0 : 0 : 0) 00000020:00000001:0.0:1400265787.402525:0:11385:0:(tgt_handler.c:1220:tgt_enqueue()) Process entered 00010000:00000001:0.0:1400265787.402526:0:11385:0:(ldlm_lockd.c:1165:ldlm_handle_enqueue0()) Process entered 00010000:00010000:0.0:1400265787.402527:0:11385:0:(ldlm_lockd.c:1167:ldlm_handle_enqueue0()) ### server-side enqueue handler START 00010000:00000001:0.0:1400265787.402528:0:11385:0:(ldlm_lockd.c:1556:ldlm_request_cancel()) Process entered 00010000:00000001:0.0:1400265787.402529:0:11385:0:(ldlm_lockd.c:1560:ldlm_request_cancel()) Process leaving (rc=0 : 0 : 0) 00010000:00000001:0.0:1400265787.402531:0:11385:0:(ldlm_lock.c:1596:ldlm_lock_create()) Process entered 00010000:00000001:0.0:1400265787.402532:0:11385:0:(lustre_fid.h:707:fid_flatten32()) Process leaving (rc=36962305 : 36962305 : 2340001) 00010000:00000040:0.0:1400265787.402534:0:11385:0:(ldlm_resource.c:1189:ldlm_resource_getref()) getref res: ffff8802bdc23a80 count: 2 00010000:00000001:0.0:1400265787.402535:0:11385:0:(ldlm_lock.c:472:ldlm_lock_new()) Process entered 00010000:00000010:0.0:1400265787.402537:0:11385:0:(ldlm_lock.c:477:ldlm_lock_new()) slab-alloced 'lock': 504 at ffff880212a98ac0. 00000020:00000001:0.0:1400265787.402538:0:11385:0:(lustre_handles.c:80:class_handle_hash()) Process entered 00000020:00000040:0.0:1400265787.402539:0:11385:0:(lustre_handles.c:114:class_handle_hash()) added object ffff880212a98ac0 with handle 0xd169abc5abfab66e to hash 00000020:00000001:0.0:1400265787.402541:0:11385:0:(lustre_handles.c:115:class_handle_hash()) Process leaving 00010000:00000001:0.0:1400265787.402542:0:11385:0:(ldlm_lock.c:515:ldlm_lock_new()) Process leaving (rc=18446612141217254080 : -131932492297536 : ffff880212a98ac0) 00010000:00000001:0.0:1400265787.402544:0:11385:0:(ldlm_lock.c:1636:ldlm_lock_create()) Process leaving (rc=18446612141217254080 : -131932492297536 : ffff880212a98ac0) 00010000:00010000:0.0:1400265787.402546:0:11385:0:(ldlm_lockd.c:1258:ldlm_handle_enqueue0()) ### server-side enqueue handler, new lock created ns: mdt-lustre-MDT0000_UUID lock: ffff880212a98ac0/0xd169abc5abfab66e lrc: 2/0,0 mode: --/CR res: [0x200002340:0x1:0x0].0 bits 0x0 rrc: 2 type: IBT flags: 0x40000000000000 nid: local remote: 0x5121400b7cdaafb2 expref: -99 pid: 11385 timeout: 0 lvb_type: 0 00010000:00000040:0.0:1400265787.402552:0:11385:0:(ldlm_lockd.c:1270:ldlm_handle_enqueue0()) lock GETting export ffff880212978800 : new locks_count 9 00000020:00000040:0.0:1400265787.402553:0:11385:0:(genops.c:839:class_export_get()) GETting export ffff880212978800 : new refcount 13 00010000:00000001:0.0:1400265787.402555:0:11385:0:(ldlm_lock.c:1666:ldlm_lock_enqueue()) Process entered 00000004:00000001:0.0:1400265787.402556:0:11385:0:(mdt_handler.c:3497:mdt_intent_policy()) Process entered 00000004:00000001:0.0:1400265787.402558:0:11385:0:(mdt_handler.c:3434:mdt_intent_opc()) Process entered 00000004:00000001:0.0:1400265787.402559:0:11385:0:(mdt_handler.c:2735:mdt_unpack_req_pack_rep()) Process entered 00000004:00000001:0.0:1400265787.402560:0:11385:0:(mdt_handler.c:2753:mdt_unpack_req_pack_rep()) Process leaving (rc=0 : 0 : 0) 00000004:00000001:0.0:1400265787.402562:0:11385:0:(mdt_handler.c:3247:mdt_intent_layout()) Process entered 00000004:00000001:0.0:1400265787.402562:0:11385:0:(mdt_handler.c:2181:mdt_object_find()) Process entered 00000004:00000040:0.0:1400265787.402563:0:11385:0:(mdt_handler.c:2183:mdt_object_find()) Find object for [0x200002340:0x1:0x0] 00000020:00000001:0.0:1400265787.402565:0:11385:0:(lustre_fid.h:707:fid_flatten32()) Process leaving (rc=36962305 : 36962305 : 2340001) 00000020:00000001:0.0:1400265787.402566:0:11385:0:(lustre_fid.h:707:fid_flatten32()) Process leaving (rc=36962305 : 36962305 : 2340001) 00000004:00000001:0.0:1400265787.402568:0:11385:0:(mdt_handler.c:2190:mdt_object_find()) Process leaving (rc=18446612141272470360 : -131932437081256 : ffff880215f41358) 00000004:00000001:0.0:1400265787.402570:0:11385:0:(mdd_object.c:238:mdd_xattr_get()) Process entered 00000004:00000001:0.0:1400265787.402572:0:11385:0:(lod_object.c:498:lod_xattr_get()) Process entered 00000004:00000001:0.0:1400265787.402575:0:11385:0:(lod_object.c:502:lod_xattr_get()) Process leaving (rc=3728 : 3728 : e90) 00000004:00000001:0.0:1400265787.402577:0:11385:0:(mdd_object.c:257:mdd_xattr_get()) Process leaving (rc=3728 : 3728 : e90) 00000004:00000001:0.0:1400265787.402579:0:11385:0:(mdt_internal.h:559:mdt_object_put()) Process entered 00000020:00000001:0.0:1400265787.402579:0:11385:0:(lustre_fid.h:707:fid_flatten32()) Process leaving (rc=36962305 : 36962305 : 2340001) 00000004:00000001:0.0:1400265787.402581:0:11385:0:(mdt_internal.h:561:mdt_object_put()) Process leaving 00000100:00000001:0.0:1400265787.402582:0:11385:0:(pack_generic.c:337:lustre_pack_reply_v2()) Process entered 02000000:00000001:0.0:1400265787.402583:0:11385:0:(sec.c:2113:sptlrpc_svc_alloc_rs()) Process entered 02000000:00000010:0.0:1400265787.402587:0:11385:0:(sec_null.c:320:null_alloc_rs()) kmalloced 'rs': 4336 at ffff88020f756000. 02000000:00000001:0.0:1400265787.402588:0:11385:0:(sec.c:2149:sptlrpc_svc_alloc_rs()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:0.0:1400265787.402590:0:11385:0:(pack_generic.c:371:lustre_pack_reply_v2()) Process leaving (rc=0 : 0 : 0) 00000004:00000001:0.0:1400265787.402592:0:11385:0:(mdt_handler.c:3284:mdt_intent_layout()) Process leaving (rc=0 : 0 : 0) 00000004:00000001:0.0:1400265787.402593:0:11385:0:(mdt_handler.c:3483:mdt_intent_opc()) Process leaving (rc=0 : 0 : 0) 00000004:00000001:0.0:1400265787.402594:0:11385:0:(mdt_handler.c:3536:mdt_intent_policy()) Process leaving (rc=0 : 0 : 0) 00010000:00000001:0.0:1400265787.402596:0:11385:0:(ldlm_inodebits.c:188:ldlm_process_inodebits_lock()) Process entered 00010000:00000001:0.0:1400265787.402597:0:11385:0:(ldlm_inodebits.c:90:ldlm_inodebits_compat_queue()) Process entered 00010000:00000001:0.0:1400265787.402598:0:11385:0:(ldlm_inodebits.c:163:ldlm_inodebits_compat_queue()) Process leaving (rc=1 : 1 : 1) 00010000:00000001:0.0:1400265787.402600:0:11385:0:(ldlm_inodebits.c:90:ldlm_inodebits_compat_queue()) Process entered 00010000:00000001:0.0:1400265787.402601:0:11385:0:(ldlm_inodebits.c:163:ldlm_inodebits_compat_queue()) Process leaving (rc=1 : 1 : 1) 00010000:00000001:0.0:1400265787.402602:0:11385:0:(ldlm_lock.c:1146:ldlm_grant_lock()) Process entered 00010000:00000001:0.0:1400265787.402603:0:11385:0:(ldlm_lock.c:1123:ldlm_grant_lock_with_skiplist()) Process entered 00010000:00000001:0.0:1400265787.402604:0:11385:0:(ldlm_lock.c:1003:search_granted_lock()) Process entered 00010000:00000001:0.0:1400265787.402605:0:11385:0:(ldlm_lock.c:1074:search_granted_lock()) Process leaving 00010000:00000001:0.0:1400265787.402605:0:11385:0:(ldlm_lock.c:1086:ldlm_granted_list_add_lock()) Process entered 00010000:00000040:0.0:1400265787.402606:0:11385:0:(ldlm_resource.c:1414:ldlm_resource_dump()) --- Resource: [0x200002340:0x1:0x0].0 (ffff8802bdc23a80) refcount = 2 00010000:00000040:0.0:1400265787.402608:0:11385:0:(ldlm_resource.c:1417:ldlm_resource_dump()) Granted locks (in reverse order): 00010000:00000040:0.0:1400265787.402610:0:11385:0:(ldlm_resource.c:1420:ldlm_resource_dump()) ### ### ns: mdt-lustre-MDT0000_UUID lock: ffff880212a982c0/0xd169abc5abfab64b lrc: 2/0,0 mode: PR/PR res: [0x200002340:0x1:0x0].0 bits 0x2 rrc: 2 type: IBT flags: 0x40200000000000 nid: 10.10.4.10@tcp remote: 0x5121400b7cdaafa4 expref: 13 pid: 11385 timeout: 0 lvb_type: 0 00010000:00010000:0.0:1400265787.402617:0:11385:0:(ldlm_lock.c:1091:ldlm_granted_list_add_lock()) ### About to add lock: ns: mdt-lustre-MDT0000_UUID lock: ffff880212a98ac0/0xd169abc5abfab66e lrc: 3/0,0 mode: CR/CR res: [0x200002340:0x1:0x0].0 bits 0x8 rrc: 2 type: IBT flags: 0x50000000000000 nid: 10.10.4.10@tcp remote: 0x5121400b7cdaafb2 expref: 13 pid: 11385 timeout: 0 lvb_type: 3 00010000:00000001:0.0:1400265787.402622:0:11385:0:(ldlm_lock.c:1113:ldlm_granted_list_add_lock()) Process leaving 00010000:00000001:0.0:1400265787.402623:0:11385:0:(ldlm_lock.c:1129:ldlm_grant_lock_with_skiplist()) Process leaving 00010000:00000001:0.0:1400265787.402624:0:11385:0:(ldlm_pool.c:345:ldlm_srv_pool_recalc()) Process entered 00010000:00000001:0.0:1400265787.402625:0:11385:0:(ldlm_pool.c:349:ldlm_srv_pool_recalc()) Process leaving (rc=0 : 0 : 0) 00010000:00000001:0.0:1400265787.402626:0:11385:0:(ldlm_lock.c:1165:ldlm_grant_lock()) Process leaving 00010000:00000001:0.0:1400265787.402627:0:11385:0:(ldlm_inodebits.c:238:ldlm_process_inodebits_lock()) Process leaving (rc=0 : 0 : 0) 00010000:00000001:0.0:1400265787.402628:0:11385:0:(ldlm_lock.c:1762:ldlm_lock_enqueue()) Process leaving via out (rc=0 : 0 : 0x0) 00010000:00000001:0.0:1400265787.402630:0:11385:0:(ldlm_lockd.c:1380:ldlm_handle_enqueue0()) Process leaving 00010000:00010000:0.0:1400265787.402632:0:11385:0:(ldlm_lockd.c:1393:ldlm_handle_enqueue0()) ### server-side enqueue handler, sending reply(err=0, rc=0) ns: mdt-lustre-MDT0000_UUID lock: ffff880212a98ac0/0xd169abc5abfab66e lrc: 3/0,0 mode: CR/CR res: [0x200002340:0x1:0x0].0 bits 0x8 rrc: 2 type: IBT flags: 0x40000000000000 nid: 10.10.4.10@tcp remote: 0x5121400b7cdaafb2 expref: 13 pid: 11385 timeout: 0 lvb_type: 3 00000004:00000001:0.0:1400265787.402639:0:11385:0:(mdt_lvb.c:104:mdt_lvbo_fill()) Process entered 00000020:00000010:0.0:1400265787.402640:0:11385:0:(lu_object.c:1643:keys_init()) kmalloced 'ctx->lc_value': 320 at ffff880212baaa00. 00000020:00000010:0.0:1400265787.402642:0:11385:0:(lu_object.c:429:lu_global_key_init()) kmalloced 'value': 512 at ffff8801f657ce00. 00000020:00000010:0.0:1400265787.402645:0:11385:0:(dt_object.c:55:dt_global_key_init()) kmalloced 'value': 1216 at ffff88032a027000. 00000040:00000010:0.0:1400265787.402647:0:11385:0:(llog_obd.c:278:llog_key_init()) kmalloced 'value': 224 at ffff8802b9611ac0. 00000020:00000010:0.0:1400265787.402649:0:11385:0:(tgt_main.c:146:tgt_key_init()) kmalloced 'value': 1152 at ffff8803277a9800. 80000000:00000010:0.0:1400265787.402651:0:11385:0:(fld_handler.c:60:fld_key_init()) kmalloced 'value': 72 at ffff880212e4a440. 40000000:00000010:0.0:1400265787.402653:0:11385:0:(fid_handler.c:406:seq_key_init()) kmalloced 'value': 48 at ffff8802128a1a00. 00040000:00000010:0.0:1400265787.402656:0:11385:0:(lquota_lib.c:53:lquota_key_init()) kmalloced 'value': 240 at ffff88029dd554c0. 00040000:00000010:0.0:1400265787.402659:0:11385:0:(qmt_dev.c:336:qmt_key_init()) kmalloced 'value': 376 at ffff8801f67b2e00. 00040000:00000010:0.0:1400265787.402661:0:11385:0:(qsd_lib.c:71:qsd_key_init()) kmalloced 'value': 376 at ffff8801f67b2600. 00000004:00000010:0.0:1400265787.402665:0:11385:0:(osd_handler.c:5503:osd_key_init()) kmalloced 'info': 4400 at ffff88020f758000. 00000004:00000010:0.0:1400265787.402668:0:11385:0:(osd_handler.c:5507:osd_key_init()) kmalloced 'info->oti_it_ea_buf': 5120 at ffff88020f75a000. 00100000:00000010:0.0:1400265787.402671:0:11385:0:(lfsck_lib.c:46:lfsck_key_init()) kmalloced 'value': 1144 at ffff88020f723800. 00000004:00000010:0.0:1400265787.402673:0:11385:0:(mdt_handler.c:5817:mdt_key_init()) kmalloced 'value': 1464 at ffff88020f723000. 00000004:00000010:0.0:1400265787.402675:0:11385:0:(lod_dev.c:804:lod_key_init()) kmalloced 'value': 336 at ffff8801f67b2c00. 00000004:00000010:0.0:1400265787.402678:0:11385:0:(mdd_device.c:1553:mdd_key_init()) kmalloced 'value': 1504 at ffff88020f724800. 00000004:00000010:0.0:1400265787.402680:0:11385:0:(osp_dev.c:1171:osp_key_init()) kmalloced 'value': 568 at ffff880212882800. 00000004:00000001:0.0:1400265787.402682:0:11385:0:(mdt_handler.c:2181:mdt_object_find()) Process entered 00000004:00000040:0.0:1400265787.402683:0:11385:0:(mdt_handler.c:2183:mdt_object_find()) Find object for [0x200002340:0x1:0x0] 00000020:00000001:0.0:1400265787.402684:0:11385:0:(lustre_fid.h:707:fid_flatten32()) Process leaving (rc=36962305 : 36962305 : 2340001) 00000020:00000001:0.0:1400265787.402686:0:11385:0:(lustre_fid.h:707:fid_flatten32()) Process leaving (rc=36962305 : 36962305 : 2340001) 00000004:00000001:0.0:1400265787.402688:0:11385:0:(mdt_handler.c:2190:mdt_object_find()) Process leaving (rc=18446612141272470360 : -131932437081256 : ffff880215f41358) 00000004:00000001:0.0:1400265787.402689:0:11385:0:(mdd_object.c:238:mdd_xattr_get()) Process entered 00000004:00000001:0.0:1400265787.402690:0:11385:0:(lod_object.c:498:lod_xattr_get()) Process entered 00000004:00000001:0.0:1400265787.402692:0:11385:0:(lod_object.c:502:lod_xattr_get()) Process leaving (rc=3728 : 3728 : e90) 00000004:00000001:0.0:1400265787.402694:0:11385:0:(mdd_object.c:257:mdd_xattr_get()) Process leaving (rc=3728 : 3728 : e90) 00000004:00000001:0.0:1400265787.402695:0:11385:0:(mdd_object.c:238:mdd_xattr_get()) Process entered 00000004:00000001:0.0:1400265787.402696:0:11385:0:(lod_object.c:498:lod_xattr_get()) Process entered 00000004:00000001:0.0:1400265787.402698:0:11385:0:(lod_object.c:502:lod_xattr_get()) Process leaving (rc=3728 : 3728 : e90) 00000004:00000001:0.0:1400265787.402699:0:11385:0:(mdd_object.c:257:mdd_xattr_get()) Process leaving (rc=3728 : 3728 : e90) 00000004:00000001:0.0:1400265787.402700:0:11385:0:(mdt_internal.h:559:mdt_object_put()) Process entered 00000020:00000001:0.0:1400265787.402701:0:11385:0:(lustre_fid.h:707:fid_flatten32()) Process leaving (rc=36962305 : 36962305 : 2340001) 00000004:00000001:0.0:1400265787.402703:0:11385:0:(mdt_internal.h:561:mdt_object_put()) Process leaving 00000020:00000010:0.0:1400265787.402704:0:11385:0:(lu_object.c:429:lu_global_key_fini()) kfreed 'info': 512 at ffff8801f657ce00. 00000020:00000010:0.0:1400265787.402706:0:11385:0:(dt_object.c:56:dt_global_key_fini()) kfreed 'info': 1216 at ffff88032a027000. 00000040:00000010:0.0:1400265787.402708:0:11385:0:(llog_obd.c:278:llog_key_fini()) kfreed 'info': 224 at ffff8802b9611ac0. 00000020:00000010:0.0:1400265787.402710:0:11385:0:(tgt_main.c:163:tgt_key_fini()) kfreed 'info': 1152 at ffff8803277a9800. 80000000:00000010:0.0:1400265787.402712:0:11385:0:(fld_handler.c:60:fld_key_fini()) kfreed 'info': 72 at ffff880212e4a440. 40000000:00000010:0.0:1400265787.402714:0:11385:0:(fid_handler.c:406:seq_key_fini()) kfreed 'info': 48 at ffff8802128a1a00. 00040000:00000010:0.0:1400265787.402716:0:11385:0:(lquota_lib.c:53:lquota_key_fini()) kfreed 'info': 240 at ffff88029dd554c0. 00040000:00000010:0.0:1400265787.402717:0:11385:0:(qmt_dev.c:336:qmt_key_fini()) kfreed 'info': 376 at ffff8801f67b2e00. 00040000:00000010:0.0:1400265787.402719:0:11385:0:(qsd_lib.c:71:qsd_key_fini()) kfreed 'info': 376 at ffff8801f67b2600. 00000004:00000010:0.0:1400265787.402721:0:11385:0:(osd_handler.c:5533:osd_key_fini()) kfreed 'info->oti_it_ea_buf': 5120 at ffff88020f75a000. 00000004:00000010:0.0:1400265787.402723:0:11385:0:(osd_handler.c:5536:osd_key_fini()) kfreed 'info': 4400 at ffff88020f758000. 00100000:00000010:0.0:1400265787.402725:0:11385:0:(lfsck_lib.c:55:lfsck_key_fini()) kfreed 'info': 1144 at ffff88020f723800. 00000004:00000010:0.0:1400265787.402727:0:11385:0:(mdt_handler.c:5829:mdt_key_fini()) kfreed 'info': 1464 at ffff88020f723000. 00000004:00000010:0.0:1400265787.402729:0:11385:0:(lod_dev.c:819:lod_key_fini()) kfreed 'info': 336 at ffff8801f67b2c00. 00000004:00000010:0.0:1400265787.402731:0:11385:0:(mdd_device.c:1563:mdd_key_fini()) kfreed 'info': 1504 at ffff88020f724800. 00000004:00000010:0.0:1400265787.402732:0:11385:0:(osp_dev.c:1171:osp_key_fini()) kfreed 'info': 568 at ffff880212882800. 00000020:00000010:0.0:1400265787.402734:0:11385:0:(lu_object.c:1593:keys_fini()) kfreed 'ctx->lc_value': 320 at ffff880212baaa00. 00000004:00000001:0.0:1400265787.402735:0:11385:0:(mdt_lvb.c:179:mdt_lvbo_fill()) Process leaving (rc=3728 : 3728 : e90) 00010000:00000001:0.0:1400265787.402737:0:11385:0:(ldlm_lock.c:2072:ldlm_reprocess_all()) Process entered 00010000:00000001:0.0:1400265787.402738:0:11385:0:(ldlm_lock.c:1793:ldlm_reprocess_queue()) Process entered 00010000:00000001:0.0:1400265787.402739:0:11385:0:(ldlm_lock.c:1812:ldlm_reprocess_queue()) Process leaving (rc=1 : 1 : 1) 00010000:00000001:0.0:1400265787.402740:0:11385:0:(ldlm_lock.c:1793:ldlm_reprocess_queue()) Process entered 00010000:00000001:0.0:1400265787.402741:0:11385:0:(ldlm_lock.c:1812:ldlm_reprocess_queue()) Process leaving (rc=1 : 1 : 1) 00010000:00000001:0.0:1400265787.402742:0:11385:0:(ldlm_lock.c:1976:ldlm_run_ast_work()) Process leaving (rc=0 : 0 : 0) 00010000:00000001:0.0:1400265787.402743:0:11385:0:(ldlm_lock.c:2100:ldlm_reprocess_all()) Process leaving 00010000:00000001:0.0:1400265787.402744:0:11385:0:(ldlm_lock.c:211:ldlm_lock_put()) Process entered 00010000:00000001:0.0:1400265787.402745:0:11385:0:(ldlm_lock.c:244:ldlm_lock_put()) Process leaving 00010000:00010000:0.0:1400265787.402746:0:11385:0:(ldlm_lockd.c:1436:ldlm_handle_enqueue0()) ### server-side enqueue handler END (lock ffff880212a98ac0, rc 0) 00000020:00000001:0.0:1400265787.402747:0:11385:0:(tgt_handler.c:1231:tgt_enqueue()) Process leaving (rc=0 : 0 : 0) 00010000:00000040:0.0:1400265787.402749:0:11385:0:(ldlm_lib.c:2363:target_committed_to_req()) last_committed 21474836483, transno 0, xid 1466802977730748 00010000:00000001:0.0:1400265787.402750:0:11385:0:(ldlm_lib.c:2422:target_send_reply()) Process entered 00010000:00000200:0.0:1400265787.402753:0:11385:0:(ldlm_lib.c:2410:target_send_reply_msg()) @@@ sending reply req@ffff88020ff7f800 x1466802977730748/t0(0) o101->a56bcdbb-ca09-9568-70fd-b3f47d606d12@10.10.4.10@tcp:0/0 lens 376/4072 e 0 to 0 dl 1400265793 ref 1 fl Interpret:/0/0 rc 0/0 Reply truncated at Client side, request is resent 00000100:00000001:0.0:1400265787.406980:0:7821:0:(events.c:97:reply_in_callback()) Process entered 00000100:00000200:0.0:1400265787.406983:0:7821:0:(events.c:99:reply_in_callback()) @@@ type 2, status 0 req@ffff8802faff9400 x1466802977730748/t0(0) o101->lustre-MDT0000-mdc-ffff880331322400@10.10.4.17@tcp:12/10 lens 376/1544 e 0 to 0 dl 1400265794 ref 2 fl Rpc:/0/ffffffff rc 0/-1 00000100:00100000:0.0:1400265787.406989:0:7821:0:(events.c:126:reply_in_callback()) truncate req ffff8802faff9400 rpc 1544 - 4040+192 00000100:00000001:0.0:1400265787.406993:0:7821:0:(events.c:173:reply_in_callback()) Process leaving 00000400:00000010:0.0:1400265787.406995:0:7821:0:(lib-lnet.h:528:lnet_msg_free()) kfreed 'msg': 352 at ffff8802f7eb7400 (tot 83132940). 00000800:00000001:0.0:1400265787.406997:0:7821:0:(socklnd_cb.c:346:ksocknal_receive()) Process entered 00000800:00000001:0.0:1400265787.406999:0:7821:0:(socklnd_cb.c:384:ksocknal_receive()) Process leaving (rc=1 : 1 : 1) 00000800:00000001:0.1:1400265787.407011:0:7821:0:(socklnd_lib-linux.c:1002:ksocknal_data_ready()) Process entered 00000800:00000001:0.1:1400265787.407013:0:7821:0:(socklnd_cb.c:1566:ksocknal_read_callback()) Process entered 00000800:00000001:0.1:1400265787.407014:0:7821:0:(socklnd_cb.c:1585:ksocknal_read_callback()) Process leaving 00000800:00000001:0.1:1400265787.407015:0:7821:0:(socklnd_lib-linux.c:1017:ksocknal_data_ready()) Process leaving 00000800:00000001:0.0:1400265787.407017:0:7821:0:(socklnd_cb.c:346:ksocknal_receive()) Process entered 00000800:00000001:0.0:1400265787.407020:0:7821:0:(socklnd_cb.c:384:ksocknal_receive()) Process leaving (rc=1 : 1 : 1) 00000800:00000001:0.0:1400265787.407021:0:7821:0:(socklnd_cb.c:346:ksocknal_receive()) Process entered 00000800:00000001:0.0:1400265787.407023:0:7821:0:(socklnd_cb.c:384:ksocknal_receive()) Process leaving (rc=1 : 1 : 1) 00000100:00000001:2.0:1400265787.407086:0:2750:0:(client.c:1525:ptlrpc_check_set()) Process entered 00000400:00000001:2.0:1400265787.407089:0:2750:0:(lib-msg.c:48:lnet_build_unlink_event()) Process entered 00000400:00000001:2.0:1400265787.407090:0:2750:0:(lib-msg.c:57:lnet_build_unlink_event()) Process leaving 00000100:00000001:2.0:1400265787.407091:0:2750:0:(events.c:97:reply_in_callback()) Process entered 00000100:00000200:2.0:1400265787.407095:0:2750:0:(events.c:99:reply_in_callback()) @@@ type 6, status 0 req@ffff8802faff9400 x1466802977730748/t0(0) o101->lustre-MDT0000-mdc-ffff880331322400@10.10.4.17@tcp:12/10 lens 376/1544 e 0 to 0 dl 1400265794 ref 2 fl Rpc:R/0/ffffffff rc -75/-1 00000100:00000200:2.0:1400265787.407103:0:2750:0:(events.c:120:reply_in_callback()) @@@ unlink req@ffff8802faff9400 x1466802977730748/t0(0) o101->lustre-MDT0000-mdc-ffff880331322400@10.10.4.17@tcp:12/10 lens 376/1544 e 0 to 0 dl 1400265794 ref 2 fl Rpc:R/0/ffffffff rc -75/-1 00000100:00000001:2.0:1400265787.407109:0:2750:0:(events.c:173:reply_in_callback()) Process leaving 00000400:00000010:2.0:1400265787.407110:0:2750:0:(lib-lnet.h:508:lnet_me_free()) kfreed 'me': 96 at ffff880320454dc0 (tot 83132844). 00000400:00000200:2.0:1400265787.407112:0:2750:0:(lib-md.c:73:lnet_md_unlink()) Unlinking md ffff880320454d40 00000400:00000010:2.0:1400265787.407114:0:2750:0:(lib-lnet.h:491:lnet_md_free()) kfreed 'md': 128 at ffff880320454d40 (tot 83132716). 00000100:00000001:2.0:1400265787.407115:0:2750:0:(client.c:2389:ptlrpc_unregister_reply()) Process leaving (rc=1 : 1 : 1) 00000100:00000001:2.0:1400265787.407117:0:2750:0:(client.c:1220:after_reply()) Process entered 02000000:00000001:2.0:1400265787.407118:0:2750:0:(sec.c:1710:sptlrpc_cli_free_repbuf()) Process entered 02000000:00000010:2.0:1400265787.407119:0:2750:0:(sec_null.c:231:null_free_repbuf()) kfreed 'req->rq_repbuf': 2048 at ffff880308075800. 02000000:00000001:2.0:1400265787.407121:0:2750:0:(sec.c:1724:sptlrpc_cli_free_repbuf()) Process leaving 00000100:00000001:2.0:1400265787.407124:0:2750:0:(client.c:1244:after_reply()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:2.0:1400265787.407125:0:2750:0:(client.c:1886:ptlrpc_check_set()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:2.0:1400265787.407126:0:2750:0:(client.c:1525:ptlrpc_check_set()) Process entered 00000100:00000001:2.0:1400265787.407127:0:2750:0:(client.c:2381:ptlrpc_unregister_reply()) Process leaving (rc=1 : 1 : 1) 00000100:00000001:2.0:1400265787.407129:0:2750:0:(client.c:1064:ptlrpc_import_delay_req()) Process entered 00000100:00000001:2.0:1400265787.407129:0:2750:0:(client.c:1115:ptlrpc_import_delay_req()) Process leaving (rc=0 : 0 : 0) 02000000:00000001:2.0:1400265787.407131:0:2750:0:(sec.c:659:sptlrpc_req_refresh_ctx()) Process entered 02000000:00000001:2.0:1400265787.407132:0:2750:0:(sec.c:686:sptlrpc_req_refresh_ctx()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:2.0:1400265787.407133:0:2750:0:(niobuf.c:682:ptl_send_rpc()) Process entered 02000000:00000001:2.0:1400265787.407135:0:2750:0:(sec.c:961:sptlrpc_cli_wrap_request()) Process entered 02000000:00000001:2.0:1400265787.407136:0:2750:0:(sec.c:997:sptlrpc_cli_wrap_request()) Process leaving (rc=0 : 0 : 0) 02000000:00000001:2.0:1400265787.407137:0:2750:0:(sec.c:1689:sptlrpc_cli_alloc_repbuf()) Process entered 02000000:00000010:2.0:1400265787.407139:0:2750:0:(sec_null.c:217:null_alloc_repbuf()) kmalloced 'req->rq_repbuf': 8192 at ffff8803006c2000. 02000000:00000001:2.0:1400265787.407141:0:2750:0:(sec.c:1699:sptlrpc_cli_alloc_repbuf()) Process leaving (rc=0 : 0 : 0) 00000400:00000010:2.0:1400265787.407142:0:2750:0:(lib-lnet.h:500:lnet_me_alloc()) alloc '(me)': 96 at ffff880320454d40 (tot 83132812). 00000400:00000010:2.0:1400265787.407145:0:2750:0:(lib-lnet.h:468:lnet_md_alloc()) alloc '(md)': 128 at ffff880320454dc0 (tot 83132940). 00000100:00000200:2.0:1400265787.407147:0:2750:0:(niobuf.c:808:ptl_send_rpc()) Setup reply buffer: 8192 bytes, xid 1466802977730748, portal 10 00000100:00000001:2.0:1400265787.407149:0:2750:0:(client.c:2605:ptlrpc_request_addref()) Process entered 00000100:00000001:2.0:1400265787.407150:0:2750:0:(client.c:2607:ptlrpc_request_addref()) Process leaving (rc=18446612145115206656 : -131928594344960 : ffff8802faff9400) 00000100:00000040:2.0:1400265787.407153:0:2750:0:(niobuf.c:829:ptl_send_rpc()) @@@ send flg=2 req@ffff8802faff9400 x1466802977730748/t0(0) o101->lustre-MDT0000-mdc-ffff880331322400@10.10.4.17@tcp:12/10 lens 376/4232 e 0 to 0 dl 1400265794 ref 3 fl Rpc:/2/ffffffff rc 0/-1 00000100:00000001:2.0:1400265787.407159:0:2750:0:(niobuf.c:60:ptl_send_buf()) Process entered 00000100:00000040:2.0:1400265787.407161:0:2750:0:(niobuf.c:64:ptl_send_buf()) conn=ffff8803282dacc0 id 12345-10.10.4.17@tcp 00000400:00000010:2.0:1400265787.407162:0:2750:0:(lib-lnet.h:468:lnet_md_alloc()) alloc '(md)': 128 at ffff8802f8c444c0 (tot 83133068). 00000100:00000200:2.0:1400265787.407164:0:2750:0:(niobuf.c:86:ptl_send_buf()) Sending 376 bytes to portal 12, xid 1466802977730748, offset 0 00000400:00000010:2.0:1400265787.407166:0:2750:0:(lib-lnet.h:517:lnet_msg_alloc()) alloc '(msg)': 352 at ffff8802f7244000 (tot 83133420). 00000400:00000200:2.0:1400265787.407169:0:2750:0:(lib-move.c:2176:LNetPut()) LNetPut -> 12345-10.10.4.17@tcp 00000800:00000200:2.0:1400265787.407173:0:2750:0:(socklnd_cb.c:947:ksocknal_send()) sending 376 bytes in 1 frags to 12345-10.10.4.17@tcp 00000800:00000010:2.0:1400265787.407175:0:2750:0:(socklnd_cb.c:51:ksocknal_alloc_tx()) alloc '(tx)': 232 at ffff880310ae9180 (tot 83133652). 00000800:00000200:2.0:1400265787.407178:0:2750:0:(socklnd.c:199:ksocknal_find_peer_locked()) got peer [ffff880338348380] -> 12345-10.10.4.17@tcp (5) 00000800:00000200:2.0:1400265787.407182:0:2750:0:(socklnd_cb.c:706:ksocknal_queue_tx_locked()) Sending to 12345-10.10.4.17@tcp ip 10.10.4.17:988 00000800:00000200:2.0:1400265787.407184:0:2750:0:(socklnd_cb.c:725:ksocknal_queue_tx_locked()) Packet ffff880310ae9180 type 1, nob 472 niov 2 nkiov 0 00000100:00000001:2.0:1400265787.407189:0:2750:0:(niobuf.c:101:ptl_send_buf()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:2.0:1400265787.407191:0:2750:0:(niobuf.c:837:ptl_send_rpc()) Process leaving via out (rc=0 : 0 : 0x0) 00000100:00000001:2.0:1400265787.407193:0:2750:0:(client.c:1886:ptlrpc_check_set()) Process leaving (rc=1 : 1 : 1) 00000100:00000001:2.0:1400265787.407196:0:2750:0:(client.c:2058:ptlrpc_set_next_timeout()) Process entered 00000100:00000001:2.0:1400265787.407197:0:2750:0:(client.c:2095:ptlrpc_set_next_timeout()) Process leaving (rc=6 : 6 : 6) 00000100:00100000:2.0:1400265787.407198:0:2750:0:(client.c:2132:ptlrpc_set_wait()) set ffff8803287f55c0 going to sleep for 6 seconds 00000100:00000001:2.0:1400265787.407199:0:2750:0:(client.c:1525:ptlrpc_check_set()) Process entered 00000100:00000001:2.0:1400265787.407200:0:2750:0:(client.c:1886:ptlrpc_check_set()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:2.0:1400265787.407201:0:2750:0:(client.c:1525:ptlrpc_check_set()) Process entered 00000100:00000001:2.0:1400265787.407202:0:2750:0:(client.c:1886:ptlrpc_check_set()) Process leaving (rc=0 : 0 : 0) 00000800:00000200:1.0:1400265787.407277:0:7820:0:(socklnd_cb.c:514:ksocknal_process_transmit()) send(0) 0 00000800:00000001:1.0:1400265787.407285:0:7820:0:(socklnd_cb.c:392:ksocknal_tx_done()) Process entered 00000800:00000010:1.0:1400265787.407286:0:7820:0:(socklnd_cb.c:106:ksocknal_free_tx()) kfreed 'tx': 232 at ffff880310ae9180 (tot 83133420). 00000100:00000001:1.0:1400265787.407289:0:7820:0:(events.c:62:request_out_callback()) Process entered 00000100:00000200:1.0:1400265787.407293:0:7820:0:(events.c:68:request_out_callback()) @@@ type 5, status 0 req@ffff8802faff9400 x1466802977730748/t0(0) o101->lustre-MDT0000-mdc-ffff880331322400@10.10.4.17@tcp:12/10 lens 376/4232 e 0 to 0 dl 1400265794 ref 3 fl Rpc:/2/ffffffff rc 0/-1 00000100:00000001:1.0:1400265787.407300:0:7820:0:(client.c:2313:__ptlrpc_req_finished()) Process entered 00000100:00000040:1.0:1400265787.407302:0:7820:0:(client.c:2325:__ptlrpc_req_finished()) @@@ refcount now 2 req@ffff8802faff9400 x1466802977730748/t0(0) o101->lustre-MDT0000-mdc-ffff880331322400@10.10.4.17@tcp:12/10 lens 376/4232 e 0 to 0 dl 1400265794 ref 3 fl Rpc:/2/ffffffff rc 0/-1 00000100:00000001:1.0:1400265787.407307:0:7820:0:(client.c:2332:__ptlrpc_req_finished()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:1.0:1400265787.407309:0:7820:0:(events.c:87:request_out_callback()) Process leaving Server detects resend, but since not expected for layout-lock, handles again, hence the LBUG 00000100:00000001:1.0:1400265787.403522:0:11347:0:(events.c:296:request_in_callback()) Process entered 00000100:00000200:1.0:1400265787.403523:0:11347:0:(events.c:306:request_in_callback()) event type 2, status 0, service mdt00000100:00000010:1.0:1400265787.403525:0:11347:0:(client.c:410:ptlrpc_request_cache_alloc()) slab-alloced 'req': 976 at ffff880214c69000. 00000100:00000040:1.0:1400265787.403526:0:11347:0:(events.c:350:request_in_callback()) incoming req@ffff880214c69000 x1466802977730748 msgsize 376 00000100:00100000:1.0:1400265787.403529:0:11347:0:(events.c:352:request_in_callback()) peer: 12345-10.10.4.10@tcp 00000100:00000001:1.0:1400265787.403533:0:11347:0:(events.c:385:request_in_callback()) Process leaving00000400:00000010:1.0:1400265787.403534:0:11347:0:(lib-lnet.h:528:lnet_msg_free()) kfreed 'msg': 352 at ffff880212baaa00 (tot 117245148). 00000800:00000001:1.0:1400265787.403536:0:11347:0:(socklnd_cb.c:346:ksocknal_receive()) Process entered 00000800:00000001:1.0:1400265787.403538:0:11347:0:(socklnd_cb.c:384:ksocknal_receive()) Process leaving (rc=1 : 1 : 1) 00000400:00000001:0.0:1400265787.403585:0:11385:0:(watchdog.c:416:lc_watchdog_touch()) Process entered 00000400:00000001:0.0:1400265787.403587:0:11385:0:(watchdog.c:427:lc_watchdog_touch()) Process leaving 00000100:00000001:0.0:1400265787.403590:0:11385:0:(service.c:1860:ptlrpc_server_handle_req_in()) Process entered 02000000:00000001:0.0:1400265787.403591:0:11385:0:(sec.c:2053:sptlrpc_svc_unwrap_request()) Process entered 00000100:00000001:0.0:1400265787.403592:0:11385:0:(pack_generic.c:582:__lustre_unpack_msg()) Process entered 00000100:00000001:0.0:1400265787.403593:0:11385:0:(pack_generic.c:601:__lustre_unpack_msg()) Process leaving (rc=0 : 0 : 0) 02000000:00000001:0.0:1400265787.403595:0:11385:0:(sec.c:2100:sptlrpc_svc_unwrap_request()) Process leaving (rc=0 : 0 : 0) 00000100:00100000:0.0:1400265787.403597:0:11385:0:(service.c:1939:ptlrpc_server_handle_req_in()) got req x1466802977730748 00000020:00000001:0.0:1400265787.403598:0:11385:0:(genops.c:741:class_conn2export()) Process entered 00000020:00000040:0.0:1400265787.403599:0:11385:0:(genops.c:753:class_conn2export()) looking for export cookie 0xd169abc5abfab541 00000020:00000001:0.0:1400265787.403600:0:11385:0:(lustre_handles.c:172:class_handle2object()) Process entered 00000020:00000040:0.0:1400265787.403601:0:11385:0:(genops.c:839:class_export_get()) GETting export ffff880212978800 : new refcount 13 00000020:00000001:0.0:1400265787.403603:0:11385:0:(lustre_handles.c:195:class_handle2object()) Process leaving (rc=18446612141216073728 : -131932493477888 : ffff880212978800) 00000020:00000001:0.0:1400265787.403605:0:11385:0:(genops.c:755:class_conn2export()) Process leaving (rc=18446612141216073728 : -131932493477888 : ffff880212978800) 00000100:00000001:0.0:1400265787.403607:0:11385:0:(service.c:1068:ptlrpc_update_export_timer()) Process entered 00000100:00000001:0.0:1400265787.403608:0:11385:0:(service.c:1081:ptlrpc_update_export_timer()) Process leaving 00000020:00000010:0.0:1400265787.403609:0:11385:0:(lu_object.c:1643:keys_init()) kmalloced 'ctx->lc_value': 320 at ffff880268932c00. 02000000:00000010:0.0:1400265787.403611:0:11385:0:(capa.c:416:lu_capainfo_key_init()) kmalloced 'value': 128 at ffff880213336bc0. 00000020:00000010:0.0:1400265787.403612:0:11385:0:(lu_ucred.c:53:lu_ucred_key_init()) kmalloced 'value': 72 at ffff880212f7cac0. 00000020:00000010:0.0:1400265787.403614:0:11385:0:(tgt_main.c:187:tgt_ses_key_init()) kmalloced 'value': 144 at ffff880215e7a740. 00000100:00000040:0.0:1400265787.403616:0:11385:0:(service.c:1203:ptlrpc_at_set_timer()) armed mdt at +1s 00000100:00000001:0.0:1400265787.403618:0:11385:0:(service.c:1670:ptlrpc_server_request_add()) Process entered 00000100:00000001:0.0:1400265787.403618:0:11385:0:(service.c:1571:ptlrpc_server_hpreq_init()) Process entered 00000100:00000001:0.0:1400265787.403620:0:11385:0:(service.c:1622:ptlrpc_server_hpreq_init()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:0.0:1400265787.403621:0:11385:0:(service.c:1685:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:0.0:1400265787.403625:0:11385:0:(service.c:2002:ptlrpc_server_handle_req_in()) Process leaving (rc=1 : 1 : 1) 00000400:00000001:0.0:1400265787.403627:0:11385:0:(watchdog.c:433:lc_watchdog_disable()) Process entered 00000400:00000001:0.0:1400265787.403628:0:11385:0:(watchdog.c:441:lc_watchdog_disable()) Process leaving 00000400:00000001:0.0:1400265787.403629:0:11385:0:(watchdog.c:416:lc_watchdog_touch()) Process entered 00000400:00000001:0.0:1400265787.403630:0:11385:0:(watchdog.c:427:lc_watchdog_touch()) Process leaving 00000100:00000001:0.0:1400265787.403631:0:11385:0:(service.c:2025:ptlrpc_server_handle_request()) Process entered 00000100:00000001:0.0:1400265787.403632:0:11385:0:(service.c:1803:ptlrpc_server_request_get()) Process entered 00000100:00100000:0.0:1400265787.403634:0:11385:0:(nrs_fifo.c:182:nrs_fifo_req_get()) NRS start fifo request from 12345-10.10.4.10@tcp, seq: 173 00000100:00000040:0.0:1400265787.403636:0:11385:0:(service.c:1841:ptlrpc_server_request_get()) RPC GETting export ffff880212978800 : new rpc_count 1 00000100:00000001:0.0:1400265787.403637:0:11385:0:(service.c:1843:ptlrpc_server_request_get()) Process leaving (rc=18446612141252710400 : -131932456841216 : ffff880214c69000) 00000100:00000040:0.0:1400265787.403640:0:11385:0:(lustre_net.h:3296:ptlrpc_rqphase_move()) @@@ move req "New" -> "Interpret" req@ffff880214c69000 x1466802977730748/t0(0) o101->a56bcdbb-ca09-9568-70fd-b3f47d606d12@10.10.4.10@tcp:0/0 lens 376/0 e 0 to 0 dl 1400265793 ref 1 fl New:/2/ffffffff rc 0/-1 00000100:00000001:0.0:1400265787.403652:0:11385:0:(service.c:1068:ptlrpc_update_export_timer()) Process entered 00000100:00000001:0.0:1400265787.403653:0:11385:0:(service.c:1081:ptlrpc_update_export_timer()) Process leaving 00000100:00100000:0.0:1400265787.403655:0:11385:0:(service.c:2086:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_002:a56bcdbb-ca09-9568-70fd-b3f47d606d12+13:2750:x1466802977730748:12345-10.10.4.10@tcp:101 00000100:00000200:0.0:1400265787.403657:0:11385:0:(service.c:2091:ptlrpc_server_handle_request()) got req 1466802977730748 00000020:00000001:0.0:1400265787.403659:0:11385:0:(tgt_handler.c:589:tgt_request_handle()) Process entered 00000020:00000001:0.0:1400265787.403660:0:11385:0:(tgt_handler.c:553:tgt_handler_find_check()) Process entered 00000020:00000001:0.0:1400265787.403661:0:11385:0:(tgt_handler.c:575:tgt_handler_find_check()) Process leaving (rc=18446744072111134752 : -1598416864 : ffffffffa0ba1820) 00000020:00000001:0.0:1400265787.403663:0:11385:0:(tgt_handler.c:492:tgt_handle_recovery()) Process entered 00000020:00000001:0.0:1400265787.403664:0:11385:0:(tgt_handler.c:542:tgt_handle_recovery()) Process leaving (rc=1 : 1 : 1) 00000020:00000001:0.0:1400265787.403666:0:11385:0:(tgt_handler.c:377:tgt_handle_request0()) Process entered 00000020:00000001:0.0:1400265787.403667:0:11385:0:(tgt_handler.c:307:tgt_request_preprocess()) Process entered 00000020:00000001:0.0:1400265787.403668:0:11385:0:(tgt_handler.c:362:tgt_request_preprocess()) Process leaving (rc=0 : 0 : 0) 00000020:00000001:0.0:1400265787.403670:0:11385:0:(tgt_handler.c:1220:tgt_enqueue()) Process entered 00010000:00000001:0.0:1400265787.403671:0:11385:0:(ldlm_lockd.c:1165:ldlm_handle_enqueue0()) Process entered 00010000:00010000:0.0:1400265787.403671:0:11385:0:(ldlm_lockd.c:1167:ldlm_handle_enqueue0()) ### server-side enqueue handler START 00010000:00000001:0.0:1400265787.403673:0:11385:0:(ldlm_lockd.c:1556:ldlm_request_cancel()) Process entered 00010000:00000001:0.0:1400265787.403673:0:11385:0:(ldlm_lockd.c:1560:ldlm_request_cancel()) Process leaving (rc=0 : 0 : 0) 00010000:00010000:0.0:1400265787.403677:0:11385:0:(ldlm_lockd.c:1241:ldlm_handle_enqueue0()) @@@ found existing lock cookie 0xd169abc5abfab66e req@ffff880214c69000 x1466802977730748/t0(0) o101->a56bcdbb-ca09-9568-70fd-b3f47d606d12@10.10.4.10@tcp:0/0 lens 376/0 e 0 to 0 dl 1400265793 ref 1 fl Interpret:/2/ffffffff rc 0/-1 00010000:00000001:0.0:1400265787.403683:0:11385:0:(ldlm_lockd.c:1242:ldlm_handle_enqueue0()) Process leaving via existing_lock (rc=0 : 0 : 0x0) 00010000:00000001:0.0:1400265787.403684:0:11385:0:(ldlm_lock.c:1666:ldlm_lock_enqueue()) Process entered 00000004:00000001:0.0:1400265787.403685:0:11385:0:(mdt_handler.c:3497:mdt_intent_policy()) Process entered 00000004:00000001:0.0:1400265787.403687:0:11385:0:(mdt_handler.c:3434:mdt_intent_opc()) Process entered 00000004:00000001:0.0:1400265787.403688:0:11385:0:(mdt_handler.c:2735:mdt_unpack_req_pack_rep()) Process entered 00000004:00000001:0.0:1400265787.403689:0:11385:0:(mdt_handler.c:2753:mdt_unpack_req_pack_rep()) Process leaving (rc=0 : 0 : 0) 00000004:00000001:0.0:1400265787.403690:0:11385:0:(mdt_handler.c:3247:mdt_intent_layout()) Process entered 00000004:00000001:0.0:1400265787.403691:0:11385:0:(mdt_handler.c:2181:mdt_object_find()) Process entered 00000004:00000040:0.0:1400265787.403692:0:11385:0:(mdt_handler.c:2183:mdt_object_find()) Find object for [0x200002340:0x1:0x0] 00000020:00000001:0.0:1400265787.403693:0:11385:0:(lustre_fid.h:707:fid_flatten32()) Process leaving (rc=36962305 : 36962305 : 2340001) 00000020:00000001:0.0:1400265787.403695:0:11385:0:(lustre_fid.h:707:fid_flatten32()) Process leaving (rc=36962305 : 36962305 : 2340001) 00000004:00000001:0.0:1400265787.403697:0:11385:0:(mdt_handler.c:2190:mdt_object_find()) Process leaving (rc=18446612141272470360 : -131932437081256 : ffff880215f41358) 00000004:00000001:0.0:1400265787.403699:0:11385:0:(mdd_object.c:238:mdd_xattr_get()) Process entered 00000004:00000001:0.0:1400265787.403700:0:11385:0:(lod_object.c:498:lod_xattr_get()) Process entered 00000004:00000001:0.0:1400265787.403704:0:11385:0:(lod_object.c:502:lod_xattr_get()) Process leaving (rc=3728 : 3728 : e90) 00000004:00000001:0.0:1400265787.403706:0:11385:0:(mdd_object.c:257:mdd_xattr_get()) Process leaving (rc=3728 : 3728 : e90) 00000004:00000001:0.0:1400265787.403707:0:11385:0:(mdt_internal.h:559:mdt_object_put()) Process entered 00000020:00000001:0.0:1400265787.403708:0:11385:0:(lustre_fid.h:707:fid_flatten32()) Process leaving (rc=36962305 : 36962305 : 2340001) 00000004:00000001:0.0:1400265787.403710:0:11385:0:(mdt_internal.h:561:mdt_object_put()) Process leaving 00000100:00000001:0.0:1400265787.403711:0:11385:0:(pack_generic.c:337:lustre_pack_reply_v2()) Process entered 02000000:00000001:0.0:1400265787.403712:0:11385:0:(sec.c:2113:sptlrpc_svc_alloc_rs()) Process entered 02000000:00000010:0.0:1400265787.403714:0:11385:0:(sec_null.c:320:null_alloc_rs()) kmalloced 'rs': 4336 at ffff88020f758000. 02000000:00000001:0.0:1400265787.403715:0:11385:0:(sec.c:2149:sptlrpc_svc_alloc_rs()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:0.0:1400265787.403717:0:11385:0:(pack_generic.c:371:lustre_pack_reply_v2()) Process leaving (rc=0 : 0 : 0) 00000004:00000001:0.0:1400265787.403718:0:11385:0:(mdt_handler.c:3284:mdt_intent_layout()) Process leaving (rc=0 : 0 : 0) 00000004:00000001:0.0:1400265787.403719:0:11385:0:(mdt_handler.c:3483:mdt_intent_opc()) Process leaving (rc=0 : 0 : 0) 00000004:00000001:0.0:1400265787.403721:0:11385:0:(mdt_handler.c:3536:mdt_intent_policy()) Process leaving (rc=0 : 0 : 0) 00010000:00000001:0.0:1400265787.403722:0:11385:0:(ldlm_inodebits.c:188:ldlm_process_inodebits_lock()) Process entered 00010000:00040000:0.0:1400265787.403723:0:11385:0:(ldlm_inodebits.c:190:ldlm_process_inodebits_lock()) ASSERTION( lock->l_granted_mode != lock->l_req_mode ) failed: 00010000:00000010:0.1:1400265787.415728:0:11385:0:(ldlm_lock.c:453:lock_handle_free()) slab-freed 'lock': 504 at ffff8802107f1700. 00010000:00040000:0.0:1400265787.415731:0:11385:0:(ldlm_inodebits.c:190:ldlm_process_inodebits_lock()) LBUG 00000400:00000001:0.0:1400265787.503768:0:11385:0:(linux-debug.c:151:libcfs_run_lbug_upcall()) Process entered 00000400:00000001:0.0:1400265787.503770:0:11385:0:(linux-debug.c:118:libcfs_run_upcall()) Process entered 00010000:00000010:0.1:1400265787.504023:0:0:0:(ldlm_lock.c:453:lock_handle_free()) slab-freed 'lock': 504 at ffff880212a986c0. 00000400:00080000:3.0:1400265787.504316:0:11385:0:(linux-debug.c:142:libcfs_run_upcall()) Invoked LNET upcall /usr/lib/lustre/lnet_upcall LBUG,/var/lib/jenkins/workspace/lustre-reviews/arch/x86_64/build_type/server/distro/el6/ib_stack/inkernel/BUILD/BUILD/lustre-2.5.58/lustre/ptlrpc/../../lustre/ldlm/ldlm_inodebits.c,ldlm_process_inodebits_lock,190 I have an add-on for current Gerrit-change |
| Comment by James A Simmons [ 29/May/14 ] |
|
Here is the b2_5 version of the patch I'm testing with: |
| Comment by James A Simmons [ 29/May/14 ] |
|
Tested it first on my smallest size test bed and the two patches that I combined and back port to b2_5 did pretty well. So I moved the test to a larger system and when I went to run my simul reproducer I now get this: May 29 18:27:03 atlas-tds-mds1 kernel: [10809.308755] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Its better in that the MDS doesn't assert any more.I also canceled the job and the system recovered which is a good sign. |
| Comment by James A Simmons [ 29/May/14 ] |
|
IOR single shared file run is producing the below messages. Also I have a log dumps that I uploaded to ftp.whamcloud.com/uploads/ May 29 18:47:39 atlas-tds-mds1 kernel: [12046.175064] Lustre: 17065:0:(service.c:1349:ptlrpc_at_send_early_reply()) Skipped 8 previous similar messages |
| Comment by Bruno Faccini (Inactive) [ 03/Jun/14 ] |
|
I am afraid I am not able to conclude about what is going wrong, using this only MDS Lustre debug-log. Are there still similar stacks being dumped in dmesg/syslog ? Also, I did not find any eviction trace in this log. |
| Comment by James A Simmons [ 03/Jun/14 ] |
|
With the patches the evictions are gone which is good. Now its just locking up. |
| Comment by James A Simmons [ 03/Jun/14 ] |
|
I tested master with the patches as well and got the same results. |
| Comment by Jodi Levi (Inactive) [ 18/Jun/14 ] |
|
http://review.whamcloud.com/#/c/5978/ |
| Comment by Peter Jones [ 20/Jun/14 ] |
|
Landed for 2.6 |
| Comment by Bruno Faccini (Inactive) [ 30/Jun/14 ] |
|
Merged b2_4 backport of both #5978 and #10378 master changes for this ticket, is at http://review.whamcloud.com/10902. |
| Comment by Christopher Morrone [ 30/Jun/14 ] |
|
We would really, really prefer that you guys not merge together multiple patches when backporting. That makes sanity checking and rebasing quite a bit more complicated for us. |
| Comment by Bruno Faccini (Inactive) [ 01/Jul/14 ] |
|
The merge was done to avoid the same miss with original patch where mdt_intent_layout() had been forgotten because not present in Xyratex source-tree at this time. BTW, my b2_4 patch/back-port has some problem and needs some re-work, because MDS bombs with "(ldlm_lock.c:851:ldlm_lock_decref_internal_nolock()) ASSERTION( lock->l_readers > 0 ) failed" when running LLNL reproducer from |
| Comment by Bruno Faccini (Inactive) [ 09/Oct/14 ] |
|
I should have commented this before, sorry. |
| Comment by Antoine Percher [ 11/Feb/15 ] |
| Comment by Bruno Faccini (Inactive) [ 11/Feb/15 ] |
|
Antoine, |
| Comment by Gerrit Updater [ 13/Mar/15 ] |
|
Jian Yu (jian.yu@intel.com) uploaded a new patch: http://review.whamcloud.com/14056 |