[LU-4112] Random eviction of clients on lock callback timeout Created: 16/Oct/13 Updated: 14/Aug/16 Resolved: 14/Aug/16 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Sebastien Buisson (Inactive) | Assignee: | Bruno Faccini (Inactive) |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 11068 |
| Description |
|
Hi, At Meteo France, Lustre clients are randomly evicted with the following message: 1381739209 2013 Oct 14 08:26:49 beaufixoss9 kern err kernel LustreError: 0:0:(ldlm_lockd.c:358:waiting_locks_callback()) ### lock callback timer expired after 101s: evicting client at 172.23.67.157@o2ib ns: filter-scratch-OST0024_UUID lock: ffff880215b3c6c0/0xe3c417cfbe584025 lrc: 3/0,0 mode: PW/PW res: 87670641/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x20 remote: 0x964c642755ae9fd expref: 4 pid: 20133 timeout 4581190301 Some attempts were made to increase the ldlm_timeout, but even with a timeout of 300 seconds, the timer finally expires and evicts the client. After collecting some debug logs, it seems that there is a communication problem: the client sends the lock callback RPC, but it is never received by the OSS, leading it to timeout. Here is an example: [root@beaufixmngt0 beaufix1046]# grep evict lustre-logs-20131011-082535 172.23.80.69 = beaufixoss15-ib1 [root@beaufixmngt0 beaufix1046]# head -n1 ../beaufixoss15/lustre-logs-20131011-082458 OSS debug log covers from 1381479186 to 1381479898 Here is the list of RPCs sent during this time (total of 5 RPCs): [root@beaufixmngt0 beaufix1046]# grep -f requests ../beaufixoss15/lustre-logs-20131011-082458 | grep incoming -> Only 1 RPC received out of 5 sent by the clients This example only shows what happens on top of LNet, but we can see no error at the Infiniband level yet. This is a blocking problem as it is preventing the cluster from being used for production. I will upload traces to help analysis. Thanks for your help, |
| Comments |
| Comment by Sebastien Buisson (Inactive) [ 16/Oct/13 ] |
|
Here are the logs. To avoid 10MB file size limitation, the .tar.bz2 file has been cut with 'split'. beaufixoss15 is the OSS, and beaufix1046 is the client. |
| Comment by Bruno Faccini (Inactive) [ 16/Oct/13 ] |
|
Hello Seb, |
| Comment by Sebastien Buisson (Inactive) [ 16/Oct/13 ] |
|
Hi Bruno, I do not know if LNET routers are involved in that particular path between beaufixoss15 and beaufix1046. I will ask to Support team, as well as for the IB network topology and the debug mask used. Thanks, |
| Comment by Sebastien Buisson (Inactive) [ 16/Oct/13 ] |
|
Here is the network information you requested. There are no LNet routers on this configuration. Cheers, |
| Comment by Sebastien Buisson (Inactive) [ 16/Oct/13 ] |
|
And for your question regarding the debug level: echo -1 > /proc/sys/lnet/debug echo -1 > /proc/sys/lnet/subsystem_debug |
| Comment by Sebastien Piechurski [ 16/Oct/13 ] |
|
In case this ease the research, I am currently uploading a reproducer for this issue on the ftp site under uploads/
in launch_run.sh:
The issue was reproduced with 8 instances of this job, running on a total of 32 nodes. I don't know yet if this shows with less. It might be interesting to know that I could reproduce it on a 2.4.1 version as well. |
| Comment by Sebastien Buisson (Inactive) [ 16/Oct/13 ] |
|
Apart from that, I think priority could be lowered to Critical. |
| Comment by Bruno Faccini (Inactive) [ 16/Oct/13 ] |
|
Thanks for all of these infos/hints already, will try to update you soon. |
| Comment by Bruno Faccini (Inactive) [ 17/Oct/13 ] |
|
I agree that Lustre debug logs provided seem to indicate that some requests (even OBD_PINGs) from Client beaufix1046 never reach OSS beaufixoss15-ib1 ... BTW, it seems to me that these requests don't have their send be acknowledged by the local Portal/IB layers, since some of the corresponding request_out_callback() traces are missing !! So unless tracing is buggy/lazy, it looks like LNet/IB networking do not work at all at least between some nodes pairs. Did you simply check that IB/LNet work between problematic+identified pairs (like beaufix1046 and beaufixoss15-ib1) with "lnet ping", IPoIB ping, and low-level IB diags/cmds ? Concerning the reproducer, and regarding its pre-requisites (Intel compilers/run-time+MPI+MKL), event at Intel it may take some time for me to setup such a test platform ... Is it possible to use alternate+free products ? |
| Comment by Sebastien Buisson (Inactive) [ 24/Oct/13 ] |
|
Hi Bruno, After fine analysis of the arpege reproducer, it came out that the issue seems related to ldlm locks. To workaround this, Support team introduced an 'lctl set_param ldlm.namespaces.osc.lru_size=0' between the input files copy and the beginning of the job itself. This way, the locks are put back prior to the beginning of the MPI job, and no reclaim occurs during the job itself. We now need to understand why the timeout occurs. Support team is currently trying to write a home made reproducer that will mimic the behavior of the MPI job, to provide it to you and better understand what exactly triggers the problem. Cheers, |
| Comment by Bruno Faccini (Inactive) [ 24/Oct/13 ] |
|
Thanks for this additional analysis and details !! Do you think that we may face some issue caused by too many locks grant/revoke traffic or better some tricky locking pattern ? BTW, do you already know if the MPI job tasks uses extent locks (which does not seem in the lustre logs you already provided ...) ? And also many thanks to work on a new reproducer because 1st one is very difficult to setup out of its original environment. |
| Comment by Sebastien Piechurski [ 25/Oct/13 ] |
|
Here is a reproducer for the issue described. I was able to reproduce it with 96 MPI tasks across 2 nodes, but best reproduced with 96 tasks across 4 nodes. I tried to strip as many things as I could to reproduce it.
Note that when removing either the stat() call or the lseek() calls, the issue disappears. So it looks like these induce a timing allowing to fall in the issue. |
| Comment by Bruno Faccini (Inactive) [ 28/Oct/13 ] |
|
Thanks to provide this new reproducer, I am currently setting up a test platform to run it. |
| Comment by Bruno Faccini (Inactive) [ 29/Oct/13 ] |
|
Reproducer is actually running on a platform as suggested. Will let you know of any results. |
| Comment by Bruno Faccini (Inactive) [ 29/Oct/13 ] |
|
Hummm, I just faced a hang situation with following Console msgs/stack being printed regularly : BUG: soft lockup - CPU#3 stuck for 67s! [LU-4112-reprodu:15078] Modules linked in: nfs fscache lmv(U) mgc(U) lustre(U) lov(U) osc(U) lquota(U) mdc(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc cpufreq_on demand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 mlx4_ib ib_sa ib_mad ib_core mlx4_en mlx4_core igb ptp pps_core microcode serio_raw i2c_i801 i2c_core sg iTCO_wd t iTCO_vendor_support i7core_edac edac_core ioatdma dca shpchp ext3 jbd mbcache sd_mod crc_t10dif ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] CPU 3 Modules linked in: nfs fscache lmv(U) mgc(U) lustre(U) lov(U) osc(U) lquota(U) mdc(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc cpufreq_on demand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 mlx4_ib ib_sa ib_mad ib_core mlx4_en mlx4_core igb ptp pps_core microcode serio_raw i2c_i801 i2c_core sg iTCO_wd t iTCO_vendor_support i7core_edac edac_core ioatdma dca shpchp ext3 jbd mbcache sd_mod crc_t10dif ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] Pid: 15078, comm: LU-4112-reprodu Not tainted 2.6.32-358.11.1.el6_lustre.g33d0394.x86_64 #1 Supermicro X8DTT/X8DTT RIP: 0010:[<ffffffff815107ce>] [<ffffffff815107ce>] _spin_lock+0x1e/0x30 RSP: 0018:ffff880305cdf988 EFLAGS: 00000297 RAX: 0000000000009f99 RBX: ffff880305cdf988 RCX: 0000000000000000 RDX: 0000000000009f98 RSI: ffff8802d5a180c8 RDI: ffff880320986774 RBP: ffffffff8100bb8e R08: 0000000000000000 R09: 0000000000000001 R10: 0000000000000000 R11: 0000000000000000 R12: 00000000526fce72 R13: 0000000000000000 R14: 0000000000000000 R15: ffff8802f3f4b2b8 FS: 00007f47e6ebb700(0000) GS:ffff880032e60000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007f8b043f9018 CR3: 00000002e1fd1000 CR4: 00000000000007e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process LU-4112-reprodu (pid: 15078, threadinfo ffff880305cde000, task ffff8802f4998040) Stack: ffff880305cdfa28 ffffffffa08bf50c ffff8802d5a262b8 ffff8802d5a181b8 <d> ffff8802d5a180c8 ffff8802d58c00a8 ffff8802d5a18180 ffff8802d5a18160 Call Trace: [<ffffffffa08bf50c>] ? osc_io_submit+0x31c/0x540 [osc] [<ffffffffa04bc128>] ? cl_io_submit_rw+0x78/0x180 [obdclass] [<ffffffffa093ca75>] ? lov_io_submit+0x355/0xc30 [lov] [<ffffffffa04bc128>] ? cl_io_submit_rw+0x78/0x180 [obdclass] [<ffffffffa04be851>] ? cl_io_read_page+0xb1/0x170 [obdclass] [<ffffffffa04b2909>] ? cl_page_assume+0xf9/0x2d0 [obdclass] [<ffffffffa09f4676>] ? ll_readpage+0x96/0x200 [lustre] [<ffffffff81096ce0>] ? wake_bit_function+0x0/0x50 [<ffffffff8111b6fc>] ? generic_file_aio_read+0x1fc/0x700 [<ffffffffa0a1bcab>] ? vvp_io_read_start+0x13b/0x3e0 [lustre] [<ffffffffa04bc29a>] ? cl_io_start+0x6a/0x140 [obdclass] [<ffffffffa04c0adc>] ? cl_io_loop+0xcc/0x190 [obdclass] [<ffffffffa09cc047>] ? ll_file_io_generic+0x3a7/0x560 [lustre] [<ffffffffa09cc339>] ? ll_file_aio_read+0x139/0x2c0 [lustre] [<ffffffffa09cc7f9>] ? ll_file_read+0x169/0x2a0 [lustre] [<ffffffff81181805>] ? vfs_read+0xb5/0x1a0 [<ffffffff81181941>] ? sys_read+0x51/0x90 [<ffffffff810dc595>] ? __audit_syscall_exit+0x265/0x290 [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b Code: 00 00 00 01 74 05 e8 62 30 d7 ff c9 c3 55 48 89 e5 0f 1f 44 00 00 b8 00 00 01 00 f0 0f c1 07 0f b7 d0 c1 e8 10 39 c2 74 0e f3 90 <0f> b7 17 eb f5 83 3f 00 75 f4 eb df c9 c3 0f 1f 40 00 55 48 89 Call Trace: [<ffffffffa08be9d5>] ? osc_cl_page_osc+0x15/0x60 [osc] [<ffffffffa08bf50c>] ? osc_io_submit+0x31c/0x540 [osc] [<ffffffffa04bc128>] ? cl_io_submit_rw+0x78/0x180 [obdclass] [<ffffffffa093ca75>] ? lov_io_submit+0x355/0xc30 [lov] [<ffffffffa04bc128>] ? cl_io_submit_rw+0x78/0x180 [obdclass] [<ffffffffa04be851>] ? cl_io_read_page+0xb1/0x170 [obdclass] [<ffffffffa04b2909>] ? cl_page_assume+0xf9/0x2d0 [obdclass] [<ffffffffa09f4676>] ? ll_readpage+0x96/0x200 [lustre] [<ffffffff81096ce0>] ? wake_bit_function+0x0/0x50 [<ffffffff8111b6fc>] ? generic_file_aio_read+0x1fc/0x700 [<ffffffffa0a1bcab>] ? vvp_io_read_start+0x13b/0x3e0 [lustre] [<ffffffffa04bc29a>] ? cl_io_start+0x6a/0x140 [obdclass] [<ffffffffa04c0adc>] ? cl_io_loop+0xcc/0x190 [obdclass] [<ffffffffa09cc047>] ? ll_file_io_generic+0x3a7/0x560 [lustre] [<ffffffffa09cc339>] ? ll_file_aio_read+0x139/0x2c0 [lustre] [<ffffffffa09cc7f9>] ? ll_file_read+0x169/0x2a0 [lustre] [<ffffffff81181805>] ? vfs_read+0xb5/0x1a0 [<ffffffff81181941>] ? sys_read+0x51/0x90 [<ffffffff810dc595>] ? __audit_syscall_exit+0x265/0x290 [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b And according to forced crash-dump analysis, this is due to : PID: 7051 TASK: ffff8803346ca080 CPU: 3 COMMAND: "ptlrpcd-brw"
#0 [ffff88032812ba30] schedule at ffffffff8150de92 <<<<< but yield from CPU !!!
#1 [ffff88032812baf8] __cond_resched at ffffffff81064d6a
#2 [ffff88032812bb18] _cond_resched at ffffffff8150e8a0
#3 [ffff88032812bb28] cl_lock_put at ffffffffa04b85d1 [obdclass]
#4 [ffff88032812bb68] osc_page_putref_lock at ffffffffa08ba248 [osc]
#5 [ffff88032812bb98] osc_page_completion_read at ffffffffa08ba2b8 [osc]
#6 [ffff88032812bba8] cl_page_completion at ffffffffa04b1fdf [obdclass]
#7 [ffff88032812bc48] osc_completion at ffffffffa08baa82 [osc]
#8 [ffff88032812bc98] osc_ap_completion at ffffffffa08a4754 [osc]
#9 [ffff88032812bcf8] brw_interpret at ffffffffa08b0f69 [osc] <<<< owns cl_loi_list_lock
#10 [ffff88032812bd98] ptlrpc_check_set at ffffffffa05f7c9b [ptlrpc]
#11 [ffff88032812be38] ptlrpcd_check at ffffffffa0628f20 [ptlrpc]
#12 [ffff88032812be68] ptlrpcd at ffffffffa06292b3 [ptlrpc]
#13 [ffff88032812bf48] kernel_thread at ffffffff8100c0ca
PID: 15078 TASK: ffff8802f4998040 CPU: 3 COMMAND: "LU-4112-reprodu"
#0 [ffff880032e67e90] crash_nmi_callback at ffffffff8102d3a6
#1 [ffff880032e67ea0] notifier_call_chain at ffffffff815139d5
#2 [ffff880032e67ee0] atomic_notifier_call_chain at ffffffff81513a3a
#3 [ffff880032e67ef0] notify_die at ffffffff8109cc1e
#4 [ffff880032e67f20] do_nmi at ffffffff8151169b
#5 [ffff880032e67f50] nmi at ffffffff81510f60
[exception RIP: _spin_lock+33]
RIP: ffffffff815107d1 RSP: ffff880305cdf988 RFLAGS: 00000297
RAX: 0000000000009f99 RBX: ffff880309ebfdd8 RCX: 0000000000000000
RDX: 0000000000009f98 RSI: ffff8802d5a180c8 RDI: ffff880320986774
RBP: ffff880305cdf988 R8: 0000000000000000 R9: 0000000000000001
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8802d5a18128
R13: ffff8802d3af4810 R14: ffff8802d3a1f740 R15: 0000000000000001
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <NMI exception stack> ---
#6 [ffff880305cdf988] _spin_lock at ffffffff815107d1 <<<<< spin on cl_loi_list_lock
#7 [ffff880305cdf990] osc_io_submit at ffffffffa08bf50c [osc]
#8 [ffff880305cdfa30] cl_io_submit_rw at ffffffffa04bc128 [obdclass]
#9 [ffff880305cdfa80] lov_io_submit at ffffffffa093ca75 [lov]
#10 [ffff880305cdfb10] cl_io_submit_rw at ffffffffa04bc128 [obdclass]
#11 [ffff880305cdfb60] cl_io_read_page at ffffffffa04be851 [obdclass]
#12 [ffff880305cdfbb0] ll_readpage at ffffffffa09f4676 [lustre]
#13 [ffff880305cdfbf0] generic_file_aio_read at ffffffff8111b6fc
#14 [ffff880305cdfcd0] vvp_io_read_start at ffffffffa0a1bcab [lustre]
#15 [ffff880305cdfd40] cl_io_start at ffffffffa04bc29a [obdclass]
#16 [ffff880305cdfd70] cl_io_loop at ffffffffa04c0adc [obdclass]
#17 [ffff880305cdfdb0] ll_file_io_generic at ffffffffa09cc047 [lustre]
#18 [ffff880305cdfe20] ll_file_aio_read at ffffffffa09cc339 [lustre]
#19 [ffff880305cdfe80] ll_file_read at ffffffffa09cc7f9 [lustre]
#20 [ffff880305cdfef0] vfs_read at ffffffff81181805
#21 [ffff880305cdff30] sys_read at ffffffff81181941
#22 [ffff880305cdff80] system_call_fastpath at ffffffff8100b072
RIP: 0000003dece0e54d RSP: 00007fff5fb8ee88 RFLAGS: 00010246
RAX: 0000000000000000 RBX: ffffffff8100b072 RCX: 0000000000000000
RDX: 000000000000c000 RSI: 0000000002476130 RDI: 0000000000000018
RBP: 00007fff5fb8f630 R8: 0000000000000001 R9: 00007fff5fb8f700
R10: 0000000000000000 R11: 0000000000000293 R12: 00007fff5fb90840
R13: 0000000000400a50 R14: 00000000526fce84 R15: 0000000000400a50
ORIG_RAX: 0000000000000000 CS: 0033 SS: 002b
This leads to a lock callback time-out and the evict !! Could this be the issue you see at your site ?? |
| Comment by Sebastien Piechurski [ 29/Oct/13 ] |
|
At least this is the same consequence. Do you see this on the first rank of the MPI job ? It is highly likely the same thing, so good shot ! Thanks for this Bruno. Seb. |
| Comment by Bruno Faccini (Inactive) [ 30/Oct/13 ] |
|
I am actually trying to find any existing/back-port or develop a patch for this dead-lock condition. On the other-hand, if it is still easy for you to reproduce it (I mean with the same reproducer), could it be possible to do it with full lustre-debug enabled and again attach evicted Client(s) and Servers full debug-logs ?? Also, you said that you reproduced with 96MPI tasks on 2/4 Nodes, but how many what is the number of Cores on Client/compute Nodes at your site ? |
| Comment by Sebastien Piechurski [ 30/Oct/13 ] |
|
I can reproduce it easily (almost each time), so I will collect again full debug logs and report here. The nodes I use have 24 cores with SMT enabled, so with 48 logical cores each. |
| Comment by Bruno Faccini (Inactive) [ 30/Oct/13 ] |
|
Cool, thanks one more time for your help Seb !! I forgot to answer you about the rank of the dead-lock'ed task. I can better check, but I assume it is rank 0 since it is the 1st task on the 1st Node … |
| Comment by Bruno Faccini (Inactive) [ 31/Oct/13 ] |
|
This problem is no longer present in master (since 2.2) because of commit 9fe4b52ad2ffadf125d9b5c78bb2ff9a01725707 for Change-Id I3ef619c1f07eefd201236ab55e5fd858791d41e0 in It is part of Jinshan's "New IO Engine" known to "Reduce loi_list lock contention" with introduction of a "per object lock". On the other side, is the resched()/yield in cl_lock_put()/cl_lock_free() absolutelly required ?? |
| Comment by Bruno Faccini (Inactive) [ 01/Nov/13 ] |
|
Since backport of An other way could be to switch cl_loi_list lock to a blocking mutex as commented with field definition in "lustre/include/obd.h", but this looks like a change with highly possible side-effects. |
| Comment by Sebastien Piechurski [ 04/Nov/13 ] |
|
I just put on the Ftp in the |
| Comment by Bruno Faccini (Inactive) [ 05/Nov/13 ] |
|
Thanks for the logs Seb, I am currently working on them, will let you know if I find something else in. |
| Comment by Sebastien Piechurski [ 05/Nov/13 ] |
|
Bad news: I recompiled our version with patch at http://review.whamcloud.com/8135 and tested again on our benchmark cluster. Seems we need to find another way to understand what happens. |
| Comment by Bruno Faccini (Inactive) [ 05/Nov/13 ] |
|
Yes, I was able to get the dead-lock situation several times simply by running your reproducer in loop, on a platform with 2xClients/1xMDS+MGS+MDT/1xOSS/2xOSTs and IB-Network, running with latest b2_1 build. Ok, may be we did some amalgam here and expected that the dead-lock I triggered was the cause of your evicts … BTW, I am still reviewing the logs, so let's wait me to finish with them. Also in the msg you posted me, you indicate you instrumented a Lustre build and have been able to take a crash-dump of the Client-node at the time of the evict, can you also upload it ?? And very important, you also mentioned that you also trigger the problem when running with a genuine 2.1.6 version, without any additional patch and on all Servers/Clients. Thus my other next step is to setup a new platform, to test your reproducer vs a full-2.1.6 platform (will also run/try with b2_1 build + my patch just to see what happens), may be with 4 Clients now since you indicate it is the best one to reproduce. |
| Comment by Sebastien Piechurski [ 05/Nov/13 ] |
|
I am currently uploading the dump taken at the time of the eviction on the ftp site under |
| Comment by Sebastien Piechurski [ 07/Nov/13 ] |
|
A new set of crash dumps and full debug logs is currently uploading under the name of |
| Comment by Bruno Faccini (Inactive) [ 08/Nov/13 ] |
|
Thank's Seb, I grabbed these new infos and I am currently reviewing them. I may call you during the day if I need more or a new test to be run on your platform. |
| Comment by Sebastien Piechurski [ 12/Nov/13 ] |
|
Here are our latest findings:
Example with the provided reproducer (v2): 32 threads spread on 4 nodes with 8 cores enabled: hangs < 30 seconds Given the fact that the site we opened this ticket for has clients with 48 logical cores enabled (using SMT), almost all hangs are over 100 seconds and finally cause evictions. |
| Comment by Sebastien Piechurski [ 12/Nov/13 ] |
|
New version of the reproducer showing hang times on each type of operations. |
| Comment by Bruno Faccini (Inactive) [ 13/Nov/13 ] |
|
Thanks one more time for your great job in trying to find more infos in order to fully qualify this problem Seb. I will run this new reproducer and see if I can at least get the same timing differences/issues. On the other hand, I found some log entries on Server/OSS side which may also explain the pseudo-blocking situation sometime leading to the eviction and reconnect+restart, you describe due to the possible delays they seem to introduce in the "Blocking AST" process : ………… 00010000:00010000:5.0:1383842563.112425:0:7257:0:(ldlm_lock.c:626:ldlm_add_bl_work_item()) ### lock incompatible; sending blocking AST. ns: filter-scratna-OST0003_UUID lock: ffff8801c4000900/0xcc4bc67228f30369 lrc: 2/0,0 mode : --/PW res: 1995653/0 rrc: 5 type: EXT [0->4095] (req 0->4095) flags: 0x0 remote: 0xd7687ea28330a549 expref: 55 pid: 20473 timeout 0 ………… 00010000:00010000:5.0:1383842563.112473:0:7257:0:(ldlm_lockd.c:790:ldlm_server_blocking_ast()) ### lock not granted, not sending blocking AST ns: filter-scratna-OST0003_UUID lock: ffff8801c4000900/0xcc4bc67228f30369 lrc: 3/0, 0 mode: --/PW res: 1995653/0 rrc: 5 type: EXT [0->4095] (req 0->4095) flags: 0x20 remote: 0xd7687ea28330a549 expref: 55 pid: 20473 timeout 0 ………… I will also investigate this in parallel. Could it also be possible for you to do some profiling, during a run with the max delays encountered and with CPU/Core splitted details, just to see if we do not miss something out of Lustre scope ? Anyway I will run this on my side if I can reproduce the timing issues. |
| Comment by Bruno Faccini (Inactive) [ 14/Nov/13 ] |
|
No success with reproducer on the small configuration that was available immediately, no msg of threshold (default=3s) overtake after several hours of consecutive runs. I have been able to get an other platform available with Nodes made of 32 Cores, and I currently configure it. |
| Comment by Bruno Faccini (Inactive) [ 15/Nov/13 ] |
|
Again no success in reproducing with the 32-Cores per node platform running with latest b2_1 build #220. Will now run reproducer against 2.1.6 distro (from our downloads) on the same platform since you reported that you also trigger problem running with genuine 2.1.6 version. BTW, I am adding your email which highlights the versions you use : Client: 2.1.6-Bull.3.002, Server: 2.1.6-Bull.3.002 Client: 2.1.6 vanilla, Server: 2.1.6-Bull.3.002 Client: 2.1.3-Bull.3.002 Server: 2.1.6-Bull.3.002 Where 2.1.6-Bull.3.002 = 2.1.6 vanilla + : - ORNL-22 general ptlrpcd threads pool support - [#316929]/LU-1144 implement a NUMA aware ptlrpcd binding policy - [#316874]/LU-1110 MDS Oops in osd_xattr_get() during file open by FID - [#319248]/LU-2613 to much unreclaimable slab space - [#319615]/LU-2624 ptlrpc fix thread stop - [#317941]/LU-2683 client deadlock in cl_lock_mutex_get - [#319613] too many ll_inode_revalidate_fini failure msg in syslog => Change CERROR in CDEBUG, to avoid too many msgs. - [NF#13194] LBUG in mdt_reconstruct_open() => LU-2943 - [NF#15746] oss_num_threads max value is too low to feed disk controllers => To raise oss_num_threads (512 -> 2048). 2.1.3-Bull.3.002: 2.1.3 vanilla + : - [#318217]/LU-1749 mdt failing to open llog result in invalidating ost - [#318179]/LU-1331 changelogs: RNMTO record not always after RNMFRM - [#317829]/LU-1592 ldlm: protect obd_export:exp_imp_reverse's change - [#316929]/LU-1144 update patch to add ptlrpc option in modprobe.conf file - [#314206]/LU-1448 prevent NULL pointer dereference on disabled OSC - [#318383]/LU-1057 quota speed up lookup in osc_quota_chkdq.patch - [#318566]/LU-1714 properly initialyse sg_magic value.patch - [#315241] disable 'lfs catinfo' - ORNL-22 general ptlrpcd threads pool support - [#316929]/LU-1144 implement a NUMA aware ptlrpcd binding policy - [#316874]/LU-1110 MDS Oops in osd_xattr_get() during file open by FID - [#317433]/LU-1363 llite: Not held lock when calling security_d_instantiate - [#317699]/LU-645/BZ23978 getcwd failure - [#317365]/LU-1299 (patch set 11) loading large enough binary from lustre trigger OOM killer - [#317554]/LU-549 Improve statfs performance if selinux is disabled (complement) - [#317852]/LU-1650 client crash in osc_req_attr_set() routine |
| Comment by Bruno Faccini (Inactive) [ 15/Nov/13 ] |
|
Bad news again, even running the reproducer with 2.1.6 stock version I am unable to reproduce, after several hours of runs in a loop I got no msg at all indicating any I/O syscall elapsed time greater than 3s. |
| Comment by Bruno Faccini (Inactive) [ 15/Nov/13 ] |
|
Wow, after more hours running the reproducer with 2.1.6 stock version I have been able to reproduce what I think is the dead-lock documented in But on the other hand, I am still unable to reproduce the wrong behavior for this ticket, let see ... |
| Comment by Bruno Faccini (Inactive) [ 20/Nov/13 ] |
|
Just an update to indicate that we currently investigate (on-site and on my own test-platform) the Kernel (profiling on Clients for instance) and OpenMPI vs IntelMPI possible implications. |
| Comment by Alexandre Louvet [ 27/Nov/13 ] |
|
We observed that the eviction disappear if we run :
|
| Comment by Bruno Faccini (Inactive) [ 27/Nov/13 ] |
|
Yes, as we have been working on during the past days, seems that everything that forces frequent/regular context-switch/schedule() causes the problem to disappear !! nohz=on is the default with recent Kernels is'nt it ? May be we face some tricky scenario with this feature and on IVY platform ? Also you forgot to mention that running with OpenMPI also helps vs originally with IntelMPI, thus are there any known parameters/env-vars we can use to have MPI threads to yield CPUs/Cores frequently ? I will investigate on my side but I am sure you also have MPI specialists that can help here!! Last, thinking more about it I wonder if you can also give a try to the "old"+single ptlrpcd thread configuration ?? |
| Comment by Bruno Faccini (Inactive) [ 28/Nov/13 ] |
|
For the record and provided by Seb, here is the typical /proc/cpuinfo output for the IVY Bridge processors on Bull's platforms : vendor_id : GenuineIntel cpu family : 6 model : 62 model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz stepping : 4 cpu MHz : 2700.000 cache size : 30720 KB physical id : 1 siblings : 24 core id : 12 cpu cores : 12 apicid : 57 initial apicid : 57 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dts tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms bogomips : 5399.25 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: and I am looking for an internal similar platform actually … |
| Comment by Andreas Dilger [ 29/Nov/13 ] |
|
Could someone please fill me in on the details of what nohz=on actually does? Is it possible that in this case the jiffies/HZ are very small? I'm wondering if we might get into trouble in a case like this with math that assumes e.g. (HZ/100) is non-zero and we sleep for that amount of time, but in fact (HZ/100) == 0 and it means "sleep forever"? I also see mention of "lctl set_param ldlm.namespaces.osc.lru_size=0" being set on the node in order to flush the locks. That is probably a bad idea, since it means there will be a lot thrashing of LDLM locks on the client when it is not needed, and will essentially impose O_SYNC writes on the client (i.e. every write will immediately be sent in an RPC in order to drop the LDLM lock from the LRU). If you just want to flush the page cache, you can do either "lctl set_param ldlm.namespaces.osc.lru_size=clear" which will drop all of the locks in the LRU, but not impose a future LRU size of 0, or "echo 3 > /proc/sys/vm/drop_caches" to flush the client cache without dropping all of the locks. |
| Comment by Bruno Faccini (Inactive) [ 02/Dec/13 ] |
|
Andreas, thanks for this very important comment on danger using dlm.namespaces.osc.lru_size=0", I am sure Sebastien will change the onsite current work-around to the "clear" option you documented. About nohz, I am still not able to fully explain its low-level implications but the latency it implies seems of a big impact here. On the other hand, I have been on-site last friday and here are some more interesting points : _ problem has also been reproduced on Sandy Bridge there when I can't on Rosso … But I need to track any differences that could explain. Likely to be the main ones are Intel-MPI vs Open-MPI, BIOS config, Kernel scheduling parameters/features, ... _ changing "max C-State" 1 in BIOS also allows to avoid the problem, which seems regarding nohz feature design, since one of its main purpose is to allow processors to remain in power saving states for a longer period of time. _ on Intel-MPI side (version v4.1.1.036 is used on-site), trying to reduce I_MPI_SPIN_COUNT to the minimum/1 value gives poor results. _ profiling sessions, even if not reproducing the problem show a main hot-spot in MPIDI_CH3I_Progress() routine. Last, since IVY Bridge nodes are available on Endeavour internal system, actually I am trying to reproduce issue there, with Lustre/IB/Intel-MPI similar pieces. |
| Comment by Bruno Faccini (Inactive) [ 04/Dec/13 ] |
|
Hello Seb, |
| Comment by Bruno Faccini (Inactive) [ 04/Dec/13 ] |
|
BTW, I have been able to reproduce the delays on our internal system with IVY Bridge nodes and running with Intel-MPI. |
| Comment by Bruno Faccini (Inactive) [ 09/Dec/13 ] |
|
As I already commented, I have been able to reproduce on Endeavour system but in a way that is much less frequent/important (only a few <30s hang/delay periods per reproducer run). The reason of this could be "intel_idle.max_cstate=1" Kernel boot-parameter usage and some BIOS settings performance oriented … |
| Comment by Patrick Farrell (Inactive) [ 08/Jan/14 ] |
|
Andreas - The first answer here does a very good job of explaining NOHZ: The main thing is that without NOHZ, every CPU is woken up 'HZ' times per second (values range from 100 to 250 to higher in some situations). This 'wake up' does various things related to process management & scheduling, and is problematic for power management situations, because it happens even on idle CPUs. My reading suggests NOHZ only comes in to play on idle CPUs, the idea being to stop waking up idle CPUs and taking them out of power saving mode. For non-idle CPUs, it has no effect, or at least that was true when it was first added. So I think the main effect of disabling NOHZ here is to get more frequent calls to the scheduler. The trade off is essentially latency and timer accuracy vs power consumption, I believe. Cray is seeing a bug we believe may be this. I'm going to try the noted changes with respect to NOHZ and the other settings and see what happens. |
| Comment by Bruno Faccini (Inactive) [ 08/Jan/14 ] |
|
Hello Patrick, |
| Comment by Patrick Farrell (Inactive) [ 08/Jan/14 ] |
|
Bruno - Nothing to report yet. Are you changing the NOHZ setting on the clients, or the servers, or both? |
| Comment by Bruno Faccini (Inactive) [ 08/Jan/14 ] |
|
On Clients only. Servers seem not impacted because their work-load is Lustre only !!… The NOHZ=on impact seem to only occur when multi-tasked/threaded CPU-bound workload is running concurrently and without itself yielding the CPU enough frequently (Intel-MPI vs OpenMPI ?) to allow re-scheduling. |
| Comment by Patrick Farrell (Inactive) [ 08/Jan/14 ] |
|
Bruno - I changed settings from NOHZ=ON and HZ=100 to NOHZ=OFF and HZ=1000 on my clients, and was able to reproduce the bug equally well in both configurations. However, I am running master on both the client and server, it's possible if I was running 2.1 on one or both I might see different behavior These are virtual systems, so I don't know of any way to change CPU related BIOS settings that will affect the virtual machines in the expected way. Do you have any other thoughts for how to get the problem to stop happening? I am very curious to see if these timing things do affect the problem for us. I'd be happy to share full DK logs of all systems and my MPI based reproducer if you'd like them. |
| Comment by Patrick Farrell (Inactive) [ 08/Jan/14 ] |
|
Actually, since we're on different schedules, here's the reproducer in case you're interested. Since the HZ changes didn't affect my reproducer, I wonder if we're actually hitting different bugs. Here's the command I'm using, the file 'logfile' is setstripe 8 (the bug does not occur with setstripe 0). I also did not see the issue with np set to fewer than 8: mpirun --hostfile hosts -np 16 mpi_log.o Here's the code. All it does is open a file, wait until all processes have opened a file, then write to the file a few times, close the file, and exit: #include <mpi.h> #include <stdio.h> #include <stdlib.h> int main(int argc, char** argv) { int size; int rank; // Initialize the MPI environment MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Comm_rank(MPI_COMM_WORLD, &rank); FILE* logfile = fopen("/mnt/centfs01/mpi_test/logfile","a"); if(logfile < 0) { fprintf(stderr,"Failed to open logfile, rc: %d\n",logfile); } fprintf(logfile,"Pre-barrier from %d of %d\n", rank, size); MPI_Barrier(MPI_COMM_WORLD); fprintf(logfile,"Post-barrier start from %d of %d\n", rank, size); fprintf(logfile,"Post-barrier middle: Hello world from %d of %d\n", rank, size); fprintf(logfile,"Post-barrier end from %d of %d\n", rank, size); fclose(logfile); MPI_Barrier(MPI_COMM_WORLD); // Finalize the MPI environment. MPI_Finalize(); } |
| Comment by Bruno Faccini (Inactive) [ 09/Jan/14 ] |
|
Hello Patrick, |
| Comment by Patrick Farrell (Inactive) [ 29/Jan/14 ] |
|
Bruno - We've opened I still suspect this may be the same issue. |
| Comment by Manish Patel (Inactive) [ 12/Feb/14 ] |
|
Hello, We are also seeing the same issue with IU which are on Lustre v2.1.6 with Here is the example from one client. ?--------- ? ? ? ? ? testtemp_2014_02_08_01_04_21 ?--------- ? ? ? ? ? testtemp_2014_02_08_01_04_212014_02_08_01_04_21_test.log ?--------- ? ? ? ? ? testtemp_2014_02_08_01_06_22 ?--------- ? ? ? ? ? testtemp_2014_02_08_01_06_222014_02_08_01_06_22_test.log ?--------- ? ? ? ? ? testtemp_2014_02_08_08_57_04 Addition to that in stack trace we also seeing 00010000:00010000:5.0:1392065759.472527:0:10020:0:(ldlm_lockd.c:642:ldlm_handle_ast_error()) ### client (nid 979@gni) returned -22 from blocking AST - normal race ns: mdt-ffff881843d1a000 lock: ffff881bf2728000/0xcf86cac758e2f8c0 lrc: 1/0,0 mode: --/PR res: 8628747995/115206 bits 0x3 rrc: 2 type: IBT flags: 0x240000a0 remote: 0x4e285e5bd84c2533 expref: 1102 pid: 4259 timeout: 7083106330 |
| Comment by Bruno Faccini (Inactive) [ 17/Feb/14 ] |
|
Hello Manish, are some Nodes/Clients more frequently evicted ? |
| Comment by Manish Patel (Inactive) [ 20/Feb/14 ] |
|
Hi Bruno, The frequency of nodes getting evicted are daily. The work load are kind of creating a directory and writing output data from the analysis to that directory and then reading it again those data from that directory. The problem is reproduced just by doing 'ls' on that problematic directory. We tried testing nohz=off but the problem still persists. IU's most of the clients are 1.8.9 client is running kernel version 2.6.18-348.3.1.el5_lustre.g171bd56 which isn't tickless, so setting nohz=off would have no impact there. IU tested on their 2.1.6 client, which is running kernel 2.6.32-358.11.1.el6.x86_64 and a long listing against the known a problematic directory still results in a hang. Here is in brief explanation from IU, what they are seeing and experiencing. we're actually experiencing two issues, or at least different manifestations of a single problem. First, we have clients that hang when stat'ing certain directories and eventually get evicted from the MDS, and only the MDS. These clients log "The mds_connect operation failed with -16" until they detect that they've been evicted by the MDS. Second, we're seeing clients evicted from OSS's with " Feb 7 15:42:32 dc2oss16 kernel: : LustreError: 138-a: dc2-OST009f: A client on nid 10.10.0.46@tcp was evicted due to a lock completion callback time out: rc -107 Feb 7 15:43:52 dc2oss16 kernel: : LustreError: 138-a: dc2-OST00a5: A client on nid 10.10.0.70@tcp was evicted due to a lock blocking callback time out: rc -107 Feb 7 16:38:46 dc2oss13 kernel: : LustreError: 138-a: dc2-OST007e: A client on nid 10.10.0.192@tcp was evicted due to a lock completion callback time out: rc -107 Feb 7 16:38:46 dc2oss13 kernel: : LustreError: 138-a: dc2-OST0082: A client on nid 10.10.0.250@tcp was evicted due to a lock completion callback time out: rc -107 Feb 7 17:25:40 dc2oss03 kernel: : LustreError: 138-a: dc2-OST001b: A client on nid 10.10.0.247@tcp was evicted due to a lock blocking callback time out: rc -107 Feb 7 17:25:40 dc2oss03 kernel: : LustreError: 138-a: dc2-OST0019: A client on nid 10.10.0.94@tcp was evicted due to a lock blocking callback time out: rc -107 Feb 7 17:25:40 dc2oss13 kernel: : LustreError: 138-a: dc2-OST007e: A client on nid 149.165.226.203@tcp was evicted due to a lock blocking callback time out: rc -107 Thank You, |
| Comment by Bruno Faccini (Inactive) [ 21/Feb/14 ] |
|
Manish, can you provide at least Client's log upon evict ? And thus, having a big trace buffer configured and dlmtrace+rpctrace enabled (all would be the must!) would be very helpful. |
| Comment by Manish Patel (Inactive) [ 21/Feb/14 ] |
|
Client and Server logs with Client trace logs. run6-client.out file. |
| Comment by Manish Patel (Inactive) [ 21/Feb/14 ] |
|
Hi Bruno, I already have those trace which was collected at that time. I have attached it here. Log files are: Let me know if that sheds some light on it, otherwise we will collect a new set of logs. Let me know what buffer size we should set and what is the right way to get those dlmtrace+rpctrace logs. Thank You, |
| Comment by Bruno Faccini (Inactive) [ 24/Feb/14 ] |
|
Thanks Manish, I will review these logs/infos and let you know asap if I can get something from them. |
| Comment by Sebastien Piechurski [ 25/Feb/14 ] |
|
Hi Bruno, I am coming back on this issue. I was able today to test the following:
I am still reproducing the issue very consistently on every run. The customer is running his jobs by clearing the ldlm LRU each time files are created or copied from one node and before they are read from another set of nodes, but this is not compatible with development of new codes and is suboptimal. What can we do to go forward on the resolution of this issue ? |
| Comment by Patrick Farrell (Inactive) [ 25/Feb/14 ] |
|
Sebastien - You must've missed my latest comment in the Gerrit review of |
| Comment by Sebastien Piechurski [ 26/Feb/14 ] |
|
Hi Patrick, Thanks for your comment. I indeed did not see the comment in gerrit saying that only patchset 1 was solving the issue. |
| Comment by Bruno Faccini (Inactive) [ 26/Feb/14 ] |
|
Humm, I think this is definitely the proof of what I strongly suspected and affirmed from the beginning when people wanted to link this/ Seb: will try to come up with a more acceptable solution/work-around by the end of the week. |
| Comment by Bruno Faccini (Inactive) [ 05/Mar/14 ] |
|
Hello Manish, sorry for the delay to finish analysis of the logs you uploaded. Unfortunately it is impossible to conclude anything from them, mainly because the Client Lustre debug log you provided is not from the same day than the Servers syslogs/msgs, and also does not contain the Client's eviction timeframe. If you want us to pursue on this you will need to provide at least the Client Lustre debug log upon eviction, by 1st configuring a big log buffer (echo 2048 > /proc/sys/lnet/debug_mb) and at least enabling dlmtrace+rpctrace traces (echo +dlmtrace +rpctrace > /proc/sys/lnet/debug), full debug traces (echo -1 > /proc/sys/lnet/debug) would be a must. Also, if you want the debug log to be dumped to disk upon eviction you can enable it (echo 1 > /proc/sys/lustre/dump_on_eviction), or you will have to manually request it (lctl dk <file>). Last, since there are still no clear evidence that you suffer the same problem that the original one in this ticket, I am definitely convinced that you better had to open a specific ticket to address you own issue. |
| Comment by Bruno Faccini (Inactive) [ 05/Mar/14 ] |
|
Seb: To ensure I did not miss anything during your/our earlier testing/reproducing/debugging sessions I again went thru all the infos/logs/crash-dumps/... and any materials that we generated before. And going deeper into the crash-dump that you generated on Client upon eviction (with your Lustre distro including the patch introducing an LBUG() in ptlrpc_invalidate_import_thread() when obd_dump_on_eviction != 0), I found something really interesting that may be a new direction to investigate (a possible live-lock/scheduling issue during cl_lock mutex operations that may cause the timeouts/evicts ...). I know it is more work for you, but since you are only able to really reproduce on your platform ..., so could it be possible that you re-run the reproducer on a set of Clients running with the same patch in order to provide me with a new crash-dump to analyze and help me to confirm this ?? |
| Comment by Bruno Faccini (Inactive) [ 31/Mar/14 ] |
|
Seb, can you give some update ?? Thanks in advance. |
| Comment by Sebastien Piechurski [ 15/Apr/14 ] |
|
Hi Bruno, Sorry I was not able to give an update earlier. |
| Comment by Bruno Faccini (Inactive) [ 21/Apr/14 ] |
|
Yes, crash-dump is required to confirm behavior found in debug-traces, and also what I think I have found in the crash-dump when running with the instrumented Lustre-modules where you inserted an LBUG() in ptlrpc_invalidate_import_thread() when obd_dump_on_eviction != 0. Can you do the same ? |
| Comment by Sebastien Piechurski [ 25/Apr/14 ] |
|
Hi Bruno, I am currently uploading an archive on the ftp site under uploads/
This was reproduced with the kernel (2.6.32-358.11.1.el6_lustre) and lustre version (vanilla 2.1.6) from the downloads.whamcloud.com site except for the LBUG() call introduced in ptlrpc_invalidate_import_thread(). Note that the nodes which were used are based on a 10-cores (40 threads nodes) version of the processor we used for the first occurrences (instead of a 12-cores) and that it took me a lot more attempts before being able to reproduce when it was systematically reproduced with the 12-cores version (48 threads). I hope you will be able to confirm your thoughts with these traces. |
| Comment by Bruno Faccini (Inactive) [ 30/Apr/14 ] |
|
Hello Seb, |
| Comment by Bruno Faccini (Inactive) [ 15/May/14 ] |
|
Seb, just to let you know that your latest infos confirmed my earlier thoughts. I Will try to give an update on were I am in the "livelock" analysis by the end of today. |
| Comment by Bruno Faccini (Inactive) [ 22/May/14 ] |
|
Humm I am late again, but this is a really tricky problem to debug from crash-dumps and lustre debug-logs ... # sysctl -n kernel.sched_features 3183 # sysctl -w kernel.sched_features=1135 kernel.sched_features = 1135 # sysctl -n kernel.sched_features 1135 # BTW, Bull's Kernel version (2.6.32-279.5.2.bl6.Bull.36.x86_64) that seems to be installed on-site does not allow to do it, but the Kernel versions (>= 2.6.32-358.11.1.el6_lustre.x86_64) we ship with Lustre 2.1.6/2.4.3 from our downloads area is ok. |
| Comment by Sebastien Piechurski [ 30/May/14 ] |
|
I was able today to perform new tests. Running the reproducer on a regular kernel was taking times between 1 and 5 minutes (of course leading to multiple lock callback timeouts after 160 seconds).
So this is very encouraging and I guess we now need to understand the phenomenon more deeply to see if there is actually a problem with our platform or if there is simply something bad when running this kind of load on systems with lots of cores (3 nodes x 24 cores in this case). |
| Comment by James A Simmons [ 14/Aug/16 ] |
|
Old blocker for unsupported version |