[LU-4112] Random eviction of clients on lock callback timeout Created: 16/Oct/13  Updated: 14/Aug/16  Resolved: 14/Aug/16

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.6
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Sebastien Buisson (Inactive) Assignee: Bruno Faccini (Inactive)
Resolution: Won't Fix Votes: 0
Labels: None

Attachments: File 2014-02-11-run6-client.out.gz     File LU-4112-reproducer-v2.tar.gz     File LU-4112-reproducer.tar.gz     Text File lustre-logs.tar.bz2.aa     Text File lustre-logs.tar.bz2.ab     Text File server_log_2_10_2014.log     Text File xfer_client_2_10_2014.log    
Severity: 3
Rank (Obsolete): 11068

 Description   

Hi,

At Meteo France, Lustre clients are randomly evicted with the following message:

1381739209 2013 Oct 14 08:26:49 beaufixoss9 kern err kernel LustreError: 0:0:(ldlm_lockd.c:358:waiting_locks_callback()) ### lock callback timer expired after 101s: evicting client at 172.23.67.157@o2ib ns: filter-scratch-OST0024_UUID lock: ffff880215b3c6c0/0xe3c417cfbe584025 lrc: 3/0,0 mode: PW/PW res: 87670641/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x20 remote: 0x964c642755ae9fd expref: 4 pid: 20133 timeout 4581190301

Some attempts were made to increase the ldlm_timeout, but even with a timeout of 300 seconds, the timer finally expires and evicts the client.

After collecting some debug logs, it seems that there is a communication problem: the client sends the lock callback RPC, but it is never received by the OSS, leading it to timeout.

Here is an example:

[root@beaufixmngt0 beaufix1046]# grep evict lustre-logs-20131011-082535
00000100:00080000:17.0:1381479935.274633:0:4383:0:(import.c:915:ptlrpc_connect_interpret()) @@@ scratch-OST0017-osc-ffff880454e9b400: evicting (reconnect/recover flags not set: 4) req@ffff88087945e000 x1448401857359985/t0(0) o8->scratch-OST0017-osc-ffff880454e9b400@172.23.80.69@o2ib1:28/4 lens 368/264 e 0 to 0 dl 1381479946 ref 1 fl Interpret:RN/0/0 rc 0/0
00000100:02020000:17.0:1381479935.274641:0:4383:0:(import.c:1299:ptlrpc_import_recovery_state_machine()) 167-0: This client was evicted by scratch-OST0017; in progress operations using this service will fail.
00000100:00080000:17.0:1381479935.285351:0:4383:0:(import.c:1303:ptlrpc_import_recovery_state_machine()) evicted from scratch-OST0017_UUID@172.23.80.69@o2ib1; invalidating

172.23.80.69 = beaufixoss15-ib1

[root@beaufixmngt0 beaufix1046]# head -n1 ../beaufixoss15/lustre-logs-20131011-082458
00010000:00000010:14.1F:1381479186.063701:0:0:0:(ldlm_lock.c:198:ldlm_lock_free()) slab-freed 'lock': 560 at ffff880067e6c480.
[root@beaufixmngt0 beaufix1046]# tail -n 2 ../beaufixoss15/lustre-logs-20131011-082458 | head -n1
00002000:00000001:2.0:1381479898.458969:0:20716:0:(filter_io_26.c:325:filter_do_bio()) Process entered

OSS debug log covers from 1381479186 to 1381479898

Here is the list of RPCs sent during this time (total of 5 RPCs):
[root@beaufixmngt0 beaufix1046]# grep -f requests lustre-logs-20131011-082535 | grep ptl_send_buf
00000100:00000200:22.0:1381479685.229939:0:4404:0:(niobuf.c:87:ptl_send_buf()) Sending 192 bytes to portal 28, xid 1448401857358474, offset 0
00000100:00000200:46.0:1381479697.155610:0:4424:0:(niobuf.c:87:ptl_send_buf()) Sending 296 bytes to portal 28, xid 1448401857358807, offset 0
00000100:00000200:6.0:1381479797.753246:0:4384:0:(niobuf.c:87:ptl_send_buf()) Sending 456 bytes to portal 6, xid 1448401857359681, offset 0
00000100:00000200:27.0:1381479835.230381:0:4411:0:(niobuf.c:87:ptl_send_buf()) Sending 192 bytes to portal 28, xid 1448401857359731, offset 0
00000100:00000200:6.0:1381479885.229892:0:4409:0:(niobuf.c:87:ptl_send_buf()) Sending 192 bytes to portal 28, xid 1448401857359816, offset 0

[root@beaufixmngt0 beaufix1046]# grep -f requests ../beaufixoss15/lustre-logs-20131011-082458 | grep incoming
00000100:00000040:0.0:1381479885.230810:0:10038:0:(events.c:284:request_in_callback()) incoming req@ffff88005b69f400 x1448401857359816 msgsize 192

-> Only 1 RPC received out of 5 sent by the clients

This example only shows what happens on top of LNet, but we can see no error at the Infiniband level yet.

This is a blocking problem as it is preventing the cluster from being used for production.

I will upload traces to help analysis.

Thanks for your help,
Sebastien.



 Comments   
Comment by Sebastien Buisson (Inactive) [ 16/Oct/13 ]

Here are the logs. To avoid 10MB file size limitation, the .tar.bz2 file has been cut with 'split'.

beaufixoss15 is the OSS, and beaufix1046 is the client.

Comment by Bruno Faccini (Inactive) [ 16/Oct/13 ]

Hello Seb,
Can you provide me with some details about IB-network config/topology at Meteo ?? Are there any LNET-routers being used ?
I am waiting for the logs/traces, but what was the debug-level/mask used ?

Comment by Sebastien Buisson (Inactive) [ 16/Oct/13 ]

Hi Bruno,

I do not know if LNET routers are involved in that particular path between beaufixoss15 and beaufix1046. I will ask to Support team, as well as for the IB network topology and the debug mask used.

Thanks,
Sebastien.

Comment by Sebastien Buisson (Inactive) [ 16/Oct/13 ]

Here is the network information you requested.

There are no LNet routers on this configuration.
The topology is a fat tree with the 16 OSSs spread on 4 separate switches.
Each OSS has 2 IB ports (1 in o2ib and 1 in o2ib1). Half of the OSTs on each OSS are exported through o2ib0 and the other half through o2ib1.
Clients have only 1 Ib ports with 2 addresses allowing to be in both o2ib0 and o2ib1.

Cheers,
Sebastien.

Comment by Sebastien Buisson (Inactive) [ 16/Oct/13 ]

And for your question regarding the debug level:

echo -1 > /proc/sys/lnet/debug
echo -1 > /proc/sys/lnet/subsystem_debug
Comment by Sebastien Piechurski [ 16/Oct/13 ]

In case this ease the research, I am currently uploading a reproducer for this issue on the ftp site under uploads/LU-4112.
The tarball contains a directory in which a few files need to be edited:
in DATA/env_arpege:

  • change ARPEGE_LIB_DIR to point to the full path of ARPEGE/LIBRARIES directory
  • change ALL_LIBRARIRES to include path to intel compiler and intel mpi libraries

in launch_run.sh:

  • change ROOT_DIR to point to full path of the ARPEGE directory
  • DATA_DIR to $ROOT_DIR/DATA
  • WORK_DIR to a specific dir for a single run (currently relies on slurm job id)
  • Adapt the COMMAND_SRUN variable accordingly to batch manager (or empty if no batch manager).

The issue was reproduced with 8 instances of this job, running on a total of 32 nodes. I don't know yet if this shows with less.

It might be interesting to know that I could reproduce it on a 2.4.1 version as well.

Comment by Sebastien Buisson (Inactive) [ 16/Oct/13 ]

Apart from that, I think priority could be lowered to Critical.

Comment by Bruno Faccini (Inactive) [ 16/Oct/13 ]

Thanks for all of these infos/hints already, will try to update you soon.

Comment by Bruno Faccini (Inactive) [ 17/Oct/13 ]

I agree that Lustre debug logs provided seem to indicate that some requests (even OBD_PINGs) from Client beaufix1046 never reach OSS beaufixoss15-ib1 ...

BTW, it seems to me that these requests don't have their send be acknowledged by the local Portal/IB layers, since some of the corresponding request_out_callback() traces are missing !! So unless tracing is buggy/lazy, it looks like LNet/IB networking do not work at all at least between some nodes pairs.

Did you simply check that IB/LNet work between problematic+identified pairs (like beaufix1046 and beaufixoss15-ib1) with "lnet ping", IPoIB ping, and low-level IB diags/cmds ?

Concerning the reproducer, and regarding its pre-requisites (Intel compilers/run-time+MPI+MKL), event at Intel it may take some time for me to setup such a test platform ... Is it possible to use alternate+free products ?

Comment by Sebastien Buisson (Inactive) [ 24/Oct/13 ]

Hi Bruno,

After fine analysis of the arpege reproducer, it came out that the issue seems related to ldlm locks.
Indeed, before the MPI job begins, the client which will then hold the first rank of the MPI job copies the input files from another filesystem (not Lustre) to the scratch filesystem (Lustre) (121 files of 8.8MB in that particular case). This client then holds the read/write ldlm lock on those files. When the MPI job begins, all ranks access those files in read-only mode and so have to get a read lock on each file for each client. The OSSes then reclaim the write lock to the client hosting the MPI first rank, and this is where the time out occurs.

To workaround this, Support team introduced an 'lctl set_param ldlm.namespaces.osc.lru_size=0' between the input files copy and the beginning of the job itself. This way, the locks are put back prior to the beginning of the MPI job, and no reclaim occurs during the job itself.

We now need to understand why the timeout occurs. Support team is currently trying to write a home made reproducer that will mimic the behavior of the MPI job, to provide it to you and better understand what exactly triggers the problem.

Cheers,
Sebastien.

Comment by Bruno Faccini (Inactive) [ 24/Oct/13 ]

Thanks for this additional analysis and details !!

Do you think that we may face some issue caused by too many locks grant/revoke traffic or better some tricky locking pattern ? BTW, do you already know if the MPI job tasks uses extent locks (which does not seem in the lustre logs you already provided ...) ?

And also many thanks to work on a new reproducer because 1st one is very difficult to setup out of its original environment.

Comment by Sebastien Piechurski [ 25/Oct/13 ]

Here is a reproducer for the issue described.

I was able to reproduce it with 96 MPI tasks across 2 nodes, but best reproduced with 96 tasks across 4 nodes.

I tried to strip as many things as I could to reproduce it.
Currently, this will just create some files from the first rank, and then have all ranks:

  • Do a stat()
  • open the file
  • read the file (partially except for the first rank) with lots of lseek()
  • close the file
    This is done twice for each file and by all ranks at once (synchronized for each file).

Note that when removing either the stat() call or the lseek() calls, the issue disappears. So it looks like these induce a timing allowing to fall in the issue.

Comment by Bruno Faccini (Inactive) [ 28/Oct/13 ]

Thanks to provide this new reproducer, I am currently setting up a test platform to run it.

Comment by Bruno Faccini (Inactive) [ 29/Oct/13 ]

Reproducer is actually running on a platform as suggested. Will let you know of any results.

Comment by Bruno Faccini (Inactive) [ 29/Oct/13 ]

Hummm, I just faced a hang situation with following Console msgs/stack being printed regularly :

BUG: soft lockup - CPU#3 stuck for 67s! [LU-4112-reprodu:15078]
Modules linked in: nfs fscache lmv(U) mgc(U) lustre(U) lov(U) osc(U) lquota(U) mdc(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc cpufreq_on
demand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 mlx4_ib ib_sa ib_mad ib_core mlx4_en mlx4_core igb ptp pps_core microcode serio_raw i2c_i801 i2c_core sg iTCO_wd
t iTCO_vendor_support i7core_edac edac_core ioatdma dca shpchp ext3 jbd mbcache sd_mod crc_t10dif ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
CPU 3 
Modules linked in: nfs fscache lmv(U) mgc(U) lustre(U) lov(U) osc(U) lquota(U) mdc(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc cpufreq_on
demand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 mlx4_ib ib_sa ib_mad ib_core mlx4_en mlx4_core igb ptp pps_core microcode serio_raw i2c_i801 i2c_core sg iTCO_wd
t iTCO_vendor_support i7core_edac edac_core ioatdma dca shpchp ext3 jbd mbcache sd_mod crc_t10dif ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]

Pid: 15078, comm: LU-4112-reprodu Not tainted 2.6.32-358.11.1.el6_lustre.g33d0394.x86_64 #1 Supermicro X8DTT/X8DTT
RIP: 0010:[<ffffffff815107ce>]  [<ffffffff815107ce>] _spin_lock+0x1e/0x30
RSP: 0018:ffff880305cdf988  EFLAGS: 00000297
RAX: 0000000000009f99 RBX: ffff880305cdf988 RCX: 0000000000000000
RDX: 0000000000009f98 RSI: ffff8802d5a180c8 RDI: ffff880320986774
RBP: ffffffff8100bb8e R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000000 R12: 00000000526fce72
R13: 0000000000000000 R14: 0000000000000000 R15: ffff8802f3f4b2b8
FS:  00007f47e6ebb700(0000) GS:ffff880032e60000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f8b043f9018 CR3: 00000002e1fd1000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process LU-4112-reprodu (pid: 15078, threadinfo ffff880305cde000, task ffff8802f4998040)
Stack:
 ffff880305cdfa28 ffffffffa08bf50c ffff8802d5a262b8 ffff8802d5a181b8
<d> ffff8802d5a180c8 ffff8802d58c00a8 ffff8802d5a18180 ffff8802d5a18160
Call Trace:
 [<ffffffffa08bf50c>] ? osc_io_submit+0x31c/0x540 [osc]
 [<ffffffffa04bc128>] ? cl_io_submit_rw+0x78/0x180 [obdclass]
 [<ffffffffa093ca75>] ? lov_io_submit+0x355/0xc30 [lov]
 [<ffffffffa04bc128>] ? cl_io_submit_rw+0x78/0x180 [obdclass]
 [<ffffffffa04be851>] ? cl_io_read_page+0xb1/0x170 [obdclass]
 [<ffffffffa04b2909>] ? cl_page_assume+0xf9/0x2d0 [obdclass]
 [<ffffffffa09f4676>] ? ll_readpage+0x96/0x200 [lustre]
 [<ffffffff81096ce0>] ? wake_bit_function+0x0/0x50
 [<ffffffff8111b6fc>] ? generic_file_aio_read+0x1fc/0x700
 [<ffffffffa0a1bcab>] ? vvp_io_read_start+0x13b/0x3e0 [lustre]
 [<ffffffffa04bc29a>] ? cl_io_start+0x6a/0x140 [obdclass]
 [<ffffffffa04c0adc>] ? cl_io_loop+0xcc/0x190 [obdclass]
 [<ffffffffa09cc047>] ? ll_file_io_generic+0x3a7/0x560 [lustre]
 [<ffffffffa09cc339>] ? ll_file_aio_read+0x139/0x2c0 [lustre]
 [<ffffffffa09cc7f9>] ? ll_file_read+0x169/0x2a0 [lustre]
 [<ffffffff81181805>] ? vfs_read+0xb5/0x1a0
 [<ffffffff81181941>] ? sys_read+0x51/0x90
 [<ffffffff810dc595>] ? __audit_syscall_exit+0x265/0x290
 [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
Code: 00 00 00 01 74 05 e8 62 30 d7 ff c9 c3 55 48 89 e5 0f 1f 44 00 00 b8 00 00 01 00 f0 0f c1 07 0f b7 d0 c1 e8 10 39 c2 74 0e f3 90 <0f> b7 17 eb f5 83 3f 00 75 f4 eb df c9 c3 0f 1f 40 00 55 48 89 
Call Trace:
 [<ffffffffa08be9d5>] ? osc_cl_page_osc+0x15/0x60 [osc]
 [<ffffffffa08bf50c>] ? osc_io_submit+0x31c/0x540 [osc]
 [<ffffffffa04bc128>] ? cl_io_submit_rw+0x78/0x180 [obdclass]
 [<ffffffffa093ca75>] ? lov_io_submit+0x355/0xc30 [lov]
 [<ffffffffa04bc128>] ? cl_io_submit_rw+0x78/0x180 [obdclass]
 [<ffffffffa04be851>] ? cl_io_read_page+0xb1/0x170 [obdclass]
 [<ffffffffa04b2909>] ? cl_page_assume+0xf9/0x2d0 [obdclass]
 [<ffffffffa09f4676>] ? ll_readpage+0x96/0x200 [lustre]
 [<ffffffff81096ce0>] ? wake_bit_function+0x0/0x50
 [<ffffffff8111b6fc>] ? generic_file_aio_read+0x1fc/0x700
 [<ffffffffa0a1bcab>] ? vvp_io_read_start+0x13b/0x3e0 [lustre]
 [<ffffffffa04bc29a>] ? cl_io_start+0x6a/0x140 [obdclass]
 [<ffffffffa04c0adc>] ? cl_io_loop+0xcc/0x190 [obdclass]
 [<ffffffffa09cc047>] ? ll_file_io_generic+0x3a7/0x560 [lustre]
 [<ffffffffa09cc339>] ? ll_file_aio_read+0x139/0x2c0 [lustre]
 [<ffffffffa09cc7f9>] ? ll_file_read+0x169/0x2a0 [lustre]
 [<ffffffff81181805>] ? vfs_read+0xb5/0x1a0
 [<ffffffff81181941>] ? sys_read+0x51/0x90
 [<ffffffff810dc595>] ? __audit_syscall_exit+0x265/0x290
 [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b

And according to forced crash-dump analysis, this is due to :

PID: 7051   TASK: ffff8803346ca080  CPU: 3   COMMAND: "ptlrpcd-brw"
 #0 [ffff88032812ba30] schedule at ffffffff8150de92         <<<<< but yield from CPU !!!
 #1 [ffff88032812baf8] __cond_resched at ffffffff81064d6a
 #2 [ffff88032812bb18] _cond_resched at ffffffff8150e8a0
 #3 [ffff88032812bb28] cl_lock_put at ffffffffa04b85d1 [obdclass]
 #4 [ffff88032812bb68] osc_page_putref_lock at ffffffffa08ba248 [osc]
 #5 [ffff88032812bb98] osc_page_completion_read at ffffffffa08ba2b8 [osc]
 #6 [ffff88032812bba8] cl_page_completion at ffffffffa04b1fdf [obdclass]
 #7 [ffff88032812bc48] osc_completion at ffffffffa08baa82 [osc]
 #8 [ffff88032812bc98] osc_ap_completion at ffffffffa08a4754 [osc]
 #9 [ffff88032812bcf8] brw_interpret at ffffffffa08b0f69 [osc]   <<<<   owns cl_loi_list_lock
#10 [ffff88032812bd98] ptlrpc_check_set at ffffffffa05f7c9b [ptlrpc]
#11 [ffff88032812be38] ptlrpcd_check at ffffffffa0628f20 [ptlrpc]
#12 [ffff88032812be68] ptlrpcd at ffffffffa06292b3 [ptlrpc]
#13 [ffff88032812bf48] kernel_thread at ffffffff8100c0ca

PID: 15078  TASK: ffff8802f4998040  CPU: 3   COMMAND: "LU-4112-reprodu"
 #0 [ffff880032e67e90] crash_nmi_callback at ffffffff8102d3a6
 #1 [ffff880032e67ea0] notifier_call_chain at ffffffff815139d5
 #2 [ffff880032e67ee0] atomic_notifier_call_chain at ffffffff81513a3a
 #3 [ffff880032e67ef0] notify_die at ffffffff8109cc1e
 #4 [ffff880032e67f20] do_nmi at ffffffff8151169b
 #5 [ffff880032e67f50] nmi at ffffffff81510f60
    [exception RIP: _spin_lock+33]
    RIP: ffffffff815107d1  RSP: ffff880305cdf988  RFLAGS: 00000297
    RAX: 0000000000009f99  RBX: ffff880309ebfdd8  RCX: 0000000000000000
    RDX: 0000000000009f98  RSI: ffff8802d5a180c8  RDI: ffff880320986774
    RBP: ffff880305cdf988   R8: 0000000000000000   R9: 0000000000000001
    R10: 0000000000000000  R11: 0000000000000000  R12: ffff8802d5a18128
    R13: ffff8802d3af4810  R14: ffff8802d3a1f740  R15: 0000000000000001
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #6 [ffff880305cdf988] _spin_lock at ffffffff815107d1        <<<<< spin on cl_loi_list_lock
 #7 [ffff880305cdf990] osc_io_submit at ffffffffa08bf50c [osc]
 #8 [ffff880305cdfa30] cl_io_submit_rw at ffffffffa04bc128 [obdclass]
 #9 [ffff880305cdfa80] lov_io_submit at ffffffffa093ca75 [lov]
#10 [ffff880305cdfb10] cl_io_submit_rw at ffffffffa04bc128 [obdclass]
#11 [ffff880305cdfb60] cl_io_read_page at ffffffffa04be851 [obdclass]
#12 [ffff880305cdfbb0] ll_readpage at ffffffffa09f4676 [lustre]
#13 [ffff880305cdfbf0] generic_file_aio_read at ffffffff8111b6fc
#14 [ffff880305cdfcd0] vvp_io_read_start at ffffffffa0a1bcab [lustre]
#15 [ffff880305cdfd40] cl_io_start at ffffffffa04bc29a [obdclass]
#16 [ffff880305cdfd70] cl_io_loop at ffffffffa04c0adc [obdclass]
#17 [ffff880305cdfdb0] ll_file_io_generic at ffffffffa09cc047 [lustre]
#18 [ffff880305cdfe20] ll_file_aio_read at ffffffffa09cc339 [lustre]
#19 [ffff880305cdfe80] ll_file_read at ffffffffa09cc7f9 [lustre]
#20 [ffff880305cdfef0] vfs_read at ffffffff81181805
#21 [ffff880305cdff30] sys_read at ffffffff81181941
#22 [ffff880305cdff80] system_call_fastpath at ffffffff8100b072
    RIP: 0000003dece0e54d  RSP: 00007fff5fb8ee88  RFLAGS: 00010246
    RAX: 0000000000000000  RBX: ffffffff8100b072  RCX: 0000000000000000
    RDX: 000000000000c000  RSI: 0000000002476130  RDI: 0000000000000018
    RBP: 00007fff5fb8f630   R8: 0000000000000001   R9: 00007fff5fb8f700
    R10: 0000000000000000  R11: 0000000000000293  R12: 00007fff5fb90840
    R13: 0000000000400a50  R14: 00000000526fce84  R15: 0000000000400a50
    ORIG_RAX: 0000000000000000  CS: 0033  SS: 002b

This leads to a lock callback time-out and the evict !!

Could this be the issue you see at your site ??

Comment by Sebastien Piechurski [ 29/Oct/13 ]

At least this is the same consequence. Do you see this on the first rank of the MPI job ?
We never saw the soft lockup messages though. I thought this was because kernel.watchdog was disabled, but I just reproduced it with the watchdog enabled, but still no messages.

It is highly likely the same thing, so good shot !

Thanks for this Bruno.

Seb.

Comment by Bruno Faccini (Inactive) [ 30/Oct/13 ]

I am actually trying to find any existing/back-port or develop a patch for this dead-lock condition.

On the other-hand, if it is still easy for you to reproduce it (I mean with the same reproducer), could it be possible to do it with full lustre-debug enabled and again attach evicted Client(s) and Servers full debug-logs ??

Also, you said that you reproduced with 96MPI tasks on 2/4 Nodes, but how many what is the number of Cores on Client/compute Nodes at your site ?

Comment by Sebastien Piechurski [ 30/Oct/13 ]

I can reproduce it easily (almost each time), so I will collect again full debug logs and report here.

The nodes I use have 24 cores with SMT enabled, so with 48 logical cores each.

Comment by Bruno Faccini (Inactive) [ 30/Oct/13 ]

Cool, thanks one more time for your help Seb !!

I forgot to answer you about the rank of the dead-lock'ed task. I can better check, but I assume it is rank 0 since it is the 1st task on the 1st Node …

Comment by Bruno Faccini (Inactive) [ 31/Oct/13 ]

This problem is no longer present in master (since 2.2) because of commit 9fe4b52ad2ffadf125d9b5c78bb2ff9a01725707 for Change-Id I3ef619c1f07eefd201236ab55e5fd858791d41e0 in LU-1030 ("osc: new IO engine implementation") !!…

It is part of Jinshan's "New IO Engine" known to "Reduce loi_list lock contention" with introduction of a "per object lock".

On the other side, is the resched()/yield in cl_lock_put()/cl_lock_free() absolutelly required ??

Comment by Bruno Faccini (Inactive) [ 01/Nov/13 ]

Since backport of LU-1030 changes appear more than huge+risky, giving a try avoiding resched()/yield in cl_lock_put()/cl_lock_free() with patch at http://review.whamcloud.com/8135.

An other way could be to switch cl_loi_list lock to a blocking mutex as commented with field definition in "lustre/include/obd.h", but this looks like a change with highly possible side-effects.

Comment by Sebastien Piechurski [ 04/Nov/13 ]

I just put on the Ftp in the LU-4112 directory the full debug logs of 2 OSS and 4 clients during the time of the reproducer.
In the end, sid3 evicts sid614.

Comment by Bruno Faccini (Inactive) [ 05/Nov/13 ]

Thanks for the logs Seb, I am currently working on them, will let you know if I find something else in.

Comment by Sebastien Piechurski [ 05/Nov/13 ]

Bad news: I recompiled our version with patch at http://review.whamcloud.com/8135 and tested again on our benchmark cluster.
The issue is still there

Seems we need to find another way to understand what happens.
Just to be sure: do you reproduce the issue often with my reproducer ?

Comment by Bruno Faccini (Inactive) [ 05/Nov/13 ]

Yes, I was able to get the dead-lock situation several times simply by running your reproducer in loop, on a platform with 2xClients/1xMDS+MGS+MDT/1xOSS/2xOSTs and IB-Network, running with latest b2_1 build.

Ok, may be we did some amalgam here and expected that the dead-lock I triggered was the cause of your evicts …

BTW, I am still reviewing the logs, so let's wait me to finish with them.

Also in the msg you posted me, you indicate you instrumented a Lustre build and have been able to take a crash-dump of the Client-node at the time of the evict, can you also upload it ??

And very important, you also mentioned that you also trigger the problem when running with a genuine 2.1.6 version, without any additional patch and on all Servers/Clients.

Thus my other next step is to setup a new platform, to test your reproducer vs a full-2.1.6 platform (will also run/try with b2_1 build + my patch just to see what happens), may be with 4 Clients now since you indicate it is the best one to reproduce.

Comment by Sebastien Piechurski [ 05/Nov/13 ]

I am currently uploading the dump taken at the time of the eviction on the ftp site under LU-4112/sid661-LU4112-state-on-eviction.tar.gz.
It was taken by introducing an LBUG() in ptlrpc_invalidate_import_thread() when obd_dump_on_eviction != 0.
You might see odd things in the dump because we have another issue where a panic does not trigger the dump, but freezes the system, and an NMI has to be sent to trigger the dump after that, but the state of all the processes should be the expected one.

Comment by Sebastien Piechurski [ 07/Nov/13 ]

A new set of crash dumps and full debug logs is currently uploading under the name of LU-4112-complete-dumps-20131107.tar.xz .
Will be complete when size reaches 378.24MB (in about 40 minutes).
EDIT: forgot to include the servers' debug logs. I just added it in the lustre-debuglogs-sid2_3.tar.xz archive.

Comment by Bruno Faccini (Inactive) [ 08/Nov/13 ]

Thank's Seb, I grabbed these new infos and I am currently reviewing them. I may call you during the day if I need more or a new test to be run on your platform.

Comment by Sebastien Piechurski [ 12/Nov/13 ]

Here are our latest findings:

  • some operations (read, stat, open) sometimes take a huge amount of time to complete (even though there is little to no activity on server side). All threads are hanging at this time for several tens of secondes. When this takes more than 100 seconds, the client is evicted, but at the application level no I/O error is reported. Instead, the application unlocks and resumes as if nothing wrong happened.
  • The number of active cores on the client system has an impact on the average hang time, even if the application is using the same number of threads.

Example with the provided reproducer (v2):

32 threads spread on 4 nodes with 8 cores enabled: hangs < 30 seconds
32 threads spread on 4 nodes with 12 cores enabled: hangs < 75 seconds
32 threads spread on 4 nodes with 16 cores enabled: hangs of > 100 seconds

Given the fact that the site we opened this ticket for has clients with 48 logical cores enabled (using SMT), almost all hangs are over 100 seconds and finally cause evictions.

Comment by Sebastien Piechurski [ 12/Nov/13 ]

New version of the reproducer showing hang times on each type of operations.

Comment by Bruno Faccini (Inactive) [ 13/Nov/13 ]

Thanks one more time for your great job in trying to find more infos in order to fully qualify this problem Seb.

I will run this new reproducer and see if I can at least get the same timing differences/issues.

On the other hand, I found some log entries on Server/OSS side which may also explain the pseudo-blocking situation sometime leading to the eviction and reconnect+restart, you describe due to the possible delays they seem to introduce in the "Blocking AST" process :

…………

00010000:00010000:5.0:1383842563.112425:0:7257:0:(ldlm_lock.c:626:ldlm_add_bl_work_item()) ### lock incompatible; sending blocking AST. ns: filter-scratna-OST0003_UUID lock: ffff8801c4000900/0xcc4bc67228f30369 lrc: 2/0,0 mode
: --/PW res: 1995653/0 rrc: 5 type: EXT [0->4095] (req 0->4095) flags: 0x0 remote: 0xd7687ea28330a549 expref: 55 pid: 20473 timeout 0

…………

00010000:00010000:5.0:1383842563.112473:0:7257:0:(ldlm_lockd.c:790:ldlm_server_blocking_ast()) ### lock not granted, not sending blocking AST ns: filter-scratna-OST0003_UUID lock: ffff8801c4000900/0xcc4bc67228f30369 lrc: 3/0,
0 mode: --/PW res: 1995653/0 rrc: 5 type: EXT [0->4095] (req 0->4095) flags: 0x20 remote: 0xd7687ea28330a549 expref: 55 pid: 20473 timeout 0

…………

I will also investigate this in parallel.

Could it also be possible for you to do some profiling, during a run with the max delays encountered and with CPU/Core splitted details, just to see if we do not miss something out of Lustre scope ? Anyway I will run this on my side if I can reproduce the timing issues.

Comment by Bruno Faccini (Inactive) [ 14/Nov/13 ]

No success with reproducer on the small configuration that was available immediately, no msg of threshold (default=3s) overtake after several hours of consecutive runs.

I have been able to get an other platform available with Nodes made of 32 Cores, and I currently configure it.

Comment by Bruno Faccini (Inactive) [ 15/Nov/13 ]

Again no success in reproducing with the 32-Cores per node platform running with latest b2_1 build #220.
Only got the same crash/race reported in LU-3614 … At least we have now a crash-dump to investigate problem for this particular ticket now …

Will now run reproducer against 2.1.6 distro (from our downloads) on the same platform since you reported that you also trigger problem running with genuine 2.1.6 version.

BTW, I am adding your email which highlights the versions you use :

Client: 2.1.6-Bull.3.002,  Server: 2.1.6-Bull.3.002
Client: 2.1.6 vanilla,  Server: 2.1.6-Bull.3.002
Client: 2.1.3-Bull.3.002 Server: 2.1.6-Bull.3.002

Where 2.1.6-Bull.3.002 = 2.1.6 vanilla + :
- ORNL-22 general ptlrpcd threads pool support
- [#316929]/LU-1144 implement a NUMA aware ptlrpcd binding policy
- [#316874]/LU-1110 MDS Oops in osd_xattr_get() during file open by FID
- [#319248]/LU-2613 to much unreclaimable slab space
- [#319615]/LU-2624 ptlrpc fix thread stop
- [#317941]/LU-2683 client deadlock in cl_lock_mutex_get
- [#319613] too many ll_inode_revalidate_fini failure msg in syslog  => Change CERROR in CDEBUG, to avoid too many msgs.
- [NF#13194] LBUG in mdt_reconstruct_open()  => LU-2943
- [NF#15746] oss_num_threads max value is too low to feed disk controllers => To raise oss_num_threads (512 -> 2048).


2.1.3-Bull.3.002: 2.1.3 vanilla + :
- [#318217]/LU-1749 mdt failing to open llog result in invalidating ost
- [#318179]/LU-1331 changelogs: RNMTO record not always after RNMFRM
- [#317829]/LU-1592 ldlm: protect obd_export:exp_imp_reverse's change
- [#316929]/LU-1144 update patch to add ptlrpc option in modprobe.conf file
- [#314206]/LU-1448 prevent NULL pointer dereference on disabled OSC
- [#318383]/LU-1057 quota speed up lookup in osc_quota_chkdq.patch
- [#318566]/LU-1714 properly initialyse sg_magic value.patch
- [#315241] disable 'lfs catinfo'
- ORNL-22 general ptlrpcd threads pool support
- [#316929]/LU-1144 implement a NUMA aware ptlrpcd binding policy
- [#316874]/LU-1110 MDS Oops in osd_xattr_get() during file open by FID
- [#317433]/LU-1363 llite: Not held lock when calling security_d_instantiate
- [#317699]/LU-645/BZ23978 getcwd failure
- [#317365]/LU-1299 (patch set 11) loading large enough binary from lustre trigger OOM killer
- [#317554]/LU-549 Improve statfs performance if selinux is disabled (complement)
- [#317852]/LU-1650 client crash in osc_req_attr_set() routine
Comment by Bruno Faccini (Inactive) [ 15/Nov/13 ]

Bad news again, even running the reproducer with 2.1.6 stock version I am unable to reproduce, after several hours of runs in a loop I got no msg at all indicating any I/O syscall elapsed time greater than 3s.

Comment by Bruno Faccini (Inactive) [ 15/Nov/13 ]

Wow, after more hours running the reproducer with 2.1.6 stock version I have been able to reproduce what I think is the dead-lock documented in LU-2683!!! If live+crash-dump analysis confirms this, that will definitely prove the need that LU-2683 change at http://review.whamcloud.com/5208 (patch-set #2) has to be back-ported, as you/Bull did in your own 2.1.6 distro, to b2_1 tree.

But on the other hand, I am still unable to reproduce the wrong behavior for this ticket, let see ...

Comment by Bruno Faccini (Inactive) [ 20/Nov/13 ]

Just an update to indicate that we currently investigate (on-site and on my own test-platform) the Kernel (profiling on Clients for instance) and OpenMPI vs IntelMPI possible implications.

Comment by Alexandre Louvet [ 27/Nov/13 ]

We observed that the eviction disappear if we run :

  • a background program which force schedule every 1 ms
  • we boot the kernel with nohz=off (Bruno, I checked and the kernel is build with CONFIG_HZ=100)
  • we profile the client (perf)
  • we doesn't run on a node with IVY Bridge cpu (???). On that point do you have a way to test with such CPUs ?
Comment by Bruno Faccini (Inactive) [ 27/Nov/13 ]

Yes, as we have been working on during the past days, seems that everything that forces frequent/regular context-switch/schedule() causes the problem to disappear !!

nohz=on is the default with recent Kernels is'nt it ? May be we face some tricky scenario with this feature and on IVY platform ?

Also you forgot to mention that running with OpenMPI also helps vs originally with IntelMPI, thus are there any known parameters/env-vars we can use to have MPI threads to yield CPUs/Cores frequently ? I will investigate on my side but I am sure you also have MPI specialists that can help here!!

Last, thinking more about it I wonder if you can also give a try to the "old"+single ptlrpcd thread configuration ??

Comment by Bruno Faccini (Inactive) [ 28/Nov/13 ]

For the record and provided by Seb, here is the typical /proc/cpuinfo output for the IVY Bridge processors on Bull's platforms :

vendor_id       : GenuineIntel
cpu family      : 6
model           : 62
model name      : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
stepping        : 4
cpu MHz         : 2700.000
cache size      : 30720 KB
physical id     : 1
siblings        : 24
core id         : 12
cpu cores       : 12
apicid          : 57
initial apicid  : 57
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dts tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms
bogomips        : 5399.25
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

and I am looking for an internal similar platform actually …

Comment by Andreas Dilger [ 29/Nov/13 ]

Could someone please fill me in on the details of what nohz=on actually does? Is it possible that in this case the jiffies/HZ are very small? I'm wondering if we might get into trouble in a case like this with math that assumes e.g. (HZ/100) is non-zero and we sleep for that amount of time, but in fact (HZ/100) == 0 and it means "sleep forever"?

I also see mention of "lctl set_param ldlm.namespaces.osc.lru_size=0" being set on the node in order to flush the locks. That is probably a bad idea, since it means there will be a lot thrashing of LDLM locks on the client when it is not needed, and will essentially impose O_SYNC writes on the client (i.e. every write will immediately be sent in an RPC in order to drop the LDLM lock from the LRU). If you just want to flush the page cache, you can do either "lctl set_param ldlm.namespaces.osc.lru_size=clear" which will drop all of the locks in the LRU, but not impose a future LRU size of 0, or "echo 3 > /proc/sys/vm/drop_caches" to flush the client cache without dropping all of the locks.

Comment by Bruno Faccini (Inactive) [ 02/Dec/13 ]

Andreas, thanks for this very important comment on danger using dlm.namespaces.osc.lru_size=0", I am sure Sebastien will change the onsite current work-around to the "clear" option you documented.

About nohz, I am still not able to fully explain its low-level implications but the latency it implies seems of a big impact here.

On the other hand, I have been on-site last friday and here are some more interesting points :

_ problem has also been reproduced on Sandy Bridge there when I can't on Rosso … But I need to track any differences that could explain. Likely to be the main ones are Intel-MPI vs Open-MPI, BIOS config, Kernel scheduling parameters/features, ...

_ changing "max C-State" 1 in BIOS also allows to avoid the problem, which seems regarding nohz feature design, since one of its main purpose is to allow processors to remain in power saving states for a longer period of time.

_ on Intel-MPI side (version v4.1.1.036 is used on-site), trying to reduce I_MPI_SPIN_COUNT to the minimum/1 value gives poor results.

_ profiling sessions, even if not reproducing the problem show a main hot-spot in MPIDI_CH3I_Progress() routine.

Last, since IVY Bridge nodes are available on Endeavour internal system, actually I am trying to reproduce issue there, with Lustre/IB/Intel-MPI similar pieces.

Comment by Bruno Faccini (Inactive) [ 04/Dec/13 ]

Hello Seb,
Could you give a try to "I_MPI_WAIT_MODE=1" Intel-MPI env. var and see if it can have some (good?) impact on a reproducer run ?? It is an hint coming from an Intel-MPI teammate.

Comment by Bruno Faccini (Inactive) [ 04/Dec/13 ]

BTW, I have been able to reproduce the delays on our internal system with IVY Bridge nodes and running with Intel-MPI.
And they seem to disappear when I use "I_MPI_WAIT_MODE=1".

Comment by Bruno Faccini (Inactive) [ 09/Dec/13 ]

As I already commented, I have been able to reproduce on Endeavour system but in a way that is much less frequent/important (only a few <30s hang/delay periods per reproducer run). The reason of this could be "intel_idle.max_cstate=1" Kernel boot-parameter usage and some BIOS settings performance oriented …

Comment by Patrick Farrell (Inactive) [ 08/Jan/14 ]

Andreas - The first answer here does a very good job of explaining NOHZ:
http://stackoverflow.com/questions/9775042/how-nohz-on-affects-do-timer-in-linux-kernel

The main thing is that without NOHZ, every CPU is woken up 'HZ' times per second (values range from 100 to 250 to higher in some situations). This 'wake up' does various things related to process management & scheduling, and is problematic for power management situations, because it happens even on idle CPUs.

My reading suggests NOHZ only comes in to play on idle CPUs, the idea being to stop waking up idle CPUs and taking them out of power saving mode. For non-idle CPUs, it has no effect, or at least that was true when it was first added.

So I think the main effect of disabling NOHZ here is to get more frequent calls to the scheduler. The trade off is essentially latency and timer accuracy vs power consumption, I believe.

Cray is seeing a bug we believe may be this. I'm going to try the noted changes with respect to NOHZ and the other settings and see what happens.

Comment by Bruno Faccini (Inactive) [ 08/Jan/14 ]

Hello Patrick,
Thanks for adding some comments+link about NOHZ feature behavior.
After the intensive testing that we made for this ticket, it seems that running with [default] NOHZ=on may induce unexpected scheduling consequences on some platforms (Kernel version, HW+BIOS [C,P]-States implementation+configuration, …) and when running CPU-bound applications (Intel-MPI in polling-mode), with wrong side-effects on Lustre-Client.
Did you find similar issues ??

Comment by Patrick Farrell (Inactive) [ 08/Jan/14 ]

Bruno - Nothing to report yet. Are you changing the NOHZ setting on the clients, or the servers, or both?

Comment by Bruno Faccini (Inactive) [ 08/Jan/14 ]

On Clients only. Servers seem not impacted because their work-load is Lustre only !!… The NOHZ=on impact seem to only occur when multi-tasked/threaded CPU-bound workload is running concurrently and without itself yielding the CPU enough frequently (Intel-MPI vs OpenMPI ?) to allow re-scheduling.

Comment by Patrick Farrell (Inactive) [ 08/Jan/14 ]

Bruno - I changed settings from NOHZ=ON and HZ=100 to NOHZ=OFF and HZ=1000 on my clients, and was able to reproduce the bug equally well in both configurations. However, I am running master on both the client and server, it's possible if I was running 2.1 on one or both I might see different behavior.

These are virtual systems, so I don't know of any way to change CPU related BIOS settings that will affect the virtual machines in the expected way.

Do you have any other thoughts for how to get the problem to stop happening? I am very curious to see if these timing things do affect the problem for us. I'd be happy to share full DK logs of all systems and my MPI based reproducer if you'd like them.

Comment by Patrick Farrell (Inactive) [ 08/Jan/14 ]

Actually, since we're on different schedules, here's the reproducer in case you're interested.

Since the HZ changes didn't affect my reproducer, I wonder if we're actually hitting different bugs.

Here's the command I'm using, the file 'logfile' is setstripe 8 (the bug does not occur with setstripe 0).

I also did not see the issue with np set to fewer than 8:

mpirun --hostfile hosts -np 16 mpi_log.o

Here's the code. All it does is open a file, wait until all processes have opened a file, then write to the file a few times, close the file, and exit:

#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
 
int main(int argc, char** argv) {
  int size;
  int rank;
  // Initialize the MPI environment
  MPI_Init(&argc, &argv);
 
  MPI_Comm_size(MPI_COMM_WORLD, &size);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  
  FILE* logfile = fopen("/mnt/centfs01/mpi_test/logfile","a");

  if(logfile < 0) {
	fprintf(stderr,"Failed to open logfile, rc: %d\n",logfile);
  }

  fprintf(logfile,"Pre-barrier from %d of %d\n", rank, size);

  MPI_Barrier(MPI_COMM_WORLD);
  
  fprintf(logfile,"Post-barrier start from %d of %d\n", rank, size);

  fprintf(logfile,"Post-barrier middle: Hello world from %d of %d\n", rank, size);
  
  fprintf(logfile,"Post-barrier end from %d of %d\n", rank, size);
 
  fclose(logfile);

  MPI_Barrier(MPI_COMM_WORLD);

  // Finalize the MPI environment.
  MPI_Finalize();
}
Comment by Bruno Faccini (Inactive) [ 09/Jan/14 ]

Hello Patrick,
According to the fact that NOHZ=off has no impact, that your reproducer is much more simple than the one from Bull, and also that you run on VM-based platform, I am definitely convinced it is not the same issue.

Comment by Patrick Farrell (Inactive) [ 29/Jan/14 ]

Bruno -

We've opened LU-4495 (https://jira.hpdd.intel.com/browse/LU-4495, patch at http://review.whamcloud.com/8971) for the issue I described above.

I still suspect this may be the same issue.

Comment by Manish Patel (Inactive) [ 12/Feb/14 ]

Hello,

We are also seeing the same issue with IU which are on Lustre v2.1.6 with LU-2943 and LU-3645 patch and it seems we hit with this bug. The issue seems similar, where sometimes clients get evicted and sometimes client doesn't lists the directory metadata information.

Here is the example from one client.

 
?--------- ? ?      ?            ?            ? testtemp_2014_02_08_01_04_21
?--------- ? ?      ?            ?            ? testtemp_2014_02_08_01_04_212014_02_08_01_04_21_test.log
?--------- ? ?      ?            ?            ? testtemp_2014_02_08_01_06_22
?--------- ? ?      ?            ?            ? testtemp_2014_02_08_01_06_222014_02_08_01_06_22_test.log
?--------- ? ?      ?            ?            ? testtemp_2014_02_08_08_57_04

Addition to that in stack trace we also seeing

00010000:00010000:5.0:1392065759.472527:0:10020:0:(ldlm_lockd.c:642:ldlm_handle_ast_error()) ### client (nid 979@gni) returned -22 from blocking AST - normal race ns: mdt-ffff881843d1a000 lock: ffff881bf2728000/0xcf86cac758e2f8c0 lrc: 1/0,0 mode: --/PR res: 8628747995/115206 bits 0x3 rrc: 2 type: IBT flags: 0x240000a0 remote: 0x4e285e5bd84c2533 expref: 1102 pid: 4259 timeout: 7083106330
Comment by Bruno Faccini (Inactive) [ 17/Feb/14 ]

Hello Manish, are some Nodes/Clients more frequently evicted ?
If yes, do you have an idea for the work-load they run at the time of the problem ?
And if there are known nodes to encounter the problem, could it be possible to have at least one of them booted with nohz=off ?

Comment by Manish Patel (Inactive) [ 20/Feb/14 ]

Hi Bruno,

The frequency of nodes getting evicted are daily. The work load are kind of creating a directory and writing output data from the analysis to that directory and then reading it again those data from that directory. The problem is reproduced just by doing 'ls' on that problematic directory.

We tried testing nohz=off but the problem still persists. IU's most of the clients are 1.8.9 client is running kernel version 2.6.18-348.3.1.el5_lustre.g171bd56 which isn't tickless, so setting nohz=off would have no impact there. IU tested on their 2.1.6 client, which is running kernel 2.6.32-358.11.1.el6.x86_64 and a long listing against the known a problematic directory still results in a hang.

Here is in brief explanation from IU, what they are seeing and experiencing.

we're actually experiencing two issues, or at least different manifestations of a single problem. First, we have clients that hang when stat'ing certain directories and eventually get evicted from the MDS, and only the MDS. These clients log "The mds_connect operation failed with -16" until they detect that they've been evicted by the MDS.

Second, we're seeing clients evicted from OSS's with "
lock blocking callback time out: rc -107", which usually are not accompanied by an eviction from the MDS. Often 2-3 clients will get evicted at the same second, see the samples below:

Feb  7 15:42:32 dc2oss16 kernel: : LustreError: 138-a: dc2-OST009f: A client on nid 10.10.0.46@tcp was evicted due to a lock completion callback time out: rc -107 Feb  7 15:43:52 dc2oss16 kernel: : LustreError: 138-a: dc2-OST00a5: A client on nid 10.10.0.70@tcp was evicted due to a lock blocking callback time out: rc -107

Feb  7 16:38:46 dc2oss13 kernel: : LustreError: 138-a: dc2-OST007e: A client on nid 10.10.0.192@tcp was evicted due to a lock completion callback time out: rc -107 Feb  7 16:38:46 dc2oss13 kernel: : LustreError: 138-a: dc2-OST0082: A client on nid 10.10.0.250@tcp was evicted due to a lock completion callback time out: rc -107

Feb  7 17:25:40 dc2oss03 kernel: : LustreError: 138-a: dc2-OST001b: A client on nid 10.10.0.247@tcp was evicted due to a lock blocking callback time out: rc -107 Feb  7 17:25:40 dc2oss03 kernel: : LustreError: 138-a: dc2-OST0019: A client on nid 10.10.0.94@tcp was evicted due to a lock blocking callback time out: rc -107 Feb  7 17:25:40 dc2oss13 kernel: : LustreError: 138-a: dc2-OST007e: A client on nid 149.165.226.203@tcp was evicted due to a lock blocking callback time out: rc -107

Thank You,
Manish

Comment by Bruno Faccini (Inactive) [ 21/Feb/14 ]

Manish, can you provide at least Client's log upon evict ? And thus, having a big trace buffer configured and dlmtrace+rpctrace enabled (all would be the must!) would be very helpful.

Comment by Manish Patel (Inactive) [ 21/Feb/14 ]

Client and Server logs with Client trace logs. run6-client.out file.

Comment by Manish Patel (Inactive) [ 21/Feb/14 ]

Hi Bruno,

I already have those trace which was collected at that time. I have attached it here.

Log files are:
2014-02-11-run6-client.out.gz
server_log_2_10_2014.log
xfer_client_2_10_2014.log

Let me know if that sheds some light on it, otherwise we will collect a new set of logs. Let me know what buffer size we should set and what is the right way to get those dlmtrace+rpctrace logs.

Thank You,
Manish

Comment by Bruno Faccini (Inactive) [ 24/Feb/14 ]

Thanks Manish, I will review these logs/infos and let you know asap if I can get something from them.

Comment by Sebastien Piechurski [ 25/Feb/14 ]

Hi Bruno,

I am coming back on this issue. I was able today to test the following:

  • Use of the I_MPI_WAIT_MODE=1 variable -> Did not change anything to the behavior (still some ldlm actions taking hundreds of seconds, leading to evictions)
  • Use of clients with patch from LU-4495, patch set 3: Same thing.

I am still reproducing the issue very consistently on every run.

The customer is running his jobs by clearing the ldlm LRU each time files are created or copied from one node and before they are read from another set of nodes, but this is not compatible with development of new codes and is suboptimal.

What can we do to go forward on the resolution of this issue ?
I guess having a set up that reproduces the issue as on ours would be a big step as you seem not to reproduce the same way we do. This would also allow us to compare what is changing between both setup.

Comment by Patrick Farrell (Inactive) [ 25/Feb/14 ]

Sebastien - You must've missed my latest comment in the Gerrit review of LU-4495. Patch set 3 does NOT resolve the issue. Patch set 1, which takes a different approach, does. You'll have to try that one if you want to rule out LU-4495.

Comment by Sebastien Piechurski [ 26/Feb/14 ]

Hi Patrick,

Thanks for your comment. I indeed did not see the comment in gerrit saying that only patchset 1 was solving the issue.
I just recompiled with patch set 1 and retried my reproducer with it.
Unfortunately this still does not solve our issue. Some operations still take up to 100 seconds (sometimes less), leading to evictions.
Thanks for your help anyway.

Comment by Bruno Faccini (Inactive) [ 26/Feb/14 ]

Humm, I think this is definitely the proof of what I strongly suspected and affirmed from the beginning when people wanted to link this/LU-4112 problem with the one that is now tracked in LU-4495 !!…

Seb: will try to come up with a more acceptable solution/work-around by the end of the week.

Comment by Bruno Faccini (Inactive) [ 05/Mar/14 ]

Hello Manish, sorry for the delay to finish analysis of the logs you uploaded. Unfortunately it is impossible to conclude anything from them, mainly because the Client Lustre debug log you provided is not from the same day than the Servers syslogs/msgs, and also does not contain the Client's eviction timeframe.

If you want us to pursue on this you will need to provide at least the Client Lustre debug log upon eviction, by 1st configuring a big log buffer (echo 2048 > /proc/sys/lnet/debug_mb) and at least enabling dlmtrace+rpctrace traces (echo +dlmtrace +rpctrace > /proc/sys/lnet/debug), full debug traces (echo -1 > /proc/sys/lnet/debug) would be a must. Also, if you want the debug log to be dumped to disk upon eviction you can enable it (echo 1 > /proc/sys/lustre/dump_on_eviction), or you will have to manually request it (lctl dk <file>).
Last, if you have already an idea about how to reproduce the evictions, it would be very useful to momentarily configure the same (buffer size, traces) on Servers side to also collect the same (you will have to request the debug log dump by hand with "lctl dk <file>"), right after any Client eviction from Server side.

Last, since there are still no clear evidence that you suffer the same problem that the original one in this ticket, I am definitely convinced that you better had to open a specific ticket to address you own issue.

Comment by Bruno Faccini (Inactive) [ 05/Mar/14 ]

Seb: To ensure I did not miss anything during your/our earlier testing/reproducing/debugging sessions I again went thru all the infos/logs/crash-dumps/... and any materials that we generated before.

And going deeper into the crash-dump that you generated on Client upon eviction (with your Lustre distro including the patch introducing an LBUG() in ptlrpc_invalidate_import_thread() when obd_dump_on_eviction != 0), I found something really interesting that may be a new direction to investigate (a possible live-lock/scheduling issue during cl_lock mutex operations that may cause the timeouts/evicts ...).

I know it is more work for you, but since you are only able to really reproduce on your platform ..., so could it be possible that you re-run the reproducer on a set of Clients running with the same patch in order to provide me with a new crash-dump to analyze and help me to confirm this ??

Comment by Bruno Faccini (Inactive) [ 31/Mar/14 ]

Seb, can you give some update ?? Thanks in advance.

Comment by Sebastien Piechurski [ 15/Apr/14 ]

Hi Bruno,

Sorry I was not able to give an update earlier.
I had the people onsite take debug traces on a reproduced occurence of the problem.
I have uploaded them to the ftp site under uploads/LU-4112/LU-4112-occurence-20140401.tar.xz
Unfortunately, I forgot to tell them to get a crashdump of the clients as well.
Tell me if the debug traces is not enough and I will manage to get new traces with the crashdump again.

Comment by Bruno Faccini (Inactive) [ 21/Apr/14 ]

Yes, crash-dump is required to confirm behavior found in debug-traces, and also what I think I have found in the crash-dump when running with the instrumented Lustre-modules where you inserted an LBUG() in ptlrpc_invalidate_import_thread() when obd_dump_on_eviction != 0. Can you do the same ?

Comment by Sebastien Piechurski [ 25/Apr/14 ]

Hi Bruno,

I am currently uploading an archive on the ftp site under uploads/LU-4112/LU-4112-complete-dumps-20140425.tar.bz2 which contains:

  • crash dumps of all involved nodes
  • lustre traces (all enabled) of each node as well. The one for sid683 which is the client which was crashed at eviction time was extracted from the dump, but for an unknown reason does not contain the last traces, so I included the last lines from the console in the file final_message.sid683
  • nids: a file to match nids to node names
  • nodes: the role of each node
  • the vmlinux and System.map files
  • the lustre modules

This was reproduced with the kernel (2.6.32-358.11.1.el6_lustre) and lustre version (vanilla 2.1.6) from the downloads.whamcloud.com site except for the LBUG() call introduced in ptlrpc_invalidate_import_thread().
I was also able to reproduce the problem with version 2.4.3.

Note that the nodes which were used are based on a 10-cores (40 threads nodes) version of the processor we used for the first occurrences (instead of a 12-cores) and that it took me a lot more attempts before being able to reproduce when it was systematically reproduced with the 12-cores version (48 threads).
Leads me to think that we are close to a limit showing the scheduling problem.

I hope you will be able to confirm your thoughts with these traces.
Tell me if anything is missing.

Comment by Bruno Faccini (Inactive) [ 30/Apr/14 ]

Hello Seb,
Thanks again for this extra work this caused you ... I am currently working on these new materials and will let you know soon if they confirm my last idea.

Comment by Bruno Faccini (Inactive) [ 15/May/14 ]

Seb, just to let you know that your latest infos confirmed my earlier thoughts. I Will try to give an update on were I am in the "livelock" analysis by the end of today.

Comment by Bruno Faccini (Inactive) [ 22/May/14 ]

Humm I am late again, but this is a really tricky problem to debug from crash-dumps and lustre debug-logs ...
Seb, I wonder if you can re-run the same last test session with OWNER_SPIN sched_feature disabled ??
This can be done like following :

# sysctl -n kernel.sched_features
3183
# sysctl -w kernel.sched_features=1135
kernel.sched_features = 1135
# sysctl -n kernel.sched_features
1135
#

BTW, Bull's Kernel version (2.6.32-279.5.2.bl6.Bull.36.x86_64) that seems to be installed on-site does not allow to do it, but the Kernel versions (>= 2.6.32-358.11.1.el6_lustre.x86_64) we ship with Lustre 2.1.6/2.4.3 from our downloads area is ok.

Comment by Sebastien Piechurski [ 30/May/14 ]

I was able today to perform new tests.

Running the reproducer on a regular kernel was taking times between 1 and 5 minutes (of course leading to multiple lock callback timeouts after 160 seconds).
Using the kernel.sched_features parameters requires to have a kernel compiled with CONFIG_SCHED_DEBUG, so I installed one of our debug kernel which has CONFIG_SCHED_DEBUG among a lot of other DEBUG features.
Running the same reproducer with this kernel with the default sched_features lead to run the complete reproducer in 20/21 seconds quite constantly.
Running with your kernel (2.6.32-358.11.1.el6_lustre) which has much less DEBUG features returned the following times:

  • between 40 seconds and 2.5 minutes with default sched_features (=3183)
  • constant time of 15 seconds with ached_features=1135

So this is very encouraging and I guess we now need to understand the phenomenon more deeply to see if there is actually a problem with our platform or if there is simply something bad when running this kind of load on systems with lots of cores (3 nodes x 24 cores in this case).

Comment by James A Simmons [ 14/Aug/16 ]

Old blocker for unsupported version

Generated at Sat Feb 10 01:39:46 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.