[LU-4786] Apparent denial of service from client to mdt Created: 20/Mar/14  Updated: 24/Nov/14  Resolved: 18/Aug/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.5.3

Type: Bug Priority: Minor
Reporter: Manish Patel (Inactive) Assignee: Oleg Drokin
Resolution: Fixed Votes: 0
Labels: None

Attachments: Text File kern_logs_20140325.tar.gz     File kern_logs_20140328.tar.gz     File kern_logs_20140331.tar.gz     File kern_logs_20140401.tar.gz     File kern_logs_20140529.tar.gz     File kern_logs_20140807.tar.gz     File lnet_test.tar     File logs_client_hgs4.tar.gz     Text File logs_client_other.tar.gz     Text File logs_client_other_headnodes.tar.gz     Text File logs_server.tar.gz    
Issue Links:
Related
is related to LU-3321 2.x single thread/process throughput ... Resolved
is related to LU-4579 Timeout system horribly broken Resolved
Severity: 3
Rank (Obsolete): 13165

 Description   

We generally use "1.8.9wc1-sanger1-PRISTINE-2.6.32-44-server” as our standard client however we are using "lustre: 2.5.0,kernel: patchless_client,build: v2_5_0_0CHANGED-3.8.0-34-generic” on a small number of systems ( hgs4 ) which are also running docker and lxc, we do have a tiny amount of "v2_5_0_0-CHANGED-3.2.0-59-virtual"

Scratch 109,110 and 113 all run "2.2.0--PRISTINE-2.6.32-gmpc-2.6.32-220.el6-lustre.2.2” on the servers.

We have recovered access to a filesystem which was effectively unavailable for around 24 hours. We have been supported by DDN ( in particularly Sven Tautmann ) during the investigation the issue, the state of our other filesystems were discussed.

We have a number of lustre systems and we saw a similar issue on scratch113,scratch110 and finally scratch109. The symptoms was that new access to file system was lost, we determined that the issue was with the mdt we took the approach of stopping the mdt and removing the lustre modules and restarting the mdt. This approach appeared to allow to regain access although we did have to abort recovery on the MDS on some of the systems after they had been stuck in recovery for more than 30 minutes. Normally we would look for a user job that was causing the issue however there was nothing obviously apparent.

With scratch109 which is one of our general purpose file systems and has our reference genomes on and therefore effects our production pipelines, restarting the mdt service didn’t solve the issue as we saw the system fail almost immediately after we aborted the recovery.

With assistance from DDN we took the following approach ( this took place during the day of the 18th of March )

1) cleared up some old configuration from scratch109 MDS config:

The following was seen in the log "Mar 18 08:46:04 lus09-mds1 kernel: LustreError: 3192:0:(mgs_handler.c:785:mgs_handle_fslog_hack()) Invalid logname received: params”

root@lus09-mds1:/# mount t ldiskfs /dev/mapper/lus09-mgs-lus09 /export/MGS
root@lus09-mds1:/export/MGS# mkdir old/CONFIGS
root@lus09-mds1:/export/MGS# mv CONFIGS/params old/CONFIGS/
root@lus09-mds1:/export/MGS# mv CONFIGS/lus09-params old/CONFIGS/

2) We saw issues with the uid/gui lookup for example:

Mar 18 12:10:02 lus09-mds2 kernel: LustreError: 13888:0:(upcall_cache.c:236:upcall_cache_get_entry()) Skipped 208 previous similar messages
Mar 18 13:31:17 lus09-mds2 kernel: Lustre: 4134:0:(mdt_handler.c:4767:mdt_process_config()) For 1.8 interoperability, skip this mdt.group_upcall. It is obsolete

And we moved the system to a new ldap caching layer which did appear to clear those messages up.

3) we then took the approach of using iptables to limit access to the mds to a limited collection of clients and then showed that the mdt and clients were stable, we then spent a number of hours introducing clients to the mdt until all bar hgs4 ( our 3.8 lustre 2.5 clients ) had access to the filesystem.

We would like to understand why the mds failed and why when it was restarted it did not recovery in any useful amount of time and whether this is related to either the work load or the 2.5.0 clients. We have attached the output of “ddn_lustre_showall.sh” from each of the server systems, hgs4 and a random collection of clients ( to provide a baseline ).

The following is a number of observations:

On the client we saw:

Mar 18 09:14:49 hgs4a kernel: [71024.072174] LustreError: 6751:0:(import.c:323:ptlrpc_invalidate_import()) lus13-OST0013_UUID: rc = -110 waiting for callback (1 != 0)
Mar 18 09:14:49 hgs4a kernel: [71024.072247] LustreError: 6751:0:(import.c:349:ptlrpc_invalidate_import()) @@@ still on sending list req@ffff881fe470ac00 x1462829581185164/t0(0) o4->lus13-OST0013-osc-ffff882fe7cbcc00@172.17.117.147@tcp:6/4 lens 488/448 e 0 to 0 dl 1395076147 ref 2 fl Rpc:RE/0/ffffffff rc -5/-1
Mar 18 09:14:49 hgs4a kernel: [71024.072347] LustreError: 6751:0:(import.c:365:ptlrpc_invalidate_import()) lus13-OST0013_UUID: RPCs in "Unregistering" phase found (0). Network is sluggish? Waiting them to error out.

While on the server the following was interesting

Mar 17 11:52:15 lus13-mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 1211s: evicting client at 172.17.125.17@tcp ns: mdt-ffff88082ad49000 lock: ffff88047a8be6c0/0xce376efee12486e2 lrc: 3/0,0 mode: PR/PR res: 8589950579/1911 bits 0x3 rrc: 5 type: IBT flags: 0x4000020 remote: 0xefde62456d81d72a expref: 851 pid: 14082 timeout: 6100853168
Mar 17 12:12:13 lus13-mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 959s: evicting client at 172.17.125.15@tcp ns: mdt-ffff88082ad49000 lock: ffff880712d596c0/0xce376efee25dddbe lrc: 3/0,0 mode: PR/PR res: 26214401/3857707524 bits 0x3 rrc: 896 type: IBT flags: 0x4000020 remote: 0xc6c5372816307bb3 expref: 821 pid: 5438 timeout: 6102051020
Mar 17 12:18:26 lus13-mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 226s: evicting client at 172.17.125.15@tcp ns: mdt-ffff88082ad49000 lock: ffff880d85b62d80/0xce376efee273fb44 lrc: 3/0,0 mode: PR/PR res: 26214401/3857707524 bits 0x3 rrc: 1023 type: IBT flags: 0x4000020 remote: 0xc6c5372816308243 expref: 7 pid: 14047 timeout: 6102424312
Mar 17 12:53:40 lus13-mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 427s: evicting client at 172.17.125.18@tcp ns: mdt-ffff88082ad49000 lock: ffff8806be599900/0xce376efee28b9046 lrc: 3/0,0 mode: PR/PR res: 26214401/3857707524 bits 0x3 rrc: 595 type: IBT flags: 0x4000020 remote: 0x90884a7b57617f83 expref: 814 pid: 14089 timeout: 6104538355
Mar 17 13:02:09 lus13-mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 848s: evicting client at 172.17.139.124@tcp ns: mdt-ffff88082ad49000 lock: ffff8806170ee6c0/0xce376efee28c39ff lrc: 3/0,0 mode: PR/PR res: 26214401/3857707524 bits 0x3 rrc: 759 type: IBT flags: 0x4000020 remote: 0xc2f2926bc4355241 expref: 7 pid: 5296 timeout: 6105047775
Mar 17 13:48:43 lus13-mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 1455s: evicting client at 172.17.125.18@tcp ns: mdt-ffff88082ad49000 lock: ffff88078f3d1900/0xce376efee2a43b88 lrc: 3/0,0 mode: PR/PR res: 26214401/3857707524 bits 0x3 rrc: 899 type: IBT flags: 0x4000020 remote: 0x90884a7b57619c01 expref: 24 pid: 5468 timeout: 6107841367
Mar 17 14:15:20 lus13-mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 1597s: evicting client at 172.17.7.98@tcp ns: mdt-ffff88082ad49000 lock: ffff88074ab79240/0xce376efee2ac4cab lrc: 3/0,0 mode: PR/PR res: 26214401/3857707524 bits 0x3 rrc: 912 type: IBT flags: 0x4000030 remote: 0x91b7902dc49a4f87 expref: 625 pid: 5533 timeout: 6109438527
Mar 17 15:00:10 lus13-mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 162s: evicting client at 172.17.125.19@tcp ns: mdt-ffff88082034f000 lock: ffff881007719240/0xc336d9366cbafe36 lrc: 3/0,0 mode: PR/PR res: 26214401/3857707524 bits 0x1 rrc: 576 type: IBT flags: 0x4000020 remote: 0x4126dea1636f365 expref: 556 pid: 7107 timeout: 4296236743
Mar 18 12:51:46 lus13-mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 454s: evicting client at 172.17.125.15@tcp ns: mdt-ffff88082034f000 lock: ffff88079a183d80/0xc336d9367b259007 lrc: 3/0,0 mode: PR/PR res: 26214401/3857707524 bits 0x3 rrc: 242 type: IBT flags: 0x4000020 remote: 0x39f2e4c10c71e35f expref: 52 pid: 7113 timeout: 4374932968
Mar 18 14:53:38 lus13-mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 779s: evicting client at 172.17.125.15@tcp ns: mdt-ffff88082034f000 lock: ffff880fe2393d80/0xc336d9367c7c0550 lrc: 3/0,0 mode: PR/PR res: 26214401/3857707524 bits 0x3 rrc: 973 type: IBT flags: 0x4000020 remote: 0x9e99423585827dd3 expref: 808 pid: 7202 timeout: 4382244335
Mar 18 15:02:04 lus13-mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 506s: evicting client at 172.17.125.15@tcp ns: mdt-ffff88082034f000 lock: ffff881014370b40/0xc336d9367c88fe44 lrc: 3/0,0 mode: PR/PR res: 26214401/3857707524 bits 0x3 rrc: 1025 type: IBT flags: 0x4000030 remote: 0x9e99423585827ed6 expref: 5 pid: 5093 timeout: 4382750051
Mar 18 15:56:26 lus13-mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 1208s: evicting client at 172.17.125.15@tcp ns: mdt-ffff88082034f000 lock: ffff881006da2480/0xc336d9367d16b067 lrc: 3/0,0 mode: PR/PR res: 26214401/3857707524 bits 0x3 rrc: 316 type: IBT flags: 0x4000020 remote: 0x9e994235858281e6 expref: 14 pid: 7275 timeout: 4386012442
Mar 18 16:12:39 lus13-mds2 kernel: LustreError: 21100:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 973s: evicting client at 172.17.125.15@tcp ns: mdt-ffff88082034f000 lock: ffff881015642900/0xc336d9367d5ecb60 lrc: 3/0,0 mode: PR/PR res: 26214401/3857707524 bits 0x3 rrc: 669 type: IBT flags: 0x4000020 remote: 0x9e994235858282fe expref: 13 pid: 7116 timeout: 4386985544

And similarly on lus09

Mar 18 03:12:28 lus09-mds2 kernel: LustreError: 5245:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 10808s: evicting client at 172.17.143.41@tcp ns: mdt-ffff881819bc3000 lock: ffff880b1f8936c0/0x5e448803b245d974 lrc: 3/0,0 mode: PR/PR res: 411041793/558694943 bits 0x3 rrc: 1001 type: IBT flags: 0x4000030 remote: 0xdd627947772d3070 expref: 28 pid: 5399 timeout: 4328657010
Mar 18 03:32:29 lus09-mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 12009s: evicting client at 172.17.140.69@tcp ns: mdt-ffff881819bc3000 lock: ffff8816d6c0eb40/0x5e448803b245dad9 lrc: 3/0,0 mode: PR/PR res: 411041793/558694943 bits 0x3 rrc: 1004 type: IBT flags: 0x4000030 remote: 0x8bba98044cf50445 expref: 16 pid: 5453 timeout: 4329858036
Mar 18 03:52:30 lus09-mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 13210s: evicting client at 172.17.27.140@tcp ns: mdt-ffff881819bc3000 lock: ffff880b21f7a480/0x5e448803b245dbb9 lrc: 3/0,0 mode: PR/PR res: 411041793/558694943 bits 0x3 rrc: 1006 type: IBT flags: 0x4000030 remote: 0x4620c27f480e4650 expref: 17 pid: 5224 timeout: 4331059019
Mar 18 04:12:31 lus09-mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 14411s: evicting client at 172.17.48.147@tcp ns: mdt-ffff881819bc3000 lock: ffff880b2180d900/0x5e448803b245dc29 lrc: 3/0,0 mode: PR/PR res: 411041793/558694943 bits 0x3 rrc: 1003 type: IBT flags: 0x4000030 remote: 0x7f400f68202358bc expref: 18 pid: 5248 timeout: 4332260008
Mar 18 04:32:32 lus09-mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 15612s: evicting client at 172.17.27.156@tcp ns: mdt-ffff881819bc3000 lock: ffff880b20e34900/0x5e448803b245dc4c lrc: 3/0,0 mode: PR/PR res: 411041793/558694943 bits 0x3 rrc: 1003 type: IBT flags: 0x4000030 remote: 0xdd23ca15f588f2f7 expref: 11 pid: 5040 timeout: 4333461005
Mar 18 04:52:33 lus09-mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 16813s: evicting client at 172.17.140.22@tcp ns: mdt-ffff881819bc3000 lock: ffff880b15dd5480/0x5e448803b245dc68 lrc: 3/0,0 mode: PR/PR res: 411041793/558694943 bits 0x3 rrc: 1004 type: IBT flags: 0x4000030 remote: 0xc21e8f4395007441 expref: 24 pid: 5030 timeout: 4334662005
Mar 18 05:12:34 lus09-mds2 kernel: LustreError: 8644:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 18014s: evicting client at 172.17.140.24@tcp ns: mdt-ffff881819bc3000 lock: ffff880b21811b40/0x5e448803b245dcbc lrc: 3/0,0 mode: PR/PR res: 411041793/558694943 bits 0x3 rrc: 1005 type: IBT flags: 0x4000030 remote: 0x72efc45909efcb5 expref: 22 pid: 5445 timeout: 4335863011
Mar 18 05:32:35 lus09-mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 19215s: evicting client at 172.17.140.78@tcp ns: mdt-ffff881819bc3000 lock: ffff8816e4d81480/0x5e448803b245dd1e lrc: 3/0,0 mode: PR/PR res: 411041793/558694943 bits 0x3 rrc: 1008 type: IBT flags: 0x4000030 remote: 0x37ca8d5924c467b8 expref: 15 pid: 4157 timeout: 4337064016
Mar 18 05:52:36 lus09-mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 20416s: evicting client at 172.17.115.125@tcp ns: mdt-ffff881819bc3000 lock: ffff880b15615d80/0x5e448803b245dd87 lrc: 3/0,0 mode: PR/PR res: 411041793/558694943 bits 0x3 rrc: 1008 type: IBT flags: 0x4000030 remote: 0x39f8634c334bd35f expref: 19 pid: 5288 timeout: 4338265010
Mar 18 06:12:37 lus09-mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 21617s: evicting client at 172.17.140.31@tcp ns: mdt-ffff881819bc3000 lock: ffff880b20e8f900/0x5e448803b245ddcd lrc: 3/0,0 mode: PR/PR res: 411041793/558694943 bits 0x3 rrc: 1014 type: IBT flags: 0x4000030 remote: 0xe18d86102d68e1c2 expref: 22 pid: 5211 timeout: 4339466035
Mar 18 06:32:38 lus09-mds2 kernel: LustreError: 8642:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 22818s: evicting client at 172.17.143.37@tcp ns: mdt-ffff881819bc3000 lock: ffff880b20a0a000/0x5e448803b245ded0 lrc: 3/0,0 mode: PR/PR res: 411041793/558694943 bits 0x3 rrc: 1013 type: IBT flags: 0x4000030 remote: 0x26a60eb067b19ece expref: 13 pid: 5323 timeout: 4340667019
Mar 18 06:52:39 lus09-mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 24019s: evicting client at 172.17.140.25@tcp ns: mdt-ffff881819bc3000 lock: ffff8816e4ee5900/0x5e448803b245e00b lrc: 3/0,0 mode: PR/PR res: 411041793/558694943 bits 0x3 rrc: 1016 type: IBT flags: 0x4000030 remote: 0xbb9b59a85baadc32 expref: 20 pid: 5346 timeout: 4341868025
Mar 18 07:12:40 lus09-mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 25220s: evicting client at 172.17.140.26@tcp ns: mdt-ffff881819bc3000 lock: ffff880b1b902240/0x5e448803b245e03c lrc: 3/0,0 mode: PR/PR res: 411041793/558694943 bits 0x3 rrc: 1015 type: IBT flags: 0x4000030 remote: 0xe8399533ab3b360c expref: 20 pid: 5375 timeout: 4343069015
Mar 18 07:32:41 lus09-mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 26421s: evicting client at 172.17.7.141@tcp ns: mdt-ffff881819bc3000 lock: ffff8816e3802d80/0x5e448803b245e100 lrc: 3/0,0 mode: PR/PR res: 411041793/558694943 bits 0x3 rrc: 1017 type: IBT flags: 0x4000030 remote: 0x7c7cfa487563f6e6 expref: 16 pid: 5034 timeout: 4344270019
Mar 18 07:52:42 lus09-mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 27622s: evicting client at 172.17.30.61@tcp ns: mdt-ffff881819bc3000 lock: ffff8816e1f3b480/0x5e448803b245e1cb lrc: 3/0,0 mode: PR/PR res: 411041793/558694943 bits 0x3 rrc: 1018 type: IBT flags: 0x4000030 remote: 0xc61848c213a21135 expref: 22 pid: 5262 timeout: 4345471015
Mar 18 08:12:43 lus09-mds2 kernel: LustreError: 8642:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 28823s: evicting client at 172.17.119.20@tcp ns: mdt-ffff881819bc3000 lock: ffff880b21da5480/0x5e448803b245e591 lrc: 3/0,0 mode: PR/PR res: 411041793/558694943 bits 0x3 rrc: 1023 type: IBT flags: 0x4000030 remote: 0x69128bab309e0a76 expref: 37 pid: 4994 timeout: 4346672050
Mar 18 08:32:44 lus09-mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 30024s: evicting client at 172.17.12.65@tcp ns: mdt-ffff881819bc3000 lock: ffff880b220666c0/0x5e448803b245e647 lrc: 3/0,0 mode: PR/PR res: 411041793/558694943 bits 0x3 rrc: 1019 type: IBT flags: 0x4000030 remote: 0xbfcd2a0005c3b47a expref: 10 pid: 5356 timeout: 4347873010
Mar 18 08:52:45 lus09-mds2 kernel: LustreError: 8642:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 20417s: evicting client at 172.17.143.60@tcp ns: mdt-ffff881819bc3000 lock: ffff8816e3c25000/0x5e448803b2461293 lrc: 3/0,0 mode: PR/PR res: 411041793/558694943 bits 0x3 rrc: 1167 type: IBT flags: 0x4000030 remote: 0x9a1bba4fa51a1301 expref: 31 pid: 5453 timeout: 4349074754
Mar 18 09:12:48 lus09-mds2 kernel: LustreError: 8645:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 1201s: evicting client at 172.17.125.21@tcp ns: mdt-ffff881819bc3000 lock: ffff8816dad51900/0x5e448803b2467901 lrc: 3/0,0 mode: PR/PR res: 411041793/558694943 bits 0x3 rrc: 1044 type: IBT flags: 0x4000020 remote: 0x1d8fffd2b0c2971f expref: 19 pid: 5313 timeout: 4350277824

The length of the timers on the locks do appear to be excessive and having a timer expire after 8 hours does seem less than optimal and this might be related to LU-4579

It appeared that the problem was migrating around the filesystems and that for example an eviction from one system would cause an eviction from another system.


Mar 17 12:12:13 hgs4a kernel: [1213093.322040] LustreError: 167-0: lus13-MDT0000-mdc-ffff882fe0884400: This client was evicted by lus13-MDT0000; in progress operations using this service will fail.
Mar 17 12:12:21 hgs4a kernel: [1213101.030435] LustreError: 167-0: lus10-MDT0000-mdc-ffff882fac577c00: This client was evicted by lus10-MDT0000; in progress operations using this service will fail.
Mar 17 12:14:40 hgs4a kernel: [1213239.422115] LustreError: 167-0: lus11-MDT0000-mdc-ffff882f8ac5dc00: This client was evicted by lus11-MDT0000; in progress operations using this service will fail.
Mar 17 12:18:26 hgs4a kernel: [1213465.907318] LustreError: 167-0: lus13-MDT0000-mdc-ffff882fe0884400: This client was evicted by lus13-MDT0000; in progress operations using this service will fail.
Mar 17 12:22:05 hgs4a kernel: [1213684.252829] LustreError: 167-0: lus10-MDT0000-mdc-ffff882fac577c00: This client was evicted by lus10-MDT0000; in progress operations using this service will fail.
Mar 17 12:26:56 hgs4a kernel: [1213974.940999] LustreError: 167-0: lus08-MDT0000-mdc-ffff882fe0886c00: This client was evicted by lus08-MDT0000; in progress operations using this service will fail.
Mar 17 12:33:53 hgs4a kernel: [1214391.253811] LustreError: 167-0: lus10-MDT0000-mdc-ffff882fac577c00: This client was evicted by lus10-MDT0000; in progress operations using this service will fail."

And in hindsight it does appear that our lustre 2.5 clients appear to get more evictions than 1.8.9.
"host count percent
hgs4g.internal.sanger.ac.uk 158 4.887102 ( 2.5.0 )
hgs4k.internal.sanger.ac.uk 155 4.794309 ( 2.5.0 )
hgs4c.internal.sanger.ac.uk 142 4.392205 ( 2.5.0 )
hgs4a.internal.sanger.ac.uk 61 1.886792 ( 2.5.0 )
hgs4d.internal.sanger.ac.uk 14 0.433034 ( 2.5.0 )
deskpro100563.internal.sanger.ac.uk 13 0.402103 ( 1.8.9 however not in the datacenter)
hgs4e.internal.sanger.ac.uk 12 0.371172 ( 2.5.0 )
pcs5a.internal.sanger.ac.uk 7 0.216517 ( 1.8.9 shares networking with hgs4…. )
lustre-utils01.internal.sanger.ac.uk 7 0.216517 ( 2.5.1-RC1 and head )
uk10k-farm-srv1.internal.sanger.ac.uk 6 0.185586 ( 1.8.9 )
pcs-genedb-dev.internal.sanger.ac.uk 6 0.185586 ( 1.8.9 a virtual machine )
ht-1-1-15.internal.sanger.ac.uk 6 0.185586 ( 1.8.8 !! )
bc-20-1-01.internal.sanger.ac.uk 6 0.185586 ( 1.8.8 !! )
vr-1-1-05.internal.sanger.ac.uk 5 0.154655 ( 1.8.9 )
....



 Comments   
Comment by James Beal [ 20/Mar/14 ]

Given that we suspected the clients on hgs4 of causing issues we decided to restrict the access that those machines to just scratch111 (2.4.2 with lu4222) and scratch113 ( lustre 2.2 ) which are the specific filesystems allocated to the group that use hgs4. Not long after access was granted the mdt of scratch 113 started to have issues and was completely unserviceable this morning. We have put iptables rules in place to restrict access to hgs4 and access to the file system has been restored ( after aborting recovery after waiting 25 minutes ). We have a new filesystem scratch114 ( lustre 2.2 ) which will try and use to isolate the issue.

Comment by James Beal [ 20/Mar/14 ]

Various log files and a set of lnet self test results.

Comment by Oleg Drokin [ 20/Mar/14 ]

Can you edscribe your filesystems in a bit more details please? Like do you have hundreds of OSTs?

Comment by James Beal [ 20/Mar/14 ]

scratch113 has 40 OST's on 8 OSS's

The attachments have more logs than is probably useful

Scratch113 is currently overly full, although we saw the same issue on scratch 109 and 110 which are more reasonable.

It is interesting to note that with this mornings issue with hgs4 the time line looked something like:

08:23 firewall hgs4 from lus13
08:26 lustre modules unloaded.
08:27 lustre modules loaded ( mdt service started ).
09:13 recovery aborted.

And the system was mostly usable by 9:30.

Looking at the logs for scratch111 which is the other filesystem hgs4 had mounted there were issues reported until around 09:34:34 when hgs4a was evicted from scratch111

lfs df /lustre/scratch113
UUID 1K-blocks Used Available Use% Mounted on
lus13-MDT0000_UUID 439366532 29978312 380091580 7% /lustre/scratch113[MDT:0]
lus13-OST0000_UUID 22935567680 20854637240 933780104 96% /lustre/scratch113[OST:0]
lus13-OST0001_UUID 22935567680 20817505556 970917932 96% /lustre/scratch113[OST:1]
lus13-OST0002_UUID 22935567680 20990514020 797905372 96% /lustre/scratch113[OST:2]
lus13-OST0003_UUID 22935567680 21262439720 525981720 98% /lustre/scratch113[OST:3]
lus13-OST0004_UUID 22935567680 21047320248 741101192 97% /lustre/scratch113[OST:4]
lus13-OST0005_UUID 22935567680 20866383892 922023212 96% /lustre/scratch113[OST:5]
lus13-OST0006_UUID 22935567680 21040308464 748050512 97% /lustre/scratch113[OST:6]
lus13-OST0007_UUID 22935567680 20697172868 1091244476 95% /lustre/scratch113[OST:7]
lus13-OST0008_UUID 22935567680 20770308620 1018112820 95% /lustre/scratch113[OST:8]
lus13-OST0009_UUID 22935567680 20857702884 930719996 96% /lustre/scratch113[OST:9]
lus13-OST000a_UUID 22935567680 20584897904 1203525540 94% /lustre/scratch113[OST:10]
lus13-OST000b_UUID 22935567680 20881652348 906764996 96% /lustre/scratch113[OST:11]
lus13-OST000c_UUID 22935567680 20630841852 1157575492 95% /lustre/scratch113[OST:12]
lus13-OST000d_UUID 22935567680 20898139204 890256636 96% /lustre/scratch113[OST:13]
lus13-OST000e_UUID 22935567680 20683806712 1104612680 95% /lustre/scratch113[OST:14]
lus13-OST000f_UUID 22935567680 20905437584 882986928 96% /lustre/scratch113[OST:15]
lus13-OST0010_UUID 22935567680 21213940192 574317612 97% /lustre/scratch113[OST:16]
lus13-OST0011_UUID 22935567680 20605668824 1182754660 95% /lustre/scratch113[OST:17]
lus13-OST0012_UUID 22935567680 20656381368 1131973152 95% /lustre/scratch113[OST:18]
lus13-OST0013_UUID 22935567680 20825344820 963041020 96% /lustre/scratch113[OST:19]
lus13-OST0014_UUID 22935567680 20931734060 856672588 96% /lustre/scratch113[OST:20]
lus13-OST0015_UUID 22935567680 20799911712 988511776 95% /lustre/scratch113[OST:21]
lus13-OST0016_UUID 22935567680 21014513412 773908028 96% /lustre/scratch113[OST:22]
lus13-OST0017_UUID 22935567680 21399192148 389203692 98% /lustre/scratch113[OST:23]
lus13-OST0018_UUID 22935567680 20760529812 1027892652 95% /lustre/scratch113[OST:24]
lus13-OST0019_UUID 22935567680 21068011884 720402388 97% /lustre/scratch113[OST:25]
lus13-OST001a_UUID 22935567680 21320938456 467471512 98% /lustre/scratch113[OST:26]
lus13-OST001b_UUID 22935567680 21194621248 593802240 97% /lustre/scratch113[OST:27]
lus13-OST001c_UUID 22935567680 20713308580 1075112508 95% /lustre/scratch113[OST:28]
lus13-OST001d_UUID 22935567680 20661740064 1126648908 95% /lustre/scratch113[OST:29]
lus13-OST001e_UUID 22935567680 20467095412 1321291212 94% /lustre/scratch113[OST:30]
lus13-OST001f_UUID 22935567680 20865537396 922885068 96% /lustre/scratch113[OST:31]
lus13-OST0020_UUID 22935567680 20755964760 1032419032 95% /lustre/scratch113[OST:32]
lus13-OST0021_UUID 22935567680 21008129620 780281168 96% /lustre/scratch113[OST:33]
lus13-OST0022_UUID 22935567680 20755160028 1033262932 95% /lustre/scratch113[OST:34]
lus13-OST0023_UUID 22935567680 21454041488 334383024 98% /lustre/scratch113[OST:35]
lus13-OST0024_UUID 22935567680 20578561808 1209808432 94% /lustre/scratch113[OST:36]
lus13-OST0025_UUID 22935567680 21000937336 787474948 96% /lustre/scratch113[OST:37]
lus13-OST0026_UUID 22935567680 20701957636 1086464828 95% /lustre/scratch113[OST:38]
lus13-OST0027_UUID 22935567680 20502428216 1285996296 94% /lustre/scratch113[OST:39]

filesystem summary: 917422707200 835044719396 36491539284 96% /lustre/scratch113

Comment by Oleg Drokin [ 20/Mar/14 ]

This looks a bit like LU-4570 to me Your evicted locks are all of PR mode - those are only possible to come from readdir. Apparently they are somehow matched by a client and then never released causing deadlocks and lock timeouts.
Biggest difference is that in your case there's no "still busy with XXX RPCs" messages visible in your logs.

Am I correct reading that you only have this issue with 2.2.0 servers and 2.4.2 servers appear to be fine?

Do you have some job that makes this happen reliably that you can run? I'd be interested in getting some log traces from mdt and client to trace one of such stuck lock lifetimes on both ends if feasible.

Comment by James Beal [ 20/Mar/14 ]

We had had a look at LU-4570 and thought it looked interesting.

scratch 111 is a 2.4.2 server and we can see messages showing that it is unhappy about the client however the users have not reported issues with it, looking at the graphs for throughput it is not clear whether it was effected or not. The logs for lus11 are attached . The job has not been run against scratch111 so we can not be sure that 2.4.2 is good.

We think we have a job that reliably cause the issue and we have a new filesystem which we can use to test the theory ( that filesystem will only be available for a short while, if we need a system to test for longer we have a test system that we can bring in to play).

If you tell us exactly how to produce the traces we will happily do that and attach them to the call.

Comment by Oleg Drokin [ 21/Mar/14 ]

So ideally you will need to reproduce with as few clients as possible (to contain the log sizing).

Increase client debug level on clients to include as much stuff as possible +dlmtrace +vfstrace +inode +ha +reada at the very minimum, +trace would be great.
This will produce a lot of internal logs, so you also need to grow the debug buffers size to say 1G: echo 1024 >/proc/sys/lnet/debug_mb and that's likely not going to be enough, so if you have some non-lustre storage on the clients, you can also run debug daemon to continuously stream the logs to such storage with: lctl debug_daemon start /path/file_name.bin 10240

on MDS just add +dlmtrace to the default debug level (echo +dlmtrace >/proc/syslnet/debug) and probably also increase the edbug buffer case and run debug daemon just in case.

Then run your reproducer and once it hits, stop the debug daemons with lctl debug_daemon stop on all nodes that it runs, collect the files and make them available for download (you can also convert them to text form with lctl df /path/file_name.bin >/somewhere/file_name.log )

Comment by James Beal [ 22/Mar/14 ]

I don't believe we have managed to run the reproduction yet, but those instructions look good thank you.

Comment by Helen Brimmer [ 24/Mar/14 ]

So, we have been running the pipeline job again today with the increased debug logging against scratch114, which has only just been put into production and has the same configuration as the problematic scratch113. It takes a while to get going, so a lot of the log files have wrapped. After five hours, out of 11 clients running lustre 2.5.0, we've had 3 succumb to LBUG kernel panics (assertion failure in CLS_QUEUING), 3 suffer seemingly permanent evictions from some OSTs and not others (different set for each affected client) and not seen any problems on the MDT (which is what we are waiting to see). The dispatching process for the pipeline is still running, though the I/O activity has dropped to pretty much nil. We are now looking at devising ways to add some extra load to the filesystem to see if we can prompt the lockups to occur.

Comment by Helen Brimmer [ 27/Mar/14 ]

We are planning to try and reproduce the problem with a different set of 2.5.0 clients which have a different kernel version (just to rule it out as a 'difference' with the hgs4 systems). As the filesystem is now stuck, we will need to reboot the MDS. Do you need us to preserve any state from the run with 'hgs4'?

Comment by Oleg Drokin [ 27/Mar/14 ]

No. I am intrested in client traces (+ lock timeout info from server logs) only, as that hopefully will answer my question about why the locks are not released there. So it's fine to reboot the MDS.
I'll check your client logs, thanks.

Comment by Oleg Drokin [ 28/Mar/14 ]

It looks like you have wrong debug level or subsystem on your clients.
So far I only see three kinds of messages. One coming from ldlm pools and one coming from all the tracing in obdclass code only and one doing rpc tracing sometimes.

Sadly that means there's no useful info in those logs that I can use

Please double check that when you do cat /proc/sys/lnet/debug on the clients, you see that all desired debug levels are listed in return.

Also please make sure that doing cat /proc/sys/lnet/subsystem_debug you see all subsystems listed like this:
undefined mdc mds osc ost class log llite rpc mgmt filter echo ldlm lov lquota osd lfsck lmv sec gss mgc mgs fid fld

It's kind of bad that the logs accumulate as fast.
Perhaps we really should be filtering based on subsystem (which is almost like what you are doing now somehow I feel).
We probably can remove osc and class from the list of subsystems (by doing echo -osc -cache >/proc/sys/lnet/subsystem_debug) and that should cut down quite a bit of chatter, at least all of it related to currently plentiful messages.

How to make sure you got the debug logs that contain useful info.:
From a mds eviction message like this:

lus14-mds2-mar24-mar25-kern.log:Mar 25 12:42:53 lus14-mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 781s: evicting client at 172.17.125.18@tcp  ns: mdt-ffff88100430e000 lock: ffff880fb8e15b40/0xa0166cc2f1abd865 lrc: 3/0,0 mode: PR/PR res: 341311489/3520050123 bits 0x3 rrc: 39 type: IBT flags: 0x4000020 remote: 0x6b99a1f28ce9b702 expref: 14 pid: 45823 timeout: 5591090236

see the number after "remote:"
Check for it in the corresponding client log after you processed it into a text format.
If you see match like this (should be very first match) (I used a different lock from my testing here for illustration purposes, the remote we are looking for is 0xb71459050dddbe18):

00010000:00010000:3.0:1395970265.268930:0:13661:0:(ldlm_lock.c:797:ldlm_lock_addref_internal_nolock()) ### ldlm_lock_addref(CR) ns: lustre-MDT0000-mdc-ffff88005315d7f0 lock: ffff8800456b7db8/0xb71459050dddbe18 lrc: 3/1,0 mode: --/CR res: [0x200000007:0x1:0x0].0 bits 0x0 rrc: 1 type: IBT flags: 0x10000000000000 nid: local remote: 0x0 expref: -99 pid: 13661 timeout: 0 lvb_type: 0

Note how on client side the location of this remoeve handle moved to the lock: addr/handle location. remote is 0 meaning this lock has not been replied to by the server yet.
It's a very good sign, it means we've likely captured entire lock lifetime in the log.

If you see at least

00010000:00010000:3.0:1395970265.287973:0:13661:0:(ldlm_request.c:700:ldlm_cli_enqueue_fini()) ### client-side enqueue END ns: lustre-MDT0000-mdc-ffff88005315d7f0 lock: ffff8800456b7db8/0xb71459050dddbe18 lrc: 4/1,0 mode: PR/PR res: [0x200000007:0x1:0x0].0 bits 0x13 rrc: 1 type: IBT flags: 0x0 nid: local remote: 0xb71459050dddbe26 expref: -99 pid: 13661 timeout: 0 lvb_type: 0

That means we have lifetime of the lock since it was granted by the server and so it's still a very good log.

If you do not have any matches for the handle in the client log, that means this log does not contain anything useful related to the lock in question and is therefore likely useless.
If it contains some matches, but not the ones mentioned above, it still might be useful.

Comment by Helen Brimmer [ 28/Mar/14 ]

Here is confirmation of how the hgs4 clients were set up for debugging:

[root@hgs4g ~]$ cat /proc/sys/lnet/debug
trace inode ioctl neterror warning dlmtrace error emerg ha vfstrace reada config console

[root@hgs4g ~]$ cat /proc/sys/lnet/subsystem_debug
undefined mdc mds osc ost class log llite rpc mgmt lnet lnd pinger filter echo ldlm lov lquota osd lfsck lmv sec gss mgc mgs fid fld

[root@hgs4g ~]$ cat /proc/sys/lnet/debug_path
/tmp/lustre-log

[root@hgs4g ~]$ cat /proc/sys/lnet/debug_mb
2048

[root@hgs4g ~]$ cat /proc/sys/lnet/debug_log_upcall
/usr/lib/lustre/lnet_debug_log_upcall

Please let me know if there is anything amiss here. As the log files were rotating so quickly, we did create several sets of debug output so I will see if some of the other files contain anything useful.

We'll run the pipeline tests on the other set of lustre 2.5.0 clients today (they have a different kernel, so are seeing if that is a potential issue with hgs4). I'll adjust the debug capture so that it doesn't collect osc and class messages as you suggest.

Comment by Oleg Drokin [ 28/Mar/14 ]

This output seems correct, so that's kind of strange how you ended up with the results you got then.

Comment by Helen Brimmer [ 28/Mar/14 ]

We have had some success recreating the issue with a different set of clients today. The filesystem went into the error state we have been seeing within about five minutes of the pipeline being started. The sequence of events appears to be that the worker nodes start being evicted by various OSTs (a different set in each case). Some time later, (~20 minutes today), the MDS starts to evict certain worker nodes but not others. Several hours later, we see stuck threads and lustre error dumps on the MDS.

I believe that this time we have managed to capture some thread lifetimes. The debug_daemon files started off accruing a lot of data, but then appeared to calm down. We did take the precaution of restarting it as the file size approached maximum, though one client (ht-1-1-03) continued to generate a lot of data so some might have wrapped. I have uploaded compressed trace text files from all of the clients involved in running the pipeline (ht-1-1-*) and the MDS (lus14-mds2). They are available from the same URL as before. There are two each, with I think the most interesting stuff happening in the file stamped '115504' (which is generally a lot smaller than the earlier one).

I'll be attaching kernel logs from all of the systems for reference. The pipeline was started at 11:22 from ht-1-1-03, which acted as the master controlling node. Processing jobs were then issued to the remaining six systems. We saw OST evictions on all of the workers except ht-1-1-03 and ht-1-1-07. MDS evictions occurred on ht-1-1-04, ht-1-1-05, ht-1-1-06, ht-1-1-08 and ht-1-1-09, with ht-1-1-09 being particularly troublesome. I believe that we have captured lock lifetimes on ht-1-1-09, ht-1-1-04 and possibly ht-1-1-06 as well. All of the clients are running Ubuntu 12.04.4, with kernel 3.2.0-58-generic and lustre 2.5.0. Debug settings are as below:

[root@ht-1-1-03 ~]$ cat /proc/sys/lnet/debug
trace inode ioctl neterror warning dlmtrace error emerg ha vfstrace reada config console
[root@ht-1-1-03 ~]$ cat /proc/sys/lnet/subsystem_debug
undefined mdc mds ost log llite rpc mgmt lnet lnd pinger filter echo ldlm lov lquota osd lfsck lmv sec gss mgc mgs fid fld
[root@ht-1-1-03 ~]$ cat /proc/sys/lnet/debug_path
/tmp/lustre-log
[root@ht-1-1-03 ~]$ cat /proc/sys/lnet/debug_mb
4096
[root@ht-1-1-03 ~]$ cat /proc/sys/lnet/debug_log_upcall
/usr/lib/lustre/lnet_debug_log_upcall

lus14-mds2 was rebooted prior to starting this run in order to reset its state from the previous run. It is also running Ubuntu 12.04.04 LTS but has a custom built kernel, 2.6.32-gmpc-2.6.32-220.el6-lustre.2.2 in order to support Lustre version 2.2.0. It had the following debug settings:

[root@lus14-mds2 ~]$ cat /proc/sys/lnet/debug
ioctl neterror warning dlmtrace error emerg ha config console
[root@lus14-mds2 ~]$ cat /proc/sys/lnet/subsystem_debug
undefined mdc mds osc ost class log llite rpc lnet lnd pinger filter echo ldlm lov lquota lmv sec gss mgc mgs fid fld
[root@lus14-mds2 ~]$ cat /proc/sys/lnet/debug_path
/tmp/lustre-log
[root@lus14-mds2 ~]$ cat /proc/sys/lnet/debug_mb
4081
[root@lus14-mds2 ~]$ cat /proc/sys/lnet/debug_log_upcall
/usr/lib/lustre/lnet_debug_log_upcall

Please let me know if there are any problems with this set of files. We are still trace logging into new files, so we can upload those too if you need to see what came next.

Comment by Oleg Drokin [ 30/Mar/14 ]

Ok, from this confirms that the client got cancel request and prepared reply in time and handed it off to a ptlrpcd thread for asynchronous sending. The server apparently did not get the reply.
Unfortunately I just now noticed that rpctrace is not i default debug level somehow so it needed to be added separately too (echo +rpctrace > /proc/sys/lnet/debug( on both cients and MDS (in addition to the existing stuff).

So the mystery now shifted to finding if the request was sent to the server at all and somehow lost there or if it was not even sent from client.
In fact I think there's just the ticket for a stuck ptlrpcd thread that might cause the request not to ever leave the client, it's LU-4509. The patch from there does not apply cleanly to 2.4 clients, so here is the b2_4 based patch for you:
http://review.whamcloud.com/#/c/9747/

Please try your workload with this patch and also all the debug (including rpctrace this time), if the issue reproduces, please gather the logs as before, if it's gone, then that might have been the culprit.

Comment by Helen Brimmer [ 31/Mar/14 ]

In fact I think there's just the ticket for a stuck ptlrpcd thread that might cause the request not to ever leave the client, it's LU-4509. The patch from there does not apply cleanly to 2.4 clients, so here is the b2_4 based patch for you:
http://review.whamcloud.com/#/c/9747/

We are using 2.5.0 clients; Can you confirm that this patch is applicable to that version?

Comment by Oleg Drokin [ 31/Mar/14 ]

http://review.whamcloud.com/8922 is the patch that applies cleanly to 2.5 clients.

I'll check your new logs, thanks.

Comment by Oleg Drokin [ 31/Mar/14 ]

So in the new logs we can clearly see an RPC is being formed for the lock cancel, but never sent.

I am really interested to see if the patch from LU-4509 fixes the issue for you.

Comment by Helen Brimmer [ 31/Mar/14 ]

We are in the process of compiling it up now. We should be able to test it out tomorrow.

Comment by Oleg Drokin [ 31/Mar/14 ]

Thank you for your effort in gathering those logs, btw. Really helpful.

Comment by Helen Brimmer [ 01/Apr/14 ]

We have tested lustre 2.5.0 with the LU-4509 patch applied. Unfortunately, it does not appear to have helped and we still see the lock timeouts and evictions (ht-1-1-09 and ht-1-1-04 seem particularly prone). I have uploaded trace logs from today's run to our FTP site and will attach the kernel logs from the same period. For reference, here are the lustre version and debug settings from one of the clients and the MDS:

[root@ht-1-1-03 lustre-dbug]# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 12.04.4 LTS
Release: 12.04
Codename: precise
[root@ht-1-1-03 lustre-dbug]# uname -a
Linux ht-1-1-03 3.2.0-58-generic #88-Ubuntu SMP Tue Dec 3 17:37:58 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
[root@ht-1-1-03 lustre-dbug]# cat /proc/fs/lustre/version
lustre: 2.5.0
kernel: patchless_client
build: 2.5.0-gca629fd-CHANGED-3.2.0-58-generic
[root@ht-1-1-03 lustre-dbug]# cat /proc/sys/lnet/debug
trace inode ioctl neterror warning dlmtrace error emerg ha rpctrace vfstrace reada config console
[root@ht-1-1-03 lustre-dbug]# cat /proc/sys/lnet/subsystem_debug
undefined mdc mds ost log llite rpc mgmt lnet lnd pinger filter echo ldlm lov lquota osd lfsck lmv sec gss mgc mgs fid fld
[root@ht-1-1-03 lustre-dbug]# cat /proc/sys/lnet/debug_mb
4096

[root@lus14-mds2 ~]# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 12.04.4 LTS
Release: 12.04
Codename: precise
[root@lus14-mds2 ~]# uname -a
Linux lus14-mds2 2.6.32-gmpc-2.6.32-220.el6-lustre.2.2 #2 SMP Wed Dec 4 16:58:34 GMT 2013 x86_64 x86_64 x86_64 GNU/Linux
[root@lus14-mds2 ~]# cat /proc/fs/lustre/version
lustre: 2.2.0
kernel: patchless_client
build: 2.2.0--PRISTINE-2.6.32-gmpc-2.6.32-220.el6-lustre.2.2
[root@lus14-mds2 ~]# cat /proc/sys/lnet/debug
ioctl neterror warning dlmtrace error emerg ha config console
[root@lus14-mds2 ~]# cat /proc/sys/lnet/subsystem_debug
undefined mdc mds osc ost class log llite rpc lnet lnd pinger filter echo ldlm lov lquota lmv sec gss mgc mgs fid fld
[root@lus14-mds2 ~]# cat /proc/sys/lnet/debug_mb
4081

Comment by Oleg Drokin [ 02/Apr/14 ]

Thanks for the logs again.
In these logs I see clear signs of LU-4300
Please apply the patch from here: http://review.whamcloud.com/#/c/9451/ and see if it helps you?
If not, please gather another set of logs. Please add +info to the client side of things. This is rapidly shifting into issues in clio that is currently excluded from logging (and if we add that stuff back in, I am afraid the logs would grow even faster which is not really sustainable).
In any case, let's see if this patch fixes things and if not, we'll see if something else if visible from the logs, or if something else is needed.

Comment by James Beal [ 02/Apr/14 ]

I am having issues applying the patch on the 2.5.0 branch.

root@isg-dev-1-01:/local/lustre-2.5.0-lu4509-lu4300/lustre-release# git log 
commit ca629fd7c6f48801b993be88d10ec7569f209336
Author: Liang Zhen <liang.zhen@intel.com>
Date:   Mon Jan 20 20:52:51 2014 +0800

    LU-4509 ptlrpc: re-enqueue ptlrpcd worker
.....

commit be39b2aa4bf1469db0c936d1415301dca3a1effc
Author: Oleg Drokin <oleg.drokin@intel.com>
Date:   Fri Oct 11 18:24:56 2013 -0400

    Tag 2.5.0-RC1


root@isg-dev-1-01:/local/lustre-2.5.0-lu4509-lu4300/lustre-release# git cherry-pick bfae5a4e4a37e4057200bcc70dbff14e18a00797
Automatic cherry-pick failed.  After resolving the conflicts,
mark the corrected paths with 'git add <paths>' or 'git rm <paths>'
and commit the result with: 

        git commit -c bfae5a4e4a37e4057200bcc70dbff14e18a00797

root@isg-dev-1-01:/local/lustre-2.5.0-lu4509-lu4300/lustre-release# git status
# Not currently on any branch.
# Changes to be committed:
#   (use "git reset HEAD <file>..." to unstage)
#
#	modified:   lustre/include/lustre_dlm.h
#	modified:   lustre/mdc/mdc_request.c
#
# Unmerged paths:
#   (use "git reset HEAD <file>..." to unstage)
#   (use "git add/rm <file>..." as appropriate to mark resolution)
#
#	both modified:      lustre/ldlm/ldlm_request.c
#	both modified:      lustre/osc/osc_lock.c
#	both modified:      lustre/osc/osc_request.c
#
# Untracked files:
#   (use "git add <file>..." to include in what will be committed)
#
#	lustre/scripts/make_META.pl

I am not confident in resolving the conflict in the three files and we are awaiting further instruction.

Comment by Oleg Drokin [ 02/Apr/14 ]

It looks like you are trying to cherry-pick the wrong patch. The one you are trying to cherry-pick is from master, the one you need is the one I have referenced that was specifically made for b2_5. Seems to be cherrypicking fine for me:

Oleg-Drokins-MacBook-Pro:lustre green$ git checkout 2.5.0
HEAD is now at be39b2a... Tag 2.5.0-RC1
Oleg-Drokins-MacBook-Pro:lustre green$ git cherry-pick d1dcff3084e929f5768dc733cdc104cddf168c06
[detached HEAD e73ec59] LU-4509 ptlrpc: re-enqueue ptlrpcd worker
 Author: Liang Zhen <liang.zhen@intel.com>
 1 file changed, 50 insertions(+), 34 deletions(-)
Oleg-Drokins-MacBook-Pro:lustre green$ git fetch http://review.whamcloud.com/fs/lustre-release refs/changes/51/9451/3 && git cherry-pick FETCH_HEAD
From http://review.whamcloud.com/fs/lustre-release
 * branch            refs/changes/51/9451/3 -> FETCH_HEAD
[detached HEAD d237657] LU-4300 ldlm: ELC picks locks in a safer policy
 Author: Jinshan Xiong <jinshan.xiong@intel.com>
 1 file changed, 6 insertions(+), 7 deletions(-)
Comment by James A Simmons [ 02/Apr/14 ]

The LU-4509 does cheery pick to b2_5 but does not build. You can grab the patch for b2_5 at http://review.whamcloud.com/#/c/9705

Comment by James Beal [ 03/Apr/14 ]

Oleg I have followed you instructions and we have a new client package. We are having to put the file system we were using in to production however we have another system ( although smaller ) that we have configured in the same way. We plan to verify we can still see the issue with that system before trying with the new client.

Comment by Helen Brimmer [ 10/Apr/14 ]

Apologies for the lack of updates. We are still in the process of putting the new test environment together. This is an older set which is smaller than the last one; 4 OSS servers providing 30 OSTs. The Lustre version on the servers is still 2.2.0 and we will be re-running the test of LU-4509 on the clients to check that the issue is still present. I'll update the ticket when we have some results.

Comment by Helen Brimmer [ 29/Apr/14 ]

Again, apologies for the long delay between updates. We have been having trouble reproducing the issue exactly on our alternative test system and other issues locally have intervened. However, there are definitely still issues with the LU-4509 patch on the smaller test server; The client systems still get evicted by a random selection of OSTs (different set for each client, and not all from the same OSS) and then lose access to the filesystem. Unfortunately , the load or volume of locks (or whatever the trigger is) on the MDS does not reach a high enough level to cause it to start to lock up. Other clients are still able to connect and use it, which is slightly less extreme service loss than we saw with our bigger production systems.

As there is still an issue, I have moved on to testing the LU-4300 patch. Unfortunately, two test runs have seen all of the test clients crash consistently with the (osc_lock.c:511:osc_lock_upcall()) ASSERTION( lock->cll_state >= CLS_QUEUING ) LBUG after 20 minutes of running the test pipeline. The OSSs and MDS only start evicting the clients after the systems have actually gone down and there is no sign of the 'callback timer expired' messages. I can't say whether this is due to LU-4300 fixing the issue or because the client systems can't run long enough to trigger it.

I believe that the osc_lock_upcall issue is fixed in 2.5.1, so we will be looking to test this version next.

Comment by Helen Brimmer [ 30/May/14 ]

It's been a busy month, but we have now been able to test Lustre 2.5.1 with patches LU-4300 and LU-4509. Unfortunately, we still see lock-ups on clients when our test pipeline is run. I have uploaded trace logs from two clients that were evicted during the latest run performed yesterday to our FTP site (URL is the same as before). The conditions on the clients were as follows:

[root@ht-1-1-09 lustre-dbug]$ cat /proc/fs/lustre/version 
lustre: 2.5.1
kernel: patchless_client
build:  2.5.1-gd6b5d00-CHANGED-3.2.0-58-generic

[root@ht-1-1-09 lustre-dbug]$ cat /proc/sys/lnet/debug
trace inode info ioctl neterror warning dlmtrace error emerg ha rpctrace vfstrace reada config console

[root@ht-1-1-09 lustre-dbug]$ cat /proc/sys/lnet/subsystem_debug 
undefined mdc mds ost log llite rpc mgmt lnet lnd pinger filter  echo ldlm lov lquota osd lfsck   lmv  sec gss  mgc mgs fid fld

[root@ht-1-1-09 lustre-dbug]$ cat /proc/sys/lnet/debug_mb 
4096

To try and ease the log gathering, we set the debug_daemon to restart at regular intervals throughout the run, so there are quite a number of files to look through this time (hopefully we have managed to capture something useful). The timeframe is from around 16:30 - 16:50 local time, by which time the machines had been running jobs for around 25 minutes. The first stuck ptlrpcd threads occur on ht-1-1-07 at around 16:32. Both machines were evicted by a different selection of OSTs, but so far we have not seen any evictions or problems with the MDT in this smaller test environment. I'll attach kernel logs from all of the machines separately. Let me know if there is anything else we can add.

Comment by James A Simmons [ 30/May/14 ]

Can you try the tip of b2_5? I have been having success with normal file operation except for singe shared files. That is being addressed in LU-2827.

Comment by Helen Brimmer [ 02/Jun/14 ]

Hi James, Can you clarify what you want us to test now? I'm afraid that I'm not familiar with Lustre development terminology.

Comment by Peter Jones [ 02/Jun/14 ]

Thanks for your input James.

Helen, one of our engineers will give a recommendation after having concluded the analysis of the latest information

Comment by Oleg Drokin [ 05/Jun/14 ]

Helen, I got your logs, but it seems only client logs from ht-1-1-09 and ht-1-1-07 are available.
I really need at least something from the server (lus03?) so that I can get an idea at what lock handles to look for example.

Comment by Helen Brimmer [ 05/Jun/14 ]

Many apologies, I overlooked the MDS (lus03-mds2) trace logs. I have now uploaded these (they are one large file as the rate of production is not as high as for the clients). We did not see any evictions from the MDS during the test run though, only the OSSes, which we currently are not collecting debug traces from. If you need us to do this, please let me know and I will get another run going as soon as I can.

Comment by Oleg Drokin [ 11/Jun/14 ]

If you can upload a syslog part of oss logs that shows the evicted log handles, that might be enough.
So let's do that as a first step?

Comment by Helen Brimmer [ 11/Jun/14 ]

kern_logs_20140529.tar.gz attached to this ticket has kernel log files from all of the servers covering the period of the latest test.

Comment by Oleg Drokin [ 12/Jun/14 ]

Thanks! I totally missed you uploaded the kernel logs into the ticket and not to your ftp site and there's no visual feedback in Jira that shows there was a file attached to that comment. Sorry.

Anyway, I looked at the logs.
It seems you are still having LU-4300 issues. I know you have the patch for that applied, but that was a simplified patch for 2.5 backport. It appears that that simplified version still left a different avenue for the problem to happen.
We'll work on backportign the full version for you to try.

Also if you are trying without full debug first, if you see "INFO: task ptlrpcd_26:2213 blocked for more than 120 seconds" sort of messages in your client logs, that alone might be enough in some cases, so might be a good idea to do that before diving into the full debug runs (Though I hope the problem would be over with this time around).

Comment by Helen Brimmer [ 12/Jun/14 ]

Thanks for looking through those logs and confirming the issue. We will wait for the new patch to try.

We ran a full debug run yesterday anyway, including '+dlmtrace' logs from the OSSs. In case those traces are useful, I've posted them to the FTP site and put the kernel logs there as well this time. On yesterday's run, all of the client systems (ht-1-1-*) suffered ptlrpcd lockups and effectively lost access to the filesystem, but only one actually got evicted (ht-1-1-07) by a few of the OSTs.

Do you know if LU-4300 and LU-4509 are going to be included in the 2.5.2 release?

Comment by Oleg Drokin [ 12/Jun/14 ]

LU-4509 patch will be included.
LU-4300 patch will be included in the version that you run that did not prove to be too effective for you, the updated version probably won't make it in time for 2.5.2

Comment by Oleg Drokin [ 17/Jun/14 ]

So, I have a question. Do you have a test system where you can try stuff on?
Since we have a fuller version of this fix in pre-release master version it would be great if you can use a prerelease version of 2.6 client (just clients is enough).
Or if you prefer something closer to 2.5, we can provide you with a special client rpm that has this and some other patches (that we did not plan to include into 2.5 releases as too invasive, but the current LU-4300 fix is based on them, which is why extensive porting is needed) so you can try thiso n your load and ensure the problem is fully solved.

If you are interested in testing the actual 2.6 prerelease code, you can grab the RPMs at http://build.whamcloud.com/job/lustre-master/2085/ - pick the client version for your distro.

Comment by Helen Brimmer [ 17/Jun/14 ]

We can try out 2.6 candidates on the set of machines that we have already been using to test 2.5.x. We are an Ubuntu shop though, so I am afraid that RPMs are not that helpful for us (the test clients are running Ubuntu 12.04.4 with kernel 3.2.0-58-generic). We have been building the 2.5.x +patch variants from the git sources. If you can confirm the git release tag to checkout from 2.6, we can try building it from there.

Comment by Oleg Drokin [ 18/Jun/14 ]

It's git hash 5c4573e327c5d027a6b877f75910767244161a1f

Comment by Helen Brimmer [ 19/Jun/14 ]

We have successfully got 2.5.59 build #2085 to build on our systems and have been running the test pipeline on six worker nodes for around 22 hours now. So far, we have not seen any stuck ptlrpcd threads and the jobs continue to flow through our batch scheduler. With the previous versions, we would have seen the nodes lockup within an hour of receiving pipeline jobs.

To confirm that we've got the version you wanted us to test:

[root@ht-1-1-07 lustre-dbug]$ cat /proc/fs/lustre/version 
lustre: 2.5.59
kernel: patchless_client
build:  2.5.59-g5c4573e-CHANGED-3.2.0-58-generic

We have not seen any problems on the servers either, which are still running v2.2.0. This looks very promising, but we do need to have a stable release for our production environment.

Comment by Jinshan Xiong (Inactive) [ 23/Jun/14 ]

I'm working on this issue.

Comment by Jinshan Xiong (Inactive) [ 24/Jun/14 ]

Can you please try patch: http://review.whamcloud.com/10795

Comment by Helen Brimmer [ 26/Jun/14 ]

We have applied that patch to Lustre 2.5.1 and restarted the pipeline. The test nodes have now been running the jobs for over 18 hours and we have not seen any stuck threads yet.

[root@ht-1-1-09 root]$ cat /proc/fs/lustre/version 
lustre: 2.5.1
kernel: patchless_client
build:  2.5.1-g0b9a5ad-CHANGED-3.2.0-58-generic
Comment by Helen Brimmer [ 09/Jul/14 ]

We have continued to run the problematic pipeline and other workloads on the 2.5.1 clients patched with LU-4786. After nearly two weeks, we have not had any problems with stuck threads or full system crashes. We will need to check how it behaves with our larger production systems, but it does look like this has solved the issue for this test environment at least.

Comment by Oleg Drokin [ 11/Aug/14 ]

Hm, so the client was evicted by the server because it did not reply to a request to release a contended lock.
Unfortunately because you only included the client log starting from moment of eviction, I do not know if client complained about anything before then. Please consider adding 30 minutes of kernel log before the eviction on that client.

After the eviction we see import invalidation thread stuck on some semaphore and then a bunch of ptlrpcd threads stuck on a semaphore, potentially indicating there's some client side deadlock involved. Whenever it was triggered by eviction or was a separate event is not yet clear.
I discussed it with Jinshan and he is going to provide his opinion after some more log reading.

This is most definitely different issue than what you had before, so that's a good sign, I guess.

Comment by Helen Brimmer [ 11/Aug/14 ]

Jira doesn't seem to want to accept attachments at the moment, so I've uploaded the whole kern.log file for last week and the equivalent syslog files to our ftp site (filenames ht-1-1-16-*.gz). This particular client reported some 'revalidate FID' errors a few days before, but other than that was very quiet.

Comment by Jinshan Xiong (Inactive) [ 12/Aug/14 ]

There is no enough information to identify the problem.

Can you please reproduce the problem and when you see the problem again:
1. run command `lctl dk <logfile>' on victim client node to dump log;
2. run `echo t > /proc/sysrq-trigger' and `dmesg > dmesg.txt' to collect stack trace of all processes in the system.

Thanks.

From the limited log, I can see that the file has 40 stripes - do you have any specific reason to have wide striped files? Also what's the access mode to the files - does it do append write to the files?

Jinshan

Comment by Helen Brimmer [ 12/Aug/14 ]

This latest test was run against one of our production filesystems, so it may take a little time to get permission to run it again. Since it is relatively quiet at the moment due to summer holidays, hopefully we will get a chance this week. In the meantime, I have run the commands you suggested anyway on the client, since it has not done anything since the run last week (and the Lustre modules were not unloaded after the filesystem was unmounted). It looks like 'lctl dk' has captured information from the duration of the pipeline run, but I'm not sure that dmesg still contains anything useful. I've uploaded the following files to our FTP site anyway, in case they are useful:

ht-1-1-16-20140812-lustre-dmp.txt.gz
ht-1-1-16-20140812-dmesg.txt.gz
ht-1-1-16-20140812-pydmesg.txt.gz

(pydmesg is an in-house wrapper for dmesg that converts the 'time since uptime' timestamps into realtime values)

The pipeline is produced and run by one of our scientific research teams; I will have to check with them about the exact details of its file accesses. Is there a way to identify the file that the log refers to? The pipeline generates over 3000 individual batch jobs, each with their own set of reference files and output files...

Comment by Jinshan Xiong (Inactive) [ 12/Aug/14 ]

If the system is still alive, you can try this command to get the file name affected.

lfs fid2path <lustre_dir> "[0x20000b8ab:0x777f:0x0]"

<lustre_dir> is the directory where you mount Lustre file system, it is /lustre/scratch113 in your case.

I will take a look at the files you attached.

Comment by Jinshan Xiong (Inactive) [ 12/Aug/14 ]

I checked the attachments but didn't find anything useful from there.

My gut feeling is that this may be related with the access pattern of those widely striped files.

Comment by Helen Brimmer [ 13/Aug/14 ]

Okay, if you need us to increase the debug level on the clients, please let us know which flags to apply.

The problematic file appears to be

/lustre/scratch113/teams/hgi/dddtest/variant_calling/20140805-lustest/pooled/.jobs/call_variants.w.1501.o

which has zero length. It is one of many .o files in that directory, the largest of which is only 16K. The '.jobs' directory is where the STDOUT and STDERR from each pipeline job is stored. These files will have data appended to them as the jobs produce their output (or errors). I will check with the pipeline owners if the striping is intentional - we don't recommend striping files this small...

Comment by Peter Jones [ 14/Aug/14 ]

Helen

We have landed the existing patch you are carrying for 2.5.3 as that has clearly improved the situation markedly. Could you please open a new ticket to track this residual issue? This will make things easier for us (and others) to track what changes are in a given release.

Thanks

Peter

Comment by Jinshan Xiong (Inactive) [ 16/Aug/14 ]

Please apply the following debug flags on the client side:

lctl set_param debug="vfstrace dlmlock cache reada"

Yes there is no reason to use wide stripe if the files are so small.

I kind of understand why it happened - if this is the output of STDOUT and STDERR, mostly people would like to do:

command >> output 2>&1

The output file is a Lustre file and it will be opened and written with O_APPEND mode, which is really bad for widely striped file.

Just my guess - please verify it.

Comment by Peter Jones [ 18/Aug/14 ]

ok I will close this ticket to reflect that the original patch has landed for 2.5.3. If/when any additional information is supplied about this residual issue then we can track that under a new ticket and link it to this ticket.

Generated at Sat Feb 10 01:45:50 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.