[LU-4786] Apparent denial of service from client to mdt - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.5.3
Affects Version/s: None
Labels:
None

Severity:
3
Rank (Obsolete):
13165

Description

We generally use "1.8.9wc1-sanger1-~~PRISTINE-2.6.32-44-server” as our standard client however we are using "lustre: 2.5.0,kernel: patchless_client,build: v2_5_0_0CHANGED-3.8.0-34-generic” on a small number of systems ( hgs4 ) which are also running docker and lxc, we do have a tiny amount of "v2_5_0_0~~-CHANGED-3.2.0-59-virtual"

Scratch 109,110 and 113 all run "2.2.0--PRISTINE-2.6.32-gmpc-2.6.32-220.el6-lustre.2.2” on the servers.

We have recovered access to a filesystem which was effectively unavailable for around 24 hours. We have been supported by DDN ( in particularly Sven Tautmann ) during the investigation the issue, the state of our other filesystems were discussed.

We have a number of lustre systems and we saw a similar issue on scratch113,scratch110 and finally scratch109. The symptoms was that new access to file system was lost, we determined that the issue was with the mdt we took the approach of stopping the mdt and removing the lustre modules and restarting the mdt. This approach appeared to allow to regain access although we did have to abort recovery on the MDS on some of the systems after they had been stuck in recovery for more than 30 minutes. Normally we would look for a user job that was causing the issue however there was nothing obviously apparent.

With scratch109 which is one of our general purpose file systems and has our reference genomes on and therefore effects our production pipelines, restarting the mdt service didn’t solve the issue as we saw the system fail almost immediately after we aborted the recovery.

With assistance from DDN we took the following approach ( this took place during the day of the 18th of March )

1) cleared up some old configuration from scratch109 MDS config:

The following was seen in the log "Mar 18 08:46:04 lus09-mds1 kernel: LustreError: 3192:0:(mgs_handler.c:785:mgs_handle_fslog_hack()) Invalid logname received: params”

root@lus09-mds1:/# mount ~~t ldiskfs /dev/mapper/lus09~~-mgs-lus09 /export/MGS
root@lus09-mds1:/export/MGS# mkdir old/CONFIGS
root@lus09-mds1:/export/MGS# mv CONFIGS/params old/CONFIGS/
root@lus09-mds1:/export/MGS# mv CONFIGS/lus09-params old/CONFIGS/

2) We saw issues with the uid/gui lookup for example:

Mar 18 12:10:02 lus09-mds2 kernel: LustreError: 13888:0:(upcall_cache.c:236:upcall_cache_get_entry()) Skipped 208 previous similar messages
Mar 18 13:31:17 lus09-mds2 kernel: Lustre: 4134:0:(mdt_handler.c:4767:mdt_process_config()) For 1.8 interoperability, skip this mdt.group_upcall. It is obsolete

And we moved the system to a new ldap caching layer which did appear to clear those messages up.

3) we then took the approach of using iptables to limit access to the mds to a limited collection of clients and then showed that the mdt and clients were stable, we then spent a number of hours introducing clients to the mdt until all bar hgs4 ( our 3.8 lustre 2.5 clients ) had access to the filesystem.

We would like to understand why the mds failed and why when it was restarted it did not recovery in any useful amount of time and whether this is related to either the work load or the 2.5.0 clients. We have attached the output of “ddn_lustre_showall.sh” from each of the server systems, hgs4 and a random collection of clients ( to provide a baseline ).

The following is a number of observations:

On the client we saw:

Mar 18 09:14:49 hgs4a kernel: [71024.072174] LustreError: 6751:0:(import.c:323:ptlrpc_invalidate_import()) lus13-OST0013_UUID: rc = -110 waiting for callback (1 != 0)
Mar 18 09:14:49 hgs4a kernel: [71024.072247] LustreError: 6751:0:(import.c:349:ptlrpc_invalidate_import()) @@@ still on sending list req@ffff881fe470ac00 x1462829581185164/t0(0) o4->lus13-OST0013-osc-ffff882fe7cbcc00@172.17.117.147@tcp:6/4 lens 488/448 e 0 to 0 dl 1395076147 ref 2 fl Rpc:RE/0/ffffffff rc -5/-1
Mar 18 09:14:49 hgs4a kernel: [71024.072347] LustreError: 6751:0:(import.c:365:ptlrpc_invalidate_import()) lus13-OST0013_UUID: RPCs in "Unregistering" phase found (0). Network is sluggish? Waiting them to error out.

While on the server the following was interesting

Mar 17 11:52:15 lus13-mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 1211s: evicting client at 172.17.125.17@tcp ns: mdt-ffff88082ad49000 lock: ffff88047a8be6c0/0xce376efee12486e2 lrc: 3/0,0 mode: PR/PR res: 8589950579/1911 bits 0x3 rrc: 5 type: IBT flags: 0x4000020 remote: 0xefde62456d81d72a expref: 851 pid: 14082 timeout: 6100853168
Mar 17 12:12:13 lus13-mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 959s: evicting client at 172.17.125.15@tcp ns: mdt-ffff88082ad49000 lock: ffff880712d596c0/0xce376efee25dddbe lrc: 3/0,0 mode: PR/PR res: 26214401/3857707524 bits 0x3 rrc: 896 type: IBT flags: 0x4000020 remote: 0xc6c5372816307bb3 expref: 821 pid: 5438 timeout: 6102051020
Mar 17 12:18:26 lus13-mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 226s: evicting client at 172.17.125.15@tcp ns: mdt-ffff88082ad49000 lock: ffff880d85b62d80/0xce376efee273fb44 lrc: 3/0,0 mode: PR/PR res: 26214401/3857707524 bits 0x3 rrc: 1023 type: IBT flags: 0x4000020 remote: 0xc6c5372816308243 expref: 7 pid: 14047 timeout: 6102424312
Mar 17 12:53:40 lus13-mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 427s: evicting client at 172.17.125.18@tcp ns: mdt-ffff88082ad49000 lock: ffff8806be599900/0xce376efee28b9046 lrc: 3/0,0 mode: PR/PR res: 26214401/3857707524 bits 0x3 rrc: 595 type: IBT flags: 0x4000020 remote: 0x90884a7b57617f83 expref: 814 pid: 14089 timeout: 6104538355
Mar 17 13:02:09 lus13-mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 848s: evicting client at 172.17.139.124@tcp ns: mdt-ffff88082ad49000 lock: ffff8806170ee6c0/0xce376efee28c39ff lrc: 3/0,0 mode: PR/PR res: 26214401/3857707524 bits 0x3 rrc: 759 type: IBT flags: 0x4000020 remote: 0xc2f2926bc4355241 expref: 7 pid: 5296 timeout: 6105047775
Mar 17 13:48:43 lus13-mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 1455s: evicting client at 172.17.125.18@tcp ns: mdt-ffff88082ad49000 lock: ffff88078f3d1900/0xce376efee2a43b88 lrc: 3/0,0 mode: PR/PR res: 26214401/3857707524 bits 0x3 rrc: 899 type: IBT flags: 0x4000020 remote: 0x90884a7b57619c01 expref: 24 pid: 5468 timeout: 6107841367
Mar 17 14:15:20 lus13-mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 1597s: evicting client at 172.17.7.98@tcp ns: mdt-ffff88082ad49000 lock: ffff88074ab79240/0xce376efee2ac4cab lrc: 3/0,0 mode: PR/PR res: 26214401/3857707524 bits 0x3 rrc: 912 type: IBT flags: 0x4000030 remote: 0x91b7902dc49a4f87 expref: 625 pid: 5533 timeout: 6109438527
Mar 17 15:00:10 lus13-mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 162s: evicting client at 172.17.125.19@tcp ns: mdt-ffff88082034f000 lock: ffff881007719240/0xc336d9366cbafe36 lrc: 3/0,0 mode: PR/PR res: 26214401/3857707524 bits 0x1 rrc: 576 type: IBT flags: 0x4000020 remote: 0x4126dea1636f365 expref: 556 pid: 7107 timeout: 4296236743
Mar 18 12:51:46 lus13-mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 454s: evicting client at 172.17.125.15@tcp ns: mdt-ffff88082034f000 lock: ffff88079a183d80/0xc336d9367b259007 lrc: 3/0,0 mode: PR/PR res: 26214401/3857707524 bits 0x3 rrc: 242 type: IBT flags: 0x4000020 remote: 0x39f2e4c10c71e35f expref: 52 pid: 7113 timeout: 4374932968
Mar 18 14:53:38 lus13-mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 779s: evicting client at 172.17.125.15@tcp ns: mdt-ffff88082034f000 lock: ffff880fe2393d80/0xc336d9367c7c0550 lrc: 3/0,0 mode: PR/PR res: 26214401/3857707524 bits 0x3 rrc: 973 type: IBT flags: 0x4000020 remote: 0x9e99423585827dd3 expref: 808 pid: 7202 timeout: 4382244335
Mar 18 15:02:04 lus13-mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 506s: evicting client at 172.17.125.15@tcp ns: mdt-ffff88082034f000 lock: ffff881014370b40/0xc336d9367c88fe44 lrc: 3/0,0 mode: PR/PR res: 26214401/3857707524 bits 0x3 rrc: 1025 type: IBT flags: 0x4000030 remote: 0x9e99423585827ed6 expref: 5 pid: 5093 timeout: 4382750051
Mar 18 15:56:26 lus13-mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 1208s: evicting client at 172.17.125.15@tcp ns: mdt-ffff88082034f000 lock: ffff881006da2480/0xc336d9367d16b067 lrc: 3/0,0 mode: PR/PR res: 26214401/3857707524 bits 0x3 rrc: 316 type: IBT flags: 0x4000020 remote: 0x9e994235858281e6 expref: 14 pid: 7275 timeout: 4386012442
Mar 18 16:12:39 lus13-mds2 kernel: LustreError: 21100:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 973s: evicting client at 172.17.125.15@tcp ns: mdt-ffff88082034f000 lock: ffff881015642900/0xc336d9367d5ecb60 lrc: 3/0,0 mode: PR/PR res: 26214401/3857707524 bits 0x3 rrc: 669 type: IBT flags: 0x4000020 remote: 0x9e994235858282fe expref: 13 pid: 7116 timeout: 4386985544

And similarly on lus09

Mar 18 03:12:28 lus09-mds2 kernel: LustreError: 5245:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 10808s: evicting client at 172.17.143.41@tcp ns: mdt-ffff881819bc3000 lock: ffff880b1f8936c0/0x5e448803b245d974 lrc: 3/0,0 mode: PR/PR res: 411041793/558694943 bits 0x3 rrc: 1001 type: IBT flags: 0x4000030 remote: 0xdd627947772d3070 expref: 28 pid: 5399 timeout: 4328657010
Mar 18 03:32:29 lus09-mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 12009s: evicting client at 172.17.140.69@tcp ns: mdt-ffff881819bc3000 lock: ffff8816d6c0eb40/0x5e448803b245dad9 lrc: 3/0,0 mode: PR/PR res: 411041793/558694943 bits 0x3 rrc: 1004 type: IBT flags: 0x4000030 remote: 0x8bba98044cf50445 expref: 16 pid: 5453 timeout: 4329858036
Mar 18 03:52:30 lus09-mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 13210s: evicting client at 172.17.27.140@tcp ns: mdt-ffff881819bc3000 lock: ffff880b21f7a480/0x5e448803b245dbb9 lrc: 3/0,0 mode: PR/PR res: 411041793/558694943 bits 0x3 rrc: 1006 type: IBT flags: 0x4000030 remote: 0x4620c27f480e4650 expref: 17 pid: 5224 timeout: 4331059019
Mar 18 04:12:31 lus09-mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 14411s: evicting client at 172.17.48.147@tcp ns: mdt-ffff881819bc3000 lock: ffff880b2180d900/0x5e448803b245dc29 lrc: 3/0,0 mode: PR/PR res: 411041793/558694943 bits 0x3 rrc: 1003 type: IBT flags: 0x4000030 remote: 0x7f400f68202358bc expref: 18 pid: 5248 timeout: 4332260008
Mar 18 04:32:32 lus09-mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 15612s: evicting client at 172.17.27.156@tcp ns: mdt-ffff881819bc3000 lock: ffff880b20e34900/0x5e448803b245dc4c lrc: 3/0,0 mode: PR/PR res: 411041793/558694943 bits 0x3 rrc: 1003 type: IBT flags: 0x4000030 remote: 0xdd23ca15f588f2f7 expref: 11 pid: 5040 timeout: 4333461005
Mar 18 04:52:33 lus09-mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 16813s: evicting client at 172.17.140.22@tcp ns: mdt-ffff881819bc3000 lock: ffff880b15dd5480/0x5e448803b245dc68 lrc: 3/0,0 mode: PR/PR res: 411041793/558694943 bits 0x3 rrc: 1004 type: IBT flags: 0x4000030 remote: 0xc21e8f4395007441 expref: 24 pid: 5030 timeout: 4334662005
Mar 18 05:12:34 lus09-mds2 kernel: LustreError: 8644:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 18014s: evicting client at 172.17.140.24@tcp ns: mdt-ffff881819bc3000 lock: ffff880b21811b40/0x5e448803b245dcbc lrc: 3/0,0 mode: PR/PR res: 411041793/558694943 bits 0x3 rrc: 1005 type: IBT flags: 0x4000030 remote: 0x72efc45909efcb5 expref: 22 pid: 5445 timeout: 4335863011
Mar 18 05:32:35 lus09-mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 19215s: evicting client at 172.17.140.78@tcp ns: mdt-ffff881819bc3000 lock: ffff8816e4d81480/0x5e448803b245dd1e lrc: 3/0,0 mode: PR/PR res: 411041793/558694943 bits 0x3 rrc: 1008 type: IBT flags: 0x4000030 remote: 0x37ca8d5924c467b8 expref: 15 pid: 4157 timeout: 4337064016
Mar 18 05:52:36 lus09-mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 20416s: evicting client at 172.17.115.125@tcp ns: mdt-ffff881819bc3000 lock: ffff880b15615d80/0x5e448803b245dd87 lrc: 3/0,0 mode: PR/PR res: 411041793/558694943 bits 0x3 rrc: 1008 type: IBT flags: 0x4000030 remote: 0x39f8634c334bd35f expref: 19 pid: 5288 timeout: 4338265010
Mar 18 06:12:37 lus09-mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 21617s: evicting client at 172.17.140.31@tcp ns: mdt-ffff881819bc3000 lock: ffff880b20e8f900/0x5e448803b245ddcd lrc: 3/0,0 mode: PR/PR res: 411041793/558694943 bits 0x3 rrc: 1014 type: IBT flags: 0x4000030 remote: 0xe18d86102d68e1c2 expref: 22 pid: 5211 timeout: 4339466035
Mar 18 06:32:38 lus09-mds2 kernel: LustreError: 8642:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 22818s: evicting client at 172.17.143.37@tcp ns: mdt-ffff881819bc3000 lock: ffff880b20a0a000/0x5e448803b245ded0 lrc: 3/0,0 mode: PR/PR res: 411041793/558694943 bits 0x3 rrc: 1013 type: IBT flags: 0x4000030 remote: 0x26a60eb067b19ece expref: 13 pid: 5323 timeout: 4340667019
Mar 18 06:52:39 lus09-mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 24019s: evicting client at 172.17.140.25@tcp ns: mdt-ffff881819bc3000 lock: ffff8816e4ee5900/0x5e448803b245e00b lrc: 3/0,0 mode: PR/PR res: 411041793/558694943 bits 0x3 rrc: 1016 type: IBT flags: 0x4000030 remote: 0xbb9b59a85baadc32 expref: 20 pid: 5346 timeout: 4341868025
Mar 18 07:12:40 lus09-mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 25220s: evicting client at 172.17.140.26@tcp ns: mdt-ffff881819bc3000 lock: ffff880b1b902240/0x5e448803b245e03c lrc: 3/0,0 mode: PR/PR res: 411041793/558694943 bits 0x3 rrc: 1015 type: IBT flags: 0x4000030 remote: 0xe8399533ab3b360c expref: 20 pid: 5375 timeout: 4343069015
Mar 18 07:32:41 lus09-mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 26421s: evicting client at 172.17.7.141@tcp ns: mdt-ffff881819bc3000 lock: ffff8816e3802d80/0x5e448803b245e100 lrc: 3/0,0 mode: PR/PR res: 411041793/558694943 bits 0x3 rrc: 1017 type: IBT flags: 0x4000030 remote: 0x7c7cfa487563f6e6 expref: 16 pid: 5034 timeout: 4344270019
Mar 18 07:52:42 lus09-mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 27622s: evicting client at 172.17.30.61@tcp ns: mdt-ffff881819bc3000 lock: ffff8816e1f3b480/0x5e448803b245e1cb lrc: 3/0,0 mode: PR/PR res: 411041793/558694943 bits 0x3 rrc: 1018 type: IBT flags: 0x4000030 remote: 0xc61848c213a21135 expref: 22 pid: 5262 timeout: 4345471015
Mar 18 08:12:43 lus09-mds2 kernel: LustreError: 8642:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 28823s: evicting client at 172.17.119.20@tcp ns: mdt-ffff881819bc3000 lock: ffff880b21da5480/0x5e448803b245e591 lrc: 3/0,0 mode: PR/PR res: 411041793/558694943 bits 0x3 rrc: 1023 type: IBT flags: 0x4000030 remote: 0x69128bab309e0a76 expref: 37 pid: 4994 timeout: 4346672050
Mar 18 08:32:44 lus09-mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 30024s: evicting client at 172.17.12.65@tcp ns: mdt-ffff881819bc3000 lock: ffff880b220666c0/0x5e448803b245e647 lrc: 3/0,0 mode: PR/PR res: 411041793/558694943 bits 0x3 rrc: 1019 type: IBT flags: 0x4000030 remote: 0xbfcd2a0005c3b47a expref: 10 pid: 5356 timeout: 4347873010
Mar 18 08:52:45 lus09-mds2 kernel: LustreError: 8642:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 20417s: evicting client at 172.17.143.60@tcp ns: mdt-ffff881819bc3000 lock: ffff8816e3c25000/0x5e448803b2461293 lrc: 3/0,0 mode: PR/PR res: 411041793/558694943 bits 0x3 rrc: 1167 type: IBT flags: 0x4000030 remote: 0x9a1bba4fa51a1301 expref: 31 pid: 5453 timeout: 4349074754
Mar 18 09:12:48 lus09-mds2 kernel: LustreError: 8645:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 1201s: evicting client at 172.17.125.21@tcp ns: mdt-ffff881819bc3000 lock: ffff8816dad51900/0x5e448803b2467901 lrc: 3/0,0 mode: PR/PR res: 411041793/558694943 bits 0x3 rrc: 1044 type: IBT flags: 0x4000020 remote: 0x1d8fffd2b0c2971f expref: 19 pid: 5313 timeout: 4350277824

The length of the timers on the locks do appear to be excessive and having a timer expire after 8 hours does seem less than optimal and this might be related to ~~LU-4579~~

It appeared that the problem was migrating around the filesystems and that for example an eviction from one system would cause an eviction from another system.

“
Mar 17 12:12:13 hgs4a kernel: [1213093.322040] LustreError: 167-0: lus13-MDT0000-mdc-ffff882fe0884400: This client was evicted by lus13-MDT0000; in progress operations using this service will fail.
Mar 17 12:12:21 hgs4a kernel: [1213101.030435] LustreError: 167-0: lus10-MDT0000-mdc-ffff882fac577c00: This client was evicted by lus10-MDT0000; in progress operations using this service will fail.
Mar 17 12:14:40 hgs4a kernel: [1213239.422115] LustreError: 167-0: lus11-MDT0000-mdc-ffff882f8ac5dc00: This client was evicted by lus11-MDT0000; in progress operations using this service will fail.
Mar 17 12:18:26 hgs4a kernel: [1213465.907318] LustreError: 167-0: lus13-MDT0000-mdc-ffff882fe0884400: This client was evicted by lus13-MDT0000; in progress operations using this service will fail.
Mar 17 12:22:05 hgs4a kernel: [1213684.252829] LustreError: 167-0: lus10-MDT0000-mdc-ffff882fac577c00: This client was evicted by lus10-MDT0000; in progress operations using this service will fail.
Mar 17 12:26:56 hgs4a kernel: [1213974.940999] LustreError: 167-0: lus08-MDT0000-mdc-ffff882fe0886c00: This client was evicted by lus08-MDT0000; in progress operations using this service will fail.
Mar 17 12:33:53 hgs4a kernel: [1214391.253811] LustreError: 167-0: lus10-MDT0000-mdc-ffff882fac577c00: This client was evicted by lus10-MDT0000; in progress operations using this service will fail."

And in hindsight it does appear that our lustre 2.5 clients appear to get more evictions than 1.8.9.
"host count percent
hgs4g.internal.sanger.ac.uk 158 4.887102 ( 2.5.0 )
hgs4k.internal.sanger.ac.uk 155 4.794309 ( 2.5.0 )
hgs4c.internal.sanger.ac.uk 142 4.392205 ( 2.5.0 )
hgs4a.internal.sanger.ac.uk 61 1.886792 ( 2.5.0 )
hgs4d.internal.sanger.ac.uk 14 0.433034 ( 2.5.0 )
deskpro100563.internal.sanger.ac.uk 13 0.402103 ( 1.8.9 however not in the datacenter)
hgs4e.internal.sanger.ac.uk 12 0.371172 ( 2.5.0 )
pcs5a.internal.sanger.ac.uk 7 0.216517 ( 1.8.9 shares networking with hgs4…. )
lustre-utils01.internal.sanger.ac.uk 7 0.216517 ( 2.5.1-RC1 and head )
uk10k-farm-srv1.internal.sanger.ac.uk 6 0.185586 ( 1.8.9 )
pcs-genedb-dev.internal.sanger.ac.uk 6 0.185586 ( 1.8.9 a virtual machine )
ht-1-1-15.internal.sanger.ac.uk 6 0.185586 ( 1.8.8 !! )
bc-20-1-01.internal.sanger.ac.uk 6 0.185586 ( 1.8.8 !! )
vr-1-1-05.internal.sanger.ac.uk 5 0.154655 ( 1.8.9 )
....

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

kern_logs_20140325.tar.gz
0.2 kB
26/Mar/14 12:01 PM
kern_logs_20140328.tar.gz
124 kB
28/Mar/14 5:25 PM
kern_logs_20140331.tar.gz
2 kB
31/Mar/14 10:53 AM
kern_logs_20140401.tar.gz
26 kB
01/Apr/14 5:12 PM
kern_logs_20140529.tar.gz
292 kB
30/May/14 5:04 PM
kern_logs_20140807.tar.gz
209 kB
07/Aug/14 12:27 PM
lnet_test.tar
300 kB
20/Mar/14 12:22 PM
logs_client_hgs4.tar.gz
1.64 MB
20/Mar/14 12:22 PM
logs_client_other_headnodes.tar.gz
0.3 kB
20/Mar/14 12:22 PM
logs_client_other.tar.gz
0.2 kB
20/Mar/14 12:22 PM
logs_server.tar.gz
0.2 kB
20/Mar/14 12:22 PM

Issue Links

is related to

LU-4579 Timeout system horribly broken

Resolved

is related to

LU-3321 2.x single thread/process throughput degraded from 1.8

Resolved

Activity

[LU-4786] Apparent denial of service from client to mdt

Helen Brimmer added a comment - 19/Jun/14 1:22 PM

We have successfully got 2.5.59 build #2085 to build on our systems and have been running the test pipeline on six worker nodes for around 22 hours now. So far, we have not seen any stuck ptlrpcd threads and the jobs continue to flow through our batch scheduler. With the previous versions, we would have seen the nodes lockup within an hour of receiving pipeline jobs.

To confirm that we've got the version you wanted us to test:

[root@ht-1-1-07 lustre-dbug]$ cat /proc/fs/lustre/version 
lustre: 2.5.59
kernel: patchless_client
build:  2.5.59-g5c4573e-CHANGED-3.2.0-58-generic

We have not seen any problems on the servers either, which are still running v2.2.0. This looks very promising, but we do need to have a stable release for our production environment.

Helen Brimmer added a comment - 19/Jun/14 1:22 PM We have successfully got 2.5.59 build #2085 to build on our systems and have been running the test pipeline on six worker nodes for around 22 hours now. So far, we have not seen any stuck ptlrpcd threads and the jobs continue to flow through our batch scheduler. With the previous versions, we would have seen the nodes lockup within an hour of receiving pipeline jobs. To confirm that we've got the version you wanted us to test: [root@ht-1-1-07 lustre-dbug]$ cat /proc/fs/lustre/version lustre: 2.5.59 kernel: patchless_client build: 2.5.59-g5c4573e-CHANGED-3.2.0-58- generic We have not seen any problems on the servers either, which are still running v2.2.0. This looks very promising, but we do need to have a stable release for our production environment.

Oleg Drokin added a comment - 18/Jun/14 3:01 PM

It's git hash 5c4573e327c5d027a6b877f75910767244161a1f

Oleg Drokin added a comment - 18/Jun/14 3:01 PM It's git hash 5c4573e327c5d027a6b877f75910767244161a1f

Helen Brimmer added a comment - 17/Jun/14 4:13 PM

We can try out 2.6 candidates on the set of machines that we have already been using to test 2.5.x. We are an Ubuntu shop though, so I am afraid that RPMs are not that helpful for us (the test clients are running Ubuntu 12.04.4 with kernel 3.2.0-58-generic). We have been building the 2.5.x +patch variants from the git sources. If you can confirm the git release tag to checkout from 2.6, we can try building it from there.

Helen Brimmer added a comment - 17/Jun/14 4:13 PM We can try out 2.6 candidates on the set of machines that we have already been using to test 2.5.x. We are an Ubuntu shop though, so I am afraid that RPMs are not that helpful for us (the test clients are running Ubuntu 12.04.4 with kernel 3.2.0-58-generic). We have been building the 2.5.x +patch variants from the git sources. If you can confirm the git release tag to checkout from 2.6, we can try building it from there.

Oleg Drokin added a comment - 17/Jun/14 2:16 PM

So, I have a question. Do you have a test system where you can try stuff on?
Since we have a fuller version of this fix in pre-release master version it would be great if you can use a prerelease version of 2.6 client (just clients is enough).
Or if you prefer something closer to 2.5, we can provide you with a special client rpm that has this and some other patches (that we did not plan to include into 2.5 releases as too invasive, but the current ~~LU-4300~~ fix is based on them, which is why extensive porting is needed) so you can try thiso n your load and ensure the problem is fully solved.

If you are interested in testing the actual 2.6 prerelease code, you can grab the RPMs at http://build.whamcloud.com/job/lustre-master/2085/ - pick the client version for your distro.

Oleg Drokin added a comment - 17/Jun/14 2:16 PM So, I have a question. Do you have a test system where you can try stuff on? Since we have a fuller version of this fix in pre-release master version it would be great if you can use a prerelease version of 2.6 client (just clients is enough). Or if you prefer something closer to 2.5, we can provide you with a special client rpm that has this and some other patches (that we did not plan to include into 2.5 releases as too invasive, but the current LU-4300 fix is based on them, which is why extensive porting is needed) so you can try thiso n your load and ensure the problem is fully solved. If you are interested in testing the actual 2.6 prerelease code, you can grab the RPMs at http://build.whamcloud.com/job/lustre-master/2085/ - pick the client version for your distro.

Oleg Drokin added a comment - 12/Jun/14 4:45 PM

~~LU-4509~~ patch will be included.
~~LU-4300~~ patch will be included in the version that you run that did not prove to be too effective for you, the updated version probably won't make it in time for 2.5.2

Oleg Drokin added a comment - 12/Jun/14 4:45 PM LU-4509 patch will be included. LU-4300 patch will be included in the version that you run that did not prove to be too effective for you, the updated version probably won't make it in time for 2.5.2

Helen Brimmer added a comment - 12/Jun/14 10:43 AM

Thanks for looking through those logs and confirming the issue. We will wait for the new patch to try.

We ran a full debug run yesterday anyway, including '+dlmtrace' logs from the OSSs. In case those traces are useful, I've posted them to the FTP site and put the kernel logs there as well this time. On yesterday's run, all of the client systems (ht-1-1-*) suffered ptlrpcd lockups and effectively lost access to the filesystem, but only one actually got evicted (ht-1-1-07) by a few of the OSTs.

Do you know if ~~LU-4300~~ and ~~LU-4509~~ are going to be included in the 2.5.2 release?

Helen Brimmer added a comment - 12/Jun/14 10:43 AM Thanks for looking through those logs and confirming the issue. We will wait for the new patch to try. We ran a full debug run yesterday anyway, including '+dlmtrace' logs from the OSSs. In case those traces are useful, I've posted them to the FTP site and put the kernel logs there as well this time. On yesterday's run, all of the client systems (ht-1-1-*) suffered ptlrpcd lockups and effectively lost access to the filesystem, but only one actually got evicted (ht-1-1-07) by a few of the OSTs. Do you know if LU-4300 and LU-4509 are going to be included in the 2.5.2 release?

Oleg Drokin added a comment - 12/Jun/14 6:08 AM

Thanks! I totally missed you uploaded the kernel logs into the ticket and not to your ftp site and there's no visual feedback in Jira that shows there was a file attached to that comment. Sorry.

Anyway, I looked at the logs.
It seems you are still having ~~LU-4300~~ issues. I know you have the patch for that applied, but that was a simplified patch for 2.5 backport. It appears that that simplified version still left a different avenue for the problem to happen.
We'll work on backportign the full version for you to try.

Also if you are trying without full debug first, if you see "INFO: task ptlrpcd_26:2213 blocked for more than 120 seconds" sort of messages in your client logs, that alone might be enough in some cases, so might be a good idea to do that before diving into the full debug runs (Though I hope the problem would be over with this time around).

Oleg Drokin added a comment - 12/Jun/14 6:08 AM Thanks! I totally missed you uploaded the kernel logs into the ticket and not to your ftp site and there's no visual feedback in Jira that shows there was a file attached to that comment. Sorry. Anyway, I looked at the logs. It seems you are still having LU-4300 issues. I know you have the patch for that applied, but that was a simplified patch for 2.5 backport. It appears that that simplified version still left a different avenue for the problem to happen. We'll work on backportign the full version for you to try. Also if you are trying without full debug first, if you see "INFO: task ptlrpcd_26:2213 blocked for more than 120 seconds" sort of messages in your client logs, that alone might be enough in some cases, so might be a good idea to do that before diving into the full debug runs (Though I hope the problem would be over with this time around).

Helen Brimmer added a comment - 11/Jun/14 8:33 AM

kern_logs_20140529.tar.gz attached to this ticket has kernel log files from all of the servers covering the period of the latest test.

Helen Brimmer added a comment - 11/Jun/14 8:33 AM kern_logs_20140529.tar.gz attached to this ticket has kernel log files from all of the servers covering the period of the latest test.

Oleg Drokin added a comment - 11/Jun/14 2:55 AM

If you can upload a syslog part of oss logs that shows the evicted log handles, that might be enough.
So let's do that as a first step?

Oleg Drokin added a comment - 11/Jun/14 2:55 AM If you can upload a syslog part of oss logs that shows the evicted log handles, that might be enough. So let's do that as a first step?

Helen Brimmer added a comment - 05/Jun/14 1:36 PM

Many apologies, I overlooked the MDS (lus03-mds2) trace logs. I have now uploaded these (they are one large file as the rate of production is not as high as for the clients). We did not see any evictions from the MDS during the test run though, only the OSSes, which we currently are not collecting debug traces from. If you need us to do this, please let me know and I will get another run going as soon as I can.

Helen Brimmer added a comment - 05/Jun/14 1:36 PM Many apologies, I overlooked the MDS (lus03-mds2) trace logs. I have now uploaded these (they are one large file as the rate of production is not as high as for the clients). We did not see any evictions from the MDS during the test run though, only the OSSes, which we currently are not collecting debug traces from. If you need us to do this, please let me know and I will get another run going as soon as I can.

Oleg Drokin added a comment - 05/Jun/14 3:10 AM

Helen, I got your logs, but it seems only client logs from ht-1-1-09 and ht-1-1-07 are available.
I really need at least something from the server (lus03?) so that I can get an idea at what lock handles to look for example.

Oleg Drokin added a comment - 05/Jun/14 3:10 AM Helen, I got your logs, but it seems only client logs from ht-1-1-09 and ht-1-1-07 are available. I really need at least something from the server (lus03?) so that I can get an idea at what lock handles to look for example.

People

Assignee:: Oleg Drokin

Reporter:: Manish Patel (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 20/Mar/14 12:13 PM

Updated:: 24/Nov/14 4:32 PM

Resolved:: 18/Aug/14 3:59 PM