[LU-10887] 2 MDTs stuck in WAITING Created: 07/Apr/18 Updated: 03/May/18 Resolved: 19/Apr/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.3 |
| Fix Version/s: | Lustre 2.12.0, Lustre 2.10.4 |
| Type: | Bug | Priority: | Critical |
| Reporter: | SC Admin (Inactive) | Assignee: | nasf (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
centos7.4, 3.10.0-693.17.1.el7.x86_64, zfs 0.7.6, lustre 2.10.3 + patch from https://review.whamcloud.com/#/c/31354/, OPA, skylake, raidz1 over hardware raid1 MDTs on SSD. |
||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Epic/Theme: | dne | ||||||||||||||||
| Severity: | 1 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
Hi, find's were hanging on the main filesystem from one client. these processes looked to be unkillable. I rebooted the client running the finds and restarted the find sweep, but they hung again. I then failed over all the MDT's to one MDS (we have 2), and that went ok. I then failed all the MDT's back to the other MDS and it LBUG'd. kernel: LustreError: 49321:0:(lu_object.c:1177:lu_device_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1 since then 2 of the MDT's won't connect. they are stuck in WAITING state and never get to RECOVERING or COMPLETE. [warble1]root: cat /proc/fs/lustre/mdt/dagg-MDT0001/recovery_status status: WAITING non-ready MDTs: 0000 recovery_start: 1523093864 time_waited: 388 [warble1]root: cat /proc/fs/lustre/mdt/dagg-MDT0002/recovery_status status: WAITING non-ready MDTs: 0000 recovery_start: 1523093864 time_waited: 391 the other MDT is ok. [warble2]root: cat /proc/fs/lustre/mdt/dagg-MDT0000/recovery_status status: COMPLETE recovery_start: 1523093168 recovery_duration: 30 completed_clients: 122/122 replayed_requests: 0 last_transno: 214748364800 VBR: DISABLED IR: DISABLED I've tried umounting a few times and remountnig, but the time_waited: just keeps incrementing. it gets to 900s, spits out a message and then keeps going forever it looks like. any ideas? cheers, |
| Comments |
| Comment by Peter Jones [ 07/Apr/18 ] |
|
Robin Just to acknowledge that we have seen this ticket. I am discussing with engineering Peter |
| Comment by Peter Jones [ 07/Apr/18 ] |
|
Robin Could you please attach mdt logs while this is happening? Peter |
| Comment by SC Admin (Inactive) [ 07/Apr/18 ] |
|
Hi Peter, logs attached happy to try 2.11 if you like cheers, |
| Comment by Mikhail Pershin [ 07/Apr/18 ] |
|
Hello Robin, have you tried to abort recovery on these 'WAITING' servers? Mike |
| Comment by SC Admin (Inactive) [ 07/Apr/18 ] |
|
Hi Mike, haven't heard of abort recovery before... will give it a try though. [warble1]root: lctl abort_recovery dagg-MDT0001 abort recovery on a restarting MDT or OST device [warble1]root: lctl abort_recovery dagg-MDT0002 abort recovery on a restarting MDT or OST device and the same on MDT0 (may as well, even though that one seems fine) and it didn't do anything. try umount and remount with -o abort_recov -> [warble1]root: df Filesystem 1K-blocks Used Available Use% Mounted on ramroot 395507840 1220964 394286876 1% / tmpfs 395507840 16924 395490916 1% /run tmpfs 395507840 0 395507840 0% /sys/fs/cgroup none 395483644 0 395483644 0% /dev /dev/ram 20 0 20 0% /ram tmpfs 395507840 39648 395468192 1% /dev/shm /dev/mapper/centos-root 52403200 9595628 42807572 19% /mnt/root warble1-MGT-pool/MGT 245120 1920 241152 1% /lustre/MGT warble1-images-MDT0-pool/MDT0 109845504 2376320 107467136 3% /lustre/images/MDT0 warble1-dagg-MDT1-pool/MDT1 630229504 16695168 613532288 3% /lustre/dagg/MDT1 warble1-dagg-MDT2-pool/MDT2 632871296 15122688 617746560 3% /lustre/dagg/MDT2 warble2-dagg-MDT0-pool/MDT0 632858752 12885248 619971456 3% /lustre/dagg/MDT0 warble2-apps-MDT0-pool/MDT0 109811840 9916032 99893760 10% /lustre/apps/MDT0 warble2-home-MDT0-pool/MDT0 109846400 2883584 106960768 3% /lustre/home/MDT0 [warble1]root: pcs property set maintenance-mode=true [warble1]root: umount /lustre/dagg/MDT2 [warble1]root: umount /lustre/dagg/MDT1 [warble1]root: mount -t lustre -o abort_recov warble1-dagg-MDT2-pool/MDT2 /lustre/dagg/MDT2 is so far hanging. [warble1]root: mount -t lustre -o abort_recov warble1-dagg-MDT1-pool/MDT1 /lustre/dagg/MDT1 which is also hanging. this is recovery_status now [warble1]root: cat /proc/fs/lustre/mdt/dagg-MDT0000/recovery_status status: COMPLETE recovery_start: 1523100736 recovery_duration: 21 completed_clients: 123/123 replayed_requests: 0 last_transno: 219043332105 VBR: DISABLED IR: ENABLED [warble1]root: cat /proc/fs/lustre/mdt/dagg-MDT0001/recovery_status cat: /proc/fs/lustre/mdt/dagg-MDT0001/recovery_status: No such file or directory [warble1]root: cat /proc/fs/lustre/mdt/dagg-MDT0002/recovery_status status: WAITING non-ready MDTs: 0000 0001 recovery_start: 0 time_waited: 1523106303 [warble1]root: cheers, |
| Comment by SC Admin (Inactive) [ 07/Apr/18 ] |
|
nothing happened after waiting an hour or 2, so abort recovery didn't work. ^C'd the mounts and because lustre and mount were now conflicted about what was mounted and what wasn't, I was forced to reboot both MDS's. pacemaker/corosync is now in tiny broken shards, sigh. however still the same from lustre after mounting everything by hand: dagg-MDT0 COMPLETED, and MDT1,2 in WAITING state. I suspect we'll have to rollback a zfs transaction or two in order to get this to mount. does that sound right to y'all? anyone done it before and know where to start without losing too much data? cheers, |
| Comment by Nathaniel Clark [ 07/Apr/18 ] |
|
Can you attach zfs status from mdt 1 and 2? Hopefully the pools aren’t corrupt. Also the messages file? |
| Comment by SC Admin (Inactive) [ 07/Apr/18 ] |
|
Hi Nathaniel, messages are already attached to this ticket. we have central syslog'd messages and minimal state on lustre servers. currently everything is mounted on warble1 with nothing running on warble2, so I'll attach zpool status and zfs list for those. but as per usual with zfs, these look very neat an tidy and there's nothing interesting. it's not what you'd call verbose about what it's doing... cheers, |
| Comment by Oleg Drokin [ 07/Apr/18 ] |
|
it looks like dagg-mdt1 and 2 cannot talk to each other and mdt0, this is evident in mdt2 saying that mdt0 and 1 are "not ready" and DNE recovery depends on all MDT targets being available otherwise you need to abort recovery. That said it's strange that your mdt1 mount hangs, are there any additional stacktraces printed into messages/dmesg (since the snapshot you provided appears to have been before the hang). Did you try to mount with -o abort recovery after your latest reboot? That seems like the thing to do. |
| Comment by Andreas Dilger [ 07/Apr/18 ] |
|
DNE doesn't allow abort_recovery (though maybe it should?). We need MDT0000 to be able to connect to MDT0001 and MDT0002. |
| Comment by Oleg Drokin [ 07/Apr/18 ] |
|
In my discussions with Di at the time of DNE landing he told me that the way to recover from a lost MDT server is by administrative action (which I think was abort recovery) followed by fsck or some such to ensure the namespace is coherent. |
| Comment by Di Wang [ 07/Apr/18 ] |
|
Could you please dump the stack strace of that hanging MDS? I suspect the update log (for DNE recovery) might be screwed up somehow, if that happens, you might also need delete those update log manually from ZFS to abort recovery, (hmm, these should be added to abort recovery process |
| Comment by Di Wang [ 08/Apr/18 ] |
|
Hmm, I tried to mount zfs osd, delete those update log files, then mount MDT again, but failed [root@onyx-82 mdt1]# ls CATALOGS lfsck_layout oi.100 oi.109 oi.117 oi.125 oi.19 oi.27 oi.35 oi.43 oi.51 oi.6 oi.68 oi.76 oi.84 oi.92 quota_master changelog_catalog lfsck_namespace oi.101 oi.11 oi.118 oi.126 oi.2 oi.28 oi.36 oi.44 oi.52 oi.60 oi.69 oi.77 oi.85 oi.93 quota_slave changelog_users lov_objid oi.102 oi.110 oi.119 oi.127 oi.20 oi.29 oi.37 oi.45 oi.53 oi.61 oi.7 oi.78 oi.86 oi.94 reply_data CONFIGS NIDTBL_VERSIONS oi.103 oi.111 oi.12 oi.13 oi.21 oi.3 oi.38 oi.46 oi.54 oi.62 oi.70 oi.79 oi.87 oi.95 ROOT fld nodemap oi.104 oi.112 oi.120 oi.14 oi.22 oi.30 oi.39 oi.47 oi.55 oi.63 oi.71 oi.8 oi.88 oi.96 seq_ctl hsm_actions O oi.105 oi.113 oi.121 oi.15 oi.23 oi.31 oi.4 oi.48 oi.56 oi.64 oi.72 oi.80 oi.89 oi.97 seq_srv last_rcvd oi.0 oi.106 oi.114 oi.122 oi.16 oi.24 oi.32 oi.40 oi.49 oi.57 oi.65 oi.73 oi.81 oi.9 oi.98 update_log LFSCK oi.1 oi.107 oi.115 oi.123 oi.17 oi.25 oi.33 oi.41 oi.5 oi.58 oi.66 oi.74 oi.82 oi.90 oi.99 update_log_dir lfsck_bookmark oi.10 oi.108 oi.116 oi.124 oi.18 oi.26 oi.34 oi.42 oi.50 oi.59 oi.67 oi.75 oi.83 oi.91 PENDING [root@onyx-82 mdt1]# rm -rf update_log [root@onyx-82 mdt1]# cd update_log_dir/ [root@onyx-82 update_log_dir]# ls [0x200000400:0x1:0x0] [0x200000401:0x1:0x0] [root@onyx-82 update_log_dir]# rm -rf * [root@onyx-82 tests]# zfs set canmount=off lustre-mdt1/mdt1 [root@onyx-82 tests]# mount -t lustre lustre-mdt1/mdt1 /mnt/lustre-mds1 mount.lustre: mount lustre-mdt1/mdt1 at /mnt/lustre-mds1 failed: File exists According to the debug log, it seems rm over zfs mount can not really delete those update log files. 00080000:00000001:22.0:1523145755.399366:0:91949:0:(osd_object.c:1654:osd_create()) Process entered 00080000:00000001:22.0:1523145755.399403:0:91949:0:(osd_oi.c:253:fid_is_on_ost()) Process entered 00080000:00000001:22.0:1523145755.399403:0:91949:0:(osd_oi.c:281:fid_is_on_ost()) Process leaving (rc=0 : 0 : 0) 00080000:00000001:22.0:1523145755.399444:0:91949:0:(osd_object.c:1707:osd_create()) Process leaving via out (rc=18446744073709551599 : -17 : 0xffffffffffffffef) 00080000:00000001:22.0:1523145755.399450:0:91949:0:(osd_object.c:1750:osd_create()) Process leaving (rc=18446744073709551599 : -17 : ffffffffffffffef) 00000020:00000001:22.0:1523145755.399451:0:91949:0:(local_storage.c:264:local_object_create()) Process leaving (rc=18446744073709551599 : -17 : ffffffffffffffef) 00000020:00000001:22.0:1523145755.399453:0:91949:0:(local_storage.c:382:__local_file_create()) Process leaving via unlock (rc=18446744073709551599 : -17 : 0xffffffffffffffef) 00080000:00000001:22.0:1523145755.399455:0:91949:0:(osd_handler.c:284:osd_trans_stop()) Process entered 00080000:00000001:22.0:1523145755.399458:0:91949:0:(osd_handler.c:324:osd_trans_stop()) Process leaving (rc=0 : 0 : 0) 00080000:00000010:22.0:1523145755.399461:0:91949:0:(osd_object.c:506:osd_object_free()) slab-freed '(obj)': 312 at ffff8810512a6220. 00000020:00000010:22.0:1523145755.399463:0:91949:0:(local_storage.c:68:ls_object_free()) kfreed 'obj': 144 at ffff8810285c69c0. 00000020:00000001:22.0:1523145755.399464:0:91949:0:(local_storage.c:425:__local_file_create()) Process leaving (rc=18446744073709551599 : -17 : ffffffffffffffef) 00000004:00000001:22.0:1523145755.399466:0:91949:0:(lod_dev.c:1204:lod_prepare()) Process leaving via out_put (rc=18446744073709551599 : -17 : 0xffffffffffffffef) I remembered Alex mentioned this once, but forgot the details. Any suggestions how to delete the llog file on zfs osd. Alex, Andreas, Oleg? |
| Comment by Andreas Dilger [ 08/Apr/18 ] |
|
Di, maybe try running "lctl lfsck_start -A" on the MDS after removing the files, so the OI entries are cleaned up? |
| Comment by Di Wang [ 08/Apr/18 ] |
|
Andreas, I deleted the local llog file over zfs mount,, instead of lustre, after that the MDT mount will fail, can lfsck be used in this case? Besides this failure seems happened during object creation, instead of OI insertion, so it looks like to me this local llog file becomes an orphan after deletion. Is there a ZFS tool to fix this? Thanks. |
| Comment by SC Admin (Inactive) [ 08/Apr/18 ] |
|
Hi Di, sysrq traces attached now. cheers, |
| Comment by SC Admin (Inactive) [ 08/Apr/18 ] |
|
Hi Oleg and Nathaniel, all of central messages for warbles for yesterday is attached now - warbles-messages-20180408.txt there are indeed stack traces for the hung mount -o abort_recov's in there from mount.lustre around Apr 7 23:02:41 cheers, |
| Comment by SC Admin (Inactive) [ 08/Apr/18 ] |
|
there's also about 100kbytes/sec of IB network traffic and a load of 2 or 3 on warble1 (all the MDTs are on this machine still) PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P WCHAN COMMAND
35047 root 20 0 0 0 0 S 65.8 0.0 58:51.46 0 - mdt_out00_002
43540 root 20 0 0 0 0 R 59.8 0.0 169:30.37 7 - mdt_out01_003
29210 root 20 0 0 0 0 R 51.8 0.0 401:32.95 8 ptlrpc_se+ lod0001_rec0000
31314 root 20 0 0 0 0 R 32.2 0.0 246:22.82 9 ptlrpc_se+ lod0002_rec0000
35586 root 20 0 0 0 0 S 19.6 0.0 216:59.66 13 ptlrpc_wa+ mdt_out01_002
35475 root 20 0 0 0 0 S 3.3 0.0 46:55.62 4 ptlrpc_wa+ mdt_out00_003
37366 root 20 0 0 0 0 S 0.7 0.0 189:31.48 0 ptlrpc_wa+ mdt_out00_004
287990 root 20 0 158836 3352 1560 R 0.7 0.0 0:00.22 14 - top
16 root 0 -20 0 0 0 S 0.3 0.0 0:06.63 1 worker_th+ kworker/1:0H
14525 ganglia 20 0 104468 10036 2556 S 0.3 0.0 2:25.02 8 poll_sche+ gmond
1 root 20 0 52644 4772 2464 S 0.0 0.0 0:16.06 9 ep_poll systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:01.93 6 kthreadd kthreadd
3 root 20 0 0 0 0 S 0.0 0.0 0:00.31 0 smpboot_t+ ksoftirqd/0
5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 0 worker_th+ kworker/0:0H
8 root rt 0 0 0 0 S 0.0 0.0 0:01.64 0 smpboot_t+ migration/0
9 root 20 0 0 0 0 S 0.0 0.0 0:00.00 0 rcu_gp_kt+ rcu_bh
10 root 20 0 0 0 0 S 0.0 0.0 0:14.42 4 rcu_gp_kt+ rcu_sched
|
| Comment by SC Admin (Inactive) [ 08/Apr/18 ] |
|
yesterday I noticed in recovery_status that when dagg-MDT1 and dagg-MDT2 first mount they say "non-ready MDTs: 0000 0002" and "non-ready MDTs: 0000 0001" respectively, but then quickly see each other and go to the above "non-ready MDTs: 0000". as a bit of background, dagg-MDT1 and 2 are usually on warble1 (as the zpool name in the df above indicates) and dagg-MDT0 is usually on warble2, but when warble1 LBUG'd on umount (as part of the initial failover everything, failback everything attempt) then actually all the MDTs were on warble1, so I'm not sure that means anything. all the lnet connections look ok BTW - lctl ping, list_nids - I just thought I'd better check that 'cos I forgot to do it yesterday, and this seemed like it might be part of a "connection issue" related problem. cheers, |
| Comment by Andreas Dilger [ 08/Apr/18 ] |
|
Did you try mounting the MDTs (in particular MDT0000 on the normal MDS? It may be that the failover config for the MDTs is not configured properly, so when they are started on the backup MDS it isn't possible to find MDT0000 and the startup/recovery of MDT0001 and MDT0002 is stuck. |
| Comment by Andreas Dilger [ 08/Apr/18 ] |
|
Di, actually, it might be possible to try editing the OI directly... With ZFS the oi "files" just appear to be directories with filenames that are FIDs, so if you also know the FID of the files being deleted (lfs path2fid <file> should print the FID even if it is mounted locally), then you can also try deleting that file from the appropriate oi.N directory. I haven't tried this myself, but should work on your test system. |
| Comment by Di Wang [ 08/Apr/18 ] |
|
Hmm, according to the trace MDT1 and 2 are indeed still recovering. Apr 8 13:25:00 warble1 kernel: lod0002_rec0000 R running task 0 31314 2 0x00000000 Apr 8 13:25:00 warble1 kernel: Call Trace: Apr 8 13:25:00 warble1 kernel: [<ffffffff816ab8a9>] schedule+0x29/0x70 Apr 8 13:25:00 warble1 kernel: [<ffffffff816a91f4>] schedule_timeout+0x174/0x2c0 Apr 8 13:25:00 warble1 kernel: [<ffffffff8109a8a0>] ? internal_add_timer+0x70/0x70 Apr 8 13:25:00 warble1 kernel: [<ffffffffc0f34e40>] ptlrpc_set_wait+0x4c0/0x910 [ptlrpc] Apr 8 13:25:00 warble1 kernel: [<ffffffff810c6620>] ? wake_up_state+0x20/0x20 Apr 8 13:25:00 warble1 kernel: [<ffffffffc0f3530d>] ptlrpc_queue_wait+0x7d/0x220 [ptlrpc] Apr 8 13:25:00 warble1 kernel: [<ffffffffc1307d6d>] osp_md_read+0x4cd/0x8c0 [osp] Apr 8 13:25:00 warble1 kernel: [<ffffffffc0d69164>] dt_read+0x14/0x50 [obdclass] Apr 8 13:25:00 warble1 kernel: [<ffffffffc0d2b7ca>] llog_osd_next_block+0x36a/0xbe0 [obdclass] Apr 8 13:25:00 warble1 kernel: [<ffffffff816ab2f4>] ? __schedule+0x424/0x9b0 Apr 8 13:25:00 warble1 kernel: [<ffffffffc0d1f294>] llog_process_thread+0x294/0x1180 [obdclass] Apr 8 13:25:00 warble1 kernel: [<ffffffffc0d25660>] ? llog_cat_cancel_records+0x3d0/0x3d0 [obdclass] Apr 8 13:25:00 warble1 kernel: [<ffffffffc0d2023c>] llog_process_or_fork+0xbc/0x450 [obdclass] Apr 8 13:25:00 warble1 kernel: [<ffffffffc0d25660>] ? llog_cat_cancel_records+0x3d0/0x3d0 [obdclass] Apr 8 13:25:00 warble1 kernel: [<ffffffffc0d24b39>] llog_cat_process_or_fork+0x199/0x2a0 [obdclass] Apr 8 13:25:00 warble1 kernel: [<ffffffffc12aacfa>] ? lod_sub_prep_llog+0x24a/0x783 [lod] Apr 8 13:25:00 warble1 kernel: [<ffffffffc127a520>] ? lodname2mdt_index+0x2f0/0x2f0 [lod] Apr 8 13:25:00 warble1 kernel: [<ffffffffc0d24c6e>] llog_cat_process+0x2e/0x30 [obdclass] Apr 8 13:25:00 warble1 kernel: [<ffffffffc1276a69>] lod_sub_recovery_thread+0x419/0xaf0 [lod] Apr 8 13:25:00 warble1 kernel: [<ffffffffc1276650>] ? lod_trans_stop+0x340/0x340 [lod] Apr 8 13:25:00 warble1 kernel: [<ffffffff810b270f>] kthread+0xcf/0xe0 Apr 8 13:25:00 warble1 kernel: [<ffffffff810b2640>] ? insert_kthread_work+0x40/0x40 Apr 8 13:25:00 warble1 kernel: [<ffffffff816b8798>] ret_from_fork+0x58/0x90 Apr 8 13:25:00 warble1 kernel: [<ffffffff810b2640>] ? insert_kthread_work+0x40/0x40 Apr 8 13:24:53 warble1 kernel: lod0001_rec0000 R running task 0 29210 2 0x00000000 Apr 8 13:24:53 warble1 kernel: Call Trace: Apr 8 13:24:53 warble1 kernel: [<ffffffff810bf254>] ? __wake_up+0x44/0x50 Apr 8 13:24:53 warble1 kernel: [<ffffffff816ab8a9>] ? schedule+0x29/0x70 Apr 8 13:24:53 warble1 kernel: [<ffffffff816a91f4>] ? schedule_timeout+0x174/0x2c0 Apr 8 13:24:53 warble1 kernel: [<ffffffffc0f30158>] ? ptlrpc_send_new_req+0x468/0xa60 [ptlrpc] Apr 8 13:24:53 warble1 kernel: [<ffffffff8133fdfd>] ? list_del+0xd/0x30 Apr 8 13:24:53 warble1 kernel: [<ffffffffc0f34e40>] ? ptlrpc_set_wait+0x4c0/0x910 [ptlrpc] Apr 8 13:24:53 warble1 kernel: [<ffffffffc0d322d9>] ? lustre_get_jobid+0x99/0x4d0 [obdclass] Apr 8 13:24:53 warble1 kernel: [<ffffffff810c6620>] ? wake_up_state+0x20/0x20 Apr 8 13:24:53 warble1 kernel: [<ffffffffc0f2cd4d>] ? ptlrpc_prep_bulk_imp+0x5d/0x180 [ptlrpc] Apr 8 13:24:53 warble1 kernel: [<ffffffffc0f3530d>] ? ptlrpc_queue_wait+0x7d/0x220 [ptlrpc] Apr 8 13:24:53 warble1 kernel: [<ffffffffc0f2ea30>] ? ptlrpc_req_finished+0x10/0x20 [ptlrpc] Apr 8 13:24:53 warble1 kernel: [<ffffffffc1307d6d>] ? osp_md_read+0x4cd/0x8c0 [osp] Apr 8 13:24:53 warble1 kernel: [<ffffffffc12f22c2>] ? osp_attr_get+0x102/0x6f0 [osp] Apr 8 13:24:53 warble1 kernel: [<ffffffffc0d69164>] ? dt_read+0x14/0x50 [obdclass] Apr 8 13:24:53 warble1 kernel: [<ffffffffc0d2b7ca>] ? llog_osd_next_block+0x36a/0xbe0 [obdclass] Apr 8 13:24:53 warble1 kernel: [<ffffffff816ab2f4>] ? __schedule+0x424/0x9b0 Apr 8 13:24:53 warble1 kernel: [<ffffffffc0d1f294>] ? llog_process_thread+0x294/0x1180 [obdclass] Apr 8 13:24:53 warble1 kernel: [<ffffffffc0d25660>] ? llog_cat_cancel_records+0x3d0/0x3d0 [obdclass] Apr 8 13:24:53 warble1 kernel: [<ffffffffc0d2023c>] ? llog_process_or_fork+0xbc/0x450 [obdclass] Apr 8 13:24:53 warble1 kernel: [<ffffffffc0d25660>] ? llog_cat_cancel_records+0x3d0/0x3d0 [obdclass] Apr 8 13:24:53 warble1 kernel: [<ffffffffc0d24b39>] ? llog_cat_process_or_fork+0x199/0x2a0 [obdclass] Apr 8 13:24:53 warble1 kernel: [<ffffffffc12aacfa>] ? lod_sub_prep_llog+0x24a/0x783 [lod] Apr 8 13:24:53 warble1 kernel: [<ffffffffc127a520>] ? lodname2mdt_index+0x2f0/0x2f0 [lod] Apr 8 13:24:53 warble1 kernel: [<ffffffffc0d24c6e>] ? llog_cat_process+0x2e/0x30 [obdclass] Apr 8 13:24:53 warble1 kernel: [<ffffffffc1276a69>] ? lod_sub_recovery_thread+0x419/0xaf0 [lod] Apr 8 13:24:53 warble1 kernel: [<ffffffffc1276650>] ? lod_trans_stop+0x340/0x340 [lod] Apr 8 13:24:53 warble1 kernel: [<ffffffff810b270f>] ? kthread+0xcf/0xe0 Apr 8 13:24:53 warble1 kernel: [<ffffffff810b2640>] ? insert_kthread_work+0x40/0x40 Apr 8 13:24:53 warble1 kernel: [<ffffffff816b8798>] ? ret_from_fork+0x58/0x90 Apr 8 13:24:53 warble1 kernel: [<ffffffff810b2640>] ? insert_kthread_work+0x40/0x40 According to these 2 traces, MDT1/2 indeed keep reading update log. It is either because of network issue or corrupted llog. I did see some timeout connecting request from the console, but seems temporary issue. Do you have recent console log, i.e. when the stack strace is created around Apr 8 13:24:53. Apr 8 01:10:07 warble1 kernel: Lustre: 23085:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1523113801/real 0] req@ffff88bd81d0bc00 x1597100427070368/t0(0) o38->dagg-MDT0001-osp-MDT0002@192.168.44.22@o2ib44:24/4 lens 520/544 e 0 to 1 dl 1523113807 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 And could you please also mount your mdt0 as zfs to see how much update log here. After you mount your MDT0 as zfs, it should be here. [root@onyx-82 mdt1]# ls CATALOGS lfsck_layout oi.100 oi.109 oi.117 oi.125 oi.19 oi.27 oi.35 oi.43 oi.51 oi.6 oi.68 oi.76 oi.84 oi.92 quota_master changelog_catalog lfsck_namespace oi.101 oi.11 oi.118 oi.126 oi.2 oi.28 oi.36 oi.44 oi.52 oi.60 oi.69 oi.77 oi.85 oi.93 quota_slave changelog_users lov_objid oi.102 oi.110 oi.119 oi.127 oi.20 oi.29 oi.37 oi.45 oi.53 oi.61 oi.7 oi.78 oi.86 oi.94 reply_data CONFIGS NIDTBL_VERSIONS oi.103 oi.111 oi.12 oi.13 oi.21 oi.3 oi.38 oi.46 oi.54 oi.62 oi.70 oi.79 oi.87 oi.95 ROOT fld nodemap oi.104 oi.112 oi.120 oi.14 oi.22 oi.30 oi.39 oi.47 oi.55 oi.63 oi.71 oi.8 oi.88 oi.96 seq_ctl hsm_actions O oi.105 oi.113 oi.121 oi.15 oi.23 oi.31 oi.4 oi.48 oi.56 oi.64 oi.72 oi.80 oi.89 oi.97 seq_srv last_rcvd oi.0 oi.106 oi.114 oi.122 oi.16 oi.24 oi.32 oi.40 oi.49 oi.57 oi.65 oi.73 oi.81 oi.9 oi.98 update_log LFSCK oi.1 oi.107 oi.115 oi.123 oi.17 oi.25 oi.33 oi.41 oi.5 oi.58 oi.66 oi.74 oi.82 oi.90 oi.99 update_log_dir lfsck_bookmark oi.10 oi.108 oi.116 oi.124 oi.18 oi.26 oi.34 oi.42 oi.50 oi.59 oi.67 oi.75 oi.83 oi.91 PENDING [root@onyx-82 mdt1]# cd update_log_dir/ [root@onyx-82 update_log_dir]# ls -l |
| Comment by Di Wang [ 08/Apr/18 ] |
|
Andreas, ah, ok. thanks |
| Comment by SC Admin (Inactive) [ 08/Apr/18 ] |
|
Hi Andreas, I checked that the zfs lustre params are fine. also these filesystems have been failed over and back dozens of times before so it seems unlikely. but anyway, sure, why not this is with MDT0 mounted on warble1: [warble2]root: for f in /proc/fs/lustre/mdt/dagg-MDT000*/recovery_status; do echo $f; cat $f; done /proc/fs/lustre/mdt/dagg-MDT0000/recovery_status status: COMPLETE recovery_start: 1523177661 recovery_duration: 40 completed_clients: 123/123 replayed_requests: 0 last_transno: 227633266742 VBR: DISABLED IR: ENABLED MDT1,2 saw no change after the above, so I umount/mount'd them again just to see and for a while they said: [warble1]root: for f in /proc/fs/lustre/mdt/dagg-MDT000*/recovery_status; do echo $f; cat $f; done /proc/fs/lustre/mdt/dagg-MDT0001/recovery_status status: WAITING non-ready MDTs: 0000 0002 recovery_start: 1523177790 time_waited: 54 /proc/fs/lustre/mdt/dagg-MDT0002/recovery_status status: WAITING non-ready MDTs: 0000 0001 recovery_start: 1523177783 time_waited: 61 and after ~60s they saw each other, but still not MDT0: [warble1]root: for f in /proc/fs/lustre/mdt/dagg-MDT000*/recovery_status; do echo $f; cat $f; done /proc/fs/lustre/mdt/dagg-MDT0001/recovery_status status: WAITING non-ready MDTs: 0000 recovery_start: 1523177790 time_waited: 57 /proc/fs/lustre/mdt/dagg-MDT0002/recovery_status status: WAITING non-ready MDTs: 0000 recovery_start: 1523177783 time_waited: 64 syslog for this stuff Apr 8 18:37:26 warble1 kernel: Lustre: 29650:0:(ldlm_lib.c:1773:extend_recovery_timer()) dagg-MDT0001: extended recovery timer reaching hard limit: 900, extend: 1 Apr 8 18:37:26 warble1 kernel: Lustre: 29650:0:(ldlm_lib.c:1773:extend_recovery_timer()) Skipped 59 previous similar messages Apr 8 18:47:26 warble1 kernel: Lustre: 29650:0:(ldlm_lib.c:1773:extend_recovery_timer()) dagg-MDT0001: extended recovery timer reaching hard limit: 900, extend: 1 Apr 8 18:47:26 warble1 kernel: Lustre: 29650:0:(ldlm_lib.c:1773:extend_recovery_timer()) Skipped 59 previous similar messages Apr 8 18:49:55 warble2 crmd[17643]: notice: State transition S_IDLE -> S_POLICY_ENGINE Apr 8 18:49:55 warble2 pengine[17642]: error: Resource start-up disabled since no STONITH resources have been defined Apr 8 18:49:55 warble2 pengine[17642]: error: Either configure some or disable STONITH with the stonith-enabled option Apr 8 18:49:55 warble2 pengine[17642]: error: NOTE: Clusters with shared data need STONITH to ensure data integrity Apr 8 18:49:55 warble2 pengine[17642]: notice: Delaying fencing operations until there are resources to manage Apr 8 18:49:55 warble2 pengine[17642]: notice: Calculated transition 72, saving inputs in /var/lib/pacemaker/pengine/pe-input-1.bz2 Apr 8 18:49:55 warble2 pengine[17642]: notice: Configuration ERRORs found during PE processing. Please run "crm_verify -L" to identify issues. Apr 8 18:49:55 warble2 crmd[17643]: notice: Transition 72 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-1.bz2): Complete Apr 8 18:49:55 warble2 crmd[17643]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE Apr 8 18:52:54 warble1 kernel: Lustre: Failing over dagg-MDT0000 Apr 8 18:52:54 warble1 kernel: LustreError: 11-0: dagg-MDT0000-osp-MDT0001: operation out_update to node 0@lo failed: rc = -19 Apr 8 18:52:54 warble1 kernel: LustreError: Skipped 3 previous similar messages Apr 8 18:52:54 warble1 kernel: Lustre: dagg-MDT0000-osp-MDT0001: Connection to dagg-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete Apr 8 18:52:55 warble1 kernel: LustreError: 23085:0:(import.c:662:ptlrpc_connect_import()) can't connect to a closed import Apr 8 18:52:55 warble1 kernel: Lustre: dagg-MDT0000: Not available for connect from 192.168.44.218@o2ib44 (stopping) Apr 8 18:52:55 warble1 kernel: Lustre: server umount dagg-MDT0000 complete Apr 8 18:52:55 warble1 kernel: LustreError: 137-5: dagg-MDT0000_UUID: not available for connect from 192.168.44.206@o2ib44 (no target). If you are running an HA pair check that the target is mounted on the other server. Apr 8 18:52:55 warble1 kernel: LustreError: Skipped 106 previous similar messages Apr 8 18:52:56 warble1 kernel: Lustre: dagg-MDT0000-lwp-MDT0001: Connection to dagg-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete Apr 8 18:52:56 warble1 kernel: Lustre: Skipped 2 previous similar messages Apr 8 18:53:03 warble1 kernel: LustreError: 137-5: dagg-MDT0000_UUID: not available for connect from 192.168.44.180@o2ib44 (no target). If you are running an HA pair check that the target is mounted on the other server. Apr 8 18:53:03 warble1 kernel: LustreError: Skipped 47 previous similar messages Apr 8 18:53:19 warble1 kernel: LustreError: 137-5: dagg-MDT0000_UUID: not available for connect from 192.168.44.34@o2ib44 (no target). If you are running an HA pair check that the target is mounted on the other server. Apr 8 18:53:19 warble1 kernel: LustreError: Skipped 86 previous similar messages Apr 8 18:53:27 warble1 kernel: Lustre: 23085:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1523177601/real 1523177601] req@ffff885e775b8600 x1597136535119920/t0(0) o38->dagg-MDT0000-lwp-MDT0002@192.168.44.22@o2ib44:12/10 lens 520/544 e 0 to 1 dl 1523177607 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Apr 8 18:53:27 warble1 kernel: Lustre: 23085:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 1 previous similar message Apr 8 18:53:52 warble1 kernel: Lustre: 23085:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1523177626/real 1523177626] req@ffff885e775bda00 x1597136535120752/t0(0) o38->dagg-MDT0000-lwp-MDT0001@192.168.44.22@o2ib44:12/10 lens 520/544 e 0 to 1 dl 1523177632 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Apr 8 18:53:52 warble1 kernel: Lustre: 23085:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 2 previous similar messages Apr 8 18:54:00 warble1 kernel: LustreError: 137-5: dagg-MDT0000_UUID: not available for connect from 192.168.44.183@o2ib44 (no target). If you are running an HA pair check that the target is mounted on the other server. Apr 8 18:54:00 warble1 kernel: LustreError: Skipped 8 previous similar messages Apr 8 18:54:04 warble1 zed: eid=36 class=pool_export pool_guid=0xFC863664CC04C875 Apr 8 18:54:04 warble1 zed: eid=37 class=config_sync pool_guid=0xFC863664CC04C875 Apr 8 18:54:11 warble2 zed: eid=1 class=history_event pool_guid=0xFC863664CC04C875 Apr 8 18:54:11 warble2 zed: eid=2 class=config_sync pool_guid=0xFC863664CC04C875 Apr 8 18:54:11 warble2 zed: eid=3 class=pool_import pool_guid=0xFC863664CC04C875 Apr 8 18:54:11 warble2 zed: eid=4 class=history_event pool_guid=0xFC863664CC04C875 Apr 8 18:54:11 warble2 zed: eid=5 class=config_sync pool_guid=0xFC863664CC04C875 Apr 8 18:54:15 warble1 kernel: Lustre: MGS: Connection restored to 192.168.44.22@o2ib44 (at 192.168.44.22@o2ib44) Apr 8 18:54:15 warble1 kernel: Lustre: Skipped 89 previous similar messages Apr 8 18:54:15 warble1 kernel: Lustre: dagg-MDT0001: Received LWP connection from 192.168.44.22@o2ib44, removing former export from 0@lo Apr 8 18:54:15 warble1 kernel: Lustre: Skipped 1 previous similar message Apr 8 18:54:16 warble2 kernel: Lustre: dagg-MDT0000: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900 Apr 8 18:54:21 warble2 kernel: Lustre: 64976:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1523177655/real 1523177655] req@ffff885e132b0300 x1597150865654064/t0(0) o8->dagg-OST0003-osc-MDT0000@192.168.44.32@o2ib44:28/4 lens 520/544 e 0 to 1 dl 1523177660 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Apr 8 18:54:21 warble2 kernel: Lustre: dagg-MDT0000: Will be in recovery for at least 2:30, or until 123 clients reconnect Apr 8 18:54:21 warble2 kernel: Lustre: dagg-MDT0000: Connection restored to 06cf7e6b-adc2-7bcc-a2b7-855f8f3a8fa2 (at 192.168.44.218@o2ib44) Apr 8 18:54:21 warble2 kernel: Lustre: Skipped 1 previous similar message Apr 8 18:54:21 warble2 kernel: Lustre: dagg-MDT0000: Connection restored to 7a40ea20-4af2-a97e-1081-d13018e4c510 (at 192.168.44.118@o2ib44) Apr 8 18:54:21 warble2 kernel: Lustre: Skipped 8 previous similar messages Apr 8 18:54:22 warble1 kernel: Lustre: 23085:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1523177651/real 1523177651] req@ffff885e61065d00 x1597136535121888/t0(0) o38->dagg-MDT0000-osp-MDT0002@192.168.44.22@o2ib44:24/4 lens 520/544 e 0 to 1 dl 1523177662 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Apr 8 18:54:22 warble2 kernel: Lustre: dagg-MDT0000: Connection restored to bd95096f-2790-2a0e-d286-cdfd62034abc (at 192.168.44.13@o2ib44) Apr 8 18:54:22 warble2 kernel: Lustre: Skipped 22 previous similar messages Apr 8 18:54:24 warble2 kernel: Lustre: dagg-MDT0000: Connection restored to 8b6c4c51-04f0-03fc-6e16-7a25afda9657 (at 192.168.44.106@o2ib44) Apr 8 18:54:24 warble2 kernel: Lustre: Skipped 53 previous similar messages Apr 8 18:54:29 warble2 kernel: Lustre: dagg-MDT0000: Connection restored to 192.168.44.33@o2ib44 (at 192.168.44.33@o2ib44) Apr 8 18:54:29 warble2 kernel: Lustre: Skipped 38 previous similar messages Apr 8 18:54:36 warble1 kernel: LustreError: 167-0: dagg-MDT0000-lwp-MDT0001: This client was evicted by dagg-MDT0000; in progress operations using this service will fail. Apr 8 18:54:36 warble1 kernel: LustreError: 343298:0:(ldlm_resource.c:1100:ldlm_resource_complain()) dagg-MDT0000-lwp-MDT0001: namespace resource [0x200000006:0x1010000:0x0].0x0 (ffff88bd842b35c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Apr 8 18:54:36 warble1 kernel: LustreError: 343298:0:(ldlm_resource.c:1682:ldlm_resource_dump()) --- Resource: [0x200000006:0x1010000:0x0].0x0 (ffff88bd842b35c0) refcount = 2 Apr 8 18:54:36 warble1 kernel: LustreError: 343298:0:(ldlm_resource.c:1685:ldlm_resource_dump()) Granted locks (in reverse order): Apr 8 18:54:36 warble1 kernel: LustreError: 343298:0:(ldlm_resource.c:1688:ldlm_resource_dump()) ### ### ns: dagg-MDT0000-lwp-MDT0001 lock: ffff885ea1726e00/0xf6b5c0247189cd2c lrc: 2/1,0 mode: CR/CR res: [0x200000006:0x1010000:0x0].0x0 rrc: 3 type: PLN flags: 0x1106400000000 nid: local remote: 0xf6b5c024718da06f expref: -99 pid: 29158 timeout: 0 lvb_type: 2 Apr 8 18:54:36 warble1 kernel: Lustre: dagg-MDT0000-lwp-MDT0001: Connection restored to 192.168.44.22@o2ib44 (at 192.168.44.22@o2ib44) Apr 8 18:54:36 warble1 kernel: Lustre: Skipped 2 previous similar messages Apr 8 18:54:38 warble2 kernel: Lustre: dagg-MDT0000: Connection restored to 192.168.44.38@o2ib44 (at 192.168.44.38@o2ib44) Apr 8 18:54:38 warble2 kernel: Lustre: Skipped 3 previous similar messages Apr 8 18:54:55 warble2 kernel: Lustre: dagg-MDT0000: Connection restored to 192.168.44.37@o2ib44 (at 192.168.44.37@o2ib44) Apr 8 18:54:55 warble2 kernel: Lustre: Skipped 5 previous similar messages Apr 8 18:55:01 warble1 kernel: LustreError: 167-0: dagg-MDT0000-lwp-MDT0002: This client was evicted by dagg-MDT0000; in progress operations using this service will fail. Apr 8 18:55:01 warble2 kernel: Lustre: dagg-MDT0000: Recovery over after 0:40, of 123 clients 123 recovered and 0 were evicted. Apr 8 18:55:01 warble1 kernel: LustreError: 343319:0:(ldlm_resource.c:1100:ldlm_resource_complain()) dagg-MDT0000-lwp-MDT0002: namespace resource [0x200000006:0x1010000:0x0].0x0 (ffff885e658b6780) refcount nonzero (1) after lock cleanup; forcing cleanup. Apr 8 18:55:01 warble1 kernel: LustreError: 343319:0:(ldlm_resource.c:1682:ldlm_resource_dump()) --- Resource: [0x200000006:0x1010000:0x0].0x0 (ffff885e658b6780) refcount = 2 Apr 8 18:55:01 warble1 kernel: LustreError: 343319:0:(ldlm_resource.c:1685:ldlm_resource_dump()) Granted locks (in reverse order): Apr 8 18:55:01 warble1 kernel: LustreError: 343319:0:(ldlm_resource.c:1688:ldlm_resource_dump()) ### ### ns: dagg-MDT0000-lwp-MDT0002 lock: ffff885e75e2d200/0xf6b5c0247189db87 lrc: 2/1,0 mode: CR/CR res: [0x200000006:0x1010000:0x0].0x0 rrc: 3 type: PLN flags: 0x1106400000000 nid: local remote: 0xf6b5c024718e5ef0 expref: -99 pid: 31307 timeout: 0 lvb_type: 2 Apr 8 18:55:22 warble1 kernel: LustreError: 343369:0:(ldlm_resource.c:1100:ldlm_resource_complain()) dagg-MDT0000-lwp-MDT0001: namespace resource [0x200000006:0x1010000:0x0].0x0 (ffff88bd842b35c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Apr 8 18:55:22 warble1 kernel: LustreError: 343369:0:(ldlm_resource.c:1682:ldlm_resource_dump()) --- Resource: [0x200000006:0x1010000:0x0].0x0 (ffff88bd842b35c0) refcount = 2 Apr 8 18:55:22 warble1 kernel: LustreError: 343369:0:(ldlm_resource.c:1685:ldlm_resource_dump()) Granted locks (in reverse order): Apr 8 18:55:22 warble1 kernel: LustreError: 343369:0:(ldlm_resource.c:1688:ldlm_resource_dump()) ### ### ns: dagg-MDT0000-lwp-MDT0001 lock: ffff885ea1726e00/0xf6b5c0247189cd2c lrc: 2/1,0 mode: CR/CR res: [0x200000006:0x1010000:0x0].0x0 rrc: 3 type: PLN flags: 0x1106400000000 nid: local remote: 0xf6b5c024718da06f expref: -99 pid: 29158 timeout: 0 lvb_type: 2 Apr 8 18:55:22 warble2 kernel: LNet: 64970:0:(o2iblnd_cb.c:396:kiblnd_handle_rx()) PUT_NACK from 192.168.44.21@o2ib44 Apr 8 18:55:23 warble1 kernel: Lustre: Failing over dagg-MDT0001 Apr 8 18:55:23 warble1 kernel: LustreError: 29210:0:(llog_osd.c:882:llog_osd_next_block()) dagg-MDT0000-osp-MDT0001: can't read llog block from log [0x2000013ca:0x1:0x0] offset 163840: rc = -5 Apr 8 18:55:23 warble1 kernel: LustreError: 29210:0:(llog.c:607:llog_process_thread()) dagg-MDT0000-osp-MDT0001 retry remote llog process Apr 8 18:55:23 warble1 kernel: LustreError: 29210:0:(lod_dev.c:419:lod_sub_recovery_thread()) dagg-MDT0000-osp-MDT0001 getting update log failed: rc = -11 Apr 8 18:55:23 warble1 kernel: LustreError: 23085:0:(import.c:662:ptlrpc_connect_import()) can't connect to a closed import Apr 8 18:55:23 warble1 kernel: LustreError: 23085:0:(import.c:662:ptlrpc_connect_import()) Skipped 1 previous similar message Apr 8 18:55:23 warble1 kernel: LustreError: 343369:0:(ldlm_lib.c:2595:target_stop_recovery_thread()) dagg-MDT0001: Aborting recovery Apr 8 18:55:23 warble1 kernel: Lustre: 29650:0:(ldlm_lib.c:2048:target_recovery_overseer()) recovery is aborted, evict exports in recovery Apr 8 18:55:23 warble1 kernel: Lustre: 29650:0:(ldlm_lib.c:2038:target_recovery_overseer()) dagg-MDT0001 recovery is aborted by hard timeout Apr 8 18:55:23 warble1 kernel: Lustre: dagg-MDT0001: Recovery over after 1065:59, of 125 clients 0 recovered and 125 were evicted. Apr 8 18:55:23 warble1 kernel: LustreError: 28706:0:(client.c:1166:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff885e706caa00 x1597136538578848/t0(0) o5->dagg-OST0000-osc-MDT0001@192.168.44.31@o2ib44:28/4 lens 432/432 e 0 to 0 dl 0 ref 2 fl Rpc:N/0/ffffffff rc 0/-1 Apr 8 18:55:23 warble1 kernel: LustreError: 28706:0:(client.c:1166:ptlrpc_import_delay_req()) Skipped 1 previous similar message Apr 8 18:55:23 warble1 kernel: LustreError: 28710:0:(osp_precreate.c:903:osp_precreate_cleanup_orphans()) dagg-OST0001-osc-MDT0001: cannot cleanup orphans: rc = -5 Apr 8 18:55:23 warble1 kernel: LustreError: 28710:0:(osp_precreate.c:903:osp_precreate_cleanup_orphans()) Skipped 3 previous similar messages Apr 8 18:55:23 warble1 kernel: Lustre: server umount dagg-MDT0001 complete Apr 8 18:55:23 warble1 kernel: LustreError: 343369:0:(ldlm_resource.c:1682:ldlm_resource_dump()) --- Resource: [0x200000006:0x1010000:0x0].0x0 (ffff885e658b6780) refcount = 2 Apr 8 18:55:23 warble1 kernel: LustreError: 343369:0:(ldlm_resource.c:1685:ldlm_resource_dump()) Granted locks (in reverse order): Apr 8 18:55:23 warble1 kernel: Lustre: Failing over dagg-MDT0002 Apr 8 18:55:24 warble1 kernel: LustreError: 31314:0:(llog_osd.c:882:llog_osd_next_block()) dagg-MDT0000-osp-MDT0002: can't read llog block from log [0x2000013cb:0x1:0x0] offset 131072: rc = -5 Apr 8 18:55:24 warble1 kernel: LustreError: 23085:0:(import.c:662:ptlrpc_connect_import()) can't connect to a closed import Apr 8 18:55:24 warble1 kernel: LustreError: 31314:0:(llog.c:607:llog_process_thread()) dagg-MDT0000-osp-MDT0002 retry remote llog process Apr 8 18:55:24 warble1 kernel: LustreError: 31314:0:(lod_dev.c:419:lod_sub_recovery_thread()) dagg-MDT0000-osp-MDT0002 getting update log failed: rc = -11 Apr 8 18:55:24 warble1 kernel: LustreError: 343369:0:(ldlm_lib.c:2595:target_stop_recovery_thread()) dagg-MDT0002: Aborting recovery Apr 8 18:55:24 warble1 kernel: Lustre: 31538:0:(ldlm_lib.c:2048:target_recovery_overseer()) recovery is aborted, evict exports in recovery Apr 8 18:55:24 warble1 kernel: Lustre: 31538:0:(ldlm_lib.c:2048:target_recovery_overseer()) Skipped 2 previous similar messages Apr 8 18:55:24 warble1 kernel: Lustre: 31538:0:(ldlm_lib.c:2038:target_recovery_overseer()) dagg-MDT0002 recovery is aborted by hard timeout Apr 8 18:55:24 warble1 kernel: Lustre: 31538:0:(ldlm_lib.c:2038:target_recovery_overseer()) Skipped 1 previous similar message Apr 8 18:55:24 warble1 kernel: Lustre: dagg-MDT0002: Recovery over after 1065:53, of 125 clients 0 recovered and 125 were evicted. Apr 8 18:55:24 warble1 kernel: LustreError: 30416:0:(client.c:1166:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff885e706c0c00 x1597136538662400/t0(0) o5->dagg-OST0002-osc-MDT0002@192.168.44.31@o2ib44:28/4 lens 432/432 e 0 to 0 dl 0 ref 2 fl Rpc:N/0/ffffffff rc 0/-1 Apr 8 18:55:24 warble1 kernel: LustreError: 30506:0:(osp_precreate.c:903:osp_precreate_cleanup_orphans()) dagg-OST0007-osc-MDT0002: cannot cleanup orphans: rc = -5 Apr 8 18:55:24 warble1 kernel: LustreError: 30506:0:(osp_precreate.c:903:osp_precreate_cleanup_orphans()) Skipped 13 previous similar messages Apr 8 18:55:24 warble1 kernel: LustreError: 30416:0:(client.c:1166:ptlrpc_import_delay_req()) Skipped 27 previous similar messages Apr 8 18:55:24 warble1 kernel: Lustre: dagg-MDT0002: Not available for connect from 192.168.44.34@o2ib44 (stopping) Apr 8 18:55:24 warble1 kernel: LustreError: 137-5: dagg-MDT0001_UUID: not available for connect from 192.168.44.34@o2ib44 (no target). If you are running an HA pair check that the target is mounted on the other server. Apr 8 18:55:24 warble1 kernel: LustreError: Skipped 94 previous similar messages Apr 8 18:55:24 warble1 kernel: Lustre: Skipped 2 previous similar messages Apr 8 18:55:25 warble1 kernel: Lustre: server umount dagg-MDT0002 complete Apr 8 18:55:38 warble2 kernel: Lustre: 64976:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1523177655/real 1523177655] req@ffff88bdf2308c00 x1597150865654752/t0(0) o400->dagg-MDT0001-osp-MDT0000@192.168.44.21@o2ib44:24/4 lens 224/224 e 3 to 1 dl 1523177738 ref 1 fl Rpc:X/c0/ffffffff rc 0/-1 Apr 8 18:55:38 warble2 kernel: Lustre: 64976:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 1 previous similar message Apr 8 18:55:49 warble1 zed: eid=38 class=pool_export pool_guid=0x13C167322AB5F5E9 Apr 8 18:55:49 warble1 zed: eid=39 class=config_sync pool_guid=0x13C167322AB5F5E9 Apr 8 18:55:49 warble2 kernel: LustreError: 137-5: dagg-MDT0002_UUID: not available for connect from 192.168.44.34@o2ib44 (no target). If you are running an HA pair check that the target is mounted on the other server. Apr 8 18:55:49 warble2 kernel: LustreError: Skipped 2 previous similar messages Apr 8 18:55:53 warble2 kernel: LustreError: 137-5: dagg-MDT0002_UUID: not available for connect from 192.168.44.38@o2ib44 (no target). If you are running an HA pair check that the target is mounted on the other server. Apr 8 18:55:53 warble2 kernel: LustreError: Skipped 2 previous similar messages Apr 8 18:55:58 warble2 kernel: LustreError: 137-5: dagg-MDT0002_UUID: not available for connect from 192.168.44.219@o2ib44 (no target). If you are running an HA pair check that the target is mounted on the other server. Apr 8 18:56:04 warble2 kernel: LustreError: 137-5: dagg-MDT0002_UUID: not available for connect from 192.168.44.196@o2ib44 (no target). If you are running an HA pair check that the target is mounted on the other server. Apr 8 18:56:04 warble2 kernel: LustreError: Skipped 1 previous similar message Apr 8 18:56:08 warble1 zed: eid=40 class=pool_export pool_guid=0xE146DD6489A1819A Apr 8 18:56:08 warble1 zed: eid=41 class=config_sync pool_guid=0xE146DD6489A1819A Apr 8 18:56:08 warble2 kernel: LustreError: 137-5: dagg-MDT0001_UUID: not available for connect from 192.168.44.31@o2ib44 (no target). If you are running an HA pair check that the target is mounted on the other server. Apr 8 18:56:08 warble2 kernel: LustreError: Skipped 27 previous similar messages Apr 8 18:56:14 warble1 zed: eid=42 class=history_event pool_guid=0x13C167322AB5F5E9 Apr 8 18:56:14 warble1 zed: eid=43 class=config_sync pool_guid=0x13C167322AB5F5E9 Apr 8 18:56:14 warble1 zed: eid=44 class=pool_import pool_guid=0x13C167322AB5F5E9 Apr 8 18:56:14 warble1 zed: eid=45 class=history_event pool_guid=0x13C167322AB5F5E9 Apr 8 18:56:14 warble1 zed: eid=46 class=config_sync pool_guid=0x13C167322AB5F5E9 Apr 8 18:56:16 warble2 kernel: LustreError: 137-5: dagg-MDT0001_UUID: not available for connect from 192.168.44.195@o2ib44 (no target). If you are running an HA pair check that the target is mounted on the other server. Apr 8 18:56:16 warble2 kernel: LustreError: Skipped 57 previous similar messages Apr 8 18:56:18 warble1 zed: eid=47 class=history_event pool_guid=0xE146DD6489A1819A Apr 8 18:56:18 warble1 zed: eid=48 class=config_sync pool_guid=0xE146DD6489A1819A Apr 8 18:56:18 warble1 zed: eid=49 class=pool_import pool_guid=0xE146DD6489A1819A Apr 8 18:56:18 warble1 zed: eid=50 class=history_event pool_guid=0xE146DD6489A1819A Apr 8 18:56:18 warble1 zed: eid=51 class=config_sync pool_guid=0xE146DD6489A1819A Apr 8 18:56:23 warble2 kernel: Lustre: dagg-MDT0000: Connection restored to 192.168.44.21@o2ib44 (at 192.168.44.21@o2ib44) Apr 8 18:56:23 warble2 kernel: Lustre: Skipped 5 previous similar messages Apr 8 18:56:23 warble1 kernel: Lustre: dagg-MDT0002: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900 Apr 8 18:56:23 warble1 kernel: Lustre: dagg-MDT0002: Will be in recovery for at least 2:30, or until 124 clients reconnect Apr 8 18:56:23 warble1 kernel: Lustre: Skipped 1 previous similar message Apr 8 18:56:23 warble1 kernel: Lustre: dagg-MDT0002: Connection restored to 13163cd6-3177-0a4d-6223-24b277d10918 (at 192.168.44.219@o2ib44) Apr 8 18:56:23 warble1 kernel: Lustre: Skipped 3 previous similar messages Apr 8 18:56:28 warble1 kernel: Lustre: 23085:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1523177783/real 1523177783] req@ffff88bd81d0b300 x1597136538664240/t0(0) o8->dagg-OST0003-osc-MDT0002@192.168.44.32@o2ib44:28/4 lens 520/544 e 0 to 1 dl 1523177788 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Apr 8 18:56:28 warble1 kernel: Lustre: 23085:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 2 previous similar messages Apr 8 18:56:29 warble1 kernel: LustreError: 11-0: dagg-MDT0002-osp-MDT0001: operation mds_connect to node 0@lo failed: rc = -114 Apr 8 18:56:29 warble1 kernel: LustreError: Skipped 3 previous similar messages Apr 8 18:56:29 warble1 kernel: Lustre: dagg-MDT0001: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900 Apr 8 18:56:30 warble1 kernel: Lustre: dagg-MDT0001: Will be in recovery for at least 2:30, or until 124 clients reconnect Apr 8 18:56:32 warble2 kernel: LustreError: 137-5: dagg-MDT0002_UUID: not available for connect from 192.168.44.36@o2ib44 (no target). If you are running an HA pair check that the target is mounted on the other server. Apr 8 18:56:32 warble2 kernel: LustreError: Skipped 101 previous similar messages Apr 8 19:00:23 warble1 kernel: Lustre: 346201:0:(ldlm_lib.c:1773:extend_recovery_timer()) dagg-MDT0002: extended recovery timer reaching hard limit: 900, extend: 1 Apr 8 19:00:23 warble1 kernel: Lustre: 346201:0:(ldlm_lib.c:1773:extend_recovery_timer()) Skipped 49 previous similar messages Apr 8 19:04:55 warble2 crmd[17643]: notice: State transition S_IDLE -> S_POLICY_ENGINE Apr 8 19:04:55 warble2 pengine[17642]: error: Resource start-up disabled since no STONITH resources have been defined Apr 8 19:04:55 warble2 pengine[17642]: error: Either configure some or disable STONITH with the stonith-enabled option Apr 8 19:04:55 warble2 pengine[17642]: error: NOTE: Clusters with shared data need STONITH to ensure data integrity Apr 8 19:04:55 warble2 pengine[17642]: notice: Delaying fencing operations until there are resources to manage Apr 8 19:04:55 warble2 pengine[17642]: notice: Calculated transition 73, saving inputs in /var/lib/pacemaker/pengine/pe-input-1.bz2 Apr 8 19:04:55 warble2 pengine[17642]: notice: Configuration ERRORs found during PE processing. Please run "crm_verify -L" to identify issues. Apr 8 19:04:55 warble2 crmd[17643]: notice: Transition 73 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-1.bz2): Complete Apr 8 19:04:55 warble2 crmd[17643]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE cheers, |
| Comment by SC Admin (Inactive) [ 08/Apr/18 ] |
|
Hi Di, attached is conman-warble1-traces.txt which is console log that I took the same time as the syslog warble1-traces.txt I'll remount as zfs now. cheers, |
| Comment by SC Admin (Inactive) [ 08/Apr/18 ] |
[warble2]root: umount /lustre/dagg/MDT0 [warble2]root: zfs set canmount=on warble2-dagg-MDT0-pool/MDT0 [warble2]root: zfs mount warble2-dagg-MDT0-pool/MDT0 [warble2]root: ls /warble2-dagg-MDT0-pool/MDT0/ CATALOGS LFSCK oi.0 oi.104 oi.110 oi.117 oi.123 oi.15 oi.21 oi.28 oi.34 oi.40 oi.47 oi.53 oi.6 oi.66 oi.72 oi.79 oi.85 oi.91 oi.98 seq_ctl changelog_catalog lfsck_bookmark oi.1 oi.105 oi.111 oi.118 oi.124 oi.16 oi.22 oi.29 oi.35 oi.41 oi.48 oi.54 oi.60 oi.67 oi.73 oi.8 oi.86 oi.92 oi.99 seq_srv changelog_users lfsck_layout oi.10 oi.106 oi.112 oi.119 oi.125 oi.17 oi.23 oi.3 oi.36 oi.42 oi.49 oi.55 oi.61 oi.68 oi.74 oi.80 oi.87 oi.93 PENDING update_log CONFIGS lfsck_namespace oi.100 oi.107 oi.113 oi.12 oi.126 oi.18 oi.24 oi.30 oi.37 oi.43 oi.5 oi.56 oi.62 oi.69 oi.75 oi.81 oi.88 oi.94 quota_master update_log_dir fld lov_objid oi.101 oi.108 oi.114 oi.120 oi.127 oi.19 oi.25 oi.31 oi.38 oi.44 oi.50 oi.57 oi.63 oi.7 oi.76 oi.82 oi.89 oi.95 quota_slave hsm_actions nodemap oi.102 oi.109 oi.115 oi.121 oi.13 oi.2 oi.26 oi.32 oi.39 oi.45 oi.51 oi.58 oi.64 oi.70 oi.77 oi.83 oi.9 oi.96 reply_data last_rcvd O oi.103 oi.11 oi.116 oi.122 oi.14 oi.20 oi.27 oi.33 oi.4 oi.46 oi.52 oi.59 oi.65 oi.71 oi.78 oi.84 oi.90 oi.97 ROOT [warble2]root: ls -l /warble2-dagg-MDT0-pool/MDT0/update_log_dir/ total 4685145 -rw-r--r-- 1 root root 80000 Jan 1 1970 [0x200000400:0x1:0x0] -rw-r--r-- 1 root root 164928 Jan 1 1970 [0x2000013ca:0x1:0x0] -rw-r--r-- 1 root root 156224 Jan 1 1970 [0x2000013cb:0x1:0x0] -rw-r--r-- 1 sforsaye 1039 0 Nov 21 18:16 [0x200005231:0x236:0x0] -rw-r--r-- 1 root root 0 Nov 21 18:13 [0x200005232:0x4:0x0] -rw-r--r-- 1 sforsaye 1039 87169968 Nov 22 00:08 [0x200005236:0x932c:0x0] -rw-r--r-- 1 sforsaye 1039 0 Nov 22 00:08 [0x200005236:0x95e3:0x0] -rw-r--r-- 1 sforsaye 1039 90696248 Nov 21 23:11 [0x200005237:0x2e2b:0x0] -rw-r--r-- 1 sforsaye 1039 0 Nov 21 23:11 [0x200005237:0x9097:0x0] -rw-r--r-- 1 sforsaye apps 86947864 Dec 15 18:10 [0x200008104:0x1050:0x0] -rw-r--r-- 1 sforsaye apps 0 Dec 15 18:10 [0x200008104:0x1305:0x0] -rw-r--r-- 1 sforsaye apps 0 Dec 15 18:11 [0x200008107:0x850c:0x0] -rw-r--r-- 1 sforsaye apps 0 Dec 15 18:31 [0x2000088d3:0x1906c:0x0] -rw-r--r-- 1 sforsaye apps 0 Dec 15 18:31 [0x2000088d4:0x18846:0x0] -rw-r--r-- 1 sforsaye apps 0 Dec 15 18:30 [0x2000088d6:0xc:0x0] -rw-r--r-- 1 sforsaye apps 0 Dec 15 18:30 [0x2000088d7:0x11:0x0] -rw-r--r-- 1 sforsaye apps 0 Dec 15 18:31 [0x2000088d8:0x2:0x0] -rw-r--r-- 1 sforsaye apps 0 Dec 15 18:31 [0x2000088d9:0x2:0x0] -rw-r--r-- 1 sforsaye apps 0 Dec 16 11:24 [0x2000090a2:0xe143:0x0] -rw-r--r-- 1 sforsaye apps 0 Dec 16 11:24 [0x2000090a3:0xe0b2:0x0] -rw-r--r-- 1 sforsaye apps 0 Dec 16 11:46 [0x2000090a7:0xb0c6:0x0] -rw-r--r-- 1 sforsaye apps 0 Dec 16 11:46 [0x2000090a8:0xb8dc:0x0] -rw-r--r-- 1 sforsaye apps 0 Dec 17 09:29 [0x2000090b2:0xe083:0x0] -rw-r--r-- 1 sforsaye apps 0 Dec 17 09:30 [0x2000090b3:0x909d:0x0] -rw-r--r-- 1 sforsaye apps 0 Dec 17 09:44 [0x2000090bb:0x4d04:0x0] -rw-r--r-- 1 sforsaye apps 0 Dec 17 09:45 [0x2000090bc:0x2cc9:0x0] -rw-r--r-- 1 sforsaye apps 0 Dec 17 09:47 [0x200009870:0x2:0x0] -rw-r--r-- 1 sforsaye apps 0 Dec 17 09:47 [0x200009871:0x2:0x0] -rw-r--r-- 1 rhumble hpcadmin 0 Feb 7 01:56 [0x20000b822:0x1c5be:0x0] -rw-r--r-- 1 rhumble hpcadmin 0 Feb 7 01:57 [0x20000b823:0x1ce14:0x0] -rw-r--r-- 1 rhumble hpcadmin 0 Feb 7 02:10 [0x20000b827:0xbec6:0x0] -rw-r--r-- 1 rhumble hpcadmin 0 Feb 7 02:10 [0x20000b828:0x9b71:0x0] -rw-r--r-- 1 rhumble hpcadmin 0 Feb 7 02:18 [0x20000b833:0x85c7:0x0] -rw-r--r-- 1 rhumble hpcadmin 0 Feb 7 02:18 [0x20000b834:0x85b8:0x0] -rw-r--r-- 1 rhumble hpcadmin 0 Feb 16 15:30 [0x20000c080:0xa4d7:0x0] -rw-r--r-- 1 rhumble hpcadmin 0 Feb 16 15:30 [0x20000c081:0xc3c6:0x0] -rw-r--r-- 1 rhumble hpcadmin 7266824 Feb 16 15:31 [0x20000c083:0x1:0x0] -rw-r--r-- 1 rhumble hpcadmin 0 Feb 16 15:29 [0x20000c083:0x2:0x0] -rw-r--r-- 1 rhumble hpcadmin 0 Feb 18 00:18 [0x20000d061:0x1efc3:0x0] -rw-r--r-- 1 rhumble hpcadmin 0 Feb 18 00:18 [0x20000d062:0x1ed67:0x0] -rw-r--r-- 1 rhumble hpcadmin 132572576 Feb 19 00:50 [0x20000d8c5:0xfd26:0x0] -rw-r--r-- 1 rhumble hpcadmin 0 Feb 19 00:54 [0x20000d8c5:0xfd27:0x0] -rw-r--r-- 1 rhumble hpcadmin 0 Feb 19 00:56 [0x20000d8c6:0xfb26:0x0] -rw-r--r-- 1 soslowsk p002_swin 0 Sep 17 2014 [0x20000f651:0x1256d:0x0] -rw-r--r-- 1 soslowsk cas 0 Oct 6 2017 [0x20000f652:0x13631:0x0] -rw-r--r-- 1 soslowsk p002_swin 0 Aug 6 2017 [0x20000fe00:0x7c5c:0x0] -rw-r--r-- 1 soslowsk p002_swin 2097848 Jan 25 2016 [0x20000fe02:0x40fb:0x0] -rw-r--r-- 1 soslowsk p002_swin 2097848 May 19 2016 [0x20000fe02:0x4d8c:0x0] -rw-r--r-- 1 rhumble hpcadmin 134218424 Mar 29 20:28 [0x20001072b:0x15ac4:0x0] -rw-r--r-- 1 lakerink p014_swin 40920624 Jun 17 2014 [0x20001072b:0x166a0:0x0] -rw-r--r-- 1 lakerink p014_swin 0 Jun 17 2014 [0x20001072b:0x166a1:0x0] -rw-r--r-- 1 lakerink p014_swin 0 Jun 17 2014 [0x20001072d:0xeae3:0x0] -rw-r--r-- 1 root root 5834904 Jan 1 1970 [0x200010da0:0x1:0x0] -rw-r--r-- 1 root root 0 Jan 1 1970 [0x200010da0:0x2:0x0] -rw-r--r-- 1 root root 5796648 Jan 1 1970 [0x200010da2:0x1:0x0] -rw-r--r-- 1 root root 0 Jan 1 1970 [0x200010da2:0x2:0x0] [warble2]root: [warble2]root: ls -l /warble2-dagg-MDT0-pool/MDT0/update_log -rw-r--r-- 1 root root 96 Jan 1 1970 /warble2-dagg-MDT0-pool/MDT0/update_log [warble2]root: [warble2]root: df Filesystem 1K-blocks Used Available Use% Mounted on ramroot 395507840 1221184 394286656 1% / tmpfs 395507840 16924 395490916 1% /run tmpfs 395507840 0 395507840 0% /sys/fs/cgroup none 395483644 0 395483644 0% /dev /dev/ram 20 0 20 0% /ram tmpfs 395507840 54624 395453216 1% /dev/shm /dev/mapper/centos-root 52403200 9473616 42929584 19% /mnt/root warble2-dagg-MDT0-pool/MDT0 637841792 12884096 624957696 3% /warble2-dagg-MDT0-pool/MDT0 [warble2]root: umount /warble2-dagg-MDT0-pool/MDT0 [warble2]root: zfs set canmount=off warble2-dagg-MDT0-pool/MDT0 [warble2]root: mount -t lustre warble2-dagg-MDT0-pool/MDT0 /lustre/dagg/MDT0 and there has now been a change - all 3 of the MDT's are now in WAITING state. [warble2]root: for f in /proc/fs/lustre/mdt/dagg-MDT000*/recovery_status; do echo $f; cat $f; done /proc/fs/lustre/mdt/dagg-MDT0000/recovery_status status: WAITING non-ready MDTs: 0001 0002 recovery_start: 1523179407 time_waited: 411 [warble1]root: for f in /proc/fs/lustre/mdt/dagg-MDT000*/recovery_status; do echo $f; cat $f; done /proc/fs/lustre/mdt/dagg-MDT0001/recovery_status status: WAITING non-ready MDTs: 0000 recovery_start: 1523177790 time_waited: 2064 /proc/fs/lustre/mdt/dagg-MDT0002/recovery_status status: WAITING non-ready MDTs: 0000 recovery_start: 1523177783 time_waited: 2071 syslog for this stuff is -> Apr 8 19:10:23 warble1 kernel: Lustre: 346201:0:(ldlm_lib.c:1773:extend_recovery_timer()) dagg-MDT0002: extended recovery timer reaching hard limit: 900, extend: 1 Apr 8 19:10:23 warble1 kernel: Lustre: 346201:0:(ldlm_lib.c:1773:extend_recovery_timer()) Skipped 59 previous similar messages Apr 8 19:17:38 warble2 kernel: Lustre: Failing over dagg-MDT0000 Apr 8 19:17:38 warble2 kernel: LustreError: 64976:0:(import.c:662:ptlrpc_connect_import()) can't connect to a closed import Apr 8 19:17:38 warble1 kernel: LustreError: 11-0: dagg-MDT0000-osp-MDT0001: operation out_update to node 192.168.44.22@o2ib44 failed: rc = -19 Apr 8 19:17:38 warble1 kernel: Lustre: dagg-MDT0000-osp-MDT0002: Connection to dagg-MDT0000 (at 192.168.44.22@o2ib44) was lost; in progress operations using this service will wait for recovery to complete Apr 8 19:17:38 warble1 kernel: LustreError: Skipped 1 previous similar message Apr 8 19:17:38 warble2 kernel: Lustre: dagg-MDT0000: Not available for connect from 192.168.44.21@o2ib44 (stopping) Apr 8 19:17:39 warble2 kernel: Lustre: server umount dagg-MDT0000 complete Apr 8 19:17:56 warble1 kernel: Lustre: 23086:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1523179069/real 1523179069] req@ffff88bd5f433600 x1597136742364608/t0(0) o400->dagg-MDT0000-lwp-MDT0001@192.168.44.22@o2ib44:12/10 lens 224/224 e 0 to 1 dl 1523179076 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Apr 8 19:17:56 warble1 kernel: Lustre: dagg-MDT0000-lwp-MDT0002: Connection to dagg-MDT0000 (at 192.168.44.22@o2ib44) was lost; in progress operations using this service will wait for recovery to complete Apr 8 19:17:56 warble1 kernel: Lustre: Skipped 1 previous similar message Apr 8 19:17:56 warble1 kernel: Lustre: 23086:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 4 previous similar messages Apr 8 19:18:02 warble1 kernel: Lustre: 23085:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1523179076/real 1523179076] req@ffff88bd5f436300 x1597136742364768/t0(0) o38->dagg-MDT0000-lwp-MDT0001@192.168.44.22@o2ib44:12/10 lens 520/544 e 0 to 1 dl 1523179082 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Apr 8 19:18:11 warble1 kernel: LustreError: 137-5: dagg-MDT0000_UUID: not available for connect from 192.168.44.131@o2ib44 (no target). If you are running an HA pair check that the target is mounted on the other server. Apr 8 19:18:11 warble1 kernel: LustreError: Skipped 281 previous similar messages Apr 8 19:18:17 warble2 zed: eid=6 class=history_event pool_guid=0xFC863664CC04C875 Apr 8 19:18:27 warble1 kernel: LustreError: 137-5: dagg-MDT0000_UUID: not available for connect from 192.168.44.206@o2ib44 (no target). If you are running an HA pair check that the target is mounted on the other server. Apr 8 19:18:27 warble1 kernel: LustreError: Skipped 92 previous similar messages Apr 8 19:18:57 warble1 kernel: Lustre: 23085:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1523179126/real 1523179126] req@ffff88bd6a100000 x1597136742366992/t0(0) o38->dagg-MDT0000-lwp-MDT0001@192.168.44.22@o2ib44:12/10 lens 520/544 e 0 to 1 dl 1523179137 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Apr 8 19:18:57 warble1 kernel: Lustre: 23085:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 1 previous similar message Apr 8 19:19:01 warble1 kernel: LustreError: 137-5: dagg-MDT0000_UUID: not available for connect from 192.168.44.131@o2ib44 (no target). If you are running an HA pair check that the target is mounted on the other server. Apr 8 19:19:01 warble1 kernel: LustreError: Skipped 47 previous similar messages Apr 8 19:19:52 warble1 kernel: Lustre: 23085:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1523179176/real 1523179176] req@ffff885e69668300 x1597136742368944/t0(0) o38->dagg-MDT0000-lwp-MDT0002@192.168.44.22@o2ib44:12/10 lens 520/544 e 0 to 1 dl 1523179192 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Apr 8 19:19:52 warble1 kernel: Lustre: 23085:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 3 previous similar messages Apr 8 19:20:05 warble1 kernel: LustreError: 137-5: dagg-MDT0000_UUID: not available for connect from 192.168.44.139@o2ib44 (no target). If you are running an HA pair check that the target is mounted on the other server. Apr 8 19:20:05 warble1 kernel: LustreError: Skipped 216 previous similar messages Apr 8 19:20:23 warble1 kernel: Lustre: 346201:0:(ldlm_lib.c:1773:extend_recovery_timer()) dagg-MDT0002: extended recovery timer reaching hard limit: 900, extend: 1 Apr 8 19:20:23 warble1 kernel: Lustre: 346201:0:(ldlm_lib.c:1773:extend_recovery_timer()) Skipped 61 previous similar messages Apr 8 19:20:47 warble1 kernel: Lustre: 23085:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1523179226/real 1523179226] req@ffff885e68219800 x1597136742371472/t0(0) o38->dagg-MDT0000-lwp-MDT0001@192.168.44.22@o2ib44:12/10 lens 520/544 e 0 to 1 dl 1523179247 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Apr 8 19:20:47 warble1 kernel: Lustre: 23085:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 3 previous similar messages Apr 8 19:22:13 warble1 kernel: LustreError: 137-5: dagg-MDT0000_UUID: not available for connect from 192.168.44.107@o2ib44 (no target). If you are running an HA pair check that the target is mounted on the other server. Apr 8 19:22:13 warble1 kernel: LustreError: Skipped 305 previous similar messages Apr 8 19:23:02 warble1 kernel: Lustre: 23085:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1523179351/real 1523179351] req@ffff885e5d173300 x1597136742376992/t0(0) o38->dagg-MDT0000-lwp-MDT0001@192.168.44.22@o2ib44:12/10 lens 520/544 e 0 to 1 dl 1523179382 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Apr 8 19:23:02 warble1 kernel: Lustre: 23085:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 7 previous similar messages Apr 8 19:23:08 warble2 zed: eid=7 class=history_event pool_guid=0xFC863664CC04C875 Apr 8 19:23:21 warble1 kernel: Lustre: MGS: Connection restored to 192.168.44.22@o2ib44 (at 192.168.44.22@o2ib44) Apr 8 19:23:21 warble1 kernel: Lustre: Skipped 277 previous similar messages Apr 8 19:23:22 warble2 kernel: LustreError: 11-0: dagg-MDT0002-osp-MDT0000: operation mds_connect to node 192.168.44.21@o2ib44 failed: rc = -16 Apr 8 19:23:22 warble1 kernel: Lustre: dagg-MDT0001: Received new LWP connection from 192.168.44.22@o2ib44, removing former export from same NID Apr 8 19:23:22 warble1 kernel: Lustre: dagg-MDT0002: Client dagg-MDT0000-mdtlov_UUID (at 192.168.44.22@o2ib44) refused connection, still busy with 6 references Apr 8 19:23:22 warble1 kernel: Lustre: Skipped 1 previous similar message Apr 8 19:23:22 warble2 kernel: Lustre: dagg-MDT0000: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900 Apr 8 19:23:27 warble2 kernel: Lustre: 64976:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1523179402/real 1523179402] req@ffff885e0a45ad00 x1597150865729056/t0(0) o8->dagg-OST0003-osc-MDT0000@192.168.44.32@o2ib44:28/4 lens 520/544 e 0 to 1 dl 1523179407 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Apr 8 19:23:27 warble2 kernel: Lustre: 64976:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 1 previous similar message Apr 8 19:23:27 warble2 kernel: Lustre: dagg-MDT0000: Will be in recovery for at least 2:30, or until 123 clients reconnect Apr 8 19:23:27 warble2 kernel: Lustre: dagg-MDT0000: Connection restored to 48d09b9d-d5c2-384c-b51b-bf52a51b2aa5 (at 192.168.44.203@o2ib44) Apr 8 19:23:27 warble2 kernel: Lustre: Skipped 3 previous similar messages Apr 8 19:23:44 warble2 kernel: Lustre: dagg-MDT0000: Connection restored to 192.168.44.35@o2ib44 (at 192.168.44.35@o2ib44) Apr 8 19:23:44 warble2 kernel: Lustre: Skipped 121 previous similar messages Apr 8 19:23:46 warble1 kernel: LustreError: 167-0: dagg-MDT0000-lwp-MDT0001: This client was evicted by dagg-MDT0000; in progress operations using this service will fail. Apr 8 19:23:47 warble1 kernel: LustreError: 353524:0:(ldlm_resource.c:1100:ldlm_resource_complain()) dagg-MDT0000-lwp-MDT0001: namespace resource [0x200000006:0x1010000:0x0].0x0 (ffff885eb028dec0) refcount nonzero (1) after lock cleanup; forcing cleanup. Apr 8 19:23:47 warble1 kernel: LustreError: 353525:0:(ldlm_resource.c:1682:ldlm_resource_dump()) --- Resource: [0x200000006:0x1010000:0x0].0x0 (ffff885e6d5946c0) refcount = 2 Apr 8 19:23:47 warble1 kernel: LustreError: 353525:0:(ldlm_resource.c:1685:ldlm_resource_dump()) Granted locks (in reverse order): Apr 8 19:23:47 warble1 kernel: LustreError: 353525:0:(ldlm_resource.c:1688:ldlm_resource_dump()) ### ### ns: dagg-MDT0000-lwp-MDT0002 lock: ffff885e6c247400/0xf6b5c0247a72a883 lrc: 2/1,0 mode: CR/CR res: [0x200000006:0x1010000:0x0].0x0 rrc: 3 type: PLN flags: 0x1106400000000 nid: local remote: 0xe9240daaf3c09eac expref: -99 pid: 345973 timeout: 0 lvb_type: 2 Apr 8 19:23:47 warble1 kernel: LustreError: 353525:0:(ldlm_resource.c:1688:ldlm_resource_dump()) Skipped 1 previous similar message Apr 8 19:23:47 warble1 kernel: Lustre: dagg-MDT0000-lwp-MDT0002: Connection restored to 192.168.44.22@o2ib44 (at 192.168.44.22@o2ib44) Apr 8 19:23:47 warble1 kernel: LustreError: 353524:0:(ldlm_resource.c:1100:ldlm_resource_complain()) Skipped 2 previous similar messages Apr 8 19:23:47 warble1 kernel: LustreError: 353524:0:(ldlm_resource.c:1682:ldlm_resource_dump()) --- Resource: [0x200000006:0x1010000:0x0].0x0 (ffff885eb028dec0) refcount = 2 Apr 8 19:23:47 warble1 kernel: LustreError: 353524:0:(ldlm_resource.c:1685:ldlm_resource_dump()) Granted locks (in reverse order): Apr 8 19:23:52 warble2 kernel: LustreError: 11-0: dagg-MDT0001-osp-MDT0000: operation mds_connect to node 192.168.44.21@o2ib44 failed: rc = -16 Apr 8 19:23:52 warble2 kernel: LustreError: Skipped 1 previous similar message Apr 8 19:23:52 warble1 kernel: Lustre: dagg-MDT0002: Received new LWP connection from 192.168.44.22@o2ib44, removing former export from same NID Apr 8 19:23:52 warble1 kernel: Lustre: dagg-MDT0001: Client dagg-MDT0000-mdtlov_UUID (at 192.168.44.22@o2ib44) refused connection, still busy with 6 references Apr 8 19:23:52 warble1 kernel: Lustre: Skipped 1 previous similar message Apr 8 19:23:52 warble1 kernel: Lustre: Skipped 1 previous similar message Apr 8 19:24:17 warble2 kernel: LustreError: 11-0: dagg-MDT0001-osp-MDT0000: operation mds_connect to node 192.168.44.21@o2ib44 failed: rc = -16 Apr 8 19:24:17 warble2 kernel: LustreError: Skipped 1 previous similar message Apr 8 19:24:17 warble1 kernel: Lustre: dagg-MDT0002: Received new LWP connection from 192.168.44.22@o2ib44, removing former export from same NID Apr 8 19:24:17 warble1 kernel: Lustre: dagg-MDT0001: Received new LWP connection from 192.168.44.22@o2ib44, removing former export from same NID Apr 8 19:24:17 warble1 kernel: Lustre: dagg-MDT0001: Client dagg-MDT0000-mdtlov_UUID (at 192.168.44.22@o2ib44) refused connection, still busy with 6 references Apr 8 19:24:17 warble1 kernel: Lustre: Skipped 1 previous similar message Apr 8 19:24:42 warble2 kernel: LustreError: 11-0: dagg-MDT0001-osp-MDT0000: operation mds_connect to node 192.168.44.21@o2ib44 failed: rc = -16 Apr 8 19:24:42 warble2 kernel: LustreError: Skipped 1 previous similar message Apr 8 19:24:42 warble1 kernel: Lustre: dagg-MDT0002: Received new LWP connection from 192.168.44.22@o2ib44, removing former export from same NID Apr 8 19:24:42 warble1 kernel: Lustre: dagg-MDT0001: Client dagg-MDT0000-mdtlov_UUID (at 192.168.44.22@o2ib44) refused connection, still busy with 6 references Apr 8 19:24:42 warble1 kernel: Lustre: Skipped 1 previous similar message Apr 8 19:24:42 warble1 kernel: Lustre: Skipped 1 previous similar message Apr 8 19:25:07 warble2 kernel: LustreError: 11-0: dagg-MDT0002-osp-MDT0000: operation mds_connect to node 192.168.44.21@o2ib44 failed: rc = -16 Apr 8 19:25:07 warble2 kernel: LustreError: Skipped 1 previous similar message Apr 8 19:25:07 warble1 kernel: Lustre: dagg-MDT0001: Received new LWP connection from 192.168.44.22@o2ib44, removing former export from same NID Apr 8 19:25:07 warble1 kernel: Lustre: dagg-MDT0002: Client dagg-MDT0000-mdtlov_UUID (at 192.168.44.22@o2ib44) refused connection, still busy with 6 references Apr 8 19:25:07 warble1 kernel: Lustre: Skipped 1 previous similar message Apr 8 19:25:07 warble1 kernel: Lustre: Skipped 1 previous similar message Apr 8 19:25:32 warble2 kernel: LustreError: 11-0: dagg-MDT0001-osp-MDT0000: operation mds_connect to node 192.168.44.21@o2ib44 failed: rc = -16 Apr 8 19:25:32 warble2 kernel: LustreError: Skipped 1 previous similar message Apr 8 19:25:32 warble1 kernel: Lustre: dagg-MDT0002: Received new LWP connection from 192.168.44.22@o2ib44, removing former export from same NID Apr 8 19:25:32 warble1 kernel: Lustre: dagg-MDT0001: Client dagg-MDT0000-mdtlov_UUID (at 192.168.44.22@o2ib44) refused connection, still busy with 6 references Apr 8 19:25:32 warble1 kernel: Lustre: Skipped 1 previous similar message Apr 8 19:25:32 warble1 kernel: Lustre: Skipped 1 previous similar message Apr 8 19:25:57 warble2 kernel: LustreError: 11-0: dagg-MDT0002-osp-MDT0000: operation mds_connect to node 192.168.44.21@o2ib44 failed: rc = -16 Apr 8 19:25:57 warble2 kernel: LustreError: Skipped 1 previous similar message Apr 8 19:25:57 warble1 kernel: Lustre: dagg-MDT0001: Client dagg-MDT0000-mdtlov_UUID (at 192.168.44.22@o2ib44) refused connection, still busy with 6 references Apr 8 19:25:57 warble1 kernel: Lustre: Skipped 2 previous similar messages Apr 8 19:26:22 warble2 kernel: Lustre: 86126:0:(ldlm_lib.c:1773:extend_recovery_timer()) dagg-MDT0000: extended recovery timer reaching hard limit: 900, extend: 1 Apr 8 19:26:22 warble1 kernel: Lustre: dagg-MDT0001: Received new LWP connection from 192.168.44.22@o2ib44, removing former export from same NID Apr 8 19:26:22 warble1 kernel: Lustre: Skipped 3 previous similar messages Apr 8 19:26:47 warble2 kernel: LustreError: 11-0: dagg-MDT0001-osp-MDT0000: operation mds_connect to node 192.168.44.21@o2ib44 failed: rc = -16 Apr 8 19:26:47 warble2 kernel: LustreError: Skipped 3 previous similar messages Apr 8 19:26:47 warble1 kernel: Lustre: dagg-MDT0002: Client dagg-MDT0000-mdtlov_UUID (at 192.168.44.22@o2ib44) refused connection, still busy with 6 references Apr 8 19:26:47 warble1 kernel: Lustre: Skipped 3 previous similar messages Apr 8 19:27:22 warble2 kernel: Lustre: 86126:0:(ldlm_lib.c:1773:extend_recovery_timer()) dagg-MDT0000: extended recovery timer reaching hard limit: 900, extend: 1 Apr 8 19:27:22 warble2 kernel: Lustre: 86126:0:(ldlm_lib.c:1773:extend_recovery_timer()) Skipped 1 previous similar message Apr 8 19:27:37 warble1 kernel: Lustre: dagg-MDT0002: Received new LWP connection from 192.168.44.22@o2ib44, removing former export from same NID Apr 8 19:27:37 warble1 kernel: Lustre: Skipped 5 previous similar messages Apr 8 19:28:02 warble2 kernel: LustreError: 11-0: dagg-MDT0001-osp-MDT0000: operation mds_connect to node 192.168.44.21@o2ib44 failed: rc = -16 Apr 8 19:28:02 warble2 kernel: LustreError: Skipped 5 previous similar messages Apr 8 19:28:02 warble1 kernel: Lustre: dagg-MDT0002: Client dagg-MDT0000-mdtlov_UUID (at 192.168.44.22@o2ib44) refused connection, still busy with 6 references Apr 8 19:28:02 warble1 kernel: Lustre: Skipped 5 previous similar messages Apr 8 19:28:22 warble2 kernel: Lustre: 86126:0:(ldlm_lib.c:1773:extend_recovery_timer()) dagg-MDT0000: extended recovery timer reaching hard limit: 900, extend: 1 Apr 8 19:28:22 warble2 kernel: Lustre: 86126:0:(ldlm_lib.c:1773:extend_recovery_timer()) Skipped 2 previous similar messages Apr 8 19:29:22 warble2 kernel: Lustre: 86126:0:(ldlm_lib.c:1773:extend_recovery_timer()) dagg-MDT0000: extended recovery timer reaching hard limit: 900, extend: 1 Apr 8 19:29:22 warble2 kernel: Lustre: 86126:0:(ldlm_lib.c:1773:extend_recovery_timer()) Skipped 2 previous similar messages Apr 8 19:30:07 warble1 kernel: Lustre: dagg-MDT0001: Received new LWP connection from 192.168.44.22@o2ib44, removing former export from same NID Apr 8 19:30:07 warble1 kernel: Lustre: Skipped 10 previous similar messages Apr 8 19:30:22 warble2 kernel: Lustre: 86126:0:(ldlm_lib.c:1773:extend_recovery_timer()) dagg-MDT0000: extended recovery timer reaching hard limit: 900, extend: 1 Apr 8 19:30:22 warble2 kernel: Lustre: 86126:0:(ldlm_lib.c:1773:extend_recovery_timer()) Skipped 2 previous similar messages Apr 8 19:30:23 warble1 kernel: Lustre: 346201:0:(ldlm_lib.c:1773:extend_recovery_timer()) dagg-MDT0002: extended recovery timer reaching hard limit: 900, extend: 1 Apr 8 19:30:23 warble1 kernel: Lustre: 346201:0:(ldlm_lib.c:1773:extend_recovery_timer()) Skipped 59 previous similar messages Apr 8 19:30:32 warble2 kernel: LustreError: 11-0: dagg-MDT0001-osp-MDT0000: operation mds_connect to node 192.168.44.21@o2ib44 failed: rc = -16 Apr 8 19:30:32 warble2 kernel: LustreError: Skipped 11 previous similar messages Apr 8 19:30:32 warble1 kernel: Lustre: dagg-MDT0002: Client dagg-MDT0000-mdtlov_UUID (at 192.168.44.22@o2ib44) refused connection, still busy with 6 references Apr 8 19:30:32 warble1 kernel: Lustre: Skipped 11 previous similar messages Apr 8 19:31:22 warble2 kernel: Lustre: 86126:0:(ldlm_lib.c:1773:extend_recovery_timer()) dagg-MDT0000: extended recovery timer reaching hard limit: 900, extend: 1 Apr 8 19:31:22 warble2 kernel: Lustre: 86126:0:(ldlm_lib.c:1773:extend_recovery_timer()) Skipped 2 previous similar messages Apr 8 19:32:22 warble2 kernel: Lustre: 86126:0:(ldlm_lib.c:1773:extend_recovery_timer()) dagg-MDT0000: extended recovery timer reaching hard limit: 900, extend: 1 Apr 8 19:32:22 warble2 kernel: Lustre: 86126:0:(ldlm_lib.c:1773:extend_recovery_timer()) Skipped 2 previous similar messages cheers, |
| Comment by SC Admin (Inactive) [ 08/Apr/18 ] |
|
please note we have had a ticket open for this cheers, |
| Comment by nasf (Inactive) [ 08/Apr/18 ] |
|
Robin, According to Di's analysis, it is suspected that there is some trouble in DNE update logs. If that is true, then we can try to remove and and re-create the DNE update logs, that will allow us to mount the system up as Lustre. The risk is that if there are some uncompleted cross-MDTs modifications before former MDT umount, then those modifications can NOT be recovered any longer. Under such case, Lustre namespace will become inconsistent, we will have to run namespace LFSCK to repair that. But the namespace LFSCK cannot guarantee that the system globally recover to the status before the former umount. But consider another ticket LU-10677, since your system is already inconsistent, it is worth to try as following: 1) mount the MDT0 up as "zfs" Please do the same for MDT1 and MDT2, replace the 0x200000009:0x0:0x0 as 0x200000009:0x1:0x0/0x200000009:0x2:0x0 on the MDT1/MDT2. Then remount all the MDTs as "lustre". If the system can be mounted up, then we will leave related update log objects on the MDTs, but comparing with the unmountable system, that is not fatal. If it does now work, then please rename them back and collect -1 level Lustre kernel debug logs on the MDTs when mount as "lustre". |
| Comment by SC Admin (Inactive) [ 08/Apr/18 ] |
|
Hi nasf, so I'm clear on 1,2,3 and doing more or less the same on mdt1,2 I get confused after that... a) did you want me collect -1 logs as I try to mount lustre the first time? b) (if all that worked ok) you want me to then umount and mount as zfs again, and mv the OI files back again, and then try to mount as lustre again, and collect -1 logs there too? c) lfsck at which stage? and this is done with "lctl lfsck_start" ? cheers, |
| Comment by nasf (Inactive) [ 08/Apr/18 ] |
If your system can mount up after removing the update logs, then needs NOT Lustre debug logs.
No, if the MDT can be mounted as "lustre" after removing the update logs, then the system will re-create new update logs after the mount.
namespace LFSCK needs to be done after the MDTs mounted up as "lustre" on the MDT0 (if the system can be mounted after removing the update logs): lctl lfsck_start -M dagg-MDT0000 -t namespace -A -r To be safe, you can use "--dryrun" option to check namespace consistency firstly before the repair mode LFSCK. |
| Comment by SC Admin (Inactive) [ 08/Apr/18 ] |
|
ok. thanks! cheers, |
| Comment by nasf (Inactive) [ 08/Apr/18 ] |
|
Anyway, all depends on that we can mount the MDTs up as "lustre" after removing the update logs. |
| Comment by SC Admin (Inactive) [ 08/Apr/18 ] |
|
yay! moving update_log and oi.9/0x200000009:0x[012]:0x0 to bak has worked [warble1]root: for f in /proc/fs/lustre/mdt/dagg-MDT000*/recovery_status; do echo $f; cat $f; done /proc/fs/lustre/mdt/dagg-MDT0000/recovery_status status: COMPLETE recovery_start: 1523196577 recovery_duration: 57 completed_clients: 123/123 replayed_requests: 0 last_transno: 231928233984 VBR: DISABLED IR: ENABLED /proc/fs/lustre/mdt/dagg-MDT0001/recovery_status status: COMPLETE recovery_start: 1523196601 recovery_duration: 160 completed_clients: 123/124 replayed_requests: 0 last_transno: 236223281864 VBR: ENABLED IR: ENABLED /proc/fs/lustre/mdt/dagg-MDT0002/recovery_status status: COMPLETE recovery_start: 1523196603 recovery_duration: 161 completed_clients: 123/124 replayed_requests: 0 last_transno: 236223281775 VBR: ENABLED IR: ENABLED I just mounted them all on warble1 for simplicity. I'll try a dry-run of lfsck now. cheers, |
| Comment by SC Admin (Inactive) [ 08/Apr/18 ] |
|
lfsck went well (included below if you are interested). looks like only minor repairs. impressively quick too. many thanks again to everybody involved, especially as this happened over a weekend. we'll probably try more mount/umount and failover stuff during the week, just to give ourselves a bit of confidence that it's fully back to being ok again. we'll have to reconstruct the pacemaker/corosync config first though 'cos that fell in a heap after one of the reboots, so might take a few days... please let us know if there's more that we should do to make sure it's happy. I presume the namespace lfsck was sufficient? I guess it goes without saying that it would be nice to know why the LBUG and subsequent WAITING happened in the first place, and have future lustre be more robust against such a failure, and have an easier way to put it back together if something like this happens again too. cheers, [warble1]root: lctl get_param -n mdd.dagg-MDT000'*'.lfsck_namespace name: lfsck_namespace magic: 0xa0621a0b version: 2 status: completed flags: param: all_targets last_completed_time: 1523199317 time_since_last_completed: 244 seconds latest_start_time: 1523198926 time_since_latest_start: 635 seconds last_checkpoint_time: 1523199317 time_since_last_checkpoint: 244 seconds latest_start_position: 266, N/A, N/A last_checkpoint_position: 35184372088832, N/A, N/A first_failure_position: N/A, N/A, N/A checked_phase1: 2708764 checked_phase2: 1284327 updated_phase1: 0 updated_phase2: 0 failed_phase1: 0 failed_phase2: 0 directories: 208318 dirent_repaired: 0 linkea_repaired: 0 nlinks_repaired: 0 multiple_linked_checked: 19975 multiple_linked_repaired: 0 unknown_inconsistency: 0 unmatched_pairs_repaired: 0 dangling_repaired: 0 multiple_referenced_repaired: 0 bad_file_type_repaired: 0 lost_dirent_repaired: 0 local_lost_found_scanned: 0 local_lost_found_moved: 0 local_lost_found_skipped: 0 local_lost_found_failed: 0 striped_dirs_scanned: 0 striped_dirs_repaired: 0 striped_dirs_failed: 0 striped_dirs_disabled: 0 striped_dirs_skipped: 0 striped_shards_scanned: 147021 striped_shards_repaired: 0 striped_shards_failed: 0 striped_shards_skipped: 0 name_hash_repaired: 0 linkea_overflow_cleared: 0 success_count: 2 run_time_phase1: 206 seconds run_time_phase2: 184 seconds average_speed_phase1: 13149 items/sec average_speed_phase2: 6980 objs/sec average_speed_total: 10238 items/sec real_time_speed_phase1: N/A real_time_speed_phase2: N/A current_position: N/A name: lfsck_namespace magic: 0xa0621a0b version: 2 status: completed flags: param: all_targets last_completed_time: 1523199284 time_since_last_completed: 277 seconds latest_start_time: 1523198926 time_since_latest_start: 635 seconds last_checkpoint_time: 1523199284 time_since_last_checkpoint: 277 seconds latest_start_position: 266, N/A, N/A last_checkpoint_position: 35184372088832, N/A, N/A first_failure_position: N/A, N/A, N/A checked_phase1: 2396759 checked_phase2: 1091779 updated_phase1: 1 updated_phase2: 0 failed_phase1: 0 failed_phase2: 0 directories: 196099 dirent_repaired: 0 linkea_repaired: 0 nlinks_repaired: 0 multiple_linked_checked: 19766 multiple_linked_repaired: 0 unknown_inconsistency: 0 unmatched_pairs_repaired: 0 dangling_repaired: 3 multiple_referenced_repaired: 0 bad_file_type_repaired: 0 lost_dirent_repaired: 0 local_lost_found_scanned: 0 local_lost_found_moved: 0 local_lost_found_skipped: 0 local_lost_found_failed: 0 striped_dirs_scanned: 0 striped_dirs_repaired: 0 striped_dirs_failed: 0 striped_dirs_disabled: 0 striped_dirs_skipped: 0 striped_shards_scanned: 147022 striped_shards_repaired: 0 striped_shards_failed: 0 striped_shards_skipped: 0 name_hash_repaired: 0 linkea_overflow_cleared: 0 success_count: 2 run_time_phase1: 189 seconds run_time_phase2: 151 seconds average_speed_phase1: 12681 items/sec average_speed_phase2: 7230 objs/sec average_speed_total: 10260 items/sec real_time_speed_phase1: N/A real_time_speed_phase2: N/A current_position: N/A name: lfsck_namespace magic: 0xa0621a0b version: 2 status: completed flags: param: all_targets last_completed_time: 1523199290 time_since_last_completed: 271 seconds latest_start_time: 1523198926 time_since_latest_start: 635 seconds last_checkpoint_time: 1523199290 time_since_last_checkpoint: 271 seconds latest_start_position: 266, N/A, N/A last_checkpoint_position: 35184372088832, N/A, N/A first_failure_position: N/A, N/A, N/A checked_phase1: 2397511 checked_phase2: 1092353 updated_phase1: 0 updated_phase2: 1 failed_phase1: 0 failed_phase2: 0 directories: 196040 dirent_repaired: 0 linkea_repaired: 0 nlinks_repaired: 0 multiple_linked_checked: 19931 multiple_linked_repaired: 0 unknown_inconsistency: 0 unmatched_pairs_repaired: 1 dangling_repaired: 0 multiple_referenced_repaired: 0 bad_file_type_repaired: 0 lost_dirent_repaired: 0 local_lost_found_scanned: 0 local_lost_found_moved: 0 local_lost_found_skipped: 0 local_lost_found_failed: 0 striped_dirs_scanned: 0 striped_dirs_repaired: 0 striped_dirs_failed: 0 striped_dirs_disabled: 0 striped_dirs_skipped: 0 striped_shards_scanned: 147021 striped_shards_repaired: 0 striped_shards_failed: 0 striped_shards_skipped: 0 name_hash_repaired: 0 linkea_overflow_cleared: 0 success_count: 2 run_time_phase1: 190 seconds run_time_phase2: 157 seconds average_speed_phase1: 12618 items/sec average_speed_phase2: 6957 objs/sec average_speed_total: 10057 items/sec real_time_speed_phase1: N/A real_time_speed_phase2: N/A current_position: N/A |
| Comment by nasf (Inactive) [ 08/Apr/18 ] |
Please add "-C" option to repair the dangling name entry via re-creating the lost MDT-object (empty) when you run repair mode LFSCK; otherwise by default, it will only warning you that there is dangling name entry. Please note that there are two possible cases that may cause the dangling name entry: 1) The uncompleted rm operation before crash, that only destroyed the MDT-object but left the name entry in namespace. From the LFSCK view, it cannot distinguish the two cases. For data safe, we assume that it is the 2nd case. You need to check the LFSCK logs (Lustre kernel debug logs on the MDT with "lfsck" enabled: lctl set_param debug+=lfsck) and find out related repaired items, then decide whether it is the 2nd case or not. If it is the 1st case, then you can remove the stale name entry manually. Anyway, even if it was the 1st case, but LFSCK repaired it as the 2nd case, equal to that some removed name entry (and related MDT-object) is "rollback" because of the MDT corruption and LFSCK. But Lustre namespace is still consistent. |
| Comment by SC Admin (Inactive) [ 08/Apr/18 ] |
|
thanks for checking that output and for clarifying. so we have 3 Lustre files/dirs somewhere that point to something that doesn't actually exist on an MDT? 'lfsck' seems to already be in the default 'debug' setting on our MDS's. cheers, |
| Comment by SC Admin (Inactive) [ 08/Apr/18 ] |
|
if it's only 3 files then I'm not sure I care too much. 3 dirs might be different. cheers, |
| Comment by Alex Zhuravlev [ 08/Apr/18 ] |
|
Hi, do you still have logs with this mesage: kernel: LustreError: 49321:0:(lu_object.c:1177:lu_device_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1 I think we still want to learn the root cause..
|
| Comment by Andreas Dilger [ 08/Apr/18 ] |
|
Robin, Typically, lfsck should write any repairs done to the console log (check "lctl get_param printk" for "lfsck" being enabled to print on the console. |
| Comment by nasf (Inactive) [ 09/Apr/18 ] |
Yes.
If there are too much logs as to the buffer overflow, then you can only enable "lfsck" debug via "lctl set_param debug=lfsck" on the MDT, then the logs will only contain LFSCK related operations.
I am not sure whether the dangling entry is for file or dir. But there may not be too much different, because even if it was for dir, the LFSCK only can create empty dir object to repair if "-C" specified. |
| Comment by SC Admin (Inactive) [ 09/Apr/18 ] |
|
Hi Alex, the LBUG is in warbles.txt above, and also warbles-messages-20180408.txt the stack trace plus a little bit of important looking context (from warbles.txt) is -> Apr 7 16:38:09 warble1 kernel: Lustre: dagg-MDT0001: Not available for connect from 192.168.44.36@o2ib44 (stopping)
Apr 7 16:38:09 warble1 kernel: Lustre: Skipped 9 previous similar messages
Apr 7 16:38:10 warble1 kernel: LustreError: 49321:0:(osp_dev.c:1276:osp_device_free()) header@ffff885e724737b0[0x0, 1, [0x200000007:0x1:0x0] hash exist]{
Apr 7 16:38:10 warble1 kernel: LustreError: 49321:0:(osp_dev.c:1276:osp_device_free()) ....mdt@ffff885e72473800mdt-object@ffff885e724737b0( , writecount=0)
Apr 7 16:38:10 warble1 kernel: LustreError: 49321:0:(osp_dev.c:1276:osp_device_free()) ....mdd@ffff88bd62be0e60mdd-object@ffff88bd62be0e60(open_count=0, valid=0, cltime=0, flags=0)
Apr 7 16:38:10 warble1 kernel: LustreError: 22468:0:(ldlm_lockd.c:2365:ldlm_cancel_handler()) ldlm_cancel from 192.168.44.136@o2ib44 arrived at 1523083090 with bad export cookie 16766780683466865680
Apr 7 16:38:10 warble1 kernel: LustreError: 49321:0:(osp_dev.c:1276:osp_device_free()) ....lod@ffff88bbefc281a0lod-object@ffff88bbefc281a0
Apr 7 16:38:10 warble1 kernel: LustreError: 49321:0:(osp_dev.c:1276:osp_device_free()) ....osp@ffff88bdae0a6850osp-object@ffff88bdae0a6800
Apr 7 16:38:10 warble1 kernel: LustreError: 49321:0:(osp_dev.c:1276:osp_device_free()) } header@ffff885e724737b0
Apr 7 16:38:10 warble1 kernel: LustreError: 49321:0:(lu_object.c:1177:lu_device_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1
Apr 7 16:38:10 warble1 kernel: LustreError: 49321:0:(lu_object.c:1177:lu_device_fini()) LBUG
Apr 7 16:38:10 warble1 kernel: Pid: 49321, comm: umount
Apr 7 16:38:10 warble1 kernel: #012Call Trace:
Apr 7 16:38:10 warble1 kernel: [<ffffffffc052e7ae>] libcfs_call_trace+0x4e/0x60 [libcfs]
Apr 7 16:38:10 warble1 kernel: [<ffffffffc052e83c>] lbug_with_loc+0x4c/0xb0 [libcfs]
Apr 7 16:38:10 warble1 kernel: [<ffffffffc0d05128>] lu_device_fini+0xb8/0xc0 [obdclass]
Apr 7 16:38:10 warble1 kernel: [<ffffffffc0d09fce>] dt_device_fini+0xe/0x10 [obdclass]
Apr 7 16:38:10 warble1 kernel: [<ffffffffc100be52>] osp_device_free+0x42/0x1f0 [osp]
Apr 7 16:38:10 warble1 kernel: [<ffffffffc0cda73e>] class_free_dev+0x54e/0x7f0 [obdclass]
Apr 7 16:38:10 warble1 kernel: [<ffffffffc0cdac00>] class_export_put+0x220/0x2f0 [obdclass]
Apr 7 16:38:10 warble1 kernel: [<ffffffffc0cdc6f5>] class_unlink_export+0x135/0x170 [obdclass]
Apr 7 16:38:10 warble1 kernel: [<ffffffffc0cf1d70>] class_decref+0x80/0x160 [obdclass]
Apr 7 16:38:10 warble1 kernel: [<ffffffffc0cf21d3>] class_detach+0x1b3/0x2e0 [obdclass]
Apr 7 16:38:10 warble1 kernel: [<ffffffffc0cf8eb8>] class_process_config+0x1a28/0x23f0 [obdclass]
Apr 7 16:38:10 warble1 kernel: [<ffffffff810c32ad>] ? ttwu_do_activate.constprop.91+0x5d/0x70
Apr 7 16:38:10 warble1 kernel: [<ffffffffc0539bc7>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
Apr 7 16:38:10 warble1 kernel: [<ffffffffc0cf9a60>] class_manual_cleanup+0x1e0/0x710 [obdclass]
Apr 7 16:38:10 warble1 kernel: [<ffffffffc100c2c8>] osp_obd_disconnect+0x178/0x210 [osp]
Apr 7 16:38:10 warble1 kernel: [<ffffffffc17034b8>] lod_putref+0x258/0x960 [lod]
Apr 7 16:38:10 warble1 kernel: [<ffffffffc170564d>] lod_fini_tgt+0xdd/0x3a0 [lod]
Apr 7 16:38:10 warble1 kernel: [<ffffffffc16f8b6c>] lod_device_fini+0x7c/0x1f0 [lod]
Apr 7 16:38:10 warble1 kernel: [<ffffffffc0cf6ac1>] class_cleanup+0x971/0xcd0 [obdclass]
Apr 7 16:38:10 warble1 kernel: [<ffffffffc0cf8e22>] class_process_config+0x1992/0x23f0 [obdclass]
Apr 7 16:38:10 warble1 kernel: [<ffffffff810c32ad>] ? ttwu_do_activate.constprop.91+0x5d/0x70
Apr 7 16:38:10 warble1 kernel: [<ffffffffc0539bc7>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
Apr 7 16:38:10 warble1 kernel: [<ffffffffc0cf9a46>] class_manual_cleanup+0x1c6/0x710 [obdclass]
Apr 7 16:38:10 warble1 kernel: [<ffffffffc16f89c3>] lod_obd_disconnect+0x93/0x1c0 [lod]
Apr 7 16:38:10 warble1 kernel: [<ffffffffc176dbc9>] mdd_process_config+0x289/0x600 [mdd]
Apr 7 16:38:10 warble1 kernel: [<ffffffffc15f54cc>] mdt_stack_fini+0x2bc/0xd40 [mdt]
Apr 7 16:38:10 warble1 kernel: [<ffffffffc15f6423>] mdt_device_fini+0x4d3/0x920 [mdt]
Apr 7 16:38:10 warble1 kernel: [<ffffffffc0cf6ac1>] class_cleanup+0x971/0xcd0 [obdclass]
Apr 7 16:38:10 warble1 kernel: [<ffffffffc0cf8e22>] class_process_config+0x1992/0x23f0 [obdclass]
Apr 7 16:38:10 warble1 kernel: [<ffffffffc0539bc7>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
Apr 7 16:38:10 warble1 kernel: [<ffffffffc0cf9a46>] class_manual_cleanup+0x1c6/0x710 [obdclass]
Apr 7 16:38:10 warble1 kernel: [<ffffffffc0d27f8e>] server_put_super+0x8de/0xcd0 [obdclass]
Apr 7 16:38:10 warble1 kernel: [<ffffffff812056d2>] generic_shutdown_super+0x72/0x100
Apr 7 16:38:10 warble1 kernel: [<ffffffff81205aa2>] kill_anon_super+0x12/0x20
Apr 7 16:38:10 warble1 kernel: [<ffffffffc0cfc342>] lustre_kill_super+0x32/0x50 [obdclass]
Apr 7 16:38:10 warble1 kernel: [<ffffffff81205e59>] deactivate_locked_super+0x49/0x60
Apr 7 16:38:10 warble1 kernel: [<ffffffff812065c6>] deactivate_super+0x46/0x60
Apr 7 16:38:10 warble1 kernel: [<ffffffff8122395f>] cleanup_mnt+0x3f/0x80
Apr 7 16:38:10 warble1 kernel: [<ffffffff812239f2>] __cleanup_mnt+0x12/0x20
Apr 7 16:38:10 warble1 kernel: [<ffffffff810aefe5>] task_work_run+0xc5/0xf0
Apr 7 16:38:10 warble1 kernel: [<ffffffff8102ab52>] do_notify_resume+0x92/0xb0
Apr 7 16:38:10 warble1 kernel: [<ffffffff816b8d37>] int_signal+0x12/0x17
Apr 7 16:38:10 warble1 kernel:
Apr 7 16:38:10 warble1 kernel: Kernel panic - not syncing: LBUG
cheers, |
| Comment by SC Admin (Inactive) [ 09/Apr/18 ] |
|
hmm. this did not go too well. setting this -> I think what's happened is that all the lfsck traffic (about 2.4M lines of it) goes at serial baud rate console=ttyS0,115200n8 which stops the machine from working. it basically halted. couldn't ping on some interfaces. all clients evicted from dagg-MDT. eventually I think warble1 was stonith'd (which shouldn't have happened but that's another story). on remount after reboot the lfsck seems to have automatically continued, which is unexpected, but not tragic in this case as 'printk' was back to default. the debug_daemon approach seems to work. I guess these are the problematic files? there's a lot of output [warble1]root: wc -l /mnt/root/root/lfsck.debug_daemon.dagg-MDT0000.20180409.2.txt 2375873 /mnt/root/root/lfsck.debug_daemon.dagg-MDT0000.20180409.2.txt so I'm guessing the grep line a bit -> [warble1]root: cat /mnt/root/root/lfsck.debug_daemon.dagg-MDT0000.20180409.2.txt | grep dangling | grep -v 'dangling no/no' 00100000:10000000:11.0F:1523257868.304558:0:1553:0:(lfsck_namespace.c:5112:lfsck_namespace_repair_dangling()) dagg-MDT0001-osd: namespace LFSCK assistant found dangling reference for: parent [0x28001c98f:0x165cd:0x0], child [0x68001ca85:0x19b9a:0x0], type 32768, name model_z0.778_9694. Keep the MDT-object there by default: rc = 0 00100000:10000000:15.0:1523257940.207653:0:1553:0:(lfsck_namespace.c:5112:lfsck_namespace_repair_dangling()) dagg-MDT0001-osd: namespace LFSCK assistant found dangling reference for: parent [0x28001c995:0x1b3c7:0x0], child [0x68001ca80:0xeae9:0x0], type 0, name [0x68001ca80:0xeae9:0x0]:1. Keep the MDT-object there by default: rc = 0 00100000:10000000:15.0:1523257940.207655:0:1553:0:(lfsck_striped_dir.c:2171:lfsck_namespace_striped_dir_rescan()) dagg-MDT0001-osd: namespace LFSCK repair the shard 1 [0x68001ca80:0xeae9:0x0] of the striped directory [0x28001c995:0x1b3c7:0x0] with dangling yes/yes, rename no/no, llinkea no/no, repair_lmvea no/no: rc = 0 00100000:10000000:15.0:1523257940.207698:0:1553:0:(lfsck_namespace.c:5112:lfsck_namespace_repair_dangling()) dagg-MDT0001-osd: namespace LFSCK assistant found dangling reference for: parent [0x28001c995:0x1b3c7:0x0], child [0x20001072d:0xeada:0x0], type 0, name [0x20001072d:0xeada:0x0]:2. Keep the MDT-object there by default: rc = 0 00100000:10000000:15.0:1523257940.207699:0:1553:0:(lfsck_striped_dir.c:2171:lfsck_namespace_striped_dir_rescan()) dagg-MDT0001-osd: namespace LFSCK repair the shard 2 [0x20001072d:0xeada:0x0] of the striped directory [0x28001c995:0x1b3c7:0x0] with dangling yes/yes, rename no/no, llinkea no/no, repair_lmvea no/no: rc = 0 which looks like 1 file and 2 dirs. cheers, |
| Comment by nasf (Inactive) [ 09/Apr/18 ] |
|
You can find out the parent name path via "lfs fid2path" on the client, for example: lfs fid2path $mnt_point [0x28001c98f:0x165cd:0x0] |
| Comment by nasf (Inactive) [ 09/Apr/18 ] |
00100000:10000000:15.0:1523257940.207653:0:1553:0:(lfsck_namespace.c:5112:lfsck_namespace_repair_dangling()) dagg-MDT0001-osd: namespace LFSCK assistant found dangling reference for: parent [0x28001c995:0x1b3c7:0x0], child [0x68001ca80:0xeae9:0x0], type 0, name [0x68001ca80:0xeae9:0x0]:1. Keep the MDT-object there by default: rc = 0 00100000:10000000:15.0:1523257940.207698:0:1553:0:(lfsck_namespace.c:5112:lfsck_namespace_repair_dangling()) dagg-MDT0001-osd: namespace LFSCK assistant found dangling reference for: parent [0x28001c995:0x1b3c7:0x0], child [0x20001072d:0xeada:0x0], type 0, name [0x20001072d:0xeada:0x0]:2. Keep the MDT-object there by default: rc = 0 These two are for the broken striped directory [0x28001c995:0x1b3c7:0x0]. |
| Comment by SC Admin (Inactive) [ 09/Apr/18 ] |
|
Hi, we've hit a new LBUG when running lfsck with -C. 2018-04-09 23:58:34 [27203.585803] LustreError: 132624:0:(dt_object.c:213:dt_mode_to_dft()) LBUG 2018-04-09 23:58:34 [27203.593489] Pid: 132624, comm: lfsck_namespace 2018-04-09 23:58:34 [27203.598773] 2018-04-09 23:58:34 [27203.598773] Call Trace: 2018-04-09 23:58:34 [27203.604202] [<ffffffffc04ff7ae>] libcfs_call_trace+0x4e/0x60 [libcfs] 2018-04-09 23:58:34 [27203.611457] [<ffffffffc04ff83c>] lbug_with_loc+0x4c/0xb0 [libcfs] 2018-04-09 23:58:34 [27203.618390] [<ffffffffc0e5f143>] dt_mode_to_dft+0x73/0x80 [obdclass] 2018-04-09 23:58:34 [27203.625555] [<ffffffffc10ee87b>] lfsck_namespace_repair_dangling+0x61b/0xf40 [lfsck] 2018-04-09 23:58:34 [27203.634104] [<ffffffffc0e5b656>] ? htable_lookup+0xf6/0x170 [obdclass] 2018-04-09 23:58:34 [27203.641423] [<ffffffffc111a8ca>] lfsck_namespace_striped_dir_rescan+0x86a/0x1220 [lfsck] 2018-04-09 23:58:34 [27203.650290] [<ffffffffc10f0a71>] lfsck_namespace_assistant_handler_p1+0x18d1/0x1f40 [lfsck] 2018-04-09 23:58:34 [27203.659391] [<ffffffff8102954d>] ? __switch_to+0xcd/0x500 2018-04-09 23:58:34 [27203.665544] [<ffffffffc10d47a4>] lfsck_assistant_engine+0x3c4/0x2080 [lfsck] 2018-04-09 23:58:34 [27203.673333] [<ffffffff810c95d5>] ? sched_clock_cpu+0x85/0xc0 2018-04-09 23:58:35 [27203.679725] [<ffffffff8102954d>] ? __switch_to+0xcd/0x500 2018-04-09 23:58:35 [27203.685842] [<ffffffff810c6620>] ? default_wake_function+0x0/0x20 2018-04-09 23:58:35 [27203.692649] [<ffffffffc10d43e0>] ? lfsck_assistant_engine+0x0/0x2080 [lfsck] 2018-04-09 23:58:35 [27203.700397] [<ffffffff810b270f>] kthread+0xcf/0xe0 2018-04-09 23:58:35 [27203.705888] [<ffffffff810b2640>] ? kthread+0x0/0xe0 2018-04-09 23:58:35 [27203.711464] [<ffffffff816b8798>] ret_from_fork+0x58/0x90 2018-04-09 23:58:35 [27203.717463] [<ffffffff810b2640>] ? kthread+0x0/0xe0 2018-04-09 23:58:35 [27203.723029] 2018-04-09 23:58:35 [27203.725126] Kernel panic - not syncing: LBUG warble1 then failed over to warble2, the lfsck automatically continued, and then warble2 got the same LBUG. 2018-04-10 00:04:43 [169536.894412] LustreError: 253888:0:(dt_object.c:213:dt_mode_to_dft()) LBUG 2018-04-10 00:04:43 [169536.902175] Pid: 253888, comm: lfsck_namespace 2018-04-10 00:04:43 [169536.907564] 2018-04-10 00:04:43 [169536.907564] Call Trace: 2018-04-10 00:04:43 [169536.913210] [<ffffffffc08a87ae>] libcfs_call_trace+0x4e/0x60 [libcfs] 2018-04-10 00:04:43 [169536.920569] [<ffffffffc08a883c>] lbug_with_loc+0x4c/0xb0 [libcfs] 2018-04-10 00:04:43 [169536.927591] [<ffffffffc0be0143>] dt_mode_to_dft+0x73/0x80 [obdclass] 2018-04-10 00:04:43 [169536.934856] [<ffffffffc144f87b>] lfsck_namespace_repair_dangling+0x61b/0xf40 [lfsck] 2018-04-10 00:04:43 [169536.943498] [<ffffffffc0bdc656>] ? htable_lookup+0xf6/0x170 [obdclass] 2018-04-10 00:04:43 [169536.950921] [<ffffffffc147b8ca>] lfsck_namespace_striped_dir_rescan+0x86a/0x1220 [lfsck] 2018-04-10 00:04:43 [169536.959904] [<ffffffffc1451a71>] lfsck_namespace_assistant_handler_p1+0x18d1/0x1f40 [lfsck] 2018-04-10 00:04:43 [169536.969128] [<ffffffff8102954d>] ? __switch_to+0xcd/0x500 2018-04-10 00:04:43 [169536.975403] [<ffffffffc14357a4>] lfsck_assistant_engine+0x3c4/0x2080 [lfsck] 2018-04-10 00:04:43 [169536.983313] [<ffffffff810c95d5>] ? sched_clock_cpu+0x85/0xc0 2018-04-10 00:04:43 [169536.989826] [<ffffffff8102954d>] ? __switch_to+0xcd/0x500 2018-04-10 00:04:43 [169536.996073] [<ffffffff810c6620>] ? default_wake_function+0x0/0x20 2018-04-10 00:04:43 [169537.002998] [<ffffffffc14353e0>] ? lfsck_assistant_engine+0x0/0x2080 [lfsck] 2018-04-10 00:04:43 [169537.010872] [<ffffffff810b270f>] kthread+0xcf/0xe0 2018-04-10 00:04:43 [169537.016481] [<ffffffff810b2640>] ? kthread+0x0/0xe0 2018-04-10 00:04:43 [169537.022166] [<ffffffff816b8798>] ret_from_fork+0x58/0x90 2018-04-10 00:04:43 [169537.028269] [<ffffffff810b2640>] ? kthread+0x0/0xe0 2018-04-10 00:04:43 [169537.033925] 2018-04-10 00:04:43 [169537.036107] Kernel panic - not syncing: LBUG and now we are back to the situation of site down, and 2 MDT's in WAITING state [warble2]root: for f in /proc/fs/lustre/mdt/dagg-MDT000*/recovery_status; do echo $f; cat $f; done /proc/fs/lustre/mdt/dagg-MDT0000/recovery_status status: COMPLETE recovery_start: 1523283444 recovery_duration: 301 completed_clients: 123/125 replayed_requests: 1 last_transno: 253403072149 VBR: ENABLED IR: DISABLED /proc/fs/lustre/mdt/dagg-MDT0001/recovery_status status: WAITING non-ready MDTs: 0000 0002 recovery_start: 1523283444 time_waited: 651 /proc/fs/lustre/mdt/dagg-MDT0002/recovery_status status: WAITING non-ready MDTs: 0000 0001 recovery_start: 1523283444 time_waited: 651 hmm, actually I think perhaps it eventually completed, because it's LBUG'd again now. I had this command entered in terminal and it appeared to be hung and waiting for the MDT's to complete, but perhaps it wasn't fast enough to stop the LBUG? [warble2]root: lctl lfsck_stop -M dagg-MDT0000 -A Stopped LFSCK on the device dagg-MDT0000. maybe it only turned it off on MDT0 and not MDT1,2? thanks. cheers, |
| Comment by SC Admin (Inactive) [ 09/Apr/18 ] |
|
ah, -o skip_lfsck. perhaps this should be put into your recommended HA config defaults? cheers, |
| Comment by nasf (Inactive) [ 09/Apr/18 ] |
Yes, please add "-o skip_lfsck" mount option if you want to prevent lfsck auto resuming from last checkpoint. I am sorry for the inconvenience. Please repeat the steps as described in the comment https://jira.hpdd.intel.com/browse/LU-10887?focusedCommentId=225392&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-225392 to make your system mountable again. I will work on the patch for the LBUG() caused by LFSCK. |
| Comment by SC Admin (Inactive) [ 09/Apr/18 ] |
|
nope. [warble1]root: mount -t lustre -o skip_lfsck /warble2-dagg-MDT0-pool/MDT0 /lustre/dagg/MDT0 mount.lustre: /warble2-dagg-MDT0-pool/MDT0 has not been formatted with mkfs.lustre or the backend filesystem type is not supported by this tool (edit) my bad. had a leading '/'. seems to mount now. [warble1]root: mount -t lustre -o skip_lfsck warble2-dagg-MDT0-pool/MDT0 /lustre/dagg/MDT0 seeing if they complete now... cheers, |
| Comment by SC Admin (Inactive) [ 09/Apr/18 ] |
|
yup. that worked. cool. filesystems all mounted up again, and I'll try to add -o skip_lfsck to our HA config. here is the namespace lfsck info if you are interested. [warble1]root: lctl get_param -n mdd.dagg-MDT000*.lfsck_namespace name: lfsck_namespace magic: 0xa0621a0b version: 2 status: stopped flags: param: all_targets,create_mdtobj last_completed_time: 1523258154 time_since_last_completed: 28416 seconds latest_start_time: 1523283745 time_since_latest_start: 2825 seconds last_checkpoint_time: 1523284186 time_since_last_checkpoint: 2384 seconds latest_start_position: 3643308, [0x20001072c:0x16502:0x0], 0x75092e6bed520000 last_checkpoint_position: 3643308, [0x20001072c:0x16502:0x0], 0x750934ffbf07ffff first_failure_position: 3643307, N/A, N/A checked_phase1: 2426567 checked_phase2: 0 updated_phase1: 0 updated_phase2: 0 failed_phase1: 1 failed_phase2: 0 directories: 202176 dirent_repaired: 0 linkea_repaired: 0 nlinks_repaired: 0 multiple_linked_checked: 14706 multiple_linked_repaired: 0 unknown_inconsistency: 0 unmatched_pairs_repaired: 0 dangling_repaired: 0 multiple_referenced_repaired: 0 bad_file_type_repaired: 0 lost_dirent_repaired: 0 local_lost_found_scanned: 0 local_lost_found_moved: 0 local_lost_found_skipped: 0 local_lost_found_failed: 0 striped_dirs_scanned: 0 striped_dirs_repaired: 0 striped_dirs_failed: 0 striped_dirs_disabled: 0 striped_dirs_skipped: 1 striped_shards_scanned: 142443 striped_shards_repaired: 0 striped_shards_failed: 0 striped_shards_skipped: 0 name_hash_repaired: 0 linkea_overflow_cleared: 0 success_count: 5 run_time_phase1: 180 seconds run_time_phase2: 0 seconds average_speed_phase1: 13480 items/sec average_speed_phase2: 0 objs/sec average_speed_total: 13480 items/sec real_time_speed_phase1: N/A real_time_speed_phase2: N/A current_position: N/A name: lfsck_namespace magic: 0xa0621a0b version: 2 status: crashed flags: inconsistent param: all_targets,create_mdtobj last_completed_time: 1523258119 time_since_last_completed: 28451 seconds latest_start_time: 1523284186 time_since_latest_start: 2384 seconds last_checkpoint_time: 1523282677 time_since_last_checkpoint: 3893 seconds latest_start_position: 2923498, N/A, N/A last_checkpoint_position: 2923496, N/A, N/A first_failure_position: N/A, N/A, N/A checked_phase1: 2224270 checked_phase2: 0 updated_phase1: 1 updated_phase2: 0 failed_phase1: 0 failed_phase2: 0 directories: 189475 dirent_repaired: 0 linkea_repaired: 0 nlinks_repaired: 0 multiple_linked_checked: 16255 multiple_linked_repaired: 0 unknown_inconsistency: 0 unmatched_pairs_repaired: 0 dangling_repaired: 1 multiple_referenced_repaired: 0 bad_file_type_repaired: 0 lost_dirent_repaired: 0 local_lost_found_scanned: 0 local_lost_found_moved: 0 local_lost_found_skipped: 0 local_lost_found_failed: 0 striped_dirs_scanned: 0 striped_dirs_repaired: 0 striped_dirs_failed: 0 striped_dirs_disabled: 0 striped_dirs_skipped: 0 striped_shards_scanned: 141950 striped_shards_repaired: 0 striped_shards_failed: 0 striped_shards_skipped: 0 name_hash_repaired: 0 linkea_overflow_cleared: 0 success_count: 5 run_time_phase1: 180 seconds run_time_phase2: 0 seconds average_speed_phase1: 12357 items/sec average_speed_phase2: 0 objs/sec average_speed_total: 12357 items/sec real_time_speed_phase1: N/A real_time_speed_phase2: N/A current_position: N/A name: lfsck_namespace magic: 0xa0621a0b version: 2 status: crashed flags: param: all_targets,create_mdtobj last_completed_time: 1523258118 time_since_last_completed: 28452 seconds latest_start_time: 1523284186 time_since_latest_start: 2384 seconds last_checkpoint_time: 1523282677 time_since_last_checkpoint: 3893 seconds latest_start_position: 2965156, N/A, N/A last_checkpoint_position: 2965154, N/A, N/A first_failure_position: N/A, N/A, N/A checked_phase1: 2196781 checked_phase2: 0 updated_phase1: 0 updated_phase2: 0 failed_phase1: 0 failed_phase2: 0 directories: 188184 dirent_repaired: 0 linkea_repaired: 0 nlinks_repaired: 0 multiple_linked_checked: 12893 multiple_linked_repaired: 0 unknown_inconsistency: 0 unmatched_pairs_repaired: 0 dangling_repaired: 0 multiple_referenced_repaired: 0 bad_file_type_repaired: 0 lost_dirent_repaired: 0 local_lost_found_scanned: 0 local_lost_found_moved: 0 local_lost_found_skipped: 0 local_lost_found_failed: 0 striped_dirs_scanned: 0 striped_dirs_repaired: 0 striped_dirs_failed: 0 striped_dirs_disabled: 0 striped_dirs_skipped: 0 striped_shards_scanned: 140952 striped_shards_repaired: 0 striped_shards_failed: 0 striped_shards_skipped: 0 name_hash_repaired: 0 linkea_overflow_cleared: 0 success_count: 5 run_time_phase1: 180 seconds run_time_phase2: 0 seconds average_speed_phase1: 12204 items/sec average_speed_phase2: 0 objs/sec average_speed_total: 12204 items/sec real_time_speed_phase1: N/A real_time_speed_phase2: N/A current_position: N/A thanks for your help. cheers, |
| Comment by Gerrit Updater [ 09/Apr/18 ] |
|
Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/31915 |
| Comment by SC Admin (Inactive) [ 09/Apr/18 ] |
|
so this is probably a stupid question, but how do I see if a mounted lustre zfs dataset has been mounted with the -o skip_lfsck option? I'm trying to add this to our HA config (the obvious place being /usr/lib/ocf/resource.d/lustre/Lustre), but having done a failover that should have been safe either way, now I can't see if -o skip_lfsck was set or not. 'mount' just says 'lustre (ro)' for everything. /proc/mounts the same. mtab is a symlink to /proc/self/mounts these days. cheers, |
| Comment by nasf (Inactive) [ 09/Apr/18 ] |
|
Here is my "mount" output for "-o skip_lfsck" specified when mount the MDT: # mount ... lustre-ost1/ost1 on /mnt/lustre-ost1 type lustre (ro,svname=lustre-OST0000,mgsnode=10.211.55.19@tcp,osd=osd-zfs) lustre-ost2/ost2 on /mnt/lustre-ost2 type lustre (ro,svname=lustre-OST0001,mgsnode=10.211.55.19@tcp,osd=osd-zfs) lustre-mdt1/mdt1 on /mnt/lustre-mds1 type lustre (ro,svname=lustre-MDT0000,skip_lfsck,mgs,osd=osd-zfs) On the other hand, the mount option "-o skip_lfsck" is only used to prevent auto resuming LFSCK when mount, after that, such option has no other usage. So as long as the LFSCK is not auto started when mount the target as "lustre", then that is OK. |
| Comment by SC Admin (Inactive) [ 09/Apr/18 ] |
|
another LBUG. this one on umount of a MDT. 2018-04-10 03:13:18 [11087.300802] Lustre: Failing over dagg-MDT0000 2018-04-10 03:13:18 [11087.364400] LustreError: 135208:0:(lod_qos.c:208:lod_statfs_and_check()) dagg-MDT0000-mdtlov: statfs: rc = -108 2018-04-10 03:13:18 [11087.375190] LustreError: 135213:0:(ldlm_lockd.c:1415:ldlm_handle_enqueue0()) ### lock on destroyed export ffff885de9da6800 ns: mdt-dagg-MDT0000_UUID lock: ffff885deb8fba00/0x25e3f37caf9307ca lrc: 3,0 2018-04-10 03:13:18 [11087.409230] Lustre: dagg-MDT0000: Not available for connect from 192.168.44.122@o2ib44 (stopping) 2018-04-10 03:13:18 [11087.409231] Lustre: Skipped 3 previous similar messages 2018-04-10 03:13:18 [11087.510313] LustreError: 181886:0:(lod_dev.c:1672:lod_device_free()) ASSERTION( atomic_read(&lu->ld_ref) == 0 ) failed: lu is ffff885eb2e80000 2018-04-10 03:13:18 [11087.524318] LustreError: 181886:0:(lod_dev.c:1672:lod_device_free()) LBUG 2018-04-10 03:13:18 [11087.531640] Pid: 181886, comm: umount 2018-04-10 03:13:18 [11087.535827] 2018-04-10 03:13:18 [11087.535827] Call Trace: 2018-04-10 03:13:18 [11087.540753] [<ffffffffc054d7ae>] libcfs_call_trace+0x4e/0x60 [libcfs] 2018-04-10 03:13:18 [11087.547764] [<ffffffffc054d83c>] lbug_with_loc+0x4c/0xb0 [libcfs] 2018-04-10 03:13:18 [11087.554422] [<ffffffffc0bb7f76>] lod_device_free+0x296/0x2a0 [lod] 2018-04-10 03:13:18 [11087.561205] [<ffffffffc0ca773e>] class_free_dev+0x54e/0x7f0 [obdclass] 2018-04-10 03:13:18 [11087.568323] [<ffffffffc0ca7c00>] class_export_put+0x220/0x2f0 [obdclass] 2018-04-10 03:13:18 [11087.575586] [<ffffffffc0ca96f5>] class_unlink_export+0x135/0x170 [obdclass] 2018-04-10 03:13:18 [11087.583119] [<ffffffffc0cbed70>] class_decref+0x80/0x160 [obdclass] 2018-04-10 03:13:18 [11087.589961] [<ffffffffc0cbf1d3>] class_detach+0x1b3/0x2e0 [obdclass] 2018-04-10 03:13:18 [11087.596884] [<ffffffffc0cc5eb8>] class_process_config+0x1a28/0x23f0 [obdclass] 2018-04-10 03:13:18 [11087.604665] [<ffffffff810c32ad>] ? ttwu_do_activate.constprop.91+0x5d/0x70 2018-04-10 03:13:18 [11087.612113] [<ffffffffc0558bc7>] ? libcfs_debug_msg+0x57/0x80 [libcfs] 2018-04-10 03:13:18 [11087.619214] [<ffffffffc0cc6a60>] class_manual_cleanup+0x1e0/0x710 [obdclass] 2018-04-10 03:13:18 [11087.626829] [<ffffffffc0bb79c3>] lod_obd_disconnect+0x93/0x1c0 [lod] 2018-04-10 03:13:18 [11087.633748] [<ffffffffc0a5abc9>] mdd_process_config+0x289/0x600 [mdd] 2018-04-10 03:13:18 [11087.640770] [<ffffffffc11344cc>] mdt_stack_fini+0x2bc/0xd40 [mdt] 2018-04-10 03:13:18 [11087.647440] [<ffffffffc1135423>] mdt_device_fini+0x4d3/0x920 [mdt] 2018-04-10 03:13:18 [11087.654200] [<ffffffffc0cc3ac1>] class_cleanup+0x971/0xcd0 [obdclass] 2018-04-10 03:13:18 [11087.661203] [<ffffffffc0cc5e22>] class_process_config+0x1992/0x23f0 [obdclass] 2018-04-10 03:13:18 [11087.668985] [<ffffffffc0558bc7>] ? libcfs_debug_msg+0x57/0x80 [libcfs] 2018-04-10 03:13:18 [11087.676089] [<ffffffffc0cc6a46>] class_manual_cleanup+0x1c6/0x710 [obdclass] 2018-04-10 03:13:18 [11087.683712] [<ffffffffc0cf4f8e>] server_put_super+0x8de/0xcd0 [obdclass] 2018-04-10 03:13:18 [11087.690990] [<ffffffff812056d2>] generic_shutdown_super+0x72/0x100 2018-04-10 03:13:18 [11087.697739] [<ffffffff81205aa2>] kill_anon_super+0x12/0x20 2018-04-10 03:13:18 [11087.703796] [<ffffffffc0cc9342>] lustre_kill_super+0x32/0x50 [obdclass] 2018-04-10 03:13:18 [11087.710972] [<ffffffff81205e59>] deactivate_locked_super+0x49/0x60 2018-04-10 03:13:18 [11087.717715] [<ffffffff812065c6>] deactivate_super+0x46/0x60 2018-04-10 03:13:18 [11087.723853] [<ffffffff8122395f>] cleanup_mnt+0x3f/0x80 2018-04-10 03:13:18 [11087.729556] [<ffffffff812239f2>] __cleanup_mnt+0x12/0x20 2018-04-10 03:13:18 [11087.735427] [<ffffffff810aefe5>] task_work_run+0xc5/0xf0 2018-04-10 03:13:18 [11087.741306] [<ffffffff8102ab52>] do_notify_resume+0x92/0xb0 2018-04-10 03:13:18 [11087.747443] [<ffffffff816b8d37>] int_signal+0x12/0x17 2018-04-10 03:13:18 [11087.753051] 2018-04-10 03:13:18 [11087.755024] Kernel panic - not syncing: LBUG |
| Comment by SC Admin (Inactive) [ 09/Apr/18 ] |
|
fs's mounted up ok. cheers, |
| Comment by Andreas Dilger [ 09/Apr/18 ] |
|
Robin, I might suggest that the system be left as-is for the time being, until a suitable maintenance window is available? That will allow users to get some work done while we test and review Fan Yong's patch (which should address at least one of the problems you are seeing). |
| Comment by Gerrit Updater [ 10/Apr/18 ] |
|
Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/31929 |
| Comment by SC Admin (Inactive) [ 10/Apr/18 ] |
|
Hi Andreas, the system seems stable with all the MDTs on warble2 and mounted by-hand with -o skip_lfsck. so yup, happy to leave it that way for a while - both for patches to mature, and for us to swap out and try to diagnose potentially bad bits of warble1 hardware. obviously at some stage we'll have to test (a hopefully improved) warble1 with Lustre again though, and unfortunately there are shared components (raid controller) that we can't test separately and offline. Hi Fan Yong, so centos7 doesn't show any extra options in the 'mount' output unfortunately. even when I mount by hand I get just 'lustre (ro)'. I guess you're using a different OS. I don't suppose there's another way to see if -o skip_lfsck was asked for? 'dk', /proc, /sys? but anyway, no hurry 'cos we don't have any HA now so can't re-test anyway. cheers, |
| Comment by nasf (Inactive) [ 10/Apr/18 ] |
|
What is the output on the MDT? mount -l -t lustre |
| Comment by SC Admin (Inactive) [ 10/Apr/18 ] |
[warble2]root: mount -l -t lustre warble2-apps-MDT0-pool/MDT0 on /lustre/apps/MDT0 type lustre (ro) warble2-home-MDT0-pool/MDT0 on /lustre/home/MDT0 type lustre (ro) warble1-MGT-pool/MGT on /lustre/MGT type lustre (ro) warble1-images-MDT0-pool/MDT0 on /lustre/images/MDT0 type lustre (ro) warble2-dagg-MDT0-pool/MDT0 on /lustre/dagg/MDT0 type lustre (ro) warble1-dagg-MDT1-pool/MDT1 on /lustre/dagg/MDT1 type lustre (ro) warble1-dagg-MDT2-pool/MDT2 on /lustre/dagg/MDT2 type lustre (ro) all the dagg's were definitely mounted with -o skip_lfsck. cheers, |
| Comment by nasf (Inactive) [ 10/Apr/18 ] |
|
I know the reason. You need the patch: https://review.whamcloud.com/29228 |
| Comment by SC Admin (Inactive) [ 10/Apr/18 ] |
|
oh, that makes sense. nice find! cheers, |
| Comment by Peter Jones [ 10/Apr/18 ] |
|
Robin We have no immediate plans to do a 2.11.1 release but we can queue up this fix for inclusion in 2.10.4 Peter |
| Comment by SC Admin (Inactive) [ 10/Apr/18 ] |
|
Hi Peter, oh, I see I haven't been reading roadmaps enough. I thought the plan was to keep folks rolling forward with 2.x releases and we were ok with that. I didn't realise 2.10.x was a LTS. is it appropriate to drop this from severity 1 now? the fs is up and we're reasonably confident it'll stay that way. cheers, |
| Comment by Peter Jones [ 10/Apr/18 ] |
|
Robin Yes - 2.10.x is an LTS branch. We'd prefer to keep the ticket severity at S1 so that we correctly categorize it when we run future reports but we understand that the "all hands on deck" period is past and we're focusing on RCA and preventive actions to avoid future scenarios. Peter |
| Comment by nasf (Inactive) [ 12/Apr/18 ] |
The object reference leak issue will be fixed via https://review.whamcloud.com/#/c/31431/ (2.10.4) |
| Comment by Gerrit Updater [ 19/Apr/18 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31915/ |
| Comment by nasf (Inactive) [ 19/Apr/18 ] |
|
Robin, https://review.whamcloud.com/31915/ has been landed to master, https://review.whamcloud.com/#/c/31431/ has been landed to 2.10.4. |
| Comment by Peter Jones [ 19/Apr/18 ] |
|
Marking this is fixed in 2.12 so that this fix is also queued up for 2.10.4. Our goal would be for Swinburne to be running a vanilla release without the need for additional patches. |
| Comment by Gerrit Updater [ 19/Apr/18 ] |
|
Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/32076 |
| Comment by SC Admin (Inactive) [ 02/May/18 ] |
|
Hi, we tested warble1 hardware all we could for about a week and found no hardware issues. we also replaced sas cards and cables just to be safe. warble1 now is 3.10.0-693.21.1.el7.x86_64 and zfs 0.7.8 and has these patches applied usr/src/lustre-2.10.3/lu10212-estale.patch usr/src/lustre-2.10.3/lu10707-ksocklnd-revert-jiffies.patch usr/src/lustre-2.10.3/lu10707-lnet-route-jiffies.patch usr/src/lustre-2.10.3/lu10887-lfsck.patch usr/src/lustre-2.10.3/lu8990-put-root.patch when the dagg MDT's were mounted on warble1 they COMPLETED ok and then about 5 seconds later it hit an LBUG in lfsck. ... 2018-05-02 22:06:06 [ 2919.828067] Lustre: dagg-MDT0000: Client 22c84389-af1f-9970-0e9b-70c3a4861afd (at 10.8.49.155@tcp201) reconnecting 2018-05-02 22:06:06 [ 2919.828113] Lustre: dagg-MDT0002: Recovery already passed deadline 0:31. If you do not want to wait more, please abort the recovery by force. 2018-05-02 22:06:38 [ 2951.686211] Lustre: dagg-MDT0002: recovery is timed out, evict stale exports 2018-05-02 22:06:38 [ 2951.694197] Lustre: dagg-MDT0002: disconnecting 1 stale clients 2018-05-02 22:06:38 [ 2951.736799] Lustre: 24680:0:(ldlm_lib.c:2544:target_recovery_thread()) too long recovery - read logs 2018-05-02 22:06:38 [ 2951.746774] Lustre: dagg-MDT0002: Recovery over after 6:24, of 125 clients 124 recovered and 1 was evicted. 2018-05-02 22:06:38 [ 2951.746775] LustreError: dumping log to /tmp/lustre-log.1525262798.24680 2018-05-02 22:06:44 [ 2957.910031] LustreError: 33236:0:(dt_object.c:213:dt_mode_to_dft()) LBUG 2018-05-02 22:06:44 [ 2957.917615] Pid: 33236, comm: lfsck_namespace 2018-05-02 22:06:44 [ 2957.922760] 2018-05-02 22:06:44 [ 2957.922760] Call Trace: 2018-05-02 22:06:44 [ 2957.928142] [<ffffffffc06457ae>] libcfs_call_trace+0x4e/0x60 [libcfs] 2018-05-02 22:06:44 [ 2957.935374] [<ffffffffc064583c>] lbug_with_loc+0x4c/0xb0 [libcfs] 2018-05-02 22:06:44 [ 2957.942270] [<ffffffffc0d82573>] dt_mode_to_dft+0x73/0x80 [obdclass] 2018-05-02 22:06:44 [ 2957.949398] [<ffffffffc115ac81>] lfsck_namespace_repair_dangling+0x621/0xf40 [lfsck] 2018-05-02 22:06:44 [ 2957.957911] [<ffffffffc0d7ea22>] ? htable_lookup+0x102/0x180 [obdclass] 2018-05-02 22:06:44 [ 2957.965289] [<ffffffffc1186f4a>] lfsck_namespace_striped_dir_rescan+0x86a/0x1220 [lfsck] 2018-05-02 22:06:44 [ 2957.974129] [<ffffffffc115ce71>] lfsck_namespace_assistant_handler_p1+0x18d1/0x1f40 [lfsck] 2018-05-02 22:06:44 [ 2957.983217] [<ffffffff8102954d>] ? __switch_to+0xcd/0x500 2018-05-02 22:06:44 [ 2957.989375] [<ffffffffc114098e>] lfsck_assistant_engine+0x3ce/0x20b0 [lfsck] 2018-05-02 22:06:44 [ 2957.997154] [<ffffffff810cb0b5>] ? sched_clock_cpu+0x85/0xc0 2018-05-02 22:06:44 [ 2958.003538] [<ffffffff8102954d>] ? __switch_to+0xcd/0x500 2018-05-02 22:06:44 [ 2958.009648] [<ffffffff810c7c70>] ? default_wake_function+0x0/0x20 2018-05-02 22:06:44 [ 2958.016449] [<ffffffffc11405c0>] ? lfsck_assistant_engine+0x0/0x20b0 [lfsck] 2018-05-02 22:06:44 [ 2958.024186] [<ffffffff810b4031>] kthread+0xd1/0xe0 2018-05-02 22:06:44 [ 2958.029662] [<ffffffff810b3f60>] ? kthread+0x0/0xe0 2018-05-02 22:06:44 [ 2958.035220] [<ffffffff816c055d>] ret_from_fork+0x5d/0xb0 2018-05-02 22:06:44 [ 2958.041197] [<ffffffff810b3f60>] ? kthread+0x0/0xe0 2018-05-02 22:06:44 [ 2958.046723] 2018-05-02 22:06:44 [ 2958.048771] Kernel panic - not syncing: LBUG 2018-05-02 22:06:44 [ 2958.053576] CPU: 2 PID: 33236 Comm: lfsck_namespace Tainted: P OE ------------ 3.10.0-693.21.1.el7.x86_64 #1 2018-05-02 22:06:44 [ 2958.065051] Hardware name: Dell Inc. PowerEdge R740/0JM3W2, BIOS 1.3.7 02/08/2018 2018-05-02 22:06:44 [ 2958.073066] Call Trace: 2018-05-02 22:06:44 [ 2958.076060] [<ffffffff816ae7c8>] dump_stack+0x19/0x1b 2018-05-02 22:06:44 [ 2958.081738] [<ffffffff816a8634>] panic+0xe8/0x21f 2018-05-02 22:06:44 [ 2958.087058] [<ffffffffc0645854>] lbug_with_loc+0x64/0xb0 [libcfs] 2018-05-02 22:06:44 [ 2958.093781] [<ffffffffc0d82573>] dt_mode_to_dft+0x73/0x80 [obdclass] 2018-05-02 22:06:44 [ 2958.100741] [<ffffffffc115ac81>] lfsck_namespace_repair_dangling+0x621/0xf40 [lfsck] 2018-05-02 22:06:45 [ 2958.109091] [<ffffffffc0d7ea22>] ? htable_lookup+0x102/0x180 [obdclass] 2018-05-02 22:06:45 [ 2958.116294] [<ffffffffc1186f4a>] lfsck_namespace_striped_dir_rescan+0x86a/0x1220 [lfsck] 2018-05-02 22:06:45 [ 2958.124963] [<ffffffffc115ce71>] lfsck_namespace_assistant_handler_p1+0x18d1/0x1f40 [lfsck] 2018-05-02 22:06:45 [ 2958.133889] [<ffffffff8102954d>] ? __switch_to+0xcd/0x500 2018-05-02 22:06:45 [ 2958.139866] [<ffffffffc114098e>] lfsck_assistant_engine+0x3ce/0x20b0 [lfsck] 2018-05-02 22:06:45 [ 2958.147492] [<ffffffff810cb0b5>] ? sched_clock_cpu+0x85/0xc0 2018-05-02 22:06:45 [ 2958.153724] [<ffffffff8102954d>] ? __switch_to+0xcd/0x500 2018-05-02 22:06:45 [ 2958.159689] [<ffffffff810c7c70>] ? wake_up_state+0x20/0x20 2018-05-02 22:06:45 [ 2958.165742] [<ffffffffc11405c0>] ? lfsck_master_engine+0x1310/0x1310 [lfsck] 2018-05-02 22:06:45 [ 2958.173343] [<ffffffff810b4031>] kthread+0xd1/0xe0 2018-05-02 22:06:45 [ 2958.178685] [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40 2018-05-02 22:06:45 [ 2958.185227] [<ffffffff816c055d>] ret_from_fork+0x5d/0xb0 2018-05-02 22:06:45 [ 2958.191065] [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40 2018-05-02 22:06:45 [ 2958.197613] Kernel Offset: disabled I've failed the MDT's back to warble2 and mounted them by hand with -o skip_lfsck cheers, |
| Comment by nasf (Inactive) [ 02/May/18 ] |
So you have applied the patch https://review.whamcloud.com/31915/ on the MDT, right? If yes, there should be other corners for lfsck_namespace_repair_dangling() that missed to set the “mode”. I will investigate more. Anyway, the existed patches are still valid. |
| Comment by Peter Jones [ 02/May/18 ] |
|
I've created a new ticket - |
| Comment by SC Admin (Inactive) [ 02/May/18 ] |
|
yes, that patch is what I've called lu10887-lfsck.patch and is applied. cheers, |
| Comment by Gerrit Updater [ 03/May/18 ] |
|
John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/32076/ |