[LU-5085] ofd_create()) charlie-OST004f: unable to precreate: rc = -116 Created: 19/May/14 Updated: 11/Jun/14 Resolved: 11/Jun/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Daire Byrne (Inactive) | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 14021 |
| Description |
|
Hi, Unfortunately we suffered a severe RAID controller failure over the weekend. We have gotten the RAID back online but even after a filesystem check we seem to be getting this on two of the OSTs. How severe is this error? I recall that previous versions of Lustre had a concept of LAST_ID which the precreated objects depended upon. Maybe this has been corrupted on these filesystems? Please advise. May 19 15:29:15 coss5 kernel: LDISKFS-fs (dm-15): mounted filesystem with ordered data mode. quota=on. Opts: May 19 15:29:16 coss5 kernel: LDISKFS-fs (dm-12): mounted filesystem with ordered data mode. quota=on. Opts: May 19 15:29:16 coss5 kernel: Lustre: charlie-OST0053: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-450 May 19 15:29:16 coss5 kernel: Lustre: Skipped 5 previous similar messages May 19 15:29:17 coss5 kernel: LDISKFS-fs (dm-16): mounted filesystem with ordered data mode. quota=on. Opts: May 19 15:29:19 coss5 rlogind[15167]: PAM authentication failed for in.rlogind May 19 15:29:25 coss5 kernel: Lustre: charlie-OST0053: Will be in recovery for at least 2:30, or until 4 clients reconnect May 19 15:29:25 coss5 kernel: Lustre: charlie-OST0052: Will be in recovery for at least 2:30, or until 8 clients reconnect May 19 15:29:25 coss5 kernel: Lustre: Skipped 4 previous similar messages May 19 15:29:25 coss5 kernel: Lustre: Skipped 4 previous similar messages May 19 15:31:36 coss5 kernel: Lustre: charlie-OST0044: recovery is timed out, evict stale exports May 19 15:31:36 coss5 kernel: Lustre: charlie-OST0044: disconnecting 2 stale clients May 19 15:31:36 coss5 kernel: Lustre: charlie-OST0047: Recovery over after 2:30, of 4 clients 2 recovered and 2 were evicted. May 19 15:31:36 coss5 kernel: Lustre: charlie-OST004a: Recovery over after 2:30, of 4 clients 2 recovered and 2 were evicted. May 19 15:31:37 coss5 kernel: LustreError: 15571:0:(ofd_obd.c:1339:ofd_create()) charlie-OST0049: unable to precreate: rc = -116 May 19 15:31:37 coss5 kernel: LustreError: 14051:0:(ofd_obd.c:1339:ofd_create()) charlie-OST004a: unable to precreate: rc = -116 May 19 15:31:44 coss5 kernel: Lustre: charlie-OST004c: recovery is timed out, evict stale exports May 19 15:31:44 coss5 kernel: Lustre: Skipped 7 previous similar messages May 19 15:31:44 coss5 kernel: Lustre: charlie-OST004c: disconnecting 2 stale clients May 19 15:31:44 coss5 kernel: Lustre: Skipped 7 previous similar messages May 19 15:31:44 coss5 kernel: Lustre: charlie-OST004c: Recovery over after 2:30, of 4 clients 2 recovered and 2 were evicted. May 19 15:31:44 coss5 kernel: Lustre: Skipped 6 previous similar messages May 19 15:31:45 coss5 kernel: LustreError: 15233:0:(ofd_obd.c:1339:ofd_create()) charlie-OST004f: unable to precreate: rc = -116 May 19 15:31:46 coss5 kernel: LustreError: 14594:0:(ofd_obd.c:1339:ofd_create()) charlie-OST004a: unable to precreate: rc = -116 May 19 15:31:54 coss5 kernel: LustreError: 14053:0:(ofd_obd.c:1339:ofd_create()) charlie-OST004f: unable to precreate: rc = -116 May 19 15:31:55 coss5 kernel: Lustre: charlie-OST0052: recovery is timed out, evict stale exports May 19 15:31:55 coss5 kernel: Lustre: Skipped 5 previous similar messages May 19 15:31:55 coss5 kernel: Lustre: charlie-OST0054: disconnecting 2 stale clients May 19 15:31:55 coss5 kernel: Lustre: Skipped 5 previous similar messages May 19 15:31:55 coss5 kernel: Lustre: charlie-OST0052: Recovery over after 2:30, of 8 clients 2 recovered and 6 were evicted. May 19 15:31:55 coss5 kernel: Lustre: Skipped 5 previous similar messages May 19 15:32:04 coss5 kernel: LustreError: 14053:0:(ofd_obd.c:1339:ofd_create()) charlie-OST004f: unable to precreate: rc = -116 May 19 15:32:04 coss5 kernel: LustreError: 14053:0:(ofd_obd.c:1339:ofd_create()) Skipped 1 previous similar message May 19 15:32:14 coss5 kernel: LustreError: 14053:0:(ofd_obd.c:1339:ofd_create()) charlie-OST004f: unable to precreate: rc = -116 May 19 15:32:14 coss5 kernel: LustreError: 14053:0:(ofd_obd.c:1339:ofd_create()) Skipped 1 previous similar message May 19 15:32:34 coss5 kernel: LustreError: 15233:0:(ofd_obd.c:1339:ofd_create()) charlie-OST004f: unable to precreate: rc = -116 May 19 15:32:34 coss5 kernel: LustreError: 15233:0:(ofd_obd.c:1339:ofd_create()) Skipped 3 previous similar messages May 19 15:33:06 coss5 kernel: LustreError: 14051:0:(ofd_obd.c:1339:ofd_create()) charlie-OST004a: unable to precreate: rc = -116 May 19 15:33:06 coss5 kernel: LustreError: 14051:0:(ofd_obd.c:1339:ofd_create()) Skipped 6 previous similar messages May 19 15:34:14 coss5 kernel: LustreError: 15574:0:(ofd_obd.c:1339:ofd_create()) charlie-OST004f: unable to precreate: rc = -116 May 19 15:34:14 coss5 kernel: LustreError: 15574:0:(ofd_obd.c:1339:ofd_create()) Skipped 12 previous similar messages |
| Comments |
| Comment by Daire Byrne (Inactive) [ 19/May/14 ] |
|
mpathg = OST004a # mount -t ldiskfs /dev/md0 /mnt/mdt/ # od -Ax -td8 /mnt/mdt/lov_objid 000000 7636305 7631922 000010 7604023 7607262 000020 7609194 7635483 000030 7626341 7633994 000040 7600558 7660819 000050 7652393 7617208 000060 7638862 7582615 000070 7630481 7624203 000080 7659168 7601867 000090 7614279 7634873 0000a0 7624073 7602352 0000b0 7579702 7655306 0000c0 7663219 7662922 0000d0 7639392 7669375 0000e0 7642283 7606244 0000f0 7636678 7645044 000100 7627300 7640592 000110 7589011 7618340 000120 7644965 7576647 000130 7607734 7629279 000140 7643935 7637064 000150 7622873 7664552 000160 7635320 7626228 000170 7636578 7649010 000180 7647537 7558117 000190 7635820 7610564 0001a0 7629527 7628570 0001b0 7639522 7606413 0001c0 7658757 7662943 0001d0 7628107 7592553 0001e0 7659005 7607434 0001f0 7665872 7637975 000200 7650983 7648721 000210 7627963 7634940 000220 7637635 7603939 000230 7648391 7646755 000240 7625799 7630426 000250 7622234 7621574 000260 7630184 7605893 000270 7660449 7630812 000280 7621539 7630278 000290 7629892 7639900 0002a0 7634208 7614781 0002b0 7637690 7638067 0002c0 7635077 7632355 0002d0 7643190 7635842 0002e0 7551592 7595546 0002f0 7625655 7606407 000300 7662671 7647106 000310 7605237 7601591 000320 7621326 7647585 000330 [root@coss5 ~]# debugfs -c -R 'dump /O/0/LAST_ID /tmp/LAST_ID' /dev/mapper/mpathg ; od -Ax -td8 /tmp/LAST_ID debugfs 1.42.7.wc1 (12-Apr-2013) /dev/mapper/mpathg: catastrophic mode - not reading inode or group bitmaps 000000 7622082 000008 [root@coss5 ~]# debugfs -c -R 'dump /O/0/LAST_ID /tmp/LAST_ID' /dev/mapper/mpathk ; od -Ax -td8 /tmp/LAST_ID debugfs 1.42.7.wc1 (12-Apr-2013) /dev/mapper/mpathk: catastrophic mode - not reading inode or group bitmaps 000000 7630564 000008 |
| Comment by Peter Jones [ 19/May/14 ] |
|
Niu Could you please advise on this one? Thanks Peter |
| Comment by Daire Byrne (Inactive) [ 19/May/14 ] |
[root@coss5 ~]# ls -1s /tmp/ost/O/0/d* | grep -v [a-z] | sort -k2 -n > /tmp/objects.OST004a
ls: cannot access /tmp/ost/O/0/d10/7622090: Input/output error
ls: cannot access /tmp/ost/O/0/d11/7622091: Input/output error
ls: cannot access /tmp/ost/O/0/d13/7622093: Input/output error
ls: cannot access /tmp/ost/O/0/d14/7622094: Input/output error
ls: cannot access /tmp/ost/O/0/d16/7622096: Input/output error
ls: cannot access /tmp/ost/O/0/d20/7622100: Input/output error
ls: cannot access /tmp/ost/O/0/d21/7622101: Input/output error
ls: cannot access /tmp/ost/O/0/d25/7622105: Input/output error
ls: cannot access /tmp/ost/O/0/d26/7622106: Input/output error
ls: cannot access /tmp/ost/O/0/d3/7622083: Input/output error
ls: cannot access /tmp/ost/O/0/d30/7622110: Input/output error
ls: cannot access /tmp/ost/O/0/d31/7622111: Input/output error
ls: cannot access /tmp/ost/O/0/d5/7622085: Input/output error
ls: cannot access /tmp/ost/O/0/d6/7622086: Input/output error
ls: cannot access /tmp/ost/O/0/d7/7622087: Input/output error
[root@coss5 ~]# tail -20 /tmp/objects.OST004a
0 7622078
0 7622079
0 7622080
0 7622081
0 7622082
? 7622083
? 7622085
? 7622086
? 7622087
? 7622090
? 7622091
? 7622093
? 7622094
? 7622096
? 7622100
? 7622101
? 7622105
? 7622106
? 7622110
? 7622111
[root@coss5 ~]# ls -1s /tmp/ost/O/0/d* | grep -v [a-z] | sort -k2 -n > /tmp/objects.OST004f
ls: cannot access /tmp/ost/O/0/d10/7630570: Input/output error
ls: cannot access /tmp/ost/O/0/d26/7630586: Input/output error
ls: cannot access /tmp/ost/O/0/d30/7630590: Input/output error
ls: cannot access /tmp/ost/O/0/d31/7630591: Input/output error
ls: cannot access /tmp/ost/O/0/d5/7630565: Input/output error
ls: cannot access /tmp/ost/O/0/d8/7630568: Input/output error
[root@coss5 ~]# tail -20 /tmp/objects.OST004f
0 7630551
0 7630552
0 7630553
0 7630554
0 7630555
0 7630556
0 7630557
0 7630558
0 7630559
0 7630560
0 7630561
0 7630562
0 7630563
0 7630564
? 7630565
? 7630568
? 7630570
? 7630586
? 7630590
? 7630591
|
| Comment by Daire Byrne (Inactive) [ 19/May/14 ] |
|
Attaching output of the e2fsck runs for all 17 OSTs on the effected server. Looks like a few of them dumped stuff in lost+found. mpathg and mpathk are the ones currently showing input/output errors on the precreated objects. Please note the output for each contains two runs - an initial read-only and then the second -y run. |
| Comment by Niu Yawei (Inactive) [ 20/May/14 ] |
|
I think it because the unused/deleted inode wasn't actually cleared by e2fsck. (see Andreas, do you know (or how to tell) which release of e2fsprogs have that fix? |
| Comment by Daire Byrne (Inactive) [ 20/May/14 ] |
|
So can I confirm that I can use debugfs -c "unlink <path>" (or rm when ldiskfs mounted) to clear these problematic files? They all seem to be of the 0-byte pre-allocated variety. It looks like we have 3 OSTs with this issue. If it helps speed things up a little can we raise the priority of this ticket? The filesystem is currently offline for fear of causing further damage to the OSTs. [root@coss5 ~]# tail -n80 /tmp/objects.mpathg
4 7622018
24 7622019
4 7622020
68 7622021
136 7622022
0 7622023
0 7622024
0 7622025
0 7622026
0 7622027
0 7622028
0 7622029
0 7622030
0 7622031
0 7622032
0 7622033
0 7622034
0 7622035
0 7622036
0 7622037
0 7622038
0 7622039
0 7622040
0 7622041
0 7622042
0 7622043
0 7622044
0 7622045
0 7622046
0 7622047
0 7622048
0 7622049
0 7622050
0 7622051
0 7622052
0 7622053
0 7622054
0 7622055
0 7622056
0 7622057
0 7622058
0 7622059
0 7622060
0 7622061
0 7622062
0 7622063
0 7622064
0 7622065
0 7622066
0 7622067
0 7622068
0 7622069
0 7622070
0 7622071
0 7622072
0 7622073
0 7622074
0 7622075
0 7622076
0 7622077
0 7622078
0 7622079
0 7622080
0 7622081
0 7622082
? 7622083
? 7622085
? 7622086
? 7622087
? 7622090
? 7622091
? 7622093
? 7622094
? 7622096
? 7622100
? 7622101
? 7622105
? 7622106
? 7622110
? 7622111
[root@coss5 ~]# tail -n120 /tmp/objects.mpathh
60 7630129
8 7630130
8 7630131
4 7630132
8 7630133
4 7630134
4 7630135
8 7630136
56 7630137
88 7630138
32 7630139
8 7630140
136 7630141
40 7630142
56 7630143
52 7630144
16 7630145
0 7630146
0 7630147
0 7630148
0 7630149
0 7630150
0 7630151
0 7630152
0 7630153
0 7630154
0 7630155
0 7630156
0 7630157
0 7630158
0 7630159
0 7630160
0 7630161
0 7630162
0 7630163
0 7630164
0 7630165
0 7630166
0 7630167
0 7630168
0 7630169
0 7630170
0 7630171
0 7630172
0 7630173
0 7630174
0 7630175
0 7630176
0 7630177
0 7630178
0 7630179
0 7630180
0 7630181
0 7630182
0 7630183
0 7630184
0 7630185
0 7630186
0 7630187
0 7630188
0 7630189
0 7630190
0 7630191
0 7630192
0 7630193
0 7630194
0 7630195
0 7630196
0 7630197
0 7630198
0 7630199
0 7630200
0 7630201
0 7630202
0 7630203
0 7630204
0 7630205
0 7630206
0 7630207
0 7630208
0 7630209
? 7630210
? 7630211
? 7630215
? 7630216
? 7630219
? 7630221
? 7630232
? 7630240
? 7630241
? 7630243
? 7630247
? 7630248
? 7630249
? 7630250
? 7630257
? 7630259
? 7630260
? 7630263
? 7630265
? 7630266
? 7630268
? 7630272
? 7630274
? 7630275
? 7630276
? 7630279
? 7630280
? 7630282
? 7630284
? 7630285
? 7630286
? 7630287
? 7630288
? 7630289
? 7630291
? 7630292
? 7630294
? 7630299
? 7630304
[root@coss5 ~]# tail -n50 /tmp/objects.mpathk
32 7630521
112 7630522
0 7630523
0 7630524
0 7630525
0 7630526
0 7630527
0 7630528
0 7630529
0 7630530
0 7630531
0 7630532
0 7630533
0 7630534
0 7630535
0 7630536
0 7630537
0 7630538
0 7630539
0 7630540
0 7630541
0 7630542
0 7630543
0 7630544
0 7630545
0 7630546
0 7630547
0 7630548
0 7630549
0 7630550
0 7630551
0 7630552
0 7630553
0 7630554
0 7630555
0 7630556
0 7630557
0 7630558
0 7630559
0 7630560
0 7630561
0 7630562
0 7630563
0 7630564
? 7630565
? 7630568
? 7630570
? 7630586
? 7630590
? 7630591
|
| Comment by Daire Byrne (Inactive) [ 20/May/14 ] |
|
I went ahead and used debugfs to unlink the bad files. The filesystem seems to be operating normally again. I'll keep an eye on it. |
| Comment by Niu Yawei (Inactive) [ 20/May/14 ] |
Yes, it's safe to unlink them, they should have been removed by e2fsck otherwise. Thank you. |
| Comment by John Fuchs-Chesney (Inactive) [ 20/May/14 ] |
|
Daire, |
| Comment by Daire Byrne (Inactive) [ 20/May/14 ] |
|
Yes the filesystem is accessible again and seems stable. I am seeing a bunch of "lvbo_init failed for resource" errors for one of the OSTs but I assume this is likely to be files that were moved to lost+found. A low number of missing files is not an issue for us. Hopefully it'll settle down in a day or two. May 20 13:05:41 coss5 kernel: LustreError: 28501:0:(ldlm_resource.c:1165:ldlm_resource_get()) charlie-OST004e: lvbo_init failed for resource 0x6ef5c3:0x0: rc = -2 [root@coss5 ~]# grep "charlie-OST004e: lvbo_init" /var/log/messages | awk '{print $13}' | sort | uniq -c 3 0x187aa3:0x0: 1 0x1b8a03:0x0: 1 0x2bd183:0x0: 1 0x40a323:0x0: 1 0x427ca3:0x0: 1 0x44f7a3:0x0: 1 0x4824e3:0x0: 1 0x4f8923:0x0: 1 0x540a63:0x0: 1 0x563163:0x0: 2 0x5c4fa3:0x0: 1 0x5e8783:0x0: 1 0x6515c3:0x0: 1 0x654a23:0x0: 2 0x6db323:0x0: 1 0x6ef5c3:0x0: 1 0x6efe43:0x0: 1 0x6f0ee3:0x0: |
| Comment by Niu Yawei (Inactive) [ 21/May/14 ] |
Yes, you can verify that by checking the lost+found directory (with debugfs). |
| Comment by John Fuchs-Chesney (Inactive) [ 10/Jun/14 ] |
|
Daire – Should we anticipate any further work required on this ticket? |
| Comment by Daire Byrne (Inactive) [ 11/Jun/14 ] |
|
Yes you can close it now. One outstanding question though is what version of e2fsck should we use to avoid this issue in future? |
| Comment by Peter Jones [ 11/Jun/14 ] |
|
Thanks Daire. The fix has been pushed upstream and so is in the latest e2fsprogs - 1.42.9.wc1 |