[LU-5085] ofd_create()) charlie-OST004f: unable to precreate: rc = -116 Created: 19/May/14  Updated: 11/Jun/14  Resolved: 11/Jun/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.1
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Daire Byrne (Inactive) Assignee: Niu Yawei (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Attachments: File e2fsck.tar.gz    
Severity: 3
Rank (Obsolete): 14021

 Description   

Hi,

Unfortunately we suffered a severe RAID controller failure over the weekend. We have gotten the RAID back online but even after a filesystem check we seem to be getting this on two of the OSTs. How severe is this error? I recall that previous versions of Lustre had a concept of LAST_ID which the precreated objects depended upon. Maybe this has been corrupted on these filesystems? Please advise.

May 19 15:29:15 coss5 kernel: LDISKFS-fs (dm-15): mounted filesystem with ordered data mode. quota=on. Opts: 
May 19 15:29:16 coss5 kernel: LDISKFS-fs (dm-12): mounted filesystem with ordered data mode. quota=on. Opts: 
May 19 15:29:16 coss5 kernel: Lustre: charlie-OST0053: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-450
May 19 15:29:16 coss5 kernel: Lustre: Skipped 5 previous similar messages
May 19 15:29:17 coss5 kernel: LDISKFS-fs (dm-16): mounted filesystem with ordered data mode. quota=on. Opts: 
May 19 15:29:19 coss5 rlogind[15167]: PAM authentication failed for in.rlogind
May 19 15:29:25 coss5 kernel: Lustre: charlie-OST0053: Will be in recovery for at least 2:30, or until 4 clients reconnect
May 19 15:29:25 coss5 kernel: Lustre: charlie-OST0052: Will be in recovery for at least 2:30, or until 8 clients reconnect
May 19 15:29:25 coss5 kernel: Lustre: Skipped 4 previous similar messages
May 19 15:29:25 coss5 kernel: Lustre: Skipped 4 previous similar messages
May 19 15:31:36 coss5 kernel: Lustre: charlie-OST0044: recovery is timed out, evict stale exports
May 19 15:31:36 coss5 kernel: Lustre: charlie-OST0044: disconnecting 2 stale clients
May 19 15:31:36 coss5 kernel: Lustre: charlie-OST0047: Recovery over after 2:30, of 4 clients 2 recovered and 2 were evicted.
May 19 15:31:36 coss5 kernel: Lustre: charlie-OST004a: Recovery over after 2:30, of 4 clients 2 recovered and 2 were evicted.
May 19 15:31:37 coss5 kernel: LustreError: 15571:0:(ofd_obd.c:1339:ofd_create()) charlie-OST0049: unable to precreate: rc = -116
May 19 15:31:37 coss5 kernel: LustreError: 14051:0:(ofd_obd.c:1339:ofd_create()) charlie-OST004a: unable to precreate: rc = -116
May 19 15:31:44 coss5 kernel: Lustre: charlie-OST004c: recovery is timed out, evict stale exports
May 19 15:31:44 coss5 kernel: Lustre: Skipped 7 previous similar messages
May 19 15:31:44 coss5 kernel: Lustre: charlie-OST004c: disconnecting 2 stale clients
May 19 15:31:44 coss5 kernel: Lustre: Skipped 7 previous similar messages
May 19 15:31:44 coss5 kernel: Lustre: charlie-OST004c: Recovery over after 2:30, of 4 clients 2 recovered and 2 were evicted.
May 19 15:31:44 coss5 kernel: Lustre: Skipped 6 previous similar messages
May 19 15:31:45 coss5 kernel: LustreError: 15233:0:(ofd_obd.c:1339:ofd_create()) charlie-OST004f: unable to precreate: rc = -116
May 19 15:31:46 coss5 kernel: LustreError: 14594:0:(ofd_obd.c:1339:ofd_create()) charlie-OST004a: unable to precreate: rc = -116
May 19 15:31:54 coss5 kernel: LustreError: 14053:0:(ofd_obd.c:1339:ofd_create()) charlie-OST004f: unable to precreate: rc = -116
May 19 15:31:55 coss5 kernel: Lustre: charlie-OST0052: recovery is timed out, evict stale exports
May 19 15:31:55 coss5 kernel: Lustre: Skipped 5 previous similar messages
May 19 15:31:55 coss5 kernel: Lustre: charlie-OST0054: disconnecting 2 stale clients
May 19 15:31:55 coss5 kernel: Lustre: Skipped 5 previous similar messages
May 19 15:31:55 coss5 kernel: Lustre: charlie-OST0052: Recovery over after 2:30, of 8 clients 2 recovered and 6 were evicted.
May 19 15:31:55 coss5 kernel: Lustre: Skipped 5 previous similar messages
May 19 15:32:04 coss5 kernel: LustreError: 14053:0:(ofd_obd.c:1339:ofd_create()) charlie-OST004f: unable to precreate: rc = -116
May 19 15:32:04 coss5 kernel: LustreError: 14053:0:(ofd_obd.c:1339:ofd_create()) Skipped 1 previous similar message
May 19 15:32:14 coss5 kernel: LustreError: 14053:0:(ofd_obd.c:1339:ofd_create()) charlie-OST004f: unable to precreate: rc = -116
May 19 15:32:14 coss5 kernel: LustreError: 14053:0:(ofd_obd.c:1339:ofd_create()) Skipped 1 previous similar message
May 19 15:32:34 coss5 kernel: LustreError: 15233:0:(ofd_obd.c:1339:ofd_create()) charlie-OST004f: unable to precreate: rc = -116
May 19 15:32:34 coss5 kernel: LustreError: 15233:0:(ofd_obd.c:1339:ofd_create()) Skipped 3 previous similar messages
May 19 15:33:06 coss5 kernel: LustreError: 14051:0:(ofd_obd.c:1339:ofd_create()) charlie-OST004a: unable to precreate: rc = -116
May 19 15:33:06 coss5 kernel: LustreError: 14051:0:(ofd_obd.c:1339:ofd_create()) Skipped 6 previous similar messages
May 19 15:34:14 coss5 kernel: LustreError: 15574:0:(ofd_obd.c:1339:ofd_create()) charlie-OST004f: unable to precreate: rc = -116
May 19 15:34:14 coss5 kernel: LustreError: 15574:0:(ofd_obd.c:1339:ofd_create()) Skipped 12 previous similar messages


 Comments   
Comment by Daire Byrne (Inactive) [ 19/May/14 ]

mpathg = OST004a
mpathk = OST004f

# mount -t ldiskfs /dev/md0 /mnt/mdt/
# od -Ax -td8 /mnt/mdt/lov_objid
000000              7636305              7631922
000010              7604023              7607262
000020              7609194              7635483
000030              7626341              7633994
000040              7600558              7660819
000050              7652393              7617208
000060              7638862              7582615
000070              7630481              7624203
000080              7659168              7601867
000090              7614279              7634873
0000a0              7624073              7602352
0000b0              7579702              7655306
0000c0              7663219              7662922
0000d0              7639392              7669375
0000e0              7642283              7606244
0000f0              7636678              7645044
000100              7627300              7640592
000110              7589011              7618340
000120              7644965              7576647
000130              7607734              7629279
000140              7643935              7637064
000150              7622873              7664552
000160              7635320              7626228
000170              7636578              7649010
000180              7647537              7558117
000190              7635820              7610564
0001a0              7629527              7628570
0001b0              7639522              7606413
0001c0              7658757              7662943
0001d0              7628107              7592553
0001e0              7659005              7607434
0001f0              7665872              7637975
000200              7650983              7648721
000210              7627963              7634940
000220              7637635              7603939
000230              7648391              7646755
000240              7625799              7630426
000250              7622234              7621574
000260              7630184              7605893
000270              7660449              7630812
000280              7621539              7630278
000290              7629892              7639900
0002a0              7634208              7614781
0002b0              7637690              7638067
0002c0              7635077              7632355
0002d0              7643190              7635842
0002e0              7551592              7595546
0002f0              7625655              7606407
000300              7662671              7647106
000310              7605237              7601591
000320              7621326              7647585
000330

[root@coss5 ~]# debugfs -c -R 'dump /O/0/LAST_ID /tmp/LAST_ID' /dev/mapper/mpathg ; od -Ax -td8 /tmp/LAST_ID
debugfs 1.42.7.wc1 (12-Apr-2013)
/dev/mapper/mpathg: catastrophic mode - not reading inode or group bitmaps
000000              7622082
000008

[root@coss5 ~]# debugfs -c -R 'dump /O/0/LAST_ID /tmp/LAST_ID' /dev/mapper/mpathk ; od -Ax -td8 /tmp/LAST_ID
debugfs 1.42.7.wc1 (12-Apr-2013)
/dev/mapper/mpathk: catastrophic mode - not reading inode or group bitmaps
000000              7630564
000008
Comment by Peter Jones [ 19/May/14 ]

Niu

Could you please advise on this one?

Thanks

Peter

Comment by Daire Byrne (Inactive) [ 19/May/14 ]
[root@coss5 ~]# ls -1s /tmp/ost/O/0/d* | grep -v [a-z] | sort -k2 -n > /tmp/objects.OST004a
ls: cannot access /tmp/ost/O/0/d10/7622090: Input/output error
ls: cannot access /tmp/ost/O/0/d11/7622091: Input/output error
ls: cannot access /tmp/ost/O/0/d13/7622093: Input/output error
ls: cannot access /tmp/ost/O/0/d14/7622094: Input/output error
ls: cannot access /tmp/ost/O/0/d16/7622096: Input/output error
ls: cannot access /tmp/ost/O/0/d20/7622100: Input/output error
ls: cannot access /tmp/ost/O/0/d21/7622101: Input/output error
ls: cannot access /tmp/ost/O/0/d25/7622105: Input/output error
ls: cannot access /tmp/ost/O/0/d26/7622106: Input/output error
ls: cannot access /tmp/ost/O/0/d3/7622083: Input/output error
ls: cannot access /tmp/ost/O/0/d30/7622110: Input/output error
ls: cannot access /tmp/ost/O/0/d31/7622111: Input/output error
ls: cannot access /tmp/ost/O/0/d5/7622085: Input/output error
ls: cannot access /tmp/ost/O/0/d6/7622086: Input/output error
ls: cannot access /tmp/ost/O/0/d7/7622087: Input/output error

[root@coss5 ~]# tail -20 /tmp/objects.OST004a 
       0 7622078
       0 7622079
      0 7622080
      0 7622081
       0 7622082
      ? 7622083
       ? 7622085
      ? 7622086
        ? 7622087
       ? 7622090
        ? 7622091
       ? 7622093
       ? 7622094
       ? 7622096
       ? 7622100
       ? 7622101
       ? 7622105
       ? 7622106
       ? 7622110
       ? 7622111

[root@coss5 ~]# ls -1s /tmp/ost/O/0/d* | grep -v [a-z] | sort -k2 -n > /tmp/objects.OST004f
ls: cannot access /tmp/ost/O/0/d10/7630570: Input/output error
ls: cannot access /tmp/ost/O/0/d26/7630586: Input/output error
ls: cannot access /tmp/ost/O/0/d30/7630590: Input/output error
ls: cannot access /tmp/ost/O/0/d31/7630591: Input/output error
ls: cannot access /tmp/ost/O/0/d5/7630565: Input/output error
ls: cannot access /tmp/ost/O/0/d8/7630568: Input/output error
[root@coss5 ~]# tail -20 /tmp/objects.OST004f 
       0 7630551
       0 7630552
       0 7630553
       0 7630554
       0 7630555
       0 7630556
      0 7630557
       0 7630558
        0 7630559
        0 7630560
       0 7630561
      0 7630562
       0 7630563
      0 7630564
       ? 7630565
        ? 7630568
       ? 7630570
       ? 7630586
       ? 7630590
        ? 7630591

Comment by Daire Byrne (Inactive) [ 19/May/14 ]

Attaching output of the e2fsck runs for all 17 OSTs on the effected server. Looks like a few of them dumped stuff in lost+found. mpathg and mpathk are the ones currently showing input/output errors on the precreated objects.

Please note the output for each contains two runs - an initial read-only and then the second -y run.

Comment by Niu Yawei (Inactive) [ 20/May/14 ]

I think it because the unused/deleted inode wasn't actually cleared by e2fsck. (see LU-3542), 1.42.7.wc1 may not have the fix from LU-3542.

Andreas, do you know (or how to tell) which release of e2fsprogs have that fix?

Comment by Daire Byrne (Inactive) [ 20/May/14 ]

So can I confirm that I can use debugfs -c "unlink <path>" (or rm when ldiskfs mounted) to clear these problematic files? They all seem to be of the 0-byte pre-allocated variety. It looks like we have 3 OSTs with this issue. If it helps speed things up a little can we raise the priority of this ticket? The filesystem is currently offline for fear of causing further damage to the OSTs.

[root@coss5 ~]# tail -n80 /tmp/objects.mpathg
       4 7622018
     24 7622019
       4 7622020
      68 7622021
    136 7622022
        0 7622023
       0 7622024
      0 7622025
       0 7622026
        0 7622027
      0 7622028
       0 7622029
       0 7622030
        0 7622031
       0 7622032
        0 7622033
       0 7622034
       0 7622035
       0 7622036
       0 7622037
      0 7622038
       0 7622039
       0 7622040
       0 7622041
       0 7622042
       0 7622043
       0 7622044
      0 7622045
       0 7622046
       0 7622047
      0 7622048
      0 7622049
       0 7622050
      0 7622051
       0 7622052
       0 7622053
      0 7622054
        0 7622055
       0 7622056
      0 7622057
       0 7622058
        0 7622059
      0 7622060
       0 7622061
       0 7622062
        0 7622063
       0 7622064
        0 7622065
       0 7622066
       0 7622067
       0 7622068
       0 7622069
      0 7622070
       0 7622071
       0 7622072
       0 7622073
       0 7622074
       0 7622075
       0 7622076
      0 7622077
       0 7622078
       0 7622079
      0 7622080
      0 7622081
       0 7622082
      ? 7622083
       ? 7622085
      ? 7622086
        ? 7622087
       ? 7622090
        ? 7622091
       ? 7622093
       ? 7622094
       ? 7622096
       ? 7622100
       ? 7622101
       ? 7622105
       ? 7622106
       ? 7622110
       ? 7622111

[root@coss5 ~]# tail -n120 /tmp/objects.mpathh
     60 7630129
       8 7630130
       8 7630131
       4 7630132
       8 7630133
       4 7630134
       4 7630135
        8 7630136
      56 7630137
     88 7630138
      32 7630139
      8 7630140
     136 7630141
      40 7630142
       56 7630143
      52 7630144
      16 7630145
       0 7630146
      0 7630147
      0 7630148
      0 7630149
       0 7630150
       0 7630151
       0 7630152
       0 7630153
       0 7630154
      0 7630155
      0 7630156
       0 7630157
      0 7630158
       0 7630159
       0 7630160
      0 7630161
       0 7630162
       0 7630163
       0 7630164
       0 7630165
       0 7630166
       0 7630167
        0 7630168
       0 7630169
      0 7630170
       0 7630171
      0 7630172
       0 7630173
       0 7630174
        0 7630175
       0 7630176
       0 7630177
       0 7630178
      0 7630179
      0 7630180
      0 7630181
       0 7630182
       0 7630183
       0 7630184
       0 7630185
       0 7630186
      0 7630187
      0 7630188
       0 7630189
      0 7630190
       0 7630191
       0 7630192
      0 7630193
       0 7630194
       0 7630195
       0 7630196
       0 7630197
       0 7630198
       0 7630199
        0 7630200
       0 7630201
      0 7630202
       0 7630203
      0 7630204
       0 7630205
       0 7630206
        0 7630207
       0 7630208
       0 7630209
       ? 7630210
      ? 7630211
       ? 7630215
       ? 7630216
      ? 7630219
       ? 7630221
        ? 7630232
       ? 7630240
       ? 7630241
      ? 7630243
       ? 7630247
       ? 7630248
       ? 7630249
       ? 7630250
      ? 7630257
       ? 7630259
       ? 7630260
       ? 7630263
       ? 7630265
      ? 7630266
      ? 7630268
       ? 7630272
       ? 7630274
      ? 7630275
      ? 7630276
       ? 7630279
       ? 7630280
       ? 7630282
      ? 7630284
       ? 7630285
      ? 7630286
       ? 7630287
       ? 7630288
      ? 7630289
       ? 7630291
       ? 7630292
       ? 7630294
       ? 7630299
       ? 7630304

[root@coss5 ~]# tail -n50 /tmp/objects.mpathk
      32 7630521
     112 7630522
       0 7630523
       0 7630524
      0 7630525
       0 7630526
        0 7630527
        0 7630528
       0 7630529
      0 7630530
       0 7630531
      0 7630532
       0 7630533
       0 7630534
      0 7630535
        0 7630536
       0 7630537
       0 7630538
       0 7630539
      0 7630540
      0 7630541
       0 7630542
       0 7630543
        0 7630544
       0 7630545
       0 7630546
       0 7630547
       0 7630548
       0 7630549
      0 7630550
       0 7630551
       0 7630552
       0 7630553
       0 7630554
       0 7630555
       0 7630556
      0 7630557
       0 7630558
        0 7630559
        0 7630560
       0 7630561
      0 7630562
       0 7630563
      0 7630564
       ? 7630565
        ? 7630568
       ? 7630570
       ? 7630586
       ? 7630590
        ? 7630591
Comment by Daire Byrne (Inactive) [ 20/May/14 ]

I went ahead and used debugfs to unlink the bad files. The filesystem seems to be operating normally again. I'll keep an eye on it.

Comment by Niu Yawei (Inactive) [ 20/May/14 ]

So can I confirm that I can use debugfs -c "unlink <path>" (or rm when ldiskfs mounted) to clear these problematic files?
...
I went ahead and used debugfs to unlink the bad files. The filesystem seems to be operating normally again. I'll keep an eye on it.

Yes, it's safe to unlink them, they should have been removed by e2fsck otherwise. Thank you.

Comment by John Fuchs-Chesney (Inactive) [ 20/May/14 ]

Daire,
Are you still seeing normal filesystem operations?
Have you brought the fs back into production mode?
Thanks,
~ jfc.

Comment by Daire Byrne (Inactive) [ 20/May/14 ]

Yes the filesystem is accessible again and seems stable. I am seeing a bunch of "lvbo_init failed for resource" errors for one of the OSTs but I assume this is likely to be files that were moved to lost+found. A low number of missing files is not an issue for us. Hopefully it'll settle down in a day or two.

May 20 13:05:41 coss5 kernel: LustreError: 28501:0:(ldlm_resource.c:1165:ldlm_resource_get()) charlie-OST004e: lvbo_init failed for resource 0x6ef5c3:0x0: rc = -2

[root@coss5 ~]# grep "charlie-OST004e: lvbo_init" /var/log/messages | awk '{print $13}' | sort | uniq -c
      3 0x187aa3:0x0:
      1 0x1b8a03:0x0:
      1 0x2bd183:0x0:
      1 0x40a323:0x0:
      1 0x427ca3:0x0:
      1 0x44f7a3:0x0:
      1 0x4824e3:0x0:
      1 0x4f8923:0x0:
      1 0x540a63:0x0:
      1 0x563163:0x0:
      2 0x5c4fa3:0x0:
      1 0x5e8783:0x0:
      1 0x6515c3:0x0:
      1 0x654a23:0x0:
      2 0x6db323:0x0:
      1 0x6ef5c3:0x0:
      1 0x6efe43:0x0:
      1 0x6f0ee3:0x0:
Comment by Niu Yawei (Inactive) [ 21/May/14 ]

Yes the filesystem is accessible again and seems stable. I am seeing a bunch of "lvbo_init failed for resource" errors for one of the OSTs but I assume this is likely to be files that were moved to lost+found.

Yes, you can verify that by checking the lost+found directory (with debugfs).

Comment by John Fuchs-Chesney (Inactive) [ 10/Jun/14 ]

Daire – Should we anticipate any further work required on this ticket?
Or – may we go ahead and mark it as resolved 'Not a Bug'?
Thanks,
~ jfc.

Comment by Daire Byrne (Inactive) [ 11/Jun/14 ]

Yes you can close it now. One outstanding question though is what version of e2fsck should we use to avoid this issue in future?

Comment by Peter Jones [ 11/Jun/14 ]

Thanks Daire. The fix has been pushed upstream and so is in the latest e2fsprogs - 1.42.9.wc1

Generated at Sat Feb 10 01:48:24 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.