[LU-7378] Error destroying object with RC115 Created: 03/Nov/15 Updated: 16/Mar/16 Resolved: 03/Dec/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Dustin Leverman | Assignee: | nasf (Inactive) |
| Resolution: | Not a Bug | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
We are encountering an issue on our lustre-2.5.4+ servers where we are seeing an error destroying objects with RC115, then the OI scrubber is starting. We suspect that the OI scrubber may be causing some jobs to run over wallclock. I have attached some kernel logs, and below is the output from the OI scrubber: [root@f1-oss1d8 f1-OST00bf]# cat /proc/fs/lustre/osd-ldiskfs/f1-OST00bf/oi_scrub Thanks, |
| Comments |
| Comment by Jian Yu [ 04/Nov/15 ] |
|
In kernel logs, the following error messages kept occurring: Oct 14 16:30:12 f1-oss1d8 kernel: [2604418.085966] LustreError: 10030:0:(ost_handler.c:1777:ost_blocking_ast()) Error -2 syncing data on lock cancel Oct 14 16:33:07 f1-oss1d8 kernel: [2604593.041236] Lustre: f1-OST00bf-osd: trigger OI scrub by RPC for [0x100000000:0x33d61e5:0x0], rc = 0 [1] Oct 14 16:33:07 f1-oss1d8 kernel: [2604593.065422] Lustre: Skipped 3 previous similar messages Oct 14 16:33:07 f1-oss1d8 kernel: [2604593.076384] LustreError: 99512:0:(ofd_obd.c:1096:ofd_destroy()) f1-OST00bf: error destroying object [0x100000000:0x33d61e5:0x0]: -115 Oct 14 16:33:07 f1-oss1d8 kernel: [2604593.125308] LustreError: 99512:0:(ofd_obd.c:1096:ofd_destroy()) Skipped 11 previous similar messages Oct 14 16:33:11 f1-oss1d8 kernel: [2604597.299596] LustreError: 38769:0:(osd_compat.c:598:osd_obj_update_entry()) f1-OST00bf-osd: the FID [0x100000000:0x2fef612:0x0] is used by two objects: 6037283/2343154970 562883/2343154970 Oct 14 16:33:11 f1-oss1d8 kernel: [2604597.348347] LustreError: 38769:0:(osd_compat.c:598:osd_obj_update_entry()) Skipped 15 previous similar messages For "-115" (-EINPROGRESS) error, it came from ofd_destroy -> ofd_destroy_by_fid -> ofd_object_find -> … -> osd_object_init -> osd_fid_lookup(): triggered = true; if (thread_is_running(&scrub->os_thread)) { result = -EINPROGRESS; } else if (!dev->od_noscrub) { result = osd_scrub_start(dev); LCONSOLE_WARN("%.16s: trigger OI scrub by RPC " "for "DFID", rc = %d [1]\n", osd_name(dev), PFID(fid), result); if (result == 0 || result == -EALREADY) result = -EINPROGRESS; else result = -EREMCHG; } For "FID used by two objects" error, it came from osd_obj_update_entry(): if (lu_fid_eq(fid, oi_fid)) { CERROR("%s: the FID "DFID" is used by two objects: " "%u/%u %u/%u\n", osd_name(osd), PFID(fid), oi_id->oii_ino, oi_id->oii_gen, id->oii_ino, id->oii_gen); GOTO(out, rc = -EEXIST); } |
| Comment by Jian Yu [ 04/Nov/15 ] |
|
Hi Dustin, Could you please run the following command on MDT device to see the pathnames of the two inodes that used the same FID? debugfs -c -R "ncheck 6037283 562883" /dev/{mdtdev}
|
| Comment by Matt Ezell [ 04/Nov/15 ] |
|
Hi Jian- I think those are OST inode numbers, so I ran the command there: [root@f1-oss1d8 ~]# debugfs -c -R "ncheck 6037283 562883" /dev/mapper/f1-ddn1d-l53 debugfs 1.42.12.wc1 (15-Sep-2014) /dev/mapper/f1-ddn1d-l53: catastrophic mode - not reading inode or group bitmaps Inode Pathname 562883 /O/0/d4/54354404 6037283 /O/0/d18/50263570 [root@f1-oss1d8 ~]# debugfs -c -R "stat /O/0/d4/54354404" /dev/mapper/f1-ddn1d-l53 | grep fid debugfs 1.42.12.wc1 (15-Sep-2014) /dev/mapper/f1-ddn1d-l53: catastrophic mode - not reading inode or group bitmaps lma: fid=[0x100000000:0x2fef612:0x0] compat=8 incompat=0 fid = "a8 6e 00 bc f6 02 00 02 00 00 00 00 02 00 00 00 " (16) fid: parent=[0x20002f6bc006ea8:0x0:0x0] stripe=2 [root@f1-oss1d8 ~]# debugfs -c -R "stat /O/0/d18/50263570" /dev/mapper/f1-ddn1d-l53 | grep fid debugfs 1.42.12.wc1 (15-Sep-2014) /dev/mapper/f1-ddn1d-l53: catastrophic mode - not reading inode or group bitmaps lma: fid=[0x100000000:0x2fef612:0x0] compat=8 incompat=0 fid = "a8 6e 00 bc f6 02 00 02 00 00 00 00 02 00 00 00 " (16) fid: parent=[0x20002f6bc006ea8:0x0:0x0] stripe=2 That's where I hit a dead end: [root@lfs-mgmt01.ncrc.gov ~]# lfs fid2path /lustre/f1 '[0x100000000:0x2fef612:0x0]' ioctl err -22: Invalid argument (22) fid2path: error on FID [0x100000000:0x2fef612:0x0]: Invalid argument [root@lfs-mgmt01.ncrc.gov ~]# lfs fid2path /lustre/f1 '[0x20002f6bc006ea8:0x0:0x0]' ioctl err -22: Invalid argument (22) fid2path: error on FID [0x20002f6bc006ea8:0x0:0x0]: Invalid argument |
| Comment by Jian Yu [ 04/Nov/15 ] |
|
Hi Matt, Could you please refer to the steps in https://build.hpdd.intel.com/job/lustre-manual/lastSuccessfulBuild/artifact/lustre_manual.html#dbdoclet.50438194_30872 ? |
| Comment by Matt Ezell [ 04/Nov/15 ] |
|
Hi Jian- This file system was formatted with Lustre 2.4, so we shouldn't have any IGIF files. According to step 2, fid2path should have been able to find it, correct? ~Matt |
| Comment by Jian Yu [ 05/Nov/15 ] |
|
Yes, Matt. Let me ask for help. Hi Nasf, Could you please advise? |
| Comment by nasf (Inactive) [ 05/Nov/15 ] |
If the "/lustre/f1" is the right mount point, then that means the FID "[0x20002f6bc006ea8:0x0:0x0]" is an invalid one. The first valid lu_fid::f_oid for MDT-object's FID is 1, not 0. So the PFID EA for the OST-object is wrong. So it is normal that you cannot locate related MDT-object via such invalid FID. On the other hand, your case happened during destroying OST-object. That means the MDT-object has been removed already. So even thought the PFID EA was right, you still could not locate the removed MDT-object.
Searched the log, only the FID [0x100000000:0x2fef612:0x0] was reported as conflict FID mapping. Such FID should be mapped to /O/0/d18/50263570 (6037283/2343154970). But for some unknown reason, the /O/0/d4/54354404 (562883/2343154970) also claims to be mapped to the FID [0x100000000:0x2fef612:0x0]. The latter one is wrong. In fact, the OI entry /O/0/d4/54354404 is corresponding to the FID [0x100000000:0x33d61e4:0x0]. Since we are destroying the OST-object that is corresponding to the FID [0x100000000:0x33d61e4:0x0]. The most simple solution is that mount the OST as ldiskfs mode, then remove /O/0/d4/54354404 directly. |
| Comment by Dustin Leverman [ 11/Nov/15 ] |
|
Thank you for the information Nasf. This will be our plan, but we need to schedule an outage with the customer. We will let you know the result. Thanks, |
| Comment by Dustin Leverman [ 03/Dec/15 ] |
|
Nasf, Thank you for your help, I think it is okay to close this ticket. Thanks, |
| Comment by Jian Yu [ 03/Dec/15 ] |
|
Thank you, Dustin. Let me close this ticket. |