[LU-7217] sanity-lfsck test_23b hangs on START_NAMESPACE Created: 28/Sep/15 Updated: 29/Sep/15 Resolved: 29/Sep/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | parinay v kondekar (Inactive) | Assignee: | nasf (Inactive) |
| Resolution: | Not a Bug | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
Configuration : 4 node setup : 1 MDS/ 1 OSS/ 2 clients stdout.log Total allocated inode limit: 0, total allocated block limit: 0 == sanity-lfsck test 23c: LFSCK can repair dangling name entry (3) == 08:02:38 (1443427358) ##### The objectA has multiple hard links, one of them corresponding to the name entry_B. But there is something wrong for the name entry_B and cause entry_B to references non-exist object_C. In the first-stage scanning, the LFSCK will think the entry_B as dangling, and re-create the lost object_C. And then others modified the re-created object_C. When the LFSCK comes to the second-stage scanning, it will find that the former re-creating object_C maybe wrong and try to replace the object_C with the real object_A. But because object_C has been modified, so the LFSCK cannot replace it. ##### Inject failure stub on MDT0 to simulate dangling name entry fail_loc=0x1621 fail_loc=0 'ls' should fail because of dangling name entry fail_val=10 fail_loc=0x1602 Trigger namespace LFSCK to find out dangling name entry |
| Comments |
| Comment by Joseph Gmitter (Inactive) [ 28/Sep/15 ] |
|
Hi Fan Yong, |
| Comment by nasf (Inactive) [ 29/Sep/15 ] |
|
According to the given log, I found that the unfinished test is test_23c, not test_23b. In the test_23c, we inject some stub to delay the LFSCK scanning: est_23c() {
echo "#####"
...
#define OBD_FAIL_LFSCK_DELAY3 0x1602
do_facet $SINGLEMDS $LCTL set_param fail_val=10 fail_loc=0x1602
echo "Trigger namespace LFSCK to find out dangling name entry"
$START_NAMESPACE -r -C ||
}
Means that when hit the stub "0x1602" (for every item scanned during the 2nd phase), the LFSCK engine will sleep 10 seconds, then go ahead. So the whole LFSCK scanning looks very slow. But it is not hung there. Here is the MDS log: 00100000:10000000:0.0:1443429130.295422:0:7460:0:(lfsck_engine.c:1733:lfsck_assistant_engine()) lustre-MDT0000-osd: LFSCK assistant phase2 scan start, synced: rc = 0 00100000:10000000:0.0:1443429130.295430:0:7460:0:(lfsck_namespace.c:6121:lfsck_namespace_assistant_handler_p2()) lustre-MDT0000-osd: namespace LFSCK phase2 scan start 00100000:10000000:0.0:1443429130.295473:0:7460:0:(lfsck_namespace.c:5727:lfsck_namespace_scan_local_lpf()) lustre-MDT0000-osd: start to scan backend /lost+found 00000001:00020000:0.0:1443429130.295483:0:7460:0:(fail.c:133:__cfs_fail_timeout_set()) cfs_fail_timeout id 1602 sleeping for 10000ms ... 00000100:00100000:1.0:1443429133.621277:0:7029:0:(service.c:2121:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ll_mgs_0000:26a6aa1f-ae32-a150-7927-f6f66a7ef0c1+9:6277:x1513543135402544:12345-192.168.112.11@tcp:400 Request procesed in 35us (90us total) trans 0 rc 0/0 00000100:00100000:1.0:1443429133.621281:0:7029:0:(nrs_fifo.c:241:nrs_fifo_req_stop()) NRS stop fifo request from 12345-192.168.112.11@tcp, seq: 1561 Debug log: 18556 lines, 18556 kept, 0 dropped, 0 bad. As we can see, the LFSCK engine hit the "0x1602" at time "1443429130.295483", then went to sleep. As scheduled, it should be re-scheduled at the time "1443429140.295483". But before such time, the log has been dumped ("1443429133.xxxx"), means the sleep time has not finished yet. So it is not a bug. |