[LU-11519] sanity-hsm: test_90 MDS crash in mdt_cdt_waiting_cb() Created: 15/Oct/18 Updated: 19/Dec/18 Resolved: 27/Nov/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.0 |
| Fix Version/s: | Lustre 2.12.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Maloo | Assignee: | John Hammond |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com> This issue relates to the following test suite run review-dne-zfs-part-2: BUG: unable to handle kernel paging request at 00000000240cd204 IP: mdt_cdt_waiting_cb.isra.25+0x357/0xc10 [mdt] CPU: 0 PID: 25301 Comm: hsm_cdtr Kdump: loaded Tainted: P RIP: 0010:[<ffffffffc1230617>] [<ffffffffc1230617>] mdt_cdt_waiting_cb.isra.25+0x357/0xc10 [mdt] Call Trace: mdt_coordinator_cb+0x162/0x290 [mdt] llog_process_thread+0x852/0x1550 [obdclass] llog_process_or_fork+0xbc/0x450 [obdclass] llog_cat_process_cb+0x239/0x250 [obdclass] llog_process_thread+0x852/0x1550 [obdclass] llog_process_or_fork+0xbc/0x450 [obdclass] llog_cat_process_or_fork+0x199/0x2a0 [obdclass] llog_cat_process+0x2e/0x30 [obdclass] cdt_llog_process+0xc6/0x3a0 [mdt] mdt_coordinator+0x541/0x19f0 [mdt] kthread+0xd1/0xe0 ret_from_fork_nospec_begin+0x21/0x21 |
| Comments |
| Comment by Andreas Dilger [ 19/Oct/18 ] |
|
+1 in review-dne-part-2 test: |
| Comment by James Nunez (Inactive) [ 19/Oct/18 ] |
|
We are seeing sanity-hsm test 13 crash at about a 5% failure rate with this call trace. All are for DNE with ZFS configurations. Some example are at |
| Comment by Andreas Dilger [ 23/Oct/18 ] |
|
+1 on master: https://testing.whamcloud.com/test_sets/23d994e6-d644-11e8-ad90-52540065bddc |
| Comment by Yang Sheng [ 25/Oct/18 ] |
|
The crash dump cannot be loaded to analysis. |
| Comment by Andreas Dilger [ 29/Oct/18 ] |
|
+1 on master https://testing.whamcloud.com/test_sets/30e877d2-d6d1-11e8-9238-52540065bddc There are actually many failures like this. |
| Comment by Andreas Dilger [ 01/Nov/18 ] |
|
John, could you please take a look at this, it looks like a coordinator bug and has been crashing the MDS fairly commonly lately. |
| Comment by John Hammond [ 02/Nov/18 ] |
|
I grabbed the most recent failure from maloo https://testing.whamcloud.com/test_sets/27781c7e-dde4-11e8-975a-52540065bddc which uses build 59629: [ 5043.117404] BUG: unable to handle kernel NULL pointer dereference at 0000000000000004
[ 5043.119825] IP: [<ffffffffc1327747>] mdt_cdt_waiting_cb.isra.25+0x357/0xc10 [mdt]
[ 5043.122276] PGD 0
$ mkdir /tmp/LU-11519
$ cd /tmp/LU-11519
$ wget 'https://build.whamcloud.com/job/lustre-reviews/59629/arch=x86_64,build_type=server,distro=el7,ib_stack=inkernel/artifact/artifacts/RPMS/x86_64/lustre-debuginfo-2.11.56_56_g098be7d-1.el7.x86_64.rpm'
...
$ rpm2cpio lustre-debuginfo-2.11.56_56_g098be7d-1.el7.x86_64.rpm | cpio -id
$ nm ./usr/lib/debug/lib/modules/3.10.0-862.14.4.el7_lustre.x86_64/extra/lustre/fs/mdt.ko.debug | awk '$3 == "mdt_cdt_waiting_cb.isra.25" { print $1; }'
000000000005f3f0
$ printf '%#lx\n' $(( 0x357 + 0x5f3f0 ))
0x5f747
$ addr2line -e ./usr/lib/debug/lib/modules/3.10.0-862.14.4.el7_lustre.x86_64/extra/lustre/fs/mdt.ko.debug --functions --inlines 0x5f747
mdt_cdt_waiting_cb
/usr/src/debug/lustre-2.11.56_56_g098be7d/lustre/mdt/mdt_coordinator.c:228
$ cat -n ./usr/src/debug/lustre-2.11.56_56_g098be7d/lustre/mdt/mdt_coordinator.c | sed -n '220,230p'
220 } while (request->hal_used_sz + hai_size >
221 LDLM_MAXREQSIZE);
222 } else if (hsd->hsd_housekeeping) {
223 struct hsm_scan_request *tmp;
224
225 /* Discard the (whole) last hal */
226 hsd->hsd_request_count--;
227 tmp = &hsd->hsd_request[hsd->hsd_request_count];
228 hsd->hsd_action_count -= tmp->hal->hal_count;
229 OBD_FREE(tmp->hal, tmp->hal_sz);
230 } else {
So it looks like tmp->hal is NULL. |
| Comment by Andreas Dilger [ 02/Nov/18 ] |
I typically use the following, which is considerably easier: $ gdb .../mdt.ko.debug -ex "list *(mdt_cdt_waiting_cb+0x357)" -ex quit
(gdb) list *(mdt_cdt_waiting_cb+0x357)
0x5f777 is in mdt_cdt_waiting_cb (/usr/src/lustre-head/lustre/mdt/mdt_coordinator.c:228).
223 struct hsm_scan_request *tmp;
224
225 /* Discard the (whole) last hal */
226 hsd->hsd_request_count--;
227 tmp = &hsd->hsd_request[hsd->hsd_request_count];
228 hsd->hsd_action_count -= tmp->hal->hal_count;
229 OBD_FREE(tmp->hal, tmp->hal_sz);
230 } else {
231 /* Bailing out, this code path is too hot */
232 RETURN(LLOG_PROC_BREAK);
Note the "list *()" argument is typically the same format as how addresses are printed from stack dumps/oops messages, but in this case the ".isra.25" part needs to be dropped. |
| Comment by Jian Yu [ 03/Nov/18 ] |
|
sanity-hsm test 13 also hit the same failure in review-dne-zfs-part-2 test session on master branch: |
| Comment by Jian Yu [ 04/Nov/18 ] |
|
sanity-hsm test 13 hit this failure again: |
| Comment by John Hammond [ 05/Nov/18 ] |
|
> I typically use the following, which is considerably easier: (I like my way because it makes inlining much easier to understand. Which is not an issue here. Maybe gcc has a better way to do this in those cases.) |
| Comment by Gerrit Updater [ 05/Nov/18 ] |
|
John L. Hammond (jhammond@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33580 |
| Comment by Gerrit Updater [ 06/Nov/18 ] |
|
Quentin Bouget (quentin.bouget@cea.fr) uploaded a new patch: https://review.whamcloud.com/33590 |
| Comment by Gerrit Updater [ 21/Nov/18 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33580/ |
| Comment by Gerrit Updater [ 27/Nov/18 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33590/ |
| Comment by Peter Jones [ 27/Nov/18 ] |
|
Landed for 2.12 |