Skip to content

DictQuery returns 404 for valid words (先生): EDRDG/JMdictDB markup changed name="e" → name="g" #54

@torrid-fish

Description

@torrid-fish

Summary

POST /api/DictQuery/ returns 404 No results found for words that do exist in EDRDG/JMdict (e.g. 先生). The upstream actually returns real data — the bug is in our scraper: it parses 0 results out of a valid results page and then emits its own 404.

This is purely a parser issue in api/dict_query.py. It is not a routing/backend/deploy problem.

Root cause

EDRDG's backend (Stuart McGraw's JMdictDB) changed the markup of its search-results rows. Our parser is still hard-coded to the old format.

Fetching 先生 today:

  • <tr class="resrow"> rows are present (3 rows → upstream has data)
  • but the page contains 0 occurrences of name="e"

Old vs. new row markup:

parser expects (old) EDRDG actual (new)
input name name="e" name="g" (a checkbox)
value format single entry id seq.n~entrid (e.g. 2727400.1~2313503)
entry URL param entr.py?...&e= entr.py?...&g=

Current new-format row:

<tr class="resrow">
  <td><input name="g" type="checkbox" value="2727400.1~2313503"></td>
  <td class="seq">
    <a href="entr.py?svc=jmdict&g=2727400.1~2313503">2727400</a>
  </td>
  ...
</tr>

Failure path

api/dict_query.py:66row.find("input", {"name": "e"}) always returns None (the input is now name="g") → url_list stays empty → api/dict_query.py:125-130 hits if not url_list: and returns the self-generated 404 No results found.

Proposed fix

Switch the scraper from the retired e= selector to the new g= group selector. Three spots in api/dict_query.py must stay aligned:

  1. Results-page parsing (get_all_url, lines ~61-72) — read the entry link from td.seq a[href^="entr.py"] (or the input[name="g"] value) and build the entry URL with g=:
soup = BeautifulSoup(response.text, "html.parser")
rows = soup.find_all("tr", class_="resrow")

url_list = []
for row in rows:
    link = row.select_one('td.seq a[href^="entr.py"]')
    if link and link.has_attr("href"):
        g = parse_qs(urlparse(link["href"]).query).get("g", [None])[0]
        if g:
            url_list.append(
                f"https://www.edrdg.org/jmwsgi/entr.py?svc=jmdict&g={g}"
            )
  1. Single-result redirect branch (lines ~52-59) — when EDRDG redirects straight to entr.py, read the g query param instead of e:
if "entr.py" in str(response.url):
    g = parse_qs(urlparse(str(response.url)).query).get("g", [None])[0]
    return (
        [f"https://www.edrdg.org/jmwsgi/entr.py?svc=jmdict&g={g}"]
        if g else []
    )
  1. Entry-page parsing (get_dict) — the entr.py?...&g= page still uses span.kanj / span.rdng / tr.sense / span.glossx, so the existing selectors should keep working; verify against the new page once (1) and (2) are fixed.

Reproduce

curl -s -X POST localhost:8000/api/DictQuery/ -H 'content-type: application/json' -d '{"word":"先生"}'
# -> 404 "No results found", despite 先生 existing in JMdict

Notes / hardening

  • Consider asserting resrow rows were found but yielded 0 entries, and logging that as a parser/format error instead of silently returning 404 — so the next upstream HTML change is caught immediately rather than masquerading as "no results".

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions