Summary
POST /api/DictQuery/ returns 404 No results found for words that do exist in EDRDG/JMdict (e.g. 先生). The upstream actually returns real data — the bug is in our scraper: it parses 0 results out of a valid results page and then emits its own 404.
This is purely a parser issue in api/dict_query.py. It is not a routing/backend/deploy problem.
Root cause
EDRDG's backend (Stuart McGraw's JMdictDB) changed the markup of its search-results rows. Our parser is still hard-coded to the old format.
Fetching 先生 today:
<tr class="resrow"> rows are present (3 rows → upstream has data)
- but the page contains 0 occurrences of
name="e"
Old vs. new row markup:
|
parser expects (old) |
EDRDG actual (new) |
| input name |
name="e" |
name="g" (a checkbox) |
| value format |
single entry id |
seq.n~entrid (e.g. 2727400.1~2313503) |
| entry URL param |
entr.py?...&e= |
entr.py?...&g= |
Current new-format row:
<tr class="resrow">
<td><input name="g" type="checkbox" value="2727400.1~2313503"></td>
<td class="seq">
<a href="entr.py?svc=jmdict&g=2727400.1~2313503">2727400</a>
</td>
...
</tr>
Failure path
api/dict_query.py:66 — row.find("input", {"name": "e"}) always returns None (the input is now name="g") → url_list stays empty → api/dict_query.py:125-130 hits if not url_list: and returns the self-generated 404 No results found.
Proposed fix
Switch the scraper from the retired e= selector to the new g= group selector. Three spots in api/dict_query.py must stay aligned:
- Results-page parsing (
get_all_url, lines ~61-72) — read the entry link from td.seq a[href^="entr.py"] (or the input[name="g"] value) and build the entry URL with g=:
soup = BeautifulSoup(response.text, "html.parser")
rows = soup.find_all("tr", class_="resrow")
url_list = []
for row in rows:
link = row.select_one('td.seq a[href^="entr.py"]')
if link and link.has_attr("href"):
g = parse_qs(urlparse(link["href"]).query).get("g", [None])[0]
if g:
url_list.append(
f"https://www.edrdg.org/jmwsgi/entr.py?svc=jmdict&g={g}"
)
- Single-result redirect branch (lines ~52-59) — when EDRDG redirects straight to
entr.py, read the g query param instead of e:
if "entr.py" in str(response.url):
g = parse_qs(urlparse(str(response.url)).query).get("g", [None])[0]
return (
[f"https://www.edrdg.org/jmwsgi/entr.py?svc=jmdict&g={g}"]
if g else []
)
- Entry-page parsing (
get_dict) — the entr.py?...&g= page still uses span.kanj / span.rdng / tr.sense / span.glossx, so the existing selectors should keep working; verify against the new page once (1) and (2) are fixed.
Reproduce
curl -s -X POST localhost:8000/api/DictQuery/ -H 'content-type: application/json' -d '{"word":"先生"}'
# -> 404 "No results found", despite 先生 existing in JMdict
Notes / hardening
- Consider asserting
resrow rows were found but yielded 0 entries, and logging that as a parser/format error instead of silently returning 404 — so the next upstream HTML change is caught immediately rather than masquerading as "no results".
Summary
POST /api/DictQuery/returns404 No results foundfor words that do exist in EDRDG/JMdict (e.g.先生). The upstream actually returns real data — the bug is in our scraper: it parses 0 results out of a valid results page and then emits its own 404.This is purely a parser issue in
api/dict_query.py. It is not a routing/backend/deploy problem.Root cause
EDRDG's backend (Stuart McGraw's JMdictDB) changed the markup of its search-results rows. Our parser is still hard-coded to the old format.
Fetching
先生today:<tr class="resrow">rows are present (3 rows → upstream has data)name="e"Old vs. new row markup:
name="e"name="g"(a checkbox)seq.n~entrid(e.g.2727400.1~2313503)entr.py?...&e=entr.py?...&g=Current new-format row:
Failure path
api/dict_query.py:66—row.find("input", {"name": "e"})always returnsNone(the input is nowname="g") →url_liststays empty →api/dict_query.py:125-130hitsif not url_list:and returns the self-generated404 No results found.Proposed fix
Switch the scraper from the retired
e=selector to the newg=group selector. Three spots inapi/dict_query.pymust stay aligned:get_all_url, lines ~61-72) — read the entry link fromtd.seq a[href^="entr.py"](or theinput[name="g"]value) and build the entry URL withg=:entr.py, read thegquery param instead ofe:get_dict) — theentr.py?...&g=page still usesspan.kanj/span.rdng/tr.sense/span.glossx, so the existing selectors should keep working; verify against the new page once (1) and (2) are fixed.Reproduce
Notes / hardening
resrowrows were found but yielded 0 entries, and logging that as a parser/format error instead of silently returning 404 — so the next upstream HTML change is caught immediately rather than masquerading as "no results".