Fix duplicate queries by harikrishnatp · Pull Request #626 · scribe-org/Scribe-Data

harikrishnatp · 2025-08-04T20:17:18Z

Contributor checklist

This pull request is on a separate branch and not the main branch
I have tested my code with the pytest command as directed in the testing section of the contributing guide

Description

This pull request resolves the issue of duplicate SPARQL query forms being generated. The fix involves update to the query generation script to ensure it correctly handles the current state of both local files and the live Wikidata dump.

Changes:

Made SPARQL parsing stateless : The script that reads existing .sparql files was modified to be stateless. It no longer accumulates data across runs, which was the original source of duplications
Corrected sub-language logic : While testing, macro languages were causing issues so query_processing logic in check_missing_forms was updated to correctly generate unique, filtered set of queries for each sub-language.
Improved error handling: No more crashing if it encounters a language in the Wikidata dump that is not present in the local project.

As part of this fix, all SPARQL query files were deleted and regenerated from the latest Wikidata lexeme dump. This not only fixed the duplication issue but also updated the queries to reflect the current state of data in wikidata. This led to the removal of some outdated queries (eg Ukrainian) for which the data no longer appears to be available in the parsable format in the dump

Testing:

The changes were tested by repeatedly running the generation script and verifying the output with a custom verification script.
pytest was passing successfully. An outdated test (test_list_data_types_specific_language) was also updated to reflect the changes in the available data.

Related issue

Closes Query generation is including the same query form in more than one query #617

github-actions · 2025-08-04T20:17:40Z

Thank you for the pull request! ❤️

The Scribe-Data team will do our best to address your contribution as soon as we can. If you're not already a member of our public Matrix community, please consider joining! We'd suggest that you use the Element client as well as Element X for a mobile app, and definitely join the General and Data rooms once you're in. Also consider attending our bi-weekly Saturday dev syncs. It'd be great to meet you 😊

github-actions · 2025-08-04T20:17:41Z

Maintainer Checklist

The following is a checklist for maintainers to make sure this process goes as well as possible. Feel free to address the points below yourself in further commits if you realize that actions are needed :)

The linting and formatting workflow within the PR checks do not indicate new errors in the files changed
The CHANGELOG has been updated with a description of the changes for the upcoming release and the corresponding issue (if necessary)

andrewtavis · 2025-08-04T20:23:50Z

Amazing to have this PR opened, @harikrishnatp! Thanks so much for getting to this 😊 Really is key for the community right now :)

andrewtavis · 2025-08-04T20:24:18Z

@axif0, maybe we could look into this when we finalize the current PR for Scribe-Server as well? 😊

harikrishnatp · 2025-08-04T20:34:35Z

Thanks so much for the encouragement, and apologies again for the delay in getting to this. Excited to see it merged! 😊

axif0 · 2025-08-04T20:50:44Z

all SPARQL query files were deleted and regenerated from the latest Wikidata lexeme dump.

I'm not sure if I support this action, when parsing the Wikidata lexeme dump, and querying the SPARQL query files, I saw that SPARQL query files sometimes gave better results then Wikidata lexeme dump. I mean, manually written SPARQL query files have certain forms that the dump file doesn't and vice versa.

I'm thinking, When the scribe-android is live, shouldn't we need to make sure it has most language data available. By merging the results from both sources, we can maximize coverage and data quality, ensuring the app starts with a rich and reliable language dataset.

andrewtavis · 2025-08-05T08:41:16Z

I think that there's value in both approaches - deleting the queries and keeping them. @axif0, what are the forms that are only available from the query service? Ideally we would be able to delete and regenerate, but then have those forms that are missing be accounted for in the process. This would assure that old fields that are from old labels or form combinations are no longer included in the queries :)

harikrishnatp · 2025-08-05T13:20:22Z

You're right, thank you both for the clarification. I agree that the best approach is to create a master list that is a union of the forms from the existing SPARQL files and the new forms found in the latest wikidata dump. This will preserve the existing forms while still allowing for a clean regeneration process.

andrewtavis · 2025-09-10T15:18:10Z

Closing this now given the conversation with @harikrishnatp and @axif0. Plan from here is that @harikrishnatp will be opening a new PR on this based on using the Wikidata Query Service to derive all form property combinations for language and data type pairs.

harikrishnatp added 2 commits August 5, 2025 01:06

fix: Resolve duplicate SPARQL query generation

8ac32c6

test: Update tests to reflect current wikidata state

1d2fc38

andrewtavis requested review from andrewtavis and axif0 and removed request for axif0 August 4, 2025 20:22

andrewtavis mentioned this pull request Aug 10, 2025

[Multiple assignees possible] Add missing tests to Scribe-Data #623

Open

2 tasks

andrewtavis closed this Sep 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix duplicate queries#626

Fix duplicate queries#626
harikrishnatp wants to merge 2 commits into
scribe-org:mainfrom
harikrishnatp:fix-duplicate-queries

harikrishnatp commented Aug 4, 2025

Uh oh!

github-actions Bot commented Aug 4, 2025

Uh oh!

github-actions Bot commented Aug 4, 2025 •

edited by axif0

Loading

Uh oh!

andrewtavis commented Aug 4, 2025

Uh oh!

andrewtavis commented Aug 4, 2025

Uh oh!

harikrishnatp commented Aug 4, 2025

Uh oh!

axif0 commented Aug 4, 2025 •

edited

Loading

Uh oh!

andrewtavis commented Aug 5, 2025

Uh oh!

harikrishnatp commented Aug 5, 2025

Uh oh!

andrewtavis commented Sep 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

harikrishnatp commented Aug 4, 2025

Contributor checklist

Description

Related issue

Uh oh!

github-actions Bot commented Aug 4, 2025

Thank you for the pull request! ❤️

Uh oh!

github-actions Bot commented Aug 4, 2025 • edited by axif0 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Maintainer Checklist

Uh oh!

andrewtavis commented Aug 4, 2025

Uh oh!

andrewtavis commented Aug 4, 2025

Uh oh!

harikrishnatp commented Aug 4, 2025

Uh oh!

axif0 commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andrewtavis commented Aug 5, 2025

Uh oh!

harikrishnatp commented Aug 5, 2025

Uh oh!

andrewtavis commented Sep 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions Bot commented Aug 4, 2025 •

edited by axif0

Loading

axif0 commented Aug 4, 2025 •

edited

Loading