Skip to content

Fix duplicate queries#626

Closed
harikrishnatp wants to merge 2 commits into
scribe-org:mainfrom
harikrishnatp:fix-duplicate-queries
Closed

Fix duplicate queries#626
harikrishnatp wants to merge 2 commits into
scribe-org:mainfrom
harikrishnatp:fix-duplicate-queries

Conversation

@harikrishnatp

Copy link
Copy Markdown
Collaborator

Contributor checklist


Description

This pull request resolves the issue of duplicate SPARQL query forms being generated. The fix involves update to the query generation script to ensure it correctly handles the current state of both local files and the live Wikidata dump.

Changes:

  1. Made SPARQL parsing stateless : The script that reads existing .sparql files was modified to be stateless. It no longer accumulates data across runs, which was the original source of duplications
  2. Corrected sub-language logic : While testing, macro languages were causing issues so query_processing logic in check_missing_forms was updated to correctly generate unique, filtered set of queries for each sub-language.
  3. Improved error handling: No more crashing if it encounters a language in the Wikidata dump that is not present in the local project.

As part of this fix, all SPARQL query files were deleted and regenerated from the latest Wikidata lexeme dump. This not only fixed the duplication issue but also updated the queries to reflect the current state of data in wikidata. This led to the removal of some outdated queries (eg Ukrainian) for which the data no longer appears to be available in the parsable format in the dump

Testing:

  • The changes were tested by repeatedly running the generation script and verifying the output with a custom verification script.
  • pytest was passing successfully. An outdated test (test_list_data_types_specific_language) was also updated to reflect the changes in the available data.

Related issue

@github-actions

github-actions Bot commented Aug 4, 2025

Copy link
Copy Markdown
Contributor

Thank you for the pull request! ❤️

The Scribe-Data team will do our best to address your contribution as soon as we can. If you're not already a member of our public Matrix community, please consider joining! We'd suggest that you use the Element client as well as Element X for a mobile app, and definitely join the General and Data rooms once you're in. Also consider attending our bi-weekly Saturday dev syncs. It'd be great to meet you 😊

@github-actions

github-actions Bot commented Aug 4, 2025

Copy link
Copy Markdown
Contributor

Maintainer Checklist

The following is a checklist for maintainers to make sure this process goes as well as possible. Feel free to address the points below yourself in further commits if you realize that actions are needed :)

  • The linting and formatting workflow within the PR checks do not indicate new errors in the files changed

  • The CHANGELOG has been updated with a description of the changes for the upcoming release and the corresponding issue (if necessary)

@andrewtavis andrewtavis requested review from andrewtavis and axif0 and removed request for axif0 August 4, 2025 20:22
@andrewtavis

Copy link
Copy Markdown
Member

Amazing to have this PR opened, @harikrishnatp! Thanks so much for getting to this 😊 Really is key for the community right now :)

@andrewtavis

Copy link
Copy Markdown
Member

@axif0, maybe we could look into this when we finalize the current PR for Scribe-Server as well? 😊

@harikrishnatp

Copy link
Copy Markdown
Collaborator Author

Thanks so much for the encouragement, and apologies again for the delay in getting to this. Excited to see it merged! 😊

@axif0

axif0 commented Aug 4, 2025

Copy link
Copy Markdown
Member

all SPARQL query files were deleted and regenerated from the latest Wikidata lexeme dump.

I'm not sure if I support this action, when parsing the Wikidata lexeme dump, and querying the SPARQL query files, I saw that SPARQL query files sometimes gave better results then Wikidata lexeme dump. I mean, manually written SPARQL query files have certain forms that the dump file doesn't and vice versa.

I'm thinking, When the scribe-android is live, shouldn't we need to make sure it has most language data available. By merging the results from both sources, we can maximize coverage and data quality, ensuring the app starts with a rich and reliable language dataset.

@andrewtavis

Copy link
Copy Markdown
Member

I think that there's value in both approaches - deleting the queries and keeping them. @axif0, what are the forms that are only available from the query service? Ideally we would be able to delete and regenerate, but then have those forms that are missing be accounted for in the process. This would assure that old fields that are from old labels or form combinations are no longer included in the queries :)

@harikrishnatp

Copy link
Copy Markdown
Collaborator Author

You're right, thank you both for the clarification. I agree that the best approach is to create a master list that is a union of the forms from the existing SPARQL files and the new forms found in the latest wikidata dump. This will preserve the existing forms while still allowing for a clean regeneration process.

@andrewtavis

Copy link
Copy Markdown
Member

Closing this now given the conversation with @harikrishnatp and @axif0. Plan from here is that @harikrishnatp will be opening a new PR on this based on using the Wikidata Query Service to derive all form property combinations for language and data type pairs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Query generation is including the same query form in more than one query

3 participants