Fix duplicate queries#626
Conversation
Thank you for the pull request! ❤️The Scribe-Data team will do our best to address your contribution as soon as we can. If you're not already a member of our public Matrix community, please consider joining! We'd suggest that you use the Element client as well as Element X for a mobile app, and definitely join the |
Maintainer ChecklistThe following is a checklist for maintainers to make sure this process goes as well as possible. Feel free to address the points below yourself in further commits if you realize that actions are needed :) |
|
Amazing to have this PR opened, @harikrishnatp! Thanks so much for getting to this 😊 Really is key for the community right now :) |
|
@axif0, maybe we could look into this when we finalize the current PR for Scribe-Server as well? 😊 |
|
Thanks so much for the encouragement, and apologies again for the delay in getting to this. Excited to see it merged! 😊 |
I'm not sure if I support this action, when parsing the Wikidata lexeme dump, and querying the SPARQL query files, I saw that SPARQL query files sometimes gave better results then Wikidata lexeme dump. I mean, manually written SPARQL query files have certain forms that the dump file doesn't and vice versa. I'm thinking, When the scribe-android is live, shouldn't we need to make sure it has most language data available. By merging the results from both sources, we can maximize coverage and data quality, ensuring the app starts with a rich and reliable language dataset. |
|
I think that there's value in both approaches - deleting the queries and keeping them. @axif0, what are the forms that are only available from the query service? Ideally we would be able to delete and regenerate, but then have those forms that are missing be accounted for in the process. This would assure that old fields that are from old labels or form combinations are no longer included in the queries :) |
|
You're right, thank you both for the clarification. I agree that the best approach is to create a master list that is a union of the forms from the existing SPARQL files and the new forms found in the latest wikidata dump. This will preserve the existing forms while still allowing for a clean regeneration process. |
|
Closing this now given the conversation with @harikrishnatp and @axif0. Plan from here is that @harikrishnatp will be opening a new PR on this based on using the Wikidata Query Service to derive all form property combinations for language and data type pairs. |
Contributor checklist
pytestcommand as directed in the testing section of the contributing guideDescription
This pull request resolves the issue of duplicate SPARQL query forms being generated. The fix involves update to the query generation script to ensure it correctly handles the current state of both local files and the live Wikidata dump.
Changes:
check_missing_formswas updated to correctly generate unique, filtered set of queries for each sub-language.As part of this fix, all SPARQL query files were deleted and regenerated from the latest Wikidata lexeme dump. This not only fixed the duplication issue but also updated the queries to reflect the current state of data in wikidata. This led to the removal of some outdated queries (eg Ukrainian) for which the data no longer appears to be available in the parsable format in the dump
Testing:
pytestwas passing successfully. An outdated test (test_list_data_types_specific_language) was also updated to reflect the changes in the available data.Related issue