scribe-org · andrewtavis · Nov 17, 2025 · Aug 4, 2025 · Aug 4, 2025 · Sep 29, 2025
diff --git a/.gitignore b/.gitignore
@@ -60,3 +60,4 @@ scribe_data_wikidata_dumps_export/*
 query_check_missing_features.json
 query_check_result_dump.json
 query_check_result_sparql.json
+query_check_sparql_service_features.json
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -1,25 +1,14 @@
 repos:
   - repo: https://github.com/pre-commit/pre-commit-hooks
-    rev: v4.5.0
+    rev: v6.0.0
     hooks:
       - id: trailing-whitespace
       - id: end-of-file-fixer
       - id: check-yaml
       # - id: check-added-large-files
 
-  - repo: https://github.com/tcort/markdown-link-check
-    rev: v3.13.6
-    hooks:
-      - id: markdown-link-check
-        args: [-q]
-
-  - repo: https://github.com/sphinx-contrib/sphinx-lint
-    rev: v1.0.0
-    hooks:
-      - id: sphinx-lint
-
   - repo: https://github.com/astral-sh/ruff-pre-commit
-    rev: v0.8.5
+    rev: v0.14.5
     hooks:
       - id: ruff
         args: [--fix]
@@ -32,3 +21,23 @@ repos:
       - id: numpydoc-validation
         files: ^src/
         exclude: ^(tests/|.*__init__\.py$)
+
+  - repo: https://github.com/sphinx-contrib/sphinx-lint
+    rev: v1.0.0
+    hooks:
+      - id: sphinx-lint
+
+  - repo: https://github.com/tcort/markdown-link-check
+    rev: v3.14.1
+    hooks:
+      - id: markdown-link-check
+        args: [-q]
+
+  - repo: https://github.com/to-sta/spdx-checker-pre-commit
+    rev: 0.1.3
+    hooks:
+      - id: spdx-license-checker
+        name: run spdx-checker license check
+        exclude: ^(?:.*/)?__init__\.py$
+        args: [-l, GPL-3.0-or-later]
+        types_or: [python]
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -12,6 +12,12 @@ Emojis for the following are chosen based on [gitmoji](https://gitmoji.dev/).
 
 ## [Upcoming] Scribe-Data 5.x
 
+## Scribe-Data 5.2.0
+
+### ✨ Features
+
+- The SPARQL queries for the Scribe-Data CLI are generated by a process that checks the available data via the Wikidata Query Service ([#617](https://github.com/scribe-org/Scribe-Data/issues/617)).
+
 ### 🐞 Bug Fixes
 
 - The handling of missing language directories in the SQLite conversion process has been dramatically improved to communicate to the user which languages are missing and also alert them that no SQLite databases will be created if no data is available for any of the desired languages.

diff --git a/README.md b/README.md
@@ -36,13 +36,13 @@ Check out Scribe's [architecture diagrams](https://github.com/scribe-org/Organiz
 - [Environment Setup](#environment-setup)
 - [Featured By](#featured-by)
 
-<a id="Process"></a>
+<a id="process"></a>
 
 # Process [`⇧`](#contents)
 
 The CLI commands defined within [scribe_data/cli](https://github.com/scribe-org/Scribe-Data/blob/main/src/scribe_data/cli) and the notebooks within the various [scribe_data](https://github.com/scribe-org/Scribe-Data/tree/main/src/scribe_data) directories are used to update all data for [Scribe-iOS](https://github.com/scribe-org/Scribe-iOS), with this functionality later being expanded to update [Scribe-Android](https://github.com/scribe-org/Scribe-Android) and [Scribe-Desktop](https://github.com/scribe-org/Scribe-Desktop) once they're active.
 
-The main data update process in triggers [language based SPARQL queries](https://github.com/scribe-org/Scribe-Data/tree/main/src/scribe_data/wikidata/language_data_extraction) to query language data from [Wikidata](https://www.wikidata.org/) using [SPARQLWrapper](https://github.com/RDFLib/sparqlwrapper) as a URI. The autosuggestion process derives popular words from [Wikipedia](https://www.wikipedia.org/) as well as those words that normally follow them for an effective baseline feature until natural language processing methods are employed. Functions to generate autosuggestions are ran in [gen_autosuggestions.ipynb](https://github.com/scribe-org/Scribe-Data/blob/main/src/scribe_data/wikipedia/gen_autosuggestions.ipynb). Emojis are further sourced from [Unicode CLDR](https://github.com/unicode-org/cldr), with this process being ran via the `scribe-data get -lang LANGUAGE -dt emoji-keywords` command.
+The main data update process in triggers [language based SPARQL queries](https://github.com/scribe-org/Scribe-Data/tree/main/src/scribe_data/wikidata/language_data_extraction) to query language data from [Wikidata](https://www.wikidata.org/) using [SPARQLWrapper](https://github.com/RDFLib/sparqlwrapper) as a URI. The autosuggestion process derives popular words from [Wikipedia](https://www.wikipedia.org/) as well as those words that normally follow them for an effective baseline feature until natural language processing methods are employed. Functions to generate autosuggestions are ran in [gen_autosuggestions.py](https://github.com/scribe-org/Scribe-Data/blob/main/src/scribe_data/wikipedia/generate_autosuggestions.py). Emojis are further sourced from [Unicode CLDR](https://github.com/unicode-org/cldr), with this process being ran via the `scribe-data get -lang LANGUAGE -dt emoji-keywords` command.
 
 <a id="installation"></a>
 
@@ -111,7 +111,7 @@ scribe-data total -i
 
 [Wikidata](https://www.wikidata.org/) has lots of [language data](https://www.wikidata.org/wiki/Wikidata:Lexicographical_data) available, but not all of it is useful for all applications. In order to make the functionality of the Scribe-Data `get` requests as simple as possible, we made the decision to always return all data for the given languages and data types. Adding the ability to pass desired forms to the commands seemed cumbersome, and larger Scribe-Data requests should be parsing [Wikidata lexeme dumps](https://dumps.wikimedia.org/wikidatawiki/entities/) as the data source.
 
-Scribe's solution to the get all functionality while preserving the ability to get specific forms is to allow users to filter the resulting data by contracts. The data contracts for Scribe's client applications can be found in the [data_contracts](./data_contracts/) directory. Data contracts are JSON objects where the values that are used in end applications are the keys and the resulting data identifiers based on Wikidata lexeme forms are the values. If the forms for a lexeme change, then the values would also change, but all that's needed is to update the contract for the application to function again.
+Scribe's solution to the get all functionality while preserving the ability to get specific forms is to allow users to filter the resulting data by contracts. The data contracts for Scribe's client applications can be found in the [scribe_data_contracts](./scribe_data_contracts/) directory. Data contracts are JSON objects where the values that are used in end applications are the keys and the resulting data identifiers based on Wikidata lexeme forms are the values. If the forms for a lexeme change, then the values would also change, but all that's needed is to update the contract for the application to function again.
 
 Efficient client application data updates using Scribe-Data follow as such:
 
@@ -275,7 +275,7 @@ See the [contribution guidelines](https://github.com/scribe-org/Scribe-Data/blob
 
 # Featured By [`⇧`](#contents)
 
-Please see the [blog posts page on our website](https://scri.be/docs/about/blog-posts) for a list of articles on Scribe, and feel free to open a pull request to add one that you've written at [scribe-org/scri.be](github.com/scribe-org/scri.be)!
+Please see the [blog posts page on our website](https://scri.be/docs/about/blog-posts) for a list of articles on Scribe, and feel free to open a pull request to add one that you've written at [scribe-org/scri.be](https://github.com/scribe-org/scri.be)!
 
 ### Organizations
 
@@ -309,20 +309,6 @@ Many thanks to all the [Scribe-Data contributors](https://github.com/scribe-org/
   <img src="https://contrib.rocks/image?repo=scribe-org/Scribe-Data" />
 </a>
 
-### Code and Dependencies
-
-The Scribe community would like to thank all the great software that made Scribe-Data's development possible.
-
-<details><summary><strong>List of referenced posts</strong></summary>
-<p>
-
-- [Building a Recommendation System Using Neural Network Embeddings](https://towardsdatascience.com/building-a-recommendation-system-using-neural-network-embeddings-1ef92e5c80c9) by [WillKoehrsen](https://github.com/WillKoehrsen)
-
-- [Wikipedia Data Science: Working with the World’s Largest Encyclopedia](https://towardsdatascience.com/wikipedia-data-science-working-with-the-worlds-largest-encyclopedia-c08efbac5f5c) by [WillKoehrsen](https://github.com/WillKoehrsen)
-
-</p>
-</details>
-
 ### Wikimedia Communities
 
 <div align="center">