Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -60,3 +60,4 @@ scribe_data_wikidata_dumps_export/*
query_check_missing_features.json
query_check_result_dump.json
query_check_result_sparql.json
query_check_sparql_service_features.json
35 changes: 22 additions & 13 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -1,25 +1,14 @@
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.5.0
rev: v6.0.0
hooks:
- id: trailing-whitespace
- id: end-of-file-fixer
- id: check-yaml
# - id: check-added-large-files

- repo: https://github.com/tcort/markdown-link-check
rev: v3.13.6
hooks:
- id: markdown-link-check
args: [-q]

- repo: https://github.com/sphinx-contrib/sphinx-lint
rev: v1.0.0
hooks:
- id: sphinx-lint

- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.8.5
rev: v0.14.5
hooks:
- id: ruff
args: [--fix]
Expand All @@ -32,3 +21,23 @@ repos:
- id: numpydoc-validation
files: ^src/
exclude: ^(tests/|.*__init__\.py$)

- repo: https://github.com/sphinx-contrib/sphinx-lint
rev: v1.0.0
hooks:
- id: sphinx-lint

- repo: https://github.com/tcort/markdown-link-check
rev: v3.14.1
hooks:
- id: markdown-link-check
args: [-q]

- repo: https://github.com/to-sta/spdx-checker-pre-commit
rev: 0.1.3
hooks:
- id: spdx-license-checker
name: run spdx-checker license check
exclude: ^(?:.*/)?__init__\.py$
args: [-l, GPL-3.0-or-later]
types_or: [python]
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,12 @@ Emojis for the following are chosen based on [gitmoji](https://gitmoji.dev/).

## [Upcoming] Scribe-Data 5.x

## Scribe-Data 5.2.0

### ✨ Features

- The SPARQL queries for the Scribe-Data CLI are generated by a process that checks the available data via the Wikidata Query Service ([#617](https://github.com/scribe-org/Scribe-Data/issues/617)).

### 🐞 Bug Fixes

- The handling of missing language directories in the SQLite conversion process has been dramatically improved to communicate to the user which languages are missing and also alert them that no SQLite databases will be created if no data is available for any of the desired languages.
Expand Down
22 changes: 4 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,13 +36,13 @@ Check out Scribe's [architecture diagrams](https://github.com/scribe-org/Organiz
- [Environment Setup](#environment-setup)
- [Featured By](#featured-by)

<a id="Process"></a>
<a id="process"></a>

# Process [`⇧`](#contents)

The CLI commands defined within [scribe_data/cli](https://github.com/scribe-org/Scribe-Data/blob/main/src/scribe_data/cli) and the notebooks within the various [scribe_data](https://github.com/scribe-org/Scribe-Data/tree/main/src/scribe_data) directories are used to update all data for [Scribe-iOS](https://github.com/scribe-org/Scribe-iOS), with this functionality later being expanded to update [Scribe-Android](https://github.com/scribe-org/Scribe-Android) and [Scribe-Desktop](https://github.com/scribe-org/Scribe-Desktop) once they're active.

The main data update process in triggers [language based SPARQL queries](https://github.com/scribe-org/Scribe-Data/tree/main/src/scribe_data/wikidata/language_data_extraction) to query language data from [Wikidata](https://www.wikidata.org/) using [SPARQLWrapper](https://github.com/RDFLib/sparqlwrapper) as a URI. The autosuggestion process derives popular words from [Wikipedia](https://www.wikipedia.org/) as well as those words that normally follow them for an effective baseline feature until natural language processing methods are employed. Functions to generate autosuggestions are ran in [gen_autosuggestions.ipynb](https://github.com/scribe-org/Scribe-Data/blob/main/src/scribe_data/wikipedia/gen_autosuggestions.ipynb). Emojis are further sourced from [Unicode CLDR](https://github.com/unicode-org/cldr), with this process being ran via the `scribe-data get -lang LANGUAGE -dt emoji-keywords` command.
The main data update process in triggers [language based SPARQL queries](https://github.com/scribe-org/Scribe-Data/tree/main/src/scribe_data/wikidata/language_data_extraction) to query language data from [Wikidata](https://www.wikidata.org/) using [SPARQLWrapper](https://github.com/RDFLib/sparqlwrapper) as a URI. The autosuggestion process derives popular words from [Wikipedia](https://www.wikipedia.org/) as well as those words that normally follow them for an effective baseline feature until natural language processing methods are employed. Functions to generate autosuggestions are ran in [gen_autosuggestions.py](https://github.com/scribe-org/Scribe-Data/blob/main/src/scribe_data/wikipedia/generate_autosuggestions.py). Emojis are further sourced from [Unicode CLDR](https://github.com/unicode-org/cldr), with this process being ran via the `scribe-data get -lang LANGUAGE -dt emoji-keywords` command.

<a id="installation"></a>

Expand Down Expand Up @@ -111,7 +111,7 @@ scribe-data total -i

[Wikidata](https://www.wikidata.org/) has lots of [language data](https://www.wikidata.org/wiki/Wikidata:Lexicographical_data) available, but not all of it is useful for all applications. In order to make the functionality of the Scribe-Data `get` requests as simple as possible, we made the decision to always return all data for the given languages and data types. Adding the ability to pass desired forms to the commands seemed cumbersome, and larger Scribe-Data requests should be parsing [Wikidata lexeme dumps](https://dumps.wikimedia.org/wikidatawiki/entities/) as the data source.

Scribe's solution to the get all functionality while preserving the ability to get specific forms is to allow users to filter the resulting data by contracts. The data contracts for Scribe's client applications can be found in the [data_contracts](./data_contracts/) directory. Data contracts are JSON objects where the values that are used in end applications are the keys and the resulting data identifiers based on Wikidata lexeme forms are the values. If the forms for a lexeme change, then the values would also change, but all that's needed is to update the contract for the application to function again.
Scribe's solution to the get all functionality while preserving the ability to get specific forms is to allow users to filter the resulting data by contracts. The data contracts for Scribe's client applications can be found in the [scribe_data_contracts](./scribe_data_contracts/) directory. Data contracts are JSON objects where the values that are used in end applications are the keys and the resulting data identifiers based on Wikidata lexeme forms are the values. If the forms for a lexeme change, then the values would also change, but all that's needed is to update the contract for the application to function again.

Efficient client application data updates using Scribe-Data follow as such:

Expand Down Expand Up @@ -275,7 +275,7 @@ See the [contribution guidelines](https://github.com/scribe-org/Scribe-Data/blob

# Featured By [`⇧`](#contents)

Please see the [blog posts page on our website](https://scri.be/docs/about/blog-posts) for a list of articles on Scribe, and feel free to open a pull request to add one that you've written at [scribe-org/scri.be](github.com/scribe-org/scri.be)!
Please see the [blog posts page on our website](https://scri.be/docs/about/blog-posts) for a list of articles on Scribe, and feel free to open a pull request to add one that you've written at [scribe-org/scri.be](https://github.com/scribe-org/scri.be)!

### Organizations

Expand Down Expand Up @@ -309,20 +309,6 @@ Many thanks to all the [Scribe-Data contributors](https://github.com/scribe-org/
<img src="https://contrib.rocks/image?repo=scribe-org/Scribe-Data" />
</a>

### Code and Dependencies

The Scribe community would like to thank all the great software that made Scribe-Data's development possible.

<details><summary><strong>List of referenced posts</strong></summary>
<p>

- [Building a Recommendation System Using Neural Network Embeddings](https://towardsdatascience.com/building-a-recommendation-system-using-neural-network-embeddings-1ef92e5c80c9) by [WillKoehrsen](https://github.com/WillKoehrsen)

- [Wikipedia Data Science: Working with the World’s Largest Encyclopedia](https://towardsdatascience.com/wikipedia-data-science-working-with-the-worlds-largest-encyclopedia-c08efbac5f5c) by [WillKoehrsen](https://github.com/WillKoehrsen)

</p>
</details>

### Wikimedia Communities

<div align="center">
Expand Down
Loading
Loading