Skip to content

Reduce pipeline processes#298

Open
christian-monch wants to merge 68 commits into
maint_0.4from
enh-reduce-pipeline-processes
Open

Reduce pipeline processes#298
christian-monch wants to merge 68 commits into
maint_0.4from
enh-reduce-pipeline-processes

Conversation

@christian-monch

Copy link
Copy Markdown
Collaborator

This PR fixes #261 partially and fixes #268

This PR modifies the dataset traverser component to send almost all information it has about the traversed dataset elements to the processors. That reduces the number of processes that the processors have to execute.

@christian-monch christian-monch changed the base branch from master to maint_0.4 January 16, 2023 12:57
@christian-monch christian-monch force-pushed the enh-reduce-pipeline-processes branch from e18f550 to 113321d Compare January 26, 2023 13:30
@christian-monch christian-monch force-pushed the enh-reduce-pipeline-processes branch from 113321d to 6183a2c Compare February 27, 2023 08:02
@codecov-commenter

codecov-commenter commented Feb 27, 2023

Copy link
Copy Markdown

Codecov Report

Patch and project coverage have no change.

Comparison is base (5c2181f) 86.27% compared to head (5c2181f) 86.27%.

❗ Current head 5c2181f differs from pull request most recent head 7e2b5ab. Consider uploading reports for the commit 7e2b5ab to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           maint_0.4     #298   +/-   ##
==========================================
  Coverage      86.27%   86.27%           
==========================================
  Files             88       88           
  Lines           4830     4830           
==========================================
  Hits            4167     4167           
  Misses           663      663           

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@yarikoptic

Copy link
Copy Markdown
Member

that's a lot of commits/work -- is it going to be merged?

This commit adds `build` and `twine` to
`requirements-devel.txt`. It also moves
sphinx-dependencies into development
requirements.

The datalad version is updated to >=0.17

In addition it sort the entries in
`requirements-devel.txt` and
`requirements.txt`.
This commit introduces AnnexedFileInfo,
to hold annex-status information for a
single file. To simplify handling,
the dataclasses_json package is used
and added to requirements

Python version requirement has been
set to >=3.7
This commit extends the FileInfo dataclass
and derives the AnnexedFileInfo class from it.
The classes hold file-information that is
returned by AnnexRepo.get_content_annexinfo(),
or by GitRepo.status().

It adds a parameter to pass JSON-serialized
FileInfo or AnnexedFileInfo objects to the
extract process via arguments, thus releaving
the necessity to invoce git-annex to
determine file-status.
This commit uses the --file-info parameter
to provide extractors with status information
about the element from which metadata should
be extracted.
This commit adds code to handle repositories
that do not posses an ID, usually these are
plain git repositories.
This commit adds a pipeline provider and a pipeline
processor with definable input output behavior. The
content that should be yielded can be defined externally,
the rate in which content is yielded can also be
defined externally. This allows to perform repeatable
performance measurements.
This commit adds information about the object-id
and the processor pid of the processor and provider
probes that are executed by meta-conduct
This commit adds an invocation count to the processor
probe that counts the invocations on this instance of
the probe.
This commit rebases the branch on maint_0.4
and adds code to check for the existence
of datset IDs.
This commit fixes the reporting of datasets in
traversal.

There is still something to due in datasets, i.e.
report "state", "gitshasum", and "prev_gitshasum"
This commit adds size information to the
traverser output for non-annexed files.
Because git-ls-files does not provide the
size information, an additional git-ls-tree
call is used to determine sizes of non-annexed
files
@mslw

mslw commented Apr 17, 2023

Copy link
Copy Markdown
Contributor

Git version 2.30.2, which is currently available in debian-stable, does not support git ls-tree --format. If I'm reading the docs correctly, --format was added in Git v 2.36.0.

I very much like the spirit of this PR and I'm not sure whether debian-stable compatibility is a goal, so I'm just leaving this as an observation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve file extractor performance Lots of identical git processes , --batch and not, operating in the same repository -- expected?

4 participants