Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/ISSUE_TEMPLATE/bug-report.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ body:
id: version
attributes:
label: Version
description: What version of "Apache StormCrawler (Incubating) are you using?"
description: What version of "Apache StormCrawler are you using?"
options:
- main branch
- stormcrawler-3.2.0
Expand All @@ -35,7 +35,7 @@ body:
attributes:
label: How to reproduce
placeholder: |
+ Which version of Apache StormCrawler (Incubating) version to use.
+ Which version of Apache StormCrawler version to use.
validations:
required: true
- type: textarea
Expand Down
2 changes: 1 addition & 1 deletion .github/ISSUE_TEMPLATE/feature-request.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
name: Feature Request
title: "[FEATURE] "
description: Suggest an idea for Apache StormCrawler (Incubating)
description: Suggest an idea for Apache StormCrawler
labels: [ "feature" ]
body:

Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/snapshots.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ on:

jobs:
upload_to_nightlies:
if: github.repository == 'apache/incubator-stormcrawler'
if: github.repository == 'apache/stormcrawler'
name: Publish Snapshots
runs-on: ubuntu-latest
steps:
Expand Down
10 changes: 0 additions & 10 deletions DISCLAIMER

This file was deleted.

2 changes: 1 addition & 1 deletion NOTICE
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
Apache StormCrawler (Incubating)
Apache StormCrawler
Copyright 2025 The Apache Software Foundation

This product includes software developed by The Apache Software
Expand Down
14 changes: 7 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
[![StormCrawler](https://stormcrawler.apache.org/img/Logo-small.jpg)](https://stormcrawler.apache.org/)
=============

[![license](https://img.shields.io/github/license/apache/incubator-stormcrawler.svg?maxAge=2592000?style=plastic)](http://www.apache.org/licenses/LICENSE-2.0)
![Build Status](https://github.com/apache/incubator-stormcrawler/actions/workflows/maven.yml/badge.svg)
[![javadoc](https://javadoc.io/badge2/apache/incubator-stormcrawler-core/javadoc.svg)](https://javadoc.io/doc/org.apache.stormcrawler/stormcrawler-core/)
[![license](https://img.shields.io/github/license/apache/stormcrawler.svg?maxAge=2592000?style=plastic)](http://www.apache.org/licenses/LICENSE-2.0)
![Build Status](https://github.com/apache/stormcrawler/actions/workflows/maven.yml/badge.svg)
[![javadoc](https://javadoc.io/badge2/apache/stormcrawler-core/javadoc.svg)](https://javadoc.io/doc/org.apache.stormcrawler/stormcrawler-core/)

Apache StormCrawler (Incubating) is an open source collection of resources for building low-latency, scalable web crawlers on [Apache Storm](http://storm.apache.org/). It is provided under [Apache License](http://www.apache.org/licenses/LICENSE-2.0) and is written mostly in Java.
Apache StormCrawler is an open source collection of resources for building low-latency, scalable web crawlers on [Apache Storm](http://storm.apache.org/). It is provided under [Apache License](http://www.apache.org/licenses/LICENSE-2.0) and is written mostly in Java.

## Quickstart

Expand All @@ -24,13 +24,13 @@ You'll be asked to enter a groupId (e.g. com.mycompany.crawler), an artefactId (

This will not only create a fully formed project containing a POM with the dependency above but also the default resource files, a default CrawlTopology class and a configuration file. Enter the directory you just created (should be the same as the artefactId you specified earlier) and follow the instructions on the README file.

Alternatively if you can't or don't want to use the Maven archetype above, you can simply copy the files from [archetype-resources](https://github.com/apache/incubator-stormcrawler/tree/master/archetype/src/main/resources/archetype-resources).
Alternatively if you can't or don't want to use the Maven archetype above, you can simply copy the files from [archetype-resources](https://github.com/apache/stormcrawler/tree/master/archetype/src/main/resources/archetype-resources).

Have a look at [crawler.flux](https://github.com/apache/incubator-stormcrawler/blob/master/archetype/src/main/resources/archetype-resources/crawler.flux), the [crawler-conf.yaml](https://github.com/apache/incubator-stormcrawler/blob/master/archetype/src/main/resources/archetype-resources/crawler-conf.yaml) file as well as the files in [src/main/resources/](https://github.com/apache/incubator-stormcrawler/tree/master/archetype/src/main/resources/archetype-resources/src/main/resources), they are all that is needed to run a crawl topology : all the other components come from the core module.
Have a look at [crawler.flux](https://github.com/apache/stormcrawler/blob/master/archetype/src/main/resources/archetype-resources/crawler.flux), the [crawler-conf.yaml](https://github.com/apache/stormcrawler/blob/master/archetype/src/main/resources/archetype-resources/crawler-conf.yaml) file as well as the files in [src/main/resources/](https://github.com/apache/stormcrawler/tree/master/archetype/src/main/resources/archetype-resources/src/main/resources), they are all that is needed to run a crawl topology : all the other components come from the core module.

## Getting help

The [WIKI](https://github.com/apache/incubator-stormcrawler/wiki) is a good place to start your investigations but if you are stuck please use the tag [stormcrawler](http://stackoverflow.com/questions/tagged/stormcrawler) on StackOverflow or ask a question in the [discussions](https://github.com/apache/incubator-stormcrawler/discussions) section.
The [WIKI](https://github.com/apache/stormcrawler/wiki) is a good place to start your investigations but if you are stuck please use the tag [stormcrawler](http://stackoverflow.com/questions/tagged/stormcrawler) on StackOverflow or ask a question in the [discussions](https://github.com/apache/stormcrawler/discussions) section.

The project website has a page listing companies providing [commercial support](https://stormcrawler.apache.org/support/) for Apache StormCrawler.

Expand Down
38 changes: 19 additions & 19 deletions RELEASING.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
# Guide to release Apache StormCrawler (Incubating)
# Guide to release Apache StormCrawler

## Release Preparation

- Select a release manager on the dev mailing list. A release manager should be a committer and should preferably switch between releases to have a transfer in knowledge.
- Create an issue for a new release in <https://github.com/apache/incubator-stormcrawler/issues>
- Review all [issues](https://github.com/apache/incubator-stormcrawler/issues) associated with the release. All issues should be resolved and closed.
- Create an issue for a new release in <https://github.com/apache/stormcrawler/issues>
- Review all [issues](https://github.com/apache/stormcrawler/issues) associated with the release. All issues should be resolved and closed.
- Any issues assigned to the release that are not complete should be assigned to the next release. Any critical or blocker issues should be resolved on the mailing list. Discuss any issues that you are unsure of on the mailing list.

## Steps for the Release Manager
Expand All @@ -13,7 +13,7 @@ The following steps need only to be performed once.

- Make sure you have your PGP fingerprint added into <https://id.apache.org/>
- Make sure you have your PGP keys password.
- Add your PGP key to the [KEYS](https://dist.apache.org/repos/dist/release/incubator/stormcrawler/KEYS) file.
- Add your PGP key to the [KEYS](https://dist.apache.org/repos/dist/release/stormcrawler/KEYS) file.

Examples of adding your key to this file:

Expand Down Expand Up @@ -84,7 +84,7 @@ export GPG_TTY=$(tty)

## Release Steps

- Checkout the Apache StormCrawler main branch: `git clone git@github.com:apache/incubator-stormcrawler.git`
- Checkout the Apache StormCrawler main branch: `git clone git@github.com:apache/stormcrawler.git`
- Execute a complete test: `mvn test`
- Ensure to have a working Docker environment on your release machine. Otherwise, coverage computation goes wrong and the build will fail.
- Check the current results of the last GitHub action runs.
Expand Down Expand Up @@ -152,14 +152,14 @@ gpg --homedir . --output apache-stormcrawler-x.y.z-incubating-source-release.ta
- Run a global replace of the old version with the new version.
- Prepare a preview via the staging environment of the website.
- Ensure the website is updated on <https://stormcrawler.staged.apache.org>
- Note: Instruction on how to do so can be found on <https://github.com/apache/incubator-stormcrawler-site>
- Note: Instruction on how to do so can be found on <https://github.com/apache/stormcrawler-site>

### Create a draft release on Github

- Create a new Draft Release -- on <https://github.com/apache/incubator-stormcrawler/releases>, click `Draft a new release` and select the `stormcrawler-X.Y.Z` tag.
- Create a new Draft Release -- on <https://github.com/apache/stormcrawler/releases>, click `Draft a new release` and select the `stormcrawler-X.Y.Z` tag.
- Click the `Generate Release Notes` (**MAKE SURE TO SELECT THE CORRECT PREVIOUS RELEASE AS THE BASE**). Copy and paste the Disclaimer and Release Summary from the previous release and update the Release Summary as appropriate.
- Click the `Set as pre-release` button.
- Click `Publish release`. The release should not have `*-rc1` in its title, e.g.: `https://github.com/apache/incubator-stormcrawler/releases/tag/stormcrawler-3.2.0`
- Click `Publish release`. The release should not have `*-rc1` in its title, e.g.: `https://github.com/apache/stormcrawler/releases/tag/stormcrawler-3.2.0`

#### Create a VOTE Thread

Expand All @@ -171,28 +171,28 @@ The VOTE process is two-fold:
- Be sure to replace all values in `[]` with the appropriate values.

```bash
Message Subject: [VOTE] Apache StormCrawler (Incubating) [version] Release Candidate
Message Subject: [VOTE] Apache StormCrawler [version] Release Candidate

----
Hi folks,

I have posted a [Nth] release candidate for the Apache StormCrawler (Incubating) [version] release and it is ready for testing.
I have posted a [Nth] release candidate for the Apache StormCrawler[version] release and it is ready for testing.

<Add a summary to highlight notable changes>

Thank you to everyone who contributed to this release, including all of our users and the people who submitted bug reports,
contributed code or documentation enhancements.

The release was made using the Apache StormCrawler (Incubating) release process, documented here:
https://github.com/apache/incubator-stormcrawler/blob/main/RELEASING.md
The release was made using the Apache StormCrawler release process, documented here:
https://github.com/apache/stormcrawler/blob/main/RELEASING.md

Source:

https://dist.apache.org/repos/dist/dev/incubator/stormcrawler/stormcrawler-x.y.z-RC1

Tag:

https://github.com/apache/incubator-stormcrawler/releases/tag/stormcrawler-x.y.z
https://github.com/apache/stormcrawler/releases/tag/stormcrawler-x.y.z

Commit Hash:

Expand Down Expand Up @@ -250,7 +250,7 @@ The vote is successful if at least 3 *+1* votes are received from IPMC members a
Acknowledge the voting results on the mailing list in the VOTE thread by sending a mail.

```bash
Message Subject: [RESULT] [VOTE] Apache StormCrawler (Incubating) [version]
Message Subject: [RESULT] [VOTE] Apache StormCrawler [version]

Hi folks,

Expand Down Expand Up @@ -296,7 +296,7 @@ Remove the old releases from SVN under <https://dist.apache.org/repos/dist/relea

- Merge the release branch to `main` to start the website deployment.
- Check, that the website is deployed successfully.
- Instruction on how to do so can be found on <https://github.com/apache/incubator-stormcrawler-site>
- Instruction on how to do so can be found on <https://github.com/apache/stormcrawler-site>

### Make the release on Github

Expand All @@ -310,18 +310,18 @@ Remove the old releases from SVN under <https://dist.apache.org/repos/dist/relea
- It needs to be sent from your **@apache.org** email address or the email will bounce from the announce list.

```bash
Title: [ANNOUNCE] Apache StormCrawler (Incubating) <version> released
Title: [ANNOUNCE] Apache StormCrawler <version> released
TO: announce@apache.org, dev@stormcrawler.apache.org, general@incubator.apache.org
----

Message body:

----
The Apache StormCrawler (Incubating) team is pleased to announce the release of version <version> of Apache StormCrawler.
The Apache StormCrawler team is pleased to announce the release of version <version> of Apache StormCrawler.
StormCrawler is a collection of resources for building low-latency, customisable and scalable web crawlers on Apache Storm.

Apache StormCrawler (Incubating) <version> source distributions is available for download from our download page: https://stormcrawler.apache.org/download/index.html
Apache StormCrawler (Incubating) is distributed by Maven Central as well.
Apache StormCrawler <version> source distributions is available for download from our download page: https://stormcrawler.apache.org/download/index.html
Apache StormCrawler is distributed by Maven Central as well.

Changes in this version:

Expand Down
2 changes: 1 addition & 1 deletion core/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ under the License.
<packaging>jar</packaging>

<name>stormcrawler-core</name>
<url>https://github.com/apache/incubator-stormcrawler/tree/master/core</url>
<url>https://github.com/apache/stormcrawler/tree/master/core</url>
<description>StormCrawler core Java API.</description>

<properties>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -509,7 +509,7 @@ public void run() {
metadata = new Metadata();
}

// https://github.com/apache/incubator-stormcrawler/issues/813
// https://github.com/apache/stormcrawler/issues/813
metadata.remove("fetch.exception");

boolean asap = false;
Expand Down Expand Up @@ -568,7 +568,7 @@ public void run() {
}

// has found sitemaps
// https://github.com/apache/incubator-stormcrawler/issues/710
// https://github.com/apache/stormcrawler/issues/710
// note: we don't care if the sitemap URLs where actually
// kept
boolean foundSitemap = (rules.getSitemaps().size() > 0);
Expand Down Expand Up @@ -732,7 +732,7 @@ public void run() {
mergedMD.setValue("_redirTo", redirection);
}

// https://github.com/apache/incubator-stormcrawler/issues/954
// https://github.com/apache/stormcrawler/issues/954
if (allowRedirs() && StringUtils.isNotBlank(redirection)) {
emitOutlink(fit.t, url, redirection, mergedMD);
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -347,7 +347,7 @@ public void execute(Tuple tuple) {
LOG.info("Found redir in {} to {}", url, redirection);
metadata.setValue("_redirTo", redirection);

// https://github.com/apache/incubator-stormcrawler/issues/954
// https://github.com/apache/stormcrawler/issues/954
if (allowRedirs() && StringUtils.isNotBlank(redirection)) {
emitOutlink(tuple, new URL(url), redirection, metadata);
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -256,7 +256,7 @@ public void execute(Tuple input) {
metadata = new Metadata();
}

// https://github.com/apache/incubator-stormcrawler/issues/813
// https://github.com/apache/stormcrawler/issues/813
metadata.remove("fetch.exception");

URL url;
Expand Down Expand Up @@ -326,7 +326,7 @@ public void execute(Tuple input) {
}

// has found sitemaps
// https://github.com/apache/incubator-stormcrawler/issues/710
// https://github.com/apache/stormcrawler/issues/710
// note: we don't care if the sitemap URLs where actually
// kept
boolean foundSitemap = (rules.getSitemaps().size() > 0);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ public class BasicURLNormalizer extends URLFilter {
/** Nutch 1098 - finds URL encoded parts of the URL */
private static final Pattern unescapeRulePattern = Pattern.compile("%([0-9A-Fa-f]{2})");

/** https://github.com/apache/incubator-stormcrawler/issues/401 * */
/** https://github.com/apache/stormcrawler/issues/401 * */
private static final Pattern illegalEscapePattern = Pattern.compile("%u([0-9A-Fa-f]{4})");

// charset used for encoding URLs before escaping
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,7 @@ public void loadJSONResources(InputStream inputStream)

// if it contains a single object
// jump directly to its content
// https://github.com/apache/incubator-stormcrawler/issues/1013
// https://github.com/apache/stormcrawler/issues/1013
if (rootNode.size() == 1 && rootNode.isObject()) {
rootNode = rootNode.fields().next().getValue();
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -39,8 +39,8 @@
* </pre>
*
* <p>Will be replaced by <a href=
* "https://github.com/apache/incubator-stormcrawler/issues/711">MetadataFilter to filter based on
* multiple key values</a>
* "https://github.com/apache/stormcrawler/issues/711">MetadataFilter to filter based on multiple
* key values</a>
*
* @since 1.14
*/
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -207,7 +207,7 @@ public void execute(Tuple tuple) {
if (!status.equals(Status.FETCH_ERROR)) {
metadata.remove(Constants.fetchErrorCountParamName);
}
// https://github.com/apache/incubator-stormcrawler/issues/415
// https://github.com/apache/stormcrawler/issues/415
// remove error related key values in case of success
if (status.equals(Status.FETCHED) || status.equals(Status.REDIRECTION)) {
metadata.remove(Constants.STATUS_ERROR_CAUSE);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ public class ProtocolResponse {

/**
* @since 1.17
* @see <a href="https://github.com/apache/incubator-stormcrawler/issues/776">Issue 776</a>
* @see <a href="https://github.com/apache/stormcrawler/issues/776">Issue 776</a>
*/
public static final String PROTOCOL_MD_PREFIX_PARAM = "protocol.md.prefix";

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -186,7 +186,7 @@ private static String getCharsetFromMeta(byte buffer[], int maxlength) {
int start = html.indexOf("<meta charset=\"");
if (start != -1) {
int end = html.indexOf('"', start + 15);
// https://github.com/apache/incubator-stormcrawler/issues/870
// https://github.com/apache/stormcrawler/issues/870
// try on a slightly larger section of text if it is trimmed
if (end == -1 && ((maxlength + 10) < buffer.length)) {
return getCharsetFromMeta(buffer, maxlength + 10);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -289,7 +289,7 @@ void testLowerCasing() throws MalformedURLException {
assertEquals(expectedResult, normalizedUrl, "Failed to filter query string");
}

// https://github.com/apache/incubator-stormcrawler/issues/401
// https://github.com/apache/stormcrawler/issues/401
@Test
void testNonStandardPercentEncoding() throws MalformedURLException {
URLFilter urlFilter = createFilter(false, false);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ void testBasicExtraction() throws IOException {
}

@Test
// https://github.com/apache/incubator-stormcrawler/issues/219
// https://github.com/apache/stormcrawler/issues/219
void testScriptExtraction() throws IOException {
prepareParserBolt("test.jsoupfilters.json");
parse("https://stormcrawler.apache.org", "stormcrawler.apache.org.html");
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@
import org.junit.jupiter.api.Test;

/**
* @see https://github.com/apache/incubator-stormcrawler/pull/653 *
* @see https://github.com/apache/stormcrawler/pull/653 *
*/
class StackOverflowTest extends ParsingTester {

Expand All @@ -47,7 +47,7 @@ void testStackOverflow() throws IOException {
}

/**
* @see https://github.com/apache/incubator-stormcrawler/issues/666
* @see https://github.com/apache/stormcrawler/issues/666
*/
@Test
void testNamespaceExtraction() throws IOException {
Expand Down
Loading