#620 Add support for shards - SolrSpout (#1343)

* #620 update spout to fetch from the corresponding shard * #620 add Solr scripts * #620 fix tests to operate in cloud mode * #620 fix code format * #620 add Solr spout test * #620 add license * #620 improve the Solr related scripts * #620 add solr archetype, update readmes * #620 minor fixes * #620 do not set the 'shard' query parameter when we have a single shard * #620 fix archetype includes, improve scripts and configuration files * #620 fix java topologies * #620 add 'injection.flux' topology * #620 bring in change from #1390 * #620 update sample flux topologies and readme * #620 minor comments and readme changes
apache · Nov 10, 2024 · f53e89a · f53e89a
1 parent 282c04c
commit f53e89a
Show file tree

Hide file tree

Showing 38 changed files with 1,379 additions and 327 deletions.
diff --git a/external/solr/README.md b/external/solr/README.md
@@ -1,117 +1,30 @@
-stormcrawler-solr
-==================
+# stormcrawler-solr
 
-Set of Solr resources for StormCrawler that allows you to create topologies that consume from a Solr collection and store metrics, status or parsed content into Solr.
+Set of [Apache Solr](https://solr.apache.org/) resources for StormCrawler that allows you to create topologies that consume from a Solr collection and store metrics, status or parsed content into Solr.
 
-## How to use
+## Getting started
 
-In your project you can use this by adding the following dependency:
+The easiest way is currently to use the archetype for Solr with:
 
-```xml
-<dependency>
-    <groupId>org.apache.stormcrawler</groupId>
-    <artifactId>stormcrawler-solr</artifactId>
-    <version>${stormcrawler.version}</version>
-</dependency>
-```
+`mvn archetype:generate -DarchetypeGroupId=org.apache.stormcrawler -DarchetypeArtifactId=stormcrawler-solr-archetype -DarchetypeVersion=3.1.1-SNAPSHOT`
 
-## Available resources
-
-* `IndexerBolt`: Implementation of `AbstractIndexerBolt` that allows to index the parsed data and metadata into a specified Solr collection.
-
-* `MetricsConsumer`: Class that allows to store Storm metrics in Solr.
-
-* `SolrSpout`: Spout that allows to get URLs from a specified Solr collection.
-
-* `StatusUpdaterBolt`: Implementation of `AbstractStatusUpdaterBolt` that allows to store the status of each URL along with the serialized metadata in Solr.
-
-* `SolrCrawlTopology`: Example implementation of a topology that use the provided classes, this is intended as an example or a guide on how to use this resources.
-
-* `SeedInjector`: Topology that allow to read URLs from a specified file and store the URLs in a Solr collection using the `StatusUpdaterBolt`. This can be used as a starting point to inject URLs into Solr.
-
-## Configuration options
-
-The available configuration options can be found in the [`solr-conf.yaml`](solr-conf.yaml) file.
-
-For configuring the connection with the Solr server, the following parameters are available: `solr.TYPE.url`, `solr.TYPE.zkhost`, `solr.TYPE.collection`.
-
-> In the previous example `TYPE` can be one of the following values:
-
-> * `indexer`: To reference the configuration parameters of the `IndexerBolt` class.
-> * `status`: To reference the configuration parameters of the `SolrSpout` and `StatusUpdaterBolt` classes.
-> * `metrics`: To reference the configuration parameters of the `MetricsConsumer` class.
-
-> *Note: Some of this classes provide additional parameter configurations.*
-
-### General parameters
-
-* `solr.TYPE.url`: The URL of the Solr server including the name of the collection that you want to use.
-
-## Additional configuration options
-
-#### MetricsConsumer
-
-In the case of the `MetricsConsumer` class a couple of additional configuration parameters are provided to use the [Document Expiration](https://lucidworks.com/blog/document-expiration/) feature available in Solr since version 4.8.
+You'll be asked to enter a groupId (e.g. com.mycompany.crawler), an artefactId (e.g. stormcrawler), a version, a package name and details about the user agent to use.
 
-* `solr.metrics.ttl`: [Date expression](https://cwiki.apache.org/confluence/display/solr/Working+with+Dates) to specify when the document should expire.
-* `solr.metrics.ttl.field`: Field to be used to specify the [date expression](https://cwiki.apache.org/confluence/display/solr/Working+with+Dates) that defines when the document should expire.
+This will not only create a fully formed project containing a POM with the dependency above but also a set of resources, configuration files and sample topology classes. Enter the directory you just created (should be the same as the artefactId you specified earlier) and follow the instructions on the README file.
 
-*Note: The date expression specified in the `solr.metrics.ttl` parameter is not validated. To use this feature some changes in the Solr configuration must be done.*
+You will of course need to have both Apache Storm (2.7.0) and Apache Solr (9.7.0) installed.
 
-#### SolrSpout
+Official references:
+* [Apache Storm: Setting Up a Development Environment](https://storm.apache.org/releases/current/Setting-up-development-environment.html)
+* [Apache Solr: Installation & Deployment](https://solr.apache.org/guide/solr/latest/deployment-guide/installing-solr.html)
 
-For the `SolrSpout` class a couple of additional configuration parameters are available to guarantee some *diversity* in the URLs fetched from Solr, in the case that you want to have better coverage of your URLs. This is done using the [collapse and expand](https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results) feature available in Solr.
-
-* `solr.status.bucket.field`: Field to be used to collapse the documents.
-* `solr.status.bucket.maxsize`: Amount of documents to return for each *bucket*.
-
-For instance if you are crawling URLs from different domains, perhaps is of your interest to *balance* the amount of URLs to be processed from each domain, instead of crawling all the available URLs from one domain and then the other.
-
-For this scenario you'll want to collapse on the `host` field (that already is indexed by the `StatusUpdaterBolt`) and perhaps you just want to crawl 100 URLs per domain. For this case is enough to add this to your configuration:
-
-```yaml
-solr.status.bucket.field: host
-solr.status.bucket.maxsize: 100
-```
-
-This feature can be combined with the [partition features](https://github.com/apache/incubator-stormcrawler/wiki/Configuration#fetching-and-partitioning) provided by StormCrawler to balance the crawling process and not just the URL coverage.
-
-### Metadata
-
-The metadata associated with each URL is also persisted in the Solr collection configured. By default the metadata is stored as separated fields in the collection using a prefix that can be configured using the `solr.status.metadata.prefix` option. If no value is supplied for this option the `metadata` value is used. Take a look at the following example record:
-
-```json
-{
-  "url": "http://test.com",
-  "host": "test.com",
-  "status": "DISCOVERED",
-  "metadata.url.path": "http://test.com",
-  "metadata.depth": "1",
-  "nextFetchDate": "2015-10-30T17:26:34.386Z"
-}
-```
-
-In the previous example the `metadata.url.path` and `metadata.depth` attributes are elements taken from the `metadata` object. If the `SolrSpout` class is used to fetch URLs from Solr, the configured prefix (`metadata.` in this case) is stripped before populating the `Metadata` instance.
-
-## Using SolrCloud
-
-To use a SolrCloud cluster instead of a single Solr server, you must use the following configuration parameters **instead** of the `solr.TYPE.url`:
-
-* `solr.TYPE.zkhost`: URL of the Zookeeper host that holds the information regarding the SolrCloud cluster.
-
-* `solr.TYPE.collection`: Name of the collection that you wish to use.
-
-## Solr configuration
-
-An example collection configuration for each type of data is also provided in the [`cores`](cores) directory. The configuration is very basic but it will allow you to view all the stored data in Solr.
-
-The configuration is only useful as a testing resource, mainly because everything is stored as a `Solr.StrField` which is not very useful for search purposes. Numeric values and dates are also **stored as strings** using dynamic fields.
+## Available resources
 
-In the `metrics` collection an `id` field is configured to be populated with an auto-generated UUID for each document, this configuration is placed in the `solrconfig.xml` file. The `id` field will be used as the `uniqueKey`.
+* [IndexerBolt](https://github.com/apache/incubator-stormcrawler/blob/main/external/solr/src/main/java/org/apache/stormcrawler/solr/bolt/IndexerBolt.java): Implementation of [AbstractIndexerBolt](https://github.com/apache/incubator-stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/indexing/AbstractIndexerBolt.java) that allows to index the parsed data and metadata into a specified Solr collection.
 
-In the `parse` and `status` cores the `uniqueKey` is defined to be the `url` field.
+* [MetricsConsumer](https://github.com/apache/incubator-stormcrawler/blob/main/external/solr/src/main/java/org/apache/stormcrawler/solr/metrics/MetricsConsumer.java): Class that allows to store Storm metrics in Solr.
 
-Also keep in mind that depending on your needs you can use the [Schemaless Mode](https://cwiki.apache.org/confluence/display/solr/Schemaless+Mode) available in Solr.
+* [SolrSpout](https://github.com/apache/incubator-stormcrawler/blob/main/external/solr/src/main/java/org/apache/stormcrawler/solr/persistence/SolrSpout.java): Spout that allows to get URLs from a specified Solr collection.
 
-To start SOLR with the preconfigured cores for StormCrawler, you can do `bin/solr start -s stormcrawler/external/solr/cores`, then open the SOLR UI (http://localhost:8983) to check that they have been loaded correctly. Alternatively, create the cores (here `status`) by `bin/solr create -c status -d stormcrawler/external/solr/cores/status/`.
+* [StatusUpdaterBolt](https://github.com/apache/incubator-stormcrawler/blob/main/external/solr/src/main/java/org/apache/stormcrawler/solr/persistence/StatusUpdaterBolt.java): Implementation of [AbstractStatusUpdaterBolt](https://github.com/apache/incubator-stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/persistence/AbstractStatusUpdaterBolt.java) that allows to store the status of each URL along with the serialized metadata in Solr.
 
diff --git a/external/solr/archetype/pom.xml b/external/solr/archetype/pom.xml
@@ -0,0 +1,72 @@
+<?xml version="1.0" encoding="UTF-8"?>
+
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
+    <modelVersion>4.0.0</modelVersion>
+
+    <parent>
+        <groupId>org.apache.stormcrawler</groupId>
+        <artifactId>stormcrawler</artifactId>
+        <version>3.1.1-SNAPSHOT</version>
+        <relativePath>../../../pom.xml</relativePath>
+    </parent>
+
+    <artifactId>stormcrawler-solr-archetype</artifactId>
+
+    <packaging>maven-archetype</packaging>
+
+    <build>
+
+        <resources>
+            <resource>
+                <directory>src/main/resources</directory>
+                <filtering>true</filtering>
+                <includes>
+                    <include>META-INF/maven/archetype-metadata.xml</include>
+                </includes>
+            </resource>
+            <resource>
+                <directory>src/main/resources</directory>
+                <filtering>false</filtering>
+                <excludes>
+                    <exclude>META-INF/maven/archetype-metadata.xml</exclude>
+                </excludes>
+            </resource>
+        </resources>
+
+        <extensions>
+            <extension>
+                <groupId>org.apache.maven.archetype</groupId>
+                <artifactId>archetype-packaging</artifactId>
+                <version>3.3.1</version>
+            </extension>
+        </extensions>
+
+        <pluginManagement>
+            <plugins>
+                <plugin>
+                    <artifactId>maven-archetype-plugin</artifactId>
+                    <version>3.3.1</version>
+                </plugin>
+            </plugins>
+        </pluginManagement>
+    </build>
+</project>
diff --git a/external/solr/archetype/src/main/resources/META-INF/archetype-post-generate.groovy b/external/solr/archetype/src/main/resources/META-INF/archetype-post-generate.groovy
@@ -0,0 +1,21 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to you under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+def file1 = new File(request.getOutputDirectory(), request.getArtifactId() + "/setup-solr.sh")
+file1.setExecutable(true, false)
+
+def file2 = new File(request.getOutputDirectory(), request.getArtifactId() + "/clear-collections.sh")
+file2.setExecutable(true, false)
diff --git a/external/solr/archetype/src/main/resources/META-INF/maven/archetype-metadata.xml b/external/solr/archetype/src/main/resources/META-INF/maven/archetype-metadata.xml
@@ -0,0 +1,84 @@
+<?xml version="1.0" encoding="UTF-8"?>
+
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+<archetype-descriptor
+        xmlns="https://maven.apache.org/plugins/maven-archetype-plugin/archetype-descriptor/1.1.0"
+        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+        xsi:schemaLocation="https://maven.apache.org/plugins/maven-archetype-plugin/archetype-descriptor/1.1.0 http://maven.apache.org/xsd/archetype-descriptor-1.1.0.xsd"
+        name="stormcrawler-core">
+
+    <requiredProperties>
+        <requiredProperty key="http-agent-name">
+            <validationRegex>^[a-zA-Z_\-]+$</validationRegex>
+        </requiredProperty>
+        <requiredProperty key="http-agent-version" />
+        <requiredProperty key="http-agent-description" />
+        <requiredProperty key="http-agent-url" />
+        <requiredProperty key="http-agent-email">
+            <validationRegex>^\S+@\S+\.\S+$</validationRegex>
+        </requiredProperty>
+        <requiredProperty key="StormCrawlerVersion">
+            <defaultValue>${project.version}</defaultValue>
+        </requiredProperty>
+    </requiredProperties>
+
+    <fileSets>
+        <fileSet filtered="true" packaged="true" encoding="UTF-8">
+            <directory>src/main/java</directory>
+            <includes>
+                <include>**/*.java</include>
+            </includes>
+        </fileSet>
+        <fileSet filtered="true" encoding="UTF-8">
+            <directory>src/main/resources</directory>
+            <includes>
+                <include>**/*.xml</include>
+                <include>**/*.txt</include>
+                <include>**/*.yaml</include>
+                <include>**/*.json</include>
+            </includes>
+        </fileSet>
+        <fileSet filtered="true" encoding="UTF-8">
+            <directory/>
+            <includes>
+                <include>*.yaml</include>
+                <include>*.flux</include>
+                <include>seeds.txt</include>
+                <include>README.md</include>
+            </includes>
+        </fileSet>
+        <fileSet filtered="false" encoding="UTF-8">
+            <directory/>
+            <includes>
+                <include>configsets</include>
+                <include>setup-solr.sh</include>
+                <include>clear-collections.sh</include>
+            </includes>
+        </fileSet>
+        <fileSet filtered="false" encoding="UTF-8">
+            <directory>configsets</directory>
+            <includes>
+                <include>**/*</include>
+            </includes>
+        </fileSet>
+    </fileSets>
+
+</archetype-descriptor>