GitHub - ericxuhao/esg-search: ESGF Search Component

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 1,165 Commits
.settings		.settings
bin		bin
docs		docs
etc		etc
lib		lib
resources		resources
src		src
web		web
.classpath		.classpath
.gitignore		.gitignore
.project		.project
CHANGES		CHANGES
INSTALL		INSTALL
LICENSE		LICENSE
README		README
README-ESGFDeployment		README-ESGFDeployment
TODO		TODO
build.xml		build.xml
ivy.xml		ivy.xml
ivysettings.xml		ivysettings.xml
pom.xml		pom.xml

Repository files navigation

DESCRIPTION

This module contains the next generation search functionality for the Earth System Grid Federation,
built upon the Apache Solr search engine. The package contains functionality for:

o Publishing and unpublishing search metadata records into and from a Solr server.
Metadata records are generated by harvesting a remote metadata source (a hierarchy of THREDDS catalogs,
a OAI repository, a CAS metadata catalog).

o Searching the Solr engine content via a free text or faceted search.

For installation and running instructions, see the INSTALL file.

THE SOLR SCHEMA

The XML schema used by the Solr engine determines the syntax of the metadata records to be inserted
(i.e. which fields should be mandatory, which are optional, and how all fields are parsed)
and the format of the records returned by a search. This application comes with a specific Solr schema
(located in "src/java/test/solr/conf/schema.xml") that has been customized for the ESGF.
Specifically, the ESGF Solr schema has the following features:

o Each incoming XML record must have the following MANDATORY named fields:
- "id": the unique record identifier
- "title": the title displayed when the record is found as the result of a search
- "url": the URL that is hyperlinked to the search result
- "type": the metadata record type, used to enable searching for different products. For now, hard-wired by the software to "dataset".

o Each incoming XML record may contain the following OPTIONAL named fields:
- "description": if found, it may be displayed as additional information in a search result
- "start_datetime", "stop_datetime": used to enable time searches (not yet implemented)
- "north/east/south/west_degrees": used to enable geographic searches (not yet implemented)
- "version": optional string used to indicate the record version, that will be converted to a long number for comparisons

o Any other field found in the incoming XML record is inserted as-is (i.e. not text processing occurs) to the Solr engine,
so that it can be used for faceted searching

o The content of all fields (mandatory named fields, optional named fields, and all other fields) is text-processed and inserted into the Solr
engine to drive the Lucene free text search.

o Upon ingestion, each record is assigned a timestamp indicating the last processing time is automatically associated with each incoming record.

USING A FACET PROFILE

Because the ESG Solr schema treats every "unknown" field as a search facet, new facets can be harvested from a metadata source into the Solr index by simply
inserting the field (name, value(s)) pair into the Solr input document, without the need for pre-defining the facets.

The same facets can be retrieved through a query by specifying their keys as input to the search operation,
for example by mapping the requested faceted to keys via the facet profile utility.