Skip to content

Elasticsearch Hive

Federico Mendez edited this page Mar 1, 2018 · 2 revisions

This page summarizes the initial evaluation of the Elasticsearch-Hive integration (https://www.elastic.co/guide/en/elasticsearch/hadoop/current/hive.html).

Experiment

  1. A Hive table 'occurrence_interpreted' with 500 million records was created from UAT data. This table contains only clean/interpreted data.
  2. An Elasticsearch hive table was created using the options
ADD JAR hdfs://ha-nn/user/hive/auxjars/elasticsearch-hadoop-hive-5.6.4.jar;
CREATE EXTERNAL TABLE occurrence_es (
  id STRING,
...
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES(
'es.nodes'= 'c4n1.gbif.org,c4n2.gbif.org,c4n3.gbif.org,c4n4.gbif.org,c4n5.gbif.org,c4n6.gbif.org,c4n7.gbif.org,c4n8.gbif.org,c4n9.gbif.org',
'es.resource' = 'occurrence_es/occurrence',
'es.index.auto.create' = 'false',
'es.nodes.wan.only' = 'false',
'es.mapping.id' = 'id',
'es.batch.size.entries'= '10000',
'es.batch.size.bytes'='4098');
  1. The ES backed table was populated using the Hive script:
ADD JAR hdfs://ha-nn/user/hive/auxjars/elasticsearch-hadoop-hive-5.6.4.jar;
set mapreduce.job.reduces=50;
set hive.exec.reducers.max=50;
INSERT INTO TABLE occurrence_es
  SELECT
    cast(gbifid as STRING) as id,
    ...
    if(decimalLatitude IS NOT NULL AND decimalLongitude IS NOT NULL,concat(decimalLatitude,',',decimalLongitude),'') AS coordinate
    FROM occurrence_interpreted
    LIMIT 1000000;

Findings

  1. The EsStorageHandler uses only one reducer when writing to an ES cluster which makes it unusable to import big dataset into ES. Additionally, even writing a small amount of records ~100K records is a very slow process. This issue has been consulted to the ES community: https://discuss.elastic.co/t/write-into-es-using-hive-uses-only-one-reducer/121998.
  2. The ES-Hive integration seems to be a better option to consolidate query mechanisms using Hive as engine.

Hive script

-- 1. Create a test table from an existing table

CREATE TABLE occurrence_interpreted (
  gbifid INT,
  abstract STRING,
  accessrights STRING,
  accrualmethod STRING,
  accrualperiodicity STRING,
  accrualpolicy STRING,
  alternative STRING,
  audience STRING,
  available STRING,
  bibliographiccitation STRING,
  conformsto STRING,
  contributor STRING,
  coverage STRING,
  created STRING,
  creator STRING,
  date_ STRING,
  dateaccepted STRING,
  datecopyrighted STRING,
  datesubmitted STRING,
  description STRING,
  educationlevel STRING,
  extent STRING,
  format_ STRING,
  hasformat STRING,
  haspart STRING,
  hasversion STRING,
  identifier STRING,
  instructionalmethod STRING,
  isformatof STRING,
  ispartof STRING,
  isreferencedby STRING,
  isreplacedby STRING,
  isrequiredby STRING,
  isversionof STRING,
  issued STRING,
  language STRING,
  license STRING,
  mediator STRING,
  medium STRING,
  modified STRING,
  provenance STRING,
  publisher STRING,
  references STRING,
  relation STRING,
  replaces STRING,
  requires STRING,
  rights STRING,
  rightsholder STRING,
  source STRING,
  spatial STRING,
  subject STRING,
  tableofcontents STRING,
  temporal STRING,
  title STRING,
  type STRING,
  valid STRING,
  institutionid STRING,
  collectionid STRING,
  datasetid STRING,
  institutioncode STRING,
  collectioncode STRING,
  datasetname STRING,
  ownerinstitutioncode STRING,
  basisofrecord STRING,
  informationwithheld STRING,
  datageneralizations STRING,
  dynamicproperties STRING,
  occurrenceid STRING,
  catalognumber STRING,
  recordnumber STRING,
  recordedby STRING,
  individualcount STRING,
  organismquantity STRING,
  organismquantitytype STRING,
  sex STRING,
  lifestage STRING,
  reproductivecondition STRING,
  behavior STRING,
  establishmentmeans STRING,
  occurrencestatus STRING,
  preparations STRING,
  disposition STRING,
  associatedreferences STRING,
  associatedsequences STRING,
  associatedtaxa STRING,
  othercatalognumbers STRING,
  occurrenceremarks STRING,
  organismid STRING,
  organismname STRING,
  organismscope STRING,
  associatedoccurrences STRING,
  associatedorganisms STRING,
  previousidentifications STRING,
  organismremarks STRING,
  materialsampleid STRING,
  eventid STRING,
  parenteventid STRING,
  fieldnumber STRING,
  eventdate STRING,
  eventtime STRING,
  startdayofyear STRING,
  enddayofyear STRING,
  year STRING,
  month STRING,
  day STRING,
  verbatimeventdate STRING,
  habitat STRING,
  samplingprotocol STRING,
  samplingeffort STRING,
  samplesizevalue STRING,
  samplesizeunit STRING,
  fieldnotes STRING,
  eventremarks STRING,
  locationid STRING,
  highergeographyid STRING,
  highergeography STRING,
  continent STRING,
  waterbody STRING,
  islandgroup STRING,
  island STRING,
  countrycode STRING,
  stateprovince STRING,
  county STRING,
  municipality STRING,
  locality STRING,
  verbatimlocality STRING,
  verbatimelevation STRING,
  verbatimdepth STRING,
  minimumdistanceabovesurfaceinmeters STRING,
  maximumdistanceabovesurfaceinmeters STRING,
  locationaccordingto STRING,
  locationremarks STRING,
  decimallatitude STRING,
  decimallongitude STRING,
  coordinateuncertaintyinmeters STRING,
  coordinateprecision STRING,
  pointradiusspatialfit STRING,
  verbatimcoordinatesystem STRING,
  verbatimsrs STRING,
  footprintwkt STRING,
  footprintsrs STRING,
  footprintspatialfit STRING,
  georeferencedby STRING,
  georeferenceddate STRING,
  georeferenceprotocol STRING,
  georeferencesources STRING,
  georeferenceverificationstatus STRING,
  georeferenceremarks STRING,
  geologicalcontextid STRING,
  earliesteonorlowesteonothem STRING,
  latesteonorhighesteonothem STRING,
  earliesteraorlowesterathem STRING,
  latesteraorhighesterathem STRING,
  earliestperiodorlowestsystem STRING,
  latestperiodorhighestsystem STRING,
  earliestepochorlowestseries STRING,
  latestepochorhighestseries STRING,
  earliestageorloweststage STRING,
  latestageorhigheststage STRING,
  lowestbiostratigraphiczone STRING,
  highestbiostratigraphiczone STRING,
  lithostratigraphicterms STRING,
  group_ STRING,
  formation STRING,
  member STRING,
  bed STRING,
  identificationid STRING,
  identificationqualifier STRING,
  typestatus STRING,
  identifiedby STRING,
  dateidentified STRING,
  identificationreferences STRING,
  identificationverificationstatus STRING,
  identificationremarks STRING,
  taxonid STRING,
  scientificnameid STRING,
  acceptednameusageid STRING,
  parentnameusageid STRING,
  originalnameusageid STRING,
  nameaccordingtoid STRING,
  namepublishedinid STRING,
  taxonconceptid STRING,
  scientificname STRING,
  acceptednameusage STRING,
  parentnameusage STRING,
  originalnameusage STRING,
  nameaccordingto STRING,
  namepublishedin STRING,
  namepublishedinyear STRING,
  higherclassification STRING,
  kingdom STRING,
  phylum STRING,
  class STRING,
  order_ STRING,
  family STRING,
  genus STRING,
  subgenus STRING,
  specificepithet STRING,
  infraspecificepithet STRING,
  taxonrank STRING,
  verbatimtaxonrank STRING,
  vernacularname STRING,
  nomenclaturalcode STRING,
  taxonomicstatus STRING,
  nomenclaturalstatus STRING,
  taxonremarks STRING,
  datasetkey STRING,
  publishingcountry STRING,
  lastinterpreted STRING,
  elevation STRING,
  elevationaccuracy STRING,
  depth STRING,
  depthaccuracy STRING,
  distanceabovesurface STRING,
  distanceabovesurfaceaccuracy STRING,
  issue STRING,
  mediatype STRING,
  hascoordinate STRING,
  hasgeospatialissues STRING,
  taxonkey STRING,
  kingdomkey STRING,
  phylumkey STRING,
  classkey STRING,
  orderkey STRING,
  familykey STRING,
  genuskey STRING,
  subgenuskey STRING,
  specieskey STRING,
  species STRING,
  genericname STRING,
  typifiedname STRING,
  protocol STRING,
  lastparsed STRING,
  lastcrawled STRING,
  repatriated STRING
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' TBLPROPERTIES ("serialization.null.format"="");

-- 2. Populate test table
FROM uat.occurrence_hdfs
INSERT INTO TABLE es.occurrence_interpreted
  SELECT
    gbifid,
    abstract,
    accessrights,
    accrualmethod,
    accrualperiodicity,
    accrualpolicy,
    alternative,
    audience,
    available,
    bibliographiccitation,
    conformsto,
    contributor,
    coverage,
    created,
    creator,
    date_,
    dateaccepted,
    datecopyrighted,
    datesubmitted,
    description,
    educationlevel,
    extent,
    format_,
    hasformat,
    haspart,
    hasversion,
    identifier,
    instructionalmethod,
    isformatof,
    ispartof,
    isreferencedby,
    isreplacedby,
    isrequiredby,
    isversionof,
    issued,
    language,
    license,
    mediator,
    medium,
    toISO8601(modified) AS modified,
    provenance,
    publisher,
    references,
    relation,
    replaces,
    requires,
    rights,
    rightsholder,
    source,
    spatial,
    subject,
    tableofcontents,
    temporal,
    title,
    type,
    valid,
    institutionid,
    collectionid,
    datasetid,
    institutioncode,
    collectioncode,
    datasetname,
    ownerinstitutioncode,
    basisofrecord,
    informationwithheld,
    datageneralizations,
    dynamicproperties,
    occurrenceid,
    catalognumber,
    recordnumber,
    recordedby,
    individualcount,
    organismquantity,
    organismquantitytype,
    sex,
    lifestage,
    reproductivecondition,
    behavior,
    establishmentmeans,
    occurrencestatus,
    preparations,
    disposition,
    associatedreferences,
    associatedsequences,
    associatedtaxa,
    othercatalognumbers,
    occurrenceremarks,
    organismid,
    organismname,
    organismscope,
    associatedoccurrences,
    associatedorganisms,
    previousidentifications,
    organismremarks,
    materialsampleid,
    eventid,
    parenteventid,
    fieldnumber,
    toISO8601(eventdate) AS eventdate,
    eventtime,
    startdayofyear,
    enddayofyear,
    year,
    month,
    day,
    verbatimeventdate,
    habitat,
    samplingprotocol,
    samplingeffort,
    samplesizevalue,
    samplesizeunit,
    fieldnotes,
    eventremarks,
    locationid,
    highergeographyid,
    highergeography,
    continent,
    waterbody,
    islandgroup,
    island,
    countrycode,
    stateprovince,
    county,
    municipality,
    locality,
    verbatimlocality,
    verbatimelevation,
    verbatimdepth,
    minimumdistanceabovesurfaceinmeters,
    maximumdistanceabovesurfaceinmeters,
    locationaccordingto,
    locationremarks,
    decimallatitude,
    decimallongitude,
    coordinateuncertaintyinmeters,
    coordinateprecision,
    pointradiusspatialfit,
    verbatimcoordinatesystem,
    verbatimsrs,
    footprintwkt,
    footprintsrs,
    footprintspatialfit,
    georeferencedby,
    georeferenceddate,
    georeferenceprotocol,
    georeferencesources,
    georeferenceverificationstatus,
    georeferenceremarks,
    geologicalcontextid,
    earliesteonorlowesteonothem,
    latesteonorhighesteonothem,
    earliesteraorlowesterathem,
    latesteraorhighesterathem,
    earliestperiodorlowestsystem,
    latestperiodorhighestsystem,
    earliestepochorlowestseries,
    latestepochorhighestseries,
    earliestageorloweststage,
    latestageorhigheststage,
    lowestbiostratigraphiczone,
    highestbiostratigraphiczone,
    lithostratigraphicterms,
    group_,
    formation,
    member,
    bed,
    identificationid,
    identificationqualifier,
    typestatus,
    identifiedby,
    toISO8601(dateidentified) AS dateidentified,
    identificationreferences,
    identificationverificationstatus,
    identificationremarks,
    taxonid,
    scientificnameid,
    acceptednameusageid,
    parentnameusageid,
    originalnameusageid,
    nameaccordingtoid,
    namepublishedinid,
    taxonconceptid,
    scientificname,
    acceptednameusage,
    parentnameusage,
    originalnameusage,
    nameaccordingto,
    namepublishedin,
    namepublishedinyear,
    higherclassification,
    kingdom,
    phylum,
    class,
    order_,
    family,
    genus,
    subgenus,
    specificepithet,
    infraspecificepithet,
    taxonrank,
    verbatimtaxonrank,
    vernacularname,
    nomenclaturalcode,
    taxonomicstatus,
    nomenclaturalstatus,
    taxonremarks,
    datasetkey,
    publishingcountry,
    toISO8601(lastinterpreted) AS lastinterpreted,
    elevation,
    elevationaccuracy,
    depth,
    depthaccuracy,
    distanceabovesurface,
    distanceabovesurfaceaccuracy,
    if(issue IS NULL,'',joinArray(issue,'\\;')) AS issue,
    if(mediatype IS NULL,'',joinArray(mediatype,'\\;')) AS mediatype,
    hascoordinate,
    hasgeospatialissues,
    taxonkey,
    kingdomkey,
    phylumkey,
    classkey,
    orderkey,
    familykey,
    genuskey,
    subgenuskey,
    specieskey,
    species,
    genericname,
    typifiedname,
    protocol,
    toISO8601(lastparsed) AS lastparsed,
    toISO8601(lastcrawled) AS lastcrawled,
    repatriated;

-- 3. Elasticsearch-Hive table

ADD JAR hdfs://ha-nn/user/hive/auxjars/elasticsearch-hadoop-hive-5.6.4.jar;
CREATE EXTERNAL TABLE occurrence_es (
  id STRING,
  abstract STRING,
  accessrights STRING,
  accrualmethod STRING,
  accrualperiodicity STRING,
  accrualpolicy STRING,
  alternative STRING,
  audience STRING,
  available STRING,
  bibliographiccitation STRING,
  conformsto STRING,
  contributor STRING,
  coverage STRING,
  created STRING,
  creator STRING,
  date_ STRING,
  dateaccepted STRING,
  datecopyrighted STRING,
  datesubmitted STRING,
  description STRING,
  educationlevel STRING,
  extent STRING,
  format_ STRING,
  hasformat STRING,
  haspart STRING,
  hasversion STRING,
  identifier STRING,
  instructionalmethod STRING,
  isformatof STRING,
  ispartof STRING,
  isreferencedby STRING,
  isreplacedby STRING,
  isrequiredby STRING,
  isversionof STRING,
  issued STRING,
  language STRING,
  license STRING,
  mediator STRING,
  medium STRING,
  modified STRING,
  provenance STRING,
  publisher STRING,
  references STRING,
  relation STRING,
  replaces STRING,
  requires STRING,
  rights STRING,
  rightsholder STRING,
  source STRING,
  spatial STRING,
  subject STRING,
  tableofcontents STRING,
  temporal STRING,
  title STRING,
  type STRING,
  valid STRING,
  institutionid STRING,
  collectionid STRING,
  datasetid STRING,
  institutioncode STRING,
  collectioncode STRING,
  datasetname STRING,
  ownerinstitutioncode STRING,
  basisofrecord STRING,
  informationwithheld STRING,
  datageneralizations STRING,
  dynamicproperties STRING,
  occurrenceid STRING,
  catalognumber STRING,
  recordnumber STRING,
  recordedby STRING,
  individualcount STRING,
  organismquantity STRING,
  organismquantitytype STRING,
  sex STRING,
  lifestage STRING,
  reproductivecondition STRING,
  behavior STRING,
  establishmentmeans STRING,
  occurrencestatus STRING,
  preparations STRING,
  disposition STRING,
  associatedreferences STRING,
  associatedsequences STRING,
  associatedtaxa STRING,
  othercatalognumbers STRING,
  occurrenceremarks STRING,
  organismid STRING,
  organismname STRING,
  organismscope STRING,
  associatedoccurrences STRING,
  associatedorganisms STRING,
  previousidentifications STRING,
  organismremarks STRING,
  materialsampleid STRING,
  eventid STRING,
  parenteventid STRING,
  fieldnumber STRING,
  eventdate STRING,
  eventtime STRING,
  startdayofyear STRING,
  enddayofyear STRING,
  year STRING,
  month STRING,
  day STRING,
  verbatimeventdate STRING,
  habitat STRING,
  samplingprotocol STRING,
  samplingeffort STRING,
  samplesizevalue STRING,
  samplesizeunit STRING,
  fieldnotes STRING,
  eventremarks STRING,
  locationid STRING,
  highergeographyid STRING,
  highergeography STRING,
  continent STRING,
  waterbody STRING,
  islandgroup STRING,
  island STRING,
  countrycode STRING,
  stateprovince STRING,
  county STRING,
  municipality STRING,
  locality STRING,
  verbatimlocality STRING,
  verbatimelevation STRING,
  verbatimdepth STRING,
  minimumdistanceabovesurfaceinmeters STRING,
  maximumdistanceabovesurfaceinmeters STRING,
  locationaccordingto STRING,
  locationremarks STRING,
  decimallatitude STRING,
  decimallongitude STRING,
  coordinateuncertaintyinmeters STRING,
  coordinateprecision STRING,
  pointradiusspatialfit STRING,
  verbatimcoordinatesystem STRING,
  verbatimsrs STRING,
  footprintwkt STRING,
  footprintsrs STRING,
  footprintspatialfit STRING,
  georeferencedby STRING,
  georeferenceddate STRING,
  georeferenceprotocol STRING,
  georeferencesources STRING,
  georeferenceverificationstatus STRING,
  georeferenceremarks STRING,
  geologicalcontextid STRING,
  earliesteonorlowesteonothem STRING,
  latesteonorhighesteonothem STRING,
  earliesteraorlowesterathem STRING,
  latesteraorhighesterathem STRING,
  earliestperiodorlowestsystem STRING,
  latestperiodorhighestsystem STRING,
  earliestepochorlowestseries STRING,
  latestepochorhighestseries STRING,
  earliestageorloweststage STRING,
  latestageorhigheststage STRING,
  lowestbiostratigraphiczone STRING,
  highestbiostratigraphiczone STRING,
  lithostratigraphicterms STRING,
  group_ STRING,
  formation STRING,
  member STRING,
  bed STRING,
  identificationid STRING,
  identificationqualifier STRING,
  typestatus STRING,
  identifiedby STRING,
  dateidentified STRING,
  identificationreferences STRING,
  identificationverificationstatus STRING,
  identificationremarks STRING,
  taxonid STRING,
  scientificnameid STRING,
  acceptednameusageid STRING,
  parentnameusageid STRING,
  originalnameusageid STRING,
  nameaccordingtoid STRING,
  namepublishedinid STRING,
  taxonconceptid STRING,
  scientificname STRING,
  acceptednameusage STRING,
  parentnameusage STRING,
  originalnameusage STRING,
  nameaccordingto STRING,
  namepublishedin STRING,
  namepublishedinyear STRING,
  higherclassification STRING,
  kingdom STRING,
  phylum STRING,
  class STRING,
  order_ STRING,
  family STRING,
  genus STRING,
  subgenus STRING,
  specificepithet STRING,
  infraspecificepithet STRING,
  taxonrank STRING,
  verbatimtaxonrank STRING,
  vernacularname STRING,
  nomenclaturalcode STRING,
  taxonomicstatus STRING,
  nomenclaturalstatus STRING,
  taxonremarks STRING,
  datasetkey STRING,
  publishingcountry STRING,
  lastinterpreted STRING,
  elevation STRING,
  elevationaccuracy STRING,
  depth STRING,
  depthaccuracy STRING,
  distanceabovesurface STRING,
  distanceabovesurfaceaccuracy STRING,
  issue STRING,
  mediatype STRING,
  hascoordinate STRING,
  hasgeospatialissues STRING,
  taxonkey STRING,
  kingdomkey STRING,
  phylumkey STRING,
  classkey STRING,
  orderkey STRING,
  familykey STRING,
  genuskey STRING,
  subgenuskey STRING,
  specieskey STRING,
  species STRING,
  genericname STRING,
  typifiedname STRING,
  protocol STRING,
  lastparsed STRING,
  lastcrawled STRING,
  repatriated STRING,
  coordinate STRING)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES(
'es.nodes'= 'c4n1.gbif.org,c4n2.gbif.org,c4n3.gbif.org,c4n4.gbif.org,c4n5.gbif.org,c4n6.gbif.org,c4n7.gbif.org,c4n8.gbif.org,c4n9.gbif.org',
'es.resource' = 'occurrence_es/occurrence',
'es.index.auto.create' = 'false',
'es.nodes.wan.only' = 'false',
'es.mapping.id' = 'id',
'es.batch.size.entries'= '10000',
'es.batch.size.bytes'='4098');


-- 4. Load data into ES

set mapreduce.job.reduces=50;
set hive.exec.reducers.max=50;

INSERT INTO TABLE occurrence_es
  SELECT
    cast(gbifid as STRING) as id,
    abstract,
    accessrights,
    accrualmethod,
    accrualperiodicity,
    accrualpolicy,
    alternative,
    audience,
    available,
    bibliographiccitation,
    conformsto,
    contributor,
    coverage,
    created,
    creator,
    date_,
    dateaccepted,
    datecopyrighted,
    datesubmitted,
    description,
    educationlevel,
    extent,
    format_,
    hasformat,
    haspart,
    hasversion,
    identifier,
    instructionalmethod,
    isformatof,
    ispartof,
    isreferencedby,
    isreplacedby,
    isrequiredby,
    isversionof,
    issued,
    language,
    license,
    mediator,
    medium,
    modified,
    provenance,
    publisher,
    references,
    relation,
    replaces,
    requires,
    rights,
    rightsholder,
    source,
    spatial,
    subject,
    tableofcontents,
    temporal,
    title,
    type,
    valid,
    institutionid,
    collectionid,
    datasetid,
    institutioncode,
    collectioncode,
    datasetname,
    ownerinstitutioncode,
    basisofrecord,
    informationwithheld,
    datageneralizations,
    dynamicproperties,
    occurrenceid,
    catalognumber,
    recordnumber,
    recordedby,
    individualcount,
    organismquantity,
    organismquantitytype,
    sex,
    lifestage,
    reproductivecondition,
    behavior,
    establishmentmeans,
    occurrencestatus,
    preparations,
    disposition,
    associatedreferences,
    associatedsequences,
    associatedtaxa,
    othercatalognumbers,
    occurrenceremarks,
    organismid,
    organismname,
    organismscope,
    associatedoccurrences,
    associatedorganisms,
    previousidentifications,
    organismremarks,
    materialsampleid,
    eventid,
    parenteventid,
    fieldnumber,
    eventdate,
    eventtime,
    startdayofyear,
    enddayofyear,
    year,
    month,
    day,
    verbatimeventdate,
    habitat,
    samplingprotocol,
    samplingeffort,
    samplesizevalue,
    samplesizeunit,
    fieldnotes,
    eventremarks,
    locationid,
    highergeographyid,
    highergeography,
    continent,
    waterbody,
    islandgroup,
    island,
    countrycode,
    stateprovince,
    county,
    municipality,
    locality,
    verbatimlocality,
    verbatimelevation,
    verbatimdepth,
    minimumdistanceabovesurfaceinmeters,
    maximumdistanceabovesurfaceinmeters,
    locationaccordingto,
    locationremarks,
    decimallatitude,
    decimallongitude,
    coordinateuncertaintyinmeters,
    coordinateprecision,
    pointradiusspatialfit,
    verbatimcoordinatesystem,
    verbatimsrs,
    footprintwkt,
    footprintsrs,
    footprintspatialfit,
    georeferencedby,
    georeferenceddate,
    georeferenceprotocol,
    georeferencesources,
    georeferenceverificationstatus,
    georeferenceremarks,
    geologicalcontextid,
    earliesteonorlowesteonothem,
    latesteonorhighesteonothem,
    earliesteraorlowesterathem,
    latesteraorhighesterathem,
    earliestperiodorlowestsystem,
    latestperiodorhighestsystem,
    earliestepochorlowestseries,
    latestepochorhighestseries,
    earliestageorloweststage,
    latestageorhigheststage,
    lowestbiostratigraphiczone,
    highestbiostratigraphiczone,
    lithostratigraphicterms,
    group_,
    formation,
    member,
    bed,
    identificationid,
    identificationqualifier,
    typestatus,
    identifiedby,
    dateidentified,
    identificationreferences,
    identificationverificationstatus,
    identificationremarks,
    taxonid,
    scientificnameid,
    acceptednameusageid,
    parentnameusageid,
    originalnameusageid,
    nameaccordingtoid,
    namepublishedinid,
    taxonconceptid,
    scientificname,
    acceptednameusage,
    parentnameusage,
    originalnameusage,
    nameaccordingto,
    namepublishedin,
    namepublishedinyear,
    higherclassification,
    kingdom,
    phylum,
    class,
    order_,
    family,
    genus,
    subgenus,
    specificepithet,
    infraspecificepithet,
    taxonrank,
    verbatimtaxonrank,
    vernacularname,
    nomenclaturalcode,
    taxonomicstatus,
    nomenclaturalstatus,
    taxonremarks,
    datasetkey,
    publishingcountry,
    lastinterpreted,
    elevation,
    elevationaccuracy,
    depth,
    depthaccuracy,
    distanceabovesurface,
    distanceabovesurfaceaccuracy,
    issue,
    mediatype,
    hascoordinate,
    hasgeospatialissues,
    taxonkey,
    kingdomkey,
    phylumkey,
    classkey,
    orderkey,
    familykey,
    genuskey,
    subgenuskey,
    specieskey,
    species,
    genericname,
    typifiedname,
    protocol,
    lastparsed,
    lastcrawled,
    repatriated,
    if(decimalLatitude IS NOT NULL AND decimalLongitude IS NOT NULL,concat(decimalLatitude,',',decimalLongitude),'') AS coordinate
    FROM occurrence_interpreted
    LIMIT 1000000;
Clone this wiki locally