Skip to content

Analysis of Existing Infrastructure

Barbara Hui edited this page Oct 24, 2022 · 4 revisions

[Last Updated Oct 5, 2022]

This document assumes you are familiar with the existing harvester operations, the Collection Registry, and the concept of enrichment chains.

In order to run the code samples in this document, you have ssh access to the Collection Registry machine, can use the registry role account, and have run python manage.py shell from within the avram codebase. This document also assumes you have run the following python code before running any of the scripts listed below:

from library_collection.models import Collection
def group_collections(collection_list, group_by, dict_name='matched'):
    group_list = []
    while len(collection_list) > 0:
        remainder = []
        matched = []
        match_value = group_by(collection_list[0])
        for collection in collection_list:
            if group_by(collection) == match_value:
                matched.append(collection)
            else:
                remainder.append(collection)
        group_list.append({
            dict_name: match_value,
            'collections': matched,
            'count': len(matched)
        })
        collection_list = remainder
    return group_list

Collection Counts

  • Total Collections: 27,797
  • Collections with Enrichment Chains: 3,184
  • Collections Ready for Publication: 2,358
  • Collections Ready for Publication with Enrichment Chains: 2,358
num_total_collections = len(Collection.objects.all())
num_collections_with_enrichments = len(Collection.objects.exclude(enrichments_item__exact=''))
num_collections_ready_for_publication = len(Collection.objects.exclude(ready_for_publication=False))
num_collections_ready_and_with_enrichments = len(Collection.published.all())
print(
    f"- Total Collections: {num_total_collections:,}\n"
    f"- Collections with Enrichment Chains: {num_collections_with_enrichments:,}\n"
    f"- Collections Ready for Publication: {num_collections_ready_for_publication:,}\n"
    f"- Collections Ready for Publication with Enrichment Chains: {num_collections_ready_and_with_enrichments:,}"
)

Unique Enrichment Chains

  • Unique Enrichment Chains: 133
  • Number of Collections Using the Most Common Enrichment Chain: 407 (~17.3%)
  • Number of Collections Using the 10 Most Common Enrichment Chains: 1,635 (~69.3%)
  • Number of Collections Using a Unique Enrichment Chain: 59 (~2.5%)
rich_collections = Collection.published.all()
unique_enrichments = group_collections(
    rich_collections, lambda c: c.enrichments_item)
num_enrichment_chains = len(unique_enrichments)
unique_enrichments.sort(key=lambda ue: ue['count'], reverse=True)
num_collections_using_most_common_chain = unique_enrichments[0]['count']
percent_collections_using_most_common_chain = unique_enrichments[0]['count']/len(Collection.published.all())
most_common_count = [ue['count'] for ue in unique_enrichments[:10]]
num_collections_using_10_most_common_chains = sum(most_common_count)
percent_collections_using_10_most_common_chains = sum(most_common_count)/len(Collection.published.all())
long_tail = [ue for ue in unique_enrichments if ue['count'] == 1]
num_collections_using_unique_chain = len(long_tail)
percent_collections_using_unique_chain = len(long_tail)/len(Collection.published.all())
print(
    f"- Unique Enrichment Chains: {num_enrichment_chains:,}\n"
    f"- Number of Collections Using the Most Common Enrichment Chain: "
    f"{num_collections_using_most_common_chain:,} (~{percent_collections_using_most_common_chain:.1%})\n"
    f"- Number of Collections Using the 10 Most Common Enrichment Chains: "
    f"{num_collections_using_10_most_common_chains:,} (~{percent_collections_using_10_most_common_chains:.1%})\n"
    f"- Number of Collections Using a Unique Enrichment Chain: "
    f"{num_collections_using_unique_chain:,} (~{percent_collections_using_unique_chain:.1%})"
)

ID Enrichments

The vast majority of enrichment chains start with getting an identifier. id_enrichment is defined per ucldc/avram: library_collection.models.Collection: id_enrichment(self)

  • Unique ID Enrichments: 14
  • Number of Collections Using the Most Common ID Enrichment: 1,429 (~61%)
  • Number of Collections Using a Unique ID Enrichment: 2

List of ID Enrichments and Count of Collections:

ID Enrichment Count of Collections
/select-id?prop=id 1429
/select-oac-id 570
/select-id?prop=uid 278
/select-cmis-atom-id 21
/select-preservica-id 18
None 15
/select-id?prop=metadata/identifier 7
select-id?prop=id 6
select-oac-id 4
/select-id?prop=PID 3
/select-id?prop=identifier 3
/csl-marc-id 2
/ucsb-aleph-marc-id 1
/sfpl-marc-id 1
rich_collections = Collection.published.all()
unique_id_enrichments = (rich_collections, lambda c: c.id_enrichment)
num_id_enrichments = len(unique_id_enrichments)
unique_id_enrichments.sort(key=lambda ue: ue['count'], reverse=True)
num_collections_using_most_common_id_enrichment = unique_id_enrichments[0]['count']
percent_collections_using_most_common_chain = unique_id_enrichments[0]['count']/len(Collection.published.all())
long_tail = [ue for ue in unique_id_enrichments if ue['count'] == 1]
num_collections_using_unique_id_enrichment = len(long_tail)
table_view = ""
for ue in unique_id_enrichments:
    table_view = f"{table_view}| {ue['matched']} | {ue['count']} |\n"

print(
    f"- Unique ID Enrichments: {num_id_enrichments:,}\n"
    f"- Number of Collections Using the Most Common ID Enrichment: "
    f"{num_collections_using_most_common_id_enrichment:,} (~{percent_collections_using_most_common_chain:.0%})\n"
    f"- Number of Collections Using a Unique ID Enrichment: {num_collections_using_unique_id_enrichment:,}\n"
    f"#### List of ID Enrichments and Count of Collections:\n"
    "| ID Enrichment | Count of Collections |\n"
    "| --- | --- |\n"
    f"{table_view}"
)

Mappers

  • Unique Mappers: 55
  • Number of Collections Using the Most Common ID Enrichment: 571 (~24%)
  • Number of Collections Using a Unique ID Enrichment: 14

List of Mapper Types and Count of Collections:

Mapper Type Count of Collections
/dpla_mapper?mapper_type=oac_dc 571
/dpla_mapper?mapper_type=contentdm_oai_dc 311
/dpla_mapper?mapper_type=ucsd_blacklight_dc 292
/dpla_mapper?mapper_type=ucldc_nuxeo 278
/dpla_mapper?mapper_type=cavpp_islandora 245
/dpla_mapper?mapper_type=calpoly_oai_dc 77
/dpla_mapper?mapper_type=usc_oai_dc 75
/dpla_mapper?mapper_type=chapman_oai_dc 48
/dpla_mapper?mapper_type=quartex_oai 46
/dpla_mapper?mapper_type=csa_omeka 44
/dpla_mapper?mapper_type=chs_islandora 36
/dpla_mapper?mapper_type=flickr_sppl 33
/dpla_mapper?mapper_type=omeka 28
/dpla_mapper?mapper_type=sjsu_islandora 24
/dpla_mapper?mapper_type=ucb_tind_marc 23
/dpla_mapper?mapper_type=cmis_atom 21
/dpla_mapper?mapper_type=up_oai_dc 19
/dpla_mapper?mapper_type=preservica_api 19
/dpla_mapper?mapper_type=youtube_video_snippet 17
/dpla_mapper?mapper_type=ucd_json 13
/dpla_mapper?mapper_type=burbank_islandora 11
/dpla_mapper?mapper_type=arck_oai 11
/dpla_mapper?mapper_type=ucsc_oai_dpla 10
/dpla_mapper?mapper_type=csudh_contentdm_oai_dc 10
/dpla_mapper?mapper_type=lapl_oai 10
/dpla_mapper?mapper_type=black_gold_oai 9
/dpla_mapper?mapper_type=islandora_oai_dc 9
/dpla_mapper?mapper_type=pspl_oai_dc 8
/dpla_mapper?mapper_type=chico_oai_dc 7
/dpla_mapper?mapper_type=chula_vista_pl_contentdm_oai_dc 5
/dpla_mapper?mapper_type=pastperfect_xml 5
/dpla_mapper?mapper_type=flickr_sdasm 5
/dpla_mapper?mapper_type=cca_vault_oai_dc 4
/dpla_mapper?mapper_type=yosemite_oai_dc 4
/dpla_mapper?mapper_type=ucla_solr_dc 3
/dpla_mapper?mapper_type=omeka_nothumb 3
/dpla_mapper?mapper_type=oac_dc_suppress_publisher 2
/dpla_mapper?mapper_type=csu_sac_oai_dc 2
/dpla_mapper?mapper_type=ucsf_solr 2
/dpla_mapper?mapper_type=caltech_restrict 2
/dpla_mapper?mapper_type=internet_archive 2
/dpla_mapper?mapper_type=ucsb_aleph_marc 1
/dpla_mapper?mapper_type=oac_dc_suppress_desc_2 1
/dpla_mapper?mapper_type=sfpl_marc 1
/dpla_mapper?mapper_type=lapl_26096 1
/dpla_mapper?mapper_type=csl_marc 1
/dpla_mapper?mapper_type=contentdm_oai_dc_get_sound_thumbs 1
/dpla_mapper?mapper_type=ucb_bampfa_solr 1
/dpla_mapper?mapper_type=csuci_mets 1
/dpla_mapper?mapper_type=emuseum_xml 1
/dpla_mapper?mapper_type=csu_dspace_mets 1
/dpla_mapper?mapper_type=flickr_api 1
/dpla_mapper?mapper_type=sierramadre_marc 1
/dpla_mapper?mapper_type=sanjose_pastperfect 1
/dpla_mapper?mapper_type=tv_academy_oai_dc 1
rich_collections = Collection.published.all()
unique_mappers = group_collections(rich_collections, lambda c: c.mapper_type)
num_unique_mappers = len(unique_mappers)
unique_mappers.sort(key=lambda ue: ue['count'], reverse=True)
num_collections_using_most_common_mapper = unique_mappers[0]['count']
percent_collections_using_most_common_mapper = unique_mappers[0]['count']/len(Collection.published.all())
long_tail = [ue for ue in unique_mappers if ue['count'] == 1]
num_collections_using_unique_mapper = len(long_tail)
table_view = ""
for ue in unique_mappers:
    table_view = f"{table_view}| {ue['matched']} | {ue['count']} |\n"

print(
    f"- Unique Mappers: {num_unique_mappers:,}\n"
    f"- Number of Collections Using the Most Common ID Enrichment: "
    f"{num_collections_using_most_common_mapper:,} (~{percent_collections_using_most_common_mapper:.0%})\n"
    f"- Number of Collections Using a Unique ID Enrichment: {num_collections_using_unique_mapper:,}\n"
    f"#### List of Mapper Types and Count of Collections:\n"
    "| Mapper Type | Count of Collections |\n"
    "| --- | --- |\n"
    f"{table_view}"
)

Relationship of Fetchers to Mappers

This section last updated Oct 21 2022

List of Mapper Types and associated fetcher types:

Mapper Type Fetcher Types
/dpla_mapper?mapper_type=oac_dc ['OAC']
/dpla_mapper?mapper_type=ucd_json ['UCD']
/dpla_mapper?mapper_type=ucldc_nuxeo ['NUX']
/dpla_mapper?mapper_type=ucsb_aleph_marc ['ALX']
/dpla_mapper?mapper_type=ucb_tind_marc ['OAI']
/dpla_mapper?mapper_type=ucsc_oai_dpla ['OAI']
/dpla_mapper?mapper_type=ucsd_blacklight_dc ['SLR']
/dpla_mapper?mapper_type=csa_omeka ['OAI']
/dpla_mapper?mapper_type=ucla_solr_dc ['SLR']
/dpla_mapper?mapper_type=oac_dc_suppress_publisher ['OAC']
/dpla_mapper?mapper_type=quartex_oai ['OAI']
/dpla_mapper?mapper_type=sjsu_islandora ['OAI']
/dpla_mapper?mapper_type=cca_vault_oai_dc ['OAI']
/dpla_mapper?mapper_type=chs_islandora ['OAI']
/dpla_mapper?mapper_type=contentdm_oai_dc ['OAI']
/dpla_mapper?mapper_type=cmis_atom ['PRE']
/dpla_mapper?mapper_type=black_gold_oai ['OAI']
/dpla_mapper?mapper_type=calpoly_oai_dc ['OAI']
/dpla_mapper?mapper_type=csu_sac_oai_dc ['OAI']
/dpla_mapper?mapper_type=csudh_contentdm_oai_dc ['OAI']
/dpla_mapper?mapper_type=oac_dc_suppress_desc_2 ['OAC']
/dpla_mapper?mapper_type=chula_vista_pl_contentdm_oai_dc ['OAI']
/dpla_mapper?mapper_type=lapl_oai ['OAI']
/dpla_mapper?mapper_type=sfpl_marc ['MRC']
/dpla_mapper?mapper_type=lapl_26096 ['OAI']
/dpla_mapper?mapper_type=ucsf_solr ['SFX']
/dpla_mapper?mapper_type=cavpp_islandora ['OAI']
/dpla_mapper?mapper_type=up_oai_dc ['OAI']
/dpla_mapper?mapper_type=chapman_oai_dc ['OAI']
/dpla_mapper?mapper_type=preservica_api ['PRA']
/dpla_mapper?mapper_type=csl_marc ['MRC']
/dpla_mapper?mapper_type=contentdm_oai_dc_get_sound_thumbs ['OAI']
/dpla_mapper?mapper_type=pspl_oai_dc ['OAI']
/dpla_mapper?mapper_type=omeka ['OAI']
/dpla_mapper?mapper_type=chico_oai_dc ['OAI']
/dpla_mapper?mapper_type=ucb_bampfa_solr ['UCB']
/dpla_mapper?mapper_type=islandora_oai_dc ['OAI']
/dpla_mapper?mapper_type=youtube_video_snippet ['YTB']
/dpla_mapper?mapper_type=csuci_mets ['OAI']
/dpla_mapper?mapper_type=pastperfect_xml ['XML']
/dpla_mapper?mapper_type=caltech_restrict ['OAI']
/dpla_mapper?mapper_type=usc_oai_dc ['OAI']
/dpla_mapper?mapper_type=yosemite_oai_dc ['OAI']
/dpla_mapper?mapper_type=emuseum_xml ['EMS']
/dpla_mapper?mapper_type=csu_dspace_mets ['OAI']
/dpla_mapper?mapper_type=flickr_api ['FLK']
/dpla_mapper?mapper_type=sierramadre_marc ['MRC']
/dpla_mapper?mapper_type=burbank_islandora ['OAI']
/dpla_mapper?mapper_type=omeka_nothumb ['OAI']
/dpla_mapper?mapper_type=sanjose_pastperfect ['XML']
/dpla_mapper?mapper_type=tv_academy_oai_dc ['OAI']
/dpla_mapper?mapper_type=flickr_sdasm ['FLK']
/dpla_mapper?mapper_type=flickr_sppl ['FLK']
/dpla_mapper?mapper_type=internet_archive ['IAR']
/dpla_mapper?mapper_type=arck_oai ['OAI']
rich_collections = Collection.published.all()
unique_mappers = group_collections(rich_collections, lambda c: c.mapper_type)

fetcher_types = {}
for m in unique_mappers:
    types = []
    for c in m['collections']:
       if c.harvest_type not in types:
          types.append(c.harvest_type)
    fetcher_types[m['matched']] = types

table_view = ""
for ft in fetcher_types:
   table_view = f"{table_view}| {ft} | {fetcher_types[ft]} |\n"
   
print(
    f"### List of Mapper Types and associated fetcher types:\n"
    "| Mapper Type | Fetcher Types |\n"
    "| --- | --- |\n"
    f"{table_view}"
)