Analysis of Existing Infrastructure

[Last Updated Oct 5, 2022]

This document assumes you are familiar with the existing harvester operations, the Collection Registry, and the concept of enrichment chains.

In order to run the code samples in this document, you have ssh access to the Collection Registry machine, can use the registry role account, and have run python manage.py shell from within the avram codebase. This document also assumes you have run the following python code before running any of the scripts listed below:

from library_collection.models import Collection
def group_collections(collection_list, group_by, dict_name='matched'):
    group_list = []
    while len(collection_list) > 0:
        remainder = []
        matched = []
        match_value = group_by(collection_list[0])
        for collection in collection_list:
            if group_by(collection) == match_value:
                matched.append(collection)
            else:
                remainder.append(collection)
        group_list.append({
            dict_name: match_value,
            'collections': matched,
            'count': len(matched)
        })
        collection_list = remainder
    return group_list

Collection Counts

Total Collections: 27,797
Collections with Enrichment Chains: 3,184
Collections Ready for Publication: 2,358
Collections Ready for Publication with Enrichment Chains: 2,358

num_total_collections = len(Collection.objects.all())
num_collections_with_enrichments = len(Collection.objects.exclude(enrichments_item__exact=''))
num_collections_ready_for_publication = len(Collection.objects.exclude(ready_for_publication=False))
num_collections_ready_and_with_enrichments = len(Collection.published.all())
print(
    f"- Total Collections: {num_total_collections:,}\n"
    f"- Collections with Enrichment Chains: {num_collections_with_enrichments:,}\n"
    f"- Collections Ready for Publication: {num_collections_ready_for_publication:,}\n"
    f"- Collections Ready for Publication with Enrichment Chains: {num_collections_ready_and_with_enrichments:,}"
)

Unique Enrichment Chains

Unique Enrichment Chains: 133
Number of Collections Using the Most Common Enrichment Chain: 407 (~17.3%)
Number of Collections Using the 10 Most Common Enrichment Chains: 1,635 (~69.3%)
Number of Collections Using a Unique Enrichment Chain: 59 (~2.5%)

rich_collections = Collection.published.all()
unique_enrichments = group_collections(
    rich_collections, lambda c: c.enrichments_item)
num_enrichment_chains = len(unique_enrichments)
unique_enrichments.sort(key=lambda ue: ue['count'], reverse=True)
num_collections_using_most_common_chain = unique_enrichments[0]['count']
percent_collections_using_most_common_chain = unique_enrichments[0]['count']/len(Collection.published.all())
most_common_count = [ue['count'] for ue in unique_enrichments[:10]]
num_collections_using_10_most_common_chains = sum(most_common_count)
percent_collections_using_10_most_common_chains = sum(most_common_count)/len(Collection.published.all())
long_tail = [ue for ue in unique_enrichments if ue['count'] == 1]
num_collections_using_unique_chain = len(long_tail)
percent_collections_using_unique_chain = len(long_tail)/len(Collection.published.all())
print(
    f"- Unique Enrichment Chains: {num_enrichment_chains:,}\n"
    f"- Number of Collections Using the Most Common Enrichment Chain: "
    f"{num_collections_using_most_common_chain:,} (~{percent_collections_using_most_common_chain:.1%})\n"
    f"- Number of Collections Using the 10 Most Common Enrichment Chains: "
    f"{num_collections_using_10_most_common_chains:,} (~{percent_collections_using_10_most_common_chains:.1%})\n"
    f"- Number of Collections Using a Unique Enrichment Chain: "
    f"{num_collections_using_unique_chain:,} (~{percent_collections_using_unique_chain:.1%})"
)

ID Enrichments

The vast majority of enrichment chains start with getting an identifier. id_enrichment is defined per ucldc/avram: library_collection.models.Collection: id_enrichment(self)

Unique ID Enrichments: 14
Number of Collections Using the Most Common ID Enrichment: 1,429 (~61%)
Number of Collections Using a Unique ID Enrichment: 2

List of ID Enrichments and Count of Collections:

ID Enrichment	Count of Collections
/select-id?prop=id	1429
/select-oac-id	570
/select-id?prop=uid	278
/select-cmis-atom-id	21
/select-preservica-id	18
None	15
/select-id?prop=metadata/identifier	7
select-id?prop=id	6
select-oac-id	4
/select-id?prop=PID	3
/select-id?prop=identifier	3
/csl-marc-id	2
/ucsb-aleph-marc-id	1
/sfpl-marc-id	1

rich_collections = Collection.published.all()
unique_id_enrichments = (rich_collections, lambda c: c.id_enrichment)
num_id_enrichments = len(unique_id_enrichments)
unique_id_enrichments.sort(key=lambda ue: ue['count'], reverse=True)
num_collections_using_most_common_id_enrichment = unique_id_enrichments[0]['count']
percent_collections_using_most_common_chain = unique_id_enrichments[0]['count']/len(Collection.published.all())
long_tail = [ue for ue in unique_id_enrichments if ue['count'] == 1]
num_collections_using_unique_id_enrichment = len(long_tail)
table_view = ""
for ue in unique_id_enrichments:
    table_view = f"{table_view}| {ue['matched']} | {ue['count']} |\n"

print(
    f"- Unique ID Enrichments: {num_id_enrichments:,}\n"
    f"- Number of Collections Using the Most Common ID Enrichment: "
    f"{num_collections_using_most_common_id_enrichment:,} (~{percent_collections_using_most_common_chain:.0%})\n"
    f"- Number of Collections Using a Unique ID Enrichment: {num_collections_using_unique_id_enrichment:,}\n"
    f"#### List of ID Enrichments and Count of Collections:\n"
    "| ID Enrichment | Count of Collections |\n"
    "| --- | --- |\n"
    f"{table_view}"
)

Mappers

Unique Mappers: 55
Number of Collections Using the Most Common ID Enrichment: 571 (~24%)
Number of Collections Using a Unique ID Enrichment: 14

List of Mapper Types and Count of Collections:

Mapper Type	Count of Collections
/dpla_mapper?mapper_type=oac_dc	571
/dpla_mapper?mapper_type=contentdm_oai_dc	311
/dpla_mapper?mapper_type=ucsd_blacklight_dc	292
/dpla_mapper?mapper_type=ucldc_nuxeo	278
/dpla_mapper?mapper_type=cavpp_islandora	245
/dpla_mapper?mapper_type=calpoly_oai_dc	77
/dpla_mapper?mapper_type=usc_oai_dc	75
/dpla_mapper?mapper_type=chapman_oai_dc	48
/dpla_mapper?mapper_type=quartex_oai	46
/dpla_mapper?mapper_type=csa_omeka	44
/dpla_mapper?mapper_type=chs_islandora	36
/dpla_mapper?mapper_type=flickr_sppl	33
/dpla_mapper?mapper_type=omeka	28
/dpla_mapper?mapper_type=sjsu_islandora	24
/dpla_mapper?mapper_type=ucb_tind_marc	23
/dpla_mapper?mapper_type=cmis_atom	21
/dpla_mapper?mapper_type=up_oai_dc	19
/dpla_mapper?mapper_type=preservica_api	19
/dpla_mapper?mapper_type=youtube_video_snippet	17
/dpla_mapper?mapper_type=ucd_json	13
/dpla_mapper?mapper_type=burbank_islandora	11
/dpla_mapper?mapper_type=arck_oai	11
/dpla_mapper?mapper_type=ucsc_oai_dpla	10
/dpla_mapper?mapper_type=csudh_contentdm_oai_dc	10
/dpla_mapper?mapper_type=lapl_oai	10
/dpla_mapper?mapper_type=black_gold_oai	9
/dpla_mapper?mapper_type=islandora_oai_dc	9
/dpla_mapper?mapper_type=pspl_oai_dc	8
/dpla_mapper?mapper_type=chico_oai_dc	7
/dpla_mapper?mapper_type=chula_vista_pl_contentdm_oai_dc	5
/dpla_mapper?mapper_type=pastperfect_xml	5
/dpla_mapper?mapper_type=flickr_sdasm	5
/dpla_mapper?mapper_type=cca_vault_oai_dc	4
/dpla_mapper?mapper_type=yosemite_oai_dc	4
/dpla_mapper?mapper_type=ucla_solr_dc	3
/dpla_mapper?mapper_type=omeka_nothumb	3
/dpla_mapper?mapper_type=oac_dc_suppress_publisher	2
/dpla_mapper?mapper_type=csu_sac_oai_dc	2
/dpla_mapper?mapper_type=ucsf_solr	2
/dpla_mapper?mapper_type=caltech_restrict	2
/dpla_mapper?mapper_type=internet_archive	2
/dpla_mapper?mapper_type=ucsb_aleph_marc	1
/dpla_mapper?mapper_type=oac_dc_suppress_desc_2	1
/dpla_mapper?mapper_type=sfpl_marc	1
/dpla_mapper?mapper_type=lapl_26096	1
/dpla_mapper?mapper_type=csl_marc	1
/dpla_mapper?mapper_type=contentdm_oai_dc_get_sound_thumbs	1
/dpla_mapper?mapper_type=ucb_bampfa_solr	1
/dpla_mapper?mapper_type=csuci_mets	1
/dpla_mapper?mapper_type=emuseum_xml	1
/dpla_mapper?mapper_type=csu_dspace_mets	1
/dpla_mapper?mapper_type=flickr_api	1
/dpla_mapper?mapper_type=sierramadre_marc	1
/dpla_mapper?mapper_type=sanjose_pastperfect	1
/dpla_mapper?mapper_type=tv_academy_oai_dc	1

rich_collections = Collection.published.all()
unique_mappers = group_collections(rich_collections, lambda c: c.mapper_type)
num_unique_mappers = len(unique_mappers)
unique_mappers.sort(key=lambda ue: ue['count'], reverse=True)
num_collections_using_most_common_mapper = unique_mappers[0]['count']
percent_collections_using_most_common_mapper = unique_mappers[0]['count']/len(Collection.published.all())
long_tail = [ue for ue in unique_mappers if ue['count'] == 1]
num_collections_using_unique_mapper = len(long_tail)
table_view = ""
for ue in unique_mappers:
    table_view = f"{table_view}| {ue['matched']} | {ue['count']} |\n"

print(
    f"- Unique Mappers: {num_unique_mappers:,}\n"
    f"- Number of Collections Using the Most Common ID Enrichment: "
    f"{num_collections_using_most_common_mapper:,} (~{percent_collections_using_most_common_mapper:.0%})\n"
    f"- Number of Collections Using a Unique ID Enrichment: {num_collections_using_unique_mapper:,}\n"
    f"#### List of Mapper Types and Count of Collections:\n"
    "| Mapper Type | Count of Collections |\n"
    "| --- | --- |\n"
    f"{table_view}"
)

Relationship of Fetchers to Mappers

This section last updated Oct 21 2022

List of Mapper Types and associated fetcher types:

Mapper Type	Fetcher Types
/dpla_mapper?mapper_type=oac_dc	['OAC']
/dpla_mapper?mapper_type=ucd_json	['UCD']
/dpla_mapper?mapper_type=ucldc_nuxeo	['NUX']
/dpla_mapper?mapper_type=ucsb_aleph_marc	['ALX']
/dpla_mapper?mapper_type=ucb_tind_marc	['OAI']
/dpla_mapper?mapper_type=ucsc_oai_dpla	['OAI']
/dpla_mapper?mapper_type=ucsd_blacklight_dc	['SLR']
/dpla_mapper?mapper_type=csa_omeka	['OAI']
/dpla_mapper?mapper_type=ucla_solr_dc	['SLR']
/dpla_mapper?mapper_type=oac_dc_suppress_publisher	['OAC']
/dpla_mapper?mapper_type=quartex_oai	['OAI']
/dpla_mapper?mapper_type=sjsu_islandora	['OAI']
/dpla_mapper?mapper_type=cca_vault_oai_dc	['OAI']
/dpla_mapper?mapper_type=chs_islandora	['OAI']
/dpla_mapper?mapper_type=contentdm_oai_dc	['OAI']
/dpla_mapper?mapper_type=cmis_atom	['PRE']
/dpla_mapper?mapper_type=black_gold_oai	['OAI']
/dpla_mapper?mapper_type=calpoly_oai_dc	['OAI']
/dpla_mapper?mapper_type=csu_sac_oai_dc	['OAI']
/dpla_mapper?mapper_type=csudh_contentdm_oai_dc	['OAI']
/dpla_mapper?mapper_type=oac_dc_suppress_desc_2	['OAC']
/dpla_mapper?mapper_type=chula_vista_pl_contentdm_oai_dc	['OAI']
/dpla_mapper?mapper_type=lapl_oai	['OAI']
/dpla_mapper?mapper_type=sfpl_marc	['MRC']
/dpla_mapper?mapper_type=lapl_26096	['OAI']
/dpla_mapper?mapper_type=ucsf_solr	['SFX']
/dpla_mapper?mapper_type=cavpp_islandora	['OAI']
/dpla_mapper?mapper_type=up_oai_dc	['OAI']
/dpla_mapper?mapper_type=chapman_oai_dc	['OAI']
/dpla_mapper?mapper_type=preservica_api	['PRA']
/dpla_mapper?mapper_type=csl_marc	['MRC']
/dpla_mapper?mapper_type=contentdm_oai_dc_get_sound_thumbs	['OAI']
/dpla_mapper?mapper_type=pspl_oai_dc	['OAI']
/dpla_mapper?mapper_type=omeka	['OAI']
/dpla_mapper?mapper_type=chico_oai_dc	['OAI']
/dpla_mapper?mapper_type=ucb_bampfa_solr	['UCB']
/dpla_mapper?mapper_type=islandora_oai_dc	['OAI']
/dpla_mapper?mapper_type=youtube_video_snippet	['YTB']
/dpla_mapper?mapper_type=csuci_mets	['OAI']
/dpla_mapper?mapper_type=pastperfect_xml	['XML']
/dpla_mapper?mapper_type=caltech_restrict	['OAI']
/dpla_mapper?mapper_type=usc_oai_dc	['OAI']
/dpla_mapper?mapper_type=yosemite_oai_dc	['OAI']
/dpla_mapper?mapper_type=emuseum_xml	['EMS']
/dpla_mapper?mapper_type=csu_dspace_mets	['OAI']
/dpla_mapper?mapper_type=flickr_api	['FLK']
/dpla_mapper?mapper_type=sierramadre_marc	['MRC']
/dpla_mapper?mapper_type=burbank_islandora	['OAI']
/dpla_mapper?mapper_type=omeka_nothumb	['OAI']
/dpla_mapper?mapper_type=sanjose_pastperfect	['XML']
/dpla_mapper?mapper_type=tv_academy_oai_dc	['OAI']
/dpla_mapper?mapper_type=flickr_sdasm	['FLK']
/dpla_mapper?mapper_type=flickr_sppl	['FLK']
/dpla_mapper?mapper_type=internet_archive	['IAR']
/dpla_mapper?mapper_type=arck_oai	['OAI']

rich_collections = Collection.published.all()
unique_mappers = group_collections(rich_collections, lambda c: c.mapper_type)

fetcher_types = {}
for m in unique_mappers:
    types = []
    for c in m['collections']:
       if c.harvest_type not in types:
          types.append(c.harvest_type)
    fetcher_types[m['matched']] = types

table_view = ""
for ft in fetcher_types:
   table_view = f"{table_view}| {ft} | {fetcher_types[ft]} |\n"
   
print(
    f"### List of Mapper Types and associated fetcher types:\n"
    "| Mapper Type | Fetcher Types |\n"
    "| --- | --- |\n"
    f"{table_view}"
)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly