-
Notifications
You must be signed in to change notification settings - Fork 3
Analysis of Existing Infrastructure
Barbara Hui edited this page Oct 24, 2022
·
4 revisions
[Last Updated Oct 5, 2022]
This document assumes you are familiar with the existing harvester operations, the Collection Registry, and the concept of enrichment chains.
In order to run the code samples in this document, you have ssh access to the Collection Registry machine, can use the registry role account, and have run python manage.py shell
from within the avram codebase. This document also assumes you have run the following python code before running any of the scripts listed below:
from library_collection.models import Collection
def group_collections(collection_list, group_by, dict_name='matched'):
group_list = []
while len(collection_list) > 0:
remainder = []
matched = []
match_value = group_by(collection_list[0])
for collection in collection_list:
if group_by(collection) == match_value:
matched.append(collection)
else:
remainder.append(collection)
group_list.append({
dict_name: match_value,
'collections': matched,
'count': len(matched)
})
collection_list = remainder
return group_list
- Total Collections: 27,797
- Collections with Enrichment Chains: 3,184
- Collections Ready for Publication: 2,358
- Collections Ready for Publication with Enrichment Chains: 2,358
num_total_collections = len(Collection.objects.all())
num_collections_with_enrichments = len(Collection.objects.exclude(enrichments_item__exact=''))
num_collections_ready_for_publication = len(Collection.objects.exclude(ready_for_publication=False))
num_collections_ready_and_with_enrichments = len(Collection.published.all())
print(
f"- Total Collections: {num_total_collections:,}\n"
f"- Collections with Enrichment Chains: {num_collections_with_enrichments:,}\n"
f"- Collections Ready for Publication: {num_collections_ready_for_publication:,}\n"
f"- Collections Ready for Publication with Enrichment Chains: {num_collections_ready_and_with_enrichments:,}"
)
- Unique Enrichment Chains: 133
- Number of Collections Using the Most Common Enrichment Chain: 407 (~17.3%)
- Number of Collections Using the 10 Most Common Enrichment Chains: 1,635 (~69.3%)
- Number of Collections Using a Unique Enrichment Chain: 59 (~2.5%)
rich_collections = Collection.published.all()
unique_enrichments = group_collections(
rich_collections, lambda c: c.enrichments_item)
num_enrichment_chains = len(unique_enrichments)
unique_enrichments.sort(key=lambda ue: ue['count'], reverse=True)
num_collections_using_most_common_chain = unique_enrichments[0]['count']
percent_collections_using_most_common_chain = unique_enrichments[0]['count']/len(Collection.published.all())
most_common_count = [ue['count'] for ue in unique_enrichments[:10]]
num_collections_using_10_most_common_chains = sum(most_common_count)
percent_collections_using_10_most_common_chains = sum(most_common_count)/len(Collection.published.all())
long_tail = [ue for ue in unique_enrichments if ue['count'] == 1]
num_collections_using_unique_chain = len(long_tail)
percent_collections_using_unique_chain = len(long_tail)/len(Collection.published.all())
print(
f"- Unique Enrichment Chains: {num_enrichment_chains:,}\n"
f"- Number of Collections Using the Most Common Enrichment Chain: "
f"{num_collections_using_most_common_chain:,} (~{percent_collections_using_most_common_chain:.1%})\n"
f"- Number of Collections Using the 10 Most Common Enrichment Chains: "
f"{num_collections_using_10_most_common_chains:,} (~{percent_collections_using_10_most_common_chains:.1%})\n"
f"- Number of Collections Using a Unique Enrichment Chain: "
f"{num_collections_using_unique_chain:,} (~{percent_collections_using_unique_chain:.1%})"
)
The vast majority of enrichment chains start with getting an identifier. id_enrichment is defined per ucldc/avram: library_collection.models.Collection: id_enrichment(self)
- Unique ID Enrichments: 14
- Number of Collections Using the Most Common ID Enrichment: 1,429 (~61%)
- Number of Collections Using a Unique ID Enrichment: 2
ID Enrichment | Count of Collections |
---|---|
/select-id?prop=id | 1429 |
/select-oac-id | 570 |
/select-id?prop=uid | 278 |
/select-cmis-atom-id | 21 |
/select-preservica-id | 18 |
None | 15 |
/select-id?prop=metadata/identifier | 7 |
select-id?prop=id | 6 |
select-oac-id | 4 |
/select-id?prop=PID | 3 |
/select-id?prop=identifier | 3 |
/csl-marc-id | 2 |
/ucsb-aleph-marc-id | 1 |
/sfpl-marc-id | 1 |
rich_collections = Collection.published.all()
unique_id_enrichments = (rich_collections, lambda c: c.id_enrichment)
num_id_enrichments = len(unique_id_enrichments)
unique_id_enrichments.sort(key=lambda ue: ue['count'], reverse=True)
num_collections_using_most_common_id_enrichment = unique_id_enrichments[0]['count']
percent_collections_using_most_common_chain = unique_id_enrichments[0]['count']/len(Collection.published.all())
long_tail = [ue for ue in unique_id_enrichments if ue['count'] == 1]
num_collections_using_unique_id_enrichment = len(long_tail)
table_view = ""
for ue in unique_id_enrichments:
table_view = f"{table_view}| {ue['matched']} | {ue['count']} |\n"
print(
f"- Unique ID Enrichments: {num_id_enrichments:,}\n"
f"- Number of Collections Using the Most Common ID Enrichment: "
f"{num_collections_using_most_common_id_enrichment:,} (~{percent_collections_using_most_common_chain:.0%})\n"
f"- Number of Collections Using a Unique ID Enrichment: {num_collections_using_unique_id_enrichment:,}\n"
f"#### List of ID Enrichments and Count of Collections:\n"
"| ID Enrichment | Count of Collections |\n"
"| --- | --- |\n"
f"{table_view}"
)
- Unique Mappers: 55
- Number of Collections Using the Most Common ID Enrichment: 571 (~24%)
- Number of Collections Using a Unique ID Enrichment: 14
Mapper Type | Count of Collections |
---|---|
/dpla_mapper?mapper_type=oac_dc | 571 |
/dpla_mapper?mapper_type=contentdm_oai_dc | 311 |
/dpla_mapper?mapper_type=ucsd_blacklight_dc | 292 |
/dpla_mapper?mapper_type=ucldc_nuxeo | 278 |
/dpla_mapper?mapper_type=cavpp_islandora | 245 |
/dpla_mapper?mapper_type=calpoly_oai_dc | 77 |
/dpla_mapper?mapper_type=usc_oai_dc | 75 |
/dpla_mapper?mapper_type=chapman_oai_dc | 48 |
/dpla_mapper?mapper_type=quartex_oai | 46 |
/dpla_mapper?mapper_type=csa_omeka | 44 |
/dpla_mapper?mapper_type=chs_islandora | 36 |
/dpla_mapper?mapper_type=flickr_sppl | 33 |
/dpla_mapper?mapper_type=omeka | 28 |
/dpla_mapper?mapper_type=sjsu_islandora | 24 |
/dpla_mapper?mapper_type=ucb_tind_marc | 23 |
/dpla_mapper?mapper_type=cmis_atom | 21 |
/dpla_mapper?mapper_type=up_oai_dc | 19 |
/dpla_mapper?mapper_type=preservica_api | 19 |
/dpla_mapper?mapper_type=youtube_video_snippet | 17 |
/dpla_mapper?mapper_type=ucd_json | 13 |
/dpla_mapper?mapper_type=burbank_islandora | 11 |
/dpla_mapper?mapper_type=arck_oai | 11 |
/dpla_mapper?mapper_type=ucsc_oai_dpla | 10 |
/dpla_mapper?mapper_type=csudh_contentdm_oai_dc | 10 |
/dpla_mapper?mapper_type=lapl_oai | 10 |
/dpla_mapper?mapper_type=black_gold_oai | 9 |
/dpla_mapper?mapper_type=islandora_oai_dc | 9 |
/dpla_mapper?mapper_type=pspl_oai_dc | 8 |
/dpla_mapper?mapper_type=chico_oai_dc | 7 |
/dpla_mapper?mapper_type=chula_vista_pl_contentdm_oai_dc | 5 |
/dpla_mapper?mapper_type=pastperfect_xml | 5 |
/dpla_mapper?mapper_type=flickr_sdasm | 5 |
/dpla_mapper?mapper_type=cca_vault_oai_dc | 4 |
/dpla_mapper?mapper_type=yosemite_oai_dc | 4 |
/dpla_mapper?mapper_type=ucla_solr_dc | 3 |
/dpla_mapper?mapper_type=omeka_nothumb | 3 |
/dpla_mapper?mapper_type=oac_dc_suppress_publisher | 2 |
/dpla_mapper?mapper_type=csu_sac_oai_dc | 2 |
/dpla_mapper?mapper_type=ucsf_solr | 2 |
/dpla_mapper?mapper_type=caltech_restrict | 2 |
/dpla_mapper?mapper_type=internet_archive | 2 |
/dpla_mapper?mapper_type=ucsb_aleph_marc | 1 |
/dpla_mapper?mapper_type=oac_dc_suppress_desc_2 | 1 |
/dpla_mapper?mapper_type=sfpl_marc | 1 |
/dpla_mapper?mapper_type=lapl_26096 | 1 |
/dpla_mapper?mapper_type=csl_marc | 1 |
/dpla_mapper?mapper_type=contentdm_oai_dc_get_sound_thumbs | 1 |
/dpla_mapper?mapper_type=ucb_bampfa_solr | 1 |
/dpla_mapper?mapper_type=csuci_mets | 1 |
/dpla_mapper?mapper_type=emuseum_xml | 1 |
/dpla_mapper?mapper_type=csu_dspace_mets | 1 |
/dpla_mapper?mapper_type=flickr_api | 1 |
/dpla_mapper?mapper_type=sierramadre_marc | 1 |
/dpla_mapper?mapper_type=sanjose_pastperfect | 1 |
/dpla_mapper?mapper_type=tv_academy_oai_dc | 1 |
rich_collections = Collection.published.all()
unique_mappers = group_collections(rich_collections, lambda c: c.mapper_type)
num_unique_mappers = len(unique_mappers)
unique_mappers.sort(key=lambda ue: ue['count'], reverse=True)
num_collections_using_most_common_mapper = unique_mappers[0]['count']
percent_collections_using_most_common_mapper = unique_mappers[0]['count']/len(Collection.published.all())
long_tail = [ue for ue in unique_mappers if ue['count'] == 1]
num_collections_using_unique_mapper = len(long_tail)
table_view = ""
for ue in unique_mappers:
table_view = f"{table_view}| {ue['matched']} | {ue['count']} |\n"
print(
f"- Unique Mappers: {num_unique_mappers:,}\n"
f"- Number of Collections Using the Most Common ID Enrichment: "
f"{num_collections_using_most_common_mapper:,} (~{percent_collections_using_most_common_mapper:.0%})\n"
f"- Number of Collections Using a Unique ID Enrichment: {num_collections_using_unique_mapper:,}\n"
f"#### List of Mapper Types and Count of Collections:\n"
"| Mapper Type | Count of Collections |\n"
"| --- | --- |\n"
f"{table_view}"
)
This section last updated Oct 21 2022
Mapper Type | Fetcher Types |
---|---|
/dpla_mapper?mapper_type=oac_dc | ['OAC'] |
/dpla_mapper?mapper_type=ucd_json | ['UCD'] |
/dpla_mapper?mapper_type=ucldc_nuxeo | ['NUX'] |
/dpla_mapper?mapper_type=ucsb_aleph_marc | ['ALX'] |
/dpla_mapper?mapper_type=ucb_tind_marc | ['OAI'] |
/dpla_mapper?mapper_type=ucsc_oai_dpla | ['OAI'] |
/dpla_mapper?mapper_type=ucsd_blacklight_dc | ['SLR'] |
/dpla_mapper?mapper_type=csa_omeka | ['OAI'] |
/dpla_mapper?mapper_type=ucla_solr_dc | ['SLR'] |
/dpla_mapper?mapper_type=oac_dc_suppress_publisher | ['OAC'] |
/dpla_mapper?mapper_type=quartex_oai | ['OAI'] |
/dpla_mapper?mapper_type=sjsu_islandora | ['OAI'] |
/dpla_mapper?mapper_type=cca_vault_oai_dc | ['OAI'] |
/dpla_mapper?mapper_type=chs_islandora | ['OAI'] |
/dpla_mapper?mapper_type=contentdm_oai_dc | ['OAI'] |
/dpla_mapper?mapper_type=cmis_atom | ['PRE'] |
/dpla_mapper?mapper_type=black_gold_oai | ['OAI'] |
/dpla_mapper?mapper_type=calpoly_oai_dc | ['OAI'] |
/dpla_mapper?mapper_type=csu_sac_oai_dc | ['OAI'] |
/dpla_mapper?mapper_type=csudh_contentdm_oai_dc | ['OAI'] |
/dpla_mapper?mapper_type=oac_dc_suppress_desc_2 | ['OAC'] |
/dpla_mapper?mapper_type=chula_vista_pl_contentdm_oai_dc | ['OAI'] |
/dpla_mapper?mapper_type=lapl_oai | ['OAI'] |
/dpla_mapper?mapper_type=sfpl_marc | ['MRC'] |
/dpla_mapper?mapper_type=lapl_26096 | ['OAI'] |
/dpla_mapper?mapper_type=ucsf_solr | ['SFX'] |
/dpla_mapper?mapper_type=cavpp_islandora | ['OAI'] |
/dpla_mapper?mapper_type=up_oai_dc | ['OAI'] |
/dpla_mapper?mapper_type=chapman_oai_dc | ['OAI'] |
/dpla_mapper?mapper_type=preservica_api | ['PRA'] |
/dpla_mapper?mapper_type=csl_marc | ['MRC'] |
/dpla_mapper?mapper_type=contentdm_oai_dc_get_sound_thumbs | ['OAI'] |
/dpla_mapper?mapper_type=pspl_oai_dc | ['OAI'] |
/dpla_mapper?mapper_type=omeka | ['OAI'] |
/dpla_mapper?mapper_type=chico_oai_dc | ['OAI'] |
/dpla_mapper?mapper_type=ucb_bampfa_solr | ['UCB'] |
/dpla_mapper?mapper_type=islandora_oai_dc | ['OAI'] |
/dpla_mapper?mapper_type=youtube_video_snippet | ['YTB'] |
/dpla_mapper?mapper_type=csuci_mets | ['OAI'] |
/dpla_mapper?mapper_type=pastperfect_xml | ['XML'] |
/dpla_mapper?mapper_type=caltech_restrict | ['OAI'] |
/dpla_mapper?mapper_type=usc_oai_dc | ['OAI'] |
/dpla_mapper?mapper_type=yosemite_oai_dc | ['OAI'] |
/dpla_mapper?mapper_type=emuseum_xml | ['EMS'] |
/dpla_mapper?mapper_type=csu_dspace_mets | ['OAI'] |
/dpla_mapper?mapper_type=flickr_api | ['FLK'] |
/dpla_mapper?mapper_type=sierramadre_marc | ['MRC'] |
/dpla_mapper?mapper_type=burbank_islandora | ['OAI'] |
/dpla_mapper?mapper_type=omeka_nothumb | ['OAI'] |
/dpla_mapper?mapper_type=sanjose_pastperfect | ['XML'] |
/dpla_mapper?mapper_type=tv_academy_oai_dc | ['OAI'] |
/dpla_mapper?mapper_type=flickr_sdasm | ['FLK'] |
/dpla_mapper?mapper_type=flickr_sppl | ['FLK'] |
/dpla_mapper?mapper_type=internet_archive | ['IAR'] |
/dpla_mapper?mapper_type=arck_oai | ['OAI'] |
rich_collections = Collection.published.all()
unique_mappers = group_collections(rich_collections, lambda c: c.mapper_type)
fetcher_types = {}
for m in unique_mappers:
types = []
for c in m['collections']:
if c.harvest_type not in types:
types.append(c.harvest_type)
fetcher_types[m['matched']] = types
table_view = ""
for ft in fetcher_types:
table_view = f"{table_view}| {ft} | {fetcher_types[ft]} |\n"
print(
f"### List of Mapper Types and associated fetcher types:\n"
"| Mapper Type | Fetcher Types |\n"
"| --- | --- |\n"
f"{table_view}"
)