Skip to content

jhu-library-applications/metadata-editing-python

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

metadata-editing-python

Scripts to merge, pivot, and edit CSV data. Many of these script rely on the pandas library.

clean-up

convertUnstructuredListsToStructuredLists.py
Converts unstructured lists of values and converts them into new structured lists.

original

value multipleTerms possibleDelimiter
manuscripts; Sanskrit y ;
attention, learning,memory ,visual cortex y ,
corgis; golden retrievers;bulldog; y ;
unicorns n

structuredAndUnstructuredLists.csv

unstructuredList structuredList
manuscripts; Sanskrit ['manuscripts', 'Sanskrit']
attention, learning,memory ,visual cortex ['attention', 'learning', 'memory', 'visual cortex']
corgis; golden retrievers;bulldog; ['corgis', 'golden retrievers', 'bulldog']

deleteBlankColumnsAndRows.py
Deletes any rows and columns that are completely blank from a CSV.

deleteDuplicateRows.py
Deletes any rows with the exact same information from a CSV.

deleteRowsWithIdentifiers.py
Deletes rows from first spreadsheet using list of identifiers from second spreadsheet.

stripWhiteSpaces.py
Strips white spaces from all cells in CSV.

evaluate

getDuplicatedIdentifiersInSheet.py
Finds all identifiers in a spreadsheet that appear more than 1 time in column. Creates a new spreadsheet containing all the rows with duplicated identifiers.

getDuplicateValuesInColumn.py
Find any duplicated values within a column of a spreadsheet, and then creates a new spreadsheet containing all the duplicated rows.

getExtensionsFromFilenames.py
Loops through a spreadsheet of filenames and add columns of just file extensions to original sheet.

getIdentifiersNotRepeatedInSheet.py
Finds identifiers that only appear once in a sheet's column. Creates a new spreadsheet containing all the rows with non-repeated identifiers.

getInProgressAndCompletedRows.py
Finds rows where the cell is empty (nan) for certain columns, and creates a new spreadsheet with these rows.

getSumAndSizeForColumnInSheet.py
Gets the sum (sum of values) and size (total count of non-empty value) of each unique value in a specific column.

getTotalValuesByIdentifierInSheet.py
For spreadsheet where some identifiers are repeated in multiple rows, and each row has a number value , add up number values for each identifier.

getValuesDuplicatedInTwoColumns.py
Get any values duplicated in two columns in a sheet.

printDuplicatedIdentifiersPairsInSheet.py
Print any duplicated identifier pairs to terminal.

printUniqueIdentifiersInSheet.py
Print list of unique identifiers to terminal.

match-values

findFuzzyMatchesWithinAList.py
Matches nearly identical strings within a list (or a column of a CSV) using fuzzywuzzy. It produces CSV with the matches and a CSV with the strings that could not be matched.

getLinkForEachUniqueValue.py
This script takes a list of unique values from a CSV, and searches for them in second CSV. In the second CSV, each value will be in a column where each cell has multiple value separated by a delimiter. The script will loop through this column and find an identifier or link for each unique value.

guessLanguage.py
Guesses the language of items from their titles using langdetect.

merge

combineStringColumns.py
Combines strings from numerous columns into a new column.

mergeLeftTwoSheets.py
Joins two spreadsheets using pandas left merge (left join) on an identifier.

mergeTwoSheetsWithoutPandas.py
Joins information from two spreadsheets by their identifiers.

multiple

combineMultipleSpreadsheets.py
Combines multiple spreadsheets with the same column names into one combined, large spreadsheet. The spreadsheets to combine should all be placed into the same directory.

compareIdentifiersInMultipleSheets.py
Collects identifiers from multiple spreadsheets in a directory and lists the spreadsheets where each identifier appears.

createSpreadsheetsForEachColumnInSheets.py
Combines multiple spreadsheets in a directory into a dataframe, and makes a new spreadsheet for each column found in the combined dataframe.

getIdentifiersMissingInSecondSheet.py
Finds any identifiers that exist in the first sheet but not in the first sheet.

getStatusOfAllIdentifiersInTwoSheets.py
Creates a spreadsheet with a list of all identifiers from both sheets and records if the identifier is in both sheets, only sheet_1, or only sheet_2.

getValueCountsForEachColumnInSheets.py
Counts how many values are found in each column of spreadsheets in a directory. This count information is then added by handle (or filename) to a new CSV.

mergeMultipleSheetsOnIdentifier.py
Merges multiple spreadsheets in a directory on an identifier column found in all the sheets.

reformat-values

addPrefixToIdentifiers.py
Adds a string prefix to an identifier column.

convertToETDF.py
Converts a wide variety of date formats to EDTF using regular expressions.

createFileNameFromCallNum.py
Creates a unique filename based on an item's call number.

reformatInversedNames.py
Converts personal names without dates to follow a Secondary Name, Primary Name format.

standardizePersonalNames.py
Checks personal names from a CSV file to see if there are any punctuation and/or spacing errors. The script bases its punctuation conventions on RDA and checks that these rules are followed:

  1. A single space follows a comma.
  2. Initials are followed by a period and there is a space between two initials. (i.e. Adams, E. P., not Adams, EP or Adams E.P.)
  3. The pre-RDA abbreviations of fl., ca., cent., b., and d. are not used.
  4. There is not a period at the end of the name unless the name ends with an abbreviation or initial (this is a local convention). To remove, delete or edit rows 32-34.

The script makes a new CSV file (namesStandardized.csv) with two columns called personalName and errorType. The second column tells the user the first punctuation/spacing error that exists for that name (see below for an example.) If there is more than one error, you only see the first error!

personalName errorType
Abbott, David Phelps, 1863-1934. Remove period
Abbott, Edwin Abbott,1838-1926. Remove period
Abelson, Paul,1878- Add space after comma
Aberra, A.A. Add space after period
Albert, Jane, b. 1889 Not updated to RDA standards

stripTimeFromDateTime
Strips the time from a standard DateTime stamp.

reshape

basicMelt.py
Melts a spreadsheet on a specific identifier column.

basicPivot.py
Pivots a spreadsheet by setting the index to the value of a specific column.

basicPivotTable.py
Pivots and aggregates a spreadsheet by setting the index to the value of a specific column.

original

column1 column2
001 Smith, Jane
002 Smith, Ed
002 Austen, Jane
003 Smith, Ed

new

column2 column1
Austen, Jane 002
Smith, Ed 002 \
Smith, Jane 001

collapseAllColumnsToOneColumn.py

collectSubjectsByIdentifier.py
Combine subject strings by identifier number. Then a left merge combines the pivot table to a metadata spreadsheet if needed.

original

identifier subject
0001 horses
0001 goats
0001 cows
0002 Mars
0002 Venus

pivot_table

identifier subject
0001 horses;goats;cows
0002 Mars;Venus

explodeAndPivotByIdentifier.py
Reshapes sheet indexed by column1 (column2 aggregated) to sheet indexed by column2 (column1 aggregated).

original

column1 column2
Smith, Ed 001 | 004 | 005
Smith, Jane 004

new

column2 column1
001 Smith, Ed
004 Smith, Ed | Smith, Jane
005 Smith, Ed

explodeColumnWithList.py
Takes a column where each cell contains multiple values, and creates a new row for each value, duplicating its original identifier.

meltMultipleSubjectsByIdentifier.py
Melts a wide CSV with multiple column pairs of an auth_name and vocab_id into a long format where each column pair is in its own row with the associated bib number.

original

bib 0_auth_name 0_vocab_id 1_auth_name 1_vocab_id
0001 horses 1003 goats 20394
0002 Venus 839402 Mars 9842718

melted

bib auth_name vocab_id
0001 horses 1003
0001 goats 20394
0002 Venus 839402
0002 Mars 9842718

splitCSVByUniqueValueInColumn.py
Generates a new CSV for each unique value found in a column of a spreadsheet. The new CSVs are named based on the unique values found.

testing

generateNumberedCopiesofFile.py
Generates numbered copies of a file for testing purposes.

getEveryNthRowFromCSV.py
Collects every Nth row from a CSV to create a sample of data.

splitCSVForTesting.py
Splits a large spreadsheet evenly into ten CSVs for testing.

testLink.py
Tests URLs listed in spreadsheet to see if they work.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages