metadata-editing-python

Scripts to merge, pivot, and edit CSV data. Many of these script rely on the pandas library.

clean-up

convertUnstructuredListsToStructuredLists.py
Converts unstructured lists of values and converts them into new structured lists.

original

value	multipleTerms	possibleDelimiter
manuscripts; Sanskrit	y	;
attention, learning,memory ,visual cortex	y	,
corgis; golden retrievers;bulldog;	y	;
unicorns	n

structuredAndUnstructuredLists.csv

unstructuredList	structuredList
manuscripts; Sanskrit	['manuscripts', 'Sanskrit']
attention, learning,memory ,visual cortex	['attention', 'learning', 'memory', 'visual cortex']
corgis; golden retrievers;bulldog;	['corgis', 'golden retrievers', 'bulldog']

deleteBlankColumnsAndRows.py
Deletes any rows and columns that are completely blank from a CSV.

deleteDuplicateRows.py
Deletes any rows with the exact same information from a CSV.

deleteRowsWithIdentifiers.py
Deletes rows from first spreadsheet using list of identifiers from second spreadsheet.

stripWhiteSpaces.py
Strips white spaces from all cells in CSV.

evaluate

getDuplicatedIdentifiersInSheet.py
Finds all identifiers in a spreadsheet that appear more than 1 time in column. Creates a new spreadsheet containing all the rows with duplicated identifiers.

getDuplicateValuesInColumn.py
Find any duplicated values within a column of a spreadsheet, and then creates a new spreadsheet containing all the duplicated rows.

getExtensionsFromFilenames.py
Loops through a spreadsheet of filenames and add columns of just file extensions to original sheet.

getIdentifiersNotRepeatedInSheet.py
Finds identifiers that only appear once in a sheet's column. Creates a new spreadsheet containing all the rows with non-repeated identifiers.

getInProgressAndCompletedRows.py
Finds rows where the cell is empty (nan) for certain columns, and creates a new spreadsheet with these rows.

getSumAndSizeForColumnInSheet.py
Gets the sum (sum of values) and size (total count of non-empty value) of each unique value in a specific column.

getTotalValuesByIdentifierInSheet.py
For spreadsheet where some identifiers are repeated in multiple rows, and each row has a number value , add up number values for each identifier.

getValuesDuplicatedInTwoColumns.py
Get any values duplicated in two columns in a sheet.

printDuplicatedIdentifiersPairsInSheet.py
Print any duplicated identifier pairs to terminal.

printUniqueIdentifiersInSheet.py
Print list of unique identifiers to terminal.

match-values

findFuzzyMatchesWithinAList.py
Matches nearly identical strings within a list (or a column of a CSV) using fuzzywuzzy. It produces CSV with the matches and a CSV with the strings that could not be matched.

getLinkForEachUniqueValue.py
This script takes a list of unique values from a CSV, and searches for them in second CSV. In the second CSV, each value will be in a column where each cell has multiple value separated by a delimiter. The script will loop through this column and find an identifier or link for each unique value.

guessLanguage.py
Guesses the language of items from their titles using langdetect.

merge

combineStringColumns.py
Combines strings from numerous columns into a new column.

mergeLeftTwoSheets.py
Joins two spreadsheets using pandas left merge (left join) on an identifier.

mergeTwoSheetsWithoutPandas.py
Joins information from two spreadsheets by their identifiers.

multiple

combineMultipleSpreadsheets.py
Combines multiple spreadsheets with the same column names into one combined, large spreadsheet. The spreadsheets to combine should all be placed into the same directory.

compareIdentifiersInMultipleSheets.py
Collects identifiers from multiple spreadsheets in a directory and lists the spreadsheets where each identifier appears.

createSpreadsheetsForEachColumnInSheets.py
Combines multiple spreadsheets in a directory into a dataframe, and makes a new spreadsheet for each column found in the combined dataframe.

getIdentifiersMissingInSecondSheet.py
Finds any identifiers that exist in the first sheet but not in the first sheet.

getStatusOfAllIdentifiersInTwoSheets.py
Creates a spreadsheet with a list of all identifiers from both sheets and records if the identifier is in both sheets, only sheet_1, or only sheet_2.

getValueCountsForEachColumnInSheets.py
Counts how many values are found in each column of spreadsheets in a directory. This count information is then added by handle (or filename) to a new CSV.

mergeMultipleSheetsOnIdentifier.py
Merges multiple spreadsheets in a directory on an identifier column found in all the sheets.

reformat-values

addPrefixToIdentifiers.py
Adds a string prefix to an identifier column.

convertToETDF.py
Converts a wide variety of date formats to EDTF using regular expressions.

createFileNameFromCallNum.py
Creates a unique filename based on an item's call number.

reformatInversedNames.py
Converts personal names without dates to follow a Secondary Name, Primary Name format.

standardizePersonalNames.py
Checks personal names from a CSV file to see if there are any punctuation and/or spacing errors. The script bases its punctuation conventions on RDA and checks that these rules are followed:

A single space follows a comma.
Initials are followed by a period and there is a space between two initials. (i.e. Adams, E. P., not Adams, EP or Adams E.P.)
The pre-RDA abbreviations of fl., ca., cent., b., and d. are not used.
There is not a period at the end of the name unless the name ends with an abbreviation or initial (this is a local convention). To remove, delete or edit rows 32-34.

The script makes a new CSV file (namesStandardized.csv) with two columns called personalName and errorType. The second column tells the user the first punctuation/spacing error that exists for that name (see below for an example.) If there is more than one error, you only see the first error!

personalName	errorType
Abbott, David Phelps, 1863-1934.	Remove period
Abbott, Edwin Abbott,1838-1926.	Remove period
Abelson, Paul,1878-	Add space after comma
Aberra, A.A.	Add space after period
Albert, Jane, b. 1889	Not updated to RDA standards

stripTimeFromDateTime
Strips the time from a standard DateTime stamp.

reshape

basicMelt.py
Melts a spreadsheet on a specific identifier column.

basicPivot.py
Pivots a spreadsheet by setting the index to the value of a specific column.

basicPivotTable.py
Pivots and aggregates a spreadsheet by setting the index to the value of a specific column.

original

column1	column2
001	Smith, Jane
002	Smith, Ed
002	Austen, Jane
003	Smith, Ed

new

column2	column1
Austen, Jane	002
Smith, Ed	002 \
Smith, Jane	001

collapseAllColumnsToOneColumn.py

collectSubjectsByIdentifier.py
Combine subject strings by identifier number. Then a left merge combines the pivot table to a metadata spreadsheet if needed.

original

identifier	subject
0001	horses
0001	goats
0001	cows
0002	Mars
0002	Venus

pivot_table

identifier	subject
0001	horses;goats;cows
0002	Mars;Venus

explodeAndPivotByIdentifier.py
Reshapes sheet indexed by column1 (column2 aggregated) to sheet indexed by column2 (column1 aggregated).

original

column1	column2
Smith, Ed	001 \| 004 \| 005
Smith, Jane	004

new

column2	column1
001	Smith, Ed
004	Smith, Ed \| Smith, Jane
005	Smith, Ed

explodeColumnWithList.py
Takes a column where each cell contains multiple values, and creates a new row for each value, duplicating its original identifier.

meltMultipleSubjectsByIdentifier.py
Melts a wide CSV with multiple column pairs of an auth_name and vocab_id into a long format where each column pair is in its own row with the associated bib number.

original

bib	0_auth_name	0_vocab_id	1_auth_name	1_vocab_id
0001	horses	1003	goats	20394
0002	Venus	839402	Mars	9842718

melted

bib	auth_name	vocab_id
0001	horses	1003
0001	goats	20394
0002	Venus	839402
0002	Mars	9842718

splitCSVByUniqueValueInColumn.py
Generates a new CSV for each unique value found in a column of a spreadsheet. The new CSVs are named based on the unique values found.

testing

generateNumberedCopiesofFile.py
Generates numbered copies of a file for testing purposes.

getEveryNthRowFromCSV.py
Collects every Nth row from a CSV to create a sample of data.

splitCSVForTesting.py
Splits a large spreadsheet evenly into ten CSVs for testing.

testLink.py
Tests URLs listed in spreadsheet to see if they work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

metadata-editing-python

clean-up

evaluate

match-values

merge

multiple

reformat-values

reshape

testing

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 100 Commits
.idea		.idea
clean-up		clean-up
evaluate		evaluate
match-values		match-values
merge		merge
multiple		multiple
reformat-values		reformat-values
reshape		reshape
testing		testing
.gitignore		.gitignore
README.md		README.md

jhu-library-applications/metadata-editing-python

Folders and files

Latest commit

History

Repository files navigation

metadata-editing-python

clean-up

evaluate

match-values

merge

multiple

reformat-values

reshape

testing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages