Scripts to merge, pivot, and edit CSV data. Many of these script rely on the pandas library.
convertUnstructuredListsToStructuredLists.py
Converts unstructured lists of values and converts them into new structured lists.
original
value | multipleTerms | possibleDelimiter |
---|---|---|
manuscripts; Sanskrit | y | ; |
attention, learning,memory ,visual cortex | y | , |
corgis; golden retrievers;bulldog; | y | ; |
unicorns | n |
structuredAndUnstructuredLists.csv
unstructuredList | structuredList |
---|---|
manuscripts; Sanskrit | ['manuscripts', 'Sanskrit'] |
attention, learning,memory ,visual cortex | ['attention', 'learning', 'memory', 'visual cortex'] |
corgis; golden retrievers;bulldog; | ['corgis', 'golden retrievers', 'bulldog'] |
deleteBlankColumnsAndRows.py
Deletes any rows and columns that are completely blank from a CSV.
deleteDuplicateRows.py
Deletes any rows with the exact same information from a CSV.
deleteRowsWithIdentifiers.py
Deletes rows from first spreadsheet using list of identifiers from second spreadsheet.
stripWhiteSpaces.py
Strips white spaces from all cells in CSV.
getDuplicatedIdentifiersInSheet.py
Finds all identifiers in a spreadsheet that appear more than 1 time in column. Creates a new spreadsheet containing all the rows with duplicated identifiers.
getDuplicateValuesInColumn.py
Find any duplicated values within a column of a spreadsheet, and then creates a new spreadsheet containing all the duplicated rows.
getExtensionsFromFilenames.py
Loops through a spreadsheet of filenames and add columns of just file extensions to original sheet.
getIdentifiersNotRepeatedInSheet.py
Finds identifiers that only appear once in a sheet's column. Creates a new spreadsheet containing all the rows with non-repeated identifiers.
getInProgressAndCompletedRows.py
Finds rows where the cell is empty (nan) for certain columns, and creates a new spreadsheet with these rows.
getSumAndSizeForColumnInSheet.py
Gets the sum (sum of values) and size (total count of non-empty value) of each unique value in a specific column.
getTotalValuesByIdentifierInSheet.py
For spreadsheet where some identifiers are repeated in multiple rows, and each row has a number value , add up number values for each identifier.
getValuesDuplicatedInTwoColumns.py
Get any values duplicated in two columns in a sheet.
printDuplicatedIdentifiersPairsInSheet.py
Print any duplicated identifier pairs to terminal.
printUniqueIdentifiersInSheet.py
Print list of unique identifiers to terminal.
findFuzzyMatchesWithinAList.py
Matches nearly identical strings within a list (or a column of a CSV) using fuzzywuzzy. It produces CSV with the matches and a CSV with the strings that could not be matched.
getLinkForEachUniqueValue.py
This script takes a list of unique values from a CSV, and searches for them in second CSV. In the second CSV, each value will be in a column where each cell has multiple value separated by a delimiter. The script will loop through this column and find an identifier or link for each unique value.
guessLanguage.py
Guesses the language of items from their titles using langdetect.
combineStringColumns.py
Combines strings from numerous columns into a new column.
mergeLeftTwoSheets.py
Joins two spreadsheets using pandas left merge (left join) on an identifier.
mergeTwoSheetsWithoutPandas.py
Joins information from two spreadsheets by their identifiers.
combineMultipleSpreadsheets.py
Combines multiple spreadsheets with the same column names into one combined, large spreadsheet. The spreadsheets to combine should all be placed into the same directory.
compareIdentifiersInMultipleSheets.py
Collects identifiers from multiple spreadsheets in a directory and lists the spreadsheets where each identifier appears.
createSpreadsheetsForEachColumnInSheets.py
Combines multiple spreadsheets in a directory into a dataframe, and makes a new spreadsheet for each column found in the combined dataframe.
getIdentifiersMissingInSecondSheet.py
Finds any identifiers that exist in the first sheet but not in the first sheet.
getStatusOfAllIdentifiersInTwoSheets.py
Creates a spreadsheet with a list of all identifiers from both sheets and records if the identifier is in both sheets, only sheet_1, or only sheet_2.
getValueCountsForEachColumnInSheets.py
Counts how many values are found in each column of spreadsheets in a directory. This count information is then added by handle (or filename) to a new CSV.
mergeMultipleSheetsOnIdentifier.py
Merges multiple spreadsheets in a directory on an identifier column found in all the sheets.
addPrefixToIdentifiers.py
Adds a string prefix to an identifier column.
convertToETDF.py
Converts a wide variety of date formats to EDTF using regular expressions.
createFileNameFromCallNum.py
Creates a unique filename based on an item's call number.
reformatInversedNames.py
Converts personal names without dates to follow a Secondary Name, Primary Name format.
standardizePersonalNames.py
Checks personal names from a CSV file to see if there are any punctuation and/or spacing errors. The script bases its punctuation conventions on RDA and checks that these rules are followed:
- A single space follows a comma.
- Initials are followed by a period and there is a space between two initials. (i.e. Adams, E. P., not Adams, EP or Adams E.P.)
- The pre-RDA abbreviations of fl., ca., cent., b., and d. are not used.
- There is not a period at the end of the name unless the name ends with an abbreviation or initial (this is a local convention). To remove, delete or edit rows 32-34.
The script makes a new CSV file (namesStandardized.csv) with two columns called personalName and errorType. The second column tells the user the first punctuation/spacing error that exists for that name (see below for an example.) If there is more than one error, you only see the first error!
personalName | errorType |
---|---|
Abbott, David Phelps, 1863-1934. | Remove period |
Abbott, Edwin Abbott,1838-1926. | Remove period |
Abelson, Paul,1878- | Add space after comma |
Aberra, A.A. | Add space after period |
Albert, Jane, b. 1889 | Not updated to RDA standards |
stripTimeFromDateTime
Strips the time from a standard DateTime stamp.
basicMelt.py
Melts a spreadsheet on a specific identifier column.
basicPivot.py
Pivots a spreadsheet by setting the index to the value of a specific column.
basicPivotTable.py
Pivots and aggregates a spreadsheet by setting the index to the value of a specific column.
original
column1 | column2 |
---|---|
001 | Smith, Jane |
002 | Smith, Ed |
002 | Austen, Jane |
003 | Smith, Ed |
new
column2 | column1 |
---|---|
Austen, Jane | 002 |
Smith, Ed | 002 \ |
Smith, Jane | 001 |
collapseAllColumnsToOneColumn.py
collectSubjectsByIdentifier.py
Combine subject strings by identifier number. Then a left merge combines the pivot table to a metadata spreadsheet if needed.
original
identifier | subject |
---|---|
0001 | horses |
0001 | goats |
0001 | cows |
0002 | Mars |
0002 | Venus |
pivot_table
identifier | subject |
---|---|
0001 | horses;goats;cows |
0002 | Mars;Venus |
explodeAndPivotByIdentifier.py
Reshapes sheet indexed by column1 (column2 aggregated) to sheet indexed by column2 (column1 aggregated).
original
column1 | column2 |
---|---|
Smith, Ed | 001 | 004 | 005 |
Smith, Jane | 004 |
new
column2 | column1 |
---|---|
001 | Smith, Ed |
004 | Smith, Ed | Smith, Jane |
005 | Smith, Ed |
explodeColumnWithList.py
Takes a column where each cell contains multiple values, and creates a new row for each value, duplicating its original identifier.
meltMultipleSubjectsByIdentifier.py
Melts a wide CSV with multiple column pairs of an auth_name and vocab_id into a long format where each column pair is in its own row with the associated bib number.
original
bib | 0_auth_name | 0_vocab_id | 1_auth_name | 1_vocab_id |
---|---|---|---|---|
0001 | horses | 1003 | goats | 20394 |
0002 | Venus | 839402 | Mars | 9842718 |
melted
bib | auth_name | vocab_id |
---|---|---|
0001 | horses | 1003 |
0001 | goats | 20394 |
0002 | Venus | 839402 |
0002 | Mars | 9842718 |
splitCSVByUniqueValueInColumn.py
Generates a new CSV for each unique value found in a column of a spreadsheet. The new CSVs are named based on the unique values found.
generateNumberedCopiesofFile.py
Generates numbered copies of a file for testing purposes.
getEveryNthRowFromCSV.py
Collects every Nth row from a CSV to create a sample of data.
splitCSVForTesting.py
Splits a large spreadsheet evenly into ten CSVs for testing.
testLink.py
Tests URLs listed in spreadsheet to see if they work.