Copy and paste the "less" command into a terminal and verify appropriate parts of the cpf output. This section needs more notes. Some QA examples towards the end have notes. The file name sometimes gives a cryptic reminder. Some localType attributes here may not match updated reality.
At least one of these is non-1xx and does not generate output. Some have nothing special and are either historical or negative result QA tests.
We want the output in the ./cpf directory because the files don't have the same prefix or a consistent enough suffix to enable use to clean up with "rm -f" unless the output is in a separate directory. qa_all.xml oclc_marc2cpf.xsl
See qa_marc2cpf.xsl
echo "" > qa_marc_list.xml; find cpf -type f -printf "%h/%f\n" >> qa_marc_list.xml; echo "" >> qa_marc_list.xml qa_marc_list.xml qa_marc2cpf.xsl
find ./qa -name "*.xml" -exec {} oclc_marc2cpf.xsl ;
find ./qa -name "*.xml" -exec {} oclc_marc2cpf.xsl output_dir=cpf ;
twl8n@shannon __ Fri Apr 26 09:49:51 EDT 2013 __ eta: 07:10:09 ps: zsh /lv1/home/twl8n/eac_project
ls cpf | wc -l 524
This is a not1xx record. It has only 600 and 700 fields.
None of these have cpfRelation.
The 700 is creatorOf in the resourceRelation, which happens to be the second datafield so it is the .r01 since not1xx records .r numbering starts with .r00.
The 600 is referencedIn as it should be.
<namePart>Sharps, Turney, Mrs.</namePart>
<roleTerm valueURI="">Creator</roleTerm>
-rw-r--r-- 1 mst3k snac 1994 Dec 3 15:56 qa_11422625_not1xx_has_600_has_700.xml
less cpf/OCLC-MEA-11422625.r00.xml
less cpf/OCLC-MEA-11422625.r01.xml
less cpf/*11422625.r00.xml
less cpf/*11422625.r01.xml
This is not1xx and has no .c file.
Not creatorOf, but should have 6 .r files .r00 through .r05 for the 6xx datafields
-rw-r--r-- 1 mst3k snac 4146 Dec 3 15:52 qa_702172281_not1xx_multi_600.xml
less cpf/*702172281.r00.xml
ls -l cpf/*702172281.*
Washburn (F|f)amily should only occur once. Test case insensitive de-duping of names.
-rw-r--r-- 1 mst3k snac 4808 Sep 25 11:14 qa_11447242_case_dup.xml
less cpf/*11447242.c.xml
less cpf/COC-11447242.c.xml
Washburn, Ruth Wendell, 1890-1975.
Is 700, is creatorOf in resourceRelation
less cpf/COC-11447242.r03.xml
multiple snac:associatedPlace, snac:associatedSubject, aka topicalSubject, geographicSubject
-rw-r--r-- 1 mst3k snac 5373 Nov 19 16:39 qa_122456647_651_multi_a.xml
less cpf/*122456647.c.xml
The 700 should be the creatorOf in the .r01. Uses the 245$f as active date only for the .r01 because Morrill is 700 aka author therefore he was active on the 245$f date, but we can't be sure the subject (610) was active on that date.
<datafield tag="700" ind1="1" ind2=" ">
<subfield code="a">Morrill, Dan L.</subfield>
<nameEntry xml:lang="en-Latn">
<part>Morrill, Dan L.</part>
<date localType=""
notAfter="1990">active approximately 1987</date>
-rw-r--r-- 1 mst3k snac 3527 Dec 3 15:51 qa_26891471_not1xx_has_700.xml
less cpf/*26891471.r01.xml
less cpf/*26891471.r00.xml
less cpf/NKM-26891471.r00.xml
less cpf/NKM-26891471.r01.xml
We only process 245$f as active for 1xx|7xx. Since the .rxx files are for 6xx, the 245$f does not apply to them.
-rw-r--r-- 1 mst3k snac 2874 Oct 24 14:56 qa_122519914_corp_body_245_f_date.xml
less cpf/*122519914.c.xml
less cpf/n-122519914.r01.xml
Sundown &\and\ La Plata Mining Company (Utah).
-rw-r--r-- 1 mst3k snac 5618 Nov 19 18:12 qa_122537190_sundown_escape_characters.xml
less cpf/*122537190.c.xml
<date localType="">1735-1???.</date>
-rw-r--r-- 1 mst3k snac 20333 Nov 26 14:55 qa_122542862_date_nqqq.xml
less cpf/*122542862.r76.xml
less cpf/*122542862.c.xml
Nov 22 12013: n/a. SNAC is the agency. We don't keep the WorldCat agency since that's for the resourceRelation, not the CPF record.
Agency code "C" returns two entries from the WorldCat registry.
-rw-r--r-- 1 mst3k snac 7534 Nov 14 13:13 qa_122583172_marc_c_two_oclc_orgs.xml
less cpf/*122583172.c.xml
Two 657 elements are identical. Tests the deduplicating code for functions. (Original XML was manually modified.)
<datafield tag="657" ind1=" " ind2="7">
<subfield code="a">Administration of nonprofit organizations.</subfield>
<subfield code="2">lcsh</subfield>
<datafield tag="657" ind1=" " ind2="7">
<subfield code="a">Administration of nonprofit organizations.</subfield>
<subfield code="2">lcsh</subfield>
<term>Administration of nonprofit organizations</term>
-rw-r--r-- 1 twl8n snac 17768 Feb 20 15:46 qa/qa_123408061a_dupe_function_657.xml
less cpf/*123408061a.c.xml
245$f dates are parsed for families. Has several cpfRelation elements, for Person and CorporateBody.
<subfield code="f">1917-1960.</subfield>
<fromDate localType=""
standardDate="1917">active 1917</fromDate>
<toDate localType=""
standardDate="1960">active 1960</toDate>
-rw-r--r-- 1 mst3k snac 5041 Oct 18 09:54 qa/qa_123410709_family_245f_date.xml
less cpf/*123410709.c.xml
Date with "or" in the middle, and a trailing "?".
<subfield code="d">1061 or 2-1121?</subfield>
<fromDate standardDate="1061"
notAfter="1062">1061 or 1062</fromDate>
<toDate standardDate="1121"
-rw-r--r-- 1 mst3k snac 3239 Oct 3 11:29 qa/qa_123415450_or_question.xml qa/qa_123415450_or_question.xml oclc_marc2cpf.xsl
less cpf/*123415450.c.xml
d. 767 or 8.
<toDate standardDate="0767"
notAfter="0768">0767 or 0768</toDate>
-rw-r--r-- 1 mst3k snac 4013 Oct 3 11:40 qa_123415456_died_or.xml qa/qa_123415456_died_or.xml oclc_marc2cpf.xsl
less cpf/*123415456.c.xml
Test parsing of "?" to mean year-1 to year+1 for "1213?-1296?"
<fromDate standardDate="1213" localType="born" notBefore="1212" notAfter="1214">1213</fromDate>
<toDate standardDate="1296" localType="died" notBefore="1295" notAfter="1297">1296</toDate>
-rw-r--r-- 1 mst3k snac 4073 Oct 3 11:52 qa_123415574_qmark.xml qa/qa_123415574_qmark.xml oclc_marc2cpf.xsl
less cpf/*123415574.c.xml
I think this tests 100 and 700 creators in the same file.
<mods xmlns="">
<namePart>Blackburn, Joyce.</namePart>
<roleTerm valueURI="">Creator</roleTerm>
<namePart>Clayton, Stephanie,</namePart>
<roleTerm valueURI="">Creator</roleTerm>
-rw-r--r-- 1 mst3k snac 5153 Sep 25 15:30 qa_123439095_mods_leader.xml
less cpf/*123439095.c.xml
less cpf/GEU-S-123439095.r07.xml
Tests 1xx with multi 7xx which generate .rxx, and generate mods name entries. Horowitz is duplicated in the input (differentiated by 700$4 and we don't use the $4), but we only output Horowitz once.
In other words, we only have a single cpfRelation for Horowitz.
-rw-r--r-- 1 mst3k snac 3636 Jan 14 15:10 qa/qa_123452814_dup_700_name.xml
less cpf/*123452814.r01.xml
less cpf/*123452814.c.xml
Code update Mar 2013. This record has a function. As of dec 7 2012 the code clearly only processes $e (and $4) for 100, not for 110. Apparently, that means that this record does not have an occupation.
Has 110, corporateBody
-rw-r--r-- 1 mst3k snac 3409 Nov 5 16:43 qa_155416763_110_e_occupation.xml
less cpf/*155416763.c.xml
less cpf/OCLC-NYHVD-155416763.c.xml
date 1865(approx.)-1944.
-rw-r--r-- 1 mst3k snac 1561 Oct 3 12:13 qa_155438491_approx.xml qa/qa_155438491_approx.xml oclc_marc2cpf.xsl
less cpf/*155438491.c.xml
date -1688
-rw-r--r-- 1 mst3k snac 1904 Oct 4 11:56 qa_155448889_date_leading_hyphen.xml
less cpf/*155448889.c.xml
100$4 "col" is becomes occupation.
dpt isn't in any of our authority lists, but col is.
<subfield code="4">col</subfield>
<subfield code="4">dpt</subfield>
<occupation localType="snac:derivedFromRole">
-rw-r--r-- 1 mst3k snac 4410 Nov 2 08:30 qa_17851136_no_e_two_4_occupation.xml
less cpf/*17851136.c.xml
<languageDeclaration><language languageCode="swe">Swedish</language>
-rw-r--r-- 1 mst3k snac 1375 Nov 13 10:51 qa_209838303_lang_040b_swe.xml
less cpf/*209838303.c.xml
New: all dates are sent through the normal parser, and the normal parser tries to get anything. Dates that don't parse are simply suspicious, not thrown out (as is the case with the alt date parser).
Old: Has no existDates in output. New code Mar 2013. 245$f has a separate date parser, and only attempts to part 4 digit numbers out of dates since the dates will only be "active". This 245$f having no 4 digit dates is not parsed, thus no existDates at all.
<datafield tag="245" ind1="0" ind2="0">
<subfield code="k">Papers,</subfield>
<subfield code="f">????-????</subfield>
<existDates localType="">
<date localType="">????-????</date>
-rw-r--r-- 1 mst3k snac 2814 Nov 26 10:31 qa_210324503_date_all_question_marks.xml
less cpf/*210324503.c.xml
This verifies that questionable dates display as suspiciousDate.
<date localType="">1912-0.</date>
-rw-r--r-- 1 mst3k snac 3542 Nov 26 14:48 qa_220227335_date_nnnn_hyphen-zero.xml
less cpf/*220227335.c.xml
Has a good date and a bad date. The name is a duplicate, except that the dates are not the same which causes the names to be treated as unique. The .r01 has the questionable date.
<date localType="">1?54-</date>
-rw-r--r-- 1 mst3k snac 4352 Nov 26 11:21 qa_220426543_date_1_q_54_hypen.xml
less cpf/*220426543.r01.xml
Manually modified based on 222612265 so we would have a died ca date. The original two instances of "b. ca. 1896" changed to "d. ca. 1896" (Manually modified.)
<toDate standardDate="1986"
notAfter="1989">approximately 1986</toDate>
</existDates> qa/qa_222612265x_fake_died_ca_date.xml oclc_marc2cpf.xsl
less cpf/*222612265x.c.xml
(Manually modified) The "a" file has a born ca date. "b. ca. 1896" occurs twice. Dup 600 varies from 100 only by a period (dot) at the end of the born date so this also tests de-duplicating ignores a trailing period (and all trailing punctuation, I think).
<fromDate standardDate="1896"
notAfter="1899">approximately 1896</fromDate>
-rw-r--r-- 1 twl8n snac 3324 Feb 20 16:12 qa/qa_222612265a_b_ca_date.xml
less cpf/*222612265a.c.xml
Not a duplicate. Has nnnn-ca. nnnn to test the toDate for nnnn-nnnn 1920-ca. 1986.
The "x" file has duplicate died "d. ca. 1896" to exercise the died date code.
<toDate standardDate="1986"
notAfter="1989">approximately 1986</toDate>
-rw-r--r-- 1 twl8n snac 3571 Feb 20 16:14 qa/qa_222612265x_fake_died_ca_date.xml qa/qa_222612265a_b_ca_date.xml oclc_marc2cpf.xsl
less cpf/*222612265x.c.xml
Manually modified.
Academics. local Academics. local <existDates>
<fromDate standardDate="1893"
<toDate standardDate="1981"
-rw-r--r-- 1 twl8n snac 3766 Feb 20 16:27 qa/qa_225810091a_600_family_date_dupe_occupation.xml
less cpf/*225810091a.c.xml
(Not done in qa_marc2cpf.xsl since correspondedWith is stil undecided.)
The 651$v "Correspondence" doesn't match an occupation, but 656$a "Gold miners" is an occupation, albeit not one that is looked up in an authority record.
600$v correspondence. Notes say "written by ..." so the 600 (subject) fields seem like they should be 700 (creator) fields.
We use cpfRelation arcrole correspondedWith based on the 600$v.
Has a useful looking 520$a which we currently ignore, except in the mods/abstract. The 520$b is lost.
-rw-r--r-- 1 mst3k snac 3709 Sep 20 12:56 qa_225815320_651_v.xml
less cpf/*225815320.c.xml
less cpf/*225815320.r01.xml
When taking into account topical subject concatenation, I'm not seeing a frank duplication. I'm not seeing it for a geographical subject either. In any case, this example has many topical subjects and a geographical subject.
-rw-r--r-- 1 mst3k snac 5545 Oct 26 11:43 qa_225851373_dupe_places_with_dot_dupe_topical.xml
less cpf/*225851373.c.xml
<subfield code="d">fl. 2nd cent.</subfield>
<date standardDate="0101"
notAfter="0200">active 2nd century</date>
-rw-r--r-- 1 mst3k snac 2536 Oct 11 09:29 qa_233844794_fl_2nd_cent_date.xml
less cpf/*233844794.c.xml
Multi 1xx has no output at this time.
-rw-r--r-- 1 mst3k snac 3706 Sep 10 11:01 qa_270613908_multi_1xx.xml
less cpf/*270613908.c.xml
<subfield code="d">d. 1601/2.</subfield>
<toDate standardDate="1601" localType="died" notBefore="1601" notAfter="1602">-1601 or 1602</toDate>
-rw-r--r-- 1 mst3k snac 1508 Oct 4 15:49 qa_270617660_date_slash.xml
less cpf/*270617660.c.xml
<subfield code="d">fl. 1724/25.</subfield>
<date standardDate="1724"
notAfter="1725">active 1724 or 1725</date>
-rw-r--r-- 1 mst3k snac 1738 Oct 8 10:07 qa_270657317_fl_date_slash_n.xml
less cpf/*270657317.c.xml
The questionable date is output in the .r01 file.
<subfield code="d">1834-1876 or later.</subfield>
<date localType="">1834-1876 or later.</date>
-rw-r--r-- 1 mst3k snac 2573 Nov 26 10:40 qa_270873349_date_or_later.xml
less cpf/*270873349.r01.xml
<subfield code="d">19th/20th cent.</subfield>
<fromDate standardDate="1801"
notAfter="1900">19th century</fromDate>
<toDate standardDate="1901"
notAfter="2000">20th century</toDate>
-rw-r--r-- 1 mst3k snac 2428 Oct 3 08:58 qa/qa_281846814_19th_slash_20th.xml
less cpf/*281846814.c.xml
<subfield code="d">1837-[1889?]</subfield>
<fromDate standardDate="1837"
<toDate standardDate="1889"
-rw-r--r-- 1 mst3k snac 4505 Oct 3 14:41 qa_313817562_sq_bracket_date.xml
less cpf/*313817562.c.xml
That is an ell, not a one. one nine four ell.
<subfield code="d">194l-</subfield>
<date localType="">194l-</date>
-rw-r--r-- 1 mst3k snac 1829 Oct 3 14:51 qa_3427618_date_194ell.xml
less cpf/*3427618.c.xml
I think this test exists to make sure subfields are concatenated in order, even if that order is a, c, a, d, as opposed to a, a, c, d as errant XSLT did at one point.
<subfield code="a">Jones,</subfield>
<subfield code="c">Mrs.</subfield>
<subfield code="a">J.C.,</subfield>
<subfield code="d">1854-</subfield>
results in:
<part>Jones, Mrs. J.C., 1854-</part>
-rw-r--r-- 1 mst3k snac 2112 Jan 14 15:14 qa/qa_367559635_100_acad_concat.xml
less cpf/*367559635.c.xml
Tests proper de-duping that ignores trailing punctuation.
<subfield code="a">Hachimonji, Kumezô.</subfield>
<subfield code="a">Hachimonji, Kumezô</subfield>
-rw-r--r-- 1 mst3k snac 6104 Oct 26 14:09 qa_39793761_punctation_name.xml
less cpf/*39793761.c.xml
One of two examples of a 100 family with no 100$d date, so it uses the 245$f date as an "active" date.
<datafield tag="245" ind1="1" ind2="0">
<subfield code="a">Waterman family papers,</subfield>
<subfield code="f">1839-1906.</subfield>
<fromDate localType=""
standardDate="1839">active 1839</fromDate>
<toDate localType=""
standardDate="1906">active 1906</toDate>
-rw-r--r-- 1 mst3k snac 3826 Dec 19 11:14 qa/qa_42714894_waterman_245f_date.xml
less cpf/*42714894.c.xml
less cpf/OCLC-CUYGB-42714894.c.xml
Waterman, R. W. (Robert Whitney), 1826-1891.
Derived from 700, therefore resourceRelation/@xlink:arcrole="creatorOf" instead of "referencedIn".
less cpf/OCLC-CUYGB-42714894.r01.xml
Has a lone comma in 700$a which broke the code at one point. I can't remember why I called it "multi_sequence".
<datafield tag="700" ind1="1" ind2=" ">
<subfield code="a">,</subfield>
<subfield code="e">interviewer.</subfield>
<fromDate standardDate="1905"
<toDate standardDate="1990"
-rw-r--r-- 1 mst3k snac 3589 Nov 19 21:03 qa_44529109_multi_sequence.xml
less cpf/*44529109.c.xml
less cpf/OCLC-ZP3-44529109.c.xml
Related to a 1xx family, Daniels, Annie Seayrs, 1869-1946. Has resourceRelation/@xlink:arcrole="referencedIn" Has cpfRelation/@xlink:arcrole="associatedWith" Derived from 600.
less cpf/OCLC-ZP3-44529109.r01.xml
(n/a for automated testing in qa_marc2cpf.xsl since there are plenty of other suspicious date checks.)
Only in the .r01 file.
Tate, Jean, date -
Newer: date -
Older: date -
-rw-r--r-- 1 mst3k snac 2912 Nov 26 14:46 qa_495568547_date_hyphen_only.xml
less cpf/OCLC-DPL-495568547.r01.xml
We parse 1870s.
<fromDate standardDate="1870"
-rw-r--r-- 1 mst3k snac 2615 Oct 9 10:24 qa/qa_505818582_date_1870s.xml qa/qa_505818582_date_1870s.xml oclc_marc2cpf.xsl
less cpf/*505818582.c.xml
Test that 1800s does the whole 100 years just as 1870s does an entire 10 years. Note the -1 offset between "1800s" and "19th century"
<fromDate standardDate="1800"
</existDates> qa/qa_505818582a_date_1870s.xml oclc_marc2cpf.xsl
less cpf/*GHT-505818582a.c.xml
less cpf/*GHT-505818582a.r01.xml
<datafield tag="100" ind1="1" ind2=" ">
<subfield code="a">Whipple, John Adams,</subfield>
<subfield code="d">1822-1891,</subfield>
<subfield code="e">photographer.</subfield>
<subfield code="4">att</subfield>
<fromDate standardDate="1822"
<toDate standardDate="1891"
<occupation localType="">
-rw-r--r-- 1 mst3k snac 2662 Nov 2 08:25 qa_51451353_e4_occupation.xml
less cpf/*51451353.c.xml
This may exist to exercise 650$b multiple values which maybe is supposed to be non-repeating. It does repeat, so we concatenate. At one point it broke something, perhaps a string function by sending a sequence instead of a single string.
<datafield tag="650" ind1="1" ind2="7">
<subfield code="a">CHILE</subfield>
<subfield code="b">MINISTERIO DE TIERRAS Y COLONIZACION</subfield>
<subfield code="2">renib</subfield>
<localDescription localType="">
-rw-r--r-- 1 mst3k snac 2348 Nov 19 11:17 qa_55316797_650_multi_b.xml
less cpf/*55316797.c.xml
This .r56 should be a questionable date. It broke the code by trying to turn a null string into a number. Either it made 0000 which is questionable (wrong), or Saxon died with an error.
<date localType="">1714 or -15-1757.</date>
-rw-r--r-- 1 mst3k snac 14958 Nov 26 10:35 qa_611138843_date_or_hyphen_15_hyphen_1757.xml
less cpf/*611138843.r56.xml
<fromDate standardDate="1834"
notAfter="1835">1834 or 1835</fromDate>
-rw-r--r-- 1 mst3k snac 3278 Oct 8 10:27 qa_671812214_b_dot_or.xml qa/qa_671812214_b_dot_or.xml oclc_marc2cpf.xsl
less cpf/*671812214.c.xml
Is (and should be) questionable, make sure it doesn't crash the script
<date localType="">16uu-17uu.</date>
-rw-r--r-- 1 mst3k snac 3801 Oct 9 11:57 qa_678631801_date_16uu.xml
less cpf/*678631801.c.xml
Oct 8 via email Daniel says person single date is active date.
<part>Longinus, 1st cent.</part>
<date standardDate="0001"
notAfter="0100">active 1st century</date>
-rw-r--r-- 1 mst3k snac 3347 Nov 20 10:44 qa_702176575_date_1st_cent.xml
less cpf/*702176575.c.xml
(n/a qa_marc2cpf.xsl automated testing since the cpf extraction will die.)
The 100$e doesn't hit any occupation, but I think at one point trying to resolve a string with a leading comma caused it to process as a sequence and not as a string and that made Saxon die with an error.
-rw-r--r-- 1 mst3k snac 6825 Nov 20 14:05 qa_733102265_comma_interviewee.xml
less cpf/*733102265.c.xml
(n/a automated testing since we have other cases like this.)
"Donors" is not an occupation in our authority files. "Authors" is.
<datafield tag="100" ind1="1" ind2=" ">
<subfield code="a">Jacobs, Jo,</subfield>
<subfield code="d">1933-</subfield>
<subfield code="e">donor</subfield>
<subfield code="e">author.</subfield>
<occupation localType="">
-rw-r--r-- 1 mst3k snac 2282 Nov 2 08:52 qa_768242927_multi_e_occupation.xml
less cpf/*768242927.c.xml
verify that .r08 future date is questionable, even if it is probably a typo.
<existDates localType="">
<date localType="">8030-1857.</date>
-rw-r--r-- 1 mst3k snac 7190 Nov 26 10:45 qa_777390959_date_greater_than_2012.xml qa/qa_777390959_date_greater_than_2012.xml oclc_marc2cpf.xsl
less cpf/*777390959.c.xml
less cpf/*777390959.r08.xml
look for only one .r record for "Kane, Thomas Leiper,". He shows up in the MODS record as a (co)creator.
-rw-r--r-- 1 mst3k snac 3197 Sep 18 09:15 qa_8560008_dup_600_700.xml
less cpf/*8560008.c.xml
less cpf/*8560008.r01.xml
grep "part>Kane" cpf/*8560008.*.xml
cpf/YWM-8560008.r01.xml: Kane, Thomas Leiper, 1822-1883.
MODS abstract is all 520$a concatented with space.
-rw-r--r-- 1 mst3k snac 4168 Nov 28 12:57 qa_8560380_multi_520a.xml
less cpf/*8560380.c.xml
verify geo "Ohio--Ashtabula County" from concat 650 with multi $z
<datafield tag="650" ind1=" " ind2="0">
<subfield code="a">Real property</subfield>
<subfield code="z">Ohio</subfield>
<subfield code="z">Ashtabula County.</subfield>
<place localType="">
<placeEntry>Ohio--Ashtabula County</placeEntry>
-rw-r--r-- 1 mst3k snac 5389 Sep 27 10:26 qa_8562615_multi_650_multi_z.xml qa/qa_8562615_multi_650_multi_z.xml oclc_marc2cpf.xsl
less cpf/*8562615.c.xml
100 is the .c and dup in 600 should not make a .r file. Value is "Longfellow, Henry Wadsworth, 1807-1882."
-rw-r--r-- 1 mst3k snac 2252 Sep 18 09:14 qa_8563535_dup_100_600.xml
less cpf/*8563535.c.xml
(n/a I can't figure out what this tests. Seems to no longer apply. Or maybe a test for 'rps' which is unknown?)
Was a test for noRegistryResults, but that has changed. Maybe now a test for Unknown repository?
<roleTerm valueURI="">Repository</roleTerm>
<span localType="noRegistryResults"/>
<span localType="original">BANC</span>
-rw-r--r-- 1 mst3k snac 4192 Nov 15 12:44 qa_86132608_marc_040a_BANC_no_result.xml
less cpf/*86132608.c.xml
Person with no 100$d uses 245$f as active date: active 1922-1949.
-rw-r--r-- 1 twl8n snac 3429 Feb 20 12:50 qa/qa_8586125_person_245f_date_range.xml
less cpf/*8586125.c.xml
(n/a If it isn't parse, we aren't QA'ing it.)
Not parsed. The old date code did this, but the new 245$f specific active date code doesn't parse century. Person with no 100$d uses 245$f, active 1st century (manually modified data)
-rw-r--r-- 1 twl8n snac 3432 Feb 20 13:49 qa/qa_8586125a_person_245f_century.xml
less cpf/*8586125a.c.xml
(n/a If it isn't parse, we aren't QA'ing it.)
Not parsed by new code. Person with no 100$d uses 245$f, active 1st/2nd century (manually modified data)
-rw-r--r-- 1 twl8n snac 3436 Feb 20 13:51 qa/qa_8586125b_person_245f_two_century.xml
less cpf/*8586125b.c.xml
(n/a If it isn't parse, we aren't QA'ing it.)
Not parsed by new code. Person with no 100$d uses 245$f, single active 1st century - 2nd century (manually modified data)
I don't think there is a rule to properly parse this, but I don't think we have any of these in the real WorldCat data, so this might be a pointless QA test.
-rw-r--r-- 1 twl8n snac 3446 Feb 20 13:53 qa/qa_8586125c_person_245f_range_century.xml
less cpf/*8586125c.c.xml
(n/a duplicate.)
Person with no 100$d uses 245$f, active 1980.
-rw-r--r-- 1 twl8n snac 2245 Feb 20 13:02 qa/qa_8594295_person_single_245f_date.xml
less cpf/*8594295.c.xml
Multiple dates so we only capture min as fromDate and max as toDate since we only care about active.
-rw-r--r-- 1 twl8n snac 3346 Mar 8 15:50 qa/qa_123410649_multi_date_245f.xml
less cpf/WiMiJHS-123410649.c.xml
less cpf/WiMiJHS-123410649.r01.xml
Multiple dates in two 245$f subfields, gets parsed as multi-date for fromDate and toDate.
Also tests Worldcat agency code lookup. Should resolve to OCLC-UCB.
-rw-r--r-- 1 twl8n snac 2048 Mar 4 11:37 qa/qa_85037313a_two_245f_alt_dates.xml
less cpf/OCLC-UCB-85037313a.c.xml
less cpf/OCLC-UCB-85037313a.r01.xml
less cpf/OCLC-UCB-85037313a.r02.xml
(n/a since the comment says it is pointless.)
Manually modified. Is a "not1xx", although we now process not1xx. Also has one (or several?) fake az (perhaps 651 Maine Freeport.) Does have a 600, therefore generates .r00, but the fake az values aren't used. Seems pointless. (Maybe the fake $a$z were processed at some point. The tpt_geo code should be/is exercised elsewhere.)
<datafield tag="600" ind1="1" ind2="0">
<subfield code="a">Cushing, E., Captain.</subfield>
<part>Cushing, E., Captain.</part>
<datafield tag="651" ind1=" " ind2="0">
<subfield code="a">Maine</subfield>
<subfield code="z">Freeport.</subfield>
<datafield tag="651" ind1=" " ind2="0">
<subfield code="a">Ohio</subfield>
<subfield code="z">Dayton.</subfield>
<subfield code="z">Main Street.</subfield>
<datafield tag="651" ind1=" " ind2="0">
<subfield code="a">Freeport (Me.)</subfield>
<subfield code="x">Commerce</subfield>
<subfield code="z">Louisiana</subfield>
<subfield code="z">New Orleans.</subfield>
-rw-r--r-- 1 twl8n snac 3033 Feb 20 16:00 qa/qa_128216482a_fake_az.xml
less cpf/*128216482a.r00.xml
Has 100$e and 656.
<date localType=""
standardDate="1976">active 1976</date>
<occupation localType="">
-rw-r--r-- 1 twl8n snac 3578 Mar 13 16:40 qa/qa_147444338_100e_different_occ_600_656a.xml
less cpf/OCLC-DLH-147444338.c.xml
Has 110$e which should become a function.
<datafield tag="110" ind1="2" ind2=" ">
<subfield code="a">Allegany County (N.Y.).</subfield>
<subfield code="b">Historian's Office,</subfield>
<subfield code="e">collector.</subfield>
<fromDate localType=""
standardDate="1989">active 1989</fromDate>
<toDate localType=""
standardDate="1994">active 1994</toDate>
<function localType="">
-rw-r--r-- 1 twl8n snac 3656 Jan 10 16:15 qa/qa_155416763_110_e_occupation.xml
less cpf/OCLC-NYHVD-155416763.c.xml
(n/a duplicate.)
100$4 "col" is becomes occupation.
<datafield tag="100" ind1="1" ind2=" ">
<subfield code="a">Herzog, George,</subfield>
<subfield code="d">1901-1983.</subfield>
<subfield code="4">col</subfield>
<subfield code="4">dpt</subfield>
<fromDate standardDate="1901"
<toDate standardDate="1983"
<occupation localType="">
-rw-r--r-- 1 twl8n snac 4657 Jan 10 16:15 qa/qa_17851136_no_e_two_4_occupation.xml
less cpf/OCLC-IJZ-17851136.c.xml
Manually modified. Dup 656 only outputs once, and only in the .c.xml file. Should be no occ in .r01.xml.
<datafield tag="656" ind1=" " ind2="7">
<subfield code="a">Academics.</subfield>
<subfield code="2">local</subfield>
<datafield tag="656" ind1=" " ind2="7">
<subfield code="a">Academics.</subfield>
<subfield code="2">local</subfield>
-rw-r--r-- 1 twl8n snac 3766 Feb 20 16:27 qa/qa_225810091a_600_family_date_dupe_occupation.xml
This is a weird record. "College teachers" is encoded as a person.
<datafield tag="600" ind1="1" ind2="0">
<subfield code="a">(data entry error)College teachers</subfield>
<subfield code="z">England</subfield>
<subfield code="x">Archives.</subfield>
(data entry error)College teachers
less cpf/*225810091a.c.xml
less cpf/OCLC-AU064-225810091a.r01.xml
(Duplicate of 100$e test above. Not implemented in qa_marc2cpf.xsl.)
Another 100$e.
<datafield tag="100" ind1="1" ind2=" ">
<subfield code="a">Whipple, John Adams,</subfield>
<subfield code="d">1822-1891,</subfield>
<subfield code="e">photographer.</subfield>
<subfield code="4">att</subfield>
<occupation localType="snac:derivedFromRole">
-rw-r--r-- 1 twl8n snac 2909 Jan 10 16:15 qa/qa_51451353_e4_occupation.xml
less cpf/OCLC-MAH-51451353.c.xml
(Duplicate of 100$e test above. Not implemented in qa_marc2cpf.xsl.)
Another 100$e. Multi $e, but only one is in our occupations file.
<datafield tag="100" ind1="1" ind2=" ">
<subfield code="a">Jacobs, Jo,</subfield>
<subfield code="d">1933-</subfield>
<subfield code="e">donor</subfield>
<subfield code="e">author.</subfield>
<occupation localType="">
-rw-r--r-- 1 twl8n snac 2529 Jan 10 16:15 qa/qa_768242927_multi_e_occupation.xml
less cpf/OCLC-EXW-768242927.c.xml
Many 545 datafields for a long bioghist.
-rw-r--r-- 1 twl8n snac 19777 Mar 14 16:25 qa/qa_123439230_multi_545_multi_b.xml
less cpf/*123439230.c.xml
less cpf/OCLC-COO-123439230.c.xml
After discussion, I think we're sure this is "active 1768 to 1792" and not "active 1768, died 1792".
<datafield tag="100" ind1="1" ind2=" ">
<subfield code="a">Meriwether, George,</subfield>
<subfield code="d">fl. 1768-1792.</subfield>
<fromDate standardDate="1768"
localType="">active 1768</fromDate>
<toDate standardDate="1792"
localType="">active 1792</toDate>
-rw-r--r-- 1 twl8n snac 6482 Mar 19 16:40 qa/qa_14638716_fl_1768-1792_date_range.xml
less cpf/OCLC-VCW-14638716.c.xml
Fixed. Parses. We used to not not parse "'s" dates, only "s" dates. Unclear why this parsed at all, although I guess it is a good sign the date parser is robust enough to cope this this unsupported format. Unclear why this parsed as "1920" instead of "1920s".
Test "1980's" in alt date parsing. The "s" should be ignored, and not cause the script to crash.
<fromDate localType=""
notAfter="1923">active approximately 1920</fromDate>
<toDate localType=""
standardDate="1980">active 1980</toDate>
-rw-r--r-- 1 twl8n snac 4968 Mar 21 14:34 qa/qa_62173411_1980_quote_s_245f.xml qa/qa_62173411_1980_quote_s_245f.xml oclc_marc2cpf.xsl
less cpf/OCLC-ALK-62173411.c.xml
Test that "18th cent.?" doesn't crash the script. The parser ignores ? when related to century values.
New, corrected as of Oct 8 via email from Daniel: active 18th century
Old, wrong: 18th century qa/qa_85016803_cent_question_mark_date.xml oclc_marc2cpf.xsl
less cpf/NNFr-85016803.c.xml
Nov 26 2013 All (?) dates are now attempted to be parsed by the real date parser tpt_exist_dates and tpt_show_date. Only if this produces a suspicious date is tpt_parse_alt called, and only in certain cases. One of those cases is the 245$f. tpt_exist_dates tokenizer has been updated to change 1920's to 1920s.
Interesting to note that ca. has no effect on 1920s since that is already approximate via another mechanism.
Old: The 's is ignored, although it has to be removed internally to keep the script from crashing. fromDate and toDate are both approx. This probably tests all necessary 2 date paths.
<datafield tag="245" ind1="1" ind2="0">
<subfield code="a">Fisheries and wildlife research in Alaska</subfield>
<subfield code="h">[graphic],</subfield>
<subfield code="f">ca. 1920's- ca. 1980's.</subfield>
New: active approximately 1920s active 1980 s
Old: active approximately 1920 active approximately 1980 qa/qa_62173411b_todate_1980_quote_s_245f.xml oclc_marc2cpf.xsl
less cpf/OCLC-ALK-62173411b.c.xml
Test the single date 245$f approximately execution path. Due to poor code formatting/factoring (repeated sections not in functions or templates) separate tests are necessary.
<datafield tag="245" ind1="1" ind2="0">
<subfield code="a">Fisheries and wildlife research in Alaska</subfield>
<subfield code="h">[graphic],</subfield>
<subfield code="f">ca. 1920's.</subfield>
<date standardDate="1920"
notAfter="1929">active approximately 1920s</date>
old: active approximately 1920 qa/qa_62173411c_single_1920_quote_s_245f.xml oclc_marc2cpf.xsl
less cpf/OCLC-ALK-62173411c.c.xml
245$f future date with 5707 that we must ignore.
Unclear how this works, but it does. Probably via tpt_parse_alt that is simply looking for a valid looking nnnn.
June 1947 (Tammuz 5707). qa/qa_54643931_245f_date_5707.xml oclc_marc2cpf.xsl
less cpf/OCLC-YUS-54643931.c.xml
Test for "19th or 20th cent" which has caused problems. I think there's only one of these. Still...
Must not be suspicious based on >60 active range. qa/qa_311261944_19th_or_20th_cent_date.xml oclc_marc2cpf.xsl
less cpf/OCLC-UXG-311261944.c.xml
See the .r01. Discuss: fix trailing commas? from email "SNAC WorldCat extracts -- initial feedback" dec 21
We do not use the normative date as part of the name. Dates which are part of names are copied intact from the original, thus the trailing comma with Edwin Abbey exists because it was part of the original 100$d and not because we created a normative form for the name.
We also do not currently remove period or comma from the end of names. (We might add period to the end of names if it does not exist.) We could, but that leads to a larger discussion of formatting names for human legibility vs formatting for machine actionable. I would consider a style sheet (CSS or XSLT) "machine action".
-rw-r--r-- 1 twl8n snac 2498 Apr 2 10:42 qa/qa_227009956_name_w_date_trailing_comma.xml qa/qa_227009956_name_w_date_trailing_comma.xml oclc_marc2cpf.xsl
less cpf/OCLC-SNN-227009956.r01.xml
less cpf/OCLC-SNN-227009956.c.xml
Manually modified to remove second, reversed 651. Test 651 $a$x$z where $a and $z become separate places, not concatenated places.
-rw-r--r-- 1 twl8n snac 2670 Apr 8 09:20 qa/qa_123429932b_651_a_x_z_places_no_second_651.xml qa/qa_123429932b_651_a_x_z_places_no_second_651.xml oclc_marc2cpf.xsl
less cpf/CSt-H-123429932b.c.xml
Test 651 $a$x$z where $a and $z become separate places, not concatenated places with the additional twist that there is a second 651 with the same values in reversed order so the de-duping will result in only one each of "United States" and "Soviet Union".
-rw-r--r-- 1 twl8n snac 2853 Apr 8 09:04 qa/qa_123429932_651_a_x_z_places.xml qa/qa_123429932_651_a_x_z_places.xml oclc_marc2cpf.xsl
less cpf/CSt-H-123429932.c.xml
(n/a because we aren't using geonames right now.)
Check correct concatenation of admin1Name, admin2Name, and name accounting for duplicate strings and empty strings. Based on searching geonames manually with featureClass=P, I think it is simply "Ambridge" 5178228 and not the "Borough of Ambridge" 5178236 which results from searching featureClass=A.
<place localType="">
<placeEntry latitude="40.59118" longitude="-80.22562"
countryCode="US">Pennsylvania--Beaver County--Borough of Ambridge</placeEntry>
</place> qa/qa_609408959_ambridge_places.xml oclc_marc2cpf.xsl
less cpf/OCLC-UPM-609408959.c.xml
(n/a since we don't know what it should be, nothing to test.)
The .r01 has a strange date "d. ca. 1534-1581." The parser does not handle this correctly, and fails to mark this as suspicious. It is unclear what this date means. Nothing has been fixed. qa/qa_270613733_d_ca_1534-1581_odd_date.xml oclc_marc2cpf.xsl
less cpf/US-SNAC-270613733.r01.xml
Hughes, John.
The .r03 incorrectly had resourceRelation correspondedWith the original .c record.
Has useful looking 520$a and 520$b which we currently ignore, except in the mods/abstract.
less cpf/OCLC-IBV-34992098.c.xml
less cpf/OCLC-IBV-34992098.r03.xml
Russell, R. Fulton, 19th/20th cent.
The 100$a person is also a 600$a as 'correspondence' as is another 600$a.
We throw out the duplicate entry when de-duping the list of CPF entities. We mark the second 600$a as correspondedWith.
Has a useful looking 520$a which we currently ignore, except in the mods/abstract.
Must not be suspicious based on > 60 year active date range
less cpf/OCLC-UXG-281846814.c.xml
less cpf/OCLC-UXG-281846814.r02.xml
Chapman, Pattie, 1829 or 30-1912 or 16,
Has 600 with date 1829 or 30-1912 or 13 Has 700 with date 1829 or 30-1912 or 16
Tests toDate with "or xx" which might use slightly different logic than the fromDate.
De-duping removes the 600 entry, so Chapman is r09 from the 700 entry, not r04. Uses 700$d date.
<fromDate standardDate="1829"
notAfter="1830">1829 or 1830</fromDate>
<toDate standardDate="1912"
notAfter="1916">1912 or 1916</toDate>
less cpf/OCLC-UXG-435496213a.r09.xml
Central of Georgia Railway. Executive Dept.
Tests 7xx$k for the 1xx record aka .c record Is 1xx and has 700$k correspondece and a 710$k correspondence Has 2 cpfRelation elements where xlink:arcrole="" See 45038108.r05 below
less cpf/OCLC-GHT-45038108.c.xml
Railroad Mutual Loan Association (Ga.).
Tests 7xx$k for the 7xx record which is .r05 here Is 710 with $k correspondence Has cpfRelation to the 1xx correspondedWith See 45038108.c above
<cpfRelation xmlns:xlink=""
<relationEntry>Central of Georgia Railway. Executive Dept.</relationEntry>
<span localType="">OCLC-GHT-45038108.c</span>
less cpf/OCLC-GHT-45038108.r05.xml
Osgood Studio (Ellsworth, Me.)
active 21st century is suspicious due to toDate/@notAfter="2100"
less cpf/MeCiGCI-776715862.r03.xml