diff --git a/.nojekyll b/.nojekyll index a3ecdf4..0136449 100644 --- a/.nojekyll +++ b/.nojekyll @@ -1 +1 @@ -ec9bf60c \ No newline at end of file +0841f5a6 \ No newline at end of file diff --git a/chapter_1_intro/1_1_welcome.html b/chapter_1_intro/1_1_welcome.html index 1d2c09d..833d794 100644 --- a/chapter_1_intro/1_1_welcome.html +++ b/chapter_1_intro/1_1_welcome.html @@ -2,7 +2,7 @@ - + @@ -183,14 +183,14 @@ diff --git a/chapter_1_intro/1_2_why_clojure.html b/chapter_1_intro/1_2_why_clojure.html index 52c96c4..9bf50d6 100644 --- a/chapter_1_intro/1_2_why_clojure.html +++ b/chapter_1_intro/1_2_why_clojure.html @@ -2,7 +2,7 @@ - + @@ -183,14 +183,14 @@ diff --git a/chapter_1_intro/1_3_set_up.html b/chapter_1_intro/1_3_set_up.html index a448937..d6228e2 100644 --- a/chapter_1_intro/1_3_set_up.html +++ b/chapter_1_intro/1_3_set_up.html @@ -2,7 +2,7 @@ - + @@ -183,14 +183,14 @@ diff --git a/chapter_2_input_output/2_1_loading_data/index.html b/chapter_2_input_output/2_1_loading_data/index.html index f57b91d..ca0c768 100644 --- a/chapter_2_input_output/2_1_loading_data/index.html +++ b/chapter_2_input_output/2_1_loading_data/index.html @@ -2,7 +2,7 @@ - + @@ -99,7 +99,7 @@ - +
@@ -183,14 +183,14 @@ @@ -198,17 +198,14 @@
- +
- - -
+ + +
(ns chapter-2-input-output.2-1-loading-data
   {:nextjournal.clerk/visibility {:code :hide}
    :nextjournal.clerk/toc true}
   (:require [scicloj.kind-clerk.api :as kind-clerk]))
-
+
(kind-clerk/setup!)
:ok
-;; This is a work in progress of the code examples that will make up chapter 2, section 1 ;; of the Clojure data cookbook ;; # 2.1 How to get data into the notebook ;; ## How to get data into the notebook ;; ### Reading from a delimited text file ;; Easiest with standard file formats, e.g. CSV. ;; #### With Clojure’s standard CSV library (require ‘[clojure.data.csv :as csv]’[clojure.java.io :as io]) ^{:nextjournal.clerk/viewer :table} (with-open [reader (io/reader “data/co2_over_time.csv”)] (doall (csv/read-csv reader))) ;; Returns: Lazy sequence of vectors of strings (one value per cell) ;; TODO: Link to useful explainer on lazy seqs, explain why we include doall here ;; #### With tablecloth ;; For most work involving tabular/columnar data, you’ll use tablecloth, Clojure’s go-to data ;; wrangling library. These all return a tech.ml.dataset Dataset object. The implementation ;; details aren’t important now, but tech.ml.dataset is the library that allows for efficient ;; and fast operations on columnar datasets. ;; TODO: Be consistent about you vs we – pick on and stick with it (require ‘[tablecloth.api :as tc]) (require’[nextjournal.clerk :as clerk]) ;; (clerk/add-viewers! [{:pred #(= tech.v3.dataset.impl.dataset.Dataset (type %)) ;; ;; :fetch-fn (fn [_ file] {:nextjournal/content-type “image/png” ;; ;; :nextjournal/value (Files/readAllBytes (.toPath file))}) ;; :render-fn v/table}]) (-> “data/co2_over_time.csv” tc/dataset) ;; Note the built-in pretty printing. ;; TODO: Write elsewhere about kindly and notebooks, how they know how to render different things ;; Easy things to tidy up at import time: ;; ##### Transforming headers ;; We’ll require Clojure’s standard string library for this example. The transformation function is ;; arbitrary though, accepting a single header value and returning a single, transformed value. (require ‘[clojure.string :as str]) (defn- lower-case-keyword [val] (-> val (str/replace #“+” “-”) str/lower-case keyword)) (-> “data/co2_over_time.csv” (tc/dataset {:key-fn lower-case-keyword})) ;; ##### Specifying separators ;; Tablecloth is pretty smart about standard formats, e.g. CSV above and TSV: (-> “data/co2_over_time.tsv” tc/dataset) ;; But it can also accept an arbitrary separator if for some reason you have some data that uses ;; a non-standard file format (have a look at data/co2_over_time.txt). Note the separator has to ;; be a single character. (-> “data/co2_over_time.txt” (tc/dataset {:separator “/”})) ;; ##### Specify file encoding ;; TODO: does this really matter? test out different file encodings.. ;; ##### Normalize values into consistent formats and types ;; Tablecloth makes it easy to apply arbitrary transformations to all values in a given column ;; We can inspect the column metadata with tablecloth: (def dataset (tc/dataset “data/co2_over_time.csv”)) (-> dataset (tc/info :columns)) ;; Certain types are built-in (it knows what to do to convert them, e.g. numbers:) ;; TODO: Explain why numbers get rounded? Probably not here.. in addendum about numbers in Clojure (-> dataset (tc/convert-types “CO2” :double) (tc/info :columns)) ;; The full list of magic symbols representing types tablecloth supports comes from the underlying ;; tech.ml.dataset library: (require’[tech.v3.datatype.casting :as casting]) @casting/valid-datatype-set ;; More details on supported types here. ;; TODO: Explain when to use :double vs :type/numerical? What’s the difference? ;; You can also process multiple columns at once, either by specifying a map of columns to data types: (-> dataset (tc/convert-types {“CO2” :double “adjusted CO2” :double}) (tc/info :columns)) ;; Or by changing all columns of a certain type to another: (-> dataset (tc/convert-types :type/numerical :double) (tc/info :columns)) ;; The supported column types are: ;; :type/numerical - any numerical type ;; :type/float - floating point number (:float32 and :float64) ;; :type/integer - any integer ;; :type/datetime - any datetime type ;; Also the magical :!type qualifier exists, which will select the complement set – all columns that ;; are not the specified type ;; For others you need to provide a casting function yourself, e.g. adding the UTC start of day, ;; accounting for local daylight savings (defn to-start-of-day-UTC [local-date] (-> local-date .atStartOfDay (java.time.ZonedDateTime/ofLocal (java.time.ZoneId/systemDefault) (java.time.ZoneOffset/UTC)))) (-> dataset (tc/convert-types “Date” [[:timezone-date to-start-of-day-UTC]]) (tc/info :columns)) ;; For full details on all the possible options for type conversion of columns see the ;; tablecloth API docs ;; ### Reading from a URL ;; CSV: (-> “https://vega.github.io/vega-lite/data/co2-concentration.csv” tc/dataset) ;; JSON: works as long as the data is an array of maps (-> “https://vega.github.io/vega-lite/data/cars.json” tc/dataset) ;; Tablecloth can handle a string that points to any file that contains either raw or gzipped csv/tsv, ;; json, xls(x), on the local file system or a URL. ;; ### Reading an excel file ;; Tablecloth supports reading xls and xlsx files iff the underlying Java library for working with ;; excel is included: (require ‘[tech.v3.libs.poi]) ;; This is not included in the library by default because poi has a hard dependency on log4j2, along ;; with many other dependencies that the core team at tech.ml.dataset (upon which tablecloth is built) ;; did not want to impose on all users by default (https://clojurians.zulipchat.com/#narrow/stream/236259-tech.2Eml.2Edataset.2Edev/topic/working.20with.20excel.20files/near/314711378). ;; You can still require it here, you’ll most likely just see an error that says something like ;; “Log4j2 could not find a logging implementation. Please add log4j-core to the classpath.”, unless ;; you already have a valid log4j config on your class path. ;; This should work according to maintainers, does not atm (tc/dataset “data/example_XLS.xls” {:filetype “xls”}) (tc/dataset “data/example_XLSX.xlsx” {:filetype “xlsx”}) (require’[dk.ative.docjure.spreadsheet :as xl]) (def xl-workbook (xl/load-workbook “data/example_XLS.xls”)) ;; To discover sheet names: (->> xl-workbook xl/sheet-seq (map xl/sheet-name)) ;; This will show us there is only one sheet in this workbook, named “Sheet1”. You can get the data ;; out of it like this: ;; To discover header names: (def headers (->> xl-workbook (xl/select-sheet “Sheet1”) xl/row-seq first xl/cell-seq (map xl/read-cell))) ;; To get the data out of the columns: (def column-index->header (zipmap [:A :B :C :D :E :F :G :H :I] headers)) (->> xl-workbook (xl/select-sheet “Sheet1”) (xl/select-columns column-index->header)) ;; and into a tablecloth dataset like this: (->> xl-workbook (xl/select-sheet “Sheet1”) (xl/select-columns column-index->header) (drop 1) ;; don’t count the header row as a row tc/dataset) ;; You might be tempted to just iterate over each row and read each cell, but it’s more ;; convenient to think of the data as column-based rather than row-based for tablecloth’s purposes. ;; Setting the dataset headers is more verbose when we’re starting from a seq of seqs, since ;; the header-row? option does not work for a seq of seqs (this option is implemented in the ;; low-level parsing code for each supported input type and is not currently implemented for ;; a seq of seqs). (def iterated-xl-data (->> xl-workbook (xl/select-sheet “Sheet1”) xl/row-seq (map #(->> % xl/cell-seq (map xl/read-cell))))) ;; Note the header-row? option is not supported: (tc/dataset iterated-xl-data {:header-row? true}) ;; Can do it manually, but just working with columns from the start is more idiomatic: (let [headers (first iterated-xl-data) rows (rest iterated-xl-data)] (map #(zipmap headers %) rows)) ;; ### Reading from a database ;; #### SQL database ;; (tc/dataset (,,, results from some SQL query)) ;; requires com.github.seancorfield/next.jdbc {:mvn/version "1.3.847"} in deps.edn ;; Note you will also require the relevant driver for the type of db you are trying ;; to access. These are some available ones: (require ‘[next.jdbc :as jdbc]) ;; Connect to the db: (def db {:dbname “data/Chinook_Sqlite.sqlite” :dbtype “sqlite”}) (def ds (jdbc/get-datasource db)) ds ;; Pass the results of a sql query to tablecloth to make a (-> ds (jdbc/execute! [“SELECT * FROM artist”]) (tc/dataset)) ;; Passing a parameter to a query (-> ds (jdbc/execute! [“SELECT * FROM artist WHERE Name = ?” “Aerosmith”]) (tc/dataset)) ;; note for SQLite specifically the concat operator is || not + (-> ds (jdbc/execute! [“SELECT * FROM artist WHERE Name like ‘%’ || ? || ‘%’” “man”]) (tc/dataset)) ;; #### SPARQL database (require’[grafter-2.rdf4j.repository :as repo]) (require ‘[grafter-2.rdf.protocols :as pr]) (def sparql (repo/sparql-repo “https://query.wikidata.org/sparql”)) ;; taken from: https://query.wikidata.org/#%23Public%20sculptures%20in%20Paris%0ASELECT%20DISTINCT%20%3Fitem%20%20%3FTitre%20%3Fcreateur%20%28year%28%3Fdate%29%20as%20%3FAnneeCreation%29%20%3Fimage%20%3Fcoord%0AWHERE%0A%7B%0A%20%20%20%3Fitem%20wdt%3AP31%2Fwdt%3AP279%2a%20wd%3AQ860861.%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%23%20sculpture%0A%20%20%20%3Fitem%20wdt%3AP136%20wd%3AQ557141%20.%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%23%20genre%C2%A0%3A%20art%20public%0A%20%20%20%7B%3Fitem%20wdt%3AP131%20wd%3AQ90.%7D%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%23%20…%20situ%C3%A9e%20dans%20Paris%0A%20%20%20UNION%0A%20%20%20%7B%3Fitem%20wdt%3AP131%20%3Farr.%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%23%20…%20ou%20dans%20un%20arrondissement%20de%20Paris%20%0A%20%20%20%3Farr%20wdt%3AP131%20wd%3AQ90.%20%7D%0A%20%20%20%3Fitem%20rdfs%3Alabel%20%3FTitre%20FILTER%20%28lang%28%3FTitre%29%20%3D%20%22fr%22%29.%20%20%23%20Titre%0A%20%0A%20%20%20OPTIONAL%20%7B%3Fitem%20wdt%3AP170%20%3FQcreateur.%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%23%20cr%C3%A9ateur%2Fcr%C3%A9atrice%20%28option%29%0A%20%20%20%3FQcreateur%20rdfs%3Alabel%20%3Fcreateur%20FILTER%20%28lang%28%3Fcreateur%29%20%3D%20%22fr%22%29%20.%7D%0A%20%20%20OPTIONAL%20%7B%3Fitem%20wdt%3AP571%20%3Fdate.%7D%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%23%20date%20de%20cr%C3%A9ation%20%28option%29%0A%20%20%20OPTIONAL%20%7B%3Fitem%20wdt%3AP18%20%20%3Fimage.%7D%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%23%20image%20%28option%29%0A%20%20%20OPTIONAL%20%7B%3Fitem%20wdt%3AP625%20%3Fcoord.%7D%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%23%20coordonn%C3%A9es%20g%C3%A9ographiques%20%28option%29%0A%7D (def sparql-results (let [conn (repo/->connection sparql)] (-> conn (repo/query “# Public sculptures in Paris SELECT DISTINCT ?item ?title ?creator (year(?date) as ?year) ?coord WHERE { ?item wdt:P31/wdt:P279* wd:Q860861. # sculpture ?item wdt:P136 wd:Q557141 . # genre : art public {?item wdt:P131 wd:Q90.} # … située dans Paris UNION {?item wdt:P131 ?arr. # … ou dans un arrondissement de Paris ?arr wdt:P131 wd:Q90. } ?item rdfs:label ?title FILTER (lang(?title) = "fr"). # title OPTIONAL {?item wdt:P170 ?Qcreateur. # créateur/créatrice (option) ?Qcreateur rdfs:label ?creator FILTER (lang(?creator) = "fr") .} OPTIONAL {?item wdt:P571 ?date.} # date de création (option) OPTIONAL {?item wdt:P18 ?image.} # image (option) OPTIONAL {?item wdt:P625 ?coord.} # coordonnées géographiques (option) }”)))) ;; grafter db can help format RDF values (def sparql-ds (-> sparql-results tc/dataset (tc/update-columns [:coord :title :creator] (partial map pr/raw-value)))) ;; ### Generating sequences (defn seq-of-seqs [rows cols-per-row output-generator] (repeatedly rows (partial repeatedly cols-per-row output-generator))) ;; Of random numbers: (defn random-number-between-0-1000 [] (rand-int 1000)) (seq-of-seqs 10 4 random-number-between-0-1000) (defn seq-of-maps [rows cols-per-row output-generator] (let [header-data (map #(str “header-” %) (range cols-per-row)) row-data (seq-of-seqs rows cols-per-row output-generator)] (map #(zipmap header-data %) row-data))) (seq-of-maps 10 4 random-number-between-0-1000) ;; dtype next (library underneath tech.ml.dataset, which is underneath tablecloth) also ;; has a built-in sequence generator: (require’[tech.v3.datatype :as dtype]) (dtype/make-reader :string 4 (str “cell-” idx)) (dtype/make-reader :int32 4 (rand-int 10)) ;; It is lazy, not cached, so be careful about using a computationally-heavy fn for generator ;; ### Generating repeatable sequences of dummy data (def consistent-data (map-indexed (fn [index _coll] (str “cell-” index)) (range 10))) (repeat (zipmap (range 10) consistent-data)) :end +

;; This is a work in progress of the code examples that will make up chapter 2, section 1 ;; of the Clojure data cookbook ;; # 2.1 How to get data into the notebook ;; ## How to get data into the notebook ;; ### Reading from a delimited text file ;; Easiest with standard file formats, e.g. CSV. ;; #### With Clojure’s standard CSV library (require ‘[clojure.data.csv :as csv]’[clojure.java.io :as io]) ^{:nextjournal.clerk/viewer :table} (with-open [reader (io/reader “data/co2_over_time.csv”)] (doall (csv/read-csv reader))) ;; Returns: Lazy sequence of vectors of strings (one value per cell) ;; TODO: Link to useful explainer on lazy seqs, explain why we include doall here ;; #### With tablecloth ;; For most work involving tabular/columnar data, you’ll use tablecloth, Clojure’s go-to data ;; wrangling library. These all return a tech.ml.dataset Dataset object. The implementation ;; details aren’t important now, but tech.ml.dataset is the library that allows for efficient ;; and fast operations on columnar datasets. ;; TODO: Be consistent about you vs we – pick on and stick with it (require ‘[tablecloth.api :as tc]) (require’[nextjournal.clerk :as clerk]) ;; (clerk/add-viewers! [{:pred #(= tech.v3.dataset.impl.dataset.Dataset (type %)) ;; ;; :fetch-fn (fn [_ file] {:nextjournal/content-type “image/png” ;; ;; :nextjournal/value (Files/readAllBytes (.toPath file))}) ;; :render-fn v/table}]) (-> “data/co2_over_time.csv” tc/dataset) ;; Note the built-in pretty printing. ;; TODO: Write elsewhere about kindly and notebooks, how they know how to render different things ;; Easy things to tidy up at import time: ;; ##### Transforming headers ;; We’ll require Clojure’s standard string library for this example. The transformation function is ;; arbitrary though, accepting a single header value and returning a single, transformed value. (require ‘[clojure.string :as str]) (defn- lower-case-keyword [val] (-> val (str/replace #“+” “-”) str/lower-case keyword)) (-> “data/co2_over_time.csv” (tc/dataset {:key-fn lower-case-keyword})) ;; ##### Specifying separators ;; Tablecloth is pretty smart about standard formats, e.g. CSV above and TSV: (-> “data/co2_over_time.tsv” tc/dataset) ;; But it can also accept an arbitrary separator if for some reason you have some data that uses ;; a non-standard file format (have a look at data/co2_over_time.txt). Note the separator has to ;; be a single character. (-> “data/co2_over_time.txt” (tc/dataset {:separator “/”})) ;; ##### Specify file encoding ;; TODO: does this really matter? test out different file encodings.. ;; ##### Normalize values into consistent formats and types ;; Tablecloth makes it easy to apply arbitrary transformations to all values in a given column ;; We can inspect the column metadata with tablecloth: (def dataset (tc/dataset “data/co2_over_time.csv”)) (-> dataset (tc/info :columns)) ;; Certain types are built-in (it knows what to do to convert them, e.g. numbers:) ;; TODO: Explain why numbers get rounded? Probably not here.. in addendum about numbers in Clojure (-> dataset (tc/convert-types “CO2” :double) (tc/info :columns)) ;; The full list of magic symbols representing types tablecloth supports comes from the underlying ;; tech.ml.dataset library: (require’[tech.v3.datatype.casting :as casting]) @casting/valid-datatype-set ;; More details on supported types here. ;; TODO: Explain when to use :double vs :type/numerical? What’s the difference? ;; You can also process multiple columns at once, either by specifying a map of columns to data types: (-> dataset (tc/convert-types {“CO2” :double “adjusted CO2” :double}) (tc/info :columns)) ;; Or by changing all columns of a certain type to another: (-> dataset (tc/convert-types :type/numerical :double) (tc/info :columns)) ;; The supported column types are: ;; :type/numerical - any numerical type ;; :type/float - floating point number (:float32 and :float64) ;; :type/integer - any integer ;; :type/datetime - any datetime type ;; Also the magical :!type qualifier exists, which will select the complement set – all columns that ;; are not the specified type ;; For others you need to provide a casting function yourself, e.g. adding the UTC start of day, ;; accounting for local daylight savings (defn to-start-of-day-UTC [local-date] (-> local-date .atStartOfDay (java.time.ZonedDateTime/ofLocal (java.time.ZoneId/systemDefault) (java.time.ZoneOffset/UTC)))) (-> dataset (tc/convert-types “Date” [[:timezone-date to-start-of-day-UTC]]) (tc/info :columns)) ;; For full details on all the possible options for type conversion of columns see the ;; tablecloth API docs ;; ### Reading from a URL ;; CSV: (-> “https://vega.github.io/vega-lite/data/co2-concentration.csv” tc/dataset) ;; JSON: works as long as the data is an array of maps (-> “https://vega.github.io/vega-lite/data/cars.json” tc/dataset) ;; Tablecloth can handle a string that points to any file that contains either raw or gzipped csv/tsv, ;; json, xls(x), on the local file system or a URL. ;; ### Reading an excel file ;; Tablecloth supports reading xls and xlsx files iff the underlying Java library for working with ;; excel is included: (require ‘[tech.v3.libs.poi]) ;; This is not included in the library by default because poi has a hard dependency on log4j2, along ;; with many other dependencies that the core team at tech.ml.dataset (upon which tablecloth is built) ;; did not want to impose on all users by default (https://clojurians.zulipchat.com/#narrow/stream/236259-tech.2Eml.2Edataset.2Edev/topic/working.20with.20excel.20files/near/314711378). ;; You can still require it here, you’ll most likely just see an error that says something like ;; “Log4j2 could not find a logging implementation. Please add log4j-core to the classpath.”, unless ;; you already have a valid log4j config on your class path. ;; This should work according to maintainers, does not atm (tc/dataset “data/example_XLS.xls” {:filetype “xls”}) (tc/dataset “data/example_XLSX.xlsx” {:filetype “xlsx”}) (require’[dk.ative.docjure.spreadsheet :as xl]) (def xl-workbook (xl/load-workbook “data/example_XLS.xls”)) ;; To discover sheet names: (->> xl-workbook xl/sheet-seq (map xl/sheet-name)) ;; This will show us there is only one sheet in this workbook, named “Sheet1”. You can get the data ;; out of it like this: ;; To discover header names: (def headers (->> xl-workbook (xl/select-sheet “Sheet1”) xl/row-seq first xl/cell-seq (map xl/read-cell))) ;; To get the data out of the columns: (def column-index->header (zipmap [:A :B :C :D :E :F :G :H :I] headers)) (->> xl-workbook (xl/select-sheet “Sheet1”) (xl/select-columns column-index->header)) ;; and into a tablecloth dataset like this: (->> xl-workbook (xl/select-sheet “Sheet1”) (xl/select-columns column-index->header) (drop 1) ;; don’t count the header row as a row tc/dataset) ;; You might be tempted to just iterate over each row and read each cell, but it’s more ;; convenient to think of the data as column-based rather than row-based for tablecloth’s purposes. ;; Setting the dataset headers is more verbose when we’re starting from a seq of seqs, since ;; the header-row? option does not work for a seq of seqs (this option is implemented in the ;; low-level parsing code for each supported input type and is not currently implemented for ;; a seq of seqs). (def iterated-xl-data (->> xl-workbook (xl/select-sheet “Sheet1”) xl/row-seq (map #(->> % xl/cell-seq (map xl/read-cell))))) ;; Note the header-row? option is not supported: (tc/dataset iterated-xl-data {:header-row? true}) ;; Can do it manually, but just working with columns from the start is more idiomatic: (let [headers (first iterated-xl-data) rows (rest iterated-xl-data)] (map #(zipmap headers %) rows)) ;; ### Reading from a database ;; #### SQL database ;; (tc/dataset (,,, results from some SQL query)) ;; requires com.github.seancorfield/next.jdbc {:mvn/version "1.3.847"} in deps.edn ;; Note you will also require the relevant driver for the type of db you are trying ;; to access. These are some available ones: (require ‘[next.jdbc :as jdbc]) ;; Connect to the db: (def db {:dbname “data/Chinook_Sqlite.sqlite” :dbtype “sqlite”}) (def ds (jdbc/get-datasource db)) ds ;; Pass the results of a sql query to tablecloth to make a (-> ds (jdbc/execute! [“SELECT * FROM artist”]) (tc/dataset)) ;; Passing a parameter to a query (-> ds (jdbc/execute! [“SELECT * FROM artist WHERE Name = ?” “Aerosmith”]) (tc/dataset)) ;; note for SQLite specifically the concat operator is || not + (-> ds (jdbc/execute! [“SELECT * FROM artist WHERE Name like ‘%’ || ? || ‘%’” “man”]) (tc/dataset)) ;; #### SPARQL database (require’[grafter-2.rdf4j.repository :as repo]) (require ‘[grafter-2.rdf.protocols :as pr]) (def sparql (repo/sparql-repo “https://query.wikidata.org/sparql”)) ;; taken from: https://query.wikidata.org/#%23Public%20sculptures%20in%20Paris%0ASELECT%20DISTINCT%20%3Fitem%20%20%3FTitre%20%3Fcreateur%20%28year%28%3Fdate%29%20as%20%3FAnneeCreation%29%20%3Fimage%20%3Fcoord%0AWHERE%0A%7B%0A%20%20%20%3Fitem%20wdt%3AP31%2Fwdt%3AP279%2a%20wd%3AQ860861.%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%23%20sculpture%0A%20%20%20%3Fitem%20wdt%3AP136%20wd%3AQ557141%20.%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%23%20genre%C2%A0%3A%20art%20public%0A%20%20%20%7B%3Fitem%20wdt%3AP131%20wd%3AQ90.%7D%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%23%20…%20situ%C3%A9e%20dans%20Paris%0A%20%20%20UNION%0A%20%20%20%7B%3Fitem%20wdt%3AP131%20%3Farr.%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%23%20…%20ou%20dans%20un%20arrondissement%20de%20Paris%20%0A%20%20%20%3Farr%20wdt%3AP131%20wd%3AQ90.%20%7D%0A%20%20%20%3Fitem%20rdfs%3Alabel%20%3FTitre%20FILTER%20%28lang%28%3FTitre%29%20%3D%20%22fr%22%29.%20%20%23%20Titre%0A%20%0A%20%20%20OPTIONAL%20%7B%3Fitem%20wdt%3AP170%20%3FQcreateur.%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%23%20cr%C3%A9ateur%2Fcr%C3%A9atrice%20%28option%29%0A%20%20%20%3FQcreateur%20rdfs%3Alabel%20%3Fcreateur%20FILTER%20%28lang%28%3Fcreateur%29%20%3D%20%22fr%22%29%20.%7D%0A%20%20%20OPTIONAL%20%7B%3Fitem%20wdt%3AP571%20%3Fdate.%7D%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%23%20date%20de%20cr%C3%A9ation%20%28option%29%0A%20%20%20OPTIONAL%20%7B%3Fitem%20wdt%3AP18%20%20%3Fimage.%7D%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%23%20image%20%28option%29%0A%20%20%20OPTIONAL%20%7B%3Fitem%20wdt%3AP625%20%3Fcoord.%7D%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%23%20coordonn%C3%A9es%20g%C3%A9ographiques%20%28option%29%0A%7D (def sparql-results (let [conn (repo/->connection sparql)] (-> conn (repo/query “# Public sculptures in Paris SELECT DISTINCT ?item ?title ?creator (year(?date) as ?year) ?coord WHERE { ?item wdt:P31/wdt:P279* wd:Q860861. # sculpture ?item wdt:P136 wd:Q557141 . # genre : art public {?item wdt:P131 wd:Q90.} # … située dans Paris UNION {?item wdt:P131 ?arr. # … ou dans un arrondissement de Paris ?arr wdt:P131 wd:Q90. } ?item rdfs:label ?title FILTER (lang(?title) = "fr"). # title OPTIONAL {?item wdt:P170 ?Qcreateur. # créateur/créatrice (option) ?Qcreateur rdfs:label ?creator FILTER (lang(?creator) = "fr") .} OPTIONAL {?item wdt:P571 ?date.} # date de création (option) OPTIONAL {?item wdt:P18 ?image.} # image (option) OPTIONAL {?item wdt:P625 ?coord.} # coordonnées géographiques (option) }”)))) ;; grafter db can help format RDF values (def sparql-ds (-> sparql-results tc/dataset (tc/update-columns [:coord :title :creator] (partial map pr/raw-value)))) ;; ### Generating sequences (defn seq-of-seqs [rows cols-per-row output-generator] (repeatedly rows (partial repeatedly cols-per-row output-generator))) ;; Of random numbers: (defn random-number-between-0-1000 [] (rand-int 1000)) (seq-of-seqs 10 4 random-number-between-0-1000) (defn seq-of-maps [rows cols-per-row output-generator] (let [header-data (map #(str “header-” %) (range cols-per-row)) row-data (seq-of-seqs rows cols-per-row output-generator)] (map #(zipmap header-data %) row-data))) (seq-of-maps 10 4 random-number-between-0-1000) ;; dtype next (library underneath tech.ml.dataset, which is underneath tablecloth) also ;; has a built-in sequence generator: (require’[tech.v3.datatype :as dtype]) (dtype/make-reader :string 4 (str “cell-” idx)) (dtype/make-reader :int32 4 (rand-int 10)) ;; It is lazy, not cached, so be careful about using a computationally-heavy fn for generator ;; ### Generating repeatable sequences of dummy data (def consistent-data (map-indexed (fn [index _coll] (str “cell-” index)) (range 10))) (repeat (zipmap (range 10) consistent-data)) :end

-source: book/chapter_2_input_output/2_1_loading_data.clj +
source: book/chapter_2_input_output/2_1_loading_data.clj
diff --git a/chapter_2_input_output/2_2_messy_data/index.html b/chapter_2_input_output/2_2_messy_data/index.html index a27a903..2bb6569 100644 --- a/chapter_2_input_output/2_2_messy_data/index.html +++ b/chapter_2_input_output/2_2_messy_data/index.html @@ -2,7 +2,7 @@ - + @@ -183,14 +183,14 @@ @@ -231,8 +231,7 @@

6  6  + + +
(ns chapter-2-input-output.2-2-messy-data
   {:nextjournal.clerk/toc true}
   (:require [tablecloth.api :as tc]
             [tech.v3.datatype.functional :as fun]
             [scicloj.kind-clerk.api :as kind-clerk]))
-
+
(kind-clerk/setup!)
@@ -273,12 +273,12 @@

6 

6.1 Multiple types mixed in one column

Tablecloth will handle it just fine, it will just give the column the type :object

-
+
(def mixed-types
   (tc/dataset {:A ["string" "more strings" 3]
                :B [1 2 "whoops"]}))
-
+
(tc/info mixed-types :columns)

_unnamed :column info [2 4]:

@@ -307,7 +307,7 @@

+
(tc/convert-types mixed-types :A :string)

_unnamed [3 2]:

@@ -333,7 +333,7 @@

+
(-> mixed-types
     (tc/convert-types :A :string)
     (tc/info :columns))
@@ -368,18 +368,18 @@

6.2 Multiple formats for a thing that’s supposed to have one (e.g. phone numbers, postal codes)

You can pass any arbitrary function to update a column

-
+
(def misformatted
   (tc/dataset {:phone ["123-456-5654" "(304) 342 1235" "(423)-234-2342" "1234325984" "nope"]
                :postal-code ["t1n 0k2" "H9Q1L2" "H3H 8V0" "eu5h04" "just wrong"]}))
-
+
(require '[clojure.string :as str])
nil
-
+
(def phone-regex
   (re-pattern
    (str
@@ -391,7 +391,7 @@ 

"(\\d{4})" ; any 4 numbers )))

-
+
(defn- normalize-phone-numbers [col]
   (map (fn [v]
          (let [[match a b c] (re-matches phone-regex v)]
@@ -403,7 +403,7 @@ 

#'chapter-2-input-output.2-2-messy-data/normalize-phone-numbers

-
+
(def postal-code-regex
   (re-pattern
    (str
@@ -419,7 +419,7 @@ 

".*" "(\\d{1})")))

-
+
(defn- normalize-postal-codes [col]
   (map (fn [v]
          (let [[match a b c d e f] (->> v str/upper-case (re-matches postal-code-regex))]
@@ -431,7 +431,7 @@ 

#'chapter-2-input-output.2-2-messy-data/normalize-postal-codes

-
+
(-> misformatted
     (tc/update-columns {:phone normalize-phone-numbers
                         :postal-code normalize-postal-codes}))
@@ -471,19 +471,19 @@

6.3 Missing values

Tablecloth has many built-in helpers for dealing with missing values.

-
+
(require '[tech.v3.datatype.datetime :as dt])
nil
-
+
(def sparse
   (tc/dataset {:A [1 2 3 nil nil 6]
                :B ["test" nil "this" "is" "a" "test"]}))

Drop whole rows with any missing values:

-
+
(tc/drop-missing sparse)

_unnamed [3 2]:

@@ -510,7 +510,7 @@

Drop whole row with any missing values in a given column:

-
+
(tc/drop-missing sparse :A)

_unnamed [4 2]:

@@ -544,12 +544,12 @@

6.4 Arbitrary values meant to indicate missing (e.g. “NONE”, “N/A”, false, etc.)

-It’s not uncommon to see missing values indicated in multiple different ways, sometimes even within the same dataset. E.g. missing cells might be blank entirely, or they might be populated with some arbitrary value meant to indicate “nothing”, like “NONE”, “N/A”, false, etc. +

It’s not uncommon to see missing values indicated in multiple different ways, sometimes even within the same dataset. E.g. missing cells might be blank entirely, or they might be populated with some arbitrary value meant to indicate “nothing”, like “NONE”, “N/A”, false, etc.

diff --git a/chapter_2_input_output/2_3_exporting_data/index.html b/chapter_2_input_output/2_3_exporting_data/index.html index abc07b2..a1e1d73 100644 --- a/chapter_2_input_output/2_3_exporting_data/index.html +++ b/chapter_2_input_output/2_3_exporting_data/index.html @@ -2,7 +2,7 @@ - + @@ -183,14 +183,14 @@ @@ -231,8 +231,7 @@

7  7  + + +
(ns chapter-2-input-output.2-3-exporting-data
   {:nextjournal.clerk/toc true}
   (:require
@@ -266,24 +266,24 @@ 

7  [tablecloth.api :as tc] [scicloj.kind-clerk.api :as kind-clerk]))

-
+
(kind-clerk/setup!)
:ok
-
+
(def consistent-data
   (map-indexed (fn [index _coll] (str "cell-" index))
                (range 10)))
-
+
(def data (take 20 (repeat (zipmap (range 10) consistent-data))))

7.1 Writing to a CSV file

depends what the data looks like for a seq of maps: headers are not necessarily sorted, put them in whatever order you want here Clojure maps make no guarantees about key order, make sure to order values, i.e. use the same header row to get the values from each map

-
+
(let [headers (-> data first keys sort)
       rows (->> data (map (fn [row]
                             (map (fn [header]
@@ -295,10 +295,10 @@ 

nil

Tablecloth can also export csvs (among other formats)

-
+
(def tc-dataset (tc/dataset data))
-
+
(tc/write-csv! tc-dataset "data/tc-output.csv")
@@ -307,14 +307,14 @@

7.2 Writing nippy

-
+
(tc/write! tc-dataset "data/tc-nippy.nippy")
nil

Read this also with tablecloth:

-
+
(tc/dataset "data/tc-nippy.nippy")

data/tc-nippy.nippy [20 10]:

@@ -591,14 +591,14 @@

7.3 Leave data in Clojure files

-
+
(->> data pr-str (spit "data/clojure-output.edn"))
nil

This can be consumed later with:

-
+
(with-open [reader (io/reader "data/clojure-output.edn")]
   (edn/read (java.io.PushbackReader. reader)))
@@ -808,17 +808,17 @@

7.4 Notebook artifacts

Clerk supports publishing your namespaces as HTML (like this website!) To do that call

-
+
(comment
   (clerk/build! {:paths "path/to/files..."
                  :index "book/index.clj"}))
-More information in Clerk’s docs: https://book.clerk.vision/#static-building HTML pages Other formats, options for exporting notebooks? PDFs? Partial artifacts, e.g. export just a graph Writing to a database? +

More information in Clerk’s docs: https://book.clerk.vision/#static-building HTML pages Other formats, options for exporting notebooks? PDFs? Partial artifacts, e.g. export just a graph Writing to a database?

diff --git a/chapter_3_data_manipulation/3_data_manipulation/index.html b/chapter_3_data_manipulation/3_data_manipulation/index.html index fb89a1f..a10532b 100644 --- a/chapter_3_data_manipulation/3_data_manipulation/index.html +++ b/chapter_3_data_manipulation/3_data_manipulation/index.html @@ -2,7 +2,7 @@ - + @@ -64,7 +64,7 @@ - + @@ -183,14 +183,14 @@ @@ -204,7 +204,7 @@

Table of contents

  • 8.1 Sorting -
      +
      • 8.1.1 Sorting columns
      • 8.1.2 Sorting rows
      • 8.1.3 Custom sorting functions
      • @@ -236,8 +236,7 @@

        8  8  + + +
        (ns chapter-3-data-manipulation.3-data-manipulation
           ;; {:nextjournal.clerk/visibility {:code :hide}
           ;;  :nextjournal.clerk/toc true}
        @@ -272,7 +272,7 @@ 

        8  [fastmath.stats :as stats] [scicloj.kind-clerk.api :as kind-clerk]))

        -
        +
        (kind-clerk/setup!)
        @@ -282,7 +282,7 @@

        8 

        8.1 Sorting

        -
        +
        (def dataset (tc/dataset [{:country "Canada"
                                    :size 10000000}
                                   {:country "USA"
        @@ -293,7 +293,7 @@ 

        8.1.1 Sorting columns

        Give the column headers in the order you want

        -
        +
        (-> dataset
             (tc/reorder-columns [:country :size]))
        @@ -323,7 +323,7 @@

        8.1.2 Sorting rows

        -
        +
        (-> dataset
             (tc/order-by [:size] [:desc]))
        @@ -354,7 +354,7 @@

        8.1.3 Custom sorting functions

        e.g. length of the country name

        -
        +
        (-> dataset
             (tc/order-by (fn [row] (-> row :country count))
                          :desc))
        @@ -386,7 +386,7 @@

        8.2 Selecting one column or multiple columns

        -
        +
        (-> dataset
             (tc/select-columns [:country]))
        @@ -412,8 +412,9 @@

        8.3 Randomizing order

        -
        -
        (-> dataset tc/shuffle)
        +
        +
        (-> dataset
        +    tc/shuffle)

        _unnamed [3 2]:

        @@ -441,8 +442,9 @@

        8.4 Repeatable randomisation

        -
        -
        (-> dataset (tc/shuffle {:seed 100}))
        +
        +
        (-> dataset
        +    (tc/shuffle {:seed 100}))

        _unnamed [3 2]:

        @@ -468,7 +470,7 @@

        Finding unique rows

        -
        +
        (def dupes (tc/dataset [{:country "Canada"
                                  :size 10000000}
                                 {:country "Canada"
        @@ -481,8 +483,9 @@ 

        :size 80000}]))

        (def “USA” #{“USA” “United States” “United states of America”}) https://scicloj.github.io/tablecloth/index.html#Unique

        -
        -
        (-> dupes tc/unique-by)
        +
        +
        (-> dupes
        +    tc/unique-by)

        _unnamed [5 2]:

        @@ -515,8 +518,9 @@

        -
        -
        (-> dupes (tc/unique-by :size))
        +
        +
        (-> dupes
        +    (tc/unique-by :size))

        _unnamed [4 2]:

        @@ -545,8 +549,9 @@

        -
        -
        (-> dupes (tc/unique-by :country))
        +
        +
        (-> dupes
        +    (tc/unique-by :country))

        _unnamed [4 2]:

        @@ -575,8 +580,9 @@

        -
        -
        (-> dupes (tc/unique-by #(-> % :country str/lower-case)))
        +
        +
        (-> dupes
        +    (tc/unique-by #(-> % :country str/lower-case)))

        _unnamed [3 2]:

        @@ -601,11 +607,13 @@

        -
        -
        (-> dupes (tc/unique-by #(-> % :country str/lower-case) {:strategy (fn [vals]
        -                                                                     (case (tdsc/column-name vals)
        -                                                                       :size (apply max vals)
        -                                                                       :country (last vals)))}))
        +
        +
        (-> dupes
        +    (tc/unique-by #(-> % :country str/lower-case)
        +                  {:strategy (fn [vals]
        +                               (case (tdsc/column-name vals)
        +                                 :size (apply max vals)
        +                                 :country (last vals)))}))

        _unnamed [3 2]:

        @@ -631,7 +639,7 @@

        could use this to rename vals to a canonical one (e.g. convert everything that matches set of USA to “USA”) Adding computed columns to data “lengthening” or “widening” data, making it “tidy” e.g. converting a column with numbers to a category (>5 “yes”, <5 “no”), summing multiple columns into a new one

        -
        +
        (-> dataset
             (tc/add-column :area [9000000 8000000 1000000]))
        @@ -662,7 +670,7 @@

        -
        +
        (-> dataset
             (tc/add-column :population [40000000 100000000 80000000])
             (tc/rename-columns {:size :area})
        @@ -684,25 +692,25 @@ 

        Canada 10000000 -4.0E+07 +4.0e07 4.00000000 USA 9000000 -1.0E+08 +1.0e08 11.11111111 Germany 80000 -8.0E+07 +8.0e07 1000.00000000

        vs, probably preferable

        -
        +
        (-> dataset
             (tc/add-column :population [40000000 100000000 80000000])
             (tc/rename-columns {:size :area})
        @@ -743,7 +751,7 @@ 

      • Removing columns
      -
      +
      (-> dataset
           (tc/drop-columns :size))
      @@ -776,7 +784,7 @@

      Filtering rows
    • Single filter, multiple filters
    -
    +
    (-> dataset
         (tc/select-rows (fn [row]
                           (< 1000000 (:size row)))))
    @@ -803,10 +811,10 @@

  • Aggregating rows (counts, groups)
-
+
(def co2-over-time (tc/dataset "data/co2_over_time.csv"))
-
+
(-> co2-over-time
     (tc/aggregate {:average-co2 (fn [ds]
                                   (/ (reduce + (get ds "CO2"))
@@ -826,7 +834,7 @@ 

Add a column for year

-
+
(-> co2-over-time
     (tc/map-columns "Year" "Date" (memfn getYear)))
@@ -976,7 +984,7 @@

Group by year

-
+
(-> co2-over-time
     (tc/group-by (fn [row]
                    (.getYear (get row "Date")))))
@@ -1104,14 +1112,14 @@

Get average temp per year tablecloth applies the aggregate fn to every groups dataset

-
+
(defn round2
   "Round a double to the given precision (number of significant digits)"
   [precision d]
   (let [factor (Math/pow 10 precision)]
     (/ (Math/round (* d factor)) factor)))
-
+
(-> co2-over-time
     (tc/group-by (fn [row]
                    (.getYear (get row "Date"))))
@@ -1220,7 +1228,7 @@ 

Can rename the column to be more descriptive

-
+
(-> co2-over-time
     (tc/group-by (fn [row]
                    (.getYear (get row "Date"))))
@@ -1329,18 +1337,18 @@ 

Concatenating datasets

-
+
(def ds1 (tc/dataset [{:id "id1" :b "val1"}
                       {:id "id2" :b "val2"}
                       {:id "id3" :b "val3"}]))
-
+
(def ds2 (tc/dataset [{:id "id1" :b "val4"}
                       {:id "id5" :b "val5"}
                       {:id "id6" :b "val6"}]))

Naively concats rows

-
+
(tc/concat ds1 ds2 (tc/dataset [{:id "id3" :b "other value"}]))

_unnamed [7 2]:

@@ -1382,7 +1390,7 @@

-
+
(tc/concat ds1 (tc/dataset [{:b "val4" :c "text"}
                             {:b "val5" :c "hi"}
                             {:b "val6" :c "test"}]))
@@ -1430,7 +1438,7 @@

De-duping

-
+
(tc/union ds1 ds2)

union [6 2]:

@@ -1472,16 +1480,16 @@

Merging datasets
  • When column headers are the same or different, on multiple columns TODO explain set logic and SQL joins
  • -
    +
    (def ds3 (tc/dataset {:id [1 2 3 4]
                           :b ["val1" "val2" "val3" "val4"]}))
    -
    +
    (def ds4 (tc/dataset {:id [1 2 3 4]
                           :c ["val1" "val2" "val3" "val4"]}))

    Keep all columns

    -
    +
    (tc/full-join ds3 ds4 :id)

    full-join [4 4]:

    @@ -1522,7 +1530,7 @@

    “Merge” datasets on a given column where rows have a value

    -
    +
    (tc/inner-join ds3 ds4 :id)

    inner-join [4 3]:

    @@ -1558,7 +1566,7 @@

    Drop rows missing a value

    -
    +
    (tc/inner-join (tc/dataset {:id [1 2 3 4]
                           :b ["val1" "val2" "val3"]})
                    (tc/dataset {:id [1 2 3 4]
    @@ -1597,7 +1605,7 @@ 

    -
    +
    (tc/right-join (tc/dataset {:id [1 2 3 ]
                                 :b ["val1" "val2" "val3"]})
                    (tc/dataset {:id [1 2 3 4]
    @@ -1642,7 +1650,7 @@ 

    scratch

    -
    +
    (tc/left-join (tc/dataset {:email ["asdf"]
                                 :name ["asdfads"]
                                 :entry-id [1 2 3]})
    @@ -1698,7 +1706,7 @@ 

    -
    +
    (tc/dataset {:email ["asdf"]
                  :name ["asdfads"]
                  :entry-id [1 2 3]})
    @@ -1730,7 +1738,7 @@

    -
    +
    (tc/dataset {:entry-id [1 2 3]
                  :upload-count [2 3 4]
                  :catgory ["art" "science"]})
    @@ -1763,7 +1771,7 @@

    see tablecloth join stuff Inner join, only keeps rows with the specified column value in common

    -
    +
    (tc/inner-join ds1 ds2 :id)

    inner-join [1 3]:

    @@ -1787,7 +1795,7 @@

    Converting between wide and long formats? Signal processing/time series analysis
  • Compute rolling average to be able to plot a trend line
  • -
    +
    (def exp-moving-avg
       (let [data (get co2-over-time "adjusted CO2")
             moving-avg
    @@ -1801,7 +1809,7 @@ 

  • widen dataset to include new row that’s already in order
  • -
    +
    (tc/append co2-over-time exp-moving-avg)

    data/co2_over_time.csv [741 4]:

    @@ -1952,7 +1960,7 @@

  • Rolling average over a 12 point range
  • -
    +
    (def rolling-average
       (tc/dataset [["Rolling average"
                     (-> co2-over-time
    @@ -1961,7 +1969,7 @@ 

    fun/mean {:relative-window-position :left}))]]))

    -
    +
    (tc/append co2-over-time rolling-average)

    data/co2_over_time.csv [741 4]:

    @@ -2112,7 +2120,7 @@

  • Train a model to predict the next 10 years
  • -
    +
    (-> co2-over-time
         )
    @@ -2242,7 +2250,7 @@

    Summarizing data (mean, standard deviation, confidence intervals etc.)
  • Standard deviation using fastmath
  • -
    +
    (def avg-co2-by-year
       (-> co2-over-time
           (tc/group-by (fn [row]
    @@ -2260,7 +2268,7 @@ 

  • Overall average
  • -
    +
    (stats/mean (:average-co2 avg-co2-by-year))
    @@ -2269,7 +2277,7 @@

  • Long term average 1991-2020
  • -
    +
    (-> avg-co2-by-year
         ;; (tc/select-rows (fn [row] (< 1990 (:year row))))
         ;; :average-co2
    @@ -2406,12 +2414,12 @@ 

    Run length encoding?
  • Filling nil s with last non-nil value?
  • -
    +
    (def sparse-dataset
       (tc/dataset {:a [nil 2 3 4 nil nil 7 8]
                    :b [10 11 12 nil nil nil 16 nil]}))
    -
    +
    (-> sparse-dataset
         (tc/replace-missing :up))
    @@ -2458,7 +2466,7 @@

    -
    +
    (-> sparse-dataset
         (tc/replace-missing :updown))
    @@ -2505,7 +2513,7 @@

    -
    +
    (-> sparse-dataset
         (tc/replace-missing :down))
    @@ -2552,7 +2560,7 @@

    -
    +
    (-> sparse-dataset
         (tc/replace-missing :downup))
    @@ -2599,7 +2607,7 @@

    -
    +
    (-> sparse-dataset
         (tc/replace-missing :lerp))
    @@ -2646,7 +2654,7 @@

    -
    +
    (-> sparse-dataset
         (tc/replace-missing :all :value 100))
    @@ -2693,7 +2701,7 @@

    -
    +
    (-> sparse-dataset
         (tc/replace-missing :a :value 100))
    @@ -2744,7 +2752,7 @@

    @@ -2991,8 +2999,8 @@

    diff --git a/chapter_4_data_visualisation/4_2_graphs/index.html b/chapter_4_data_visualisation/4_2_graphs/index.html index 07805f3..06e91fd 100644 --- a/chapter_4_data_visualisation/4_2_graphs/index.html +++ b/chapter_4_data_visualisation/4_2_graphs/index.html @@ -2,12 +2,12 @@ - + -Clojure Data Cookbook - 9  Graphs +Clojure Data Cookbook - 10  Graphs - + + -
    +
    (ns chapter-4-data-visualisation.4-2-graphs
       (:require [tablecloth.api :as tc]
                 [aerial.hanami.common :as hc]
    @@ -265,16 +264,16 @@ 

    9  [tablecloth.api :as tc] [scicloj.kind-clerk.api :as kind-clerk]))

    -
    +
    (kind-clerk/setup!)
    :ok
    -
    +
    (def co2-over-time (tc/dataset "data/co2_over_time.csv"))
    -
    +
    (-> co2-over-time
         (vis/hanami-plot ht/line-chart {:X "Date"
                                         :XTYPE "temporal"
    @@ -283,15 +282,12 @@ 

    9  :YSCALE {:zero false}}))

    -
    -vega
    -
    -
    +
    (def diamonds datasets/diamonds)
    -
    +
    (-> diamonds
         (vis/hanami-plot vht/boxplot-chart {:X :cut
                                             :XTYPE "nominal"
    @@ -299,13 +295,10 @@ 

    9  :WIDTH 750}))

    -
    -vega
    -
    -
    +
    (-> diamonds
         (vis/hanami-plot vht/boxplot-chart {:X :color
                                             :XTYPE "nominal"
    @@ -313,13 +306,10 @@ 

    9  :WIDTH 750}))

    -
    -vega
    -
    -
    +
    (-> diamonds
         (vis/hanami-plot vht/boxplot-chart {:X :clarity
                                             :XTYPE "nominal"
    @@ -327,13 +317,10 @@ 

    9  :WIDTH 750}))

    -
    -vega
    -
    -
    +
    :ok
    @@ -343,7 +330,7 @@

    9  book/chapter_4_data_visualisation/4_2_graphs.clj +
    source: book/chapter_4_data_visualisation/4_2_graphs.clj

    @@ -584,14 +571,11 @@

    9 

    diff --git a/chapter_4_data_visualisation/noj_examples/index.html b/chapter_4_data_visualisation/noj_examples/index.html index 976c4d0..692688a 100644 --- a/chapter_4_data_visualisation/noj_examples/index.html +++ b/chapter_4_data_visualisation/noj_examples/index.html @@ -2,12 +2,12 @@ - + -Clojure Data Cookbook - 10  Graphs with Noj +Clojure Data Cookbook - 9  Graphs with Noj - + + -
    -

    10.1 Bar graphs

    -
    +
    +

    9.1 Bar graphs

    +
    (ns chapter-4-data-visualisation.noj-examples
       (:require [tablecloth.api :as tc]
                 [aerial.hanami.common :as hc]
    @@ -283,45 +284,37 @@ 

    [clojure2d.color :as color] [scicloj.kind-clerk.api :as kind-clerk]))

    -
    +
    (kind-clerk/setup!)
    :ok
    -
    -

    10.2 Raw html

    -
    +
    +

    9.2 Raw html

    +
    (-> "<p>Hello, <i>Noj</i>.</p>"
    -    vis/raw-html)
    -
    -
    -
    - -
    -
    -
    -
    (-> [:svg {:height 210
    -           :width 500}
    -     [:line {:x1 0
    -             :y1 0
    -             :x2 200
    -             :y2 200
    -             :style "stroke:rgb(255,0,0);stroke-width:2"}]]
    -    hiccup/html
    -    vis/raw-html)
    -
    -
    -
    - -
    -
    + kind/html)

    +
    +

    +Hello, Noj. +

    +
    +
    (kind/html
    + "
    +<svg height=100 width=100>
    +<circle cx=50 cy=50 r=40 stroke='purple' stroke-width=3 fill='floralwhite' />
    +</svg> ")
    +
    + + +

    -
    -

    10.3 Visualizing datases with Hanami

    +
    +

    9.3 Visualizing datases with Hanami

    Noj offers a few convenience functions to make Hanami plotting work smoothly with Tablecloth and Kindly.

    -
    +
    (def random-walk
       (let [n 20]
         (-> {:x (range n)
    @@ -329,22 +322,19 @@ 

    (reductions +))} tc/dataset)))

    -
    -

    10.3.1 A simple plot

    +
    +

    9.3.1 A simple plot

    We can plot a Tablecloth datasete using a Hanami template:

    -
    +
    (-> random-walk
         (vis/hanami-plot ht/point-chart
                          {:MSIZE 200}))
    -
    -vega - -
    +

    Let us look inside the resulting vega-lite space. We can see the dataset is included as CSV:

    -
    +
    (-> random-walk
         (vis/hanami-plot ht/point-chart
                          {:MSIZE 200})
    @@ -360,14 +350,14 @@ 

    :height 300, :data {:values - "x,y\n0,0.2696595674516514\n1,0.5994221672898448\n2,0.9041662987177651\n3,1.1641703504999699\n4,1.606396428799537\n5,1.3972382302814177\n6,1.7686488303622263\n7,1.8812856284088362\n8,2.1521859934642023\n9,1.761413935660772\n10,1.5350565538499519\n11,1.4760599735629056\n12,1.2326873858637482\n13,1.2742130826088063\n14,0.9937616484523007\n15,1.4130287588308725\n16,1.16480354577581\n17,0.6889384877674767\n18,0.821314858587385\n19,0.7473480777397288\n", + "x,y\n0,0.25915143611932323\n1,0.07679044186868467\n2,-0.16838373926426764\n3,-0.3472917379109737\n4,-0.4185674782284593\n5,-0.3275712090765166\n6,0.06499031613330208\n7,-0.12473464521100663\n8,0.24581959605889236\n9,0.3872343668945971\n10,0.20630731645770806\n11,0.4283007097190942\n12,0.8577253018355132\n13,1.029799282228336\n14,1.500296189747702\n15,1.802090709990422\n16,1.675173594897049\n17,1.5406670970402527\n18,1.5912246361060238\n19,1.7546356050436023\n", :format {:type "csv"}}}

    -
    -

    10.3.2 Additional Hanami templates

    +
    +

    9.3.2 Additional Hanami templates

    The scicloj.noj.v1.vis.hanami.templates namespace add Hanami templates to Hanami’s own collection.

    -
    +
    (-> datasets/mtcars
         (vis/hanami-plot vht/boxplot-chart
                          {:X :gear
    @@ -375,15 +365,12 @@ 

    :Y :mpg}))

    -
    -vega
    -
    -
    -

    10.3.3 Layers

    -
    +
    +

    9.3.3 Layers

    +
    (-> random-walk
         (vis/hanami-layers
          {:TITLE "points and a line"}
    @@ -396,15 +383,12 @@ 

    :MCOLOR "brown"})]))

    -
    -vega - -
    +
    -
    -

    10.3.4 Concatenation

    -
    +
    +

    9.3.4 Concatenation

    +
    (-> random-walk
         (vis/hanami-vconcat
          {}
    @@ -421,12 +405,9 @@ 

    :WIDTH 100})]))

    -
    -vega - -
    +
    -
    +
    (-> random-walk
         (vis/hanami-hconcat
          {}
    @@ -443,15 +424,12 @@ 

    :WIDTH 100})]))

    -
    -vega - -
    +
    -
    -

    10.3.5 Linear regression

    -
    +
    +

    9.3.5 Linear regression

    +
    (-> datasets/mtcars
         (stats/add-predictions :mpg [:wt]
                                {:model-type :smile.regression/ordinary-least-square})
    @@ -472,30 +450,24 @@ 

    :YTITLE :mpg})]))

    -
    -vega - -
    +
    -
    -

    10.3.6 Histogram

    -
    +
    +

    9.3.6 Histogram

    +
    (-> datasets/iris
         (vis/hanami-histogram :sepal-width
                               {:nbins 10}))
    -
    -vega
    -
    -
    -

    10.3.7 Combining a few things together

    +
    +

    9.3.7 Combining a few things together

    The following is inspired by the example at Plotnine’s main page. Note how we add regression lines here. We take care of layout and colouring on our side, not using Vega-Lite for that.

    -
    +
    (let [pallete (->> :accent
                        color/palette
                        (mapv color/format-hex))]
    @@ -528,13 +500,10 @@ 

    (vis/hanami-vconcat nil {}))))

    -
    -vega - -
    +

    A similar example with histograms:

    -
    +
    (let [pallete (->> :accent
                        color/palette
                        (mapv color/format-hex))]
    @@ -549,13 +518,10 @@ 

    (vis/hanami-vconcat nil {}))))

    -
    -vega
    -

    Scatterplots and regression lines again, this time using Vega-Lite for layout and coloring (using its “facet” option).

    -
    +
    (-> datasets/mtcars
         (tc/group-by [:gear])
         (stats/add-predictions :mpg [:wt]
    @@ -585,12 +551,9 @@ 

    kind/vega-lite)

    -
    -vega - -
    +
    -
    +
    :bye
    @@ -600,7 +563,7 @@

    book/chapter_4_data_visualisation/noj_examples.clj +
    source: book/chapter_4_data_visualisation/noj_examples.clj

    @@ -843,11 +806,14 @@

    diff --git a/index.html b/index.html index a1c59f2..529baf1 100644 --- a/index.html +++ b/index.html @@ -2,7 +2,7 @@ - + @@ -182,14 +182,14 @@ @@ -203,7 +203,7 @@

    Table of contents

    - - -
    + + +
    (ns index
       {:nextjournal.clerk/visibility {:code :hide}}
       (:require
    @@ -268,8 +268,6 @@ 

    1 Preface

    Welcome to the Clojure Data Cookbook! This is the website for the work-in-progress that will become the Clojure Data Cookbook. The goal is to provide a reference for anyone who has data to work with and an interest in doing it in Clojure, documenting the current community recommendations and default stack for data science in Clojure.

    1.1 Note! all work here is in progress, subject to change, very messy, and partially done. Please bear with me as I work on through this project :D

    -
    -

    Contents @@ -321,17 +319,24 @@

    Chapter_4_data_visualisation/noj_examples

    -
    +
    +

    +dev +

    +
  • +Dev +
  • +