Skip to content

Commit

Permalink
Merge pull request #50 from CLMBRs/ib-naming-model
Browse files Browse the repository at this point in the history
information bottleneck
  • Loading branch information
shanest authored Dec 23, 2024
2 parents 08a70ec + 1613b1a commit e2e3b8f
Show file tree
Hide file tree
Showing 94 changed files with 30,757 additions and 6,038 deletions.
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
.vscode/
.DS_Store
src/altk.egg-info

**/*.pkl

# Distribution/build
dist/
Expand Down
5 changes: 2 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ First, set up a virtual environment (e.g. via [miniconda](https://docs.conda.io/

## Getting started

- Check out the [examples](https://github.com/CLMBRs/ultk/tree/main/src/examples), starting with a basic signaling game. The examples folder also contains a simiple efficient communication analysis of [indefinites](https://github.com/CLMBRs/ultk/tree/main/src/examples/indefinites).
- Check out the [examples](https://github.com/CLMBRs/ultk/tree/main/src/examples), starting with a simiple efficient communication analysis of [indefinites](https://github.com/CLMBRs/ultk/tree/main/src/examples/indefinites) and a comparison of two approaches to efficient communication, with modals as a test case.
- To see more scaled up usage examples, visit the codebase for an efficient communication analysis of [modals](https://github.com/nathimel/modals-effcomm) or [sim-max games](https://github.com/nathimel/rdsg).
- For an introduction to efficient communication research, here is a [survey paper](https://www.annualreviews.org/doi/abs/10.1146/annurev-linguistics-011817-045406) of the field.
- For an introduction to the RSA framework, see [this online textbook](http://www.problang.org/).
Expand All @@ -51,7 +51,7 @@ Unit tests are written in [pytest](https://docs.pytest.org/en/7.3.x/) and execut
<details>
<summary>Figures:</summary>

> Kemp, C. & Regier, T. (2012) Kinship Categories Across Languages Reflect General Communicative Principles. Science. https://www.science.org/doi/10.1126/science.1218811
> Kemp, C. & Regier, T. (2012). Kinship Categories Across Languages Reflect General Communicative Principles. Science. https://www.science.org/doi/10.1126/science.1218811
> Zaslavsky, N., Kemp, C., Regier, T., & Tishby, N. (2018). Efficient compression in color naming and its evolution. Proceedings of the National Academy of Sciences, 115(31), 7937–7942. https://doi.org/10.1073/pnas.1800521115
Expand All @@ -64,7 +64,6 @@ Unit tests are written in [pytest](https://docs.pytest.org/en/7.3.x/) and execut
<details>
<summary>Links:</summary>

> Imel, N. (2023). The evolution of efficient compression in signaling games. PsyArXiv. https://doi.org/10.31234/osf.io/b62de

> Imel, N., & Steinert-Threlkeld, S. (2022). Modal semantic universals optimize the simplicity/informativeness trade-off. Semantics and Linguistic Theory, 1(0), Article 0. https://doi.org/10.3765/salt.v1i0.5346
Expand Down
2 changes: 1 addition & 1 deletion docs/search.js

Large diffs are not rendered by default.

17 changes: 7 additions & 10 deletions docs/ultk.html

Large diffs are not rendered by default.

44 changes: 20 additions & 24 deletions docs/ultk/effcomm.html

Large diffs are not rendered by default.

1,702 changes: 816 additions & 886 deletions docs/ultk/effcomm/agent.html

Large diffs are not rendered by default.

6 changes: 3 additions & 3 deletions docs/ultk/effcomm/analysis.html

Large diffs are not rendered by default.

286 changes: 286 additions & 0 deletions docs/ultk/effcomm/information_bottleneck.html

Large diffs are not rendered by default.

1,305 changes: 1,305 additions & 0 deletions docs/ultk/effcomm/information_bottleneck/ba.html

Large diffs are not rendered by default.

1,174 changes: 1,174 additions & 0 deletions docs/ultk/effcomm/information_bottleneck/ib.html

Large diffs are not rendered by default.

1,928 changes: 1,928 additions & 0 deletions docs/ultk/effcomm/information_bottleneck/modeling.html

Large diffs are not rendered by default.

729 changes: 729 additions & 0 deletions docs/ultk/effcomm/information_bottleneck/tools.html

Large diffs are not rendered by default.

399 changes: 221 additions & 178 deletions docs/ultk/effcomm/informativity.html

Large diffs are not rendered by default.

102 changes: 54 additions & 48 deletions docs/ultk/effcomm/optimization.html

Large diffs are not rendered by default.

532 changes: 532 additions & 0 deletions docs/ultk/effcomm/probability.html

Large diffs are not rendered by default.

255 changes: 133 additions & 122 deletions docs/ultk/effcomm/sampling.html

Large diffs are not rendered by default.

36 changes: 18 additions & 18 deletions docs/ultk/effcomm/tradeoff.html

Large diffs are not rendered by default.

6 changes: 3 additions & 3 deletions docs/ultk/language.html

Large diffs are not rendered by default.

2,970 changes: 1,677 additions & 1,293 deletions docs/ultk/language/grammar.html

Large diffs are not rendered by default.

735 changes: 340 additions & 395 deletions docs/ultk/language/language.html

Large diffs are not rendered by default.

1,498 changes: 753 additions & 745 deletions docs/ultk/language/sampling.html

Large diffs are not rendered by default.

920 changes: 431 additions & 489 deletions docs/ultk/language/semantics.html

Large diffs are not rendered by default.

280 changes: 280 additions & 0 deletions docs/ultk/util.html

Large diffs are not rendered by default.

500 changes: 500 additions & 0 deletions docs/ultk/util/frozendict.html

Large diffs are not rendered by default.

519 changes: 519 additions & 0 deletions docs/ultk/util/io.html

Large diffs are not rendered by default.

1 change: 0 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,6 @@ dependencies = [
"plotnine",
"pathos",
"pytest",
"rdot",
]

[project.urls]
Expand Down
1 change: 0 additions & 1 deletion src/examples/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
"""Minimal examples demonstrating how to use ULTK.
See `examples.signaling_game`.
"""
56 changes: 56 additions & 0 deletions src/examples/colors/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# Analyzing the Relationship between Complexity and Informativity across the World's Languages

Based off [Zaslavsky, Kemp et al's paper on color complexity ](https://www.pnas.org/doi/full/10.1073/pnas.1800521115) and [the corresponding original repo](https://github.com/nogazs/ib-color-naming).

This example creates a "conceptual" / miniature replication of the above paper using the tools provided by the ULTK library. Right now, the final analysis produces the following plot:
![a plot showing communicative cost and complexity of natural, explored, and dominant languages](https://github.com/CLMBRs/altk/blob/main/src/examples/colors/outputs/plot.png?raw=true)

This README first explains the contents of this example directory, focusing on what the user has to provide that's specific to the color case study, before then explaining the concrete steps taken to produce the above plot. After that, there is some discussion of what's missing from the above paper and other next steps for this example.

## Contents
`data` consists of language and color data provided by the [World Color Survey](https://linguistics.berkeley.edu/wcs/data.html). Certain files have been slightly edited in order for simplicity of parsing, such as providing a header row.

`outputs` contains outputs of various scripts, as outlined below.


`lang_colors` consists of per-language color distributions. Major color terms are graphed per language.

`analyze_data.py` contains functions for graphing the distribution of color terms across language expressions and languages themselves.

`color_grammar.py` contains class definitions for the ColorLanguage and other utility structures.

`generate_wcs_languages.py` contains the function for reading and converting the WCS data to ULTK language structures. It also generates

`complexity.py` calculates the complexity and informativity of the various color WCS color languages, passed in as a pandas DataFrame.

`graph_colors.py` contains functions for graphing the distribution of color terms across language expressions and languages themselves.

`util.py` contains utility functions, including the argument parser for running this tool from shell.

## Usage

From `ultk/examples` base directory:
1. Run `python -m colors.scripts.read_color_universe`: this generates the color universe (the 330 Munsell chips) to be re-used throughout. It does very light processing of the WCS data to generate a CSV file that can be easily read by ULTK.
- Consumes: `data/cnum-vhcm-lab-new.txt`
- Produces: `outputs/color_universe.csv`
2. Run `python -m colors.scripts.read_natural_languages`: this reads the natural language WCS data and produces ULTK `Language` objects. (NOTE: still a work-in-progress)
- Consumes: `data/data/term.txt`, `outputs/color_universe.csv`
- Produces: `outputs/natural_languages.yaml`
3. Run `python -m colors.scripts.measure_natural_languages`: this reads the ULTK natural languages and calculates the complexity and informativity of each language.
- Consumes: `outputs/natural_languages.yaml`
- Produces: `outputs/natural_language_information_plane.csv`


NOTE: below this is
Run `python analyze_data.py` from the `colors` folder. This calls `generate_wcs_languages` to generate the language data, then `complexity.py` to generate the complexity, then Several options are available as command-line settings.:


## Remaining Tasks

At the moment, the density of the probability function per major color term is not factored into the final graphs generated.

Additionally, the mutual information when probability is taken into account using an assigned probability to the weight matrix gives a large negative value, which should be impossible given the prior is entirely uniform.

At the moment


File renamed without changes.
14 changes: 7 additions & 7 deletions src/examples/colors/data/lang.txt
Original file line number Diff line number Diff line change
Expand Up @@ -19,15 +19,15 @@ LNUM LNAME LGEO LFW
18 Ucayali Campa Peru Allene Heitzman Jason D. Patent * Campa_DAT_new.txt new
19 Camsa * * * * Camsa_DAT_new.txt new
20 Candoshi * * * * Candoshi_DAT_new.txt new
21 Cavine{\x96}a * * * * Cavinena_DAT_new.txt new
21 Cavinena * * * * Cavinena_DAT_new.txt new
22 Cayapa Ecuador Neil Wiebe Scott Merrifield William R. Merrifield Cayapa_DAT_new.txt new
23 Ch{\x87}cobo * * * * Chacobo_DAT_new.txt new
23 Chacobo * * * * Chacobo_DAT_new.txt new
24 Chavacano * * * * Chavacano_DAT_new.txt new
25 Chayahuita * * * * Chayahuita_DAT_new.txt new
26 Chinanteco Mexico Al & Jeff Anderson Jason D. Patent * Chinantec_DAT_new.txt new
27 Chiquitano Bolivia M. Kr{\x9F}si, L. Rodriguez, E. Lyn (?) Jason Patent * Chiquitano_DAT_new.txt new
28 Chumburu * Hansford Scott Merrifield William R. Merrifield Chumburu_DAT_new.txt new
29 Cof{\x87}n * * * * Cofan_DAT_new.txt new
29 Cofan * * * * Cofan_DAT_new.txt new
30 Colorado * * * * Colorado_DAT_new.txt new
31 Eastern Cree Canada Lieselotte Bartlett Scott Merrifield William R. Merrifield Cree_DAT_new.txt new
32 Culina Peru, Brazil P. Adams and T. Fern{\x87}ndez Jason Patent * Culina_DAT_new.txt new
Expand All @@ -40,7 +40,7 @@ LNUM LNAME LGEO LFW
39 Guahibo Colombia Riena Kondo Kenneth J. Merrifield William R. Merrifield Guahibo_DAT_new.txt new
40 Guambiano * * * * Guambiano_DAT_new.txt new
41 Guarijio Mexico Ron and Sharon Stoltzfus Kenneth J. Merrifield William R. Merrifield Guarijio_DAT_new.txt new
42 Ng{\x8A}bere Panama Arosemena Patent, Jason * Guaymi_DAT_new.txt new
42 Ngbere Panama Arosemena Patent, Jason * Guaymi_DAT_new.txt new
43 Gunu Cameroon D. Heath Ken Merrifield Ken Merrifield Gunu_DAT_new.txt new
44 Halbi India F. Woods and P. Hopple Jason Patent * Halbi_DAT_new.txt new
45 Huasteco * * * * Huastec_DAT_new.txt new
Expand Down Expand Up @@ -72,11 +72,11 @@ LNUM LNAME LGEO LFW
71 Mikasuki U S A David West Scott Merrifield Scott Merrifield Mikasuki_DAT_new.txt new
72 Mixteco * * * * Mixtec_DAT_new.txt new
73 Mundu * * * * Mundu_DAT_new.txt new
74 M{\x9C}ra Pirah{\x8B} * * * * Mura-Piraha_DAT_new.txt new
74 Mura Piraha * * * * Mura-Piraha_DAT_new.txt new
75 Murle * * * * Murle_DAT_new.txt new
76 Murinbata * * * * Murrinh-Patha_DAT_new.txt new
77 Nafaanra * * * * Nafaanra_DAT_new.txt new
78 N{\x87}huatl * * * * Nahuatl_DAT_new.txt new
78 Nahuatl * * * * Nahuatl_DAT_new.txt new
79 Ocaina * * * * Ocaina_DAT_new.txt new
80 Papago * * * * Oodham_DAT_new.txt new
81 Patep * * * * Patep_DAT_new.txt new
Expand All @@ -85,7 +85,7 @@ LNUM LNAME LGEO LFW
84 Saramaccan * * * * Saramaccan_DAT_new.txt new
85 Seri * * * * Seri_DAT_new.txt new
86 Shipibo Peru Guillermo Ramirez Ken Merrifield Ken Merrifield Shipibo_DAT_new.txt new
87 Sirion{\x97} * * * * Siriono_DAT_new.txt new
87 Siriono * * * * Siriono_DAT_new.txt new
88 Slave Canada Monus Jason D. Patent * Slave_DAT_new.txt new
89 Sursurunga * * * * Sursurunga_DAT_new.txt new
90 Tabla * * * * Tabla_DAT_new.txt new
Expand Down
Binary file added src/examples/colors/data/zkrt18_prior.npy
Binary file not shown.
355 changes: 355 additions & 0 deletions src/examples/colors/demo.ipynb

Large diffs are not rendered by default.

Loading

0 comments on commit e2e3b8f

Please sign in to comment.