Skip to content

Commit

Permalink
Merge pull request #15 from datasciencecampus/feature/doc-scoring
Browse files Browse the repository at this point in the history
Topic Modelling - Redefined
  • Loading branch information
ColinDaglish authored Jul 24, 2023
2 parents 3cf8da1 + 2363e1e commit 9e45f2e
Show file tree
Hide file tree
Showing 15 changed files with 590 additions and 566 deletions.
68 changes: 68 additions & 0 deletions docs/user_guide/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,71 @@ This is the user guide for the `consultation-nlp-2023` project.
:maxdepth: 2
./loading_environment_variables.md
```

## How to set the configure the model
The majority of the model configuration happens in the `question_model_config.yaml`

Within this file you will have configuration options for each of the questions that get's processed.

**example:**
```yaml
qu_12:
max_features: null
ngram_range: !!python/tuple [1,2]
min_df: 2
max_df: 0.9
n_topics: 3
n_top_words: 10
max_iter:
lda: 25
nmf: 1000
lowercase: true
topic_labels:
lda: null
nmf:
- "Admin Data"
- "Research"
- "Policy"
```
In this example you can see that the yaml file is indented at various levels.
### qu_12
type:str
At the top level of indentation, we have the question-id, in this case 'qu_12'. Each number corosponds to the column nuber of the raw input data (i.e. qu_12 is column 12 of the raw data csv).
### max_features
type: int (or null)
This is an optional value, which can either be null (which will convert to None when transposed to Python) or an integer value for the maximum number of text features to include.
### ngram_range
type: tuple (but looks like a bit like a list)
ngrams or word combination ranges, can help to increase the number of features you have in your dataset which is useful if multi-word phrases like "admin data" utilised a lot in the responses. The two values `[1,2]` corrospond to the start and end of the range. So this example would include unigrams (individual words) and bi-grams (2 word combinations). To have only one word combinations, you can change the settings to `[1,1]`. You can also include tri-grams and longer if you wish.

### min_df
type: int or float
This is a way of filtering out less important words, that don't appear in enough responses. `min_df` can either be a float value (e.g. 0.1), in which case it will be interpreted as a proportion, or an integer value (e.g 1) where it will be interpretted as a number of responses.
So 0.1 would mean that a word needs to appear in at least 10% of the corpus to get through, or 2 would mean that it needs to appear in at least 2 documents.

### max_df
type: int or float
Similar to min_df, max_df is a way of filtering out words, but this time the more common words. This field also takes, floats and integers, interpretting them as proportions and absolute numbers respectively. So 0.9 would stop words appearing in more than 90% of documents from making their way through, or 100 would stop words that appear in more than 100 documents coming through.

### n_topics
type: int
This is the number of topics to attempt to model in the topic modelling, it must be an integer value.

### n_top_words
type: int
This is the number of top words to include in the modelling, it must be an integer value.

### max_iter
type: dictionary
This option breaks down further into `lda` and `nmf` which are both integers. This setting relates to the number of iterations for the models to run through in order to move towards convergence. You may need to adjust these seperately depending on model performance.

### lowercase
type: boolean
A switch setting for parsing words as lowercase or leaving them in their unadjusted form.

### topic_labels
type: dictionary
Again this one breaks down furhter into lda, and nmf, as it is likely that after you have run the models, you may wish to add specific topic lables for the plots you are generating. These can either be null or a list of strings. If you are setting labels, you must ensure there are the same number of labels as there are n_topics, otherwise the system will through an error.
12 changes: 0 additions & 12 deletions src/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,3 @@ raw_data_path: "data/raw/20230717_consultation_ingest.csv" #str
additional_stopwords: #list of words to filter; must be type str
- 'he'
lemmatize: True #bool; select False to use Stemmer
feature_count: #dict
ngram_range: !!python/tuple [1,2] #tuple range of defaults to unigram (1,1)
min_df: 2 #float (proportion) or int (count)
max_df: 0.9 #float (proportion) or int (count)
max_features: null #null converts to None, or int value
lowercase: True #whether to convert all words to lowercase
lda: #dict
n_topics: 3 #int greater than 0
n_top_words: 10 #int
max_iter: 25 #int
title: "Topic Summary" #str
topic_labels: null # also takes a list of strings (see additional stopwords ^)
131 changes: 0 additions & 131 deletions src/modules/analysis.py

This file was deleted.

21 changes: 21 additions & 0 deletions src/modules/named_entity_recognition.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
import spacy
from pandas import Series


def retrieve_named_entities(series: Series) -> list:
"""retrieve any named entities from the series
Parameters
----------
series:Series
A series of text strings to analyse for named entities
Returns
-------
list[list[str]]
a list of lists containing strings for each named entitity"""
nlp = spacy.load("en_core_web_sm")
entities = []
for doc in nlp.pipe(series):
entities.append([str(ent) for ent in doc.ents])
return entities

Check warning on line 21 in src/modules/named_entity_recognition.py

View check run for this annotation

Codecov / codecov/patch

src/modules/named_entity_recognition.py#L17-L21

Added lines #L17 - L21 were not covered by tests
Loading

0 comments on commit 9e45f2e

Please sign in to comment.