Skit.ai's calls library.
We provide means to sample calls, and conversations (aka turns) from a specified environment. This data is required for analysis and training machine learning models. Hence the current offering of this library is an aggregation of conversation over calls.
We use this project as a component within skit-pipelines
The installation is a little quirky because it is meant for usage within a separate project here. You would need credentials from skit.ai to get past dvc pull and beyond.
- Miniconda or any other python version management tool.
- awscli
- poetry
- S3 access credentials from skit.ai
- tunnel secrets by mailing to skit.ai
- libpq-dev, postgresql-libs or
brew install postgresql
on mac/mac M1.
git clone [email protected]:skit-ai/skit-calls.git
cd skit-calls
poetry install
dvc pull
The dvc pull
command will create a secrets/
dir. This is where we store our queries and environment variables.
You need to change the first line to export DB_HOST="localhost"
.
source secrets/env.sh
Now you are ready to use the project.
Post installation, we can see what the tooling provides by running:
skit-calls -h
usage: skit-calls [-h] [-v] [--on-disk] {sample,select} ...
Skit.ai's calls library {'0.2.8'}. We provide means to sample calls and conversations
from a specified environment. Learn about this library at: https://github.com/skit-
ai/skit-calls
positional arguments:
{sample,select} Supported means to obtain calls datasets aggregated with their turns.
sample Random sample calls with a variety of call/turn filters.
select Select calls from known call-ids.
options:
-h, --help show this help message and exit
-v, --verbose Increase verbosity
--on-disk Each record is written directly to disk. Highly recommended for large
queries.
To get random samples:
❯ poetry run skit-calls sample -h
usage: skit-calls sample [-h] --lang LANG [--org-ids [ORG_IDS ...]] --start-date START_DATE
[--end-date END_DATE] [--timezone TIMEZONE]
[--call-quantity CALL_QUANTITY]
[--call-type {INBOUND,OUTBOUND,CALL_TEST}]
[--ignore-callers [IGNORE_CALLERS ...]] [--reported]
[--use-case USE_CASE] [--flow-name FLOW_NAME]
[--min-audio-duration MIN_AUDIO_DURATION]
[--asr-provider ASR_PROVIDER]
options:
-h, --help show this help message and exit
--lang LANG Search calls made in the given language.
--org-ids ORG_IDS The orgs for which you need the data.
--start-date START_DATE
Search calls made after the given date (YYYY-MM-DD).
--end-date END_DATE Search calls made before the given date.
--timezone TIMEZONE The timezone to use for the start and end dates.
--call-quantity CALL_QUANTITY
The number of calls to filter.
--call-type {INBOUND,OUTBOUND,CALL_TEST}
The type of call to filter.
--ignore-callers [IGNORE_CALLERS ...]
A comma separated list of callers to ignore.
--reported Search only reported calls.
--use-case USE_CASE Filter calls by use-case.
--flow-name FLOW_NAME
Filter calls by flow-name.
--min-audio-duration MIN_AUDIO_DURATION
Filter calls longer than given duration.
--asr-provider ASR_PROVIDER
Filter calls served via a specific ASR provider.
But if you already have a selected call-ids in mind:
❯ poetry run skit-calls select -h
usage: skit-calls select [-h] (--call-ids CALL_IDS [CALL_IDS ...] | --csv CSV) [--org-ids [ORG_IDs ...]] [--use-fsm-url]
[--domain-url DOMAIN_URL] [--uuid-column UUID_COLUMN] [--history]
optional arguments:
-h, --help show this help message and exit
--call-ids CALL_IDS [CALL_IDS ...]
The call-ids to select.
--csv CSV CSV file that contains the call-ids to select.
--org-ids ORG_IDS The orgs for which you need the data.
--use-fsm-url Whether to use turn audio url from fsm or s3 path.
--domain-url DOMAIN_URL
The domain to use while forming public audio_urls
--uuid-column UUID_COLUMN
The column name of the UUID column in the CSV file. Required if --csv is set.
--history Collect call history for each turn