Geotag is a utility to quickly tag studies and samples from the NCBI Gene Expression Omnibus (GEO) on the command line.
- Display custom data table with custom columns that help to tag a samples/rows.
- Allow the user to modify the sample list to attach additional columns retrospectivly.
- Quickly display sample specific part of the relevant soft file.
- Filter and search data rows based on regular expressions.
- Custom sorting and organization of columns.
- Tag based line coloring for better overview and feedback to prevent mistakes.
- Allow adjustment of tag definitions.
- Allow creation of new tags that are shared between all users.
- Self-contained documentation to show the user all features without the need to read a manual.
- Undo/redo ability.
- Automatic backups that do not accumulate indefinitely.
- Select, show and tag multiple samples/rows simultaneously.
- Allow to make a note per sample/row.
- Quick and easy navigation.
- Log all tag relevant user interactions.
- Save the results automatically and in a human readable format.
- Save last state upon exit to allow continuing where the work was left off upon restart.
Geotag runs in Python 3.6 and and later versions. It uses the Python
curses module for an
interactive user interface on the command line,
tmux to display and organize
multiple terminals and less
to inspeact GEO soft files. Make sure tmux runs in 256 color mode,
e.g., by writing set -g default-terminal "screen-256color"
into
your ~/.tmux.conf
befor starting tmux. You can install Geotag with
pip:
pip3 install git+https://github.com/fraunhofer-izi/geotag
The base of the tagging is a table of all samples that should be tagged. The table
must be tab-separated and contain the columns gse
with the GEO Series accession
GSExxx and id
with the GEO sample accession GSMxxx. Any additional columns
may provide supportive information. An example is given in example/geo_sampe_table.tsv
:
platform_id gse id characteristics
GPL10999 GSE48305 GSM1174473 tissue: peripheral blood
GPL10999 GSE48305 GSM1174472 tissue: peripheral blood
GPL10999 GSE38993 GSM953384 cell type: lung fibroblast iPSC
GPL10999 GSE38993 GSM953383 cell type: foreskin fibroblast iPSC
The table is passed with --table example/geo_sampe_table.tsv
Most information on the sample is accessible in the GEO soft files. In order
to make the content available to Geotag all relevant soft files need to
be organized such that the soft file of GSExxx is located in
/path/to/the/soft/files/GSExxx/GSExxx_family.soft
. To download and
organize the soft files accordingly one could use the script download_soft.sh
with a list of line separated GSE numbers <gse list>
by
cat <gse list> | xargs -n1 ./download_soft.sh /path/to/the/soft/files
Per default, Geotag writes all its output into the directory ~/geotag
.
There are four different files:
- The tag file, holding the tag descriptions (default
tag.yml
). - The output file with the tags given to the samples (default
<user name>.yml
). - A log-file loging many user actions (default
<user namer>.log
). - A binary view file saving the view state of Geotag so you can continue
where you left off after restarting Geotag (default
<user name>.pkl
).
An alternative output path for each of these files can be specified
respectively with the arguments --tags
, --output
, --log
and --state
.
The output file is a yaml with the following structure:
tag definitions:
<tag name>:
col_width: <integer>
desc: <str>
editor: <str user name>
key: <char>
type: <"int" or "str">
<next tag name ...>
tags:
<tag name>:
<gse numer>_<gsm number>(<uniquifying integer if neccesarry>): <value>
<...>
<next tag name ...>
The output file is saved in an asynchronous thread after each tagging
action by the user. There is no feedback on whether the detached thread
successfully saved the file. If the user wants to make sure the latest
information is saved, a synchronous save can be triggered with the
key s
.
All tagging actions can be undone with the key u
. However, if geotag is
restarted, previous actions cannot be undone. To prevent
any loss of information a backup of the output file
with the appendix .backup_<date and time>
is saved after every 10th
action. The user can restore a backup by removing the appendix
from the file name. Geotag will keep at most ten backups by removing
the oldest backup if this number is exceeded.
To work together in a team, it is recommended to use a unique tag file
that all members can write to.
Whenever a member updates a tag description or
adds a new tag, the change becomes available to another member as soon
as she presses t
to view the tag-dialog.
Note that all tag descriptions are also stored in the output file, and upon a change of a tag description, all tags available to a user will be written to the tag file. So to remove a tag, each member of a team has to remove it.
If you want to share additional information between the team members,
e.g., the values they have tagged, it is recommended to write such
info to the input table e.g., through a periodically
repeated routine. The members can reload the displayed table by pressing l
.
Geotag needs to be run inside a tmux
session. This allows Geotag to display multiple soft files with
the reliable pager less
and while using all the window splitting
and organizing features of tmux. If the output should be stored in
the default path, you can run Geotag with
python3 -m geotag --table example/geo_sampe_table.tsv --softPath example/soft
Some issues can be resolved by restarting Geotag with the --update
option.
This will clear the current view state and leave the user at the top of
the table with default view settings. Another common issue is incomplete
keypress forwarding in the used terminal emulator. The key forwarded
to Geotag can be displayed in the status bar if you start it with --showKey
.
Press h
after getoag has loaded to receive help.
Sub-windows of Geotag list all available options at the top of the window.
Copyright (C) 2019 Fraunhofer-Gesellschaft zur Foerderung der angewandten Forschung e.V. acting on behalf of its Fraunhofer Institute for Cell Therapy and Immunology (IZI).
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.