tmtk - TranSMART data curation toolkit

Author:Jochem Bijlard
Source Code:https://github.com/thehyve/tmtk/
Generated:Jan 29, 2018
License:GPLv3
Version:0.3.2

Philosophy

A toolkit for ETL curation for the tranSMART data warehouse for translational research.

The TranSMART curation toolkit (tmtk) aims to provide a language and set of classes for describing data to be uploaded to tranSMART. The toolkit can be used to edit and validate studies prior to loading them with transmart-batch.

Functionality currently available:
  • create a transmart-batch ready study from clinical data files.
  • load an existing study and validate its contents.
  • edit the transmart concept tree in The Arborist graphical editor.
  • create chromosomal region annotation files.
  • map HGNC gene symbols to corresponding Entrez gene IDs using mygene.info.

Note

tmtk is a python3 package meant to be run in Jupyter notebooks. Results for other setups may vary.

Basic Usage

Step 1: Opening a notebook

First open a Jupyter Notebook, open a shell and change directory to some place where your data is. Then start the notebook server:

cd /path/to/studies/
jupyter notebook

This should open your browser to Jupyters file browser, create a new notebook for here.

Step 2: Using tmtk

# First import the toolkit into your environment
import tmtk

# Then create a <tmtk.Study> object by pointing to study.params of a transmart-batch study
study = tmtk.Study('~/studies/a_tm_batch_ready_study/study.params')
# Or, by using the study wizard on a directory with correctly structured, clinical data files.
# (Visit the transmart-batch documentation to find out what is expected.)
study = tmtk.wizard.create_study('~/studies/dir_with_some_clinical_data_files/')

Now we have loaded the study as a tmtk.Study object we have some interesting functions available:

# Check whether transmart-batch will find any issues with the way your study is setup
study.validate_all()

# Graphically manipulate the concept tree in this study by using The Arborist
study.call_boris()

Contents

Changelog

Version 0.3.2

  • More easily extensible validator functionality
  • Added multiple validation methods
  • Fix issue with namespace cleaner

Version 0.3.1

  • Replaced deprecated pandas functionality
  • More reliably start batch job

Version 0.3.0

  • Create studies from TraIT data templates, see Data templates.
  • Create fully randomized studies of any size: tmtk.toolbox.RandomStudy.
  • Load data right from Jupyter using transmart-batch, with progress bars!! Also works in as a command line tool transmart-batch.
  • Set name and id from the main study object.

Version 0.2.2

  • Minor bug fix for Arborist installation

Version 0.2.1

  • The Arborist is now implemented as a Jupyter Notebook extension
  • Metadata tags are automatically sorted in Arborist.

Version 0.2.0

  • Create and apply tree templates in Arborist
  • Improved interaction with metadata tags in Arborist
  • Resolved issues with the validator
  • R is now an optional dependency

User examples

These examples have been extracted from Jupyter Notebooks.

Create study from clinical data.

tmtk has a wizard that can be used to quickly go from clinical data files to a study object. The main goal of this functionality is to reduce the barrier of setting up all transmart-batch specific files (i.e. parameter files, column mapping and word mapping files).

The way to use it is to call tmtk.wizard.create_study(path), where path points a directory with clinical data files.

Note: clinical datafiles have to be in a format that is accepted by transmart-batch.

Here we will create a study from these two files:

import os
files_dir = './studies/wizard/'
os.listdir(files_dir)
['Cell-line_clinical.txt', 'Cell-line_NHTMP.txt']
# Load the toolkit
import tmtk
# Create a study object by running the wizard
study = tmtk.wizard.create_study('./studies/wizard/')
#####  Please select your clinical datafiles  #####
-    0. /home/vlad-the-impaler/tmtk/studies/wizard/Cell-line_clinical.txt
-    1. /home/vlad-the-impaler/tmtk/studies/wizard/Cell-line_NHTMP.txt
Pick number:  0
Selected files: ['Cell-line_clinical.txt']
Pick number:  1
Selected files: ['Cell-line_clinical.txt', 'Cell-line_NHTMP.txt']
Pick number:

✅ Adding 'Cell-line_clinical.txt' as clinical datafile to study.

✅ Adding 'Cell-line_NHTMP.txt' as clinical datafile to study.


The wizard walked us through some of the options for the study we want to create. Our new study is a public study with STUDY_ID==WIZARD and you can pick an appropriate name by setting the study.study_name = 'Ur a wizard harry'. None of the clinical params have been set, so tmtk will use default names for the column and word mapping file. Next the datafiles have been loaded and the column mapping object has been created to include the data files.

Next we will run the validator and find out that some files cannot be found. This is expected as these objects are only in memory and not yet on disk.

study.validate_all(5)

⚠ No valid file found on disk for /home/vlad-the-impaler/tmtk/studies/wizard/clinical/word_mapping_file.txt, creating dataframe.

Validating params file at clinical

❌ WORD_MAP_FILE=word_mapping_file.txt cannot be found.

❌ COLUMN_MAP_FILE=column_mapping_file.txt cannot be found.

Detected parameter WORD_MAP_FILE=word_mapping_file.txt.

Detected parameter COLUMN_MAP_FILE=column_mapping_file.txt.

Validating params file at study

Detected parameter TOP_NODE=\Public Studies\You're a wizard Harry\.

Detected parameter STUDY_ID=WIZARD.

Detected parameter SECURITY_REQUIRED=N.


Of course, we want to write our study to disk so it can be loaded with transmart-batch.

study = study.write_to('~/studies/my_new_study')

Writing file to /home/vlad-the-impaler/studies/my_new_study/clinical/clinical.params

Writing file to /home/vlad-the-impaler/studies/my_new_study/study.params

Writing file to /home/vlad-the-impaler/studies/my_new_study/clinical/column_mapping_file.txt

Writing file to /home/vlad-the-impaler/studies/my_new_study/clinical/Cell-line_clinical.txt

Writing file to /home/vlad-the-impaler/studies/my_new_study/clinical/word_mapping_file.txt

Writing file to /home/vlad-the-impaler/studies/my_new_study/clinical/Cell-line_NHTMP.txt

Next you can use the TranSMART Arborist to modify the concept tree or use tmtk to load to transmart if you’ve set your $TMBATCH_HOME, see Using transmart-batch from Jupyter.


TranSMART Arborist

GUI editor for the concept tree.

First load the toolkit.

import tmtk

Create a study object by entering a “study.params” file.

study = tmtk.Study('../studies/valid_study/study.params')

To verify the study object is compatible with transmart-batch for loading you can run the validator

study.validate_all()

Validating Tags:

❌ Tags (2) found that cannot map to tree: (1. Cell line characteristics∕1. Cell lines∕Age and 1. Cell line characteristics∕1. Cell lines∕Gender). You might want to call_boris() to fix them.

We will ignore this issue for now as this will be fixed automatically when calling the Arborist GUI.

The GUI allows a user to interactively edit all aspects of TranSMART’s concept tree, this include:

  • Concept Paths from the clinical column mapping.
  • Word mapping from clinical data files.
  • High dimensional paths from subject sample mapping files.
  • Meta data tags
# In a Jupyter Notebook, this brings up the interactive concept tree editor.
study.call_boris()
_images/arborist.png

Once returned from The Arborist to Jupyter environment we can write the updated files to disk. You can then run transmart-batch on that study to load it into your tranSMART instance.

study.write_to('~/studies/updated_study')

Collaboration with non technical users.

Though using Jupyter Notebooks is great for technical users, less technical domain experts might quickly feel discouraged. To allow for collaboration with these users we will upload this concept tree to a running Boris as a Service webserver. This will allow others to make refinements to the concept tree.

study.publish_to_baas('arborist-test-trait.thehyve.net')

Once the study is updated in BaaS, we can update the local files by copying the url for the latest tree into this command.

study.update_from_baas('arborist-test-trait.thehyve.net/trees/valid-study/3/~edit')

Using transmart-batch from Jupyter

Using tmtk you can load data to transmart right from Jupyter. For this to work you need to download and build transmart-batch, if you want to do this see the transmart-batch github.

Once you’ve done that you need to set an environment variable to the path of the github repository. The easiest way to do this is to add the following to your ~/.bash_profile:

export $TMBATCH_HOME=/home/path/to/transmart-batch

Next make sure to create a batchdb.property file with an appropriate name in the $TMBATCH_HOME directory. tmtk will look for any *.property file and allow you run transmart-batch with that property file from many objects. An examples of a good names are production.properties or test-environment.properties. Next you will be able to do something like this:

study.load_to.production()

API Description

Study class

class tmtk.Study(study_params_path=None, minimal=False)[source]

Bases: tmtk.utils.validate.ValidateMixin

Describes an entire TranSMART study. This is the main object used in tmtk. Studies can be initialized by pointing to a study.params file. This study has to be structured according to specification for transmart-batch.

>>> import tmtk
>>> study = tmtk.Study('./studies/valid_study/study.params')

This will create the study object which can be used as a starting point for custom curation or directly in The Arborist.

add_metadata()[source]

Create the Tags object for this study. Does nothing if it is already present.

all_files

All file objects in this study.

annotation_files

All annotation file objects in this study.

call_boris(height=650)[source]

Launch The Arborist GUI editor for the concept tree. This starts a Flask webserver in an IFrame when running in a Jupyter Notebook.

While The Arborist is opened, the GIL prevents any other actions. :param height: set the height of the output cell

clinical_files

All clinical file objects in this study.

concept_tree

ConceptTree object for this study.

concept_tree_json

Stringified JSON that is used by JSTree in The Arborist.

concept_tree_to_clipboard()[source]

Send stringified JSON that is used by JSTree in The Arborist to clipboard.

create_clinical()[source]

Add clinical data to a study object by creating empty params.

files_with_changes()[source]

Find dataframes that have changed since they have been loaded.

find_annotation(platform=None)[source]

Search for annotation data with this study and return it.

Parameters:platform – platform id to look for in this study.
Returns:an Annotations object or nothing.
find_params_for_datatype(datatypes=None)[source]

Search for parameter files within this study object and return them as list.

Parameters:datatypes – single string datatype or list of strings
Returns:a list of parameter objects for specific datatype in this study
get_object_from_params_path(path)[source]

Returns object that belongs to the params path given

get_objects_with_prop(prop: <built-in function all>)[source]

Search for objects with a certain property.

Parameters:prop – string equal to the property name.
Returns:generator for the found objects.
high_dim_files

All high dimensional file objects in this study.

load_to
publish_to_baas(url, study_name=None, username=None)[source]

Publishes a tree on a Boris as a Service instance.

Parameters:
Returns:

the url that points to the study you’ve just uploaded.

sample_mapping_files

All subject sample mapping file objects in this study.

study_id

The study ID as it is set in study params.

study_name

The study name, extracted from study param TOP_NODE.

tag_files
update_from_baas(url, username=None)[source]

Give url to a tree in BaaS.

Parameters:
update_from_treefile(treefile)[source]

Give path to a treefile (from Boris as a Service or otherwise) and update the current study to match made changes.

Parameters:treefile – path to a treefile (stringified JSON).
validate_all(verbosity='WARNING')[source]

Validate all items in this study.

Parameters:verbosity – only display output of this level and above. Levels: ‘debug’, ‘info’, ‘okay’, ‘warning’, ‘error’, ‘critical’. Default is ‘WARNING’.
Returns:True if no errors or critical is encountered.
write_to(root_dir, overwrite=False, return_new=True)[source]

Write this study to a new directory on file system.

Parameters:
  • root_dir – the base directory to write the study to.
  • overwrite – set this to True to overwrite existing files.
  • return_new – if True load the study object from the new location and return it.
Returns:

new study object if return_new == True.


Params classes

Params Container

class tmtk.params.Params.Params(study_folder=None)[source]

Bases: tmtk.utils.validate.ValidateMixin

Container class for all params files, called by Study to locate all params files.

add_params(path, parameters=None)[source]

Add a new parameter file to the Params object.

Parameters:
  • path – a path to a parameter file.
  • new – if new, create parameter object.
  • parameters – add dict here with parameters if you want to create a new parameter file.
static create_params(path, parameters=None, subdir=None)[source]

Create a new parameter file object.

Parameters:
  • path – a path to a parameter file.
  • parameters – add dict here with parameters if you want to create a new parameter file.
  • subdir – subdir is used as string representation.
Returns:

parameter file object.

Base class: ParamsBase

class tmtk.params.ParamsBase.ParamsBase(path=None, parameters=None, subdir=None, parent=None)[source]

Bases: tmtk.utils.validate.ValidateMixin

Base class for parameter files.

get(parameter, default=None)[source]

Return value for parameter.

Parameters:
  • parameter – string will be converted to uppercase.
  • default – return default if value is not found.
Returns:

value for this parameter if set, else None.

save()[source]

Overwrite the original file with the current parameters.

update()[source]

Iterate over parameters to change them interactively.

write_to(path, overwrite=False)[source]

Writes parameters in object to file in path. Does not overwrite existing files unless specifically told.

Parameters:
  • path – path to store parameters to.
  • overwrite – allow overwriting existing files.

AnnotationParams

class tmtk.params.AnnotationParams.AnnotationParams(path=None, parameters=None, subdir=None, parent=None)[source]

Bases: tmtk.params.ParamsBase.ParamsBase

is_viable()[source]
Returns:True if both the platform is set and the annotations file is located, else returns False.
mandatory
optional

ClinicalParams

class tmtk.params.ClinicalParams.ClinicalParams(path=None, parameters=None, subdir=None, parent=None)[source]

Bases: tmtk.params.ParamsBase.ParamsBase

is_viable()[source]
Returns:True if both the column mapping file is located, else returns False.
mandatory
optional

HighDimParams

class tmtk.params.HighDimParams.HighDimParams(path=None, parameters=None, subdir=None, parent=None)[source]

Bases: tmtk.params.ParamsBase.ParamsBase

is_viable()[source]
Returns:True if both the datafile and map file are located, else returns False.
mandatory
optional

StudyParams

class tmtk.params.StudyParams.StudyParams(path=None, parameters=None, subdir=None, parent=None)[source]

Bases: tmtk.params.ParamsBase.ParamsBase

is_viable()[source]
Returns:True if STUDY_ID has been set.
mandatory
optional

TagsParams

class tmtk.params.TagsParams.TagsParams(path=None, parameters=None, subdir=None, parent=None)[source]

Bases: tmtk.params.ParamsBase.ParamsBase

is_viable()[source]
Returns:True if both the column mapping file is located, else returns False.
mandatory
optional

Clinical classes

Clinical Container

class tmtk.clinical.Clinical(clinical_params=None)[source]

Bases: tmtk.utils.validate.ValidateMixin

Container class for all clinical data related objects, i.e. the column mapping, word mapping, and clinical data files.

This object has methods that add data files, and for lookups of clinical files and variables.

ColumnMapping
WordMapping
add_datafile(filename, dataframe=None)[source]

Add a clinical data file to study.

Parameters:
  • filename – path to file or filename of file in clinical directory.
  • dataframe – if given, add pd.DataFrame to study.
all_variables

Dictionary where {tmtk.VarID: tmtk.Variable} for all variables in the column mapping file.

apply_column_mapping_template(template)[source]

Update the column mapping by applying a template.

Parameters:template

expected input is a dictionary where keys are column names as found in clinical datafiles. Each column header name has a dictionary describing the path and data label. For example:

{‘GENDER’: {‘path’: ‘CharacteristicsDemographics’,
’label’: ‘Gender’},
’BPBASE’: {‘path’: ‘Lab resultsBlood’,
’label’: ‘Blood pressure (baseline)’}

}

call_boris(height=650)[source]

Use The Arborist to modify only information in the column and word mapping files. :param height: set the height of the output cell

clinical_files
get_datafile(name: str)[source]

Find datafile object by filename.

Parameters:name – name of file.
Returns:tmtk.DataFile object.
get_variable(var_id: tuple)[source]

Return a Variable object based on variable id.

Parameters:var_id – tuple of filename and column number.
Returns:tmtk.Variable.
load_to
params
show_changes()[source]

Print changes made to the column mapping and word mapping file.

validate_all(verbosity=3)[source]

ColumnMapping

class tmtk.clinical.ColumnMapping(params=None)[source]

Bases: tmtk.utils.filebase.FileBase, tmtk.utils.validate.ValidateMixin

Class with utilities for the column mapping file for clinical data. Can be initiated with by giving a clinical params file object.

append_from_datafile(datafile)[source]

Appends the column mapping file with rows based on datafile column names.

Parameters:datafiletmtk.DataFile object.
build_index(df=None)[source]

Build index for the column mapping dataframe. If pd.DataFrame (optional) is given, modify and return that.

Parameters:dfpd.DataFrame.
Returns:pd.DataFrame.
create_df()[source]

Create pd.DataFrame with a correct header.

Returns:pd.DataFrame.
get_concept_path(var_id: tuple)[source]

Return concept path for given variable identifier tuple.

Parameters:var_id – tuple of filename and column number.
Return str:concept path for this variable.
ids

A list of variable identifier tuples.

included_datafiles

List of datafiles included in column mapping file.

path_changes(silent=False)[source]

Determine changes made to column mapping file.

Parameters:silent – if True, only print output.
Returns:if silent=False return dictionary with changes since load.
path_id_dict

Dictionary with all variable ids as keys and paths as value.

select_row(var_id: tuple)[source]

Select row based on variable identifier tuple. Raises exception if variable is not in this column mapping.

Parameters:var_id – tuple of filename and column number.
Returns:list of items in selected row.
set_concept_path(var_id: tuple, path, label)[source]

Return concept path for given variable identifier tuple.

Parameters:
  • var_id – tuple of filename and column number.
  • path – new value for path.
  • label – new value for data label.
subj_id_columns

A list of tuples with datafile and column index for SUBJ_ID, e.g. (‘cell-line.txt’, 1).

DataFile

class tmtk.clinical.DataFile(path=None)[source]

Bases: tmtk.utils.filebase.FileBase

Class for clinical data files, does not do much more than tmkt.FileBase.

Variable

class tmtk.clinical.Variable(datafile, column: int = None, clinical_parent=None)[source]

Bases: object

Base class for clinical variables

column_map_data

Column mapping row as dictionary where keys are short descriptors.

Returns:dict.
concept_path

Concept path after conversions by transmart-batch.

Returns:str.
data_label

Variable data label.

Returns:str.
forced_categorical

Check if forced categorical by entering ‘CATEGORICAL’ in 7th column.

Returns:bool.
is_empty

Check if variable is fully empty.

Returns:bool.
is_in_wordmap

Check if variable is represented in word mapping file.

Returns:bool.
is_numeric

True if transmart-batch will load this concept as numerical. This includes information from word mapping and column mapping.

Returns:bool.
is_numeric_in_datafile

True if the datafile contains only numerical items.

Returns:bool.
mapped_values

Data items after word mapping.

Returns:list.
unique_values
Returns:Unique set of values in the datafile.
validate(verbosity=2)[source]
values
Returns:All values as found in the datafile.
var_id
Returns:Variable identifier tuple (datafile.name, column).
word_map_dict

A dictionary with word mapped categoricals. Keys are items in the datafile, values are what they will be mapped to through the word mapping file. Unmapped items are also added as key, value pair.

Returns:dict.

WordMapping

class tmtk.clinical.WordMapping(params=None)[source]

Bases: tmtk.utils.filebase.FileBase, tmtk.utils.validate.ValidateMixin

Class representing the word mapping file.

build_index(df=None)[source]

Build and sort multi-index for dataframe based on filename and column number columns. If no df parameter is not set, build index for self.df.

Parameters:dfpd.DataFrame.
Returns:pd.DataFrame.
create_df()[source]

Create pd.DataFrame with a correct header.

Returns:pd.DataFrame.
get_word_map(var_id)[source]

Return dict with value in data file, and the mapped value as keyword-value pairs.

Parameters:var_id – tuple of filename and column number.
Returns:dict.
word_map_changes(silent=False)[source]

Determine changes made to word mapping file.

Parameters:silent – if True, only print output.
Returns:if silent=False return dictionary with changes since load.
word_map_dicts

Dictionary with all variable ids as keys and word map dicts as value.

Annotations

Annotations Container

class tmtk.annotation.Annotations.Annotations(params_list=None, parent=None)[source]

Bases: object

Class containing all AnnotationFile objects.

annotation_files
validate_all(verbosity=3)[source]

Base class: AnnotationBase

class tmtk.annotation.AnnotationBase.AnnotationBase(params=None, path=None)[source]

Bases: tmtk.utils.filebase.FileBase, tmtk.utils.validate.ValidateMixin

Base class for annotation files.

load_to
marker_type

ChromosomalRegions

class tmtk.annotation.ChromosomalRegions.ChromosomalRegions(params=None, path=None)[source]

Bases: tmtk.annotation.AnnotationBase.AnnotationBase

Subclass for CNV (aCGh, qDNAseq) annotation

biomarkers

MicroarrayAnnotation

class tmtk.annotation.MicroarrayAnnotation.MicroarrayAnnotation(params=None, path=None)[source]

Bases: tmtk.annotation.AnnotationBase.AnnotationBase

Subclass for microarray (mRNA) expression annotation files.

biomarkers

MirnaAnnotation

class tmtk.annotation.MirnaAnnotation.MirnaAnnotation(params=None, path=None)[source]

Bases: tmtk.annotation.AnnotationBase.AnnotationBase

Subclass for micro RNA (miRNA) expression annotation files.

biomarkers

ProteomicsAnnotation

class tmtk.annotation.ProteomicsAnnotation.ProteomicsAnnotation(params=None, path=None)[source]

Bases: tmtk.annotation.AnnotationBase.AnnotationBase

Subclass for proteomics annotation

biomarkers

High Dimensional data

HighDim

class tmtk.highdim.HighDim.HighDim(params_list=None, parent=None)[source]

Bases: tmtk.utils.validate.ValidateMixin

Container class for all High Dimensional data types.

Parameters:params_list – contains a list with Params objects.
high_dim_files
sample_mapping_files
update_high_dim_paths(high_dim_paths)[source]

Update sample mapping if path has been changed.

Parameters:high_dim_paths – dictionary with paths and old concept paths.
validate_all(verbosity=3)[source]

HighDimBase

class tmtk.highdim.HighDimBase.HighDimBase(params=None, path=None, parent=None)[source]

Bases: tmtk.utils.filebase.FileBase, tmtk.utils.validate.ValidateMixin

Base class for high dimensional data structures.

load_to

CopyNumberVariation

class tmtk.highdim.CopyNumberVariation.CopyNumberVariation(params=None, path=None, parent=None)[source]

Bases: tmtk.highdim.HighDimBase.HighDimBase

Base class for copy number variation datatypes (aCGH, qDNAseq)

allowed_header
remap_to(destination=None)[source]
Parameters:destination
Returns:
samples

Expression

class tmtk.highdim.Expression.Expression(params=None, path=None, parent=None)[source]

Bases: tmtk.highdim.HighDimBase.HighDimBase

Base class for microarray mRNA expression data.

samples

Mirna

class tmtk.highdim.Mirna.Mirna(params=None, path=None, parent=None)[source]

Bases: tmtk.highdim.HighDimBase.HighDimBase

Base class for proteomics data.

samples

Proteomics

class tmtk.highdim.Proteomics.Proteomics(params=None, path=None, parent=None)[source]

Bases: tmtk.highdim.HighDimBase.HighDimBase

Base class for proteomics data.

samples

ReadCounts

class tmtk.highdim.ReadCounts.ReadCounts(params=None, path=None, parent=None)[source]

Bases: tmtk.highdim.HighDimBase.HighDimBase

Subclass for ReadCounts.

allowed_header
remap_to(destination=None)[source]
Parameters:destination
Returns:
samples

SampleMapping

class tmtk.highdim.SampleMapping.SampleMapping(path=None)[source]

Bases: tmtk.utils.filebase.FileBase, tmtk.utils.validate.ValidateMixin

Base class for subject sample mapping

get_concept_paths

Get all concept paths from file, replaces ATTR1 and ATTR2.

Returns:dictionary with md5 hash values as key and paths as value
platform
Returns:the platform id in this sample mapping file.
samples
slice_path(path)[source]

Give slice of the dataframe where the paths are equal to given path. :param path: path (will be converted using global logic). :return: slice of dataframe.

study_id
Returns:study_id in sample mapping file
update_concept_paths(path_dict)[source]

Metadata Tags

Tags

class tmtk.tags.Tags.MetaDataTags(params=None, parent=None)[source]

Bases: tmtk.utils.filebase.FileBase, tmtk.utils.validate.ValidateMixin

static create_df()[source]
get_tags()[source]

generator that gets tags from tags file.

Returns:tuples (<path>, <title>, <description>)
invalid_paths
load_to
tag_paths

Return tag paths delimited by the path_converter.

Utilities

FileBase

class tmtk.utils.filebase.FileBase[source]

Bases: object

Super class with shared utilities for file objects.

df

The pd.DataFrame for this file object.

df_has_changed
header
name
save()[source]

Overwrite the original file with the current dataframe.

tabs_in_first_line()[source]

Check if file is tab delimited.

write_to(path, overwrite=False)[source]

Wrapper for tmtk.utils.df2file().

Parameters:
  • path – path to write file to.
  • overwrite – write over existing files in the filesystem)

Generic module

tmtk.utils.Generic.clean_for_namespace(path) → str[source]

Converts a path and returns a namespace safe variant. Converts characters that give errors to underscore.

Parameters:path – usually a descriptive subdirectory
Returns:string
tmtk.utils.Generic.column_map_diff(a_column, b_column)[source]
tmtk.utils.Generic.df2file(df=None, path=None, overwrite=False)[source]

Write a dataframe to file safely. Does not overwrite existing files automatically. This function converts concept path delimiters.

Parameters:
  • dfpd.DataFrame
  • path – path to write to
  • overwrite – False (default) or True
tmtk.utils.Generic.file2df(path=None)[source]

Load a file specified by path into a Pandas dataframe. If hashed is True, return a a (dataframe, hash) value tuple.

Parameters:path – to file to load
Returns:pd.DataFrame
tmtk.utils.Generic.find_fully_unique_columns(df)[source]

Check if a dataframe contains a fully unique column (SUBJ_ID candidate).

Parameters:dfpd.DataFrame
Returns:list of names of unique columns
tmtk.utils.Generic.fix_everything()[source]

Scans over all the data and indicates which errors have been fixed. This function is great for stress relieve.

Returns:All your problems fixed by Rick
tmtk.utils.Generic.is_numeric(values)[source]

Check if list of values are numeric.

Parameters:values – iterable
tmtk.utils.Generic.md5(s: str)[source]

utf-8 encoded md5 hash string of input s.

Parameters:s – string
Returns:md5 hash string
tmtk.utils.Generic.merge_two_dicts(x, y)[source]

Given two dicts, merge them into a new dict as a shallow copy.

tmtk.utils.Generic.numeric(x)[source]
tmtk.utils.Generic.path_converter(path, internal=False)[source]

Convert paths by creating delimiters of backslash “” and “+” sign, additionally converting underscores “_” to a single space.

Parameters:
  • path – concept path
  • internal – if path is for internal use delimit with Mappings.PATH_DELIM
Returns:

delimited path

tmtk.utils.Generic.path_join(*args)[source]

Join items with the used path delimiter.

Parameters:args – path items
Returns:path as string
tmtk.utils.Generic.summarise(list_or_dict=None, max_items: int = 7) → str[source]

Takes an iterable and returns a summarized string statement. Picks a random sample if number of items > max_items.

Parameters:
  • list_or_dict – list or dict to summarise
  • max_items – maximum number of items to keep.
Returns:

the items joined as string with end statement.

tmtk.utils.Generic.word_map_diff(a_word_map, b_word_map)[source]

utils.CPrint module

utils.Exceptions module

exception tmtk.utils.Exceptions.ClassError(found=None, expected=None)[source]

Bases: BaseException

Error raised when unexpected class is found.

Parameters:
  • found – is the Object class of found
  • expected – is the required Object class
exception tmtk.utils.Exceptions.DatatypeError(found=None, expected=None)[source]

Bases: BaseException

Error raised when incorrect datatype is found.

Parameters:
  • found – is the datatype of object
  • expected – is the required datatype
exception tmtk.utils.Exceptions.NotYetImplemented[source]

Bases: BaseException

exception tmtk.utils.Exceptions.PathError(found=None)[source]

Bases: BaseException

Error raised when an incorrect path is given.

exception tmtk.utils.Exceptions.TooManyValues(found=None, expected=None, id_=None)[source]

Bases: BaseException

Error raised when too many values are found.

utils.HighDimUtils module

utils.mappings module

class tmtk.utils.mappings.Mappings[source]

Bases: object

Collection of statics used in various parts of the code.

EXT_PATH_DELIM = '\\'
PATH_DELIM = '∕'
annotation_data_types = {'rnaseq': 'Messenger RNA data (sequencing)', 'cnv': 'ACGH data', 'expression': 'Messenger RNA data (microarray)', 'mirna': 'micro RNA data (PCR)', 'proteomics': 'Proteomics data (mass spec)', 'vcf': 'Genomic variant data'}
annotation_marker_types = {'proteomics_annotation': 'PROTEOMICS', 'cnv_annotation': 'Chromosomal', 'vcf_annotation': 'VCF', 'mirna_annotation': 'MIRNA_QPCR', 'mrna_annotation': 'Gene expression', 'rnaseq_annotation': 'RNASEQ_RCNT'}
cat_cd = 'Category Code'
cat_cd_s = 'ccd'
col_num = 'Column Number'
col_num_s = 'col'
column_mapping_header = ['Filename', 'Category Code', 'Column Number', 'Data Label', 'Magic 5th', 'Magic 6th', 'Concept Type']
column_mapping_s = ['fn', 'ccd', 'col', 'dl', 'm5', 'm6', 'cty']
concept_type = 'Concept Type'
concept_type_s = 'cty'
data_label = 'Data Label'
data_label_s = 'dl'
df_value = 'Datafile Value'
df_value_s = 'dfv'
filename = 'Filename'
filename_s = 'fn'
static get_annotations(dtype=None)[source]

Return mapping for annotations classes. Return only for datatype if dtype is set. Else return full map.

Parameters:dtype – optional datatype (e.g. cnv_annotation)
Returns:dict with mapping, or class.
static get_highdim(dtype=None)[source]

Return mapping for high dimensional classes. Return only for datatype if dtype is set. Else return full map.

Parameters:dtype – optional datatype (e.g. cnv)
Returns:dict with mapping, or class.
static get_params(dtype=None)[source]

Return mapping for params classes. Return only for datatype if dtype is set. Else return full map.

Parameters:dtype – optional datatype (e.g. cnv)
Returns:dict with mapping, or class.
magic_5 = 'Magic 5th'
magic_5_s = 'm5'
magic_6 = 'Magic 6th'
magic_6_s = 'm6'
map_value = 'Mapping Value'
map_value_s = 'map'
tags_description = 'Description'
tags_header = ['Concept Path', 'Title', 'Description', 'Weight']
tags_node_name = 'Tags'
tags_path = 'Concept Path'
tags_title = 'Title'
tags_weight = 'Weight'
word_mapping_header = ['Filename', 'Column Number', 'Datafile Value', 'Mapping Value']

Toolbox package

Generate chromosomal regions file

tmtk.toolbox.generate_chromosomal_regions_file.generate_chromosomal_regions_file(platform_id=None, reference_build='hg19', **kwargs)[source]

This creates a new chromosomal regions annotation file.

Parameters:
  • platform_id – Give the new platform a name to fill first column
  • reference_build – choose either hg18, hg19 or hg38
Returns:

a pandas dataframe with the new platform

Remap chromosomal regions data

tmtk.toolbox.remap_chromosomal_regions.map_index_to_region_ids(gene, origin_platform, region_origin)[source]
tmtk.toolbox.remap_chromosomal_regions.remap_chromosomal_regions(origin_platform=None, destination_platform=None, datafile=None, flag_indicator='.flag', to_dest=2, start_dest=3, end_dest=4, region_dest=1, chr_origin=2, start_origin=3, end_origin=4, region_origin=1, region_data=0)[source]
tmtk.toolbox.remap_chromosomal_regions.return_mean(datafile, mapping, flag_columns=None)[source]

Study Wizard

tmtk.toolbox.wizard.create_study(path)[source]

Start a study object by pointing by giving a folder that contains clinical data files only.

Parameters:path – path to folder with files.
Returns:study object.

Create study from templates

tmtk.toolbox.create_study_from_templates(ID, source_dir, output_dir=None, sec_req='Y')[source]

Create tranSMART files in designated output_dir for all data provided in templates in the source_dir.

Parameters:
  • ID – study ID.
  • source_dir – directory containing all the templates.
  • output_dir – directory where the output should be written.
  • sec_req – security required? “Y” or “N”, default=”Y”.
Returns:

None

The Arborist

tmtk.arborist.common module

tmtk.arborist.common.call_boris(to_be_shuffled=None, **kwargs)[source]

This function loads the Arborist if it has been properly installed in your environment.

Parameters:to_be_shuffled – has to be either a tmtk.Study object, a Pandas column mapping dataframe, or a path to column mapping file.
tmtk.arborist.common.launch_arborist_gui(json_data: str, height=650)[source]
Parameters:
  • json_data
  • height
Returns:

tmtk.arborist.common.update_clinical_from_json(clinical, json_data)[source]
Parameters:
  • clinical
  • json_data
Returns:

tmtk.arborist.common.update_study_from_json(study, json_data)[source]
Parameters:
  • study
  • json_data
Returns:

tmtk.arborist.common.valid_arborist_or_exception(**kwargs)[source]

tmtk.arborist.connect_to_baas module

tmtk.arborist.connect_to_baas.get_instance_url(url)[source]
tmtk.arborist.connect_to_baas.get_json_from_baas(url, username=None)[source]

Get a json file from a Boris as a Service instance.

Parameters:
Returns:

the JSON string from BaaS.

tmtk.arborist.connect_to_baas.json_url(url)[source]
tmtk.arborist.connect_to_baas.login_url(url)[source]
tmtk.arborist.connect_to_baas.publish_to_baas(url, json, study_name, username=None)[source]

Publishes a tree on a Boris as a Service instance.

Parameters:
  • url – url to a BaaS instance.
  • json – the stringified json you want to publish.
  • study_name – a nice name.
  • username – if no username is given, you will be prompted for one.
Returns:

the url that points to the study you’ve just uploaded.

tmtk.arborist.connect_to_baas.start_session(url, username)[source]

tmtk.arborist.jstreecontrol module

class tmtk.arborist.jstreecontrol.ConceptNode(path, var_id=None, node_type='numeric', data_args=None)[source]

Bases: object

class tmtk.arborist.jstreecontrol.ConceptTree(json_data=None)[source]

Bases: object

Build a ConceptTree to be used in the graphical tree editor.

add_node(path, var_id=None, node_type=None, data_args=None)[source]

Add ConceptNode object nodes list.

Parameters:
  • path – Concept path for this node.
  • var_id – Unique ID that allows to keep track of a node.
  • node_type – Explicitly set node type (highdim, numerical, categorical)
  • data_args – Any additional parameters are put a ‘data’ dictionary.
column_mapping_file
Returns:Column Mapping file based on ConceptTree object.
high_dim_paths

All high dimensional nodes in concept tree as dict

jstree
tags_file
word_mapping
class tmtk.arborist.jstreecontrol.JSNode(path, oid=None, **kwargs)[source]

Bases: object

This class exists as a helper to the JSTree. Its “json_data” method can generate sub-tree JSON without putting the logic directly into the JSTree.

get_child(var_id, text)[source]
json_data()[source]
class tmtk.arborist.jstreecontrol.JSTree(concept_nodes)[source]

Bases: object

An json like object that converts a list of nodes into something that jQuery jstree can use.

json_data

Convert this object to json ready to be consumed by jstree.

json_data_string
Returns:Returns the json_data properly formatted as string.
pretty(root=None, depth=0, spacing=2)[source]

Create a pretty representation of tree.

to_clipboard()[source]
class tmtk.arborist.jstreecontrol.MyEncoder(skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)[source]

Bases: json.encoder.JSONEncoder

Overwriting the standard JSON Encoder to treat numpy ints as native ints.

default(obj)[source]
tmtk.arborist.jstreecontrol.create_concept_tree(column_object)[source]
Parameters:column_object – tmtk.Study object, tmtk.Clinical object, or ColumnMapping dataframe
Returns:json string to be interpreted by the JSTree
tmtk.arborist.jstreecontrol.create_tree_from_clinical(clinical_object, concept_tree=None)[source]
Parameters:
  • clinical_object
  • concept_tree
Returns:

tmtk.arborist.jstreecontrol.create_tree_from_study(study, concept_tree=None)[source]
Parameters:
  • study
  • concept_tree
Returns:

Data templates

This document describes how you can use tmtk to read your filled in templates and write the data to tranSMART-ready files. The templates can be downloaded here.

Create study templates

Using the tmtk.toolbox.create_study_from_templates() function you can process any template you have filled in, and output the contents to a format that can be uploaded to tranSMART. It has the following parameters:

  • ID (Mandatory) Unique identifier of the study. This argument does not define the name of the study, that will be derived from Level 1 of the clinical data template tree sheet.
  • source_dir (Mandatory) Path to the folder in which the filled in templates are stored. Template files are not searched recursively, so all should be in the same folder.
  • output_dir Path to the folder where the tranSMART files should we written to. If the path doesn’t exist the required folder(s) will be created. Default: ./<STUDY_ID>_transmart_files
  • sec_req Determines whether it should be a public or private study. Use Y for private or N for public. Default: Y

It is important that your source_dir contains just one clinical data template, which is detected by having “clinical” somewhere in the file name (case insensitive). If the template with general study level metadata is present it should have “general study metadata” in its name (case insensitive). All high-dimensional templates are detected by content, so file names are not important, as long as the names don’t conflict with the templates described above.

Note: It is possible to run the function with only high-dimensional templates, but keep in mind that in that case the concept paths will have to be manually added to the subject-sample mapping files.

# Load the toolkit
import tmtk
# Read templates and write to tranSMART files
tmtk.toolbox.create_study_from_templates(ID='MY-TEMPLATE-STUDY',
                                         source_dir='./my_templates_folder/',
                                         sec_req='N')

Contributors

  • Stefan Payrable
  • Ward Weistra