Poster Presentation 39th Annual Lorne Genome Conference 2018

Bayesian neural network based modelling of steady-state splicing mechanism. (#260)

KANUPRIYA TIWARI , Lars K Nielsen

Rationale: Most splicing models assemble a sequence based “splicing code” in a single tissue or rank sequence features based on their importance in different tissues making implicit assumptions about splice factor availability. From a systems biology perspective, the mechanistic modelling of splicing is complicated due to the expanding repertoire of core splicing machinery components and auxiliary factors involved. However, with the wealth of open source RNA-seq data available, machine learning methods can be trained using this data for learning the underlying mechanisms. We propose that by incorporating splice factor expression signatures and sequence features as inputs, a neural network model can be trained for predicting the most likely splicing pattern of a sequence, taking advantage of the methods ability to learn non-linear dependencies among its inputs.

Methods: We have built a 3 layer Bayesian neural network for predicting the direction of inclusion of the central exon in a triplet of exons. Each training data point for the network is a set of input features including 1400 sequence-based features and splice factor expression values (a linear combination of factor expression and its binding site counts) corresponding to 1 of 16 human tissues and the matched output probability of inclusion of the central exon in that tissue. A sparsity prior was assigned to the network weights for the purpose of regularizing the model, inference is performed using Gibbs sampling. 

Impact: While models exist for predicting the effects of sequence variants on splicing, these variants only account for ca. 15% of disease-causing mutations whereas several diseases are a result of aberrant splice factor expression. The current model is a useful tool for assessing the impact of changes in factor expression on the transcriptome of a cell which can be used to narrow down the number of possibly affected transcripts to be further investigated.