============================== B R A I N ============================== :Authors: Lan Hui :Version: 0.1.6 of 2022-10-09 *Revision history: 9 June 2017, 22 June 2017, 4 July 2017, 11 July 2017 (created Section Drawing sub-networks), 18 July 2017 (created Section Uploading peak files), 10 October 2017 ('Enter a set of genes' under Section Exploring the network; created Section uRNA), 24 Jan 2018 (Enter 'AT1G24260 AGL9;SEP3 tissue specific tau 1.9' under Section Exploring the network), 3 August 2019 (design considerations), 5 August 2019 (updated create_edges0.py, added a section FAQs), 7 August 2019 (files needed for drawing a scatterplot online), 11 August 2019 (added information for correlation_per_group_fixed_number.R), 22 November 2019 (review and revise the entire content of this documentation), 4 April 2020 (use html_network.py to generate static html pages that contain a gene's regulators and regulatees; use cron job to copy these files to the web application), 6 February 2021 (write a section describing the running environment for brain), 9 October 2022 (remove PARAMETER_FOR from PARAMETER_FOR_BUILDRMATRIX_RENEW_INTERVAL)* .. contents:: Preface ---------- This document contains usage and some implemention details for BRAIN, a project envisioned by Philip A. Wigge while I worked as a Research Associate in his lab at SLCU_. .. _SLCU: https://www.slcu.cam.ac.uk This document also describes how to run BRAIN locally on your own computer. .. This document is under construction. It serves as a memo for the authors. Introduction ---------------- BRAIN (Biological Regulation Analysed via Inferred Networks) grows and refines a direct transcriptional regulatory network whose internal wiring evolves together with the rapidly-growing short reads data deposited in the European Nucleotide Archive (ENA) or in Sequence Read Archive (SRA). As of 28 Feb 2017, there are 13,538 Arabidopsis RNA-seq datasets in ENA, far exceeding the number of microarray data AraNet used (1,074). BRAIN's scatterplot-driven approach produces **interpretable hypotheses on direct regulation** between genes. It seeks simple linear relationship, context-dependent linear relationship and post-translational relationship (wedge-shaped_) in each established scatterplot. Context is defined by tissues, or by clusters of RNA-seq samples. BRAIN maintains historical versions for all edges, enabling us to pick really good edges and forget spurious edges. It provides an easy interface for browsing_. It is a long-running process that requires little human intervention (maintenance is still required). .. It is portable to other organisms (need efforts to achieve that). BRAIN mainly uses two types of data, gene expression data (RNA-seq samples from ENA or SRA) and protein binding data (ChIP-seq, DAP-seq, etc). Using binding data, we focus on direct regulations, and using many RNA-seq samples, we get statistically significant co-expression. **A bit biology**. A phenotype is the result of the expression of certain genes, and the expression of these genes are controlled upstream by certain transcription factors. So if we know which transcription factors control which genes and know genes determining a phenotype, then in principle by controlling the expression of transcription factors we are able to engineer that phenotype. Image the tap system. TFs are like valves regulating fuild coming out from a tap, the target gene. We have "pioneer" valves (Daphane Ezer's words) to turn on/off the tap and 'modulator' valves to further fine-tune the amount of fluid from that tap. For example, it is the different amount of fuild from taps that makes different tissues. Connecting these taps makes a network. .. (Is the tap system a good analogy???) Though more advanced methods do exist for establishing a relationship among genes (references), a simple yet interpretable approach is using scatterplots where we just look at one TF and one target gene at a time. We would expect to see a linear trend in the scatterplot if the TF *always* regulates the target's expression, regardless of biological context. According to our experience, many times this linear trend is not obvious or absent, because transcriptional regulation is often context-dependent (e.g., tissue-dependent, treatment-dependent, circadian clock-dependent, etc) or post-translational. Therefore, we identify a subset of points from the scatterplot that could form a trend. These points may come from the same context. One way to do this is using Mixture of Regressions (mixtool.useful.case_). More specifically, if gene A regulates gene B (with the possibility of interacting with gene C), then we expect to see, in certain conditions, A binds to B, and A's expression is correlated with B. .. A useful feature is to detect these conditions (mixtool.useful.case_). Transcriptional regulation can also happen at the post-translational stage. Whether or not protein A can regulate gene B depends on A's activity. It could be the case that A's gene expression remains high but A's protein product is inactive, unable to regulate B. Protein A can regulate B only when it is activated by another protein C. In this case, there is no clear co-expression between A and B. Instead, A's gene expression level remains high while B's gene expression level fluctuates, resulting in a vertical strip on the right in the scatterplot. So we also look for wedge-shaped_ scatterplots to capture this phenomenon. .. _browsing: **Availability**. BRAIN was available at http://172.26.114.34:5000, and now available at http://118.25.96.118/brain. The source code will be available at https://gitlab/slcu/teamPW. .. Therefore, C might be one of the genes that are correlated with B. .. Since we have lots of RNA-seq data, we can set aside a validation set to measure the confidence of an edge, in addition to correlation based solely on the training data. Even better, we can get genuinely "unseen" data that will be generated future, enabling us to test the network. .. Delay effect since a TF needs time to become a functional protein. .. More-complex models such as multiple linear regression (with various kinds of regularisers), random forest regression (an ensemble method taking average the output of hundreds of regression trees) could be used, but their results are less interpretable than the simple, linear, pairwise correlation. .. Their ability in selecting most relevant regulating genes is less needed in the context of direct binding. Multicolinearity might also pose a problem for these models. Getting started ---------------- Copy the whole directory structure (was */home/hui/network/v03*, now */home/lanhui/brain*) to your working directory. The most time-consuming step is to prepare mandatory files for BRAIN. Conservatively speaking, a few days? The user needs to manage three text files, parameter_for_buildCmatrix.txt (for protein binding), parameter_for_buildRmatrix.txt (for gene expression) and parameter_for_net.txt (for network generation). If the user has a few BED files at hand (e.g., from peak calling on ChIP-seq data) and want to run a custom gene regulatory network locally, see run.network.locally_. Preparing mandatory files ~~~~~~~~~~~~~~~~~~~~~~~~~ The user must have the following three text files ready before being able to build the gene regulatory network. - Data/parameter/parameter_for_buildCmatrix.txt The main purpose of this file is to help build **binding.txt** . Example: parameter.for.buildCmatrix_. - Data/parameter/parameter_for_buildRmatrix.txt The main purpose of this file is to build **TPM.txt** . Example: paramter.for.buildRmatrix_. - Data/parameter/parameter_for_net.txt The main purpose of this file is to create edge files. Example: parameter.for.net_. After these three files are ready, run the following command and you are good to go:: cd Code && python update_network.py The above three files contain links to or information about the data that BRAIN consumes. In the future, one only needs to edit the above three files to add ChIP-seq, RNA-seq, etc. Of course, it is possible to update these files automatically. Editing parameter_for_buildRmatrix.txt is achieved by executing make.parameter.rnaseq.py_ in update_network.py. Downloading and mapping RNA-seq data is achieved by executing download.and.map.py_. Mapped data are stored in a text file (e.g., ERR588039_quant.txt), and will be put in *Data/R/Mapped/public*. Raw data are stored temporarily under *Data/R/Raw*. Since RNA-seq raw data are big in size, remember to remove files in Data/R/Raw regularly (now made default). *Data/log/download_log.txt* contains a list of downloaded RNA-seq IDs. Of course, if we get mapped RNA-seq data from elsewhere (not from running download_and_map.py), then we just need to copy them manually to Data/R/Mapped/public for BRAIN to incorperate them in the next update. update_network.py periodically checks whether BRAIN has got enough new RNA-seq samples. If yes, then it will update parameter_for_buildRmatrix.txt. The user can change variables BUILDRMATRIX_RENEW_INTERVAL and MIN_RNA_SEQ_INCREASE in configure.py to customize the update frequency. The user manually updates paramter_for_buildCmatrix.txt. (Automatic updating is possible, but not well tested. See upload.peak.files_ for more details.) Other mandatory files to prepare ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In the above three parameter files, the user needs to specify paths to certain files, e.g., GENE_LIST, GENE_ID_AND_GENE_NAME and GENE_FILE. - **GENE_LIST.** Each line contains one gene ID. The gene ID should not contain isoforms. For example, use AT1G01020 but not AT1G01020.1 or AT1G01020.2. Example: *Data/information/gene-list.txt* contains a list of AGI IDs, one for each line. Only these IDs will be used when building the regulatory network. This file is used both in parameter_for_net.txt and in parameter_for_buildRmatrix.txt. - **GENE_ID_AND_GENE_NAME.** Each line contains one gene ID and one gene name. If a gene has multiple names, separate them by semicolons. Example: the file *Data/information/AGI-to-gene-names_v2.txt* has two columns. The left column is AGI, and the right column is gene names (separated by semicolons). This file is used in many places where a gene name is needed, for example, in create_edges4.py. - **GENE_FILE.** Each line contains a gene ID, gene name, gene position, gene orientation and gene annotation (optional). See gene_file_. Example: *Data/information/gene_file.txt*. This file is used to help convert a narrowPeak file into a binding file. An example file is available at gene_file.txt.gz_ .. _gene_file.txt.gz: http://118.25.96.118/brain/static/info/gene_file.txt.gz - **rnaseq_info_database.json** -- this file must be put under *Data/information*. All experimental conditions for RNA-seq samples are contained in this file. This file should be updated quarterly. For how to make this file, see rnaseq.info.database_. Automating downloading and updating the network ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Instead of manually execute python update_network.py from time to time, we can make a *cron job* for it. To do so, - In the terminal, type:: crontab -e - Append the following line:: 59 23 * * * cd /home/lanhui/brain/Code && 'ps -eF | grep python3 | grep update_network' || python3 update_network.py 1 6 * * * cd /home/lanhui/brain/Code && python3 download_and_map.py This makes update_network.py run every day at 23:59pm, and makes download_and_map.py run every day at 6:01am. Don't forget to change '/home/lanhui/brain' to the path to your working directory. The first line in the crontab is too long. The purpose is to run update_network.py only if it is not already running. There is a better method (being used now):: 59 23 * * * cd /home/lanhui/brain/Code && flock /tmp/brain.lockfile ./update_network.py We should change update_network.py to be executable (chmod u+x update_network.py). Also add a shebang line (#! /usr/bin/python3) on the top of update_network.py. update_network.py could take a few days to finish, if there are lots of (e.g. millions) binding evidences to check. In order to harvest updated edges each day, use update_network_by_force.py, which will be only interested in incorporating newly computed edges without actually computing them (computing them is the job of update_network.py). The files generated above (i.e., edges.txt, edges.sqlite, and GeneID.html) must be copied to the web application folder for it to work properly:: 15 9,17 * * * sudo lsof /home/lanhui/brain/Data/temp/edges.txt || sudo cp /home/lanhui/brain/Data/temp/edges.txt /var/www/brain/brain/static/edges/edges.txt 20 9,17 * * * sudo lsof /home/lanhui/brain/Data/temp/html_edges/edges.sqlite || sudo cp /home/lanhui/brain/Data/temp/html_edges/edges.sqlite /var/www/brain/brain/static/edges 30 9,17 * * * sudo lsof /home/lanhui/brain/Data/temp/edges.txt || sudo find /home/lanhui/brain/Webapp/static/edges/ -name "AT*.html" -exec cp -t /var/www/brain/brain/static/edges {} + "lsof filename" returns zero if filename is not being modified. If the first part of each OR conditional test (||) is false, then second part will be evaluated. "AT*.html" means the html files for all Arabidopsis thaliana genes. Arabidopsis thaliana gene ID starts with the two letters "AT". We used "find" command to copy static html files because there are tens of thousands of them (one for each Arabidopsis gene ID), and the conventional copy command "cp" does not work in this situation. .. _upload.peak.files: Uploading peak files ---------------------- (WORK IN PROGRESS) Users can upload their own peak files (in BED6+4 format) by visiting this link: http://172.26.114.34:5000/upload-chip. Our team will process your peak files and the results will show up in BRAIN. The user should at least provide protein ID information in the form. The uploaded file will be stored in a folder specified by UPLOAD_FOLDER in start_webapp.py. We can add these peak files to parameter_for_buildCmatrix.txt:: python make_upload_chip_parameter.py Also, users can download BRAIN and run it locally. See run.network.locally_. This requires sufficient computational power (in our case it is 100GB memory, 32-core CPU), mainly for computing large distance matrices. Running environment --------------------------- Brain could be run on a Ubuntu server. This server has 4 processor cores, 8GB of RAM, and 50GB of disk space. The front-end is located at ``/var/www/brain``. Log in as root to edit cron jobs. The back-end is located at ``/home/lanhui/brain``. Log in as a normal user lanhui to edit cron jobs. .. _run.network.locally: Running a network locally --------------------------- Perhaps a biologist is interested in building their custom gene regulatory networks. He can run BRAIN locally, for example, when he has a BED file from a ChIP-seq experiment and wants to find the protein's potential targets. To do this, he needs adequate computational power and certain packages. Copy the entire directory structure to your working directory. (The directory is prepared by /home/hui/network/copy_dir.sh.) Go to Code directory. Run ``python local_network.py``. Copy your BED files to directory Data/C/Mapped. Edit Data/parameter/parameter_for_buildCmatrix.txt (Example: parameter.for.buildCmatrix_). Fill in PROTEIN_ID, PROTEIN_NAME, DATA_NAME, DATA_FORMAT, and LOCATION. Make sure DATA_FORMAT is narrowPeak, and LOCATION is ``../Data/C/Mapped/xxx.narrowPeak``. Also set a sample ID after @. See sample.id.naming.convention_ for how to generate a ChIP-seq sample ID. Run ``python local_network.py`` again. Follow the instructions given by the program. Edges will be generaetd in the background. You can run ``python local_network.py`` regularly to add more edges to the network. If later you added more BED files to Data/parameter/parameter_for_buildCmatrix.txt, then set FORCE_MAKE_EDGES in local_network.py to 'YES' to produce new edges. To display the network, uncomment app.run(debug=True) in Webapp/start_webapp.py and comment out the previous line app.run(). Type this command: ``cd Webpapp && python start_webapp.py``. Enter http://127.0.0.1:5000 in the address bar in your browser. .. _sample.id.naming.convention: Sample ID naming convention --------------------------- Assign each data a unique ID (14 letters) using a consistent naming convention for consistency and for easy future reference:: 1st letter: C=ChIP, R=RNA-seq (data type) 2nd letter: 0=unspecified, 1=success, 2=failure (data status) 3rd-5th letter: lab id, 000=unspecified, 001=Phil's lab, DRR, SRR, ERR for RNA-seq 6th-14th letter: 9 letters, detailed as follows: If lab id is not DRR/SRR/ERR 0000 000 00 reserved library ID sample ID If library is a replicate, the left-most digit is 1,2,3,4,... For example: C0001000008404 means the sample is ChIP-seq data, status unknown, from Phil's lab, library number is 84C, and sample number is 04. If lab id is DRR/SRR/ERR, then the 9 letters are filled from left to right, the remaining positions (if any) are filled with X's. For example, R0000ERR588040XXX is a sample ID for sample ERR588040 from ENA. Currently, for ChIP-seq data we have C0001... (Phil lab), C0002... (DAP-seq downloaded by Daphne Ezer), C0003... (inferred peaks from C0002... based on the idea that TFs from the same family tend to have similar binding sites), C0004... (P. Cubas lab in Spain), and C0005... (Ecker lab). Building a gene expression matrix ---------------------------------- The purpose of this step is to build a gene expression matrix (TPM.txt), using RNA-seq samples from various sources, e.g., ENA or SRA. The source data for TPM.txt is quant.sf produced by Salmon. The most important command is:: python buildRmatrix.py ../Data/parameter/parameter_for_buildRmatrix.txt .. _paramter.for.buildRmatrix: - Input: parameter_for_buildRmatrix.txt, generated by script make_parameter_rnaseq.py (check its head for usage):: %%GENE_LIST=../Data/information/gene-list.txt %%HOLDON=NO @R0001000014601 DATA_NAME:LAB001_LIB14601 DATA_FORMAT:txt DESCRIPTION:in-house LOCATION:/home/hui/network/v03/Data/R/Mapped/inhouse/146R/A2_S1_quant.txt NOTE: non zero ratio is 0.67 A2_S1_quant.txt is generated by Salmon (see download_and_map.py), and has the following format:: Name Length EffectiveLength TPM NumReads AT1G01010.1 1688 1282.48 8.13289 135 AT1G01020.1 1623 1217.48 3.73382 58.8375 AT1G01020.2 1085 679.483 22.1912 195.163 ... .. make_parameter_rnaseq.py assembles all small TPM files. Specify on the top a gene list in parameter_for_buildRmatrix.txt so that only the TPM values for these genes are collected. - Output: a matrix of TPM values, stored in the file TPM.txt (under *Data/history/expr*). If a gene has isoforms, then use the largest TPM value from these isoforms. Getting a final TPM matrix involves first downloading RNA-seq data and mapping them. - (Obsolete) Download fastq data:: - Make a download script:: python /home/hui/SRA/make_download_command.py sample_list.txt > download_2016_12_06.sh Input: a list of DRR/ERR/SRR IDs. See ena_DRR.txt, ena_ERR.txt, ena_SRR.txt. - Batch download:: bash download_2016_12_06.sh Data will be saved in the directory ftp.sra.ebi.ac.uk - (Obsolete) Map data (using salmon):: python get_TPM_by_salmon.py gz_files.txt Quant files are saved in the directory RESULT_DIR (e.g., result), defined in get_TPM_by_salmon.py. The above command takes an argument, gz_files.txt, each line containing the path to a fastq.gz file. To make this file, follow these steps. cd /home/hui/SRA/download05/ find ftp.sra.ebi.ac.uk -name '*gz' -exec mv -t . {} + find /home/hui/SRA/download05/ -name "*.gz" > gz_files.txt .. _download.and.map.py: - The above two steps are combined into one in download_and_map.py:: python download_and_map.py Edit DAILY_DOWNLOAD_NUMBER in configure.py to change the number of FASTQ files to download each time, then execute the above command. You can automate this process using a crontab entry, for example:: 01 23 * * 1,3,5 cd /home/hui/network/v03/Code && python download_and_map.py Add the above line to the cron table using this command *crontab -e*. The tool called https://crontab.guru/ is good at choosing times, in an interactive way. In the above example, downloading and mapping will happen every Monday, Wednesday and Friday at 11pm. download_and_map.py rnaseq.info.database_, so prepare it first. .. _make.parameter.rnaseq.py: - make_parameter_rnaseq.py:: python make_parameter_rnaseq.py [id-list.txt] > ../Data/parameter/parameter_for_buildRmatrix.txt Optional input: a list of SRA run IDs, starting with SRR/DRR/ERR. If input is not specified, edit the variable QUANT_PATH, a list of paths to all quant files. A quant file name from repositories SRA or ENA has this format: SRR1802166_quant.txt. A quant file name from in-house data has this format: \*_S1_quant.txt. The in-house quant files must be organised into folders called #R (e.g., 146R), where '#' is a number. Output example, parameter_for_buildRmatrix.txt. This is an IMPORTANT file to keep and will be used in the following steps. It contains the following information:: %%GENE_LIST=gene-list.txt %%HOLDON=NO @R0000ERR588039XXX DATA_NAME:ERR588039 DATA_FORMAT:txt DESCRIPTION:SRA LOCATION:/home/hui/network/R/Mapped/public/ERR588039_quant.txt NOTE: .. Manually prepend the line %%GENE_LIST=gene-list.txt. - Download RNA-seq information (meta data), ena_run.xml, ena_study.xml and ena_experiment.xml:: python download_ena_metadata.py Change LIBRARY_STRATEGY to 'RNA-seq' to download metadata for RNA-seq, and to 'ChIP-seq' to download metadata for ChIP-seq. For more information, see the source code. .. _rnaseq.info.database: - Make RNA-seq information database, Data/information/rnaseq_info_database.json:: python parse_ena_xml.py > rnaseq_info_database.txt (obsolete) python parse_ena_xml_test.py Generate an RNA-seq information file required by html_network.py and download_and_map.py. rnaseq_info_database.txt contains 8 columns: run_id, sample_id, experiment_id, study_id, study_id_PRJ, title, alias, and description. At minimum, provide the first column run_id (RNA-seq SRA IDs), and use a dot for other missing information. :: run_id sample_id experiment_id study_id study_id_PRJ title alias description DRR008476 SAMD00009103...DRS007600 Arabidopsis thaliana WT_Col mRNA-seq 30664389 2330493564 DRX007662 DRP001015 PRJDB2180 WT Col_mRNA-seq DRR008476

[Study title] Mechanism for full-length RNA processing over intragenic heterochromatin

[Study description] Mechanism for full-length RNA processing over intragenic heterochromatin

[Sample title] Arabidopsis thaliana WT_Col mRNA-seq

[Sample description] None

[Experiment title] Illumina Genome Analyzer IIx sequencing; Arabidopsis WT-Col mRNA_seq

[Experiment description] None The program's input is four xml files: run.xml, study.xml, experiment.xml and sample.xml, manually downloaded from the ENA website (https://www.ebi.ac.uk/ena). See the head of parse_ena_xml.py for more detail. To get these files, - visit http://www.ebi.ac.uk/ena/data/search?query=arabidopsis%20thaliana - Search keyword "arabidopsis" on the upper-right corner on the web page, then click Run (24,237) on the left panel, and click XML link on the right panel to download. - Repeat these steps for Study, Experiment and Sample. You can also download these files using ``download_ena_metadata.py``. However, it is only able to download run.xml, study.xml and experiment.xml. We need to download sample.xml manually. The input also includes SraRunTable.txt downloaded from https://www.ncbi.nlm.nih.gov/sra. (The steps for preparing rnaseq_info_database.txt/json should be much simplified. Future work.) **rnaseq_info_database.txt** is useful in that it enables us to extract experiment, project, and sample information of the RNA-seq data we downloaded. For example, html_network.py retrieves these information when making an html edge page (e..g., AT3G11440_AT1G01020.html), and words within the information text can be ordered by their frequency. More specifically, for each RNA-seq sample used to establish the edge, we obtain its experimental information text, split it into words, and count the frequency of each word. This gives us a rough idea about in which experimental conditions these two genes are co-expressed. This file is also used in assign_tissue.py, make_parameter_rnaseq.py and update_network.py. The by-product of the above script is *Data/information/rnaseq_info_database.json*, a database in JSON format where each record contains two elements, *tissue* and *detail*. - A utility script ``count_word.py``. This script gets all descriptive words in rnaseq_info_database.txt and order them by their frequencies. Useful to have a quick look at what the most prevalent conditions, tissues are. Usage:: python count_word.py rnaseq_info_database.txt Building a protein binding matrix ----------------------------------- The purpose of this step is to build a binding matrix (binding.txt) using various types of data, e.g., ChIP-seq data, DAP-seq, etc. The source data for binding.txt is either narrowPeak files or BigWig files. The most important command is:: python buildCmatrix.py > ../Data/parameter/paramter_for_buildCmatrix.txt .. _parameter.for.buildCmatrix: Take a look at paramter_for_buildCmatrix.txt:: %%TARGET_RANGE=4800 %%INCLUDE_GENE_BODY=YES %%BOTH_SIDE=NO %%OUTPUT_BINDING_STRENGTH=YES %%GENE_FILE=../Data/information/gene_file.txt %%CHR_INFO=../Data/information/chromosome.info.txt %%REBUILD_LIST= %%DESTINATION=../Data/C/Mapped/Columns %%HOLDON=NO @C0001000007319 PROTEIN_ID:AT2G25930 PROTEIN_NAME:ELF3 DATA_NAME:gELF3-FLAG-elf3-1-12_S19 DATA_FORMAT:narrowPeak DESCRIPTION:SIGNAL=grown in chamber 22C, 9am to 5pm, shift to 12C, start shift 5pm, collect at 6:30pm. PHENOTYPE=NA LOCATION:/home/hui/network/C/Mapped/inhouse/73C/LIB73C_gELF3-FLAG-elf3-1-12_S19_peaks.narrowPeak NOTE:update:20170101 DATA_FORMAT accepts two values, narrowPeak or bw. The user must set HOLDON to YES while editing this file, and set if back to NO to make his editing effective. The user adds "SIGNAL=" and "PHENOTYPE=" for each ChIP-seq experiment if available. This way, we know which signals cause which phenotype. In this file, specify GENE_FILE, which contains the position of each gene and guides buildCmatrix.py to find target genes. GENE_FILE (i.e., *Data/information/gene_file.txt*) is prepared by prepare_gene_file.py:: python prepare_gene_file.py all-ath-gene-position.txt > gene_file.txt all-ath-gene-position.txt is in BED format, as follows:: 1 3630 5899 AT1G01010 . + 1 5927 8737 AT1G01020 . - 1 11648 13714 AT1G01030 . - 1 23145 31227 AT1G01040 . + 1 28499 28706 AT1G01046 . + 1 31169 33153 AT1G01050 . - .. _gene_file: The format of gene_file.txt is:: AT1G01010 ANAC001 1 3630 5899 + protein_coding NAC domain containing protein 1 AT1G01020 ARV1 1 5927 8737 - protein_coding Arv1-like protein ARV1; CONTAINS InterPro DOMAIN/s: Arv1-like Fields are gene id, gene name, chromosome number, gene start position, gene end position, gene orientation, gene annotations (optional). The fields are separated by TAB. We need a few more steps to create paramter_for_buildCmatrix.txt, as follows. Later, we can just append paramter_for_buildCmatrix.txt as more ChIP samples are added. DESTINATION specifies the path to individual binding files. Each binding file is generated by get_binding.py and has the following format:: gene_id C0003000001450 AT1G01010 0 AT1G01020 0 AT1G01030 0 AT1G01040 0 AT1G01046 0 ... As mentioned above, parameter_for_buildCmatrix.txt accepts the following types of files: - **narrowPeak files.** Generated by the peak calling program, MACS2. - **bw files.** bwfiles.txt --> (UNIX uniq command) --> bwfiles_unique.txt make_parameter_bw.py (see the head of this file for usage and input) Here is an example output (need to manually add the lines begining with %%):: %%TARGET_RANGE=4800 %%INCLUDE_GENE_BODY=YES %%BOTH_SIDE=NO %%OUTPUT_BINDING_STRENGTH=YES %%GENE_FILE=gene_file.txt ################## In-house ChIP-seq ###################### @C0001100007100 PROTEIN_ID:AT3G54560 PROTEIN_NAME:HTA11 DATA_NAME:2_71C_12_IP-HTA11-17c-zero-BIS DATA_FORMAT:bw DESCRIPTION:in house ChIP data LOCATION:/media/pw_synology3/PW_HiSeq_data/ChIP-seq/Mapped_data/2_71C/20150603_H2AZ_3nd_rep_mapped_bw/2_71C_12_IP-HTA11-17c-zero-BIS_raw_trimmo_paired_truseq3-PE-2_2_10_5_1_bowtie2_TAIR10_ensembl_nomixed_sorted_rmdup_picard_genomenorm.bw NOTE: - DAP-seq files:: python make_parameter_dapseq2.py python make_parameter_dapseq3.py Append the result from running the above command(s) to parameter_for_buildCmatrix.txt. **Infer binding data.** For transcription factors not in the DAP-seq data, infer their binding positions from the DAP-seq data. The idea is that transcription factors within the same family bind to similar places (called conserved binding). So for each TF family, get the shared binding intervals among TFs using the DAP-seq data (common_peak.py). Use these shared intervals as highly likely binding positions for transcription factors not in DAP-seq but in the same TF family. **Prepare files for make_parameter_dapseq3.py.** gene_families_sep_29_09_update.txt (downloaded from tair) --- [grep 'Transcription factor'] ---> tffamily.txt ---[tffamily.py] ---> tffamily.simple.txt. These files are in direcotry dapseq_merged. Update the inferred ChIPs C0003.. [make_parameter_dapseq3b.py and dapseq3-output-20170116.txt] Individual binding files are stored in DESTINATION defined in parameter_for_buildCmatrix.txt. Produce these files using get_binding.py. This program will get for each gene in gene_file.txt (maximum) binding strength. See the head of get_binding.py for usage. Other parameter options: %%REBUILD_LIST and %%HOLDON. REBUILD_LIST includes a list of ChIP IDs for which their bindings are going to be re-built, and the results will be put in DESTINATION. HOLDON will prevent update_network.py from updating the network on changes in parameter_for_buildCmatrix.txt. Don't manually modify REBUILD_LIST, leave it empty or set it to all. Adding, deleting or updating ChIP-seq data is achieved through editing the NOTE fields in parameter_for_buildCmatrix.txt. The following three actions are allowed and described below. Deleting BED files ~~~~~~~~~~~~~~~~~~ To prevent a BED file for a ChIP-seq ID from being included in binding.txt. Add the word ``obsolete`` to the NOTE field for that ChIP-seq ID:: NOTE:obsolete Side effects: this will delete any edges estbalished SOLELY based on this ChIP-seq ID. Updating old BED files ~~~~~~~~~~~~~~~~~~~~~~ In the NOTE field, add update:yyyymmdd, where yyyymmdd is the date you update the BED file. So all future newly added ChIP-seq data, make sure put update:yyyymmdd. This information is used for buildCmatrix.py to decide whether **recent** binding.txt should include this ChIP-seq data (as if it is older than one week, it will not get included.) Adding new BED files ~~~~~~~~~~~~~~~~~~~~~~ To add new peak files (in BED format as returned by peak calling software MACS2), append to parameter_for_buildCmatrix.txt. The location of the new peak file is important; so make sure you correctly specify its path in the LOCATION field. The positions in bed files don't have to be sorted. The chromosome number in the BED file can be a number, or 'chr' followed by a number. To make the change take effect in next update, set HOLDON to NO in paramter_for_buildCmatrix.txt. In the NOTE field, add update:yyyymmdd, where yyyymmdd is the date you add the new data. Adding this date is not required, but desirable. If the update date is older than 15 days, the ChIP-seq ID won't be considered new. However, if the ChIP-seq ID is new to parameter_for_buildCmatrix.txt, then it will still be included, regardless of the date. Example:: @C0005000000101 PROTEIN_ID:AT1G49720 PROTEIN_NAME:ABF1 DATA_FORMAT:narrowPeak DESCRIPTION:SIGNAL=ABA PHENOTYPE=unknown LOCATION:/home/hui/network/v03/Data/C/Mapped/other/005/ABF1_ABA_optimal_narrowPeak_p16.bed NOTE:update:20170407 PROTEIN_ID is in fact gene ID. If it is empty or equal to ``id_unknown``, then this ChIP-seq will be ignored. Building an edge file ------------------------------------------- Run the following command to create an edge file:: python create_edges.py parameter_for_net.txt > edges.txt *Note: create_edges.py is quite old. So newer scripts like create_edges4.py should be used.* Each line in edges.txt has the following fields, separated by a TAB: - target_id target_name - source_id source_name - score - type of score: all, mix - If all, the score is computed using all available RNA-seq samples. - If mix, the score is computed using selected RNA_seq samples (e.g., by Mixture of Regressions, samples with the same tissue type, or samples with similar gene expression profiles). - RNA-seq experiment IDs. A dot if type of score is all. (To save space, the actual number of IDs is used instead of listing them). - ChIP-seq experiment IDs. - Log likelihood. A negative value. A dot if type is all. If edge type is mix, we have the following cases:: samples are selected using mixtools [a value determined by mixtools] samples are selected using post.translation.3 -1000.0 samples are selected using hclust -991.0 samples are selected using hclust (fixed number of groups) -991.1 samples are selected using wedge -1001.0 seedling -999.0 meristem -998.0 root -997.0 leaf -996.0 flower -995.0 shoot -994.0 seed -993.0 stem -992.0 aerial -990.0 - Date. 4-digit year, 2-digit month, and 2-digit day. For example, 20170404. Individual edge files will be written to Data/history/edges/one_target/, and merged into a single edges.txt by update_network.py. There are several ways to create edges. - create_edges.py (obsolete) -- single-process (slow) and reading both TPM.txt and binding.txt into memory (memory-inefficient). - create_edges2.py (obsolete) -- multiple-process (and hence faster). - create_edges3.py (obsolete) -- multiple-process and reading individual JSON files (made from TPM.txt) and target_tf.txt. It depends on two programs, TPM2JSON.py and make_target_tf.py. JSON files are stored in directory jsonTPM. edges.txt.date.gene_id are stored in directory EDGES_PER_TARGET. - create_edges4.py -- evolved from create_edges3.py. It emits a R script for each target gene. Additionally, a R script is generated for each of the target gene's successors. So the program computes incoming and outgoing edges for each target gene. We can quickly get results for important genes (by setting HIGH_PRIORITY_GENE in parameter_for_net.txt). K specifies number of components in Mixture of Regressions. Currently K=2 and K=3 (in function establish_edges). MAX_PROCESS specifies number of R scripts to run at the same time. This program checks existing edges.txt to determine high-priority targets (they are genes specified in HIGH_PRIORITY_GENE or genes with few neighbours). To compute high-priority target genes first, add the following line in parameter_for_net.txt, as follows:: %%HIGH_PRIORITY_GENE=AT3G26744 AT1G12860 MIN_SIZE in create_edges4.py decides the minimum number of RNA-seq samples needed for a significant edge. MIN_SIZE is a number no larger than 100, but no smaller than 10. *Less connected genes will be high-priority targets and therefore processed first*. This program also checks whether an edge to be computed already exists and is recent enough (controlled by EDGE_AGE). If it is recent, the program will not re-compute the edge. This program considers post-translational regulation, too. This was achieved by the R function post.translation.3 and more recently by wedge.R (see below). .. _mixtool.useful.case: .. figure:: context.dependent.mixtools.png :scale: 60 A toy example in which TF regulates Target in the green context but not in the red context. The green context may represent the presence of a "pioneer" gene, or certain conditions only in which TF can bind to Target. We don't know exactly what the context is (though given good annotation of RNA-seq samples, we may get some clue). The red and green points are identified by mixtools_. In addition, we consider a commonly seen situation in which no significant pairwise correlation exists though TF and Target *do* have a relationship. TF is constitutively expressed (e.g., ICE1, MAPK6), and its Target can be expressed or not expressed, depending on whether or not the TF protein is functional, or whether or not a third interacting protein (if any) is functional. In this situation TF and Target are often no longer correlated, and Mixture of Regressions may also not help here. The following figure shows such a scatterplot of gene expression. If we have a vertical strip far-away enough to 0 and including majority of points in the scatterplot, and among the remaining points with small X values, they concentrate in the lower-left corner (while the upper-left corner is virtually empty), then we say that TF is *necessary but not sufficient* for Target (to be expressed). .. figure:: ICE1_CBF3_Figure1.png :scale: 60 TF is ICE1, and Target is CBF3. The first component from Mixture of Regressions consists of red points, and the second component green points. None of the correlation coeffients are large. .. figure:: ICE1_CBF3_Figure2.png :scale: 60 The same scatterplot. The red points make the vertical strip. Value=0.85 is returned by post.translation.3. .. figure:: DIN6_MAPK6_Figure1.png :scale: 60 The horizontal axis is MAPK6, and the vertical axis is DIN6. - create_edges0.py -- quickly establish edges by computing correlation using all available RNA-seq samples. TFs and targets are from *Data/information/target_tf.txt*, which is generated by make_target_tf.py. The TPM matrix is specified in parameter_for_net.txt. I have a number of useful TPM matrices: TPM.txt.3130, TPM.travadb.txt, TPM.travadb.stress.txt, TPM.mild.drought.txt and TPM.inhouse.diurnal.txt. All these could be used to quickly establish edges. The result is saved in *Data/history/edge_pool/edges.txt.simple.correlation.all.conditions.yymmdd_hhmmss*. Each row in this file has **10 fields**, separated by TAB. These 10 fields are Target, TF, correlation coefficient, 'all', number of RNA-seq samples used for computing the correlation coefficient (to save space, I decided not to keep all the RNA-seq IDs), the ChIP-seq IDs, date, strength, and methods. The field methods is a short description of the methods used to get that edge. For example, 3131RNAseq means there are 3131 RNA-seq samples (more information at *Data/information/README.txt*). The folder *edge_pool* contains a number of edge files produced by various methods, which are to be merged by merge_edges.py. The products of merge_edges.py are a final edge file *Data/temp/edges.txt* and a folder *Data/temp/html_edges* which contains html files for all edges in edges.txt. edges.txt and HTML pages in html_edges/ will be copied to static/edges/ in the web application. Keep only edges with an absolute correlation coefficient greater than tau (e.g., 0.60) and whose TF and target are in target_tf.txt. Unlike searching for subsets of RNA-seq samples which takes much longer time, as create_edges4.py does, create_edges0.py gives us a basic co-expression network within hours. .. A gene whose number of non-zero values of log(TPM+1) and whose standard deviation of log(TPM+1) is not above the specified thresholds is ignored (see function make_r_script in create_edges0.py for detail). .. _tissue.specific.correlation: - correlation_per_tissue.R -- This script computes correlation on tissue-specific samples, i.e., RNA-seq samples with the same tissue. See assign.tissue.name_ on how to assign tissue names to RNA-seq samples. create_edges0B.py calls it. - correlation_per_group.R -- This script computes correlation on related samples, i.e., samples having similar gene expression profiles. We establish relatedness in N RNA-seq samples using hierarchical clustering. We then cut the dendrogram into different number of clusters, ranging from 3 to N^0.35. The largest cluster resulting in a significant correlation coefficient is used. - correlation_per_group_fixed_number.R -- Similar to correlation_per_group.R but less computationally expensive, this script first seeks an optimal number of clusters by calling the function get.optimal.number.of.clusters. The idea is that the optimal cut of the dendrogram should have best division of the samples' tissues. RNA-seq samples in one cluster should tend to have the same tissue name. For each TF-target pair in target_tf.txt, we check their correlation coefficient in these optimal clusters. In contrast, correlation_per_group.R checks many sets of clusters, each corresponding to a different number of clusters, and thus taking longer time. .. _wedge-shaped: - wedge.R -- This script looks for a wedge shape in each gene expression scatterplot, representing a potential post-translational relationship. If TF is low, then Target must be low. If TF is high, then target can be high or low, depending on whether or not the protein product of TF is active. create_edges4.py also includes an implementation of this idea (see post.translation.3 in create_edges4.py), but wedge.R is more straightforward. wedge.R identifies scatterplots in which correlation between TF and Target is low but most points (90\%) are below a line passing through the origin. .. figure:: wedge.png :scale: 60 99\% points in the scatterplot are below the red line. The slope of the red line is automatically determined. When x is low, y is also low. So x is required for y to be expressed. When x is high, y can be low or high, so x by itself may be not sufficient for y to be expressed. q is the 75th-quantile of y when x is less than 0.1. m is the mean of y when x is less than 1. r is the correlation coefficient of x and y using all points in the scatterplot. Static html summary of the network ----------------------------------- For each gene, we provide its upstream regulators and downstream regulatees (if any). We generate for each gene a static html page to contain the information of regulators and regulatees, using the following command:: python html_network.py -f edges.txt -r parameter_for_buildRmatrix.txt -c parameter_for_buildCmatrix.txt A biologist could view this page by *right-clicking* a gene node in the brain web app. For example, right-clicking the gene node labelled as ATPIF4 redirects him to `AT2G43010.html`_. .. _AT2G43010.html: http://118.25.96.118/static/edges/AT2G43010.html The above command is called in update_network.py, such that whenever edges.txt is updated, all regulator and regulatee information in these static html pages will be updated too. Therefore, each gene has links to its TFs and its targets (if any), and each edege has a link to its evidence page, which includes a gene express scatterplot. A JavaScript *Code/depend/c3/scatterplot.js* is used to generate the scatterplot. To display a scatterplot for a TF and its target, we need the gene expression values for each gene. So for each gene, create a JSON file which stores its most recent gene epxression values (logarithmised TPM values) in all expriments. These JSON files are generated by:: python slice_TPM_to_JSON.py parameter_for_net.txt Ranking edges using frecency ------------------------------ Edges have various confidence of being true. Frecency is a combination of frequency and recentness. Never updated edges will be gradually forgotten, just as human brains do. On the other hand, frequently updated edges will be reinforced. An edge will be frequently updated if new binding evidence is added and relationship built upon the new gene expression data (TPM.txt) is frequently established. An edge appearing frequently with different dates, different ChIP-seq IDs, and different sets of RNA-seq IDs, will rank high. As we maintain historical versions of each edge, gradually an edge may have multiple versions, created on different dates, based on various amount of RNA-seq data, etc. .. Distinguish and keep only three types of edges from gene A to gene B, best.mix.plus, best.mix.minus and best.all. An edge from A to B is best mix plus if its type is mix, correlation coefficient r is positive and has the largest value of \|r\| * log10(RN), where RN is the number of RNA-seq data used for computing r. Likewise for best mix minus. An edge from A to B is best all if it has the most recent date. Each kept edge is assigned a ranking metric, defined as the mutiplication of the following factors: - avg \|r\|. - log10(avg RN), where RN is the number of RNA-seq samples used for computing r. - log2 (F + 1), where F is the number of occurences of this edge. - exp( (MostRecentEdgeDate - CurrentDate)/S ), where S=365 controls memory rentention rate. The last two bullets relate to frecency. Therefore in addition to having a highly confident correlation coefficient, an edge that appears frequently will be ranked higher (the second last term). An edge that does not re-appear will be gradually "forgotten" (the effect of the last term). An edge that is based on more RNA-seq data will be more favourable (the second term). .. figure:: retention.png :scale: 80 Forgetting curve. Hermann Ebbinghaus's exponential nature of forgetting, exp(-t/S), where t is time, and S is memory retention rate. See fucntion merge_edges in update_network.py for detail. Updating the network --------------------- .. The before link .. MANUAL .. After edges.txt has been update in the application, visit http://118.25.96.118/brain/before to make the update effective (i.e., make a new graph from the updated edges.txt). Triggering update ~~~~~~~~~~~~~~~~~~ Once any of the following files is modified, update the network using the following command:: python update_network.py - parameter_for_buildCmatrix.txt - parameter_for_buildRmatrix.txt .. _parameter.for.net: - parameter_for_net.txt:: %%GENE_LIST=../Data/information/gene-list.txt %%GENE_ID_AND_GENE_NAME=../Data/information/AGI-to-gene-names_v2.txt %%BINDING_MATRIX=../Data/history/bind/binding.txt %%EXPRESSION_MATRIX=../Data/history/expr/TPM.txt %%EXPRESSION_MATRIX_DESCRIPTION=drought %%INPUT_MATRIX=../Data/history/bind/INPUT_binding.txt %%BINDING_INFO=../Data/parameter/parameter_for_buildCmatrix.txt %%EXPRESSION_INFO=../Data/parameter/parameter_for_buildRmatrix.txt %%EXISTING_TARGET_TF_PAIRS=../Data/information/target_tf_agris.txt ../Data/information/target_tf.txt.20170629_143000 %%MAX_NUM_TARGETS=200 %%OVERFLOW_TARGETS_PERCENTAGE=0.25 %%HIGH_PRIORITY_GENE=AT3G26744 AT2G43790 AT3G18550 %%HOLDON=YES - HIGH_PRIORITY_GENE -- these gene will be considered first. Used in create_edges4.py. - INPUT_MATRIX -- can be omitted if no bw files are used in building the binding matrix. - MAX_NUM_TARGETS and OVERFLOW_TARGETS_PERCENTAGE are used in make_target_tf.py. - EXISTING_TARGET_TF_PAIRS -- specify files including tf-target pairs from other sources (if any). Sources are separated by a space. The user could create such files to include tf-target interaction hypotheses that need to be further investigated by brain using RNA-Seq data. - HOLDON -- if YES, updates in this file won't take effect. See update_network.py for detail. - EXPRESSION_MATRIX_DESCRIPTION -- if this parameter is specified, then the last field (tissue or method) of edges created by create_edges0.py will be this description (e.g., drought). .. - edges.txt - binding.txt - TPM.txt .. (Not required now.) Run make_timestamp_file.py before first running update_network.py to prepare a time-stamp file containing the last modified times of all the above files. parameter_for_buildCmatrix.txt. Give a list of ChIP-seq sample IDs in REBUILD_LIST and call get_binding.py to update the binding files. If the list is empty, then calling get_binding.py will update the binding files for all sample IDs. The produced binding files are located at DESTINATION. If parameter_for_buildCmatrix.txt is updated, handle the following cases: - If a ChIP-seq sample is marked as 'obsolete' (in Note:) or 'update', then edges.txt will be modified. The edges containing this ChIP-seq sample might be affected. - If a ChIP-seq sample is marked as 'update', then its binding file will be remade. For a particular ChIP-seq ID, put 'update:yyyymmdd' in the 'Note' field to indicate that its narrowPeak file has been updated on yyyymmdd. (To simplify things, update all binding files if parameter_for_buildCmatrix.txt is updated.) If parameter_for_buildRmatrix.txt is updated, new TPM.txt will be made, and new edges will be computed using this new TPM.txt. A final edges.txt is made by combining all edge files from these two directories, HISTORY_DIR and HISTORY_DIR2. .. To prevent the network from updating binding.txt while we are editing parameter_for_buildCmatrix.txt (or parameter_for_buildRmatrix.txt or parameter_for_net.txt), add the following line:: .. %%HOLDON=YES Important log files ~~~~~~~~~~~~~~~~~~~ *Data/log/update.network.log.txt* records the activities while running update_network.py. *Data/log/download_log.txt* records all downloaded and mapped RNA-seq samples. Exploring the network ------------------------ Latest update as of 22 November 2019:: sudo cp /home/lanhui/brain/Data/temp/edges.txt /var/www/brain/brain/static/edges/edges.txt sudo find /home/lanhui/brain/Data/temp/html_edges -name "*.html" -exec mv -t /var/www/brain/brain/static/edges {} + Copy the files in summary directory (generated by html_network.py) to static/edges. Copy the files in json directory (generated by slice_TPM_to_JSON.py) to static/edges/json. Copy edges.txt to the directory where start_webapp.py resides. To explore the network:: python start_webapp.py Open an Internet browser and type http://127.0.0.1:5000 in the address bar for local debugging. Visit http://172.26.114.34:5000/ from other machines. We could explore the network in the following ways: - Enter a gene name or gene ID to start. Each node is clickable, and clicking that node will show its neighbourhood. Each edge is clickable too, and clicking that edge will open a new web page showing strength of association and binding details. TFs are in yellow, and non-TFs in light blue. A selected node is in dark blue. Mouseover a node highlights all incoming edges, and displays the sources of these incoming edges (on the right panel). To ensure responsiveness, only the neighbourhood of a query gene will be shown. If there are too many neighbours, show only a portion of the successors (specified by SHOW_N_NODES_EACH_TIME in start_webapp.py). The percentage is indicated on the right panel. Also displayed are the selected node's regulators. - Enter two gene IDs to show their common targets. - Enter many genes IDs to show a sub-network consisting of these gene IDs. For example:: AT1G30330 AT1G75080 AT3G59060 AT2G43010 AT5G11260 AT3G12580 AT3G33520 AT2G42870 AT5G39860 AT2G43060 AT2G18300 - Enter one gene ID and *tfs* to show the gene's predecessors and successors which are TFs. .. - Enter http://172.26.114.34:5000/tfs in the address bar to show a random set of connected TFs. The gene with highest outgoing degree is highlighted in blue. Refresh this page to see a different set. - Enter http://172.26.114.34:5000/summary in the address bar to show basic statistics of the network, such as node degree distribution, hub TFs, hub targets, etc. - Enter one source gene ID, one target gene ID and *path* to show all genes along the shortest paths from the source to the target. The network is made undirected before finding the paths. - Enter one gene ID and one condition (e.g., root, leaf, meristem, aerial, seedling, seed, shoot, flower, mixtool, all, hclust) to display tissue- or method -specific edges. .. - Enter a set of genes to get their upstream regulators. A TF that regulates more (in percentage) target genes in the set than in the whole genome is returned. In other words, if the set is enriched with target genes of a given TF, then this TF is considered as a significant upstream regulator for the set. Hypergeometric test is used to get enrichment p-values. This functionality is inspired by TF2Network (http://bioinformatics.psb.ugent.be/webtools/TF2Network), but does not impose a restriction on the number of genes in the set (at least 10 genes required by TF2Network). - Enter 'AT1G24260 AGL9;SEP3 tissue specific tau 1.9' to return TFs that are tissue specific as defined using tissue specific ratio tau=1.9. Classify TFs in to 4 categories, i.e., T: Tissue-specific. G: General. I: Inducible by stress. N: Not inducible. Assign different colours to TFs belonging to different categories, as follows: Dark green: TI; Light green: TN; Dark yellow (gold): GI; Light yellow: GN. .. - Zoom in or zoom out the network using two-finger up or down gesture. The most important files for network visualisation are:: start_webapp.py [ for buidling the network using networkx, also for handling routing using Flask] templates/form.html [ for getting user's input ] templates/graph.html [ for displaying subgraph using cytoscape.js ] static/json/genes.json [ for auto-completing user's input ] Cytoscape.js is used to visualize the network. W2UI is used to display large tables (invoked by right-clicking a gene). All files for W2UI are in Code/depend/w2ui. Note: this functionality is currently not available. When an edge link is clicked, evidence for the link, such as selected RNA-seq samples (no longer displayed for simplicity; instead association strength is displayed), ChIP-seq samples will be displayed. The link *Click for gene expression scatter plot* will open a scatterplot of gene expression for the source gene and the target gene. **Utilities.** Get all paths and the nodes in the paths between two genes:: python draw_network.py -f edges.txt -s ELF3 -e LUX -p -i all The above command draws a network built on edges.txt, including only genes that are in the paths from ELF3 to LUX. Drawing scatterplots --------------------- Drawing scatterplots in the web application ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Files in *static/edges* are responsible for displaying gene expression scatterplots. scatterplot.js does the job for displaying scatterplots. This javascript is embedded into all edge html pages generated by merge_edges.py. The command *python3 merge_edges.py* produces edges.txt and *html_folder/*, all placed under Data/temp. We must copy these information to *static/edges*. MANUALLY: | cp /home/lanhui/brain/Data/temp/edges.txt /var/www/brain/brain/static/edges/edges.txt | find /home/lanhui/brain/Data/temp/html_edges -name "\*.html" -exec mv -t /var/www/brain/brain/static/edges {} + A few other js file are needed: c3.min.js, d3.min.js, cytoscape.min.js, jquery.min.js, jquery-ui.min.js and pace.min.js. These files are stored in *static/js/*. In addition, c3.min.css is needed in *static/css*. Copy all these files to static/edges. c3.min.js depends on d3.min.js. The latest version is significantly faster in speed than the one in 2017. Copy the directory *Data/history/expr/json* (where gene expression levels are stored for all genes) to *static/edgse/json*. Copy rnaseq_info_database.json to *static/edges*. Drawing gene expression scatterplot for any pair of genes ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ As a utility, we can draw a scatterplot for an arbitrary pair of genes. This could be useful when there is no direct binding evidence available for a pair of genes, but a biologist still wants to check their co-expression anyway. - Go to http://172.26.114.34:5000/ - Enter gene ID1, gene ID2 and scatterplot. Use sc or sp to stand for scatterplot. For example, enter:: At2g25930 At5g61380 sc Drawing sub-networks --------------------- Displaying hundreds of genes and thousands of edges at once does not give too much biological insight, as the plot looks messy. I have tried a few layout algortihms, twopi, neato and sfdp, without too much success. See result.leaf.pdf in this folder for an example (this plot was generated by make_graphviz_file4.py followed by dot command). If we have a gene of interest, we can draw sub-networks surrounding it for all tissues. First, generate a tissue-specific edge file:: python test_network4.py The above command produces result.skeleton.txt, result.in.txt, result.out.txt and reult.all.txt. Skeleton means only TFs are considered. The last three files are degree files. Generate a plot for a TF in all tissues:: python make_graphviz_file3B.py AT1G65480 | dot -Tsvg -o result.svg result.gv result.skeleton.txt is used. A plot result.svg is generated. The regulators and regulatees of AT1G65480 are shown, one group for each tissue. For more details, see the head of make_graphviz_file3B.py. .. figure:: FT.result.svg :width: 300% Tissue names are shown in yellow boxes (bottom). A double circle means the gene is both a regulator and a regulatee. An egg means a regulatee. An oval means a regulator. Darker colour means higher relative node degree of a gene. If the colour is azure (white), then the gene has no successors in any tissue. .. figure:: E2FA.result.svg :width: 100% Another example for ATE2FA. Command: python make_graphviz_file3B.py AT2G36010 | neato -Goverlap=false -Tsvg -o result.svg result.gv To plot a TF's neigbours' neighbours, one plot for each tissue. Use:: python make_graphviz_file3C.py AT1G65480 | dot -Tsvg -o result.flower.svg result.flower.gv Relationship between programs ------------------------------------------------ Most source code are python files and R files. update_network.py is the entry point to understand the whole program. We could schedule this python script to run everyday using Linux's *cron* service, e.g., `1 5 * * * cd /home/hui/network/v03/Code && python update_network.py`. This will make update_network.py run everyday at 5am. The following plot shows information flow among various scripts in BRAIN. Text in the circles in general refers to a python script. Text on an edge refers to the output from that python program. The file extension .py is omitted for brevity. The whole process involves many steps. The central one is create_edges4, which consumes all files produced before and outputs many edges.txt.* files. These individual edge files are stored in directories specified by HISTORY_DIR and HISTORY_DIR2 (see update_network.py). .. image:: flowchart.png .. Since create_edges4.py takes weeks to finish, update_network.py can be scheduled to run every day, so that the network keeps updating without having to wait for create_edges4.py to finish. .. _assign.tissue.name: Assign tissue names to RNA-seq samples --------------------------------------- We can assign each column (RNA-seq sample) in TPM.txt with a tissue name. If the RNA-seq sample's tissue is known, use it. Otherwise, predict it. The tissue information is used for displaying scatterplots, and for computing tissue-specific correlation coefficients (tissue.specific.correlation_). Follow the following three steps to get tissue names for RNA-seq samples: - python assign_tissue.py > ../Data/temp/experiment.and.tissue.1.txt. - python refine_tissue.py > ../Data/information/experiment.and.tissue.2.txt Check the last column in the output, sugguested.tissue. - python update_rnaseq_info_json.py This script will update Data/information/rnaseq_info_database.json, which is used to display scatterplots. It will also call knn_classify.R to predict tissues for unknown RNA-seq samples. The result is saved in Data/information/experiment.and.tissue.txt. Correlation and sample size ------------------------------ We get stabler estimation of correlation between two variables when this estimate is based on more samples. That's why having more RNA-seq data is useful. 100 is a reasonable threshold, as demonstrated in the following figure. Beyond 100, correlation coefficients show small and decreasing variation along the red line. The red line represents the correlation coeffiecient on all 3101 samples. If the total of 3101 RNA-seq samples is our population, then the red line represents the *true* correlation. So for example as we randomly take 100 samples, 200 samples, 500 samples, 1000 samples and 2000 samples from the population, our confidence on approaching the "truth" increases (because of smaller variation). .. figure:: corrcoef.vs.sample.size.png :scale: 80 10 curves of correlation coefficients between HSFA1A (At4g17750) and HSP70 (At3g12580). We shuffle the 3101 samples and choose the first 10 samples, 20 samples, 30 samples, etc, and compute correlation coefficients using them. Connecting these values draws a curve. We repeat this process 10 times to generate 10 curves. Each curve represents a possible path on which BRAIN sees the growth of RNA-seq data. .. Code: Downloads/drawTPM/test_mixtools5b.R (Macbook) Context-dependent correlation and false discoveries ----------------------------------------------------- In the event of low correlation when we use all points in a scatterplot, we seek a subset of points (amouting to choosing RNA-seq samples) on which we get high correlation. One way to do this is Mixture of Regressions (mixtools_). We investigate to what extent this approach will introduce false discoveries. That is, even with completely random, zero-correlation data, how often will this approach report high correlation? The following plot shows that we rarely get correlation coefficient greater than 0.5 on subsets with size at least 146. So if we use a correlation threshold of 0.75 and consider only subsets having more than 146 points, we are unlikely to get false discoveries. .. figure:: r.good.by.change.png :scale: 80 Histogram of correlation coefficients on components having a sample size at least 146. We randomly permute two variables having a total of 1464 samples and apply mixtools_ to get components and compute correlation coefficients on each component. We repeart this process many times and check the distribution of correlation coefficients on all the components with a sample size at least 146. If we obtain these subsets using tissues or clusters of similar RNA-seq samples, the results should be similar. .. uRNA ----- (WORK IN PROGRESS.) Available at 172.26.114.34:5000/upload-rna Return a sub-network consisting of TFs and *research* genes most relevant to the biological context specified by custom transcriptomes. Research genes are provided by the user, as a list of gene IDs separated by TAB, in addition the custom transcriptomes. To make a biological context, the user uploads one transcriptome or two transcriptomes (one control and one treatment) as a plain text file, with gene expression quantified by TPM. Biological context is defined as a subspace of transcriptomes centred around the input transcriptome(s). In the one transcriptome case, the sub-network is made from all co-expressed TF-Target pairs. In the two transcriptomes case, the sub-network is further filtered, such that only differentially expressed genes and research genes (if any) are included. A GO term enrichment analysis is also provided. Note that a gene regulatory network is highly context-dependent, so a network algorithm that respects this fact is highly valuable. uChIP ------ (WORK IN PROGRESS.) Available at 172.26.114.34:5000/upload-chip Other notes --------------- - BRAIN is a long-running process, like a daemon. The gene regulatory network is updated daily. New RNA-seq data are downloaded and mapped daily, and batches of them are incorporated periodically. History --------- Sizes of the TPM table (TPM.txt): - 5,563 RNA-seq samples [29 Jun 2017] - 3,130 RNA-seq samples [24 Apr 2017] - 2,497 RNA-seq samples [20 Mar 2017] Dependencies ------------- R is required. In particular, the prgoram Rscript is required. Python3 is required too. - R - rjson - mixtools_ - Python - numpy - scipy - networkx - pyBigWig (needed if there are bigwig files in parameter_for_buildCmatrix.txt) - flask - Software - Salmon 0.7.2 for mapping RNA-seq data. Directory structure ------------------- - Code - depend - Data - C - R - history - bind - edges - one_target - many_targets - edges.txt - expr - information - parameter - temp - log - upload - Documentation - Webapp - start_webapp.py - static - edges - js - templates - Test Other organisms --------------- The program is developed for Arabidopsis. With a reasonable amount of efforts, it can be adapted to work for other organisms. Design considerations ----------------------- - Simple and FIXED input. It will help everyone who is using this software. Tell my users that if you give me something like this, I guarantee you I will give you something like that. For ChIP-seq, narrowPeaks. For RNA-seq, quant.sf. - Clear pipepline for developers. - Compute edges.txt in BRAIN's core engine and copy this file to the web application. - Don't let the machine stay idle when we sleep. Let it work day and night. Useful links ------------- I find the following links useful while developing this program. - SRA Run Selector: https://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRR2014765&go=go - ENA BioSample: http://www.ebi.ac.uk/biosamples/search?searchTerm=SAMN03995029&rows=10&start=0&useFuzzySearch=false&filters= - ENA Run information: http://www.ebi.ac.uk/ena Future work ----------- - Thus far context is defined by tissue, clustering, or components returned by Mixture of Regression. Context can be more flexible. For example, the expression level of a particular gene can be a context. If that gene is highly expressed, this is a gene-specific high-expression context. We can have an query such as *Gene ID1 when Gene ID2 is high*. .. - Complete the information in parameter_for_buildCmatrix.txt. - Check the update status of the ENA website and download new data automatically. FAQs ----------- - When should make_target_tf.py be called and how? Answer: The purpose of make_target_tf.py is to generate a targt_tf.txt file which is to be used by edge generating script (e.g., create_edges0.py). So make_target_tf.py should be called whenever a new target_tf.txt is needed. This could happen when binding.txt is updated. In fact, make_target_tf.py takes binding.txt as an input (specified in parameter_for_net.txt). Call it in the following way: | python3 make_target_tf.py ../Data/parameter/parameter_for_net.txt > ../Data/information/target_tf.txt - When should create_edges0.py be called and how? Answer: Call it when either EXPRESSION_MATRIX or BINDING_MATRIX (specified in parameter_for_net.txt) is updated. Its product will be placed in ../Data/history/edge_pool. Call it in the following way: | python3 create_edges0.py ../Data/parameter/parameter_for_net.txt Contact ----------- .. Hui Lan (hui.lan@slcu.cam.ac.uk) in Wigge's group, the Sainsbury Laboratory, University of Cambridge, UK. Please contact Hui Lan (lanhui@zjnu.edu.cn) for any questions about the internal workings of BRAIN. Maintenance stories -------------------- [2020-08-10]. G.pickle contains the entire networkx network. G.pickle cannot be saved as a pickle file because of memory error. Check file /var/www/brain/brain/start_webapp.py:: 1134 with open('/var/www/brain/brain/static/G.pickle', 'wb') as p: 1135 pickle.dump(G, p) 8G memory in the server seems to be not enough. Solution: pick an edge line with a probability 0.7:: 161 if line != '' and random.randint(1,10) < 8: # random.randint(1,10) < 8 is for using this line with a probability. Why? Limited memory caused memory error 161 while writing pickle file. References ----------- .. _mixtools: [mixtools] Benaglia, T., Chauveau, D., Hunter, D., & Young, D. (2009). mixtools: An R package for analyzing finite mixture models. Journal of Statistical Software, 32(6), 1-29.