Sunday, February 18, 2024

Using the Visual Genome Browser to compare the genome of the Tasmanian Devil (Sarcophilus harrisii) with that of the extinct Tasmanian Tiger/Wolf (Thylacine or Thylacinus cynocephalus)

It was early Spring of September 7, 1936 and Benjamin, the last remaining Tasmanian Tiger/Wolf (scientific name: Thylacinus cynocephalus) was breathing his last breath in an enclosure of the Beaumaris Zoo in Hobart of the Australian island state Tasmania.  It was a sad day for Australia, so much so that 7 September is still commemorated as National Threatened Species Day every year.  

Below are two photos of the Thylacine:



You can also watch a Colourised video of the Tasmanian Tiger Youtube video of it over here.

That was almost 88 years ago, but the Thylacine's voice is not completely silent.  With the power of DNA sequencing one can still discover its secrets by comparing its genome sequence against the genome of other living Australian mammals like the Tasmanian Devil (Sarcophilus harrisii).

Here is an image of a Tasmanian Devil


Both of these animals are carnivorous, both are marsupial mammals.  The gestation period of the Thylacine was 28 days while that of the Tasmanian Devil is 21 days.  When they are born in a relatively immature state they have to crawl into the pouch of the mother where they have to find one of the available teats where they will complete their further development . Both animals has/had a rear facing pouch and both had four teats.

Looking at the images one would be excused for thinking that the Thylacine/Tasmanian Tiger is more closely related to the Dog or the Wolf, but the Thylacine and the Tasmanian Devil only have 7 pairs of chromosomes (12 autosomal and 2 sex chromosomes) while the Dog and the Wolf have 39 pairs of chromosomes (76 autosomal and 2 sex chromosomes) which are considerably different from that of the Thylacine and the Tasmanian Devil.

You can read more on the relatedness of the Thylacine with Australian animals in the following articles:

Genome sequence expands on the story of the extinct Tasmanian tiger
We’ve decoded the numbat genome – and it could bring the thylacine’s resurrection a step closer
https://wwwThe mitochondrial genome sequence of the Tasmanian tiger (Thylacinus cynocephalus).ncbi.nlm.nih.gov/pmc/articles/PMC2652203/



What I really want to get to is to show you how to do this comparison for yourself with the use of my Visual Genome Browser.

The first step in doing the comparison is to install the genomes of both of these marsupial mammals in the browser.

The first step is to click on the following button which will take you to the NCBI Genbank repository:


For the Tasmanian Devil you will be downloading the genome at NCBI.







For the Thylacine you will download its genome from the University of Melbourne's Thylacine genome repository here.


Both these files need to be unzipped into a folder such as :
FastaThylacine or FastaTasmanianTiger or FastaTasmanianWolf below where your genomes are in the data folder.

The next step is to ask the Genome Browser to refresh its detected folders:



This action will rebuild the 2bit genome file from the fasta/fsa file and the GFF annotation file.  On my PC it looks as follows after this step:





When you have completed these steps you will have the 2 animals' genomes available for selection in the folder combo box:


When you select the genome of the Tasmanian Wolf and also click on the Load genes button, the chromosomes will be shown as scaffolds and you will see a view like this:


Now you need to open the Control panel by clicking on the following button:



Next step is to set the Thylacine genome as the "Other genome" which you will be comparing by double clicking on the bottom Overlay text box.


It should set the position as well as turn pink.


Next step is to load the Tasmanian Devil genome by selecting it from the Folder combo box.

Then, after it is loaded and you have clicked the "Load Genes" button, you double click in the top overlay text box next to the red "This".

Now both genomes will be loaded and ready for further comparison.

You should see something like this:



 Now that both genomes are ready, we are going to use the KmerDb tool to do a k-mer comparison between all the chromosomes of the Thylacine and all the chromosomes of the Tasmanian Devil to determine which chromosome from each animal corresponds to the same chromosome in the other animal.  It will essentially calculate the Jaccard similarity between each chromosome and then provide the results in an Overlay table.


This might take a considerable time, and when it is finished it will show you the table with the corresponding chromosomes of this genome (Tasmanian Devil) in the Rows and the Other genome (Thylacine) in the Columns.



The next time you click on the "Create or Read table" button it should take much less tie as it saved the table in a CSV file for later use.

Next step is to create a DOT Plot display using Minimap2 for all the likely matching chromosomes between the 2 animals.  This can be done in 2 ways:

If you want to do it manually, you can simply double click on the matching column and then click on Full Align which will proceed to use Minimap2 to create a PAF file that is used to create a DOT plot for the full alignment between the 2 sequences.  Make sure you have enough RAM, otherwise you will need to select the "Low mem" option which runs much longer but uses less memory.




When it is finished building a dot plot, you will see it in the Dot Plot tab of the bottom panel




You can move the mouse over the DOT Plot to see which genes in both genomes are found at that position as well as get a local alignment of the DNA at that position.  

The Current (This) genome runs along the Y axis and the Other genome sequence runs along the X-axis.

Green is used to indicate a match in the forward direction and purple indicates a match in the reverse complement.  



When you double click in the DOT Plot at the start of a region and CTRL + double click at the end of the region it will display the genes from both genome sequences found in that region.




The dot plot display also allows you to "ALIGN" the sequences from the 2 genomes based on the CIGAR strings for the alignments found in the PAF file generated by the Minimap2 process.

You can do this by holding in the SHIFT key and double clicking at a specific position in the DOT plot.

This action will correctly set the positions in both genomes in order that the Zoom DNA view is at the spot where the 2 sequences can be exactly overlaid on top of each other.  You can switch the Zoom DNA view between 3 modes: Overlay, No Overlay and Aligned Overlay (where local alignment is used to display the 2 genome sequences with colour coded letters). But the gene annotations will only be displayed in the Overlay mode.  



When you are in the Overlay mode of the Zoom DNA view, you can even right click to access the genes in both genomes at that position.



If you now select the "Copy Protein to clipboard + Comparisons" option for both "Zinc finger proteins" in the 2 genomes it will copy the protein sequences to the Comparisons so you can do further analysis on them.

A very powerful feature after you have calculated the Minimap2 Full alignment Dot Plot between the 2 sequences is to Switch to the Comparisons Tab and then 


Next select Fast Global Alignment


It is important to select this Fast option, as it uses optimised code to align sequences at speeds which are orders of magnitude faster than the other options.  The other options are only used to align individual sequences against each other when you want to obtain colour coded output or want to use PAM substitution matrices with protein sequences. 

Then check the "Use full pairwise" checkbox and after that click the "Alignments" button to find the genes from both genomes from the positions with best alignment.


The Identity and Coverage text boxes can be used to set the stringency of the matches. (Coverage is only used when you selected Fast Local Alignment in the previous selection instead of Fast Global Alignment)

The Alignments button is usually used when you have  copied several Comparison entries against each other (when you have unchecked Use Layers) or when you want to Align the currently selected Comparison entries against the genes in the currently displayed genome sequence (when you have checked Use Layers). This normally does an ALL to ALL comparison between them all to find the best alignments across the entire sequence's annotated genesThe Use full pairwise option attempts to speed up this search by only searching within the Minimap2 alignment output where we have already determined that there is a high degree of similarity.

You will get the following popup:


The search process will take several minutes, but still much quicker than if you had to do a full brute force search between all of the annotated genes of the Tasmanian Devil and all of the comparisons loaded from the Thylacine.

The result is a list of 1240 genes from the Tasmanian Devil with high similarity with genes in the Thylacine.  The output is displayed in the Gene Search Results tree.

You can zoom the tree by holding down the CTRL key and rolling the mouse wheel.  The results are ordered in descending order from highest score to lowest score. (The score is a measure of both the alignment length and the identity %)


When you double click on the main tree node, it will position the Zoom DNA view at the position of the gene in the current Tasmanian Devil's genome. 


You can also right click on the entries to get various menu options:


And most importantly, when you double-click on the entry containing the =>98.4%, the Browser will take you to a pairwise alignment between the 2 protein sequences in the Comparisons Tab.
This is the output of the fast method.



This is now where you can select one of the slower alignment methods such as Blosum90Global which will relax the matching criteria to take into account that amino acids can normally be substituted in nature for others and still result in a similar folding pattern in the protein.

The BLOSUM90 matrix assigns scores to amino acid substitutions based on their observed frequencies in related protein sequences. It provides higher scores for more similar amino acids and lower scores for less similar ones. Substitution matrices come in various forms, with the most common ones being BLOSUM (BLOcks SUbstitution Matrix) and PAM (Point Accepted Mutation) matrices.

When you want to re-run the alignment, you can click on the Edit Distance button or simply double click on the =>98.4% node again.

Now you get a colour coded pairwise alignment where Hydrophobic amino acids are pink, Polar (Uncharged) ones are blue, Polar (+) ones are a Cyan colour and Polar (-) ones are green. It will sometimes also show the start of the Exons in Red. It helps you to see when similar amino acids were substituted for others.


Another quick way to get an alignment is to simply start typing the gene name in the Gene Search box and the Genome Browser will match the corresponding genes in both genomes.




You can now just press CTRL + ENTER to do the same alignment with different genes.

After you have navigated to any gene position you can instantly overlay their sequences by clicking on the 'A' button, which will use the CIGAR strings in the Minimap2 alignment to position the genomes so that the overlay in the Zoom DNA view.


When you now switch to the "Align" view, you can see the DNA regions on the corresponding chromosomes aligned in the Zoom DNA view.


Getting back to the DOT Plot images for the Minimap2 full pairwise alignments, they can be found in the following temporary folder for the Tasmanian devil in case you want to use them elsewhere:

Genomes\FastaTasmanianDevil\Temp_FastaTasmanianDevil\Minimap


 

When you need to create all of the DOT PLOT images for all of the likely matching sequences, you can use a batch mode which will take some time but will eventually pre-calculate them all for you.



This concludes what I want to explain on how to use the Visual Genome Browser to compare the genomes of the Tasmanian Tiger/Wolf with that of the Tasmanian Devil.

You can download the Visual Genome Browser at this link.

To read more about this go to this link about the University of Melbourne's project to de-extinct the Thylacine.



Friday, April 21, 2023

Visual Genome Browser. The navigation markers and highlighting matching DNA patterns

 For the best experience you should view this on a 4K screen. (Click on the YouTube Full screen button)


The controls panel, navigation in the Zoom DNA View using markers. Copying snippets of DNA. Highlighting matching DNA based on a pattern. Telomeres of chromosomes. Number of allowable mismatches. Searching through entire chromosomes for ALU sequences. Jumping between search results.


ALU Sequence: GGCCGGGCGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGGCGGGCGGATCACCTGAGGTCAGGAGTTCGAGACCAGCCTGGCCAACATGGTGAAACCCCGTCTCTACTAAAAATACAAAAATTAGCCGGGCGTGGTGGCGCGCGCCTGTAATCCCAGCTACTCGGGAGGCTGAGGCAGGAGAATCGCTTGAACCCGGGAGGCGGAGGTTGCAGTGAGCCGAGATCGCGCCACTGCACTCCAGCCTGGGCGACAGAGCGAGACTCCGTCTCAAAAAA




Wednesday, April 19, 2023

How to use the Visual genome Browser to navigate to a human gene RNA Polymerase II Subunit A (POLR2A)

 For the best experience you should view this on a 4K screen. (Click on the YouTube Full screen button)


Navigating to a specific gene RNA Polymerase II Subunit A (POLR2A) using the Gene Lookup. Gene Information. Working with genes in the Zoom DNA View, Transcript of the gene, The Protein View, The Genetic Code table. Colour coding of the codons. Amino Acid Properties, Codon Bias colouring, GC Content, Copying transcripts, spliced transcripts, Splice Information, Protein sequence. Protein statistics


Thursday, April 13, 2023

Introductory Video on how to install the Visual genome Browser

 For the best experience this need to be viewed on a 4K screen.


While exploring genome data provided by the University of Santa Cruz, I developed a concept to visualize the human genome in a two-dimensional space. As a programmer, I wanted to gain a better understanding of the digital data contained within the genome, which codes for protein machines such as enzymes and structural and control proteins like transcription factors, enhancers, and suppressors. However, the standard linear axis representation did not provide sufficient insight into the genome's structure, particularly the repetitive patterns in regions like the centromeres and telomeres of chromosomes.


Drawing on my experience in GIS, I sought to represent large amounts of map data in two dimensions with multiple layers of information overlaid on top of each other. Applying the same approach to the genome, I aimed to depict the chromosomes as compactly as possible by using coloured blocks running from left to right and top to bottom. By adjusting the line width, any repetitive patterns in the data could become visible.


To achieve this, I decided to use the GC content as a metric and map it onto an RGB colour scale, with green reminding me of GFP, which is often used in fluorescent reporter assays. I discovered that simply mapping nucleotides to pixels resulted in noise, so I applied averaging over bases to obtain visually identifiable blocks of a certain colour intensity. By taking the average number of G's and C's in blocks of approximately 20-50 bases and mapping it to the 0-255 green intensity, I generated a structure of the genome that distinguished areas exhibiting distinct GC content structure, similar to how origins of replication occur in areas with higher A's and T's, where the DNA helix is more easily opened up by a helicase for DNA replication to begin.

The Visual Genome Browser is the culmination of more than 7 years of experimentation and it is now available in a Beta for people to start exploring genomes for themselves in this 2 Dimensional format.

Initially it will not be fully clear how this way to visualize the genome would be of benefit, but as I release more videos on the use cases it will become evident.

Friday, September 16, 2022

Reverse engineering the T-Cell Receptor proteins for a T-Cell that can kill virus infected cells.

Today I want to continue my discussion on our fascinating adaptive immune system by considering a specific Killer-T-Cell able to bind to cells that are infected with the Human T-lymphotropic Virus (HTLV1) and I want to then show you how one would go about to reverse engineer the receptor proteins of this specific T-Cell to find out which of the gene segments in the T-Cell Receptor gene cluster was actually used to build it. 

Please start reading my previous post where I discussed how these gene segments work.

The term T-lymphotropic means it is a virus which targets T-Cells, it is also known as Human T-cell leukaemia virus type I (viralzone link) and it is a reverse transcribing virus like HIV and it sometimes causes blood cancer.

Remember how Killer/Cytotoxic T-Cells are the sidekicks of B-Cells and Natural Killer cells, and how they fill a very important role in fighting against pathogens:
  • B-Cells produce antibodies which can bind to matching antigens of pathogens (outside of cells) and disable them OR allow other cells to easily find and bind to them by providing a convenient Fc receptor that those cells (like innate immune system neutrophils and macrophages) can attach to in order to more easily kill those enemies.
  • Killer T-Cells are able to determine if a cell is infected with a known pathogen by inspecting the MHC1 proteins on the surface of cells that are continually displaying small peptide parts (small amino acid segments are called peptides) of the pathogens and checking if it can bind to their T-Cell receptors.  When they have determined that their unique receptors can bind the antigen, and they receive the required confirmatory signals that it is indeed a pathogen that they have previously been warned about, they then instruct the cells to kill themselves gracefully.
  • Natural Killer cells check which cells are not displaying any peptides in MHC1 proteins and if not, they send similar commands (as T-Cells do) to the cells to self destruct.
T-Cells will not start killing unless they have been properly trained for the job.  During development in the Thymus (where T-Cells derive their name from), they are checked to make sure:
  • That they have T-Cell receptor proteins on their surface that have been properly formed from the Alpha and Beta chains of the TRB and TRA gene segments.
  • That their T-Cell receptors are able to bind to peptides presented to them in your own body's MHC1 proteins. This is partly achieved by the CD8 protein which will bind to a part of the MHC1 protein during docking.
  • That these T-Cells will not react to the body's own proteins (if they do, they are immediately ordered to self destruct) 
  • That they do not bind too weakly or too strongly to their specific antigen, but just the correct amount (like in Goldilocks and the three bears).
When they have passed all of these tests they are allowed to leave the thymus and move to the lymph nodes where they need to be activated by professional antigen presenting cells (APC) like dendritic native immune cells.  Dendritic cells are like the spies of the immune system, collecting samples of dangerous pathogens and reporting back to headquarters (the lymph nodes) in order to identify the T-Cells with the correct receptors which are able to neutralize the enemy.  Once they find this T-Cell "operative" (out of billions of possible ones), they use a second "danger" signal to stimulate the T-Cell's CD28 protein receptor to indicate that the peptide which is currently being displayed to the T-Cell agent on its MHC1 protein is indeed part of an enemy's make-up and that this special "operative" has the correct skills for this job (i.e. the correct shuffling of its gene segments).  It will then use its CD8 surface protein to connect to a part of the MHC1 protein on the dendritic cell surface and pull it closer.  There is a very close relationship between your own CD8 and MHC1 (as well as your CD4 and MHC2 proteins in the case of Helper T-Cells).  This activates the Killer-T-Cell, which now has a license to kill the body's own cells if it again comes across a cell presenting this antigen peptide.  Once activated, they multiply and leave the lymph nodes to patrol our body in search for the pathogens they uniquely can recognise.

Footnote: MHC (which is known as the Major Histocompatibility Complex) is also known for causing organ rejection between mismatched recipients because your body will recognise somebody else's MHC as foreign and try to kill it via the Natural Killer Cells.

When cells are infected by viruses, they start emitting an alarm signal in the form of a chemical messenger called Type I interferon (IFN-alpha and IFN-beta).  This will stimulate surrounding cells to start producing more MHC1 display proteins and they will present all kinds of protein parts they have digested on the surface MHC1 receptors for T-Cells to clearly "smell".  Antibodies are generally not able to enter cells, and that is why Killer-T-Cells are so important.  When a Killer T Cell bumps into an MHC1 receptor on the surface of an infected cell, it will immediately spring into action to exercise its "license-to-kill" and get rid of the infected cell.  All of the cell's machinery, including the viral proteins will be digested and destroyed, lock-stock-and-barrel, eliminating the spread of the virus via that cell.

Short peptides from a pathogen (of around 9 amino acids in length) are carried from the inside of a cell and presented in a groove of the MHC1 protein molecule.  In the following image you can see how a part of the Human T-lymphotropic Virus protein is held very tightly in a groove of the MHC1 protein.

Different people have MHC1 proteins which are slightly better or slightly worse at presenting peptides from different pathogens, which is why some people are genetically able to better cope with different viruses than other people.  Each person's MHC1 is simply better or worse with binding to different parts of a broken down virus protein.



The following shows the secondary structure of the MHC1 groove in which the peptide amino acids are presented to the T-Cell Receptors.  The amino acids are labelled and colour coded.  You can see alpha helixes as well as beta sheets.


The specific amino acids in the TAX protein peptide (depicted in grey) are: LLFGYPVYV

Leucine-Leucine-Phenylalanine-Glycine-Tyrosine-Proline-Valine-Tyrosine-Valine

This specific sequence can be found in the genome of the HTLV1 virus at base position: 6977-7003 of the 8507 sized genome.
In the following diagrammatic 2D representation of the complete genome of the HTLV1 virus genome, I have indicated where this specific peptide (which is part of the TAX protein) can be found in the virus genome. 

I have used the Visual Genome Browser to depict the amino acids coded for by the genome bases in the TAX protein with the specific peptide that the T-Cell receptor can bind to highlighted in white.



The genome exists inside the virus particle as single stranded RNA which is reverse transcribed into DNA and then integrates into the cell DNA (sometimes causing leukaemia when it breaks an important gene during integration).  When the infected cell's ribosome protein "printing machines" are hijacked to manufacture this TAX protein, the 3 letter coding bases in the viral genome are translated into a protein of 359 amino acids.  The TAX protein is a transcription activator which will "awake" the virus that have integrated into the human DNA and cause it to be transcribed into messenger RNA, thus resurrecting the virus from its "slumber".  

But when the viral TAX protein gets digested into small peptides, the MHC1 protein will pick up pieces of the protein and "present" these pieces on the cell surface to Killer T-Cells.

In the following image you can see how snugly the TAX protein peptide "fits" like a puzzle piece to the T-Cell Receptor Alpha chain (depicted in pink) and the Beta chain (depicted in green).  The tight binding to the viral peptide is brought about by the very serendipitous arrangement of the V, D and J segments in the T-Cell Receptor alpha and beta gene clusters.  Exactly the right gene segments were included in the recombination "shuffling" during development of this T-Cell to make this all possible.  

This Killer T-Cell that has these receptors existed in the body all along, but it took an antigen presenting dendritic cell (APC) to identify the correct T-Cell for the job in the lymph node. 

The peptide is ALWAYS presented inside an MHC1 protein groove, but I have left it out in the image above to show you the close binding of the virus peptide to the T-Cell receptor CDR (Complementarity Determining Region), which is the amino acids of the T-Cell receptor proteins that is responsible for recognising and binding to the presented peptide.

Normally, it looks as follows:
The MHC1 is bound to the peptide from the bottom and the T-Cell receptors are bound at the top.  The T-Cell Alpha and Beta receptors is attached to the surface of the T-Cell and the MHC1 surface receptor is attached to the infected cell.

This is a match like a LOCK IN A KEY. It then sets in motion the "gears" inside the T-Cell that will eventually lead to the self destruction of the cell with this virus peptide on its surface.
The presence of part of the virus ism presented "on a platter" in the MHC1 groove is a tell-tale sign that the cell had been infected with the T-Cell Leukaemia virus and that it is silently making thousands of copies of the virus inside the cell.



If you want to explore the 3D structure of the above complex for yourself, you can find it here on the Protein Databank Website for the entry: 1AO7



In this same way, all kinds of proteins from our own body proteins as well as those of invading viruses, are being digested into small peptide sequences and then presented for inspection in MHC1 receptors on the surface of all of our body cells (except red blood cells) to the Killer-T-Cells for inspection.  There would normally not be Killer-T-Cells that will target normal body proteins because they would have been eliminated by the strict screening process in the thymus.  

If you are interested in more T-Cell to MHC "docking" examples, have a look at this data.  It comes from an article on the topic.

Next, I will show you how one would go about finding the specific gene segments which were stitched together to produce this specific T-Cell receptor proteins that are able to recognise this virus peptide so elegantly.

The first step is to download the actual sequences of amino acids that make up the different chains of the T-Cell Receptors. This is done by clicking on the Download-FASTA menu item:



This will provide you with a FASTA text file containing the following sequences:

>1AO7_1|Chain A|HLA-A 0201|Homo sapiens (9606)

GSHSMRYFFTSVSRPGRGEPRFIAVGYVDDTQFVRFDSDAASQRMEPRAPWIEQEGPEYWDGETRKVKAHSQTHRVDLGTLRGYYNQSEAGSHTVQRMYGCDVGSDWRFLRGYHQYAYDGKDYIALKEDLRSWTAADMAAQTTKHKWEAAHVAEQLRAYLEGTCVEWLRRYLENGKETLQRTDAPKTHMTHHAVSDHEATLRCWALSFYPAEITLTWQRDGEDQTQDTELVETRPAGDGTFQKWAAVVVPSGQEQRYTCHVQHEGLPKPLTLRWE

(This is the the main chain of the MHC1 protein which interacts with the T-Cell receptor and which also holds the peptide being presented. HLA stands for Human Leukocyte Antigen. This is also the protein that differs so much between different people making organ transplantation very difficult.)



>1AO7_2|Chain B|BETA-2 MICROGLOBULIN|Homo sapiens (9606)

MIQRTPKIQVYSRHPAENGKSNFLNCYVSGFHPSDIEVDLLKNGERIEKVEHSDLSFSKDWSFYLLYCTEFTPTEKDEYACRVNHVTLSQPCIVKWDRDM

(This is part of the MHC1 protein complex, but it does not interact directly with the T-Cell receptor) . See this Wikipedia article.


>1AO7_3|Chain C|TAX PEPTIDE|Human T-lymphotropic virus 1 (11908)

LLFGYPVYV

(This is the Viral Peptide sequence)



>1AO7_4|Chain D|T CELL RECEPTOR ALPHA|Homo sapiens (9606)

KEVEQNSGPLSVPEGAIASLNCTYSDRGSQSFFWYRQYSGKSPELIMSIYSNGDKEDGRFTAQLNKASQYVSLLIRDSQPSDSATYLCAVTTDSWGKLQFGAGTQVVVTPDIQNPDPAVYQLRDSKSSDKSVCLFTDFDSQTNVSQSKDSDVYITDKTVLDMRSMDFKSNSAVAWSNKSDFACANAFNNSIIPEDTFFPSPESS

(This is the T-Cell Alpha chain protein sequence)


>1AO7_5|Chain E|T CELL RECEPTOR BETA|Homo sapiens (9606)

NAGVTQTPKFQVLKTGQSMTLQCAQDMNHEYMSWYRQDPGMGLRLIHYSVGAGITDQGEVPNGYNVSRSTTEDFPLRLLSAAPSQTSVYFCASRPGLAGGRPEQYFGPGTRLTVTEDLKNVFPPEVAVFEPSEAEISHTQKATLVCLATGFYPDHVELSWWVNGKEVHSGVSTDPQPLKEQPALNDSRYALSSRLRVSATFWQNPRNHFRCQVQFYGLSENDEWTQDRAKPVTQIVSAEAWGRAD

(This is the T-Cell Beta chain protein sequence)

The sequences highlighted in blue are the ones we want to try and reverse engineer on the human genome.


We start off by locating where the T-Cell Receptor Alpha gene is located on the human genome:

I just type in TRAV in the GENES field. (Also obtained by just pressing CTRL-G)


Selecting any of them and pressing Enter immediately jumps to Chromosome 14 and highlights where the TRAV genes can be found.


You can also filter the display to only show the required genes starting with TRA or TRB.



Now make sure your have built the local BLAST search database for chromosome 14:



Next step is to paste the T-Cell Receptor Alpha sequence into the first align search box:

KEVEQNSGPLSVPEGAIASLNCTYSDRGSQSFFWYRQYSGKSPELIMSIYSNGDKEDGRFTAQLNKASQYVSLLIRDSQPSDSATYLCAVTTDSWGKLQFGAGTQVVVTPDIQNPDPAVYQLRDSKSSDKSVCLFTDFDSQTNVSQSKDSDVYITDKTVLDMRSMDFKSNSAVAWSNKSDFACANAFNNSIIPEDTFFPSPESS


This will execute the BLAST (Basic Local Alignment Search Tool) command:

tblastn.exe -task tblastn -evalue 1 -num_threads 4 -max_target_seqs 10 -outfmt "6 qaccver saccver pident length mismatch gapopen qstart qend sstart send evalue bitscore frames sseq" -db "E:\Genomes\hg38\Blast\hg38" -query "E:\Genomes\hg38\Temp_hg38\Query.fa" -out "E:\Genomes\hg38\Temp_hg38\QueryResults.txt" -seqidlist "E:\Genomes\hg38\Temp_hg38\QuerySequenceIds.txt"

It will give you the BLAST output:

Which the Visual Genome browser will then interpret the matches and provide an output where the gene segment names are presented against the highest matching entries from BLAST.


The ALPHABET letters will indicate which genes segments are most likely to be matching that part of the T-Cell Receptor sequence.


This indicates that this T-Cell Alpha sequence is highly likely made up of:

TRAV12-2  (Variable gene segment)

TRAJ24 (Joining gene segment)

TRAC (Constant gene segment)

And this is indeed the case. When I align the query sequence from the Protein Databank with the protein from the HG38 Human genome sequence I get:

The top sequence represent the query sequence while the bottom sequence represent the actual amino acids obtained from the human reference Genome HG38.

From this output on the "Comparisons" tab you can see that there is 99.02% identity/match with the query sequence for this combination of V and J and C segments.

202 amino acids match out of a total of 275.   73 are different

The join between V and J happens after amino acid 113 and the constant segment starts after 135.

We get the best match when we select:
TRAV12-2 => 1-113  (Bases=340/3  remaining bases=1)
TRAJ24 => 114-134  (Bases=63/3   remaining bases=0)
TRAC => 135-276    (Bases=425/3  remaining bases=2)

Similarity = 99.02 %   (Exact:202 + Similar:0)/Total:275 Diff:73  (202/275)


1         11        21        31        41        51        61        71

MKSLRVLLVILWLQLSWVWSQQKEVEQNSGPLSVPEGAIASLNCTYSDRGSQSFFWYRQYSGKSPELIMFIYSNGDKEDG

81        91        101       111       121       131       141       151       

RFTAQLNKASQYVSLLIRDSQPSDSATYLCAVNMTTDSWGKFQFGAGTQVVVTPDIQNPDPAVYQLRDSKSSDKSVCLFT

161       171       181       191       201       211       221       231       

DFDSQTNVSQSKDSDVYITDKTVLDMRSMDFKSNSAVAWSNKSDFACANAFNNSIIPEDTFFPSPESSCDVKLVEKSFET

241       251       261       271        

DTNLNFQNLSVIGFRILLLKVAGFNLLMTLRLWSS

There is no Diversity (D) segment in the Alpha chain.

This provides the used bases as well as the resulting amino acids:
>chr14:22519968-22520030 (Bases=63, Codons=21) 
GTGACAACTGACAGCTGGGGGAAATTCCAGTTTGGAGCAGGGACCCAGGTTGTGGTCACCCCA

>Protein of chr14:22519968-22520030 (L=21)
VTTDSWGKFQFGAGTQVVVTP
which matches the sequence we are looking for:
MTTDSWGKFQFGAGTQVVVTP




When we go to the TRAJ24 gene segment in the genome:


Make sure the display settings is as follows:



We can search for the protein with 1 mismatch in the amino sequence by putting the MTTDSWGKFQFGAGTQVVVTP sequence in the Search box. This will search for the protein in all of the 6 reading frames.


Because we know there might be one or more amino acid not matching due to genes not having a multiple of 3 codon bases, we put 1 in the mismatches field as depicted above.  This highlights the genome sequence that matches in the genome:


You can actually use a tool to determine what amino acid sequence is coded for by demarcating it in the genome as follows: 

After using the protein coding tool, now clear the search field to reveal the newly encoded protein:




Also change the display settings as shown in order to have the browser show protein letters on the genome.

You can now observe the protein in the new reading frame: TTDSWGKFQFGAGTQVVVT


There is a tool in the Human Genome Browser that allows you to play around with selecting different gene segments to see how good a match you can get.








Now let us do the same with the T-Cell Receptor Beta chain:


>1AO7_5|Chain E|T CELL RECEPTOR BETA|Homo sapiens (9606)

NAGVTQTPKFQVLKTGQSMTLQCAQDMNHEYMSWYRQDPGMGLRLIHYSVGAGITDQGEVPNGYNVSRSTTEDFPLRLLSAAPSQTSVYFCASRPGLAGGRPEQYFGPGTRLTVTEDLKNVFPPEVAVFEPSEAEISHTQKATLVCLATGFYPDHVELSWWVNGKEVHSGVSTDPQPLKEQPALNDSRYALSSRLRVSATFWQNPRNHFRCQVQFYGLSENDEWTQDRAKPVTQIVSAEAWGRAD


We navigate to Chromosome 7 and again use a filter:   op_gene^TRB





After again running the "Search Align" it has found :

TRBV6-5 to be the Variable segment

TRBC2 to be the constant segment

But we are not sure which makes up the Diversity (D) and Joining (J) segments.

This time we start by selecting the TRBV6-5 and TRBC2 segments which we have a more than 96% certainty of.

Then we press CTRL+select any joining segment.  This will go through all of the joining segments and then compare the resulting protein with the query sequence entered in the top text box on the Comparisons tab. It will keep the best matching one.

After doing this we find that the diversity and joining segments are likely:

TRBD1

and

TRBJ2-7



The following protein sequence is assembled from 

TRBV6-5 => 1-114      (Bases=344/3  remaining bases=2)
TRBD1 => 115-118      (Bases=12/3   remaining bases=0)
TRBJ2-7 => 119-134    (Bases=47/3   remaining bases=2)
TRBC2 => 135-314      (Bases=539/3  remaining bases=2)

1         11        21        31        41        51        61        71          

MSIGLLCCAALSLLWAGPVNAGVTQTPKFQVLKTGQSMTLQCAQDMNHEYMSWYRQDPGMGLRLIHYSVGAGITDQGEVP

81        91        101       111       121       131       141       151  

NGYNVSRSTTEDFPLRLLSAAPSQTSVYFCASSYSGQGASYEQYFGPGTRLTVTEDLKNVFPPKVAVFEPSEAEISHTQK

161       171       181       191       201       211       221       231    

ATLVCLATGFYPDHVELSWWVNGKEVHSGVSTDPQPLKEQPALNDSRYCLSSRLRVSATFWQNPRNHFRCQVQFYGLSEN

241       251       261       271       281       291       301       311    

DEWTQDRAKPVTQIVSAEAWGRADCGFTSESYQQGVLSATILYEILLGKATLYAVLVSALVLMAMVKRKDSRG

We can see that bases remaining from the previous segment will still contribute to the next segment if you look at the remaining bases. This is because segments are not always a multiple of 3 bases to make full codons.

When we jump to the TRBJ2-7 gene segment we can see the Joining gene segment:



The browser will now show the amino acids that is encoded in the normal reading frame:


We want to see how this segment produces : ASYEQYFGPGTRLTVT


This will then show where the protein sequence matches on the genome in one of the 6 reading frames and we see that there is a protein which starts 2 bases earlier:


We now use the feature that will look for proteins coded on the genome:

This will generate the protein that is formed by this reading frame:

>chr7:142797454-142797501 (Bases=48, Codons=16)
TGCTCCTACGAGCAGTACTTCGGGCCGGGCACCAGGCTCACGGTCACA

>Protein of chr7:142797454-142797501 (L=16)
SYEQYFGPGTRLTVT



When we then compare it with the protein sequence we are looking for:
ASYEQYFGPGTRLTVT  
CSYEQYFGPGTRLTVT

 (The first letter mismatches due to the bases at the join between D and J not being a multiple of 3 to make up a full codon on 3 bases)

When we follow the same procedure by going to gene segment TRBD1:


When we again put SGQGA in the search box:


>chr7:142786211-142786225 (Bases=15, Codons=5)
TGGGGACAGGGGGCC

>Protein of chr7:142786211-142786225 (L=5)
WGQGA
SGQGA


So in summary: we have now used a local BLAST search in addition to a method which constructed proteins by searching for the nest match to the reference sequence.  This allowed us to get very close matches to the Protein Databank proteins:

T-Cell Receptor Alpha:
TRAV12-2 => 1-113  
TRAJ24 => 114-134  
TRAC => 135-276    
Similarity = 99.02 %  

T-Cell Receptor Beta:
TRBV6-5 => 1-114   
TRBD1 => 115-118      
TRBJ2-7 => 119-134    
TRBC2 => 135-314     
Similarity = 96.122 % 

If you want to learn more about activation of Cytotoxic (CD8 positive or Killer)-T Cells via Dendritic cells, please go and read the following excellent article on the topic:

Activation of CD8 T Lymphocytes during Viral Infections