java-projects/dna/README.md

215 lines
5.9 KiB
Markdown

# Topics:
DNA.java focuses on arrays and file/text processing.
DNA requires the two input files dna.txt and ecoli.txt.
These files are saved in the spec/ folder. This can be changed with INPUT\_PATH.
DNA processes input files, parses the genomic information, and save the output in a new file.
## Learning Outcomes:
#Background
DNA consists of long chains of chemical compounds called nucleotides.
Four nucleotides are present in DNA: Adenine (A), Cytosine (C), Guanine (G), and Thymine (T).
This are organized into subunits called "codons".
Most genes contain condons for building proteins, some do not.
Each codon (e.g. TAC GGA) uniquely encodes a single amino acid, a building block of proteins.
The sequences of DNA that encode proteins occur between a start codon (which we will assume to be ATG) and a stop codon (which is any of TAA, TAG, or TGA).
Not all regions of DNA are genes; large portions that do not lie between a
Start: ATG.
Stop: TAA, TAG, TGA.
Often high percentages of Cytosine (C) and Guanine (G) are indicators of important genetic data.
#Input
- The DNA input data consists of line pairs.
- The input file exists, is readable, and contains valid input.
- The first line has the name of the nucleotide sequence.
- The second is the nucleotide sequence itself.
- Each character in a sequence of nucleotides will be A, C, G, T, or a dash character, "-", either upper or lowercase.
- You may assume that each sequence's number of nucleotides (without dashes) will be a multiple of 3,
##Example
> cure for cancer protein
> ATGCCACTATGGTAG
> captain picard hair growth protein
> ATgCCAACATGgATGCCcGATAtGGATTgA
> bogus protein
> CCATt-AATgATCa-CAGTt
> ...
#Process:
- from the original nucleotide sequence string to nucleotide counts.
- from nucleotide counts to mass percentages.
- from the original nucleotide sequence string to codon triplets.
#Output:
For each nucleotide sequence:
- nucleotide counts
- Counts the occurrences of each of the four nucleotides ['A', 'C', 'G', 'T'].
- mass percentages
- Calculates the mass percentage occupied by each nucleotide type, rounded to one digit past the decimal point.
- codons
- Reports the codons (TAG) present in each sequence and
- protein-coding gene
- begins with a valid start codon ["ATG"]
- ends with a valid stop codon ["TAA", "TAG", or "TGA")
- contains at least 5 total codons (including its initial start codon and final stop codon)
- Cytosine (C) and Guanine (G) combined account for at least 30% of its total mass
##Example
> This program reports information about DNA
> nucleotide sequences that may encode
> proteins. Input file name? dna.txt
> Output file name? output.txt
> Output file output.txt after above execution (partial):
> Region Name: cure for cancer protein
> Nucleotides: ATGCCACTATGGTAG
> Nuc. Counts: [4, 3, 4, 4]
> Total Mass%: [27.3, 16.8, 30.6, 25.3] of 1978.8
> Codons List: [ATG, CCA, CTA, TGG, TAG]
> Is Protein?: YES
> ...
#Approach
##Constants
```java
int MIN_CODONS = 5; //the minimum number of codons a valid protein must have
int MIN_MASS_PERCENT = 30; // the percentage of mass from C and G in order for a protein to be valid
int UNIQUE_NUCLEOTIDES = 4; // the number of unique nucleotides (representing A, C, G, and T)
int CODONS_PER_NUCLEAOTIDE = 3; // the number of nucleotides per codon
```
```java
/*
*
* begins with an introduction and prompts for input and output file names.
* reads the input file to process its nucleotide sequences.
* outputs the results into the given output file.
*
* @return void
*
*/
public static void main(String[] args) {
}
```
```java
/*
*
* assumes the user will type the name of an existing input file.
* input is in the proper format.
*
* @params console
* @return files array of input and output files for processing and saving
*
*/
public static String[] promptForFileNames(Scanner console) {
}
```
```java
/*
*
* Converts char nucleotide to molar mass
*
* mass percentages, use the following as the mass of each nucleotide (grams/mol).
* "junk" regions are excluded from many parts of your computations, but they do contribute mass to the total.
* Adenine (A): 135.128
* Cytosine (C): 111.103
* Guanine (G): 151.128
* Thymine (T): 125.107
* Junk (-): 100.000
*
* @params nucleotide char of nucleotide.
* @return molarMass rounded to nearest tenth.
*
*/
public static double nucleotideToMolarMass(char nucleotide) {
double[] masses = {135.128, 111.103, 151.128, 125.107};
double rounded = Math.round(mass * 10.0) / 10.0;
}
```
```java
/*
*
* Next, write code to pass over a nucleotide sequence and count the number of As, Cs, Gs, and Ts.
* use a String's charAt method to get individual characters.
*
* @params sequence sequence of
* @return counts array of size 4.
*
*/
public static int[] nucleotideCountFromSequence(String sequence) {
}
```
```java
/*
*
* converts a single character (i.e. A, C, T, G) into indices (i.e. 0 to 3).
*
* @params nucleotide char of nucleotide.
* @return index int mapping of nucleotide.
*
*/
public static int nucleotideIndexFromChar(char nucleotide) {
}
```
```java
/*
*
* eliminate these characters.
*
* @params sequence raw sequence
* @return sequence without junk DNA '-'.
*
*/
public static String nucleotidesOnlyFromSequence(String sequence) {
}
```
```java
/*
*
* break apart the sequence into codons and examine each codon.
*
* @params sequence String of nucleotides or raw.
* @return codons Array of strings of codons.
*
*/
public static String[] codonsFromSequence(String sequence) {
}
```
```java
/*
*
* Saves output to file using PrintStream as described in Section 6.4 of the textbook.
* overwrites any existing data in the output file (this is the default PrintStream behavior).
* nucleotide sequence is output in uppercase
* the nucleotide counts and mass percentages are shown in A, C, G, T order.
*
* @params path String path of output filename
* @return void;
*
*/
public static void saveOutput(String path) {
}
```