CS50 PSet 6: DNA

JR
4 min readDec 26, 2020

A guide to the ‘DNA’ problem in CS50 Week 6.

Goal: To write a python script that can identify someone from a database, based on their DNA sequence.

The script must be called with two additional command line arguments; a csv database containing the number of times that particular sequences of characters repeat in a list of people’s DNA sequences, and a DNA sequence .txt file that we will be analysing and assigning an owner to based on the database in the first argument.

The particular sequences in question are known as ‘Short Tandem Requests’ or STRs. These are given in the top row of the database csv files, for example ‘AATG’ or ‘TATC’.

Just like in C, our Python script must begin with the importing of any modules we will be using throughout. The syntax is slightly different, with import being used rather than #include. For this script we want to import reader and DictReader from the csv library for processing of .csv files, and argv and exit from the sys library for handling command line arguments and exit codes respectively.

Within the main() function, we begin by ensuring that the correct number of command line arguments are given using len(argv). In python the script itself counts as a command line argument, plus the database and the DNA sequence described earlier. If an incorrect amount are specified, the exit() method that was imported earlier can be called to generate an exit code.

If this check passes, variable names can be assigned to the database and sequence arguments, for ease of use throughout the script.

from csv import reader, DictReader
from sys import argv, exit
def main(): # Handle command line arguments
if len(argv) != 3:
print("Usage: python dna.py data.csv sequence.txt")
exit(1)
db_path = argv[1]
seq_path = argv[2]

Now that the command line arguments have been validated, the next step is to open each of the files and convert them into formats that can be worked on.

First the database .csv file. I used the DictReader() method here, which converts each row in the file to a dictionary of key value pairs. By default the keys will be taken from the top row of the file unless specified, which is what we want as the database is in a table format. This reader object is also stored here as a list of dictionaries, which will be accessed later.

The sequence file, which is a .txt file with a single row, can be read using the read() method on the file object. This returns the contents of the file as a string.

    # Open csv file and convert to dict
with open(db_path, "r") as csvfile:
reader = DictReader(csvfile)
dict_list = list(reader)
# Open sequences file and convert to list
with open(seq_path, "r") as file:
sequence = file.read()

Now that the input files have been parsed into data structures that can be read and operated on, we can begin to find the longest consecutive sequence of the STRs given at the top of the database file in the sequence file we are assessing.

Firstly define an empty array that will store the max consecutive sequence of each STR. Next we can loop through each of the STRs and check them in sequence. These are stored in the fieldnames attribute of the reader object defined previously. Note that this should start from 1 as column 0 stores the names of the people in the database. Here the STR being checked is stored in a variable and a new value is being added to to the array storing the lengths of the longest sequences for each STR.

The inner loop here iterates through the sequence being assessed and for each character checks if the STR follows. If it does, the following characters are checked to see if the STR repeats, adding one to the count for each repetition. If the count is larger than the current maximum, then the current maximum for that STR is overwritten.

With an extra value being added to the max counts array for each STR in the database, the final populated array can be checked against the rows in the database.

    # For each STR, compute longest run of consecutive repeats in      sequence
max_counts = []
for i in range(1, len(reader.fieldnames)):
STR = reader.fieldnames[i]
max_counts.append(0)
# Loop through sequence to find STR
for j in range(len(sequence)):
STR_count = 0
# If match found, start counting repeats
if sequence[j:(j + len(STR))] == STR:
k = 0
while sequence[(j + k):(j + k + len(STR))] == STR:
STR_count += 1
k += len(STR)
# If new maximum of repeats, update max_counts
if STR_count > max_counts[i - 1]:
max_counts[i - 1] = STR_count

The list of dictionaries defined earlier is employed here, as there will be one dictionary for each person in the database. For each person in the database, the list of STRs is cycled through and the max counts array is compared to the max STR sequences for that person. Here dict_list[i] represents a dictionary and [reader.fieldnames[j]] returns the key to be looked up in that dictionary. If there is a match, the number of matches is incremented. If the number of matches equals the number of STRs in the database then the sequence being checked macthes someone in the database and the name of the person can be returned. Otherwise, ‘No match’ is printed.

    # Compare against data
for i in range(len(dict_list)):
matches = 0
for j in range(1, len(reader.fieldnames)):
if int(max_counts[j - 1]) == int(dict_list[i] [reader.fieldnames[j]]):
matches += 1
if matches == (len(reader.fieldnames) - 1):
print(dict_list[i]['name'])
exit(0)
print("No match")main()

Hopefully that was an eye opener to the power of Python having spent 5 weeks on C, and a great introduction to the syntax and data structures used.

--

--