문제해결(PS)/ROSALIND

Global Alignment with Scoring Matrix

곰탱이장 2024. 11. 18. 13:55

https://rosalind.info/problems/glob/

 

ROSALIND | Global Alignment with Scoring Matrix

It appears that your browser has JavaScript disabled. Rosalind requires your browser to be JavaScript enabled. Global Alignment with Scoring Matrix solved by 868 Generalizing the Alignment Score The edit alignment score in “Edit Distance Alignment” cou

rosalind.info

Problem

To penalize symbol substitutions differently depending on which two symbols are involved in the substitution, we obtain a scoring matrix SS in which Si,jSi,j represents the (negative) score assigned to a substitution of the iith symbol of our alphabet AA with the jjth symbol of AA.

A gap penalty is the component deducted from alignment score due to the presence of a gap. A gap penalty may be a function of the length of the gap; for example, a linear gap penalty is a constant gg such that each inserted or deleted symbol is charged gg; as a result, the cost of a gap of length LL is equal to gLgL.

Given: Two protein strings ss and tt in FASTA format (each of length at most 1000 aa).

Return: The maximum alignment score between ss and tt. Use:

Sample Dataset

>Rosalind_67
PLEASANTLY
>Rosalind_17
MEANLY

Sample Output

8

 

 이 문제는 scoring matrix를 참고하여 두 서열 사이의 얼라이먼트 중 최고 점수를 출력하는 문제이다. scoring matrix는 alignment시 두 문자가 align할 때의 점수를 매기는 식이다. PP와 PA로 할때 P-P는 몇 점 P-A는 몇 점해서 총 점수로 더해진다. 

 이 문제의 scoring matrix는 BLOSUM62이고 gap penalty는 -5.0이다. 우리가 edit distance alignment를 할 때 edit distance가 최소가 되도록 2차원 배열 안에서 돌았듯이, 이 문제에서는 score가 최대가 되도록 다이나믹 프로그래밍으로 풀면된다.

from Bio import SeqIO
from Bio.Align import substitution_matrices

matrix = substitution_matrices.load('BLOSUM62')
gap_penalty = -5.0

def edit_distance(s1,s2):
    array=[[[0,'',''] for _ in range(len(s2))] for _ in range(len(s1))]

    for i in range(len(s1)):
        for j in range(len(s2)):
            if i==0:
                array[i][j][0] = -5.0 * j
            elif j==0:
                array[i][j][0] = -5.0 * i
            elif s1[i] == s2[j]:
                array[i][j][0] = array[i-1][j-1][0]+matrix[s1[i]][s2[j]]
                array[i][j][1] = array[i-1][j-1][1] + s1[i]
                array[i][j][2] = array[i-1][j-1][2] + s2[j]
            else:
                temp=max(
                    [array[i-1][j][0]+gap_penalty,
                     array[i][j-1][0]+gap_penalty,
                     array[i-1][j-1][0]+matrix[s1[i]][s2[j]]])
                array[i][j][0] = temp
                if array[i-1][j][0]+gap_penalty == temp:#1에 add
                    array[i][j][1] = array[i-1][j][1] + s1[i]
                    array[i][j][2] = array[i-1][j][2] + '-'
                elif array[i][j-1][0]+gap_penalty == temp:#2에 add
                    array[i][j][1] = array[i][j-1][1] + '-'
                    array[i][j][2] = array[i][j-1][2] + s2[j]
                elif array[i-1][j-1][0]+matrix[s1[i]][s2[j]] == temp:#치환이므로 각각
                    array[i][j][1] = array[i-1][j-1][1] + s1[i]
                    array[i][j][2] = array[i-1][j-1][2] + s2[j]


    return array[len(s1)-1][len(s2)-1]

if __name__ == '__main__':
    with open(r"파일경로",'r') as fa:
        seqs=[]
        for s in SeqIO.parse(fa,'fasta'):
            seqs.append(s.seq)
        S1=' '+str(seqs[0])
        S2=' '+str(seqs[1])

print(int(edit_distance(S1,S2)[0]))