Counting Optimal Alignments

문제해결(PS)/ROSALIND

Counting Optimal Alignments

곰탱이장 2024. 11. 16. 09:42

ROSALIND | Counting Optimal Alignments

It appears that your browser has JavaScript disabled. Rosalind requires your browser to be JavaScript enabled. Counting Optimal Alignments solved by 460 2013년 1월 18일 4:24:30 오후 by Rosalind Team Topics: Alignment, Combinatorics Beware of Alignment

rosalind.info

Problem

Recall from “Edit Distance Alignment” that if $s^{'}$ and $t^{'}$ are the augmented strings corresponding to an alignment of strings $s$ and $t$ , then the edit alignment score of $s^{'}$ and $t^{'}$ was given by the Hamming distance $d_{H} (s^{'}, t^{'})$ (because $s^{'}$ and $t^{'}$ have the same length and already include gap symbols to denote insertions/deletions).

As a result, we obtain $d_{E} (s, t) = min_{s^{'}, t^{'}} d_{H} (s^{'}, t^{'})$ , where the minimum is taken over all alignments of $s$ and $t$ . Strings $s^{'}$ and $t^{'}$ achieving this minimum correspond to an optimal alignment with respect to edit alignment score.

Given: Two protein strings $s$ and $t$ in FASTA format, each of length at most 1000 aa.

Return: The total number of optimal alignments of $s$ and $t$ with respect to edit alignment score, modulo 134,217,727 (227-1).

Sample Dataset

>Rosalind_78
PLEASANTLY
>Rosalind_33
MEANLY

Sample Output

이 문제는 edit distance가 가장 작은 optimal alignment의 개수를 구하는 문제이다. 이 문제는 optimal alignment를 구하는 코드에서 조금만 더 변형해서 푼다면 쉽게 풀 수 있다.

우리가 optimal alignment를 구할 때에는 배열에서 [i,j]의 값을 구하기 위해서 [i-1,j],[i,j-1],[i-1,j-1]의 값을 비교하였다. 지금 비교하는 문자열이 같다면 i-1,j-1의 값을 그대로 아니라면 저 3개 값 중 최솟값을 취하는 식으로 구했다. 하지만 제거나 치환을 해도 그대로 두는 경우랑 edit distance가 같을 수 있는 경우가 있기에, 같다면 i-1,j-1 값 그대로 아니라면 +1 하는식으로 한다음 그 중 최솟값을 구했다. 그리고 optimal alignment의 개수는 일종의 여기까지 올 수 있는 배열에서의 경로를 이야기한다. 예를 들어 2x2의 정사각형에서 제2사분면에서 제4사분면으로 향하는 경우의 수는 3개이다. 이는 2->1->4, 2->3->4, 2->4 같은 식이다. 이는 맞닿아 있는 3개의 배열에서의 경로의 수를 합한 것이다.

그러니 우리는 최솟값을 가지고 있는 배열의 경로의 수를 계속해서 더해가는 식으로 optimal alignment의 개수를 구할 것이다.

from Bio import SeqIO

def edit_distance(s1,s2):
    array=[[0 for _ in range(len(s2))] for _ in range(len(s1))]#[s1][s2]로 넣기
    dp=[[0 for _ in range(len(s2))] for _ in range(len(s1))]
    modulus = 2**27 - 1

    for i in range(len(s1)):
        for j in range(len(s2)):
            if i==0:#i가 0이면
                array[i][j] = j
                dp[i][j] = 1
            elif j==0:#j가 0이라
                array[i][j] = i
                dp[i][j] =1
            else:#그외
                temp=[array[i-1][j]+1,array[i][j-1]+1,array[i-1][j-1]+(s1[i]!=s2[j])]#맞닿이있는 3개의 배열     
                array[i][j] = min(temp)#edit distance = 최솟값

                dp[i][j] += [0,dp[i-1][j]][temp[0]==array[i][j]]#다르면0, 같다면 경로의 수 더하기
                dp[i][j] += [0,dp[i][j-1]][temp[1]==array[i][j]]
                dp[i][j] += [0,dp[i-1][j-1]][temp[2]==array[i][j]]

                dp[i][j] = dp[i][j] % modulus#지속적으로 나머지로

    return dp[len(s1)-1][len(s2)-1]#최종

if __name__ == '__main__':
    with open(r"파일경로",'r') as fa:
        seqs=[]
        for s in SeqIO.parse(fa,'fasta'):
            seqs.append(s.seq)
        S1=' '+str(seqs[0])
        S2=' '+str(seqs[1])

print(edit_distance(S1,S2))

'문제해결(PS) > ROSALIND' 카테고리의 다른 글

Global Alignment with Scoring Matrix (0)	2024.11.18
Counting Unrooted Binary Trees (2)	2024.11.17
Creating a Character Table from Genetic Strings (2)	2024.11.11
Counting Disease Carriers (2)	2024.11.10
Wobble Bonding and RNA Secondary Structures (0)	2024.11.09

현재글Counting Optimal Alignments

곰탱이의 공부정리

공부한 것 정리하기

문자열, dfs, 조합, 백준, 그래프, phylogeny, 순열, DP, 데이크스트라, 백트랙킹, rosalin, 브루트포스, 확률, 트리, 생물정보학, BFS, 다이나믹 프로그래밍, Rosalind, 계통학, 바이오파이썬,

Today :
Yesterday :

곰탱이의 공부정리