Finding a Shared Spliced Motif

곰탱이장 2024. 9. 11. 19:38

ROSALIND | Finding a Shared Spliced Motif

It appears that your browser has JavaScript disabled. Rosalind requires your browser to be JavaScript enabled. Finding a Shared Spliced Motif solved by 2456 Locating Motifs Despite Introns In “Finding a Shared Motif”, we discussed searching through a d

rosalind.info

Problem

A string $u$ is a common subsequence of strings $s$ and $t$ if the symbols of $u$ appear in order as a subsequence of both $s$ and $t$ . For example, "ACTG" is a common subsequence of "AACCTTGG" and "ACACTGTGA".

Analogously to the definition of longest common substring, $u$ is a longest common subsequence of $s$ and $t$ if there does not exist a longer common subsequence of the two strings. Continuing our above example, "ACCTTG" is a longest common subsequence of "AACCTTGG" and "ACACTGTGA", as is "AACTGG".

Given: Two DNA strings $s$ and $t$ (each having length at most 1 kbp) in FASTA format.

Return: A longest common subsequence of $s$ and $t$ . (If more than one solution exists, you may return any one.)

Sample Dataset

>Rosalind_23
AACCTTGG
>Rosalind_64
ACACTGTGA

Sample Output

AACTGG

https://en.wikipedia.org/wiki/Longest_common_subsequence

Longest common subsequence - Wikipedia

From Wikipedia, the free encyclopedia Algorithmic problem on pairs of sequences Comparison of two revisions of an example file, based on their longest common subsequence (black) A longest common subsequence (LCS) is the longest subsequence common to all se

en.wikipedia.org

이 문제는 가장 긴 공통서열(longest common sequence)를 구하는 문제이다.필자는 문제에 링크되어 있는 위키백과를 참조하여 문제의 풀이를 진행하였다.

먼저 가장 긴 공통서열을 아웃풋으로 하는 함수 LCS()를 가정하자. 이 LCS라는 함수는 두 가지 속성을 가지고 있다.

첫 번째 속성은 LCS(서열1+'x,서열2+'x') = LCS(서열1,서열2)+'x' 인 속성이다. 맨 끝의 문자가 동일하다면 가장 긴 공통서열에 무조건적으로 포함되기에 뺄 수 있다. 예시로 들어보자면 LCS(ABCD,FCD)는 CD이다. LCS(ABC,FC)+D도 CD이고 LCS(AB,F)+C+D도 CD이다.

두번째 속성은 LCS(서열1+'A',서열2+'B') = max(LCS(서열1+'A',서열2),LCS(서열1,서열2+'B'))이다. 이는 맨 끝의 문자가 다르다면 서로 맨끝의 문자를 빼고 그 중 더 긴 것이 가장 긴 공통서열이란 뜻이다. 예시로 들어보자면 LCS(ABCD,ABDF)는 ABD이다. 이는 LCS(ABCD,ABD)와 LCS(ABC,ABDF) 중 앞의 것이 ABD로 더 길어서 ABD가 된 것이다.

Xi를 X[0:i+1]라 하고 Yj를 Y[0:j+1]이라 하면 LCS(Xi,Yj)는 X[i+1]==Y[j+1]이라면 LCS(Xi-1,Yi-1)+X[i+1]이고(첫번째 속성), X[i+1] != Y[j+1]이라면 LCS(Xi-1,Yj)와 LCS(Xi,Yj-1) 중 더 긴 것이다.(두번째 속성) 필자는 계산의 용이를 위해 두 서열 앞에 공백을 넣어 구현했다.

from Bio import SeqIO

def LCS(S1,S2):
    lcs_map=[['' for _ in range(len(S2))] for _ in range(len(S1))]#lcs 저장

    for i in range(len(S1)):
        for j in range(len(S2)):
            if i==0 or j==0:#0이라면
                continue
            elif S1[i]==S2[j]:#첫번째 속성
                lcs_map[i][j]=lcs_map[i-1][j-1]+S1[i]
            elif S1[i] != S2[j]:#두번째 속성
                lcs_map[i][j]=max([lcs_map[i-1][j],lcs_map[i][j-1]],key=len)

    return lcs_map[len(S1)-1][len(S2)-1]#맨 끝




if __name__=='__main__':
    with open(r'파일경로','r') as fa:
        l=[]
        for i in SeqIO.parse(fa,'fasta'):
            l.append(i.seq)
        S1=' '+l[0]
        S2=' '+l[1]

print(LCS(S1,S2))