Genome Assembly as Shortest Superstring(가장 짧은 문자열로 게놈합성)

문제해결(PS)/ROSALIND

Genome Assembly as Shortest Superstring(가장 짧은 문자열로 게놈합성)

곰탱이장 2024. 8. 29. 18:34

ROSALIND | Genome Assembly as Shortest Superstring

It appears that your browser has JavaScript disabled. Rosalind requires your browser to be JavaScript enabled. Genome Assembly as Shortest Superstring solved by 3727 2012년 7월 19일 12:00:00 오전 by Mikhail Dvorkin Topics: Genome Assembly Introduction

rosalind.info

Problem

For a collection of strings, a larger string containing every one of the smaller strings as a substring is called a superstring.

By the assumption of parsimony, a shortest possible superstring over a collection of reads serves as a candidate chromosome.

Given: At most 50 DNA strings of approximately equal length, not exceeding 1 kbp, in FASTA format (which represent reads deriving from the same strand of a single linear chromosome).

The dataset is guaranteed to satisfy the following condition: there exists a unique way to reconstruct the entire chromosome from these reads by gluing together pairs of reads that overlap by more than half their length.

Return: A shortest superstring containing all the given strings (thus corresponding to a reconstructed chromosome).

Sample Dataset

>Rosalind_56
ATTAGACCTG
>Rosalind_57
CCTGCCGGAA
>Rosalind_58
AGACCTGCCG
>Rosalind_59
GCCGGAATAC

Sample Output

ATTAGACCTGCCGGAATAC

이 문제는 주어진 문자열을 모두 포함하는 가장 짧은 슈퍼스트링을 출력하는 문제이다. 필자는 처음에는 앞뒤가 얼마나 겹치는지를 다 측정하여 이중리스트를 이용하여 겹치는 정도를 각 노드간의 거리로 표현한 그래프를 만들어, 그 그래프에서의 모든 노드를 지나는 최대 거리를 구하는 알고리즘을 생각했었다. 하지만 생각보다 그래프 탐색에 시간복잡도가 높고 DFS로 하자니 시간이 걸리고 그 외 다른 알고리즘으로 하자니 난해해서 생각을 바꾸었다.

그래서 다이나믹 프로그래밍으로 풀기로 했다. 각 시도마다 가장 길게 오버랩 되는 문자열 두개를 합치고, 그 문자열을 리스트에서 제거하고 합친 문자열을 다시 리스트에 넣는 식으로 했다

from Bio import SeqIO

def get_overlap_string(s1,s2):#가장 길게 오버랩된 문자열 찾기
    overlap=[]
    combine=[]

    for i in range(len(s1)):
        if s1[i:] == s2[:len(s1)-i]:
            overlap.append(s1[i:])
            combine.append(s1+s2[len(s1)-i:])
            break
    for i in range(len(s2)):
        if s2[i:] == s1[:len(s2)-i]:
            overlap.append(s2[i:])
            combine.append(s2+s1[len(s2)-i:])
            break
    if not overlap:
        return '',''
    overlap=max(overlap,key=len)#큰게 앞으로
    combine=min(combine,key=len)#작은게 앞으로

    return overlap,combine

def shortest(seqs):
    temp = seqs#임시 리스트

    while len(temp) > 1#리스트에 2개 이상 있을시
        longest_overlap_length=0
        longest_overlap=''
        longest_overlap_pair=[]

        for i in range(len(temp)-1):
            for j in range(i+1,len(temp)):
                overlap,combine=get_overlap_string(temp[i],temp[j])
                if len(overlap) > longest_overlap_length:#지금게 에전거보다 큼
                    longest_overlap=combine
                    longest_overlap_length=len(overlap)
                    longest_overlap_pair=[temp[i],temp[j]]
        temp.remove(longest_overlap_pair[0])#제거
        temp.remove(longest_overlap_pair[1])#제거
        temp.append(longest_overlap)#합친거 넣기
        
    return temp[0]
        


handle = SeqIO.parse(r'파일경로','fasta')
seqs=[]
for i in handle:
    seqs.append(str(i.seq))

print(shortest(seqs))

'문제해결(PS) > ROSALIND' 카테고리의 다른 글

Introduction to Random Strings(무작위 문자열) (0)	2024.08.31
Partial Permutations(부분 순열) (2)	2024.08.30
Longest Increasing Subsequence(가장 긴 증가 서열) (1)	2024.08.22
Locating Restriction Sites(제한부위 위치) (0)	2024.08.16
Enumerating Gene Orders(유전자 순서 열거하기) (0)	2024.08.15

현재글Genome Assembly as Shortest Superstring(가장 짧은 문자열로 게놈합성)

곰탱이의 공부정리

공부한 것 정리하기

Rosalind, 다이나믹 프로그래밍, 확률, BFS, 계통학, 바이오파이썬, rosalin, DP, 트리, dfs, 조합, 생물정보학, 그래프, 백트랙킹, 순열, 브루트포스, 데이크스트라, 문자열, phylogeny, 백준,

Today :
Yesterday :

곰탱이의 공부정리