Constructing a De Bruijn Graph

문제해결(PS)/ROSALIND

Constructing a De Bruijn Graph

곰탱이장 2024. 10. 20. 13:34

ROSALIND | Constructing a De Bruijn Graph

It appears that your browser has JavaScript disabled. Rosalind requires your browser to be JavaScript enabled. Constructing a De Bruijn Graph solved by 1148 2012년 7월 2일 12:00:00 오전 by Mikhail Dvorkin Topics: Genome Assembly Wading Through the Rea

rosalind.info

Problem

Consider a set $S$ of $(k + 1)$ -mers of some unknown DNA string. Let $S^{rc}$ denote the set containing all reverse complements of the elements of $S$ . (recall from “Counting Subsets” that sets are not allowed to contain duplicate elements).

The de Bruijn graph $B_{k}$ of order $k$ corresponding to $S \cup S^{rc}$ is a digraph defined in the following way:

Nodes of $B_{k}$ correspond to all $k$ -mers that are present as a substring of a $(k + 1)$ -mer from $S \cup S^{rc}$ .
Edges of $B_{k}$ are encoded by the $(k + 1)$ -mers of $S \cup S^{rc}$ in the following way: for each $(k + 1)$ -mer $r$ in $S \cup S^{rc}$ , form a directed edge ( $r [1 : k]$ , $r [2 : k + 1]$ ).

Given: A collection of up to 1000 (possibly repeating) DNA strings of equal length (not exceeding 50 bp) corresponding to a set $S$ of $(k + 1)$ -mers.

Return: The adjacency list corresponding to the de Bruijn graph corresponding to $S \cup S^{rc}$ .

Sample Dataset

TGAT
CATG
TCAT
ATGC
CATC
CATC

Sample Output

(ATC, TCA)
(ATG, TGA)
(ATG, TGC)
(CAT, ATC)
(CAT, ATG)
(GAT, ATG)
(GCA, CAT)
(TCA, CAT)
(TGA, GAT)

이 문제는 주어진 DNA와 이들의 역상보적 DNA의 k+1-mer 들의 de Brujin graph의 edge들을 출력하는 문제이다. de Brujin graph의 node는 (k+1)-mer DNA를 함유하는 set S와 그들의 역상보적인 DNA를 함유하는 set Src의 합집합의 요소로 만든 k-mer들이고 edge는 k+1-mer인 r에서 r[1:k] -> r[2:k+1]인 방향성이 있는 edge이다.

위의 개념을 그대로 차분히 코드로 구현해내면 쉽게 구할 수 있다.

from Bio.Seq import Seq

if __name__ == "__main__":
    with open(r"파일경로",'r') as f:
        seqs=set()
        for i in f.readlines():
            seqs.add(i.rstrip())
            ri = Seq(i.rstrip())
            seqs.add(str(ri.reverse_complement()))

anss=[]
for node in seqs:
    anss.append((node[0:len(node)-1],node[1:len(node)]))

wf = open(r"파일경로",'w')
anss.sort()
for ans in anss:
    print(f'({ans[0]}, {ans[1]})',file=wf)

'문제해결(PS) > ROSALIND' 카테고리의 다른 글

Independent Segregation of Chromosomes (1)	2024.11.02
Inferring Peptide from Full Spectrum (0)	2024.11.02
Creating a Character Table (1)	2024.10.19
Comparing Spectra with the Spectral Convolution (2)	2024.10.17
Introduction to Pattern Matching (9)	2024.10.16

현재글Constructing a De Bruijn Graph

곰탱이의 공부정리

공부한 것 정리하기

Rosalind, 바이오파이썬, 조합, 트리, BFS, 데이크스트라, 백트랙킹, 순열, 계통학, dfs, phylogeny, rosalin, 다이나믹 프로그래밍, 문자열, DP, 그래프, 생물정보학, 확률, 브루트포스, 백준,

Today :
Yesterday :

일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

곰탱이의 공부정리