Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cCRE dELS definition #5

Open
zhangpicb opened this issue Nov 23, 2021 · 4 comments
Open

cCRE dELS definition #5

zhangpicb opened this issue Nov 23, 2021 · 4 comments

Comments

@zhangpicb
Copy link

zhangpicb commented Nov 23, 2021

Hi,

Thanks for this fantastic resource!
When I use the GRCh38-ccREs.bed download from SCREEN,there are some dELS,CTCF-bound and dELS to their near protein coding gene TSS are less than 2kb.

GENCODE v24

chr2 176104216 176109754 5539 + HAVANA gene ENSG00000128713.12 protein_coding HOXD11
chr2_176104215_176104556_dELS,CTCF-bound

Thanks a ton!

@camelest
Copy link

Hi, I have a similar question and regarding this, do we have access to either

~/Lab/Reference/Mouse/GENCODEM18/TSS.Basic.4K.bed
~/Lab/Reference/Human/$genome/GENCODE24/TSS.Basic.4K.bed
to reproduce the analysis of classification of CREs? Or it would be really appreciated if you could share the script to create these files.

Thank you so much for your effort on this wonderful resource.

@zhangpicb
Copy link
Author

Hi @camelest @Jill-Moore

I downloaded GENCODE v24 from GENCODE website,and found this question.
This dELS defination used TSS.Basic.bed.gz(by the way,no TSS.Basic.4K.bed in the github repo).TSS.Basic.bed.gz removes some transcripts in GENCODE v24,that's the reason why I found some dELS near TSS.
How to filter transcripts in GENCODE v24 to produce TSS.Basic.bed.gz?

Thanks a ton!

@camelest
Copy link

camelest commented Nov 28, 2021

Hi, @zhangpicb, thank you for your information. Regarding their filtering, I believe they took the basic tag transcripts from the GENCODE gtf, based on the method part from the paper.

@Jill-Moore, I found the TSS.Basic.bed.gz. The start and the end position of the bed file is the same, which I think is not typical for a bed file. Did you intentionally do this to ensure TSS.Basic.4K.bed has 4000bp-length regions instead of 4001bp? Thank you for your help.

@sayangsep
Copy link

Here's a solution to get TSS.basic.bed

#!/bin/bash

# Input GTF file
input_gtf="gencode.v24.basic.annotation.gtf"
# Output BED file (unsorted)
output_bed="TSS.Basic.bed"
# Output sorted BED file
sorted_bed="TSS.Basic.sorted.bed"

# Create or overwrite the output file
> $output_bed

# Process the GTF file
awk '
BEGIN {FS="\t";OFS="\t" }
$3 == "transcript" { 
    # Extract chromosome, start, end, strand
    chrom = $1
    start = $4
    end = $5
    strand = $7

    # For the '+' strand, the TSS is the start position
    if (strand == "+") {
        tss_start = start - 1  # BED is 0-based
        tss_end = start         # BED expects half-open intervals
    }
    # For the '-' strand, the TSS is the end position
    else if (strand == "-") {
        tss_start = end - 1     # For '-' strand, TSS is the end of the transcript
        tss_end = end
    }

    # Extract gene_id and transcript_id from $9 manually using split
    gene_id = "unknown_id"
    transcript_id="unknown_id"
    n = split($9, a, ";")
    for (i = 1; i <= n; i++) {
        if (a[i] ~ /gene_id/) {
            split(a[i], b, "\"")
            gene_id = b[2]
        }
        if (a[i] ~ /transcript_id/) {
            split(a[i], b, "\"")
            transcript_id = b[2]
        }
    }

    # Print the BED line (chromosome, new_start, new_end, transcript_id, ".", strand, gene_id)
    print chrom, tss_start, tss_end, transcript_id, ".", strand, gene_id
}
' $input_gtf > $output_bed

# Sort the BED file by chromosome and position
sort -k1,1 -k2,2n $output_bed > $sorted_bed

# Report completion
echo "TSS.Basic.sorted.bed created and sorted by chromosome and position."

Here's a solution to get TSS.basic.4K.bed

#!/bin/bash

# Input GTF file
input_gtf="gencode.v24.basic.annotation.gtf"
# Output unsorted 4K BED file
unsorted_bed="TSS.Basic.4K.bed"
# Output sorted 4K BED file
sorted_bed="TSS.Basic.4K.sorted.bed"

# Create or overwrite the unsorted output file
> $unsorted_bed

# Process the GTF file
awk '
BEGIN { FS="\t"; OFS="\t" }
$3 == "transcript" { 
    # Extract chromosome, start, end, strand
    chrom = $1
    start = $4
    end = $5
    strand = $7

    # For the '+' strand, the TSS is the start position
    if (strand == "+") {
        tss_start = start - 1  # BED is 0-based
        tss_end = start         # BED expects half-open intervals
        new_start = tss_start - 2000  # 2K upstream for the '+' strand
        new_end = tss_end + 2000      # 2K downstream for the '+' strand
    }
    # For the '-' strand, the TSS is the end position
    else if (strand == "-") {
        tss_start = end - 1     # For '-' strand, TSS is the end of the transcript
        tss_end = end
        new_start = tss_start - 2000  # 2K upstream for the '-' strand
        new_end = tss_end + 2000      # 2K downstream for the '-' strand
    }

    # Ensure the start position is not negative
    if (new_start < 0) {
        new_start = 0
    }

    # Extract gene_id and transcript_id from $9 manually using split
    gene_id = "unknown_id"
    transcript_id="unknown_id"
    n = split($9, a, ";")
    for (i = 1; i <= n; i++) {
        if (a[i] ~ /gene_id/) {
            split(a[i], b, "\"")
            gene_id = b[2]
        }
        if (a[i] ~ /transcript_id/) {
            split(a[i], b, "\"")
            transcript_id = b[2]
        }
    }

    # Print the BED line (chromosome, new_start, new_end, gene_id, ".", strand, transcript_id)
    print chrom, new_start, new_end, transcript_id, ".", strand, gene_id
}
' $input_gtf > $unsorted_bed

# Sort the BED file by chromosome and position
sort -k1,1 -k2,2n $unsorted_bed > $sorted_bed

# Remove the unsorted file to keep only the sorted one (optional)
rm $unsorted_bed

# Report completion
echo "TSS.Basic.4K.bed created and sorted by chromosome and position."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants