cCRE dELS definition #5

zhangpicb · 2021-11-23T12:07:23Z

Hi,

Thanks for this fantastic resource!
When I use the GRCh38-ccREs.bed download from SCREEN,there are some dELS,CTCF-bound and dELS to their near protein coding gene TSS are less than 2kb.

GENCODE v24

chr2	176104216	176109754	5539	+	HAVANA	gene	ENSG00000128713.12	protein_coding	HOXD11

chr2_176104215_176104556_dELS,CTCF-bound

Thanks a ton!

camelest · 2021-11-25T03:54:46Z

Hi, I have a similar question and regarding this, do we have access to either

~/Lab/Reference/Mouse/GENCODEM18/TSS.Basic.4K.bed
~/Lab/Reference/Human/$genome/GENCODE24/TSS.Basic.4K.bed
to reproduce the analysis of classification of CREs? Or it would be really appreciated if you could share the script to create these files.

Thank you so much for your effort on this wonderful resource.

zhangpicb · 2021-11-26T03:10:57Z

Hi @camelest @Jill-Moore

I downloaded GENCODE v24 from GENCODE website,and found this question.
This dELS defination used TSS.Basic.bed.gz（by the way,no TSS.Basic.4K.bed in the github repo).TSS.Basic.bed.gz removes some transcripts in GENCODE v24,that's the reason why I found some dELS near TSS.
How to filter transcripts in GENCODE v24 to produce TSS.Basic.bed.gz?

Thanks a ton!

camelest · 2021-11-28T05:05:00Z

Hi, @zhangpicb, thank you for your information. Regarding their filtering, I believe they took the basic tag transcripts from the GENCODE gtf, based on the method part from the paper.

@Jill-Moore, I found the TSS.Basic.bed.gz. The start and the end position of the bed file is the same, which I think is not typical for a bed file. Did you intentionally do this to ensure TSS.Basic.4K.bed has 4000bp-length regions instead of 4001bp? Thank you for your help.

sayangsep · 2024-09-13T21:40:51Z

Here's a solution to get TSS.basic.bed

#!/bin/bash

# Input GTF file
input_gtf="gencode.v24.basic.annotation.gtf"
# Output BED file (unsorted)
output_bed="TSS.Basic.bed"
# Output sorted BED file
sorted_bed="TSS.Basic.sorted.bed"

# Create or overwrite the output file
> $output_bed

# Process the GTF file
awk '
BEGIN {FS="\t";OFS="\t" }
$3 == "transcript" { 
    # Extract chromosome, start, end, strand
    chrom = $1
    start = $4
    end = $5
    strand = $7

    # For the '+' strand, the TSS is the start position
    if (strand == "+") {
        tss_start = start - 1  # BED is 0-based
        tss_end = start         # BED expects half-open intervals
    }
    # For the '-' strand, the TSS is the end position
    else if (strand == "-") {
        tss_start = end - 1     # For '-' strand, TSS is the end of the transcript
        tss_end = end
    }

    # Extract gene_id and transcript_id from $9 manually using split
    gene_id = "unknown_id"
    transcript_id="unknown_id"
    n = split($9, a, ";")
    for (i = 1; i <= n; i++) {
        if (a[i] ~ /gene_id/) {
            split(a[i], b, "\"")
            gene_id = b[2]
        }
        if (a[i] ~ /transcript_id/) {
            split(a[i], b, "\"")
            transcript_id = b[2]
        }
    }

    # Print the BED line (chromosome, new_start, new_end, transcript_id, ".", strand, gene_id)
    print chrom, tss_start, tss_end, transcript_id, ".", strand, gene_id
}
' $input_gtf > $output_bed

# Sort the BED file by chromosome and position
sort -k1,1 -k2,2n $output_bed > $sorted_bed

# Report completion
echo "TSS.Basic.sorted.bed created and sorted by chromosome and position."

Here's a solution to get TSS.basic.4K.bed

#!/bin/bash

# Input GTF file
input_gtf="gencode.v24.basic.annotation.gtf"
# Output unsorted 4K BED file
unsorted_bed="TSS.Basic.4K.bed"
# Output sorted 4K BED file
sorted_bed="TSS.Basic.4K.sorted.bed"

# Create or overwrite the unsorted output file
> $unsorted_bed

# Process the GTF file
awk '
BEGIN { FS="\t"; OFS="\t" }
$3 == "transcript" { 
    # Extract chromosome, start, end, strand
    chrom = $1
    start = $4
    end = $5
    strand = $7

    # For the '+' strand, the TSS is the start position
    if (strand == "+") {
        tss_start = start - 1  # BED is 0-based
        tss_end = start         # BED expects half-open intervals
        new_start = tss_start - 2000  # 2K upstream for the '+' strand
        new_end = tss_end + 2000      # 2K downstream for the '+' strand
    }
    # For the '-' strand, the TSS is the end position
    else if (strand == "-") {
        tss_start = end - 1     # For '-' strand, TSS is the end of the transcript
        tss_end = end
        new_start = tss_start - 2000  # 2K upstream for the '-' strand
        new_end = tss_end + 2000      # 2K downstream for the '-' strand
    }

    # Ensure the start position is not negative
    if (new_start < 0) {
        new_start = 0
    }

    # Extract gene_id and transcript_id from $9 manually using split
    gene_id = "unknown_id"
    transcript_id="unknown_id"
    n = split($9, a, ";")
    for (i = 1; i <= n; i++) {
        if (a[i] ~ /gene_id/) {
            split(a[i], b, "\"")
            gene_id = b[2]
        }
        if (a[i] ~ /transcript_id/) {
            split(a[i], b, "\"")
            transcript_id = b[2]
        }
    }

    # Print the BED line (chromosome, new_start, new_end, gene_id, ".", strand, transcript_id)
    print chrom, new_start, new_end, transcript_id, ".", strand, gene_id
}
' $input_gtf > $unsorted_bed

# Sort the BED file by chromosome and position
sort -k1,1 -k2,2n $unsorted_bed > $sorted_bed

# Remove the unsorted file to keep only the sorted one (optional)
rm $unsorted_bed

# Report completion
echo "TSS.Basic.4K.bed created and sorted by chromosome and position."

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cCRE dELS definition #5

cCRE dELS definition #5

zhangpicb commented Nov 23, 2021 •

edited

Loading

camelest commented Nov 25, 2021

zhangpicb commented Nov 26, 2021

camelest commented Nov 28, 2021 •

edited

Loading

sayangsep commented Sep 13, 2024

cCRE dELS definition #5

cCRE dELS definition #5

Comments

zhangpicb commented Nov 23, 2021 • edited Loading

camelest commented Nov 25, 2021

zhangpicb commented Nov 26, 2021

camelest commented Nov 28, 2021 • edited Loading

sayangsep commented Sep 13, 2024

Here's a solution to get TSS.basic.bed

Here's a solution to get TSS.basic.4K.bed

zhangpicb commented Nov 23, 2021 •

edited

Loading

camelest commented Nov 28, 2021 •

edited

Loading