Querying iGEM Registry SQL Dump Intro

Author

Samuel Ortion

Published

September 24, 2023

Loading in MariaDB/MySQL

gunzip sql_parts.sql.gz
CREATE DATABASE igem_registry; 
mysql igem_registry < sql_parts.sql

Preliminary Exploration

show tables;
Tables_in_igem_registry
parts
parts_seq_features
describe parts;
select label from parts_seq_features where label like "%OMPR%";
select label from parts_seq_features where label like "%OMPC%";

Select the part labels containing “OmpR”, and count how many parts contains this label, ordering by decreasing count:

select label, count(label) from parts_seq_features where lab
el like "%OMPR%" group by label order by count(label) desc;
Results Table

label count(label)
C1 OmpR 209
C3 OmpR 184
C2 OmpR 184
F3 OmpR 33
F2 OmpR 33
F1 OmpR 33
ompR 11
OmpR binding site 8
OmpR binding 5
ompR(sph) 4
OmpR dependent promoter 2
OmpR site b 1
OmpR Scrambled promoter 1
Location of wt C3 OmpR 1
Location of wt C2 OmpR 1
Strong OmpR 1
Location of wt C1 OmpR 1
OmpR (R0082) 1
OmpR site a 1
BBa_R0082 - pOmpR 1
OmpR site c 1

What are the available part types:

select part_type from parts group by part_type;
Results Table

part_type
Cell
Coding
Composite
Conjugation
Device
DNA
Generator
Intermediate
Inverter
Measurement
Other
Plasmid
Plasmid_Backbone
Primer
Project
Promoter
Protein_Domain
RBS
Regulatory
Reporter
RNA
Scar
Signalling
T7
Tag
Temporary
Terminator
Translational_Unit

How many part has the part_type Promoter?

select count(part_id) from parts where part_type = "Promoter"

3

Idem for Regulatory part type:

select count(part_id) from parts where part_type = "Regulatory"

6981

Interested Queries

What are the part biobicks registry IDs associated with “OmpR binding site” label?

select parts.part_name 
from parts
inner join parts_seq_features
on parts.part_id = parts_seq_features.part_id
where
    label = "OmpR binding site";
Results Table

part_name
BBa_K136004
BBa_K2845017
BBa_K2845019
BBa_K2845021
BBa_K2845023
BBa_K2845027
BBa_K2845029
BBa_K2845031

What’s the length of those feature sequence?

select parts.part_name, parts.sequence_length, 
    parts_seq_features.start_pos, parts_seq_features.end_pos, 
    parts_seq_features.end_pos - parts_seq_features.start_pos as feature_length
from parts
inner join parts_seq_features
on parts.part_id = parts_seq_features.part_id
where
    label = "OmpR binding site";
Results Table

part_name sequence_length start_pos end_pos feature_length
BBa_K136004 230 43 63 20
BBa_K2845017 58 11 28 17
BBa_K2845019 63 10 28 18
BBa_K2845021 68 11 28 17
BBa_K2845023 108 11 28 17
BBa_K2845027 63 36 53 17
BBa_K2845029 108 41 68 27
BBa_K2845031 73 46 73 27

Extracting the feature sequence from the part into a fasta file:

select 
    parts.part_name, 
    substring(parts.sequence, 
        parts_seq_features.start_pos, 
        parts_seq_features.end_pos
    ) as feature_sequence
from parts, parts_seq_features
where parts.part_id = parts_seq_features.part_id
and parts_seq_features.label = "OmpR binding site"
into outfile "/tmp/OmpR_binding_site_part_features.csv";
sudo cp /tmp/systemd-private-90c6e33e66b14aad94f288d77b339520-mariadb.service-0LgiF0/tmp/OmpR_binding_site_part_features.csv ./data/pOmpR/OmpR_binding_site_part_features.csv
sudo chown $USER:$USER /home/sortion/Documents/Projects/asso/igem/2023/bioinfo/OmpR_binding_site_part_features.csv
head ./data/pOmpR/OmpR_binding_site_part_features.csv
BBa_K136004     gaaaaatcttagataagtgtaaagacccatttctatttgtaaggacatattaaaccaaaaagg
BBa_K2845017    acattttgaaacatctatagcgatctct
BBa_K2845019    tacattttgaaacatctatagcgatctc
BBa_K2845021    acattttgaaacatctatagcgatctct
BBa_K2845023    ttacattttgaaacatctatagcgataa
BBa_K2845027    acattttgaaacatctatagcgatctct
BBa_K2845029    gaaacatcttaaaagttttagtatcatattcgtgttggattattctgcatttttggggagaatggact
BBa_K2845031    acattttgaaacatctatagcgatctct

*Converting this TSV into a FASTA file`:

awk '{ printf ">%s\n", $1; print $2}' ./data/pOmpR/OmpR_binding_site_part_features.csv > ./data/pOmpR/OmpR_binding_site_part_features.fasta

Multiple Sequence Alignment

muscle 

Alignment with color

musclor() {
    local source_file
    source_file="$1"
    muscle -align ./data/${source_file}.fasta -output ./results/${source_file}.afa 
    muscle -align ./data/${source_file}.fasta -stratified -output ./results/${source_file}.efa
    muscle -letterconf ./results/${source_file}.efa -ref ./results/${source_file}.afa -output ./results/${source_file}.lc.afa
    muscle -letterconf ./results/${source_file}.efa -ref ./results/${source_file}.afa \
    -output ./results/${source_file}.lc.afa \
    -html ./results/${source_file}.lc.html \
    -jalview ./results/${source_file}/jalview.features
}
musclor "pOmpR/OmpR_binding_site_part_features"

Two sequence seems to contain relatively large insert.

awk 'length($2) < 30 { printf ">%s\n", $1; print $2}' ./data/pOmpR/OmpR_binding_site_part_features.csv > ./data/pOmpR/OmpR_binding_site_part_features_without_inserted.fasta
musclor "pOmpR/OmpR_binding_site_part_features_without_inserted"

Extract the interesting html content into a .qmd file, for import in this notebook:

for file in OmpR_binding_site_part_features_without_inserted.lc.html OmpR_binding_site_part_features.lc.html 
do
    ./src/extract_inner_muscle_html.sh ./results/pOmpR/$file > ./media/$file.qmd
done
With all reported OmpR binding sites:



BBa_K2845027  acattttgaaa——————————–catc-tatagc———————–
 BBa_K136004  ga-aaaatcttagataagtgtaaagacccatttctatttgtaaggacatattaaacc———————–
BBa_K2845019  tacattttgaaa——————————–catc-tatagc———————–
BBa_K2845029  gaaacatcttaaaa——————————–gttt-tagtatcatattcgtgttggattattctg
BBa_K2845031  acattttgaaa——————————–catc-tatagc———————–
BBa_K2845017  acattttgaaa——————————–catc-tatagc———————–
BBa_K2845023  tt-acattttgaaa——————————–catc-tatagc———————–
BBa_K2845021  acattttgaaa——————————–catc-tatagc———————–


BBa_K2845027  ————–gatctct
 BBa_K136004  ————–aaaaagg
BBa_K2845019  ————–gatct-c
BBa_K2845029  catttttggggagaatggact
BBa_K2845031  ————–gatctct
BBa_K2845017  ————–gatctct
BBa_K2845023  ————–gataa
BBa_K2845021  ————–gatctct



Confidence high 9876543210 low

With the reported OmpR binding sites with length below 30 bp:



BBa_K2845031  acattttgaaacatctatagcgatctct
BBa_K2845023  ttacattttgaaacatctatagcgataa
BBa_K2845017  acattttgaaacatctatagcgatctct
BBa_K2845019  -tacattttgaaacatctatagcgatct-c
BBa_K2845021  acattttgaaacatctatagcgatctct
BBa_K2845027  acattttgaaacatctatagcgatctct



Confidence high 9876543210 low

Appendix

Easier connection to the database

To connect to the MariaDB database without using sudo all the time, one can create a user:

GRANT ALL PRIVILEGES ON igem_registry.* TO 'user'@'localhost' identified by 'secret';
FLUSH PRIVILEGES;

Then, to avoid having to type the login / password at each connection, he can add the following to the ~/.my.cnf config file:

~/.my.cnf
[client-server]
user=user
password=secret
host=localhost

Thus, the connection is straightforward:

mysql igem_registry

Also, to enable file system access, from an administrator mariadb account:

GRANT FILE ON *.* TO 'user'@'localhost';

References