Querying iGEM Registry SQL Dump Intro

Author

Samuel Ortion

Published

September 24, 2023

Loading in MariaDB/MySQL

gunzip sql_parts.sql.gz

CREATE DATABASE igem_registry;

mysql igem_registry < sql_parts.sql

Preliminary Exploration

show tables;

Tables_in_igem_registry
parts
parts_seq_features

describe parts;

select label from parts_seq_features where label like "%OMPR%";

select label from parts_seq_features where label like "%OMPC%";

Select the part labels containing “OmpR”, and count how many parts contains this label, ordering by decreasing count:

select label, count(label) from parts_seq_features where lab
el like "%OMPR%" group by label order by count(label) desc;

Results Table

label	count(label)
C1 OmpR	209
C3 OmpR	184
C2 OmpR	184
F3 OmpR	33
F2 OmpR	33
F1 OmpR	33
ompR	11
OmpR binding site	8
OmpR binding	5
ompR(sph)	4
OmpR dependent promoter	2
OmpR site b	1
OmpR Scrambled promoter	1
Location of wt C3 OmpR	1
Location of wt C2 OmpR	1
Strong OmpR	1
Location of wt C1 OmpR	1
OmpR (R0082)	1
OmpR site a	1
BBa_R0082 - pOmpR	1
OmpR site c	1

What are the available part types:

select part_type from parts group by part_type;

Results Table

part_type
Cell
Coding
Composite
Conjugation
Device
DNA
Generator
Intermediate
Inverter
Measurement
Other
Plasmid
Plasmid_Backbone
Primer
Project
Promoter
Protein_Domain
RBS
Regulatory
Reporter
RNA
Scar
Signalling
T7
Tag
Temporary
Terminator
Translational_Unit

How many part has the part_type Promoter?

select count(part_id) from parts where part_type = "Promoter"

3

Idem for Regulatory part type:

select count(part_id) from parts where part_type = "Regulatory"

6981

Interested Queries

What are the part biobicks registry IDs associated with “OmpR binding site” label?

select parts.part_name 
from parts
inner join parts_seq_features
on parts.part_id = parts_seq_features.part_id
where
    label = "OmpR binding site";

Results Table

part_name
BBa_K136004
BBa_K2845017
BBa_K2845019
BBa_K2845021
BBa_K2845023
BBa_K2845027
BBa_K2845029
BBa_K2845031

What’s the length of those feature sequence?

select parts.part_name, parts.sequence_length, 
    parts_seq_features.start_pos, parts_seq_features.end_pos, 
    parts_seq_features.end_pos - parts_seq_features.start_pos as feature_length
from parts
inner join parts_seq_features
on parts.part_id = parts_seq_features.part_id
where
    label = "OmpR binding site";

Results Table

part_name	sequence_length	start_pos	end_pos	feature_length
BBa_K136004	230	43	63	20
BBa_K2845017	58	11	28	17
BBa_K2845019	63	10	28	18
BBa_K2845021	68	11	28	17
BBa_K2845023	108	11	28	17
BBa_K2845027	63	36	53	17
BBa_K2845029	108	41	68	27
BBa_K2845031	73	46	73	27

Extracting the feature sequence from the part into a fasta file:

select 
    parts.part_name, 
    substring(parts.sequence, 
        parts_seq_features.start_pos, 
        parts_seq_features.end_pos
    ) as feature_sequence
from parts, parts_seq_features
where parts.part_id = parts_seq_features.part_id
and parts_seq_features.label = "OmpR binding site"
into outfile "/tmp/OmpR_binding_site_part_features.csv";

sudo cp /tmp/systemd-private-90c6e33e66b14aad94f288d77b339520-mariadb.service-0LgiF0/tmp/OmpR_binding_site_part_features.csv ./data/pOmpR/OmpR_binding_site_part_features.csv
sudo chown $USER:$USER /home/sortion/Documents/Projects/asso/igem/2023/bioinfo/OmpR_binding_site_part_features.csv

head ./data/pOmpR/OmpR_binding_site_part_features.csv

BBa_K136004     gaaaaatcttagataagtgtaaagacccatttctatttgtaaggacatattaaaccaaaaagg
BBa_K2845017    acattttgaaacatctatagcgatctct
BBa_K2845019    tacattttgaaacatctatagcgatctc
BBa_K2845021    acattttgaaacatctatagcgatctct
BBa_K2845023    ttacattttgaaacatctatagcgataa
BBa_K2845027    acattttgaaacatctatagcgatctct
BBa_K2845029    gaaacatcttaaaagttttagtatcatattcgtgttggattattctgcatttttggggagaatggact
BBa_K2845031    acattttgaaacatctatagcgatctct

*Converting this TSV into a FASTA file`:

awk '{ printf ">%s\n", $1; print $2}' ./data/pOmpR/OmpR_binding_site_part_features.csv > ./data/pOmpR/OmpR_binding_site_part_features.fasta

Multiple Sequence Alignment

muscle

Alignment with color

musclor() {
    local source_file
    source_file="$1"
    muscle -align ./data/${source_file}.fasta -output ./results/${source_file}.afa 
    muscle -align ./data/${source_file}.fasta -stratified -output ./results/${source_file}.efa
    muscle -letterconf ./results/${source_file}.efa -ref ./results/${source_file}.afa -output ./results/${source_file}.lc.afa
    muscle -letterconf ./results/${source_file}.efa -ref ./results/${source_file}.afa \
    -output ./results/${source_file}.lc.afa \
    -html ./results/${source_file}.lc.html \
    -jalview ./results/${source_file}/jalview.features
}
musclor "pOmpR/OmpR_binding_site_part_features"

Two sequence seems to contain relatively large insert.

awk 'length($2) < 30 { printf ">%s\n", $1; print $2}' ./data/pOmpR/OmpR_binding_site_part_features.csv > ./data/pOmpR/OmpR_binding_site_part_features_without_inserted.fasta

musclor "pOmpR/OmpR_binding_site_part_features_without_inserted"

Extract the interesting html content into a .qmd file, for import in this notebook:

for file in OmpR_binding_site_part_features_without_inserted.lc.html OmpR_binding_site_part_features.lc.html 
do
    ./src/extract_inner_muscle_html.sh ./results/pOmpR/$file > ./media/$file.qmd
done

With all reported OmpR binding sites:

BBa_K2845027  —acattttgaaa——————————–catc-tatagc———————–
BBa_K136004  ga-aaaatcttagataagtgtaaagacccatttctatttgtaaggacatattaaacc———————–
BBa_K2845019  t–acattttgaaa——————————–catc-tatagc———————–
BBa_K2845029  gaaacatcttaaaa——————————–gttt-tagtatcatattcgtgttggattattctg
BBa_K2845031  —acattttgaaa——————————–catc-tatagc———————–
BBa_K2845017  —acattttgaaa——————————–catc-tatagc———————–
BBa_K2845023  tt-acattttgaaa——————————–catc-tatagc———————–
BBa_K2845021  —acattttgaaa——————————–catc-tatagc———————–

BBa_K2845027  ————–gatctct
BBa_K136004  ————–aaaaagg
BBa_K2845019  ————–gatct-c
BBa_K2845029  catttttggggagaatggact
BBa_K2845031  ————–gatctct
BBa_K2845017  ————–gatctct
BBa_K2845023  ————–gat–aa
BBa_K2845021  ————–gatctct

Confidence high 9876543210 low

With the reported OmpR binding sites with length below 30 bp:

BBa_K2845031  –acattttgaaacatctatagcgatctct
BBa_K2845023  ttacattttgaaacatctatagcgata–a
BBa_K2845017  –acattttgaaacatctatagcgatctct
BBa_K2845019  -tacattttgaaacatctatagcgatct-c
BBa_K2845021  –acattttgaaacatctatagcgatctct
BBa_K2845027  –acattttgaaacatctatagcgatctct

Confidence high 9876543210 low

Appendix

Easier connection to the database

To connect to the MariaDB database without using sudo all the time, one can create a user:

GRANT ALL PRIVILEGES ON igem_registry.* TO 'user'@'localhost' identified by 'secret';
FLUSH PRIVILEGES;

Then, to avoid having to type the login / password at each connection, he can add the following to the ~/.my.cnf config file:

~/.my.cnf

[client-server]
user=user
password=secret
host=localhost

Thus, the connection is straightforward:

mysql igem_registry

Also, to enable file system access, from an administrator mariadb account:

GRANT FILE ON *.* TO 'user'@'localhost';

References

https://parts.igem.org/Registry_API (database dumps available there).