Curated proteome databases for phylogenetics
TL;DR
Taxonomically balanced proteome dataset curator for bacterial/computational cell biologists that auto-filters NCBI/UniProt genomes by genus/family redundancy and taxonomy (e.g., 1 genome/taxon) and exports FASTA files for RAxML/PhyML so they can reduce manual curation time from 10+ hours/week to <1 hour while eliminating phylogenetic bias from overrepresented taxa.
Target Audience
Bacterial cell biologists and computational biologists in academia or biotech, who study protein evolution and need non-redundant genome datasets for phylogenetics.
The Problem
Problem Context
Researchers in bacterial cell biology need to study proteins in their evolutionary context. This requires searching for homologs in genomes using tools like BLAST or HMMer. However, public databases (e.g., NCBI RefSeq, UniProt) contain many redundant genomes, making it difficult to work with the results efficiently.
Pain Points
Manually selecting non-redundant genomes is time-consuming and unsystematic. Researchers either pick genomes randomly or use one per genus, which is inefficient and may miss key evolutionary relationships. Current tools like BLAST or HMMer return too many redundant sequences, slowing down analysis.
Impact
Wasted time (10+ hours/week) delays research progress and publication timelines. Redundant data increases computational costs and complicates phylogenetic analysis. Researchers may miss critical insights due to incomplete or biased genome selection.
Urgency
This problem arises every time a researcher starts a new project, making it a recurring bottleneck. Without a curated database, researchers cannot efficiently study protein evolution, which is core to their work. Delayed insights could impact grant funding or competitive advantage in industry settings.
Target Audience
Bacterial cell biologists, computational biologists, and researchers in phylogenetics. These users work in academia, biotech companies, or pharmaceutical labs. They rely on genome databases for comparative analysis but struggle with redundancy and inefficient curation methods.
Proposed AI Solution
Solution Approach
A web-based tool that provides pre-curated, taxonomically balanced proteome datasets for common taxa (e.g., bacteria, archaea). Users can filter genomes by taxonomy, redundancy, or custom criteria, then export the results for downstream analysis. The tool integrates with BLAST/HMMer to streamline homolog searches without manual curation.
Key Features
- Custom Filtering: Upload proteomes or use built-in datasets, then filter by taxonomy, redundancy, or user-defined rules.
- BLAST/HMMer Integration: Search curated datasets directly without downloading raw files, saving time.
- Export Tools: Download filtered proteomes in FASTA format for use in phylogenetic software (e.g., RAxML, PhyML).
User Experience
Users start by selecting a taxon (e.g., E. coli) or uploading their own proteomes. The tool displays a pre-filtered, non-redundant dataset. Users can refine the selection using taxonomy or redundancy filters, then export the results for analysis. The process replaces manual curation with a few clicks, saving hours per project.
Differentiation
Unlike uncurated databases (e.g., NCBI, UniProt), this tool provides *systematically selected, non-redundant genomes- tailored for phylogenetics. It avoids the need for manual filtering, which is error-prone and time-consuming. The taxonomically balanced datasets ensure comprehensive coverage without redundancy, improving analysis accuracy.
Scalability
Start with bacterial genomes, then expand to archaea, fungi, and metagenomes. Add premium datasets (e.g., clinical isolates) or integrations (e.g., with R/BioPython) for advanced users. Pricing can scale with dataset size or user seats, accommodating labs of all sizes.
Expected Impact
Researchers save 10+ hours/week on manual curation, accelerating project timelines. Curated datasets improve phylogenetic analysis accuracy and reduce computational costs. Labs can standardize their genome selection process, ensuring reproducibility across studies.