Heuristic Mining of Hierarchical Genotypes and Accessory Genome Loci in Bacterial Populations

2.0K Views

•

08:03 min

•

December 7th, 2021

DOI :

10.3791/63115-v

December 7th, 2021

•

Natasha Pavlovikj*¹, Joao Carlos Gomes-Neto*²^,³, Andrew K. Benson²^,³

¹Department of Computer Science and Engineering, University of Nebraska-Lincoln, ²Department of Food Science and Technology, University of Nebraska-Lincoln, ³Nebraska Food for Health Center, University of Nebraska-Lincoln

Transcript

This analytical protocol allows for the study of pathogenic populations of bacteria, large-scale. That's very important because it enhance how ecological and epidemiological investigations can be done. But for that to happen, what we need is an automated and scalable tool, or a computational platform that allows for many thousands of genome sequences to be analyzed at once.

ProkEvo fits that niche, and it allows for practical bacterial population analysis to be done at scale, while mapping pan-genomic content, that reviews genotypes and unique features about those genotypes for ecological and epidemiological investigation. The main advantage of this protocol is the usage of powerful, automated, and scalable computational platforms, such as ProkEvo to do heuristic mining of hierarchical genotypes in bacterial populations. The analytical protocol being presented here today has several practical implications.

One of them is to facilitate diagnostics in the sense that would allow for bacterial genotypes to be mapped and tracked in real time, in a scalable fashion, which allows for pathogenic lineages of pathogens to be discerned and defined to track and map those pathogens in different settings. Another application is to enhance routine surveillance of public health laboratories and regulatory agencies, which is done to facilitate the tracking of pathogens in different commercial settings. The protocol presented here provides practical guidance for microbiologists, ecologists, epidemiologists, and anyone interested in bacterial population genomics.

ProkEvo is an open source and publicly available platform, and its GitHub page provides detailed usage instructions. The protocol explained here can be found on GitHub as well. With the provided instructions, we want to make ProkEvo and this protocol easy to use and be utilized by novice and advanced researchers.

Start conducting the analyses using Gigi tree to plot a phylogenetic tree along with genotypic information. To do so, optimize the Gigi tree figure size, including the diameter and width of rings by changing the numerical values inside the x-lim and G heat map. When plotting multiple layers of data with the phylogenetic tree, aggregate all metadata into the lowest possible number of categories to facilitate the choice of coloring panel.

Conduct the data aggregation based on the question of interest and domain knowledge. Once done, use a bar plot to assess the relative frequencies by aggregating data for the sequence type or ST lineages, and core genome multilocus sequence typing or cgMLST variants to facilitate visualizations. Choose an empirical or statistical threshold used for data aggregation.

The example code can be used to inspect the frequency distribution of the ST lineages and determine the cutoff. The example code shows how minor or low frequency STs are aggregated. The STs that are not numbered can be grouped as other STs.

Use a similar code for the cgMLST variants. Use the nested approach to calculate the proportion of each ST lineage within each BAPS1 subgroup to identify the STs that belong to the same BAPS1 subgroup. The code exemplifies how the ST-based proportion can be calculated across the BAPS1 subgroups.

To plot the distribution of antimicrobial resistance or AMR loci across the ST lineages, use an empirical or statistical threshold to filter out the most important AMR loci to facilitate visualizations. Provide a raw. csv file containing the calculated proportions of all AMR loci across all the ST lineages.

Next calculate the AMR proportion for each ST using the code. After calculations are done for all STs, combine the data sets as one data frame using the code, and then export the csv file containing the calculated proportions with the code. Before plotting the AMR-based distribution across the ST lineages, filter the data based on a threshold to facilitate visualizations.

Next, plot the core genome phylogeny along with the hierarchical genotypic classifications in AMR data in a single plot using Gigi tree. Then optimize the figure size inside Gigi tree using the parameters mentioned earlier. Optimize the visualizations by aggregating the variables or using binary classification, such as the gene presence or absence.

The hierarchical population structure of Salmonella enterica lineage one in the context of a core genome phylogeny was examined. The relative frequencies of all hierarchical genotypes were then used to evaluate the overall distribution and most frequently observed classifications. Less frequent ST lineages were aggregated as other STs to facilitate data visualization.

Similarly, less frequent cgMLST variants were aggregated as other cgMLSTs. The ancestral relationships between the STs were examined using a nested approach by assessing the relative frequency of ST lineages by the BAPS1 subgroups or haplotypes. The relative frequency of the ST lineage differentiating AMR loci was assessed to identify unique accessory genomic signatures linked to the serovar Newport population structure.

In the results, MDFA and AAC6IAA loci appeared to be ancestrally acquired by the serovar Newport population, whereas ST45 is predicted to be multi-drug resistant. When compared to the ST45, the other major ST lineages, such ST5 and ST118, are more likely to be multi-drug susceptible. Additionally, a phylogeny-anchored visualization was used to integrate the hierarchical population structure data systematically.

This analytical protocol presents a basis for data mining of bacterial populations at scale. What it allows is for genotypes to be mapped and tracked at scale using ProkEvo, but it also can be expanded to answer other questions, such as exploring the distribution of metabolic pathways and virulence factors associated with genotypic information. That is, we can predict the phenotypes that are associated with specific genotypes of interest.

The protocol described here definitely paves the way for researchers to explore new questions in the field of population genomics and infer evolutionary and ecological patterns for pathogenic as well as non-pathogenic bacterial species.

Summary

Explore More Videos

Heuristic Mining

Hierarchical Genotypes

Accessory Genome Loci

Bacterial Populations

ProkEvo

Computational Platform