Post a Job Sign in

To help ensure jobseeker privacy, some information has been hidden.

To see full resume details, log in to your Indeed account or create an account for free.

Metagenomics Scientist

Metagenomics Scientist - Second Genome, Inc.

San Mateo, CA


I have over 7 years of experience in Meta*omic data analyses, algorithms design and pipeline development. I have developed a number of high-throughput data-mining pipelines and biological databases with the goal of answering the ‘Who?’, ‘Why?’ and ‘What?’ about microbial communities in various environments. More recently, I have focused on identifying and nominating proteins from these communities for various areas of interest related to human and plant health. I am always interested in opportunities to apply my skills in various areas of ‘–omics’ research.

Authorized to work in the US for any employer

Work Experience

Metagenomics Scientist

Second Genome
San Francisco Bay Area, CA

December 2015 to Present

• Develop and validate (experimentally and in-silico) a scalable pipeline for de novo assembly and differential abundance of meta-omic data for the cloud (AWS). This will now be a service that Second Genome will be able to offer to external clients. (Perl, BASH, R) 
• Develop a graph database (Neo4j) pipeline solution as a means to integrate various types of omics data and nominate protein/peptides candidates to be tested as therapeutics in various disease assays. (BASH, R) 
• Mine the microbial dark matter and nominate novel protein/peptide drug candidates to be tested as therapeutics in various disease assays. About 60 peptide and over 100 protein candidates for immunooncology. About 1000 protein candidates for insecticidal activity. About 200 proteins candidates for various metabolic diseases. 
• Developed an algorithm to look for insecticides with novel modes of action. Paper in the works. (Neo4j, R) 
• Developed a RF model to predict the ‘expressibility’ of a nominated protein. Paper in the works. (R) 
• Developed a pipeline to continuously check public databases for new genomes, perform QC on said genomes and assimilating this information into a database for later use. This also worked with the bins acquired from de novo assembly pipeline mentioned earlier. (Perl, BASH, R) 
• Update the proprietary ‘GreenGenes’ and ‘StrainSelect’ databases with the latest publically available genomes and help automate this process. I was also was directly responsible for a 5x increase in the database size. 
• Collaborated with talented microbiologists, immunologists and protein scientists in-house and externally to fine-tune our protein selection, and nomination algorithms. 
• Hired and (fully or partially) managed talented bioinformatics developers and scientists to build and extend the Bioinformatics Discovery & Development, Cloud Infrastructure Management and Omics Data Science teams.

Research Bioinformatics Specialist

Department of Earth and Environmental Sciences
Ann Arbor, MI

May 2011 to December 2015

• Manage and oversee all bioinformatics projects for the lab. 
• Perform binning, metagenomic and metatranscriptomic analysis on various metagenomic datasets that directly lead to the discovery of numerous novel microorganisms. See ‘Publications’ for more. 
• Developed a pipeline for de-novo metatranscriptome assembly. See Baker et al 2013 in Publications. 
• Build custom analysis and de-novo assembly tools/pipelines for the various versions of Illumina and PacBio datasets. 
• Develop database solutions for handling meta-omics data. 
• Provide bioinformatics and HPC support to the students, researchers and collaborators of the lab. 
• Developed the graph database (neo4j) back-end to integrate microbial genomics, environmental chemistry, and ecosystem processes data to understand harmful algal blooms. Still under active development: Cyanohub. 
• Developed high-throughput methods to find, classify and score novel secondary metabolites from meta-omics data as potential natural products of interest. Building on my U.S. Patent No. 20150361470 
• Helped developed algorithms to identify and characterize viruses from metagenomic datasets. Publication in-prep. 
• Mined meta-omic data from some of the deepest hydrothermal vents in the world. This was challenging especially because of the amount of data and novel organisms present in the datasets. Publications in-prep. 
• Teach 'bioinformatic analysis' portions of Earth 513 and 523 course offered by department in Winter and Fall respectively.

Graduate Student Research Assistant/Part-time Researcher

Department of Earth and Environmental Sciences
Ann Arbor, MI

January 2011 to May 2011

Michigan Geomicrobiology Lab, University of Michigan 
• Creation and application of tools and database infrastructure to the analysis of new 
metagenomic and metatranscriptomic datasets. This involved analysis of large next-generation 
DNA and RNA sequence datasets from deep-sea waters of the Gulf of California. 
• Functional and Genomic study of a cryptic and Ubiquitous deltaproteobacteria clade: SAR324.

Bioinformatics Analyst (Internship)

Department of Earth and Environmental Sciences
Ann Arbor, MI

May 2010 to December 2010

Michigan Geomicrobiology Lab, University of Michigan 
• Designed and developed Metagenome and Metatranscriptome NGS data analysis pipelines 
• Comparative Genomics study on the Cyanobacteria found at the bottom of Lake Huron. This 
region is believed to be an analog to study the Proterozoic Eon.


Inter-University Consortium for Political and Social Research
Ann Arbor, MI

January 2010 to April 2010

University of Michigan 
• 'The Barrier's Project': An NLP (Natural Language Processing) based approach to classify 
scientific articles into relevant clusters. This tool is currently being used by the ICPSR as a 
means to easily curate the deluge of publications in a database.


MS in Bioinformatics

University of Michigan
Ann Arbor, MI

2009 to 2011

B.Tech in Bioinformatics

Amity University
Noida, Uttar Pradesh

2004 to 2008


Perl (9 years), R (3 years), Neo4j (3 years), Shell Scripting (7 years), MySQL (3 years), AWS (2 years), *Computer Skills* • Languages of choice: Perl, R, PL/pgSQL, Python 3 • Familiar with: Python 2, C/C++, Java, SAS, SPSS, Tanagra. • RDBMS: PostgreSQL, BioSQL, MySQL, Oracle, SQLite. • Platforms: Unix/Linux, Windows, Mac. Bioinformatics • Public Databases: CAMERA, EMBL, JGI, KBase, NCBI, KEGG and more. • Tools and Pipelines: AMOS, Arb, Blast, Bowtie, BWA, Consed, ESOM, Galaxy, GATK, GMOD, Megan, Meta-AMOS, MG-RAST, Mothur, PrinSeq, samtools. • Assemblers: Celera, Newbler, Velvet, Oases, Trinity, Meta-Velvet, IDBA, SOAP, SPAdes, Ray. (7 years)


Highest number of `Impact Awards` in Q3 2017.


Impact Awards are peer-to-peer recognition for colleagues who made a difference.



August 2009 to December 2015

This material is based upon work supported by the National Science Foundation under Grant Number EAR-1035955. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.


Nonribosomal Peptide Synthetases (#20150361470)

December 2017

The present disclosure is directed to the biosynthetic pathway for a nonribosomal peptide synthetase (NRPS)-derived drug and analogs thereof. The invention provides polynucleotide sequences useful for heterologous expression in a convenient microbial host for the synthesis of the NRPS-derived drug, the polypeptides encoded by such polynucleotides, expression vectors comprising the polynucleotides, host cells comprising the polynucleotides or expression vectors, and kits comprising a host cell. Also provided is a method for the production of ET-743, the NRPS-derived drug.


Cyanobacterial life at low O2: Community genomics and function reveal metabolic versatility and extremely low diversity of a cyanobacterial mat.

March 2012

Voorhies, A.A., B.A. Biddanda, S.T. Kendall, S. Jain, D.N. Marcus, S.C. Nold, N.D. Sheldon and G.J. Dick. Geobiology 
Cyanobacteria are renowned as the mediators of Earth’s oxygenation. However, little is known about the cyanobacterial communities that flourished under the low-O2 conditions that characterized most of their evolutionary history. Microbial mats in the submerged Middle Island Sinkhole of Lake Huron provide opportunities to investigate cyanobacteria under such persistent low-O2 conditions. Here, venting groundwater rich in sulfate and low in O2 supports a unique benthic ecosystem of purple-colored cyanobacterial mats. Beneath the mat is a layer of carbonate that is enriched in calcite and to a lesser extent dolomite. In situ benthic metabolism chambers revealed that the mats are net sinks for O2, suggesting primary production mechanisms other than oxygenic photosynthesis. Indeed, 14C-bicarbonate uptake studies of autotrophic production show variable contributions from oxygenic and anoxygenic photosynthesis and chemosynthesis, presumably because of supply of sulfide. These results suggest the presence of either facultatively anoxygenic cyanobacteria or a mix of oxygenic/anoxygenic types of cyanobacteria. Shotgun metagenomic sequencing revealed a remarkably low-diversity mat community dominated by just one genotype most closely related to the cyanobacterium Phormidium autumnale, for which an essentially complete genome was reconstructed. Also recovered were partial genomes from a second genotype of Phormidium and several Oscillatoria. Despite the taxonomic simplicity, diverse cyanobacterial genes putatively involved in sulfur oxidation were identified, suggesting a diversity of sulfide physiologies. The dominant Phormidium genome reflects versatile metabolism and physiology that is specialized for a communal lifestyle under fluctuating redox conditions and light availability. Overall, this study provides genomic and physiologic insights into low-O2 cyanobacterial mat ecosystems that played crucial geobiological roles over long stretches of Earth history.

The metatranscriptome of a deep-sea hydrothermal plume is dominated by water column methanotrophs and lithotrophs

May 2012

Lesniewski, R.A., S. Jain, P.D. Schloss, K. Anantharaman,and G.J. Dick. ISMEJ 
Microorganisms mediate geochemical processes in deep-sea hydrothermal vent plumes, which are a conduit for transfer of elements and energy from the subsurface to the oceans. Despite this important microbial influence on marine geochemistry, the ecology and activity of microbial communities in hydrothermal plumes is largely unexplored. Here, we use a coordinated metagenomic and metatranscriptomic approach to compare microbial communities in Guaymas Basin hydrothermal plumes to background waters above the plume and in the adjacent Carmen Basin. Despite marked increases in plume total RNA concentrations (3–4 times) and microbially mediated manganese oxidation rates (15–125 times), plume and background metatranscriptomes were dominated by the same groups of methanotrophs and chemolithoautotrophs. Abundant community members of Guaymas Basin seafloor environments (hydrothermal sediments and chimneys) were not prevalent in the plume metatranscriptome. De novo metagenomic assembly was used to reconstruct genomes of abundant populations, including Marine Group I archaea, Methylococcaceae, SAR324 Deltaproteobacteria and SUP05 Gammaproteobacteria. Mapping transcripts to these genomes revealed abundant expression of genes involved in the chemolithotrophic oxidation of ammonia (amo), methane (pmo) and sulfur (sox). Whereas amo and pmo gene transcripts were abundant in both plume and background, transcripts of sox genes for sulfur oxidation from SUP05 groups displayed a 10–20-fold increase in plumes. We conclude that the biogeochemistry of Guaymas Basin hydrothermal plumes is mediated by microorganisms that are derived from seawater rather than from seafloor hydrothermal environments such as chimneys or sediments, and that hydrothermal inputs serve as important electron donors for primary production in the deep Gulf of California.

The Community transcriptomic assembly reveals microbes that contribute to deep-sea carbon and nitrogen cycling.

April 2013

Baker, B.J., C. S. Sheik, C. A. Taylor, S. Jain, A. Bhasi, J.D. Cavalcoli and G.J. Dick. The ISME Journal. 
The deep ocean is an important component of global biogeochemical cycles because it contains one of the largest pools of reactive carbon and nitrogen on earth. However, the microbial communities that drive deep-sea geochemistry are vastly unexplored. Metatranscriptomics offers new windows into these communities, but it has been hampered by reliance on genome databases for interpretation. We reconstructed the transcriptomes of microbial populations from Guaymas Basin, in the deep Gulf of California, through shotgun sequencing and de novo assembly of total community RNA. Many of the resulting messenger RNA (mRNA) contiguous sequences contain multiple genes, reflecting co-transcription of operons, including those from dominant members. Also prevalent were transcripts with only limited representation (2.8 times coverage) in a corresponding metagenome, including a considerable portion (1.2 Mb total assembled mRNA sequence) with similarity (96%) to a marine heterotroph, Alteromonas macleodii. This Alteromonas and euryarchaeal marine group II populations displayed abundant transcripts from amino-acid transporters, suggesting recycling of organic carbon and nitrogen from amino acids. Also among the most abundant mRNAs were catalytic subunits of the nitrite oxidoreductase complex and electron transfer components involved in nitrite oxidation. These and other novel genes are related to novel Nitrospirae and have limited representation in accompanying metagenomic data. High throughput sequencing of 16S ribosomal RNA (rRNA) genes and rRNA read counts confirmed that Nitrospirae are minor yet widespread members of deep-sea communities. These results implicate a novel bacterial group in deep-sea nitrite oxidation, the second step of nitrification. This study highlights metatranscriptomic assembly as a valuable approach to study microbial communities

Metabolic flexibility of deep-sea Sar324 revealed through metagenomic and transcriptomic analysis.


Sheik, C, S. Jain and G. J. Dick. Environmental Microbiology. 
Chemolithotrophy is a pervasive metabolic lifestyle for microorganisms in the dark ocean. The SAR324 group of Deltaproteobacteria is ubiquitous in the ocean and has been implicated in sulfur oxidation and carbon fixation, but also contains genomic signatures of C1 utilization and heterotrophy. Here we reconstructed the metagenome and metatranscriptome of a population of SAR324 from a hydrothermal plume and surrounding waters in the deep Gulf of California to gain insight into the genetic capability and transcriptional dynamics of this enigmatic group. SAR324's metabolism is signified by genes that encode a novel particulate hydrocarbon monooxygenase (pHMO), degradation pathways for corresponding alcohols and short chain fatty acids, dissimilatory sulfur oxidation (DsrAB), formate dehydrogenase (FDH) and a nitrite reductase (NirK). Transcripts of the pHMO, NirK, FDH and transporters for exogenous carbon and amino acid uptake were highly abundant in plume waters. Sulfur oxidation genes were also abundant in the plume metatranscriptome, indicating SAR324 may also utilize reduced sulfur species in hydrothermal fluids. These results suggest that aspects of SAR324's versatile metabolism (lithotrophy, heterotrophy and alkane oxidation) operate simultaneously, and may explain SAR324's ubiquity in the deep Gulf of California and in the global marine biosphere.

Novel hydrocarbon monooxygenase genes in the metatranscriptome of a natural deep-sea hydrocarbon plume.


Li, M., S. Jain, B.J. Baker, C.A. Taylor and G.J. Dick. Environmental Microbiology. 
Particulate membrane-associated hydrocarbon monooxygenases (pHMOs) are critical components of the aerobic degradation pathway for low molecular weight hydrocarbons, including the potent greenhouse gas methane. Here, we analyzed pHMO gene diversity in metagenomes and metatranscriptomes of hydrocarbon-rich hydrothermal plumes in the Guaymas Basin (GB) and nearby background waters in the deep Gulf of California. Seven distinct phylogenetic groups of pHMO were present and transcriptionally active in both plume and background waters, including several that are undetectable with currently available PCR primers. The seven groups of pHMOs included those related to a putative ethane oxidizing Methylococcaceae-like group, a group of the SAR324 Deltaproteobacteria, three deep-sea clades (Deep sea-1/symbiont-like, Deep sea-2/PS-80 and Deep sea-3/OPU3) within gammaproteobacterial methanotrophs, one clade related to Group Z, and one unknown group. Differential abundance of pHMO gene transcripts in plume and background suggests niche differentiation between groups. Corresponding 16S rRNA genes reflected similar phylogenetic and transcriptomic abundance trends. The novelty of transcriptionally active pHMOs we recovered from a hydrocarbon-rich hydrothermal plume suggests there are significant gaps in our knowledge of the diversity and function of these enzymes in the environment.

How Do Facultative Methanotrophs Utilize Multi-Carbon Compounds for Growth? Genomic and Transcriptomic Analysis of Methylocystis Strain SB2 Grown on Methane and on Ethanol


Vorobev, A, S. Jagadevan, S. Jain, K. Anantharaman, G. J. Dick, S Vuilleumier, and J Semrau. 
A minority of methanotrophs are able to utilize multi-carbon compounds as growth substrates in addition to methane. The pathways utilized by these microorganisms for assimilation of multi-carbon compounds, however, have not been explicitly examined. Here, we report the draft genome of the facultative methanotroph Methylocystis strain SB2 and perform a detailed transcriptomic analysis of cultures grown with either methane or ethanol. Evidence for use of the canonical methane oxidation pathway and the serine cycle for carbon assimilation from methane was obtained, and also for operation of the complete tricarboxylic acid (TCA) cycle and the ethylmalonyl-CoA (EMC) pathway. Experiments with Methylocystis strain SB2 grown on methane revealed that genes responsible for the first step of methane oxidation, the conversion of methane to methanol, were expressed at a significantly higher level than downstream oxidative transformations, suggesting that this step may be rate-limiting for growth of this strain with methane. Further, transcriptomic analyses of Methylocystis strain SB2 grown with ethanol as compared to methane revealed that on ethanol (1) expression of the pathway of methane oxidation and the serine cycle was significantly reduced, (2) expression of the TCA cycle dramatically increased, and (3) expression of the EMC pathway was similar. Based on these data, it appears Methylocystis strain SB2 converts ethanol to acetyl-CoA, which is then funneled into the TCA cycle for energy generation, or incorporated into biomass via the EMC pathway. This suggests that some methanotrophs have greater metabolic flexibility than previously thought, and that operation of multiple pathways in these microorganisms is highly controlled and integrated.

Identification and analysis of the bacterial endosymbiont specialized for production of the chemotherapeutic natural product ET-743


Schofield, M. M.*,Jain, S.*, D. Porat, G. J. Dick and D. H. Sherman 
(co-first author) 
Ecteinascidin 743 (ET-743, Yondelis) is a clinically approved chemotherapeutic natural product isolated from the Caribbean mangrove tunicate Ecteinascidia turbinata. Researchers have long suspected that a microorganism may be the true producer of the anti-cancer drug, but its genome has remained elusive due to our inability to culture the bacterium in the laboratory using standard techniques. Here, we sequenced and assembled the complete genome of the ET-743 producer, Candidatus Endoecteinascidia frumentensis, directly from metagenomic DNA isolated from the tunicate. Analysis of the ∼631 kb microbial genome revealed strong evidence of an endosymbiotic lifestyle and extreme genome reduction. Phylogenetic analysis suggested that the producer of the anti-cancer drug is taxonomically distinct from other sequenced microorganisms and could represent a new family of Gammaproteobacteria. The complete genome has also greatly expanded our understanding of ET-743 production and revealed new biosynthetic genes dispersed across more than 173 kb of the small genome. The gene cluster's architecture and its preservation demonstrate that the drug is likely essential to the interactions of the microorganism with its mangrove tunicate host. Taken together, these studies elucidate the lifestyle of a unique, and pharmaceutically-important microorganism and highlight the wide diversity of bacteria capable of making potent natural products.

Genomic and transcriptomic evidence for scavenging of diverse organic compounds by widespread deep-sea archaea

November 17, 2015

Meng Li, Brett J. Baker, Karthik Anantharaman,Sunit Jain,John A. Breier & Gregory J. Dick 
Microbial activity is one of the most important processes to mediate the flux of organic carbon from the ocean surface to the seafloor. However, little is known about the microorganisms that underpin this key step of the global carbon cycle in the deep oceans. Here we present genomic and transcriptomic evidence that five ubiquitous archaeal groups actively use proteins, carbohydrates, fatty acids and lipids as sources of carbon and energy at depths ranging from 800 to 4,950 m in hydrothermal vent plumes and pelagic background seawater across three different ocean basins. Genome-enabled metabolic reconstructions and gene expression patterns show that these marine archaea are motile heterotrophs with extensive mechanisms for scavenging organic matter. Our results shed light on the ecological and physiological properties of ubiquitous marine archaea and highlight their versatile metabolic strategies in deep oceans that might play a critical role in global carbon cycling.

Single-Cell (Meta-) Genomics of a Dimorphic Candidatus Thiomargarita nelsonii Reveals Genomic Plasticity

April 2016

Unraveling the physiological roles of the cyanobacterium Geitlerinema sp. BBD and other black band disease community members through genomic analysis of a mixed culture.

June 2016

Genomic and Transcriptomic Resolution of Organic Matter Utilization Among Deep-Sea Bacteria in Guaymas Basin Hydrothermal Plumes.

July 2017

Are oligotypes meaningful ecological and phylogenetic units: a case study of Microcystis in freshwater lakes

March 2017

Additional Information

Software Development. 
Complex Systems Modeling 
Data Organization, Management and Analysis. 
• Languages of choice: Perl, R, Cypher 
• Familiar with: Python 2, C/C++, Java, SAS, SPSS, Tanagra. 
• RDBMS: PostgreSQL, BioSQL, MySQL, Oracle, SQLite, Neo4j. 
• Platforms: Unix/Linux, Windows, Mac. 
• Public Databases: CAMERA, EMBL, JGI, KBase, NCBI, KEGG and more. 
• Tools and Pipelines: AMOS, Arb, Blast, Bowtie, BWA, Consed, ESOM, Galaxy, GATK, GMOD, Megan, Meta-AMOS, MG-RAST, Mothur, PrinSeq, samtools. 
• Assemblers: Celera, Newbler, Velvet, Oases, Trinity, Meta-Velvet, IDBA, SOAP, SPAdes, Ray.