Glycan Structure Extraction from Scientific Literature

Author: Nhat Duong

Mentor: Dr. Nathan Edwards, Department of Biochemistry and Molecular &smp; Cellular Biology, Georgetown University Medical Center.

The extraction and representation of accepted glycobiology knowledge is challenging due to the widespread use of images to represent glycans in published literature. In the absence of an explicit computer-readable glycan sequence or accession number, human curation is required to extract published glycosylation knowledge for our glycomics data-resources. Glycan structure images in published literature are highly stylized but poorly standardized, despite efforts by the Standardized Nomenclature for Glycans (SNFG) group to make them more consistent. Automated extraction of glycan structures from figures in published manuscripts will ease the curation effort.

We use a combination of open-source Python modules for parsing and manipulating PDF files; neural network-based object classification to locate glycans in manuscripts’ figures; and OpenCV-based image analysis to extract glycan structure details. Figures from GlyGen and UniCarbKB manuscript annotations were curated to construct a training set of figures with glycan bounding-box locations. This object classification approach successfully identifies glycan structures in manuscript figures for subsequent detailed glycan image analysis and highlights the glycan structures in the manuscript.

Glycan structure details are extracted using the open-source OpenCV library. Using color masks, monosaccharides' distinctive colors and shapes are recognized, establishing the monosaccharide composition of the glycan. Monosaccharide linkage is then determined, where possible. The glycan structure’s orientation and the reducing-end monosaccharide is then established by identifying common glycan cores. Together, this information is sufficient to construct a GlycoCT format sequence for the structure’s topology and to then search for a matching glycan in GlyTouCan. Once matched, the GlyTouCan accession is used to embed a targeted, clickable link to the GNOme Structure Browser on top of the glycan structure in the original PDF file.

This infrastructure provides a surprisingly effective tool for extracting glycan structures from the figures of published glycobiology manuscripts. The method successfully identifies glycans' positions on all pages, extracts their topological structure in GlycoCT format, and annotates them in-place with deep-links to GNOme so the curator can verify or refine the specific details of the structure. This prototype demonstrates the potential utility of automated extraction of glycan structures from published manuscript figures, significantly lowering the curation burden for the representation of glycosylation knowledge in glycomics resources.