Introducing Nougat: Academic PDF Text Extraction Beyond OCR

Hello Learners…

Welcome to the blog…

Table Of Contents

  • Introduction
  • Introducing Nougat: Academic PDF Text Extraction Beyond OCR
  • Summary
  • References

Introduction

In this we are Introducing Nougat: Academic PDF Text Extraction Beyond OCR.

Nougat (Neural Optical Understanding for Academic Documents), the new generative model from Meta AI trained to extract text from academic PDFs without needing traditional OCR engines.

Nougat: Neural Optical Understanding for Academic Documents

Introducing Nougat: Academic PDF Text Extraction Beyond OCR

We know how important data quality is for training LLMs. Now, Nougat can convert scanned documents and textbooks into high-quality data for pertaining!

Model Details:

  • Nougat is an encoder-decoder transformer that allows for an end-to-end training procedure using the same architecture as Donut. The Model has a SwinTransformer encoder and a mBart decoder.
  • The Swin Transformer encodes the input document image into latent embeddings.
  • The mBart decodes the encoded image embeddings into a sequence of tokens in an auto-regressive way.
  • The model was trained on datasets of PDF pages and LaTeX source code. Data augmentations like noise, blurring, and distortions were used to simulate scanned documents.

Paper insights:

  • The input image size was 896×672;
  • between a US Letter and a Din A4 format
  • mBart decoder context window is 4096 tokens
  • The model only has 350M parameters
  • Data augmentation was key to simulating scanned documents
  • The dataset was a total of 8 204 754 pages
  • Evaluated only against β€œnon-ml” methods and not against existing ML models like Donut

GitHub:

Full Research paper:

Demo By HunggingFace:

Introducing Noguat: Academic PDF Text Extraction Beyond OCR

Introducing Noguat: Academic PDF Text Extraction Beyond OCR

Nougat represents a significant advancement in this domain, as it goes beyond traditional Optical Character Recognition (OCR) by converting scientific documents into a machine-readable markup language.

The model has been rigorously tested and validated on a comprehensive dataset of scientific documents, demonstrating its impressive performance and accuracy.

Summary

Happy learning And Keep Learning…

Thank you…

References

Leave a Comment