The team utilizes a custom-tuned transformer-encoder-based network which converts webpage to text for information retrieval of generic information available on product pages such as price, title, description, and image URLs.
The network is capable of extracting information from nested tables and complex textual structures as the model has an understanding of both language and HTML DOMAnother way of information extraction from web pages or PDFs/screenshots is through Visual Scraping. Often when crawling is not an option, the analytics and data science team uses a custom-built visual, AI-based crawling solution.
We have summarized this news so that you can read it quickly. If you are interested in the news, you can read the full text here. Read more: