Welcome Blind Vision

Forget Computer Vision

 

What is Blind Vision?

Blind Vision is a proprietary data extraction and labelling technology that is applied to web pages. It brings a new and revolutionary AI-driven approach to extracting meaningful data across the web, significantly improving on existing methods with a huge uptick in accuracy and an even larger reduction in costs. In this article, we’re going to explain what Blind Vision is, how it works and why it’s not just important, but is a gamechanger.

We’ve primarily written this article for a non-technical, business audience, so if you already know your neural networks from your nodes then please bear with us. We also explore some of the context behind the need for robust web data extraction and labelling technologies, as well as existing approaches such as computer vision. If this background isn’t of interest, please skip to the latter sections which focus more specifically on Blind Vision.

 

Why do we need to make more sense of content on the web?

First, some context on why extracting data from the web has become critical.

In the bigger picture, we are drowning in data and content on the internet and beyond. Estimates suggest that there are currently around 4.2 billion web pages in existence spread across approximately 2 billion websites, with the majority of these not being actively updated.

And the volume of data is still expanding at near-crazy levels. We’re on a perpetual curve of explosive data growth. Estimates suggest that by the end of 2021, we will have created, copied and consumed approximately 79 zettabytes of data, but by 2025 the data produced will have more than doubled to 181 zettabytes. Although the vast majority of this data is not kept, it is a mind-bogglingly huge amount of data – one zettabyte is one trillion gigabytes!

Simultaneous to the explosion in the volume of data, organisations across multiple sectors increasingly need to make sense of web-based content for many of their core processes. Organisations varying from governments, to global businesses, to start-ups may all rely on monitoring, leveraging and extracting web content for decision-making, managing risk, delivering services and more.

Multiple teams within organisations may be extracting data from web content for a variety of different use cases, including:

    • Monitoring the latest news and developments on specific topics that impact the organisation and its customers
    • Monitoring web pages to protect brand and reputation
    • Gathering data to carry out equity research and make investment decisions
    • Carrying out research and development activity across multiple sectors
    • Gathering essential business intelligence to inform decision-making
    • Ensuring compliance around their brand with a supplier or partner network
    • Combining data from different sources to support e-commerce operations
    • Creating datasets to support marketing campaigns and business development initiatives
    • Validating and improving data quality in their core systems
    • Creating datasets in the right format to support AI and automation
    • Creating and delivering essential products and services to customers who rely on web data integration
    • And many more!

As individuals, we also want to make sense of data spread across web pages; for example, we may need to aggregate news that is reliable and of interest, or compare the prices of products to help make buying decisions.

The challenge

But there is a problem. As the amount of web content continues to grow with little to no control or standards set across it, it is getting harder for organisations to monitor, manage and extract data from web pages to use in core processes.

In practice, to get value out of data that is spread across web pages, you need some kind of “web scraping” or “web data integration” capability that automatically and intelligently extracts the data and provides structure to it before being seamlessly inserted into another system with reporting and workflow. But this is far from straightforward.

Consider this use case. Let’s say you are in financial services and need to carry out research to support investment decisions, identify acquisition targets or publish analyst reports. The web offers essential “alternative data” on companies that teams will rely on for decision-making; some of this data may need to be consumed in real-time, especially if informing daily trading activity.

Teams are extremely busy and work in high-pressure environments. They need information that is instantly digestible and reliable, ideally presented into a system in a format they already use. You’ll need to have that data structured and codified to make sense of it; for example, about which company the data relates to, the source, the date, the language the content is written in and so on.

How do you actively monitor the ever-increasing mountain of sources and content that is available on the web, and then extract the critical information in a format that is ready to be presented to decision-makers who need to digest and reference it at breakneck speed?

There are some key challenges:

    • Unstructured data: web content is unstructured and chaotic. There is no convenient standard that means valuable data within web pages can be easily identified and extracted.
    • Variety of content and layout: web content comes in different and disparate formats, styles and layouts that again makes automatically processing data very difficult.
    • Exponential growth: web content continues to grow and grow. How do you keep on top of the essential information as the universe of sources continues to expand?
    • Frequent changes: pages constantly change and then change some more. How do you keep on top of these changes?
    • Human judgement is required: Often, extracting the right data from pages relies on judgements that are best made by a person visualising that webpage, making automation and scalability hard to achieve.
    • Filtering: If you do extract data, how do you then filter out the noise, avoid repetition and ignore insignificant data to ensure your data has value?
    • Limitation of current approaches: Current approaches to web data extraction, including ‘computer vision’, have limitations with scalability, high costs and low accuracy rates.

The value of web data integration software

The lack of standards, the uncontrolled universe of sources, the disparity of layout across websites, the frequent editing and updating of content and the increasing expansion of web pages makes web scraping an increasingly difficult and expensive task.

Even if you do implement an automated approach to extract data from the web, all too often, the effort ends up being manual, inefficient, inaccurate and inconsistent. Up to now, using automated approaches may require a substantial effort to clean up and standardise the data – possibly a manual exercise – in order to render it usable in another system or some kind of automation or workflow.

More often than not, organisations simply fail and get it wrong. The result is bad data – a phenomenon which IBM calculated back in 2016 as costing the US alone $USD 3.1tn per year!

Technology that can successfully bring some sense, order, meaning, standardisation and value to this chaotic universe of web pages is becoming increasingly important. More specifically, web data integration technologies that extract data from web pages in intelligent ways so it can be leveraged for a variety of core business processes has even greater value.

In fact, analysts at Grand View Research have suggested that the Data Annotation tools market (of which web scraping technologies are a part) is set to be valued at $USD 3.4bn by 2028, representing an estimated CAGR of 27.1% from the value of the market in 2021.

How do current web data integration approaches work?

Currently, the main approaches to web data integration that harness automation and AI are less than satisfactory and are holding us back.

Before we explore these and their limitations around accuracy, scalability and cost, we need to consider the kind of judgements a person needs to make if they were extracting information from a web page manually.

Let’s say you have been provided with the URL of a web page with an article on it, and you need to find out who the author of the article is and the date it was published.

Web pages are visual assets and come with lots of different elements: titles, headings, blocks of text, buttons, images, banners, adverts, videos, forms and more.

To find the right information, you need to view the page. Although it might only take you a few seconds to find the right author and date information, you are continually exercising judgement as you scan the page:

    • You locate the right information on the page based on your experience of the layout. Based on your previous experience, you’ll look to the places the author and date are likely to be – probably underneath the title, but sometimes at the bottom of the page.
    • You ignore the “noise” – those elements of the page that do not contain the information you need, such as an image or an advert.
    • You might make a judgement call about what information you need, such as distinguishing between an article author and a photographer who is credited, or the date an article was originally written or updated.
    • You might make a judgement call interpreting the information; for example, is the date 7.6.21 in the US or European date format?

Why web scraping is complicated

This is all super easy for human beings with our visual and cognitive abilities, but it’s actually really tricky for computers, even with all the advancements we have made with AI and machine learning (ML). Of course, computers can’t see like humans so we do need to rely on AI, ML and ingenious algorithms, but the nature of web content nevertheless makes it difficult for computers to extract meaning from visual assets in a consistent and accurate way.

One reason for this is the variation of layouts in web content. The nature of web design, with different stylings and coding, means there are endless variations of how a page might be laid out. A computer will require a HUGE amount of data to learn common layouts, and even then, there will still be exceptions that won’t be recognised.

At the same time, digital teams are continually changing the layout of their pages to drive both user engagement and SEO, so if a computer has learnt to extract data from a page in one way, it may need to re-learn as a page changes.

There are further practical difficulties in “web scraping”, including:

    • The use of JavaScript for rendering the elements of a page, which can make scraping more difficult
    • Invisible elements on a page which can confuse some algorithms
    • Anti-scraping approaches put in place by digital teams to prevent scraping.

‘Raw data processing’ and ‘computer vision’

The two primary approaches to web scraping currently in use are raw data processing and computer vision. Raw data processing takes the raw code from web pages and applies a range of algorithms to try and make sense of it, attempting to turn unstructured data into more structured data. It leverages different methods from the world of statistical analysis, including well-established frameworks (such as “Block-o-Matic”) as well as some more complex models associated with machine-learning (such as Support Vector Machines).

The main problem with raw data processing approaches is that because of the lack of structure and a huge variation in the content, a huge dataset is required to get half-decent results, and even then, the accuracy of the data is disappointing. It’s simply not scalable.

The other approach – usually regarded as the better option – is computer vision. Computer vision is a wholly AI- and ML-driven approach that tries to replicate human vision – it is trying to “see” the web page. Computer vision ultimately learns to recognise images so it can extract meaningful data, relying on deep learning (a branch of machine learning) and neural networks to predict what it is “seeing”.

Whilst this approach might deliver better accuracy than raw data processing, it is even less scalable, with an even larger dataset required, more time spent training the algorithms to be effective and greater associated costs.

Why is Blind Vision revolutionary?

Blind Vision is a brand new approach that combines the best elements of raw code processing and computer vision but manages to swerve the associated pitfalls that limit their success and practical application.

Blind Vision is revolutionary not just because it’s a completely new way of carrying out web scraping, but also due to the quality of the results it returns.

Blind Vision produces exceptional results when compared to computer vision, the current “best” approach followed by the industry.

In a controlled experiment, our data science team tested Blind Vision against computer vision, processing the same 1,200 randomly selected web articles, and attempting to extract title, content, main image, author and publication date. We repeated this exercise with several different batches of URLs.

The results are stunning – compared to computer vision, Blind Vision produces 35% better accuracy at a 65% lower cost. This represents a potential sea-change in the way web data extraction and labelling can be applied across a wide variety of use cases; it’s exciting news for data providers, technology vendors, businesses, institutions and users.

How Blind Vision works

We gave our proprietary technology the name “Blind Vision” because the results simulate vision, despite the fact that computers cannot see. Here are the main steps to achieve this:

Step One: It replicates how a person might see a web page

Blind Vision replicates how a person sees a web page using a set of extremely advanced raw data processing algorithms that produce the outcomes of computer vision without the pitfalls, such as those around scalability.

Step Two: It understands the layout

The data is pushed through a layout understanding pipeline that is constructed from deep neural networks and other algorithms, allowing Blind Vision to eliminate the elements of the web page that aren’t important and focus on those that are. The platform has a deep understanding of popular layout formats such as news, articles, and more.

Step Three: Clustering and segmentation

Blind Vision extracts data from those important page elements, and clusters and segments the data.

Step Four: It applies the right labels to the right data

Blind Vision then takes the clustering/segmentation data and applies the right labels to it – author, date, main image, address etc. – for whatever type of data you have defined to extract. This means that the data now has the right structure to be used, for example, in an AI solution or for importation into another system.

Step Five: It adds value to the dataset

Blind Vision and the HIPSTO platform add value to the data, for example, by removing duplicate information across the dataset based on intelligent analysis.

How is this Blind Vision different to other approaches?

Here’s a summary of the differences between the three approaches.

Raw code processing

Computer vision

Blind Vision

Basic approach Examines raw code on web pages using a range of algorithms leveraging different approaches Uses AI and ML approaches to understand visual context of data on web pages A proprietary approach from HIPSTO that takes the best elements from other approaches and leverages a deep understanding of layouts
Dataset required to get worthwhile results Huge Huge Much smaller (up to 100X less than other approaches)
Resources required during run-time Fewer resources required than computer vision but disappointing results Extensive resources needed for basic results Lightweight approach that delivers the best results currently possible
Scalable? No No Yes
Accuracy (retrieving key elements) Very low Low 35% more accurate than computer vision
Associated costs (continued monitoring) Expensive Very expensive, prohibitive 65% lower costs than computer vision

Looking to the future

Blind Vision and its revolutionary, underlying approach evolved through the process of continual improvement, innovation, experimentation and rigorous testing that we apply to HIPSTO’s AI-platform. We looked at computer vision’s issues around scalability, conceptualised an alternative approach, and then tested it.

We admit that the quality of the results that emerged from these tests surprised and excited us. We ran the tests again and again with different datasets to ensure the results were robust. It was a turning point.

We’re still working tirelessly on continual improvement, and we know that Blind Vision is just going to get better and better, particularly in terms of accuracy. We also have our sights on keeping costs low so that Blind Vision can be successfully applied across key use cases at-scale.

Part of this continual underlying improvement is down to the machine learning inherent in the platform, as well as the improvements that occur naturally as we broaden our client base and expand the dataset we work with. We’re also working hard to train Blind Vision to even better understand the layouts within particular content types. The next on our roadmap is e-commerce product pages with more to follow.

To return to our first point, the demand for game-changing technologies like Blind Vision is only going to get greater. The sheer volume of existing web content is overwhelming – we are drowning in data. Organisations, teams and individuals need assistance to make sense of the web to render content usable and data actionable. AI and automation are waiting in the wings to support better ways of doing things, but in order to work, they need to be fed the right data at the right time. Blind Vision is going to be one of the key technologies that help here.

Want to find out more about Blind Vision? Get in touch!