The Neudata Winter Online Summit

Q&A: Analysis of Long-Form Documents


This week HIPSTO was delighted to participate in the Neudata Winter Summit (Wednesday 9th December 2020). It was a particularly exciting event for us as there was a huge amount of buzz around the CEO’s presentation ‘Blind Vision – Presenting the new benchmark in Web scraping Technology’, a bit of a world premiere, and our CTO was able to offer his thoughts on the current market in a panel discussion on the subject: ‘Analysis of Long-Form Documents’.

Andrii, our CTO, is always excited to talk about his passion, and discussion highlights include him explaining why NLP has no future! There was lots of interest and not time to answer all of the questions, so we thought we’d cover everything here for you:

In many situations, NLP is the unsung hero of data structuring (e.g. even transactional datasets leverage NLP for transaction tagging.) How does HIPSTO leverage NLP throughout the data acquisition and structuring process?

Several years ago, HIPSTO was formed with the realisation that although the universe of web information is huge, it is ‘so’ unstructured that an individual, or a business. can’t properly digest it. In order to resolve this problem HIPSTO has been building a funnel of microservices that clean and purify information specific to the actual user.

The challenge starts at the first stage of the funnel. In order to extract data from a web page, it has to be downloaded and rendered as a web-browser would. However, this is just part of the process. The actual data is extracted by its xPath locators in a Document Object Model (DOM) tree, either in absolute or relative form. The xPaths can be configured for the scraper either manually, or with some kind of an algorithm.

The most intuitive configuration option is a heuristics approach, in other words utilizing ‘if-else’ clauses. This process requires a developer to foresee the majority of situations in which a data item (e.g. news text or title) can be presented in DOM structure on a particular HTML page. The number of variations is endless! Even a single HTML developer can combine HTML tags in any order and in any number of combinations, even then the CSS style could further confuse the code. As a result, the approach will show poor results.

The refined path lies in the field of machine and distant learning. Currently, there are two main approaches in Content Extraction – raw code processing using methods based on statistical analysis (e.g. block-o-matic) or, complex models such as SVM or Neural Networks and Computer Vision based algorithms. These algorithms are very resource extensive they require extended run-time, huge datasets, time -to-train, and don’t export very accurate results. Raw code processing has a much faster run-time but still requires big datasets and has even worse accuracy. The problem with both is that website design variance is very high and limited only to its creator’s creativity. Theoretically it would require enormous datasets to learn all variants and there would still be a chance of coming across a website that doesn’t fit all of the learned ‘features’, which would then consequently fail.

HIPSTO’s proprietary Blind Vision employs sophisticated raw code processing algorithms which produces the valuable outcomes from the computer vision approach without the negating factors of using it. Our approach pushes data through a pipeline constructed from deep neural networks and other algorithms. However, this is just the first step of many.

Once the data has been cleaned, the long-form text provided can be pushed through a pipeline of NLP microservices which be tailored to the specific business case. Let’s take, for example, a financial monitoring case. Imagine you are watching a listed company and want to monitor the news and reports on them to assess their risk. In this instance you would need a web scraper to grab the news and reports on that company. The ideal scenario would be to grab all related content on the topic… and, this is where HIPSTO Related content detection steps in. Our service is able to recognize two pieces of the content describing the same story or event semantically, even if it is written by different people in a hundred different languages.

Some grabbed articles may be confusing; however, there are several ways to make sure the content item is worthy of analysis. One is to assign it to a content category and then determine if it is correct. Taking this approach, in the case of financial monitoring IAB1 ‘Entertainment’ would not be the key point of interest, instead the setup should be watching for IAB3 ‘Business’. HIPSTO’s technology can intelligently determine the correct category using our off-the-shelf, multilingual solution, plus it will cater for all IAB categories by Q1 2021.

Now all of the entities in the content will be discovered and it is possible to form a report. NER, one of the benchmark NLP tasks, has been achieved by HIPSTO and its accuracy is currently being pushed even further by the team. And, yet again, it is multilingual and consistent!

At this point the pipeline can already deliver a report; however, humans typically want it in story form combining multiple pieces of news and entities into a short, effective and informative summary. This is exactly what the HIPSTO team is currently refining utilizing cutting edge technologies and architectures.

Future developments: What can be done to improve current NLP solutions?

Nothing can be done to improve NLP solutions. Seriously, NLP is almost dead. Long live NLU and NLG! Natural Language Understanding along with Natural Language Generation are the areas where development will be focused in the near future. Statistical approaches are already being shifted away from NLP by the Transformers, reformers and (as we are talking about long form documents) longformers that will improve the quality on the main NLP tasks. With the announcement in the recent mT5 presentation by Google, NLG will become available with zero- and few-shot learning in multiple languages, which will definitely push the boundaries in that direction too.

Another major stream, mostly fuelled by NLU, is knowledge graph construction. Having a better understanding of the text and access to a variety of insight extraction tools enables deeper dynamic entity mapping. The modern data storage and querying tools have already helped us to jump from purely scientific ontologies to knowledge graphs. Now, knowledge graphs should convert into what could be described as a cognitive map.

HIPSTO’s variety of AI NLU services are extremely helpful in the deep knowledge graph building, as HIPSTO’s knowledge graph is being built on the same patterns a human being applies, while acquiring and accepting new information through text.

Let’s discuss some of the common challenges around utilising NLP

There are two main challenges concerned with using NLP commercially. The first is exactly the fact that most documents are long form. Of course, some of the texts that are used as a source of information and insights are therefore short, e.g. tweets, other social media posts and comments. However, even standard news items on a publisher website are already a problem for current tools, not to mention corporate documents, scientific papers and books. This is exactly the problem we encountered and successfully solved at HIPSTO whilst developing our related content detection microservice. The main problem comes from attempting to preserve the context of the document as a whole. Transformer-based models are limited in the input sequence size and are unable to process long sequences due to their self-attention mechanism.

The other challenge is to perform multilingual and cross-lingual tasks. Although some of the state-of-the-art architectures are multilingual, they are extremely heavy and require a lot of resources to retrain them for a specific task. The challenge is therefore to utilize them through zero- or few-shot learning approaches.

1. Use of Alt Data limited because of the shortage of historical data. 

You would find it hard to build any reliable model or prediction without proper backtesting. That is why the historical data in general and historical Alt Data are of vital importance in Quant trading. And, this is why HIPSTO preserves and applies it to every new AI microservice, that’s nearly a million news and tweet items collected during our 3-year operation.

2. Dynamic Entity Mapping 

Entity relations are very changeable in the modern world. Manual ‘relation management’ is too slow and gets old quicker than it is processed. Entity mapping has to be automatic to be able to keep up with the number of information items being published out globally every single day.

HIPSTO FalconV delivers new insights as a part of its life cycle. Every entity mined from hundreds of news and social media items every hour, can be automatically and dynamically mapped to the already existing knowledge as a part of HIPSTO Knowledge graph initiative.

3. Golden Event Triangle 

Alt Data analysts are often supposed to perform the data analysis to grab some sort of signal; thus, compiling what prof. Peter Kolm coins ‘golden triangles’, and the HIPSTO team calls and investigates as ‘cross-correlation detection’, It is both very labour intensive and important, but only if done properly. We approach the task by using the output of FalconV to dynamically map information into a knowledge graph that includes subsequent predictions based on historical patterns and correlations.