Extracting an information from web page by machine learning

First, your task fits into the information extraction area of research. There are mainly 2 levels of complexity for this task:

  • extract from a given html page or a website with the fixed template (like Amazon). In this case the best way is to look at the HTML code of the pages and craft the corresponding XPath or DOM selectors to get to the right info. The disadvantage with this approach is that it is not generalizable to new websites, since you have to do it for each website one by one.
  • create a model that extracts same information from many websites within one domain (having an assumption that there is some inherent regularity in the way web designers present the corresponding attribute, like zip or phone or whatever else). In this case you should create some features (to use ML approach and let IE algorithm to "understand the content of pages"). The most common features are: DOM path, the format of the value (attribute) to be extracted, layout (like bold, italic and etc.), and surrounding context words. You label some values (you need at least 100-300 pages depending on domain to do it with some sort of reasonable quality). Then you train a model on the labelled pages. There is also an alternative to it - to do IE in unsupervised manner (leveraging the idea of information regularity across pages). In this case you/your algorith tries to find repetitive patterns across pages (without labelling) and consider as valid those, that are the most frequent.

The most challenging part overall will be to work with DOM tree and generate the right features. Also data labelling in the right way is a tedious task. For ML models - have a look at CRF, 2DCRF, semi-markov CRF.

And finally, this is in the general case a cutting edge in IE research and not a hack that you can do it a few evenings.

p.s. also I think NLTK will not be very helpful - it is an NLP, not Web-IE library.


As per i know there are two ways to do this task using machine learning approach.

1.Using computer vision to train the model and then extract the content based on your use case, this has already been implemented by diffbot.com. and they have not open sourced their solution.

2.The other way to go around this problem is using supervised machine learning to train binary classifier to classify content vs boilerplate and then extract the content. This approach is used in dragnet. and other research around this area. You can have a look at benchmark comparison among different content extraction techniques.


Firstly, Machine Learning is not magic. These algorithms perform specific tasks, even if these can be a bit complex sometimes.

The basic approach of any such task is to generate some reasonably representative labeled data, so that you can evaluate how well you are doing. "BOI" tags could work, where for each word you assign it a label: "O" (outside) if it is not something you're looking for, "B" (beginning) if it is the start of an address, and "I" for all subsequent words (or numbers or whatever) in the address.

The second step is to think about how you want to evaluate your success. Is it important that you discover the most part of an address, or do you also need to know exactly what the thing is (postcode or street or city, etc). This then changes what you count as an error.

If you want your named-entity recogniser to work well, you have to know your data well, and decide on the best tool for the job. This may very well be a series of regular expressions with some rules on how to combine the results. I expect you'll be able to find most of the data with relatively simple programmes. Once you have something simple that works, you check out the false positives (things that turned out not to be the thing you were looking for) and the false negatives (things that you missed), and look for patterns. If you see something that you can fix easily, try it out. A huge advantage of regex is that it is much easier to not only recognise something as part of an address, but also detect which part it is.

If you want to move beyond that, you may find that many NLP methods don't perform well on your data, since "Natural Language Processing" usually needs something that looks like (you guessed it) Natural Language to recognise what something is.

Alternatively, since you can view it as a chunking problem, you might use Maximum Entropy Markov Models. This uses probabilities of transitioning from one type of word to another to chunk text into "part of an address" and "not part of an address", in this case.

Good luck!


tl;dr: The problem might solvable using ML, but it's not straightforward if you're new to the topic


There's a lot of machine learning libraries for python:

  • Scikit-learn is very popular general-purpose for beginners and great for simple problems with smallish datasets.
  • Natural Language Toolkit has implementations for lots of algorithms, many of which are language agnostic (say, n-grams)
  • Gensim is great for text topic modelling
  • Opencv implements some common algorithms (but is usually used for images)
  • Spacy and Transformers implement modern (state-of-the-art, as of 2020) text NLU (Natural Language Understanding) techniques, but require more familiarity with the complex techniques

Usually you pick a library that suits your problem and the technique you want to use.

Machine learning is a very vast area. Just for the supervised-learning classification subproblem, and considering only "simple" classifiers, there's Naive Bayes, KNN, Decision Trees, Support Vector Machines, feed-forward neural networks... The list goes on and on. This is why, as you say, there are no "quickstarts" or tutorials for machine learning in general. My advice here is, firstly, to understand the basic ML terminology, secondly, understand a subproblem (I'd advise classification within supervised-learning), and thirdly, study a simple algorithm that solves this subproblem (KNN relies on highschool-level math).

About your problem in particular: it seems you want detect the existence of a piece of data (postal code) inside an huge dataset (text). A classic classification algorithm expects a relatively small feature vector. To obtain that, you will need to do what's called a dimensionality reduction: this means, isolate the parts that look like potential postal codes. Only then does the classification algorithm classify it (as "postal code" or "not postal code", for example).

Thus, you need to find a way to isolate potential matches before you even think about using ML to approach this problem. This will most certainly entail natural language processing, as you said, if you don't or can't use regex or parsing.

More advanced models in NLU could potentially parse your whole text, but they might require very large amounts of pre-classified data, and explaining them is outside of the scope of this question. The libraries I've mentioned earlier are a good start.