Skip to content

A configurable web scraper that can map information from a website using a json schema.

License

Notifications You must be signed in to change notification settings

dinostheo/sqrap

Repository files navigation

sqrap

CircleCI Known Vulnerabilities codecov FOSSA Status

A configurable web scraper that can map information from a website using a json schema.

Installation

npm i sqrap

Usage

The sqrap module exports a function that accepts two parameters, the url of the resource to exttract the information and a configuration object thats should contain the custom selectors to extract values from the specified resource and optionally http options, based on the request module.

Selectors

You can use selectors to extract information from a specific page for a specific property that you can define. For each property you can specify a set of selectors. The names of the properties are up to you.

e.g.

const selectors = {
  author: [
    {
      selector: 'span[itemprop="author"] > span[itemprop="name"]',
      text: true
    }
  ],
  title: [
    {
      selector: 'h1',
      text: 'true'
    }
  ],
  text: [
    {
      selector: 'h1',
      text: true
    },
    {
      selector: '.field-name-summary',
      text: true
    },
    {
      selector: 'div[itemprop="articleBody"]',
      text: true
    }
  ],
  image: [
    {
      selector: 'meta[property="og:image"]',
      attribute: 'content'
    }
  ],
  htmlText: [
    {
      selector: 'div.group-left',
      html: true
    }
  ]
};

Every selector item has 2 properties. The one is always a selector and the second can be one of text, attribute and html.

text

It will extract all the text included in the selected DOM element.

attribute

It will extract the value of an attribute of the selected DOM element.

html

It will extract all the html included in the selected DOM element.

Example usage

'use strict';

const sqrap = require('sqrap');

const selectors = {
  logo: [
    {
      selector: '#hplogo',
      attribute: 'src'
    }
  ],
  title: [
    {
      selector: 'title',
      text: 'true'
    }
  ],
  content: [
    {
      selector: '#SIvCob',
      html: true
    }
  ]
};

const url = 'http://www.google.com';

sqrap(url, { selectors })
  .then(result => {
    console.log(result);
  })
  .catch(console.log);

Response

{
  "logo": "/images/branding/googlelogo/1x/googlelogo_white_background_color_272x92dp.png",
  "title": "Google",
  "content": "Google offered in:  <a href=\"http://www.google.com/setprefs?sig=0_66pRjBrpofhOEMhxHuwX235zuS4%3D&amp;hl=fy&amp;source=homepage&amp;sa=X&amp;ved=0ahUKEwiazsS12JzeAhUD2KQKHT_CBmQQ2ZgBCAU\">Frysk</a>  "
}

License

FOSSA Status

About

A configurable web scraper that can map information from a website using a json schema.

Resources

License

Stars

Watchers

Forks

Packages

No packages published