sqrap

A configurable web scraper that can map information from a website using a json schema.

Installation

npm i sqrap

Usage

The sqrap module exports a function that accepts two parameters, the url of the resource to exttract the information and a configuration object thats should contain the custom selectors to extract values from the specified resource and optionally http options, based on the request module.

Selectors

You can use selectors to extract information from a specific page for a specific property that you can define. For each property you can specify a set of selectors. The names of the properties are up to you.

e.g.

const selectors = {
  author: [
    {
      selector: 'span[itemprop="author"] > span[itemprop="name"]',
      text: true
    }
  ],
  title: [
    {
      selector: 'h1',
      text: 'true'
    }
  ],
  text: [
    {
      selector: 'h1',
      text: true
    },
    {
      selector: '.field-name-summary',
      text: true
    },
    {
      selector: 'div[itemprop="articleBody"]',
      text: true
    }
  ],
  image: [
    {
      selector: 'meta[property="og:image"]',
      attribute: 'content'
    }
  ],
  htmlText: [
    {
      selector: 'div.group-left',
      html: true
    }
  ]
};

Every selector item has 2 properties. The one is always a selector and the second can be one of text, attribute and html.

text

It will extract all the text included in the selected DOM element.

attribute

It will extract the value of an attribute of the selected DOM element.

html

It will extract all the html included in the selected DOM element.

Example usage

'use strict';

const sqrap = require('sqrap');

const selectors = {
  logo: [
    {
      selector: '#hplogo',
      attribute: 'src'
    }
  ],
  title: [
    {
      selector: 'title',
      text: 'true'
    }
  ],
  content: [
    {
      selector: '#SIvCob',
      html: true
    }
  ]
};

const url = 'http://www.google.com';

sqrap(url, { selectors })
  .then(result => {
    console.log(result);
  })
  .catch(console.log);

Response

{
  "logo": "/images/branding/googlelogo/1x/googlelogo_white_background_color_272x92dp.png",
  "title": "Google",
  "content": "Google offered in:  <a href=\"http://www.google.com/setprefs?sig=0_66pRjBrpofhOEMhxHuwX235zuS4%3D&amp;hl=fy&amp;source=homepage&amp;sa=X&amp;ved=0ahUKEwiazsS12JzeAhUD2KQKHT_CBmQQ2ZgBCAU\">Frysk</a>  "
}

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.circleci		.circleci
.editorconfig		.editorconfig
.eslintrc.js		.eslintrc.js
.gitignore		.gitignore
.lintstagedrc		.lintstagedrc
.prettierrc		.prettierrc
LICENSE		LICENSE
README.md		README.md
index.js		index.js
index.spec.js		index.spec.js
mock.html		mock.html
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

sqrap

Installation

Usage

Selectors

text

attribute

html

Example usage

License

About

Releases

Packages

Contributors 2

Languages

License

dinostheo/sqrap

Folders and files

Latest commit

History

Repository files navigation

sqrap

Installation

Usage

Selectors

text

attribute

html

Example usage

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages