Help for package boilerpipeR

Version:

1.3.2

Date:

2021-05-19

Title:

Interface to the Boilerpipe Java Library

Author:

See AUTHORS file.

Maintainer:

Mario Annau <mario.annau@gmail.com>

Imports:

rJava

Suggests:

RCurl

Description:

Generic Extraction of main text content from HTML files; removal of ads, sidebars and headers using the boilerpipe https://github.com/kohlschutter/boilerpipe Java library. The extraction heuristics from boilerpipe show a robust performance for a wide range of web site templates.

License:

Apache License (== 2.0)

URL:

https://github.com/mannau/boilerpipeR

BugReports:

https://github.com/mannau/boilerpipeR/issues

RoxygenNote:

7.1.1

Encoding:

UTF-8

NeedsCompilation:

Packaged:

2021-05-19 09:05:37 UTC; marioannau

Repository:

CRAN

Date/Publication:

2021-05-19 09:20:02 UTC

Extract the main content from HTML files

Description

boilerpipeR interfaces the boilerpipe Java library, created by Christian Kohlschutter https://github.com/kohlschutter/boilerpipe. It implements robust heuristics to extract the main content from HTML files, removing unessecary elements like ads, banners and headers/footers.

Author(s)

Mario Annau mario.annau@gmail

Examples

## Not run: 
data(content)
extract <- DefaultExtractor(content)
cat(extract)

## End(Not run)

A full-text extractor which is tuned towards news articles.

Description

In this scenario it achieves higher accuracy than DefaultExtractor.

Usage

ArticleExtractor(content, ...)

Arguments

content

Text content as character

...

additional parameters

Value

extracted text as character

Author(s)

Mario Annau

Examples

data(content)
extract <- ArticleExtractor(content)

A full-text extractor which is tuned towards extracting sentences from news articles.

Description

A full-text extractor which is tuned towards extracting sentences from news articles.

Usage

ArticleSentencesExtractor(content, ...)

Arguments

content

Text content as character

...

additional parameters

Value

extracted text as character

Author(s)

Mario Annau

Examples

data(content)
extract <- ArticleSentencesExtractor(content)

A full-text extractor trained on a 'krdwrd' Canola (see `https://krdwrd.org/trac/attachment/wiki/Corpora/Canola/CANOLA.pdf`.

Description

A full-text extractor trained on a 'krdwrd' Canola (see https://krdwrd.org/trac/attachment/wiki/Corpora/Canola/CANOLA.pdf.

Usage

CanolaExtractor(content, ...)

Arguments

content

Text content as character

...

additional parameters

Value

extracted text as character

Author(s)

Mario Annau

Examples

data(content)
extract <- CanolaExtractor(content)

A quite generic full-text extractor.

Description

A quite generic full-text extractor.

Usage

DefaultExtractor(content, ...)

Arguments

content

Text content as character

...

additional parameters

Value

extracted text as character

Author(s)

Mario Annau

Examples

data(content)
extract <- DefaultExtractor(content)

Generic extraction function which calls boilerpipe extractors

Description

It is the actual workhorse which directly calls the boilerpipe Java library. Typically called through functions as listed for parameter exname.

Usage

Extractor(exname, content, asText = TRUE, ...)

Arguments

exname

character specifying the extractor to be used. It can take one of the following values:

ArticleExtractorA full-text extractor which is tuned towards news articles.
ArticleSentencesExtractorA full-text extractor which is tuned towards extracting sentences from news articles.
CanolaExtractorA full-text extractor trained on a 'krdwrd'.
DefaultExtractorA quite generic full-text extractor.
KeepEverythingExtractorMarks everything as content.
LargestContentExtractorA full-text extractor which extracts the largest text component of a page.
NumWordsRulesExtractorA quite generic full-text extractor solely based upon the number of words per block.

content

Text content or URL as character

asText

should content specifed be treated as actual text to be extracted or url (from which HTML document is first downloaded and extracted afterwards), defaults to TRUE

...

additional parameters

Value

extracted text as character

Author(s)

Mario Annau

References

https://github.com/kohlschutter/boilerpipe

Marks everything as content.

Description

Marks everything as content.

Usage

KeepEverythingExtractor(content, ...)

Arguments

content

Text content as character

...

additional parameters

Value

extracted text as character

Author(s)

Mario Annau

Examples

data(content)
extract <- KeepEverythingExtractor(content)

A full-text extractor which extracts the largest text component of a page.

Description

For news articles, it may perform better than the DefaultExtractor, but usually worse than ArticleExtractor.

Usage

LargestContentExtractor(content, ...)

Arguments

content

Text content as character

...

additional parameters

Value

extracted text as character

Author(s)

Mario Annau

Examples

data(content)
extract <- LargestContentExtractor(content)

A quite generic full-text extractor solely based upon the number of words per block (the current, the previous and the next block).

Description

A quite generic full-text extractor solely based upon the number of words per block (the current, the previous and the next block).

Usage

NumWordsRulesExtractor(content, ...)

Arguments

content

Text content as character

...

additional parameters

Value

extracted text as character

Author(s)

Mario Annau

Examples

data(content)
extract <- NumWordsRulesExtractor(content)

Wordpress generated Webpage (retrieved from Quantivity Blog https://quantivity.wordpress.com). Content is saved as character and ready to be extracted.

Description

Wordpress generated Webpage (retrieved from Quantivity Blog https://quantivity.wordpress.com). Content is saved as character and ready to be extracted.

Author(s)

Mario Annau

References

https://quantivity.wordpress.com

Examples

#Data set has been generated as follows:
## Not run: 
library(RCurl)
url <- "https://quantivity.wordpress.com/2012/11/09/multi-asset-market-regimes/"
content <- getURL(url)
content <- iconv(content, "UTF-8", "ASCII//TRANSLIT")
save(content, file = "content.rda")

## End(Not run)

Extract the main content from HTML files

Description

Author(s)

See Also

Examples

A full-text extractor which is tuned towards news articles.

Description

Usage

Arguments

Value

Author(s)

See Also

Examples

A full-text extractor which is tuned towards extracting sentences from news articles.

Description

Usage

Arguments

Value

Author(s)

See Also

Examples

A full-text extractor trained on a 'krdwrd' Canola (see https://krdwrd.org/trac/attachment/wiki/Corpora/Canola/CANOLA.pdf.

Description

Usage

Arguments

Value

Author(s)

See Also

Examples

A quite generic full-text extractor.

Description

Usage

Arguments

Value

Author(s)

See Also

Examples

Generic extraction function which calls boilerpipe extractors

Description

Usage

Arguments

Value

Author(s)

References

Marks everything as content.

Description

Usage

Arguments

Value

Author(s)

See Also

Examples

A full-text extractor which extracts the largest text component of a page.

Description

Usage

Arguments

Value

Author(s)

See Also

Examples

A quite generic full-text extractor solely based upon the number of words per block (the current, the previous and the next block).

Description

Usage

Arguments

Value

Author(s)

See Also

Examples

Wordpress generated Webpage (retrieved from Quantivity Blog https://quantivity.wordpress.com). Content is saved as character and ready to be extracted.

Description

Author(s)

References

Examples

A full-text extractor trained on a 'krdwrd' Canola (see `https://krdwrd.org/trac/attachment/wiki/Corpora/Canola/CANOLA.pdf`.