| Version: | 1.3.2 |
| Date: | 2021-05-19 |
| Title: | Interface to the Boilerpipe Java Library |
| Author: | See AUTHORS file. |
| Maintainer: | Mario Annau <mario.annau@gmail.com> |
| Imports: | rJava |
| Suggests: | RCurl |
| Description: | Generic Extraction of main text content from HTML files; removal of ads, sidebars and headers using the boilerpipe https://github.com/kohlschutter/boilerpipe Java library. The extraction heuristics from boilerpipe show a robust performance for a wide range of web site templates. |
| License: | Apache License (== 2.0) |
| URL: | https://github.com/mannau/boilerpipeR |
| BugReports: | https://github.com/mannau/boilerpipeR/issues |
| RoxygenNote: | 7.1.1 |
| Encoding: | UTF-8 |
| NeedsCompilation: | no |
| Packaged: | 2021-05-19 09:05:37 UTC; marioannau |
| Repository: | CRAN |
| Date/Publication: | 2021-05-19 09:20:02 UTC |
Extract the main content from HTML files
Description
boilerpipeR interfaces the boilerpipe Java library, created by Christian Kohlschutter https://github.com/kohlschutter/boilerpipe. It implements robust heuristics to extract the main content from HTML files, removing unessecary elements like ads, banners and headers/footers.
Author(s)
Mario Annau mario.annau@gmail
See Also
Extractor DefaultExtractor ArticleExtractor
Examples
## Not run:
data(content)
extract <- DefaultExtractor(content)
cat(extract)
## End(Not run)
A full-text extractor which is tuned towards news articles.
Description
In this scenario it achieves higher accuracy than DefaultExtractor.
Usage
ArticleExtractor(content, ...)
Arguments
content |
Text content as character |
... |
additional parameters |
Value
extracted text as character
Author(s)
Mario Annau
See Also
Examples
data(content)
extract <- ArticleExtractor(content)
A full-text extractor which is tuned towards extracting sentences from news articles.
Description
A full-text extractor which is tuned towards extracting sentences from news articles.
Usage
ArticleSentencesExtractor(content, ...)
Arguments
content |
Text content as character |
... |
additional parameters |
Value
extracted text as character
Author(s)
Mario Annau
See Also
Examples
data(content)
extract <- ArticleSentencesExtractor(content)
A full-text extractor trained on a 'krdwrd' Canola (see https://krdwrd.org/trac/attachment/wiki/Corpora/Canola/CANOLA.pdf.
Description
A full-text extractor trained on a 'krdwrd' Canola (see https://krdwrd.org/trac/attachment/wiki/Corpora/Canola/CANOLA.pdf.
Usage
CanolaExtractor(content, ...)
Arguments
content |
Text content as character |
... |
additional parameters |
Value
extracted text as character
Author(s)
Mario Annau
See Also
Examples
data(content)
extract <- CanolaExtractor(content)
A quite generic full-text extractor.
Description
A quite generic full-text extractor.
Usage
DefaultExtractor(content, ...)
Arguments
content |
Text content as character |
... |
additional parameters |
Value
extracted text as character
Author(s)
Mario Annau
See Also
Examples
data(content)
extract <- DefaultExtractor(content)
Generic extraction function which calls boilerpipe extractors
Description
It is the actual workhorse which directly calls the boilerpipe Java library. Typically called through
functions as listed for parameter exname.
Usage
Extractor(exname, content, asText = TRUE, ...)
Arguments
exname |
character specifying the extractor to be used. It can take one of the following values:
|
content |
Text content or URL as character |
asText |
should content specifed be treated as actual text to be extracted or url (from which HTML document is first downloaded and extracted afterwards), defaults to TRUE |
... |
additional parameters |
Value
extracted text as character
Author(s)
Mario Annau
References
https://github.com/kohlschutter/boilerpipe
Marks everything as content.
Description
Marks everything as content.
Usage
KeepEverythingExtractor(content, ...)
Arguments
content |
Text content as character |
... |
additional parameters |
Value
extracted text as character
Author(s)
Mario Annau
See Also
Examples
data(content)
extract <- KeepEverythingExtractor(content)
A full-text extractor which extracts the largest text component of a page.
Description
For news articles, it may perform better than the DefaultExtractor,
but usually worse than ArticleExtractor.
Usage
LargestContentExtractor(content, ...)
Arguments
content |
Text content as character |
... |
additional parameters |
Value
extracted text as character
Author(s)
Mario Annau
See Also
Examples
data(content)
extract <- LargestContentExtractor(content)
A quite generic full-text extractor solely based upon the number of words per block (the current, the previous and the next block).
Description
A quite generic full-text extractor solely based upon the number of words per block (the current, the previous and the next block).
Usage
NumWordsRulesExtractor(content, ...)
Arguments
content |
Text content as character |
... |
additional parameters |
Value
extracted text as character
Author(s)
Mario Annau
See Also
Examples
data(content)
extract <- NumWordsRulesExtractor(content)
Wordpress generated Webpage (retrieved from Quantivity Blog https://quantivity.wordpress.com). Content is saved as character and ready to be extracted.
Description
Wordpress generated Webpage (retrieved from Quantivity Blog https://quantivity.wordpress.com). Content is saved as character and ready to be extracted.
Author(s)
Mario Annau
References
https://quantivity.wordpress.com
Examples
#Data set has been generated as follows:
## Not run:
library(RCurl)
url <- "https://quantivity.wordpress.com/2012/11/09/multi-asset-market-regimes/"
content <- getURL(url)
content <- iconv(content, "UTF-8", "ASCII//TRANSLIT")
save(content, file = "content.rda")
## End(Not run)