Class SingleDocumentExtraction


  • public class SingleDocumentExtraction
    extends Object
    This class acts as a facade where all extractors (for a given MIMEType) can be called on a single document. Extractors are automatically filtered by MIMEType.
    • Constructor Summary

      Constructors 
      Constructor Description
      SingleDocumentExtraction​(org.apache.any23.configuration.Configuration configuration, org.apache.any23.source.DocumentSource in, org.apache.any23.extractor.ExtractorFactory<?> factory, org.apache.any23.writer.TripleHandler output)
      Builds an extractor by the specification of document source, extractors factory and output triple handler.
      SingleDocumentExtraction​(org.apache.any23.configuration.Configuration configuration, org.apache.any23.source.DocumentSource in, org.apache.any23.extractor.ExtractorGroup extractors, org.apache.any23.writer.TripleHandler output)
      Builds an extractor by the specification of document source, list of extractors and output triple handler.
      SingleDocumentExtraction​(org.apache.any23.source.DocumentSource in, org.apache.any23.extractor.ExtractorFactory<?> factory, org.apache.any23.writer.TripleHandler output)
      Builds an extractor by the specification of document source, extractors factory and output triple handler, using the DefaultConfiguration.
    • Constructor Detail

      • SingleDocumentExtraction

        public SingleDocumentExtraction​(org.apache.any23.configuration.Configuration configuration,
                                        org.apache.any23.source.DocumentSource in,
                                        org.apache.any23.extractor.ExtractorGroup extractors,
                                        org.apache.any23.writer.TripleHandler output)
        Builds an extractor by the specification of document source, list of extractors and output triple handler.
        Parameters:
        configuration - configuration applied during extraction.
        in - input document source.
        extractors - list of extractors to be applied.
        output - output triple handler.
      • SingleDocumentExtraction

        public SingleDocumentExtraction​(org.apache.any23.configuration.Configuration configuration,
                                        org.apache.any23.source.DocumentSource in,
                                        org.apache.any23.extractor.ExtractorFactory<?> factory,
                                        org.apache.any23.writer.TripleHandler output)
        Builds an extractor by the specification of document source, extractors factory and output triple handler.
        Parameters:
        configuration - configuration applied during extraction.
        in - input document source.
        factory - the extractors factory.
        output - output triple handler.
      • SingleDocumentExtraction

        public SingleDocumentExtraction​(org.apache.any23.source.DocumentSource in,
                                        org.apache.any23.extractor.ExtractorFactory<?> factory,
                                        org.apache.any23.writer.TripleHandler output)
        Builds an extractor by the specification of document source, extractors factory and output triple handler, using the DefaultConfiguration.
        Parameters:
        in - input document source.
        factory - the extractors factory.
        output - output triple handler.
    • Method Detail

      • setLocalCopyFactory

        public void setLocalCopyFactory​(LocalCopyFactory copyFactory)
        Sets the internal factory for generating the document local copy, if null the MemCopyFactory will be used.
        Parameters:
        copyFactory - local copy factory.
        See Also:
        DocumentSource
      • setMIMETypeDetector

        public void setMIMETypeDetector​(org.apache.any23.mime.MIMETypeDetector detector)
        Sets the internal mime type detector, if null mimetype detection will be skipped and all extractors will be activated.
        Parameters:
        detector - detector instance.
      • run

        public SingleDocumentExtractionReport run​(org.apache.any23.extractor.ExtractionParameters extractionParameters)
                                           throws org.apache.any23.extractor.ExtractionException,
                                                  IOException
        Triggers the execution of all the Extractor registered to this class using the specified extraction parameters.
        Parameters:
        extractionParameters - the parameters applied to the run execution.
        Returns:
        the report generated by the extraction.
        Throws:
        org.apache.any23.extractor.ExtractionException - if an error occurred during the data extraction.
        IOException - if an error occurred during the data access.
      • run

        public SingleDocumentExtractionReport run()
                                           throws IOException,
                                                  org.apache.any23.extractor.ExtractionException
        Triggers the execution of all the Extractor registered to this class using the default extraction parameters.
        Returns:
        the extraction report.
        Throws:
        IOException - if there is an error reading input from the document source
        org.apache.any23.extractor.ExtractionException - if there is an error duing distraction
      • getDetectedMIMEType

        public String getDetectedMIMEType()
                                   throws IOException
        Returns the detected mimetype for the given DocumentSource.
        Returns:
        string containing the detected mimetype.
        Throws:
        IOException - if an error occurred while accessing the data.
      • hasMatchingExtractors

        public boolean hasMatchingExtractors()
                                      throws IOException
        Check whether the given DocumentSource content activates of not at least an extractor.
        Returns:
        true if at least an extractor is activated, false otherwise.
        Throws:
        IOException - if there is an error locating matching extractors
      • getMatchingExtractors

        public List<org.apache.any23.extractor.Extractor> getMatchingExtractors()
        Returns:
        the list of all the activated extractors for the given DocumentSource.
      • getParserEncoding

        public String getParserEncoding()
        Returns:
        the configured parsing encoding.
      • setParserEncoding

        public void setParserEncoding​(String encoding)
        Sets the document parser encoding.
        Parameters:
        encoding - parser encoding.