Documentation
¶
Overview ¶
Package gowarc provides a framework for handling WARC files, enabling their parsing, creation, and validation.
WARC Overview ¶
The WARC format offers a standard way to structure, manage and store billions of resources collected from the web and elsewhere. It is used to build applications for harvesting, managing, accessing, mining and exchanging content.
For more details, visit the WARC specification: https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/
WARC record creation ¶
The WarcRecordBuilder, initialized via NewRecordBuilder, is the primary tool for creating WARC records. By default, the WarcRecordBuilder generates a record id and calculates the 'Content-Length' and 'WARC-Block-Digest'.
Use WarcFileWriter, initialized with NewWarcFileWriter, to write WARC files.
WARC record parsing ¶
To parse single WARC records, use the Unmarshaler initialized with NewUnmarshaler.
To read entire WARC files, employ the WarcFileReader initialized through NewWarcFileReader.
Validation and repair ¶
The gowarc package supports validation during both the creation and parsing of WARC records. Control over the scope of validation and the handling of validation errors can be achieved by setting the appropriate options in the WarcRecordBuilder, Unmarshaler, or WarcFileReader.
Index ¶
- Constants
- Variables
- type Block
- type ContentLengthError
- type DigestEncoding
- type DigestError
- type ErrorPolicy
- type HeaderFieldError
- type HttpRequestBlock
- type HttpResponseBlock
- type Marshaler
- type PatternNameGenerator
- type PayloadBlock
- type ProtocolHeaderBlock
- type Record
- type RecordType
- type RevisitRef
- type SyntaxError
- type Unmarshaler
- type WarcFields
- func (wf *WarcFields) Add(name string, value string)
- func (wf *WarcFields) AddId(name, value string)
- func (wf *WarcFields) AddInt(name string, value int)
- func (wf *WarcFields) AddInt64(name string, value int64)
- func (wf *WarcFields) AddTime(name string, value time.Time)
- func (wf *WarcFields) AddTimeNano(name string, value time.Time)
- func (wf *WarcFields) CanonicalHeaderKey(s string) string
- func (wf *WarcFields) Delete(key string)
- func (wf *WarcFields) Get(key string) string
- func (wf *WarcFields) GetAll(name string) []string
- func (wf *WarcFields) GetId(name string) string
- func (wf *WarcFields) GetInt(key string) (int, error)
- func (wf *WarcFields) GetInt64(name string) (int64, error)
- func (wf *WarcFields) GetTime(name string) (time.Time, error)
- func (wf *WarcFields) Has(name string) bool
- func (wf *WarcFields) Set(name string, value string)
- func (wf *WarcFields) SetId(name, value string)
- func (wf *WarcFields) SetInt(name string, value int)
- func (wf *WarcFields) SetInt64(name string, value int64)
- func (wf *WarcFields) SetTime(name string, value time.Time)
- func (wf *WarcFields) Sort()
- func (wf *WarcFields) String() string
- func (wf *WarcFields) Write(w io.Writer) (n int64, err error)
- type WarcFieldsBlock
- type WarcFileNameGenerator
- type WarcFileReader
- type WarcFileWriter
- type WarcFileWriterOption
- func WithAddWarcConcurrentToHeader(addConcurrentHeader bool) WarcFileWriterOption
- func WithAfterFileCreationHook(f func(fileName string, size int64, warcInfoId string) error) WarcFileWriterOption
- func WithBeforeFileCreationHook(f func(fileName string) error) WarcFileWriterOption
- func WithCompressedFileSuffix(suffix string) WarcFileWriterOption
- func WithCompression(compress bool) WarcFileWriterOption
- func WithCompressionLevel(gzipLevel int) WarcFileWriterOption
- func WithExpectedCompressionRatio(ratio float64) WarcFileWriterOption
- func WithFileNameGenerator(generator WarcFileNameGenerator) WarcFileWriterOption
- func WithFlush(flush bool) WarcFileWriterOption
- func WithMarshaler(marshaler Marshaler) WarcFileWriterOption
- func WithMaxConcurrentWriters(count int) WarcFileWriterOption
- func WithMaxFileSize(size int64) WarcFileWriterOption
- func WithOpenFileSuffix(suffix string) WarcFileWriterOption
- func WithRecordOptions(opts ...WarcRecordOption) WarcFileWriterOption
- func WithSegmentation() WarcFileWriterOption
- func WithWarcInfoFunc(f func(recordBuilder WarcRecordBuilder) error) WarcFileWriterOption
- type WarcRecord
- type WarcRecordBuilder
- type WarcRecordOption
- func WithAddMissingContentLength(addMissingContentLength bool) WarcRecordOption
- func WithAddMissingDigest(addMissingDigest bool) WarcRecordOption
- func WithAddMissingRecordId(addMissingRecordId bool) WarcRecordOption
- func WithBlockErrorPolicy(policy ErrorPolicy) WarcRecordOption
- func WithBufferMaxMemBytes(size int64) WarcRecordOption
- func WithBufferTmpDir(dir string) WarcRecordOption
- func WithDefaultDigestAlgorithm(defaultDigestAlgorithm string) WarcRecordOption
- func WithDefaultDigestEncoding(defaultDigestEncoding DigestEncoding) WarcRecordOption
- func WithFixContentLength(fixContentLength bool) WarcRecordOption
- func WithFixDigest(fixDigest bool) WarcRecordOption
- func WithFixSyntaxErrors(fixSyntaxErrors bool) WarcRecordOption
- func WithFixWarcFieldsBlockErrors(fixWarcFieldsBlockErrors bool) WarcRecordOption
- func WithNoValidation() WarcRecordOption
- func WithRecordIdFunc(recordIdFunc func() (string, error)) WarcRecordOption
- func WithSkipParseBlock() WarcRecordOption
- func WithSpecViolationPolicy(policy ErrorPolicy) WarcRecordOption
- func WithStrictValidation() WarcRecordOption
- func WithSyntaxErrorPolicy(policy ErrorPolicy) WarcRecordOption
- func WithUnknownRecordTypePolicy(policy ErrorPolicy) WarcRecordOption
- func WithUrlParserOptions(opts ...url.ParserOption) WarcRecordOption
- func WithVersion(version *WarcVersion) WarcRecordOption
- type WarcVersion
- type WriteResponse
Examples ¶
Constants ¶
const ( // WARC header field name constants ContentLength = "Content-Length" ContentType = "Content-Type" WarcBlockDigest = "WARC-Block-Digest" WarcConcurrentTo = "WARC-Concurrent-To" WarcDate = "WARC-Date" WarcFilename = "WARC-Filename" WarcIPAddress = "WARC-IP-Address" WarcIdentifiedPayloadType = "WARC-Identified-Payload-Type" WarcPayloadDigest = "WARC-Payload-Digest" WarcProfile = "WARC-Profile" WarcRecordID = "WARC-Record-ID" WarcRefersTo = "WARC-Refers-To" WarcRefersToDate = "WARC-Refers-To-Date" WarcRefersToTargetURI = "WARC-Refers-To-Target-URI" WarcSegmentNumber = "WARC-Segment-Number" WarcSegmentOriginID = "WARC-Segment-Origin-ID" WarcSegmentTotalLength = "WARC-Segment-Total-Length" WarcTargetURI = "WARC-Target-URI" WarcTruncated = "WARC-Truncated" WarcType = "WARC-Type" WarcWarcinfoID = "WARC-Warcinfo-ID" WarcPageID = "WARC-Page-ID" // Browsertrix extension field WarcResourceType = "WARC-Resource-Type" // Browsertrix extension field WarcJSONMetadata = "WARC-JSON-Metadata" // Browsertrix extension field )
const ( // Well known content types ApplicationWarcFields = "application/warc-fields" ApplicationHttp = "application/http" )
const ( // Well known revisit profiles ProfileIdenticalPayloadDigestV1_1 = "http://netpreserve.org/warc/1.1/revisit/identical-payload-digest" ProfileServerNotModifiedV1_1 = "http://netpreserve.org/warc/1.1/revisit/server-not-modified" ProfileIdenticalPayloadDigestV1_0 = "http://netpreserve.org/warc/1.0/revisit/identical-payload-digest" ProfileServerNotModifiedV1_0 = "http://netpreserve.org/warc/1.0/revisit/server-not-modified" )
Variables ¶
var ( // ErrNotRevisitRecord is returned when a revisit-only operation is attempted on a non-revisit record. ErrNotRevisitRecord = errors.New("gowarc: not a revisit record") // ErrIsRevisitRecord is returned when attempting to create a revisit reference from a revisit record. ErrIsRevisitRecord = errors.New("gowarc: cannot reference a revisit record") // ErrUnknownRevisitProfile is returned when a revisit record references an unrecognized profile URI. ErrUnknownRevisitProfile = errors.New("gowarc: unknown revisit profile") // ErrMissingPayloadDigest is returned when the identical-payload-digest profile is used but no payload digest is available. ErrMissingPayloadDigest = errors.New("gowarc: payload digest required for identical-payload-digest profile") // ErrMergeRequiresOneRecord is returned when Merge is called with zero or more than one referenced record. ErrMergeRequiresOneRecord = errors.New("gowarc: revisit merge requires exactly one referenced record") // ErrMergeNotSupported is returned when merging is attempted on a record type that does not support it. ErrMergeNotSupported = errors.New("gowarc: merging is only possible for revisit records or segmented records") // ErrMergeSegmentedNotImplemented is returned when merging of segmented records is attempted. ErrMergeSegmentedNotImplemented = errors.New("gowarc: merging of segmented records is not implemented") // ErrMergeWrongBlockType is returned when a revisit record's block type is incompatible with merging // (typically because the record was parsed with SkipParseBlock). ErrMergeWrongBlockType = errors.New("gowarc: revisit block type incompatible with merge; record must be parsed with SkipParseBlock=false") // ErrMergeUnsupportedBlock is returned when merging a revisit with a non-HTTP block type. ErrMergeUnsupportedBlock = errors.New("gowarc: merge only supports http request and response blocks") // ErrUnsupportedDigestAlgorithm is returned when an unrecognized digest algorithm is encountered. ErrUnsupportedDigestAlgorithm = errors.New("gowarc: unsupported digest algorithm") // ErrNoRecord is returned by [Unmarshaler.Unmarshal] and [WarcFileReader.Next] // when the reader scans past one or more bytes without finding a WARC record // before reaching end-of-file. This distinguishes "stream contained only // unrecognizable data" from a clean EOF on an empty or fully-consumed stream. ErrNoRecord = errors.New("gowarc: no WARC record found") )
Sentinel errors for common conditions. These can be matched with errors.Is.
var ( // WARC versions V1_0 = &WarcVersion{id: 1, txt: "1.0", major: 1, minor: 0} // WARC 1.0 V1_1 = &WarcVersion{id: 2, txt: "1.1", major: 1, minor: 1} // WARC 1.1 )
Functions ¶
This section is empty.
Types ¶
type Block ¶
type Block interface {
// RawBytes returns the bytes of the Block
RawBytes() (io.Reader, error)
BlockDigest() string
Size() int64
IsCached() bool
Cache() error
io.Closer
}
Block is the interface used to represent the content of a WARC record as specified by the WARC specification: https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#warc-record-content-block
A Block might be cached or non-cached. Calling RawBytes or BlockDigest more than once will fail if the block is not cached.
NOTE: Blocks are not required to be thread safe.
type ContentLengthError ¶
type ContentLengthError struct {
// Expected is the Content-Length value declared in the WARC header.
Expected int64
// Actual is the measured size of the content.
Actual int64
}
ContentLengthError is returned when the actual content size does not match the Content-Length header value. Use errors.As to extract the expected and actual lengths programmatically.
func (*ContentLengthError) Error ¶
func (e *ContentLengthError) Error() string
type DigestEncoding ¶
type DigestEncoding uint8
DigestEncoding represents the encoding used for WARC digest values.
const ( Base16 DigestEncoding = 1 Base32 DigestEncoding = 2 Base64 DigestEncoding = 3 )
type DigestError ¶
type DigestError struct {
// Algorithm is the digest algorithm name (e.g. "sha1", "sha256").
Algorithm string
// Expected is the digest value from the WARC header.
Expected string
// Computed is the digest value calculated from the record content.
Computed string
}
DigestError is returned when a computed digest does not match the expected value from a WARC-Block-Digest or WARC-Payload-Digest header. Use errors.As to extract the algorithm, expected, and computed values programmatically.
func (*DigestError) Error ¶
func (e *DigestError) Error() string
type ErrorPolicy ¶
type ErrorPolicy int8
ErrorPolicy describes how to handle WARC record errors.
const ( ErrIgnore ErrorPolicy = 0 // Ignore the given error. ErrWarn ErrorPolicy = 1 // Ignore given error, but submit a warning. ErrFail ErrorPolicy = 2 // Fail on given error. )
type HeaderFieldError ¶
type HeaderFieldError struct {
// FieldName is the WARC header field that caused the error (e.g. "WARC-Date").
// May be empty for structural errors like missing required fields.
FieldName string
// Msg describes the violation.
Msg string
}
HeaderFieldError is used for violations of WARC header specification. Use errors.As to extract the field name and message programmatically.
func (*HeaderFieldError) Error ¶
func (e *HeaderFieldError) Error() string
type HttpRequestBlock ¶
type HttpRequestBlock interface {
PayloadBlock
ProtocolHeaderBlock
HttpRequestLine() string
HttpHeader() *http.Header
}
type HttpResponseBlock ¶
type HttpResponseBlock interface {
PayloadBlock
ProtocolHeaderBlock
HttpStatusLine() string
HttpStatusCode() int
HttpHeader() *http.Header
}
type Marshaler ¶
type Marshaler interface {
Marshal(w io.Writer, record WarcRecord, maxSize int64) (WarcRecord, int64, error)
}
Marshaler is the interface that wraps the Marshal function.
Marshal converts a WARC record to its serialized form and returns the size of the marshalled record or any error encountered.
Depending on implementation, Marshal might return a WarcRecord which is the continuation of the record being written. See the description of record segmentation at https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#record-segmentation
func NewMarshaler ¶
func NewMarshaler() Marshaler
type PatternNameGenerator ¶
type PatternNameGenerator struct {
Directory string // Directory to store warcfiles. Defaults to the empty string
Prefix string // Prefix available to be used in pattern. Defaults to the empty string
Serial int32 // Serial number available for use in pattern. It is atomically increased with every generated file name.
Pattern string // Pattern for generated file name. Defaults to: "%{prefix}s%{ts}s-%04{serial}d-%{hostOrIp}s.%{ext}s"
Extension string // Extension for file name. Defaults to: "warc"
Params map[string]any // Parameters available to be used in pattern. If a custom parameter has the same key as a predefined field (prefix, ext, etc), the predefined field will take precedence
}
PatternNameGenerator implements the WarcFileNameGenerator.
New filenames are generated based on a pattern which defaults to the recommendation in the WARC 1.1 standard (https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#annex-c-informative-warc-file-size-and-name-recommendations). The pattern is like golangs fmt package (https://pkg.go.dev/fmt), but allows for named fields in curly braces. The available predefined names are:
- prefix - content of the Prefix field
- ext - content of the Extension field
- ts - current time as 14-digit GMT Time-stamp
- serial - atomically increased serial number for every generated file name. Initial value is 0 if Serial field is not set
- ip - primary IP address of the node
- host - host name of the node
- hostOrIp - host name of the node, falling back to IP address if host name could not be resolved
func (*PatternNameGenerator) NewWarcfileName ¶
func (g *PatternNameGenerator) NewWarcfileName() (string, string)
NewWarcfileName returns a directory (might be the empty string for current directory) and a file name
type PayloadBlock ¶
PayloadBlock is a Block with a well-defined payload.
Ref: https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#warc-record-payload
type ProtocolHeaderBlock ¶
type ProtocolHeaderBlock interface {
// ProtocolHeaderBytes returns the raw bytes from the protocol's header.
ProtocolHeaderBytes() []byte
}
ProtocolHeaderBlock is a Block with a well-defined protocol header e.g. http response
type Record ¶
type Record struct {
// WarcRecord is the parsed WARC record.
WarcRecord WarcRecord
// Offset is the byte offset of the record within the file.
Offset int64
// Size is the number of bytes consumed by the record in the file,
// including headers, payload, and framing (e.g. gzip envelope).
Size int64
// Validation contains non-fatal validation findings (populated when
// an [ErrorPolicy] is set to [ErrWarn]). It is nil when clean.
Validation []error
}
Record represents a WARC record as read from a WARC file, including its position within the file and any validation findings.
func (Record) Close ¶
Close closes the underlying WarcRecord, releasing any resources.
type RecordType ¶
type RecordType uint16
RecordType represents the type of a WARC record.
const ( // WARC record types Warcinfo RecordType = 1 Response RecordType = 2 Resource RecordType = 4 Request RecordType = 8 Metadata RecordType = 16 Revisit RecordType = 32 Conversion RecordType = 64 Continuation RecordType = 128 )
func (RecordType) String ¶
func (rt RecordType) String() string
String returns a string representation of the record type.
type RevisitRef ¶
type SyntaxError ¶
type SyntaxError struct {
// Msg describes the syntax violation.
Msg string
// Line is the 1-based line number where the error occurred, or 0 if unknown.
Line int
// Wrapped is the underlying cause, if any. Use [errors.As] or [errors.Is]
// to inspect it, or access it directly.
Wrapped error
}
SyntaxError is used for syntactical errors like wrong line endings. Use errors.As to extract position information and wrapped cause programmatically.
func (*SyntaxError) Error ¶
func (e *SyntaxError) Error() string
func (*SyntaxError) Unwrap ¶
func (e *SyntaxError) Unwrap() error
type Unmarshaler ¶
type Unmarshaler interface {
Unmarshal(b *bufio.Reader) (record WarcRecord, offset int64, validation []error, err error)
}
Unmarshaler is the interface implemented by types that can unmarshal a WARC record. A new instance of Unmarshaler is created by calling NewUnmarshaler. NewUnmarshaler accepts a number of options that can be used to control the unmarshalling process. See WarcRecordOption for details.
Unmarshal parses the WARC record from the given reader and returns:
- record: the parsed WarcRecord. May be nil if a fatal error occurred.
- offset: the number of bytes that were discarded before the start of the record was found.
- validation: a slice of non-fatal errors discovered during parsing (populated when an ErrorPolicy is set to ErrWarn).
- err: a fatal error, if any. A nil err does not imply the record is fully valid; check the validation slice for warnings.
If the reader contains multiple records, Unmarshal parses the first record and returns. If the reader contains no records, Unmarshal returns an io.EOF error.
Example ¶
data := bytes.NewBufferString(" WARC/1.1\r\n" +
"WARC-Date: 2017-03-06T04:03:53Z\r\n" +
"WARC-Record-ID: <urn:uuid:e9a0cecc-0221-11e7-adb1-0242ac120008>\r\n" +
"WARC-Filename: temp-20170306040353.warc.gz\r\n" +
"WARC-Type: warcinfo\r\n" +
"Content-Type: application/warc-fields\r\n" +
"Warc-Block-Digest: sha1:af4d582b4ffc017d07a947d841e392a821f754f3\r\n" +
"Content-Length: 34\r\n" +
"\r\n" +
"format: WARC File Format 1.1\r\n" +
"\r\n\r\n")
input := bufio.NewReader(data)
// Create a new unmarshaler
unmarshaler := gowarc.NewUnmarshaler(gowarc.WithSpecViolationPolicy(gowarc.ErrWarn), gowarc.WithSyntaxErrorPolicy(gowarc.ErrWarn))
wr, off, validation, err := unmarshaler.Unmarshal(input)
if err == nil {
fmt.Printf("Offset: %d, %s\n", off, wr)
if len(validation) > 0 {
fmt.Println("Validation errors:")
for i, e := range validation {
fmt.Printf(" %d: %s\n", i+1, e)
}
}
}
Output: Offset: 2, WARC record: version: WARC/1.1, type: warcinfo, id: urn:uuid:e9a0cecc-0221-11e7-adb1-0242ac120008 Validation errors: 1: gowarc: record was found 2 bytes after expected offset 2: block: wrong digest: expected sha1:af4d582b4ffc017d07a947d841e392a821f754f3, computed: sha1:8a936f9fd60d664cf95b1ffb40f1c4093e65bb40 3: too few bytes in end of record marker. Expected "\r\n\r\n", was ""
func NewUnmarshaler ¶
func NewUnmarshaler(opts ...WarcRecordOption) Unmarshaler
type WarcFields ¶
type WarcFields []*nameValue
WarcFields represents the key value pairs in a WARC-record header.
It is also used for representing the record block of records with content-type "application/warc-fields".
All key-manipulating functions take case-insensitive keys and modify them to their canonical form.
func (*WarcFields) Add ¶
func (wf *WarcFields) Add(name string, value string)
Add adds the key, value pair to the header. It appends to any existing values associated with key. The key is case-insensitive.
func (*WarcFields) AddId ¶
func (wf *WarcFields) AddId(name, value string)
AddId adds the key, value pair to the header. It appends to any existing values associated with key. The key is case-insensitive.
The value is surrounded with '<' and '>' if not already present.
func (*WarcFields) AddInt ¶
func (wf *WarcFields) AddInt(name string, value int)
AddInt adds the key, value pair to the header. It appends to any existing values associated with key. The key is case-insensitive.
func (*WarcFields) AddInt64 ¶
func (wf *WarcFields) AddInt64(name string, value int64)
AddInt64 adds the key, value pair to the header. It appends to any existing values associated with key. The key is case-insensitive.
func (*WarcFields) AddTime ¶
func (wf *WarcFields) AddTime(name string, value time.Time)
AddTime adds the key, value pair to the header. It appends to any existing values associated with key. The key is case-insensitive.
The value is converted to RFC 3339 format.
func (*WarcFields) AddTimeNano ¶
func (wf *WarcFields) AddTimeNano(name string, value time.Time)
AddTimeNano adds the key, value pair to the header. It appends to any existing values associated with key. The key is case-insensitive.
The value is formatted as RFC 3339 with up to nanosecond precision.
func (*WarcFields) CanonicalHeaderKey ¶
func (wf *WarcFields) CanonicalHeaderKey(s string) string
func (*WarcFields) Delete ¶
func (wf *WarcFields) Delete(key string)
Delete deletes the values associated with key. The key is case-insensitive.
func (*WarcFields) Get ¶
func (wf *WarcFields) Get(key string) string
Get gets the first value associated with the given key. It is case-insensitive. If the key doesn't exist or there are no values associated with the key, Get returns the empty string. To access multiple values of a key, use GetAll.
func (*WarcFields) GetAll ¶
func (wf *WarcFields) GetAll(name string) []string
GetAll returns all values associated with the given key. It is case-insensitive.
func (*WarcFields) GetId ¶
func (wf *WarcFields) GetId(name string) string
GetId is like Get, but removes the surrounding '<' and '>' from the field value.
func (*WarcFields) GetInt ¶
func (wf *WarcFields) GetInt(key string) (int, error)
GetInt is like Get, but converts the field value to int.
func (*WarcFields) GetInt64 ¶
func (wf *WarcFields) GetInt64(name string) (int64, error)
GetInt64 is like Get, but converts the field value to int64.
func (*WarcFields) GetTime ¶
func (wf *WarcFields) GetTime(name string) (time.Time, error)
GetTime is like Get, but converts the field value to time.Time. The field is expected to be in RFC 3339 format.
func (*WarcFields) Has ¶
func (wf *WarcFields) Has(name string) bool
Has returns true if field exists. This can be used to separate a missing field from a field for which value is the empty string.
func (*WarcFields) Set ¶
func (wf *WarcFields) Set(name string, value string)
Set sets the header entries associated with key to the single element value. It replaces any existing values associated with key. The key is case-insensitive
func (*WarcFields) SetId ¶
func (wf *WarcFields) SetId(name, value string)
SetId sets the header entries associated with key to the single element value. It replaces any existing values associated with key. The key is case-insensitive
The value is surrounded with '<' and '>' if not already present.
func (*WarcFields) SetInt ¶
func (wf *WarcFields) SetInt(name string, value int)
SetInt sets the header entries associated with key to the single element value. It replaces any existing values associated with key. The key is case-insensitive
func (*WarcFields) SetInt64 ¶
func (wf *WarcFields) SetInt64(name string, value int64)
SetInt64 sets the header entries associated with key to the single element value. It replaces any existing values associated with key. The key is case-insensitive
func (*WarcFields) SetTime ¶
func (wf *WarcFields) SetTime(name string, value time.Time)
SetTime sets the header entries associated with key to the single element value. It replaces any existing values associated with key. The key is case-insensitive
The value is converted to RFC 3339 format.
func (*WarcFields) Sort ¶
func (wf *WarcFields) Sort()
Sort sorts the fields in lexicographical order.
Only field names are sorted. Order of values for a repeated field is kept as is.
func (*WarcFields) String ¶
func (wf *WarcFields) String() string
type WarcFieldsBlock ¶
type WarcFieldsBlock interface {
Block
WarcFields() *WarcFields
}
type WarcFileNameGenerator ¶
type WarcFileNameGenerator interface {
// NewWarcfileName returns a directory (might be the empty string for current directory) and a file name
NewWarcfileName() (string, string)
}
WarcFileNameGenerator is the interface that wraps the NewWarcfileName function.
type WarcFileReader ¶
type WarcFileReader struct {
// contains filtered or unexported fields
}
WarcFileReader is used to read WARC files. Use NewWarcFileReader to create a new instance.
func NewWarcFileReader ¶
func NewWarcFileReader(filename string, offset int64, opts ...WarcRecordOption) (*WarcFileReader, error)
NewWarcFileReader creates a new WarcFileReader from the supplied filename. If offset is > 0, the reader will start reading from that offset. The WarcFileReader can be configured with options. See WarcRecordOption.
Example ¶
reader, err := gowarc.NewWarcFileReader("test.warc.gz", 0, gowarc.WithStrictValidation())
if err != nil {
fmt.Println("Error creating warc reader:", err)
return
}
for {
rec, err := reader.Next()
if err == io.EOF {
break
}
if err != nil {
fmt.Println("Error reading record:", err)
return
}
fmt.Println("Record type:", rec.WarcRecord.Type().String())
fmt.Println("Record version:", rec.WarcRecord.Version())
// Do more with record as per needs
}
func NewWarcFileReaderFromStream ¶
func NewWarcFileReaderFromStream(r io.Reader, offset int64, opts ...WarcRecordOption) (*WarcFileReader, error)
NewWarcFileReaderFromStream creates a new WarcFileReader from the supplied io.Reader. The WarcFileReader can be configured with options. See WarcRecordOption.
It is the responsibility of the caller to close the io.Reader.
func (*WarcFileReader) Close ¶
func (wf *WarcFileReader) Close() error
Close closes the WarcFileReader.
func (*WarcFileReader) Next ¶
func (wf *WarcFileReader) Next() (Record, error)
Next reads the next Record from the WarcFileReader.
The returned Record contains the parsed WarcRecord, its byte offset and size within the file, and any non-fatal validation findings.
The returned values depend on the ErrorPolicy options set on the WarcFileReader:
ErrIgnore: errors are suppressed. A Record is returned without any validation. An error is only returned if the file is so badly formatted that nothing meaningful can be parsed.
ErrWarn: a Record is returned. Non-fatal validation findings are collected in the [Record.Validation] slice, which should be inspected by the caller.
ErrFail: the first validation failure is returned as err, and [Record.WarcRecord] may be nil.
Mixed Policies: different ErrorPolicy values may be set per error category with WithSyntaxErrorPolicy, WithSpecViolationPolicy and WithUnknownRecordTypePolicy. The return values of Next are a mix of the above based on the configured policies.
When at end of file, [Record.WarcRecord] is nil and err is io.EOF.
func (*WarcFileReader) Records ¶
func (wf *WarcFileReader) Records() iter.Seq2[Record, error]
Records returns an iterator over all records in the WARC file.
Each iteration yields a Record and an error. The iterator stops automatically at EOF. Fatal errors are yielded and the iterator stops.
Usage:
for rec, err := range reader.Records() {
if err != nil {
log.Fatal(err)
}
fmt.Println(rec.WarcRecord.Type())
rec.Close()
}
type WarcFileWriter ¶
type WarcFileWriter struct {
// contains filtered or unexported fields
}
WarcFileWriter writes WARC records using a pool of independent file writers. Each worker owns one singleWarcFileWriter and thus one "current file" at a time.
Close drains queued work and stops workers. Writes after Close return nil. Rotate is ordered w.r.t. queued writes: each worker closes its current file only after it has processed all requests that were queued before Rotate.
func NewWarcFileWriter ¶
func NewWarcFileWriter(opts ...WarcFileWriterOption) *WarcFileWriter
Example ¶
nameGenerator := &gowarc.PatternNameGenerator{Directory: "directory-name"}
w := gowarc.NewWarcFileWriter(gowarc.WithFileNameGenerator(nameGenerator))
defer func() {
_ = w.Close()
}()
builder := gowarc.NewRecordBuilder(gowarc.Response, gowarc.WithStrictValidation())
_, err := builder.WriteString("HTTP/1.1 200 OK\r\nDate: Tue, 19 Sep 2016 17:18:40 GMT\r\nContent-Length: 19 ....")
if err != nil {
panic(err)
}
builder.AddWarcHeader(gowarc.WarcRecordID, "<urn:uuid:e9a0cecc-0221-11e7-adb1-0242ac120008>")
builder.AddWarcHeader(gowarc.WarcDate, "2006-01-02T15:04:05Z")
builder.AddWarcHeader(gowarc.ContentType, "application/http;msgtype=response")
if wr, _, err := builder.Build(); err == nil {
w.Write(wr)
}
func (*WarcFileWriter) Close ¶
func (w *WarcFileWriter) Close() error
Close drains queued work and stops workers.
func (*WarcFileWriter) Rotate ¶
func (w *WarcFileWriter) Rotate() error
Rotate closes the current file of each worker, ordered after all previously queued requests.
func (*WarcFileWriter) String ¶
func (w *WarcFileWriter) String() string
func (*WarcFileWriter) Write ¶
func (w *WarcFileWriter) Write(records ...WarcRecord) []WriteResponse
Write marshals one or more WarcRecords to file. If addConcurrentHeader is enabled, records in the same call cross-reference each other.
Returns nil if writer is closed.
type WarcFileWriterOption ¶
type WarcFileWriterOption func(*warcFileWriterOptions)
WarcFileWriterOption configures how to write WARC files.
func WithAddWarcConcurrentToHeader ¶
func WithAddWarcConcurrentToHeader(addConcurrentHeader bool) WarcFileWriterOption
WithAddWarcConcurrentToHeader configures if records written in the same call to Write should have WARC-Concurrent-To headers added for cross-reference.
default false
func WithAfterFileCreationHook ¶
func WithAfterFileCreationHook(f func(fileName string, size int64, warcInfoId string) error) WarcFileWriterOption
WithAfterFileCreationHook sets a function to be called after a new file is created.
The function receives the file name of the new file, the size of the file and the WARC-Warcinfo-ID.
func WithBeforeFileCreationHook ¶
func WithBeforeFileCreationHook(f func(fileName string) error) WarcFileWriterOption
WithBeforeFileCreationHook sets a function to be called before a new file is created.
The function receives the file name of the new file.
func WithCompressedFileSuffix ¶
func WithCompressedFileSuffix(suffix string) WarcFileWriterOption
WithCompressedFileSuffix sets a suffix to be added after the name generated by the WarcFileNameGenerator id compression is on.
defaults to ".gz"
func WithCompression ¶
func WithCompression(compress bool) WarcFileWriterOption
WithCompression sets if writer should write gzip compressed WARC files.
defaults to true
func WithCompressionLevel ¶
func WithCompressionLevel(gzipLevel int) WarcFileWriterOption
WithCompressionLevel sets the gzip level (1-9) to use for compression.
defaults to 5
func WithExpectedCompressionRatio ¶
func WithExpectedCompressionRatio(ratio float64) WarcFileWriterOption
WithExpectedCompressionRatio sets the expectd reduction in size when using compression.
This value is used to decide if a record will fit into a Warcfile's MaxFileSize when using compression since it's not possible to know this before the record is written. If the value is far from the actual size reduction, an under- or overfilled file might be the result.
defaults to .5 (half the uncompressed size)
func WithFileNameGenerator ¶
func WithFileNameGenerator(generator WarcFileNameGenerator) WarcFileWriterOption
WithFileNameGenerator sets the WarcFileNameGenerator to use for generating new Warc file names.
Default is to use a PatternNameGenerator with the default pattern.
func WithFlush ¶
func WithFlush(flush bool) WarcFileWriterOption
WithFlush sets if writer should commit each record to stable storage.
defaults to false
func WithMarshaler ¶
func WithMarshaler(marshaler Marshaler) WarcFileWriterOption
WithMarshaler sets the Warc record marshaler to use.
defaults to defaultMarshaler
func WithMaxConcurrentWriters ¶
func WithMaxConcurrentWriters(count int) WarcFileWriterOption
WithMaxConcurrentWriters sets the maximum number of Warc files that can be written simultaneously.
defaults to one
func WithMaxFileSize ¶
func WithMaxFileSize(size int64) WarcFileWriterOption
WithMaxFileSize sets the max size of the Warc file before creating a new one.
defaults to 1 GiB
func WithOpenFileSuffix ¶
func WithOpenFileSuffix(suffix string) WarcFileWriterOption
WithOpenFileSuffix sets a suffix to be added to the file name while the file is open for writing.
The suffix is automatically removed when the file is closed.
defaults to ".open"
func WithRecordOptions ¶
func WithRecordOptions(opts ...WarcRecordOption) WarcFileWriterOption
WithRecordOptions sets the options to use for creating WarcInfo records.
See WithWarcInfoFunc
func WithSegmentation ¶
func WithSegmentation() WarcFileWriterOption
WithSegmentation sets if writer should use segmentation for large WARC records.
defaults to false
func WithWarcInfoFunc ¶
func WithWarcInfoFunc(f func(recordBuilder WarcRecordBuilder) error) WarcFileWriterOption
WithWarcInfoFunc sets a warcinfo-record generator function to be called for every new WARC-file created.
The function receives a WarcRecordBuilder which is prepopulated with WARC-Record-ID, WARC-Type, WARC-Date and Content-Type. After the submitted function returns, Content-Length and WARC-Block-Digest fields are calculated.
When this option is set, records written to the warcfile will have the WARC-Warcinfo-ID automatically set to point to the generated warcinfo record.
Use WithRecordOptions to modify the options used to create the WarcInfo record.
defaults nil (no generation of warcinfo record)
type WarcRecord ¶
type WarcRecord interface {
// Version returns the WARC version of the record.
Version() *WarcVersion
// Type returns the WARC record type.
Type() RecordType
// WarcHeader returns the WARC header fields.
WarcHeader() *WarcFields
// Block returns the content block of the record.
Block() Block
// RecordId returns the WARC-Record-ID header field.
RecordId() string
// ContentLength returns the Content-Length header field.
ContentLength() (int64, error)
// Date returns the WARC-Date header field.
Date() (time.Time, error)
// String returns a string representation of the record.
String() string
// Closer closes the record and releases any resources associated with it.
io.Closer
// ToRevisitRecord takes RevisitRef referencing the record we want to make a revisit of and returns a revisit record.
ToRevisitRecord(ref *RevisitRef) (WarcRecord, error)
// RevisitRef extracts a RevisitRef from the current record if it is a revisit record.
RevisitRef() (*RevisitRef, error)
// CreateRevisitRef creates a RevisitRef which references the current record.
//
// The RevisitRef might be used by another record's ToRevisitRecord to create a revisit record referencing this record.
CreateRevisitRef(profile string) (*RevisitRef, error)
// Merge merges this record with its referenced record(s)
//
// It is implemented only for revisit records, but this function will be enhanced to also support segmented records.
Merge(record ...WarcRecord) (WarcRecord, error)
// ValidateDigest validates block and payload digests if present.
//
// If option FixDigest is set, an invalid or missing digest will be corrected in the header.
// Digest validation requires the whole content block to be read. As a side effect the
// Content-Length field is also validated, and if option FixContentLength is set, a wrong
// content length will be corrected in the header.
//
// If the record is not cached, it might not be possible to read any content from this
// record after validation.
//
// The returned values depend on the [ErrorPolicy] options:
// - [ErrIgnore]: only fatal errors are returned via err.
// - [ErrWarn]: non-fatal findings are collected in validation; err is nil.
// - [ErrFail]: the first validation failure is returned via err.
ValidateDigest() (validation []error, err error)
}
WarcRecord is the interface implemented by types that can represent a WARC record. A new instance of WarcRecord is created by a WarcRecordBuilder.
type WarcRecordBuilder ¶
type WarcRecordBuilder interface {
io.Writer
io.StringWriter
io.ReaderFrom
io.Closer
AddWarcHeader(name string, value string)
AddWarcHeaderInt(name string, value int)
AddWarcHeaderInt64(name string, value int64)
AddWarcHeaderTime(name string, value time.Time)
Build() (record WarcRecord, validation []error, err error)
Size() int64
SetRecordType(recordType RecordType)
}
func NewRecordBuilder ¶
func NewRecordBuilder(recordType RecordType, opts ...WarcRecordOption) WarcRecordBuilder
NewRecordBuilder initializes a WarcRecordBuilder used for creating a new record.
WarcRecordBuilder implements io.Writer for adding the content block. recordType might be 0, but then SetRecordType or AddWarcHeader(WarcType, "myRecordType") must be called before Build is called.
When finished with adding headers and writing content, call Build on the WarcRecordBuilder to create a WarcRecord.
Example ¶
builder := gowarc.NewRecordBuilder(gowarc.Response)
_, err := builder.WriteString("HTTP/1.1 200 OK\nDate: Tue, 19 Sep 2016 17:18:40 GMT\nServer: Apache/2.0.54 (Ubuntu)\n" +
"Last-Modified: Mon, 16 Jun 2013 22:28:51 GMT\nETag: \"3e45-67e-2ed02ec0\"\nAccept-Ranges: bytes\n" +
"Content-Length: 19\nConnection: close\nContent-Type: text/plain\n\nThis is the content")
if err != nil {
panic(err)
}
builder.AddWarcHeader(gowarc.WarcRecordID, "<urn:uuid:e9a0cecc-0221-11e7-adb1-0242ac120008>")
builder.AddWarcHeader(gowarc.WarcDate, "2006-01-02T15:04:05Z")
builder.AddWarcHeader(gowarc.ContentLength, "257")
builder.AddWarcHeader(gowarc.ContentType, "application/http;msgtype=response")
builder.AddWarcHeader(gowarc.WarcBlockDigest, "sha1:B285747AD7CC57AA74BCE2E30B453C8D1CB71BA4")
if wr, _, err := builder.Build(); err == nil {
fmt.Println(wr)
}
Output: WARC record: version: WARC/1.1, type: response, id: urn:uuid:e9a0cecc-0221-11e7-adb1-0242ac120008
type WarcRecordOption ¶
type WarcRecordOption func(*warcRecordOptions)
WarcRecordOption configures validation, marshaling and unmarshaling of WARC records.
func WithAddMissingContentLength ¶
func WithAddMissingContentLength(addMissingContentLength bool) WarcRecordOption
WithAddMissingContentLength sets if missing Content-Length header should be calculated.
When creating records with NewRecordBuilder, missing Content-Length is always set. This option primarily affects parsing/unmarshalling behavior.
defaults to false
func WithAddMissingDigest ¶
func WithAddMissingDigest(addMissingDigest bool) WarcRecordOption
WithAddMissingDigest sets if missing Block digest and eventually Payload digest header fields should be calculated.
Only digest fields are controlled by this option. Record ID and Content-Length are always set for records created with NewRecordBuilder when missing.
defaults to false
func WithAddMissingRecordId ¶
func WithAddMissingRecordId(addMissingRecordId bool) WarcRecordOption
WithAddMissingRecordId sets if missing WARC-Record-ID header should be generated.
When creating records with NewRecordBuilder, missing WARC-Record-ID is always generated. This option primarily affects parsing/unmarshalling behavior.
defaults to false
func WithBlockErrorPolicy ¶
func WithBlockErrorPolicy(policy ErrorPolicy) WarcRecordOption
WithBlockErrorPolicy sets the policy for handling errors in block parsing.
For most records this is the content fetched from the original source and errors here should be ignored.
defaults to ErrIgnore
func WithBufferMaxMemBytes ¶
func WithBufferMaxMemBytes(size int64) WarcRecordOption
WithBufferMaxMemBytes sets the maximum amount of memory a buffer is allowed to use before overflowing to disk.
defaults to 1 MiB
func WithBufferTmpDir ¶
func WithBufferTmpDir(dir string) WarcRecordOption
WithBufferTmpDir sets the directory to use for temporary files.
If not set or dir is the empty string then the default directory for temporary files is used (see os.TempDir).
func WithDefaultDigestAlgorithm ¶
func WithDefaultDigestAlgorithm(defaultDigestAlgorithm string) WarcRecordOption
WithDefaultDigestAlgorithm sets which algorithm to use for digest generation.
Valid values: 'md5', 'sha1', 'sha256' and 'sha512'.
defaults to sha1
func WithDefaultDigestEncoding ¶
func WithDefaultDigestEncoding(defaultDigestEncoding DigestEncoding) WarcRecordOption
WithDefaultDigestEncoding sets which encoding to use for digest generation.
Valid values: Base16, Base32 and Base64.
defaults to Base32
func WithFixContentLength ¶
func WithFixContentLength(fixContentLength bool) WarcRecordOption
WithFixContentLength sets if a ContentLength header with value which do not match the actual content length should be set to the real value.
This will not have any impact if SpecViolationPolicy is ErrIgnore ¶
defaults to false
func WithFixDigest ¶
func WithFixDigest(fixDigest bool) WarcRecordOption
WithFixDigest sets if a BlockDigest header or a PayloadDigest header with a value which do not match the actual content should be recalculated.
This will not have any impact if SpecViolationPolicy is ErrIgnore ¶
defaults to false
func WithFixSyntaxErrors ¶
func WithFixSyntaxErrors(fixSyntaxErrors bool) WarcRecordOption
WithFixSyntaxErrors sets if an attempt to fix syntax errors should be done when those are detected.
This will not have any impact if SyntaxErrorPolicy is ErrIgnore ¶
defaults to false
func WithFixWarcFieldsBlockErrors ¶
func WithFixWarcFieldsBlockErrors(fixWarcFieldsBlockErrors bool) WarcRecordOption
WithFixWarcFieldsBlockErrors sets if an attempt to fix syntax errors in warcfields block should be done when those are detected.
A warcfields block is typically generated by a web crawler. An error in this context suggests a potential bug in the crawler's WARC writer.
defaults to false
func WithNoValidation ¶
func WithNoValidation() WarcRecordOption
WithNoValidation sets the parser to do as little validation as possible.
This option is for parsing as fast as possible and being as lenient as possible. Settings implied by this option are:
SyntaxErrorPolicy = ErrIgnore SpecViolationPolicy = ErrIgnore UnknownRecordPolicy = ErrIgnore SkipParseBlock = true
func WithRecordIdFunc ¶
func WithRecordIdFunc(recordIdFunc func() (string, error)) WarcRecordOption
WithRecordIdFunc sets a function for generating WARC-Record-ID if AddMissingRecordId is true.
Expected output is a valid URI without the surrounding '<' and '>' as described in the WARC spec (https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#warc-record-id-mandatory)
defaults to generating uuid
func WithSkipParseBlock ¶
func WithSkipParseBlock() WarcRecordOption
WithSkipParseBlock sets parser to skip detecting known block types.
This implies that no payload digest can be computed.
func WithSpecViolationPolicy ¶
func WithSpecViolationPolicy(policy ErrorPolicy) WarcRecordOption
WithSpecViolationPolicy sets the policy for handling violations of the WARC specification in WARC records.
defaults to ErrWarn
func WithStrictValidation ¶
func WithStrictValidation() WarcRecordOption
WithStrictValidation sets the parser to fail on first error or violation of WARC specification.
Settings implied by this option are:
SyntaxErrorPolicy = ErrFail SpecViolationPolicy = ErrFail UnknownRecordPolicy = ErrFail SkipParseBlock = false
func WithSyntaxErrorPolicy ¶
func WithSyntaxErrorPolicy(policy ErrorPolicy) WarcRecordOption
WithSyntaxErrorPolicy sets the policy for handling syntax errors in WARC records.
defaults to ErrWarn
func WithUnknownRecordTypePolicy ¶
func WithUnknownRecordTypePolicy(policy ErrorPolicy) WarcRecordOption
WithUnknownRecordTypePolicy sets the policy for handling unknown record types.
defaults to ErrWarn
func WithUrlParserOptions ¶
func WithUrlParserOptions(opts ...url.ParserOption) WarcRecordOption
func WithVersion ¶
func WithVersion(version *WarcVersion) WarcRecordOption
WithVersion sets the WARC version to use for new records.
defaults to WARC/1.1
type WarcVersion ¶
type WarcVersion struct {
// contains filtered or unexported fields
}
WarcVersion represents a WARC specification version.
For record creation, only WARC 1.0 and 1.1 are supported which are represented by the constants V1_0 and V1_1. During parsing of a record, the WarcVersion will take on the version value found in the record itself.
func (*WarcVersion) Major ¶
func (v *WarcVersion) Major() uint8
func (*WarcVersion) Minor ¶
func (v *WarcVersion) Minor() uint8
func (*WarcVersion) String ¶
func (v *WarcVersion) String() string
String returns a string representation of the WARC version in the format used by WARC files i.e. 'WARC/1.0' or 'WARC/1.1'.