gowarc

package module
v3.0.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 26, 2026 License: Apache-2.0 Imports: 38 Imported by: 0

Documentation

Overview

Package gowarc provides a framework for handling WARC files, enabling their parsing, creation, and validation.

WARC Overview

The WARC format offers a standard way to structure, manage and store billions of resources collected from the web and elsewhere. It is used to build applications for harvesting, managing, accessing, mining and exchanging content.

For more details, visit the WARC specification: https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/

WARC record creation

The WarcRecordBuilder, initialized via NewRecordBuilder, is the primary tool for creating WARC records. By default, the WarcRecordBuilder generates a record id and calculates the 'Content-Length' and 'WARC-Block-Digest'.

Use WarcFileWriter, initialized with NewWarcFileWriter, to write WARC files.

WARC record parsing

To parse single WARC records, use the Unmarshaler initialized with NewUnmarshaler.

To read entire WARC files, employ the WarcFileReader initialized through NewWarcFileReader.

Validation and repair

The gowarc package supports validation during both the creation and parsing of WARC records. Control over the scope of validation and the handling of validation errors can be achieved by setting the appropriate options in the WarcRecordBuilder, Unmarshaler, or WarcFileReader.

Index

Examples

Constants

View Source
const (
	// WARC header field name constants
	ContentLength             = "Content-Length"
	ContentType               = "Content-Type"
	WarcBlockDigest           = "WARC-Block-Digest"
	WarcConcurrentTo          = "WARC-Concurrent-To"
	WarcDate                  = "WARC-Date"
	WarcFilename              = "WARC-Filename"
	WarcIPAddress             = "WARC-IP-Address"
	WarcIdentifiedPayloadType = "WARC-Identified-Payload-Type"
	WarcPayloadDigest         = "WARC-Payload-Digest"
	WarcProfile               = "WARC-Profile"
	WarcRecordID              = "WARC-Record-ID"
	WarcRefersTo              = "WARC-Refers-To"
	WarcRefersToDate          = "WARC-Refers-To-Date"
	WarcRefersToTargetURI     = "WARC-Refers-To-Target-URI"
	WarcSegmentNumber         = "WARC-Segment-Number"
	WarcSegmentOriginID       = "WARC-Segment-Origin-ID"
	WarcSegmentTotalLength    = "WARC-Segment-Total-Length"
	WarcTargetURI             = "WARC-Target-URI"
	WarcTruncated             = "WARC-Truncated"
	WarcType                  = "WARC-Type"
	WarcWarcinfoID            = "WARC-Warcinfo-ID"
	WarcPageID                = "WARC-Page-ID"       // Browsertrix extension field
	WarcResourceType          = "WARC-Resource-Type" // Browsertrix extension field
	WarcJSONMetadata          = "WARC-JSON-Metadata" // Browsertrix extension field
)
View Source
const (
	// Well known content types
	ApplicationWarcFields = "application/warc-fields"
	ApplicationHttp       = "application/http"
)
View Source
const (
	// Well known revisit profiles
	ProfileIdenticalPayloadDigestV1_1 = "http://netpreserve.org/warc/1.1/revisit/identical-payload-digest"
	ProfileServerNotModifiedV1_1      = "http://netpreserve.org/warc/1.1/revisit/server-not-modified"
	ProfileIdenticalPayloadDigestV1_0 = "http://netpreserve.org/warc/1.0/revisit/identical-payload-digest"
	ProfileServerNotModifiedV1_0      = "http://netpreserve.org/warc/1.0/revisit/server-not-modified"
)

Variables

View Source
var (
	// ErrNotRevisitRecord is returned when a revisit-only operation is attempted on a non-revisit record.
	ErrNotRevisitRecord = errors.New("gowarc: not a revisit record")

	// ErrIsRevisitRecord is returned when attempting to create a revisit reference from a revisit record.
	ErrIsRevisitRecord = errors.New("gowarc: cannot reference a revisit record")

	// ErrUnknownRevisitProfile is returned when a revisit record references an unrecognized profile URI.
	ErrUnknownRevisitProfile = errors.New("gowarc: unknown revisit profile")

	// ErrMissingPayloadDigest is returned when the identical-payload-digest profile is used but no payload digest is available.
	ErrMissingPayloadDigest = errors.New("gowarc: payload digest required for identical-payload-digest profile")

	// ErrMergeRequiresOneRecord is returned when Merge is called with zero or more than one referenced record.
	ErrMergeRequiresOneRecord = errors.New("gowarc: revisit merge requires exactly one referenced record")

	// ErrMergeNotSupported is returned when merging is attempted on a record type that does not support it.
	ErrMergeNotSupported = errors.New("gowarc: merging is only possible for revisit records or segmented records")

	// ErrMergeSegmentedNotImplemented is returned when merging of segmented records is attempted.
	ErrMergeSegmentedNotImplemented = errors.New("gowarc: merging of segmented records is not implemented")

	// ErrMergeWrongBlockType is returned when a revisit record's block type is incompatible with merging
	// (typically because the record was parsed with SkipParseBlock).
	ErrMergeWrongBlockType = errors.New("gowarc: revisit block type incompatible with merge; record must be parsed with SkipParseBlock=false")

	// ErrMergeUnsupportedBlock is returned when merging a revisit with a non-HTTP block type.
	ErrMergeUnsupportedBlock = errors.New("gowarc: merge only supports http request and response blocks")

	// ErrUnsupportedDigestAlgorithm is returned when an unrecognized digest algorithm is encountered.
	ErrUnsupportedDigestAlgorithm = errors.New("gowarc: unsupported digest algorithm")

	// ErrNoRecord is returned by [Unmarshaler.Unmarshal] and [WarcFileReader.Next]
	// when the reader scans past one or more bytes without finding a WARC record
	// before reaching end-of-file. This distinguishes "stream contained only
	// unrecognizable data" from a clean EOF on an empty or fully-consumed stream.
	ErrNoRecord = errors.New("gowarc: no WARC record found")
)

Sentinel errors for common conditions. These can be matched with errors.Is.

View Source
var (
	// WARC versions
	V1_0 = &WarcVersion{id: 1, txt: "1.0", major: 1, minor: 0} // WARC 1.0
	V1_1 = &WarcVersion{id: 2, txt: "1.1", major: 1, minor: 1} // WARC 1.1
)

Functions

This section is empty.

Types

type Block

type Block interface {
	// RawBytes returns the bytes of the Block
	RawBytes() (io.Reader, error)
	BlockDigest() string
	Size() int64
	IsCached() bool
	Cache() error
	io.Closer
}

Block is the interface used to represent the content of a WARC record as specified by the WARC specification: https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#warc-record-content-block

A Block might be cached or non-cached. Calling RawBytes or BlockDigest more than once will fail if the block is not cached.

NOTE: Blocks are not required to be thread safe.

type ContentLengthError

type ContentLengthError struct {
	// Expected is the Content-Length value declared in the WARC header.
	Expected int64
	// Actual is the measured size of the content.
	Actual int64
}

ContentLengthError is returned when the actual content size does not match the Content-Length header value. Use errors.As to extract the expected and actual lengths programmatically.

func (*ContentLengthError) Error

func (e *ContentLengthError) Error() string

type DigestEncoding

type DigestEncoding uint8

DigestEncoding represents the encoding used for WARC digest values.

const (
	Base16 DigestEncoding = 1
	Base32 DigestEncoding = 2
	Base64 DigestEncoding = 3
)

type DigestError

type DigestError struct {
	// Algorithm is the digest algorithm name (e.g. "sha1", "sha256").
	Algorithm string
	// Expected is the digest value from the WARC header.
	Expected string
	// Computed is the digest value calculated from the record content.
	Computed string
}

DigestError is returned when a computed digest does not match the expected value from a WARC-Block-Digest or WARC-Payload-Digest header. Use errors.As to extract the algorithm, expected, and computed values programmatically.

func (*DigestError) Error

func (e *DigestError) Error() string

type ErrorPolicy

type ErrorPolicy int8

ErrorPolicy describes how to handle WARC record errors.

const (
	ErrIgnore ErrorPolicy = 0 // Ignore the given error.
	ErrWarn   ErrorPolicy = 1 // Ignore given error, but submit a warning.
	ErrFail   ErrorPolicy = 2 // Fail on given error.
)

type HeaderFieldError

type HeaderFieldError struct {
	// FieldName is the WARC header field that caused the error (e.g. "WARC-Date").
	// May be empty for structural errors like missing required fields.
	FieldName string
	// Msg describes the violation.
	Msg string
}

HeaderFieldError is used for violations of WARC header specification. Use errors.As to extract the field name and message programmatically.

func (*HeaderFieldError) Error

func (e *HeaderFieldError) Error() string

type HttpRequestBlock

type HttpRequestBlock interface {
	PayloadBlock
	ProtocolHeaderBlock
	HttpRequestLine() string
	HttpHeader() *http.Header
}

type HttpResponseBlock

type HttpResponseBlock interface {
	PayloadBlock
	ProtocolHeaderBlock
	HttpStatusLine() string
	HttpStatusCode() int
	HttpHeader() *http.Header
}

type Marshaler

type Marshaler interface {
	Marshal(w io.Writer, record WarcRecord, maxSize int64) (WarcRecord, int64, error)
}

Marshaler is the interface that wraps the Marshal function.

Marshal converts a WARC record to its serialized form and returns the size of the marshalled record or any error encountered.

Depending on implementation, Marshal might return a WarcRecord which is the continuation of the record being written. See the description of record segmentation at https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#record-segmentation

func NewMarshaler

func NewMarshaler() Marshaler

type PatternNameGenerator

type PatternNameGenerator struct {
	Directory string         // Directory to store warcfiles. Defaults to the empty string
	Prefix    string         // Prefix available to be used in pattern. Defaults to the empty string
	Serial    int32          // Serial number available for use in pattern. It is atomically increased with every generated file name.
	Pattern   string         // Pattern for generated file name. Defaults to: "%{prefix}s%{ts}s-%04{serial}d-%{hostOrIp}s.%{ext}s"
	Extension string         // Extension for file name. Defaults to: "warc"
	Params    map[string]any // Parameters available to be used in pattern. If a custom parameter has the same key as a predefined field (prefix, ext, etc), the predefined field will take precedence
}

PatternNameGenerator implements the WarcFileNameGenerator.

New filenames are generated based on a pattern which defaults to the recommendation in the WARC 1.1 standard (https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#annex-c-informative-warc-file-size-and-name-recommendations). The pattern is like golangs fmt package (https://pkg.go.dev/fmt), but allows for named fields in curly braces. The available predefined names are:

  • prefix - content of the Prefix field
  • ext - content of the Extension field
  • ts - current time as 14-digit GMT Time-stamp
  • serial - atomically increased serial number for every generated file name. Initial value is 0 if Serial field is not set
  • ip - primary IP address of the node
  • host - host name of the node
  • hostOrIp - host name of the node, falling back to IP address if host name could not be resolved

func (*PatternNameGenerator) NewWarcfileName

func (g *PatternNameGenerator) NewWarcfileName() (string, string)

NewWarcfileName returns a directory (might be the empty string for current directory) and a file name

type PayloadBlock

type PayloadBlock interface {
	Block
	PayloadBytes() (io.Reader, error)
	PayloadDigest() string
}

PayloadBlock is a Block with a well-defined payload.

Ref: https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#warc-record-payload

type ProtocolHeaderBlock

type ProtocolHeaderBlock interface {
	// ProtocolHeaderBytes returns the raw bytes from the protocol's header.
	ProtocolHeaderBytes() []byte
}

ProtocolHeaderBlock is a Block with a well-defined protocol header e.g. http response

type Record

type Record struct {
	// WarcRecord is the parsed WARC record.
	WarcRecord WarcRecord
	// Offset is the byte offset of the record within the file.
	Offset int64
	// Size is the number of bytes consumed by the record in the file,
	// including headers, payload, and framing (e.g. gzip envelope).
	Size int64
	// Validation contains non-fatal validation findings (populated when
	// an [ErrorPolicy] is set to [ErrWarn]). It is nil when clean.
	Validation []error
}

Record represents a WARC record as read from a WARC file, including its position within the file and any validation findings.

func (Record) Close

func (r Record) Close() error

Close closes the underlying WarcRecord, releasing any resources.

type RecordType

type RecordType uint16

RecordType represents the type of a WARC record.

const (
	// WARC record types
	Warcinfo     RecordType = 1
	Response     RecordType = 2
	Resource     RecordType = 4
	Request      RecordType = 8
	Metadata     RecordType = 16
	Revisit      RecordType = 32
	Conversion   RecordType = 64
	Continuation RecordType = 128
)

func (RecordType) String

func (rt RecordType) String() string

String returns a string representation of the record type.

type RevisitRef

type RevisitRef struct {
	Profile        string
	TargetRecordId string
	TargetUri      string
	TargetDate     string
}

type SyntaxError

type SyntaxError struct {
	// Msg describes the syntax violation.
	Msg string
	// Line is the 1-based line number where the error occurred, or 0 if unknown.
	Line int
	// Wrapped is the underlying cause, if any. Use [errors.As] or [errors.Is]
	// to inspect it, or access it directly.
	Wrapped error
}

SyntaxError is used for syntactical errors like wrong line endings. Use errors.As to extract position information and wrapped cause programmatically.

func (*SyntaxError) Error

func (e *SyntaxError) Error() string

func (*SyntaxError) Unwrap

func (e *SyntaxError) Unwrap() error

type Unmarshaler

type Unmarshaler interface {
	Unmarshal(b *bufio.Reader) (record WarcRecord, offset int64, validation []error, err error)
}

Unmarshaler is the interface implemented by types that can unmarshal a WARC record. A new instance of Unmarshaler is created by calling NewUnmarshaler. NewUnmarshaler accepts a number of options that can be used to control the unmarshalling process. See WarcRecordOption for details.

Unmarshal parses the WARC record from the given reader and returns:

  • record: the parsed WarcRecord. May be nil if a fatal error occurred.
  • offset: the number of bytes that were discarded before the start of the record was found.
  • validation: a slice of non-fatal errors discovered during parsing (populated when an ErrorPolicy is set to ErrWarn).
  • err: a fatal error, if any. A nil err does not imply the record is fully valid; check the validation slice for warnings.

If the reader contains multiple records, Unmarshal parses the first record and returns. If the reader contains no records, Unmarshal returns an io.EOF error.

Example
data := bytes.NewBufferString("  WARC/1.1\r\n" +
	"WARC-Date: 2017-03-06T04:03:53Z\r\n" +
	"WARC-Record-ID: <urn:uuid:e9a0cecc-0221-11e7-adb1-0242ac120008>\r\n" +
	"WARC-Filename: temp-20170306040353.warc.gz\r\n" +
	"WARC-Type: warcinfo\r\n" +
	"Content-Type: application/warc-fields\r\n" +
	"Warc-Block-Digest: sha1:af4d582b4ffc017d07a947d841e392a821f754f3\r\n" +
	"Content-Length: 34\r\n" +
	"\r\n" +
	"format: WARC File Format 1.1\r\n" +
	"\r\n\r\n")
input := bufio.NewReader(data)

// Create a new unmarshaler
unmarshaler := gowarc.NewUnmarshaler(gowarc.WithSpecViolationPolicy(gowarc.ErrWarn), gowarc.WithSyntaxErrorPolicy(gowarc.ErrWarn))
wr, off, validation, err := unmarshaler.Unmarshal(input)
if err == nil {
	fmt.Printf("Offset: %d, %s\n", off, wr)
	if len(validation) > 0 {
		fmt.Println("Validation errors:")
		for i, e := range validation {
			fmt.Printf("  %d: %s\n", i+1, e)
		}
	}
}
Output:

Offset: 2, WARC record: version: WARC/1.1, type: warcinfo, id: urn:uuid:e9a0cecc-0221-11e7-adb1-0242ac120008
Validation errors:
  1: gowarc: record was found 2 bytes after expected offset
  2: block: wrong digest: expected sha1:af4d582b4ffc017d07a947d841e392a821f754f3, computed: sha1:8a936f9fd60d664cf95b1ffb40f1c4093e65bb40
  3: too few bytes in end of record marker. Expected "\r\n\r\n", was ""

func NewUnmarshaler

func NewUnmarshaler(opts ...WarcRecordOption) Unmarshaler

type WarcFields

type WarcFields []*nameValue

WarcFields represents the key value pairs in a WARC-record header.

It is also used for representing the record block of records with content-type "application/warc-fields".

All key-manipulating functions take case-insensitive keys and modify them to their canonical form.

func (*WarcFields) Add

func (wf *WarcFields) Add(name string, value string)

Add adds the key, value pair to the header. It appends to any existing values associated with key. The key is case-insensitive.

func (*WarcFields) AddId

func (wf *WarcFields) AddId(name, value string)

AddId adds the key, value pair to the header. It appends to any existing values associated with key. The key is case-insensitive.

The value is surrounded with '<' and '>' if not already present.

func (*WarcFields) AddInt

func (wf *WarcFields) AddInt(name string, value int)

AddInt adds the key, value pair to the header. It appends to any existing values associated with key. The key is case-insensitive.

func (*WarcFields) AddInt64

func (wf *WarcFields) AddInt64(name string, value int64)

AddInt64 adds the key, value pair to the header. It appends to any existing values associated with key. The key is case-insensitive.

func (*WarcFields) AddTime

func (wf *WarcFields) AddTime(name string, value time.Time)

AddTime adds the key, value pair to the header. It appends to any existing values associated with key. The key is case-insensitive.

The value is converted to RFC 3339 format.

func (*WarcFields) AddTimeNano

func (wf *WarcFields) AddTimeNano(name string, value time.Time)

AddTimeNano adds the key, value pair to the header. It appends to any existing values associated with key. The key is case-insensitive.

The value is formatted as RFC 3339 with up to nanosecond precision.

func (*WarcFields) CanonicalHeaderKey

func (wf *WarcFields) CanonicalHeaderKey(s string) string

func (*WarcFields) Delete

func (wf *WarcFields) Delete(key string)

Delete deletes the values associated with key. The key is case-insensitive.

func (*WarcFields) Get

func (wf *WarcFields) Get(key string) string

Get gets the first value associated with the given key. It is case-insensitive. If the key doesn't exist or there are no values associated with the key, Get returns the empty string. To access multiple values of a key, use GetAll.

func (*WarcFields) GetAll

func (wf *WarcFields) GetAll(name string) []string

GetAll returns all values associated with the given key. It is case-insensitive.

func (*WarcFields) GetId

func (wf *WarcFields) GetId(name string) string

GetId is like Get, but removes the surrounding '<' and '>' from the field value.

func (*WarcFields) GetInt

func (wf *WarcFields) GetInt(key string) (int, error)

GetInt is like Get, but converts the field value to int.

func (*WarcFields) GetInt64

func (wf *WarcFields) GetInt64(name string) (int64, error)

GetInt64 is like Get, but converts the field value to int64.

func (*WarcFields) GetTime

func (wf *WarcFields) GetTime(name string) (time.Time, error)

GetTime is like Get, but converts the field value to time.Time. The field is expected to be in RFC 3339 format.

func (*WarcFields) Has

func (wf *WarcFields) Has(name string) bool

Has returns true if field exists. This can be used to separate a missing field from a field for which value is the empty string.

func (*WarcFields) Set

func (wf *WarcFields) Set(name string, value string)

Set sets the header entries associated with key to the single element value. It replaces any existing values associated with key. The key is case-insensitive

func (*WarcFields) SetId

func (wf *WarcFields) SetId(name, value string)

SetId sets the header entries associated with key to the single element value. It replaces any existing values associated with key. The key is case-insensitive

The value is surrounded with '<' and '>' if not already present.

func (*WarcFields) SetInt

func (wf *WarcFields) SetInt(name string, value int)

SetInt sets the header entries associated with key to the single element value. It replaces any existing values associated with key. The key is case-insensitive

func (*WarcFields) SetInt64

func (wf *WarcFields) SetInt64(name string, value int64)

SetInt64 sets the header entries associated with key to the single element value. It replaces any existing values associated with key. The key is case-insensitive

func (*WarcFields) SetTime

func (wf *WarcFields) SetTime(name string, value time.Time)

SetTime sets the header entries associated with key to the single element value. It replaces any existing values associated with key. The key is case-insensitive

The value is converted to RFC 3339 format.

func (*WarcFields) Sort

func (wf *WarcFields) Sort()

Sort sorts the fields in lexicographical order.

Only field names are sorted. Order of values for a repeated field is kept as is.

func (*WarcFields) String

func (wf *WarcFields) String() string

func (*WarcFields) Write

func (wf *WarcFields) Write(w io.Writer) (n int64, err error)

Write implements the io.Writer interface.

type WarcFieldsBlock

type WarcFieldsBlock interface {
	Block
	WarcFields() *WarcFields
}

type WarcFileNameGenerator

type WarcFileNameGenerator interface {
	// NewWarcfileName returns a directory (might be the empty string for current directory) and a file name
	NewWarcfileName() (string, string)
}

WarcFileNameGenerator is the interface that wraps the NewWarcfileName function.

type WarcFileReader

type WarcFileReader struct {
	// contains filtered or unexported fields
}

WarcFileReader is used to read WARC files. Use NewWarcFileReader to create a new instance.

func NewWarcFileReader

func NewWarcFileReader(filename string, offset int64, opts ...WarcRecordOption) (*WarcFileReader, error)

NewWarcFileReader creates a new WarcFileReader from the supplied filename. If offset is > 0, the reader will start reading from that offset. The WarcFileReader can be configured with options. See WarcRecordOption.

Example
reader, err := gowarc.NewWarcFileReader("test.warc.gz", 0, gowarc.WithStrictValidation())
if err != nil {
	fmt.Println("Error creating warc reader:", err)
	return
}

for {
	rec, err := reader.Next()
	if err == io.EOF {
		break
	}
	if err != nil {
		fmt.Println("Error reading record:", err)
		return
	}
	fmt.Println("Record type:", rec.WarcRecord.Type().String())
	fmt.Println("Record version:", rec.WarcRecord.Version())
	// Do more with record as per needs
}

func NewWarcFileReaderFromStream

func NewWarcFileReaderFromStream(r io.Reader, offset int64, opts ...WarcRecordOption) (*WarcFileReader, error)

NewWarcFileReaderFromStream creates a new WarcFileReader from the supplied io.Reader. The WarcFileReader can be configured with options. See WarcRecordOption.

It is the responsibility of the caller to close the io.Reader.

func (*WarcFileReader) Close

func (wf *WarcFileReader) Close() error

Close closes the WarcFileReader.

func (*WarcFileReader) Next

func (wf *WarcFileReader) Next() (Record, error)

Next reads the next Record from the WarcFileReader.

The returned Record contains the parsed WarcRecord, its byte offset and size within the file, and any non-fatal validation findings.

The returned values depend on the ErrorPolicy options set on the WarcFileReader:

  • ErrIgnore: errors are suppressed. A Record is returned without any validation. An error is only returned if the file is so badly formatted that nothing meaningful can be parsed.

  • ErrWarn: a Record is returned. Non-fatal validation findings are collected in the [Record.Validation] slice, which should be inspected by the caller.

  • ErrFail: the first validation failure is returned as err, and [Record.WarcRecord] may be nil.

  • Mixed Policies: different ErrorPolicy values may be set per error category with WithSyntaxErrorPolicy, WithSpecViolationPolicy and WithUnknownRecordTypePolicy. The return values of Next are a mix of the above based on the configured policies.

When at end of file, [Record.WarcRecord] is nil and err is io.EOF.

func (*WarcFileReader) Records

func (wf *WarcFileReader) Records() iter.Seq2[Record, error]

Records returns an iterator over all records in the WARC file.

Each iteration yields a Record and an error. The iterator stops automatically at EOF. Fatal errors are yielded and the iterator stops.

Usage:

for rec, err := range reader.Records() {
    if err != nil {
        log.Fatal(err)
    }
    fmt.Println(rec.WarcRecord.Type())
    rec.Close()
}

type WarcFileWriter

type WarcFileWriter struct {
	// contains filtered or unexported fields
}

WarcFileWriter writes WARC records using a pool of independent file writers. Each worker owns one singleWarcFileWriter and thus one "current file" at a time.

Close drains queued work and stops workers. Writes after Close return nil. Rotate is ordered w.r.t. queued writes: each worker closes its current file only after it has processed all requests that were queued before Rotate.

func NewWarcFileWriter

func NewWarcFileWriter(opts ...WarcFileWriterOption) *WarcFileWriter
Example
nameGenerator := &gowarc.PatternNameGenerator{Directory: "directory-name"}

w := gowarc.NewWarcFileWriter(gowarc.WithFileNameGenerator(nameGenerator))
defer func() {
	_ = w.Close()
}()

builder := gowarc.NewRecordBuilder(gowarc.Response, gowarc.WithStrictValidation())
_, err := builder.WriteString("HTTP/1.1 200 OK\r\nDate: Tue, 19 Sep 2016 17:18:40 GMT\r\nContent-Length: 19 ....")
if err != nil {
	panic(err)
}
builder.AddWarcHeader(gowarc.WarcRecordID, "<urn:uuid:e9a0cecc-0221-11e7-adb1-0242ac120008>")
builder.AddWarcHeader(gowarc.WarcDate, "2006-01-02T15:04:05Z")
builder.AddWarcHeader(gowarc.ContentType, "application/http;msgtype=response")

if wr, _, err := builder.Build(); err == nil {
	w.Write(wr)
}

func (*WarcFileWriter) Close

func (w *WarcFileWriter) Close() error

Close drains queued work and stops workers.

func (*WarcFileWriter) Rotate

func (w *WarcFileWriter) Rotate() error

Rotate closes the current file of each worker, ordered after all previously queued requests.

func (*WarcFileWriter) String

func (w *WarcFileWriter) String() string

func (*WarcFileWriter) Write

func (w *WarcFileWriter) Write(records ...WarcRecord) []WriteResponse

Write marshals one or more WarcRecords to file. If addConcurrentHeader is enabled, records in the same call cross-reference each other.

Returns nil if writer is closed.

type WarcFileWriterOption

type WarcFileWriterOption func(*warcFileWriterOptions)

WarcFileWriterOption configures how to write WARC files.

func WithAddWarcConcurrentToHeader

func WithAddWarcConcurrentToHeader(addConcurrentHeader bool) WarcFileWriterOption

WithAddWarcConcurrentToHeader configures if records written in the same call to Write should have WARC-Concurrent-To headers added for cross-reference.

default false

func WithAfterFileCreationHook

func WithAfterFileCreationHook(f func(fileName string, size int64, warcInfoId string) error) WarcFileWriterOption

WithAfterFileCreationHook sets a function to be called after a new file is created.

The function receives the file name of the new file, the size of the file and the WARC-Warcinfo-ID.

func WithBeforeFileCreationHook

func WithBeforeFileCreationHook(f func(fileName string) error) WarcFileWriterOption

WithBeforeFileCreationHook sets a function to be called before a new file is created.

The function receives the file name of the new file.

func WithCompressedFileSuffix

func WithCompressedFileSuffix(suffix string) WarcFileWriterOption

WithCompressedFileSuffix sets a suffix to be added after the name generated by the WarcFileNameGenerator id compression is on.

defaults to ".gz"

func WithCompression

func WithCompression(compress bool) WarcFileWriterOption

WithCompression sets if writer should write gzip compressed WARC files.

defaults to true

func WithCompressionLevel

func WithCompressionLevel(gzipLevel int) WarcFileWriterOption

WithCompressionLevel sets the gzip level (1-9) to use for compression.

defaults to 5

func WithExpectedCompressionRatio

func WithExpectedCompressionRatio(ratio float64) WarcFileWriterOption

WithExpectedCompressionRatio sets the expectd reduction in size when using compression.

This value is used to decide if a record will fit into a Warcfile's MaxFileSize when using compression since it's not possible to know this before the record is written. If the value is far from the actual size reduction, an under- or overfilled file might be the result.

defaults to .5 (half the uncompressed size)

func WithFileNameGenerator

func WithFileNameGenerator(generator WarcFileNameGenerator) WarcFileWriterOption

WithFileNameGenerator sets the WarcFileNameGenerator to use for generating new Warc file names.

Default is to use a PatternNameGenerator with the default pattern.

func WithFlush

func WithFlush(flush bool) WarcFileWriterOption

WithFlush sets if writer should commit each record to stable storage.

defaults to false

func WithMarshaler

func WithMarshaler(marshaler Marshaler) WarcFileWriterOption

WithMarshaler sets the Warc record marshaler to use.

defaults to defaultMarshaler

func WithMaxConcurrentWriters

func WithMaxConcurrentWriters(count int) WarcFileWriterOption

WithMaxConcurrentWriters sets the maximum number of Warc files that can be written simultaneously.

defaults to one

func WithMaxFileSize

func WithMaxFileSize(size int64) WarcFileWriterOption

WithMaxFileSize sets the max size of the Warc file before creating a new one.

defaults to 1 GiB

func WithOpenFileSuffix

func WithOpenFileSuffix(suffix string) WarcFileWriterOption

WithOpenFileSuffix sets a suffix to be added to the file name while the file is open for writing.

The suffix is automatically removed when the file is closed.

defaults to ".open"

func WithRecordOptions

func WithRecordOptions(opts ...WarcRecordOption) WarcFileWriterOption

WithRecordOptions sets the options to use for creating WarcInfo records.

See WithWarcInfoFunc

func WithSegmentation

func WithSegmentation() WarcFileWriterOption

WithSegmentation sets if writer should use segmentation for large WARC records.

defaults to false

func WithWarcInfoFunc

func WithWarcInfoFunc(f func(recordBuilder WarcRecordBuilder) error) WarcFileWriterOption

WithWarcInfoFunc sets a warcinfo-record generator function to be called for every new WARC-file created.

The function receives a WarcRecordBuilder which is prepopulated with WARC-Record-ID, WARC-Type, WARC-Date and Content-Type. After the submitted function returns, Content-Length and WARC-Block-Digest fields are calculated.

When this option is set, records written to the warcfile will have the WARC-Warcinfo-ID automatically set to point to the generated warcinfo record.

Use WithRecordOptions to modify the options used to create the WarcInfo record.

defaults nil (no generation of warcinfo record)

type WarcRecord

type WarcRecord interface {
	// Version returns the WARC version of the record.
	Version() *WarcVersion

	// Type returns the WARC record type.
	Type() RecordType

	// WarcHeader returns the WARC header fields.
	WarcHeader() *WarcFields

	// Block returns the content block of the record.
	Block() Block

	// RecordId returns the WARC-Record-ID header field.
	RecordId() string

	// ContentLength returns the Content-Length header field.
	ContentLength() (int64, error)

	// Date returns the WARC-Date header field.
	Date() (time.Time, error)

	// String returns a string representation of the record.
	String() string

	// Closer closes the record and releases any resources associated with it.
	io.Closer

	// ToRevisitRecord takes RevisitRef referencing the record we want to make a revisit of and returns a revisit record.
	ToRevisitRecord(ref *RevisitRef) (WarcRecord, error)

	// RevisitRef extracts a RevisitRef from the current record if it is a revisit record.
	RevisitRef() (*RevisitRef, error)

	// CreateRevisitRef creates a RevisitRef which references the current record.
	//
	// The RevisitRef might be used by another record's ToRevisitRecord to create a revisit record referencing this record.
	CreateRevisitRef(profile string) (*RevisitRef, error)

	// Merge merges this record with its referenced record(s)
	//
	// It is implemented only for revisit records, but this function will be enhanced to also support segmented records.
	Merge(record ...WarcRecord) (WarcRecord, error)

	// ValidateDigest validates block and payload digests if present.
	//
	// If option FixDigest is set, an invalid or missing digest will be corrected in the header.
	// Digest validation requires the whole content block to be read. As a side effect the
	// Content-Length field is also validated, and if option FixContentLength is set, a wrong
	// content length will be corrected in the header.
	//
	// If the record is not cached, it might not be possible to read any content from this
	// record after validation.
	//
	// The returned values depend on the [ErrorPolicy] options:
	//   - [ErrIgnore]: only fatal errors are returned via err.
	//   - [ErrWarn]: non-fatal findings are collected in validation; err is nil.
	//   - [ErrFail]: the first validation failure is returned via err.
	ValidateDigest() (validation []error, err error)
}

WarcRecord is the interface implemented by types that can represent a WARC record. A new instance of WarcRecord is created by a WarcRecordBuilder.

type WarcRecordBuilder

type WarcRecordBuilder interface {
	io.Writer
	io.StringWriter
	io.ReaderFrom
	io.Closer
	AddWarcHeader(name string, value string)
	AddWarcHeaderInt(name string, value int)
	AddWarcHeaderInt64(name string, value int64)
	AddWarcHeaderTime(name string, value time.Time)
	Build() (record WarcRecord, validation []error, err error)
	Size() int64
	SetRecordType(recordType RecordType)
}

func NewRecordBuilder

func NewRecordBuilder(recordType RecordType, opts ...WarcRecordOption) WarcRecordBuilder

NewRecordBuilder initializes a WarcRecordBuilder used for creating a new record.

WarcRecordBuilder implements io.Writer for adding the content block. recordType might be 0, but then SetRecordType or AddWarcHeader(WarcType, "myRecordType") must be called before Build is called.

When finished with adding headers and writing content, call Build on the WarcRecordBuilder to create a WarcRecord.

Example
builder := gowarc.NewRecordBuilder(gowarc.Response)
_, err := builder.WriteString("HTTP/1.1 200 OK\nDate: Tue, 19 Sep 2016 17:18:40 GMT\nServer: Apache/2.0.54 (Ubuntu)\n" +
	"Last-Modified: Mon, 16 Jun 2013 22:28:51 GMT\nETag: \"3e45-67e-2ed02ec0\"\nAccept-Ranges: bytes\n" +
	"Content-Length: 19\nConnection: close\nContent-Type: text/plain\n\nThis is the content")
if err != nil {
	panic(err)
}
builder.AddWarcHeader(gowarc.WarcRecordID, "<urn:uuid:e9a0cecc-0221-11e7-adb1-0242ac120008>")
builder.AddWarcHeader(gowarc.WarcDate, "2006-01-02T15:04:05Z")
builder.AddWarcHeader(gowarc.ContentLength, "257")
builder.AddWarcHeader(gowarc.ContentType, "application/http;msgtype=response")
builder.AddWarcHeader(gowarc.WarcBlockDigest, "sha1:B285747AD7CC57AA74BCE2E30B453C8D1CB71BA4")

if wr, _, err := builder.Build(); err == nil {
	fmt.Println(wr)
}
Output:

WARC record: version: WARC/1.1, type: response, id: urn:uuid:e9a0cecc-0221-11e7-adb1-0242ac120008

type WarcRecordOption

type WarcRecordOption func(*warcRecordOptions)

WarcRecordOption configures validation, marshaling and unmarshaling of WARC records.

func WithAddMissingContentLength

func WithAddMissingContentLength(addMissingContentLength bool) WarcRecordOption

WithAddMissingContentLength sets if missing Content-Length header should be calculated.

When creating records with NewRecordBuilder, missing Content-Length is always set. This option primarily affects parsing/unmarshalling behavior.

defaults to false

func WithAddMissingDigest

func WithAddMissingDigest(addMissingDigest bool) WarcRecordOption

WithAddMissingDigest sets if missing Block digest and eventually Payload digest header fields should be calculated.

Only digest fields are controlled by this option. Record ID and Content-Length are always set for records created with NewRecordBuilder when missing.

defaults to false

func WithAddMissingRecordId

func WithAddMissingRecordId(addMissingRecordId bool) WarcRecordOption

WithAddMissingRecordId sets if missing WARC-Record-ID header should be generated.

When creating records with NewRecordBuilder, missing WARC-Record-ID is always generated. This option primarily affects parsing/unmarshalling behavior.

defaults to false

func WithBlockErrorPolicy

func WithBlockErrorPolicy(policy ErrorPolicy) WarcRecordOption

WithBlockErrorPolicy sets the policy for handling errors in block parsing.

For most records this is the content fetched from the original source and errors here should be ignored.

defaults to ErrIgnore

func WithBufferMaxMemBytes

func WithBufferMaxMemBytes(size int64) WarcRecordOption

WithBufferMaxMemBytes sets the maximum amount of memory a buffer is allowed to use before overflowing to disk.

defaults to 1 MiB

func WithBufferTmpDir

func WithBufferTmpDir(dir string) WarcRecordOption

WithBufferTmpDir sets the directory to use for temporary files.

If not set or dir is the empty string then the default directory for temporary files is used (see os.TempDir).

func WithDefaultDigestAlgorithm

func WithDefaultDigestAlgorithm(defaultDigestAlgorithm string) WarcRecordOption

WithDefaultDigestAlgorithm sets which algorithm to use for digest generation.

Valid values: 'md5', 'sha1', 'sha256' and 'sha512'.

defaults to sha1

func WithDefaultDigestEncoding

func WithDefaultDigestEncoding(defaultDigestEncoding DigestEncoding) WarcRecordOption

WithDefaultDigestEncoding sets which encoding to use for digest generation.

Valid values: Base16, Base32 and Base64.

defaults to Base32

func WithFixContentLength

func WithFixContentLength(fixContentLength bool) WarcRecordOption

WithFixContentLength sets if a ContentLength header with value which do not match the actual content length should be set to the real value.

This will not have any impact if SpecViolationPolicy is ErrIgnore

defaults to false

func WithFixDigest

func WithFixDigest(fixDigest bool) WarcRecordOption

WithFixDigest sets if a BlockDigest header or a PayloadDigest header with a value which do not match the actual content should be recalculated.

This will not have any impact if SpecViolationPolicy is ErrIgnore

defaults to false

func WithFixSyntaxErrors

func WithFixSyntaxErrors(fixSyntaxErrors bool) WarcRecordOption

WithFixSyntaxErrors sets if an attempt to fix syntax errors should be done when those are detected.

This will not have any impact if SyntaxErrorPolicy is ErrIgnore

defaults to false

func WithFixWarcFieldsBlockErrors

func WithFixWarcFieldsBlockErrors(fixWarcFieldsBlockErrors bool) WarcRecordOption

WithFixWarcFieldsBlockErrors sets if an attempt to fix syntax errors in warcfields block should be done when those are detected.

A warcfields block is typically generated by a web crawler. An error in this context suggests a potential bug in the crawler's WARC writer.

defaults to false

func WithNoValidation

func WithNoValidation() WarcRecordOption

WithNoValidation sets the parser to do as little validation as possible.

This option is for parsing as fast as possible and being as lenient as possible. Settings implied by this option are:

SyntaxErrorPolicy = ErrIgnore
SpecViolationPolicy = ErrIgnore
UnknownRecordPolicy = ErrIgnore
SkipParseBlock = true

func WithRecordIdFunc

func WithRecordIdFunc(recordIdFunc func() (string, error)) WarcRecordOption

WithRecordIdFunc sets a function for generating WARC-Record-ID if AddMissingRecordId is true.

Expected output is a valid URI without the surrounding '<' and '>' as described in the WARC spec (https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#warc-record-id-mandatory)

defaults to generating uuid

func WithSkipParseBlock

func WithSkipParseBlock() WarcRecordOption

WithSkipParseBlock sets parser to skip detecting known block types.

This implies that no payload digest can be computed.

func WithSpecViolationPolicy

func WithSpecViolationPolicy(policy ErrorPolicy) WarcRecordOption

WithSpecViolationPolicy sets the policy for handling violations of the WARC specification in WARC records.

defaults to ErrWarn

func WithStrictValidation

func WithStrictValidation() WarcRecordOption

WithStrictValidation sets the parser to fail on first error or violation of WARC specification.

Settings implied by this option are:

SyntaxErrorPolicy = ErrFail
SpecViolationPolicy = ErrFail
UnknownRecordPolicy = ErrFail
SkipParseBlock = false

func WithSyntaxErrorPolicy

func WithSyntaxErrorPolicy(policy ErrorPolicy) WarcRecordOption

WithSyntaxErrorPolicy sets the policy for handling syntax errors in WARC records.

defaults to ErrWarn

func WithUnknownRecordTypePolicy

func WithUnknownRecordTypePolicy(policy ErrorPolicy) WarcRecordOption

WithUnknownRecordTypePolicy sets the policy for handling unknown record types.

defaults to ErrWarn

func WithUrlParserOptions

func WithUrlParserOptions(opts ...url.ParserOption) WarcRecordOption

func WithVersion

func WithVersion(version *WarcVersion) WarcRecordOption

WithVersion sets the WARC version to use for new records.

defaults to WARC/1.1

type WarcVersion

type WarcVersion struct {
	// contains filtered or unexported fields
}

WarcVersion represents a WARC specification version.

For record creation, only WARC 1.0 and 1.1 are supported which are represented by the constants V1_0 and V1_1. During parsing of a record, the WarcVersion will take on the version value found in the record itself.

func (*WarcVersion) Major

func (v *WarcVersion) Major() uint8

func (*WarcVersion) Minor

func (v *WarcVersion) Minor() uint8

func (*WarcVersion) String

func (v *WarcVersion) String() string

String returns a string representation of the WARC version in the format used by WARC files i.e. 'WARC/1.0' or 'WARC/1.1'.

type WriteResponse

type WriteResponse struct {
	FileName     string // filename
	FileOffset   int64  // the offset in file
	BytesWritten int64  // number of uncompressed bytes written
	Err          error  // eventual error
}

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL