newsdoc

package module
v1.0.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 23, 2026 License: MIT Imports: 9 Imported by: 9

README

NewsDoc

This package provides type declarations for NewsDoc as Go types, protobuf messages, and a JSON schema. Protobuf and JSON schemas are generated from the Go type declarations.

NewsDoc was created to be a convenient and type-safe document format for editorial data like articles and concept metadata that minimises the need for evolving the schema to adapt to new types of data. It avoids this by not using data structure for expressing relationships ({categories:['a', 'b'], seeAlso:['c', 'd']}) or type/identity of the data ({articleMetadata:{teaserHeadline:"v", teaserText:"w"}, headline:"x", "lead_in":"y", paragraphs:["z"]}). An example of a hypothetical format that does this:

{
    "categories": [
        "28b94216-77d7-41e9-be08-a6bfbe59f1d5",
        "a23528b7-31af-4ae2-bbca-0c78f1cbc959",
    ],
    "readMore": [
        "6dd826dd-d866-459b-a07e-0da4bad7bce0",
        "043c248f-92ac-4e0b-b0ec-76cc26323634"
    ],
    "articleMetadata": {
        "teaserHeadline": "v",
        "teaserText": "w"
    },
    "headline": "x",
    "lead_in": "y",
    "paragraphs": ["z"],
    "image": "https://example.com/an-image.jpg",
    "image_width": 128,
    "image_height": 128,
    "image_alt_text": "desc"
}

Instead it adopts a view of documents as a set of links expressing relationships to other entities, a set of typed metadata blocks, and a list of typed content blocks that represent the actual content of f.ex. an article. The article hinted at in the above paragraph would instead look like this:

{
    "type": "example/article",
    "links": [
        {"rel":"category", "uuid":"28b94216-77d7-41e9-be08-a6bfbe59f1d5"},
        {"rel":"category", "uuid":"a23528b7-31af-4ae2-bbca-0c78f1cbc959"},
        {
            "rel":"see-also", "type":"example/article",
            "uuid":"6dd826dd-d866-459b-a07e-0da4bad7bce0"
        },
        {
            "rel":"see-also", "type":"example/article",
            "uuid":"043c248f-92ac-4e0b-b0ec-76cc26323634"
        }
    ],
    "meta": [
        {
            "type": "example/teaser",
            "title": "v",
            "data": {
                "text": "w"
            }
        }
    ],
    "content": [
        {
            "type": "example/headline",
            "data": {
                "text": "x"
            }
        },
        {
            "type": "example/image",
            "url": "https://example.com/an-image.jpg",
            "data": {
                "width": "128",
                "height": "128",
                "alt": "desc"
            }
        },
        {
            "type": "example/lead-in",
            "data": {
                "text": "y"
            }
        },
        {
            "type": "example/paragraph",
            "data": {
                "text": "z"
            }
        },
    ]
}

This kind of structure allows a system that's using NewsDoc to f.ex. recognise that there is a link to another entity, or a content element with text, without knowing about the specific type of relationship or content. On the flip side it's also easy to ignore f.ex. a metadata block with a type that you don't recognize.

One thing is lost in translation here, the "data" object of a block is a string->string key value structure, so the width 128 becomes "128". We sacrifice the specific types of some data to be able to have a largely static type system. But the "type contract" between content producers and consumers in a system like this is that "width" and "height" always must be integers. Revisor is our attempt to formalise and enforce these type contracts.

A revisor schema for the above format could look like this:

{"documents":[{
  "name": "News article",
  "description": "A basic news article example",
  "declares": "example/article",
  "links": [
    {
      "name": "Category",
      "description": "A category assigned to the article",
      "declares": {"rel":"category"},
      "attributes": {"uuid": {}}
    }
    {
      "name": "Read more",
      "description": "A link to other articles that are interesting",
      "declares": {"rel":"see-also", "type": "example/article"},
      "attributes": {"uuid": {}}
    }
  ],
  "meta": [
    {
      "name": "Teaser",
      "declares": {"type":"example/teaser"},
      "attributes": {"title": {}},
      "data": {"text": {}},
      "count": 1
    }
  ],
  "content": [
    {
      "name": "Headline",
      "declares": {"type":"example/headline"},
      "data": {"text": {}}
    },
    {
      "name": "Lead-in",
      "declares": {"type":"example/lead-in"},
      "data": {"text": {}}
    },
    {
      "name": "Paragraph",
      "declares": {"type":"example/paragraph"},
      "data": {"text": {}}
    },
    {
      "name": "Image",
      "declares": {"type":"example/image"},
      "attributes": {
        "url": {"glob":"https://**"}
      },
      "data": {
        "width": {"format":"int"},
        "height": {"format":"int"},
        "alt": {},
      }
    }
  ]
}]}

This schema can then be used to validate documents to ensure the data quality of stored documents. It's also serves as documentation, and can be used by automated systems like a full text index provide a hint about the correct way to index the data.

Value extractor expressions

The ValueExtractor provides a way to extract values from documents using a selector expression language. An expression consists of a chain of block selectors followed by a value specifier that determines what to extract from the matched blocks.

Selectors

Selectors navigate the block hierarchy of a document. Each selector targets a block list (meta, links, or content) and can optionally filter by block attributes:

.meta                              -- all meta blocks
.links(rel='category')             -- links with rel "category"
.meta(type='core/note').links      -- links inside meta blocks of type "core/note"
.content(type='core/text' role='heading')  -- content blocks matching both type and role

Selectors can be chained to navigate into nested blocks. The available filter attributes are: id, uuid, uri, url, type, rel, role, name, value, contenttype, and sensitivity. Attribute values are single-quoted; use \' to escape a literal quote inside a value.

Data filters

In addition to block attributes, selectors can filter on values in the block's data map using the data. prefix inside the parentheses. Three modes are supported:

data.key='value'   -- exact match: the data key must exist with this value
data.key?          -- exists: the data key must be present (even if empty)
data.key??         -- non-empty: the data key must be present and non-empty

Data filters can be mixed freely with attribute filters:

.meta(type='core/event' data.date?? data.status='confirmed').data{date}
.links(rel='item' data.date_timezone='Asia/Shanghai').data{date}
Combining conditions with or and grouping

By default, multiple conditions inside a selector are combined with implicit AND — a block must satisfy all of them. Use the or keyword to match blocks satisfying at least one alternative:

.meta(value='text' or value='picture')

AND binds tighter than or, so conditions separated by spaces are grouped together before or is applied. To control precedence, use parentheses:

.meta(type='core/thing' (value='a' or value='b'))

This matches meta blocks with type='core/thing' AND either value='a' or value='b'. Without the inner parentheses, the expression would be parsed as (type='core/thing' value='a') or value='b'.

Parenthesized groups can be nested and combined freely with attribute and data filters:

-- OR between two AND groups
.meta((type='a' value='x') or (type='b' value='y'))

-- OR between data filters
.meta(data.status='draft' or data.status='review')

-- Nested groups
.meta((type='a' (value='x' or value='y')) or (type='b' value='z'))

-- Three-way OR
.meta(value='text' or value='picture' or value='video')
Child selectors

Use # to filter blocks by their descendants without navigating into them. The selectors after # form a child selector chain — the parent block is only matched if it has descendants satisfying the chain. The extraction targets the parent block, not the descendants:

.meta(type='core/assignment')#.links(rel='deliverable' uuid='...')

This selects core/assignment meta blocks that contain a link with rel='deliverable' and the given UUID. The result is the assignment block itself. Compare with the non-child version which would navigate into and return the link:

.meta(type='core/assignment').links(rel='deliverable' uuid='...')

Child selectors can be chained to match deeper descendants, and support the same attribute and data filters as regular selectors:

assignment=.meta(type='core/assignment')#.links(rel='deliverable' data.status='active'):label
Extracting data values

Use .data{} to extract values from the matched blocks' data maps. Values are space-separated (commas are also accepted):

.meta(type='core/planning-item').data{start_date end_date}

Each matched block must have all specified data keys for the extraction to succeed. Append ? to make a value optional:

.meta(type='core/planning-item').data{start_date date_tz?}
Extracting block attributes

Use @{} to extract block attribute values:

.content(type='core/text')@{value}
.links(rel='author')@{uuid title}

When no selectors are provided, @{} extracts document-level attributes (uuid, type, uri, url, title, language):

@{title language}
Combining attribute and data extraction

An expression can combine @{} and .data{} to extract both block attributes and data values from the same matched blocks:

.meta(type='core/assignment')@{title}.data{start_date date_tz}

This extracts the title attribute and the start_date and date_tz data values from each matched block. The same all-or-nothing semantics apply: if any required value is missing, the block is skipped.

Annotations and roles

Values can be annotated with a type hint using :, and given a role using = as a prefix:

.meta(type='core/event').data{date:date tz=date_timezone?}

Here date has the annotation date, and date_timezone is extracted with the role tz. Annotations and roles are passed through in the extracted results and can be used by the caller to interpret the values.

Extracting full blocks

If no .data{} or @{} value specifier is present, the expression extracts the full matched blocks. Block extraction requires a name prefix and optionally accepts an annotation:

name=.selectors
name=.selectors:annotation

Examples:

items=.meta(type='core/collection').links(rel='item')
event=.links(rel='event' type='core/event'):calendar

The name is used as the key in the extracted results and populates the Name field of the ExtractedValue. The matched block is available in the Block field.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func AlterBlocks added in v0.7.0

func AlterBlocks(list []Block, selector BlockMatcher, fn func(*Block))

AlterBlocks calls fn for each block matching the selector.

func AlterFirstBlock added in v0.7.2

func AlterFirstBlock(list []Block, selector BlockMatcher, fn func(*Block))

AlterFirstBlock calls fn for the first block matching the selector.

func JSONSchema

func JSONSchema() []byte

JSONSchema returns the NewsDoc JSON schema.

Types

type Block

type Block struct {
	// ID is the block ID,
	ID string `json:"id,omitempty" proto:"1"`
	// UUID is used to reference another Document in a block.
	UUID string `json:"uuid,omitempty" jsonschema_extras:"format=uuid" proto:"2"`
	// URI is used to reference another entity in a document.
	URI string `json:"uri,omitempty"  jsonschema_extras:"format=uri" proto:"3"`
	// URL is a browseable URL for the the block.
	URL string `json:"url,omitempty" jsonschema_extras:"format=uri" proto:"4"`
	// Type is the type of the block
	Type string `json:"type,omitempty" proto:"5"`
	// Title is the title/headline of the block, typically used in the
	// presentation of the block.
	Title string `json:"title,omitempty" proto:"6"`
	// Data contains block data.
	Data DataMap `json:"data,omitempty" proto:"7"`
	// Rel describes the relationship to the document/parent entity.
	Rel string `json:"rel,omitempty" proto:"8"`
	// Role is used either as an alternative to rel, or for nuancing the
	// relationship.
	Role string `json:"role,omitempty" proto:"9"`
	// Name is a name for the block. An alternative to "rel" when
	// relationship is a term that doesn't fit.
	Name string `json:"name,omitempty" proto:"10"`
	// Value is a value for the block. Useful when we want to store a
	// primitive value.
	Value string `json:"value,omitempty" proto:"11"`
	// ContentType is used to describe the content type of the block/linked
	// entity if it differs from the type of the block.
	Contenttype string `json:"contenttype,omitempty" proto:"12"`
	// Links are used to link to other resources and documents.
	Links []Block `json:"links,omitempty" proto:"13"`
	// Content is used to embed content blocks.
	Content []Block `json:"content,omitempty" proto:"14"`
	// Meta is used to embed metadata
	Meta []Block `json:"meta,omitempty" proto:"15"`
	// Sensitivity can be use to communicate how the information in a block
	// can be handled. It could f.ex. be set to "internal", to show that it
	// contains information that must be removed or transformed before
	// publishing.
	Sensitivity string `json:"sensitivity,omitempty" proto:"16"`
}

Block is the building block for data embedded in documents. It is used for both content, links and metadata. Blocks have can be nested, but that's nothing to strive for, keep it simple.

func AddOrReplaceBlock added in v0.7.0

func AddOrReplaceBlock(
	list []Block, selector BlockMatcher, insert Block,
) []Block

AddOrReplaceBlock inserts a new block into the list, or replaces the first block matching the selector.

func AllBlocks added in v0.7.1

func AllBlocks(list []Block, selector BlockMatcher) []Block

AllBlocks returns all blocks matching the selector.

func DedupeBlocks added in v0.7.0

func DedupeBlocks(list []Block, selector BlockMatcher) []Block

DedupeBlocks removes all but the first block matching the selector.

func DropBlocks added in v0.7.0

func DropBlocks(list []Block, selector BlockMatcher) []Block

DropBlocks removes all blocks matching the selector.

func FirstBlock added in v0.7.0

func FirstBlock(list []Block, selector BlockMatcher) (Block, bool)

FirstBlock returns the first block matching the selector.

func UpsertBlock added in v0.7.0

func UpsertBlock(
	list []Block, selector BlockMatcher, insert Block,
	fn func(b Block) Block,
) []Block

UpsertBlock inserts a new block or updates an existing block if it matches the selector. The function fn will be called on the inserted or existing block.

func WithBlockOfType added in v0.7.0

func WithBlockOfType(
	list []Block, blockType string, fn func(b Block) Block,
) []Block

WithBlockOfType upserts a block with the given type, see UpsertBlock().

func (Block) Clone added in v0.7.3

func (b Block) Clone() Block

Clone returns a deep copy of the block.

type BlockKind added in v0.8.0

type BlockKind string
const (
	BlockKindMeta    BlockKind = "meta"
	BlockKindLinks   BlockKind = "links"
	BlockKindContent BlockKind = "content"
)

type BlockMatchFunc added in v0.7.0

type BlockMatchFunc func(block Block) bool

BlockMatchFunc is a custom BlockMatcher function.

func (BlockMatchFunc) Match added in v0.7.0

func (fn BlockMatchFunc) Match(block Block) bool

Implements BlockMatcher.

type BlockMatcher added in v0.7.0

type BlockMatcher interface {
	// Match returns true if the block matches the condition.
	Match(block Block) bool
}

BlockMatcher checks if a block matches a condition.

func BlockDoesntMatch added in v0.7.0

func BlockDoesntMatch(selector BlockMatcher) BlockMatcher

BlockDoesntMatch returns a block matcher that negates the selector.

func BlockMatchesAll added in v0.7.0

func BlockMatchesAll(matchers ...BlockMatcher) BlockMatcher

BlockMatchesAll returns a block matcher that returns true if a block matches all the conditions.

func BlockMatchesAny added in v0.7.0

func BlockMatchesAny(matchers ...BlockMatcher) BlockMatcher

BlockMatchesAny returns a block matcher that returns true if a block matches any of the conditions.

func BlocksWithRel added in v0.7.0

func BlocksWithRel(rel string) BlockMatcher

BlocksWithRel returns a BlockMatcher that matches blocks with the given rel.

func BlocksWithType added in v0.7.0

func BlocksWithType(blockType string) BlockMatcher

BlocksWithType returns a BlockMatcher that matches blocks with the given type.

func BlocksWithTypeAndRel added in v0.7.0

func BlocksWithTypeAndRel(blockType string, rel string) BlockMatcher

BlocksWithTypeAndRel returns a BlockMatcher that matches blocks with the given type and rel.

func BlocksWithTypeAndRole added in v0.7.0

func BlocksWithTypeAndRole(blockType string, role string) BlockMatcher

BlocksWithTypeAndRole returns a BlockMatcher that matches blocks with the given type and role.

type BlockRole added in v0.7.0

type BlockRole string

BlockRole can be used to check that a block has a specific role.

func (BlockRole) Match added in v0.7.0

func (role BlockRole) Match(block Block) bool

Implements BlockMatcher.

type BlockSelector added in v0.8.0

type BlockSelector struct {
	Kind   BlockKind
	Filter *FilterNode `json:",omitempty"`
}

BlockSelector selects blocks by kind and optional attribute/data filters.

func (BlockSelector) FilterBlocks added in v0.9.0

func (bs BlockSelector) FilterBlocks(blocks []Block) []Block

func (BlockSelector) Iterator added in v0.8.0

func (bs BlockSelector) Iterator(blocks iter.Seq[Block]) iter.Seq[Block]

func (BlockSelector) Matches added in v0.8.0

func (bs BlockSelector) Matches(b Block) bool

type DataFilter added in v0.9.0

type DataFilter struct {
	Key   string
	Value string `json:",omitempty"`
	Mode  DataFilterMode
}

DataFilter is a filter condition on a block's data map.

type DataFilterMode added in v0.9.0

type DataFilterMode string

DataFilterMode describes the comparison mode for a data filter.

const (
	// DataFilterExact matches when the data key exists with the exact
	// value.
	DataFilterExact DataFilterMode = "exact"
	// DataFilterExists matches when the data key exists, even if empty.
	DataFilterExists DataFilterMode = "exists"
	// DataFilterNonEmpty matches when the data key exists and is non-empty.
	DataFilterNonEmpty DataFilterMode = "non-empty"
)

type DataMap

type DataMap map[string]string

DataMap is used as key -> (string) value data for blocks.

func CopyData added in v0.7.0

func CopyData(dst DataMap, src DataMap, keys ...string) DataMap

CopyData copies the given keys from the source data map to the destination. Keys will only be copied if they actually exists and it's safe to call the function with nil DataMaps. The result will always be a non-nil DataMap.

func DataWithDefaults added in v0.7.0

func DataWithDefaults(data DataMap, defaults DataMap) DataMap

WithDefaults adds the values from defaults into data if the value for corresponding key is unset or empty. If data is nil a new DataMap will be created.

func UpsertData added in v0.7.0

func UpsertData(data DataMap, newData DataMap) DataMap

UpsertData adds the values from new into data. If data is nil a new DataMap will be created.

func (DataMap) Delete added in v0.7.0

func (bd DataMap) Delete(keys ...string)

Delete the values with the given keys. This is safe to use on nil DataMaps.

func (DataMap) DropEmpty added in v0.7.0

func (bd DataMap) DropEmpty()

DropEmpty removes all entries with empty values. This is safe to use on nil DataMaps.

func (DataMap) Get added in v0.7.0

func (bd DataMap) Get(key string, defaultValue string) string

Get the value with the given key. This is safe to use on nil DataMaps.

func (DataMap) MarshalJSON

func (bd DataMap) MarshalJSON() ([]byte, error)

MarshalJSON implements a custom marshaler to make the JSON output of a document deterministic. Maps are unordered.

type Document

type Document struct {
	// UUID is a unique ID for the document, this can for example be a
	// random v4 UUID, or a URI-derived v5 UUID.
	UUID string `json:"uuid,omitempty" jsonschema_extras:"format=uuid" proto:"1"`
	// Type is the content type of the document.
	Type string `json:"type,omitempty"  proto:"2"`
	// URI identifies the document (in a more human-readable way than the
	// UUID).
	URI string `json:"uri,omitempty" jsonschema_extras:"format=uri" proto:"3"`
	// URL is the browseable location of the document (if any).
	URL string `json:"url,omitempty" jsonschema_extras:"format=uri" proto:"4"`
	// Title is the title of the document, can be used as the document name,
	// or the headline when the document is displayed.
	Title string `json:"title,omitempty" proto:"5"`
	// Content is the content of the document, this is essentially what gets
	// rendered on the page when you view a document.
	Content []Block `json:"content,omitempty" proto:"6"`
	// Meta is the metadata for a document, this could be things like
	// teasers, open graph data, newsvalues.
	Meta []Block `json:"meta,omitempty" proto:"7"`
	// Links are links to other resources and entities. This could be links
	// to topics, categories and subjects for the document, or credited
	// authors.
	Links []Block `json:"links,omitempty" proto:"8"`
	// Language is the language used in the document as an IETF language
	// tag. F.ex. "en", "en-UK", "es", or "sv-SE".
	Language string `json:"language,omitempty" proto:"9"`
}

Document is a NewsDoc document.

func (Document) Clone added in v0.7.3

func (d Document) Clone() Document

Clone returns a deep copy of the document.

type ExtractedItems added in v0.8.0

type ExtractedItems map[string]ExtractedValue

type ExtractedValue added in v0.8.0

type ExtractedValue struct {
	Name       string
	Value      string `json:",omitempty"`
	Block      *Block `json:",omitempty"`
	Annotation string `json:",omitempty"`
	Role       string `json:",omitempty"`
}

type FilterNode added in v0.9.0

type FilterNode struct {
	Op       FilterOp     `json:",omitempty"`
	Children []FilterNode `json:",omitempty"`
	Attr     string       `json:",omitempty"`
	Value    string       `json:",omitempty"`
	Data     *DataFilter  `json:",omitempty"`
}

FilterNode is a node in a boolean filter expression tree. Branch nodes have Op and Children set; leaf nodes have either Attr+Value (attribute match) or Data (data filter) set.

func (*FilterNode) Matches added in v0.9.0

func (fn *FilterNode) Matches(b Block) bool

Matches reports whether the filter node matches the given block. A nil node matches all blocks.

type FilterOp added in v0.9.0

type FilterOp string

FilterOp is the boolean operator for a filter node.

const (
	// FilterOpAnd combines children with logical AND.
	FilterOpAnd FilterOp = "and"
	// FilterOpOr combines children with logical OR.
	FilterOpOr FilterOp = "or"
)

type ValueExtractor added in v0.8.0

type ValueExtractor struct {
	Selectors      []BlockSelector
	ChildSelectors []BlockSelector `json:",omitempty"`
	ValueKind      ValueKind
	Values         []ValueSpec
}

func ValueExtractorFromBytes added in v0.8.0

func ValueExtractorFromBytes(text []byte) (*ValueExtractor, error)

func ValueExtractorFromString added in v0.8.0

func ValueExtractorFromString(text string) (*ValueExtractor, error)

func (*ValueExtractor) Collect added in v0.8.0

func (ve *ValueExtractor) Collect(doc Document) []ExtractedItems

type ValueKind added in v0.8.0

type ValueKind string

ValueKind describes what kind of values a ValueExtractor produces.

const (
	// ValueKindAttributes extracts block attribute values using @{}.
	ValueKindAttributes ValueKind = "attributes"
	// ValueKindData extracts block data map values using .data{}.
	ValueKindData ValueKind = "data"
	// ValueKindBlock extracts matched blocks themselves (name=.selectors).
	ValueKindBlock ValueKind = "block"
	// ValueKindCombined extracts both attribute and data values from matched
	// blocks using @{}.data{} in a single expression.
	ValueKindCombined ValueKind = "combined"
)

type ValueSource added in v0.9.0

type ValueSource string

ValueSource identifies whether a value spec in a combined extraction targets block attributes or block data. It is only populated for ValueKindCombined expressions.

const (
	ValueSourceData       ValueSource = "data"
	ValueSourceAttributes ValueSource = "attributes"
)

type ValueSpec added in v0.8.0

type ValueSpec struct {
	Name       string
	Source     ValueSource `json:",omitempty"`
	Optional   bool        `json:",omitempty"`
	Annotation string      `json:",omitempty"`
	Role       string      `json:",omitempty"`
}

Directories

Path Synopsis
cmd
newsdoc command
internal

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL