Search Optimisation Notes

Table of Contents

1 Topics

1.1 How can we make things faster?

  • Change the physics
  • Reduce the amount of work done

In distributed Splunk environments particularly:

  • How can we ensure as much work as possible is distributed?
  • How can we ensure as little data as possible is moved?

1.2 Search Scoping

  • If a search is a pipeline, reducing the volume of data entering the pipe has the greatest effect, as every subsequent step does less work

1.2.1 Time range

  • Description
    Inside every Splunk index, are buckets, which contain events. Buckets are organised by start and end times, so the shorter your time range, the fewer buckets are read, and the faster it will be.
  • Diagnostic
    Look for searches with all time search ranges
  • Good practise
    Scope to an appropriate time range (Last 60 minutes, Today, Week to Date) Can use earliest=/latest= on the search bar for searching against _time (extracted timestamp of event) or _index_earliest= and _index_latest= For searching against _indextime (time when Splunk received event) Time picker affects _time
  • Speedup Metric
    Divisor : 30 - 365 Percent of original runtime : 3% Speedup : 97%
  • Example
    Change Time Picker from All Time to Week to Date.

1.2.2 Scope to required index, sourcetype or source

  • Description
    index is a special field, which controls which physical disk locations will be read from to find results. Default index setups often include 'All non-internal indexes' which is convienient for interactive use, but for reports, dashboards and forms should be more tightly scoped. All events in Splunk have sourcetype and source fields, and including these in your searches will improve speed and precision.
  • Diagnostic
    Look for searches without an explicit index= clause
  • Good practise
    Scope to a specific set of indexes, and sourcetypes. Can use eventtypes when grouping multiple sourcetypes.
  • Speedup metric
    Divisor: 2 - 10 Percent of original runtime: 50% Speedup : 50%
  • Example
    MID=* -> index=cisco sourcetype=cisco:esa:textmail MID=* Or using eventtypes index=cisco eventtype=cisco_esa_email With the eventtype: (sourcetype="cisco:esa:textmail" OR sourcetype=cisco:esa:legacy) AND (MID OR ICID OR DCID)

1.2.3 Search Modes

Use verbose mode sparingly, when developing searches. Fast mode is good for event searches where you know exactly what you are looking for. Splunk does less work in Fast Mode - only extracting fields you have required, only using lookups explicitly required.

  • Diagnostic
    Search job property of request.custom.display.page.search.mode = verbose
  • Good Practise
    Use Smart or Fast Mode (Dashboards/Reports will automatically do this)
  • Example
    Verbose Mode -> Fast Mode
  • Speedup Metric
    Divisor : 2 - 5 Percent of original runtime: 50% Speedup: 50%

1.2.4 Inclusionary search terms

Splunk maintains an index of terms contained inside events. Inclusionary search terms specify which to select, and let us just read the specific events Exclusionary search terms require us to select the inverse, which is harder It's OK to have a series of inclusionary terms, then some specfic exclusions. It's better to avoid a series of exclusionary terms where possible (A good counterexample is exlusion of known errors - you have to go broad with inclusion and specific with exclusion) Tagged eventtypes often useful for maintaining inclusions or exclusions that are commonly used

  • Diagnostic
    Large scan numbers versus final events
  • Good Practise
    Mostly inclusionary terms, small or no exclusionary terms
  • Example
    sourcetype=cisco:esa:textmail  NOT rewritten  NOT ACCEPT  NOT RELAY
    NOT Response  NOT matched  NOT From  NOT "Message-ID"  NOT New  NOT
    RID  NOT done  NOT Logfile  NOT Quarantine  NOT Outbreak  NOT Subject
    NOT engine  NOT close  NOT generated  NOT attachment  NOT AV
    

    -> sourcetype=cisco:esa:textmail NXDOMAIN

  • Speedup Metric
    Divisor: 2 - 20 Percent of original runtime: 50% Speedup: 50%

1.2.5 Field usage

  • Description
    Define fields on segmentation boundaries where possible. Splunk breaks up events into pieces - and will try to turn field=value into a token of value in the search (which is then verified to be the field) More complicated scenarios include fields that come from other fields or are partial tokens, and these can be manually optimised or configured in fields.conf when the optimisation is global.
  • Diagnostic
    Check that the base lispy in search.log contains the values you are searching for.
  • Good practise
    Repeat tokens as barewords if required e.g. (mid=1234 -> 1234 mid=1234) Configure fields.conf to give Splunk a hint
  • Example
    Consider an application where client logs are stored under a file with a session guid which doesn't appear in the events (raw), and server logs have a guid= field which comes from inside the event.
    guid=942032a0-4fd3-11e5-acd9-0002a5d5c5
    

    ->

    (index=server sourcetype=logins 942032a0-4fd3-11e5-acd9-0002a5d5c5
    guid=942032a0-4fd3-11e5-acd9-0002a5d5c5) OR (index=client
    eventtype=client-login
    source=/var/log/client/942032a0-4fd3-11e5-acd9-0002a5d5c5)
    

1.3 Aside: Efficient Field Extraction

Fields in Splunk are made with:

  • Regular Expressions
  • Indexed extractions (CSV, JSON, W3C)
  • Structured Search Time Parsing (Key=Value, XML, CSV, JSON)
  • Lookups
  • Calculated Fields

1.3.1 Duplicate Structured Fields

  • Description
    Sometimes both indexed extractions and search time parsing are enabled for a CSV or JSON sourcetype. This is repeated unneccessary work.
  • Diagnostic
    Searches against data return multivalued fields where only single values are expected
  • Good practise
    Disable the search time parsing
  • Example
    [my_custom_indexed_csv]
    # required on SH
    KV_MODE=csv
    # required on forwarder
    INDEXED_EXTRACTIONS = CSV
    
    

    ->

    [my_custom_indexed_csv]
    # required on SH
    KV_MODE=none
    # required on forwarder
    INDEXED_EXTRACTIONS = CSV
    

1.3.2 Basic Regular Expression Best Practise

  • Description
    Most fields are extracted by regular expressions. Some regular expression operations are much better performing than others (for example, backtracking is very expensive). Regular expressions will be run against every single event (sometimes more than once) so even a moderate speedup is a gain.
  • Diagnostic
    kv time in Job Inspector.
  • Good practise
    • Prefer + to * If you know the item must exist - they both have their uses
    • Extract multiple fields together when they appear together and are most frequently used together
    • Simple Expressions are usually better
    • Anchor cleanly on both sides - I'd say on at least one side, preferably beginning
    • Test and benchmark for accuracy and speed with a regex debugger over typical data
  • Example
    ^[^\]\n]*\]\s+\w+\s+'\d+\.\d+\.\d+\s+(?P<messageid>[^ ]+) -> ^\S+\s+\d+\s+\d\d:\d\d:\d\d\s+\w+\[\d+\]\s+\w+\s+'\d+\.\d+\.\d+\s+(?P<messageid>[^ ]+)

1.4 Joining Data

1.4.1 Description

Splunk has a join function which is often used by people with two kinds of data that they wish to analyse together. It's often less efficient than alternative approaches:

  • join involves setting up a subsearch
  • join is going to join all the data from search a and search b, usually we only need a subset
  • join often requires all data to be brought back to the search head

1.4.2 Diagnostic

Look for searches with the join operator

1.4.3 Good practise

Use lookup tables for reference data Use stats commands to join data by common fields

  • values(field_name) is great
  • range(_time) is often a good duration
  • dc(sourcetype) is a good way of knowing if you actually joined stuff up or only have one end
  • eval can be nested inside your stats expression
  • searchmatch is nice for ad-hoc grouping, could also use eventtypes if disciplined

1.4.4 Example

A | fields TxnId,Queue | join TxnId [ search B or C | stats
min(_time) as start_time, max(_time) as end_time by TxnId | eval
total_time = end_time - start_time] | table total_time,Queue

-> A OR B OR C | stats values(Queue) as Queue range(_time) as duration by TxnId

or with exact semantics: A OR B OR C | stats values(Queue) as Queue range(eval(if(searchmatch("B OR C"), _time, null()))) as duration

(from http://answers.splunk.com/answers/204782/how-to-join-two-search-events-that-have-a-common-f.html#answer-204058)

1.5 Transactions - Simple / Replace

1.5.1 Description

Many searches use transaction for joining up events that relate to a particular 'transaction' Often these make use of common unique identifiers, and are often followed by a table or stats command (This doesn't cover transactions which make use of the maxspan and maxpause or startswith and endswith or those with transitory identifiers, which require more thought)

1.5.2 Diagnostic

look for searches that use transaction with only a list of fields specified - we are looking for those that don't require us to keep state)

1.5.3 Good Practise

Use stats to simulate transaction:

  • Move fields from table into values() or list() expressions
  • Move fields from the transaction command into by clauses
  • Use range(_time) as duration to compute duration

Can use stats ... | stats if required (often easier if sessionising then counting, e.g. unique users per page)

1.5.4 Example

index=weblogs 
| transaction sessionid 
| stats max(duration) as longest_session min(duration) as shortest_session
  max(eventcount) as max_pages_hit min(eventcount) as min_pages_hit
  count as num_sessions

->

index=weblogs
| stats count as eventcount min(_time) as min_time max(_time) as max_time by sessionid
| eval duration=max_time-min_time
| stats max(duration) as longest_session min(duration) as shortest_session
  max(eventcount) as max_pages_hit min(eventcount) as min_pages_hit
  count as num_sessions~

1.6 Transactions - Optimise in place

1.6.1 Description

Not every transaction can be replaced with a stats command easily. Instead of replacing its usage, we can make transactions easier for Splunk to compute

1.6.2 Diagnostic

look for searches that use transaction and have span options and/or multiple transitory fields but do not have a prior fields command

1.6.3 Good practise

  • Add as many restrictions to the transaction as possible (maxspan,

maxpause, maxevents)

  • Prefilter the fields available to the transaction to just those required
  • Be specific in required outputs (use a post transaction table or stats command)

1.6.4 Example

index=oidemo (sourcetype=radius OR sourcetype=access_custom) | transaction maxspan=1m clientip mdn -> ~ index=oidemo (sourcetype=radius OR sourcetype=accesscustom) | fields bcuri mdn clientip | transaction maxspan=1m clientip mdn | table bcuri mdn~

1.7 Using tstats against indexed fields and datamodels

1.7.1 Description

  • When using indexed extractions, data can be queried with tstats, allowing you to produce stats directly without a prior search
  • Similarly data models can be queried with tstats (speedup ong accelerated data models)
  • Bonus: tstats is available against host source sourcetype and time for all data (see also metadata)

1.7.2 Diagnostic

Build accelerated data models where data is:

  • commonly examined as an entire data set, as opposed to rare term search
  • structure is well known in advance

Indexed extractions will consume additional disk space

1.7.3 Good Practise

  • Build searches directly with tstats where possible
  • Use tstats then pipe to other commands as required
  • tstats will go faster the fewer columns are read
  • usually more efficient and simpler to run multiple queries rather than datacube style searches (the accelerated data model acts as your datacube already)

1.7.4 Speedup metric

10-100x

1.7.5 Example

Finding disappearing source/sourcetype/host combinations:

index=* | stats count earliest(_time) as et latest(_time) as lt by host source sourcetype | where lt < relative_time(now(),"-1d") -> | tstats count earliest(_time) as et latest(_time) as lt by host source sourcetype| where lt < relative_time(now(),"-1d")

Date: 2015-09-22T16:30-0700

Author: Duncan Turnbull

Org version 7.9.3f with Emacs version 24

Validate XHTML 1.0