A storage place for samples of all kinds. Typically focused around large-scale collection of phishing kits and other email/website artifacts.

The basic plan at the moment is to treat each submitted archive as a “job”. There’s “known good” jobs and “other” jobs. “Known good” job lots would be something like the Wordpress installer, while “other” would be a backup of a compromised/phishing site. There’s likely lots of commonality between them, but the interesting parts are the differences.

Build Status

If you have any kind of suggestion or issue, please create a github issue - I’ll gladly discuss it. Pull requests for features or fixes are even better :)

Table of Contents


Starting the web interface:

pipenv install 
pipenv run python -m dewar web

Starting the ingestor (not … really working yet)

pipenv run python -m dewar ingestor

Internal “element” types

Random thoughts

Various bits to build

  1. ingestion methods:
    • watch a bucket
    1. “known_good” - automatically tagged as good
    2. “other” - known_good = False - [ ] have a simple API for submitting files, part of the frontend
  2. ingestion pipelines [ ] simple single threaded widget [ ] pubsub queue with multiple nodes doing things
  3. storage backends
    • s3
    • local filesystem
  4. metadata backends
    • tinydb
    • postgresql ? (not on my )
    • other?
  5. processing of samples
    • extraction of IOCs like urls, emails, IP addresses etc.
    • hilariously simple tokenization
    • image normalisation? (phistOfFury?)
    • ssdeep?
    • words/phrases etc
  6. processing pipelines
    • single job queue, processing tasks
    • pubsub multithreaded clustered hilarity
  7. Data interaction
    • website frontend for ..
    • seeing the incoming file bucket contents
    • manually processing incoming jobs - in case you want to insert notes as you do it etc
    • see the list of historical jobs
    • edit job data (typically only notes?)
    • upload jobs - [ ] HTTP API
    • shoving files into the job buckets
    • submitting jobs
    • querying job data?
    • querying hashes
      • have we seen this
      • extended - which jobs was this seen in, for correlation
  8. AAA…
    • is scary bizness
    • flask basic http on frontends

Starting a new backend implementation

An example would be a Storage backend. The “base” template is and Storage backends should always be imported as from<backend> import Storage so they can be consistently used. The S3 implementation then is from import Storage.

Methods that storage backends should support (inspired by http verbs)

metadata backends should support