Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewriting the queue #78

Merged
merged 107 commits into from
Aug 4, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
107 commits
Select commit Hold shift + click to select a range
fd106db
Initial commit for frontier rewrite
CorentinB Jul 12, 2024
96a4e38
Replace mmap files with "regular" files
CorentinB Jul 12, 2024
1090690
Remove useless stat
CorentinB Jul 12, 2024
1c41124
Remove dequeue recursion
CorentinB Jul 12, 2024
a0045d3
Fix: dequeue timeout mechanism
CorentinB Jul 12, 2024
487319d
Big brain modifications of queue & enqueue with signal
CorentinB Jul 12, 2024
3a70670
Add: enqueue tests
CorentinB Jul 12, 2024
a2fb09d
Increment Prometheus counter only when --api is specified
CorentinB Jul 12, 2024
52d751a
WIP: Dequeue test
CorentinB Jul 12, 2024
5bc73f1
Remove timeout for Dequeue
CorentinB Jul 12, 2024
307cc9c
Remove any mention of 'frontier'
CorentinB Jul 12, 2024
2abaf8e
Move stats updates to separate functions
CorentinB Jul 12, 2024
ea748ea
Remove a test that doesn't make sense in a root env (like in a Docker…
CorentinB Jul 12, 2024
2da5d78
Add: large scale test (doesn't pass yet)
CorentinB Jul 12, 2024
61b7b9c
Restore previous version of Dequeue
CorentinB Jul 12, 2024
d81a5ff
chore: align structures to save space
equals215 Jul 13, 2024
a1f0efd
feat: remove the dequeue recusion
equals215 Jul 13, 2024
de41b1d
fix: crash when --api is not set (#79)
NGTmeaty Jul 15, 2024
7423728
Increment Prometheus counter only when --api is specified
CorentinB Jul 12, 2024
cc9eed1
WIP - Switching to Protobuf
CorentinB Jul 16, 2024
2786d62
Fix: dequeue properly
CorentinB Jul 16, 2024
bbdfa78
Use new NewItem function
CorentinB Jul 16, 2024
cea09f0
Remove Gob queue encoder/decoder
CorentinB Jul 16, 2024
7cb11da
Move item stuff to a separate file
CorentinB Jul 16, 2024
9427e39
chore: moved protobuf to a new package to respect conventions
equals215 Jul 16, 2024
9c1a579
Fix: cond wait lock problem
CorentinB Jul 16, 2024
25602e7
WIP fix parallel test
CorentinB Jul 16, 2024
42386e5
item encoding revamped
equals215 Jul 16, 2024
1ea1232
reflect changes on crawl.go
equals215 Jul 16, 2024
c07731a
Removing useless boolean comparison
CorentinB Jul 19, 2024
90bc8bd
Merge branch 'main' into queue
equals215 Jul 19, 2024
53b329c
chore: second pass for merge
equals215 Jul 19, 2024
708fd86
Merge branch 'queue' of github.com:internetarchive/Zeno into queue
equals215 Jul 19, 2024
5ccc9e1
chore: removed relicate
equals215 Jul 19, 2024
19cb96d
Fix TestParallelEnqueueDequeue
CorentinB Jul 19, 2024
ab683ee
Remove the very useless LoggingChan
CorentinB Jul 19, 2024
c0ad1c9
Remove sync.Cond & rewrite parallel test
CorentinB Jul 20, 2024
a7ac043
Fix: TestNewItem
CorentinB Jul 20, 2024
2081aa3
Typo: TestNewItem
CorentinB Jul 20, 2024
56ab6ef
Merge branch 'main' into queue
equals215 Jul 21, 2024
e50b979
fix: some leftovers unseen errors from main merge
equals215 Jul 21, 2024
48a2817
WIP on index
equals215 Jul 22, 2024
ac520c4
index MVP - to be tested __thoroughly__
equals215 Jul 22, 2024
2998afc
WIP changes
equals215 Jul 23, 2024
1e0db1a
enhancements for WAL + start of tests
equals215 Jul 23, 2024
2a3075b
fix: corrected WAL fd inconsistencies thanks to new tests
equals215 Jul 23, 2024
a39c0d5
add more tests and corrected truncate accordingly
equals215 Jul 23, 2024
7c6b106
feat: index can recover after non-graceful exit (still need to check …
equals215 Jul 23, 2024
0812bed
prepare to merge and apply Corentin live review
equals215 Jul 23, 2024
d0092e4
feat: remove index/file_io.go:performDump() printf
equals215 Jul 23, 2024
ed7f6cf
implemented requested changes
equals215 Jul 23, 2024
f11170c
feat: enhance periodicDump routine
equals215 Jul 23, 2024
3221141
fix: periodicDump() missing a stat check
equals215 Jul 23, 2024
c602dcf
Persist & load queue stats (#93)
CorentinB Jul 23, 2024
7ce49a5
fix: queue-stats merge typo
equals215 Jul 24, 2024
85682e7
Merge remote-tracking branch 'origin/main' into queue
CorentinB Jul 25, 2024
3000dfa
tests: add some benchmarking for queue and index
equals215 Jul 25, 2024
11d67e9
fix: temp files not contained in the same directory as Zeno & index f…
equals215 Jul 25, 2024
2d56d81
fix: edge-case of queue fatigue: queue depletes over time when not en…
equals215 Jul 26, 2024
8e724c4
add queue empty bool state to live stats
equals215 Jul 26, 2024
67c3384
Implement host rotation and Enqueue/Dequeue access regulation via ato…
equals215 Jul 26, 2024
98ec9fa
fix: open the WAL file as O_SYNC and get rid of file.Sync() calls con…
equals215 Jul 27, 2024
ccb6ac4
fix: reduce sleeps in worker
equals215 Jul 27, 2024
65c0cf3
feat: make Dequeue non-blocking and manage the wait on the worker sid…
equals215 Jul 27, 2024
73ac185
fix: TestDequeue expected return of a deprecated error
equals215 Jul 27, 2024
1f4d8a6
fix: TestPersistentGroupedQueue_Close expected return of a deprecated…
equals215 Jul 27, 2024
49f4989
feat: add `TEMP_DIR` env variable support for queue benchmarks
equals215 Jul 29, 2024
2a8aa48
add: TestBatchEnqueue
CorentinB Jul 29, 2024
d2e0503
modify HQConsumer to batch enqueue
CorentinB Jul 29, 2024
97c64ef
change HQ consumption logic
CorentinB Jul 29, 2024
cb68078
fix: TestEnqueue & TestBatchEnqueue
CorentinB Jul 29, 2024
d78a187
rework HQConsumer pause logic a bit
CorentinB Jul 29, 2024
63bafd7
fix: many tests
CorentinB Jul 29, 2024
2ed7b9c
fix: resuming queue couldn't start due to broken logic: queue.Empty w…
equals215 Jul 29, 2024
6c0006d
enhancement: close all workers in parallel
CorentinB Jul 29, 2024
ae2056d
fix: TestQueueEmptyBool
CorentinB Jul 29, 2024
fae5403
fix: index.Pop logic error: WAL writing unnamed goroutine couldn't re…
equals215 Jul 29, 2024
ca10b08
feat: implemented a handover process so that batch enqueue process ca…
equals215 Jul 30, 2024
d6480c9
fix: infinite for loop on BatchEnqueue()
equals215 Jul 30, 2024
12e9aea
make workers yield more frequently
equals215 Jul 30, 2024
01a0e22
feat: make outlinks batch enqueueable
equals215 Jul 30, 2024
e8ed999
feat: add handover stats
equals215 Jul 30, 2024
c9c2a2c
optimize `get list` loading performance (#104)
yzqzss Aug 1, 2024
4ce3cde
Remove `runtime.Gosched()` in polling (#105)
yzqzss Aug 1, 2024
09ddbc7
Queue handover v2 (#106)
equals215 Aug 1, 2024
7029d7e
slight enhancements of handover and adjustments of workers sleeps
equals215 Aug 2, 2024
6a34cac
fix: infinite for loop when finishing
equals215 Aug 2, 2024
beac4a4
chore: Set `Content-Type` header for json response
yzqzss Aug 1, 2024
bc6f381
Improve WAL concurrency performance through the commit mechanism
yzqzss Aug 2, 2024
538e2ab
tests: update
yzqzss Aug 2, 2024
37744a7
fix: QueueStats data race
yzqzss Aug 2, 2024
7a64613
fix: deadlock :(
yzqzss Aug 3, 2024
1c80c19
push the seed list to the queue in batches
yzqzss Aug 3, 2024
be8b9b4
typo: committed
yzqzss Aug 3, 2024
04e02e6
Gracefully shutdown walCommitsSyncer()
yzqzss Aug 3, 2024
5f81a4d
chore
yzqzss Aug 3, 2024
f56c463
Merge pull request #107 from saveweb/queue-commit-await
equals215 Aug 3, 2024
9d179ab
feat: made @yzqzss commit feature optional by flag
equals215 Aug 3, 2024
f633a8f
fix: forgot to separate Dequeue() function
equals215 Aug 3, 2024
5eaefe1
chore: rm uint64 init to 0
equals215 Aug 3, 2024
48dc14b
feat: check if items have been written before trying to commit WAL
equals215 Aug 3, 2024
c1ff8f0
forgot to save before commit
equals215 Aug 3, 2024
03b2877
fix: changes asked by yzqzss
equals215 Aug 3, 2024
15df42d
Merge pull request #109 from internetarchive/queue-WAL-write-batches
equals215 Aug 3, 2024
2c68252
fix: hanging on index.Close()
yzqzss Aug 3, 2024
e09bf35
Merge pull request #110 from yzqzss/patch-3
equals215 Aug 4, 2024
b404d69
Merge branch 'main' into queue
equals215 Aug 4, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion .github/workflows/go.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,6 @@ jobs:
run: go build -v ./...

- name: Test
continue-on-error: true
run: go test -race -v ./...

- name: Goroutine leak detector
Expand Down
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,5 @@ Zeno
*.txt
*.sh
zeno.log
.vscode/
.vscode/
*.py
6 changes: 4 additions & 2 deletions cmd/get.go
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ func getCMDsFlags(getCmd *cobra.Command) {
getCmd.PersistentFlags().String("job", "", "Job name to use, will determine the path for the persistent queue, seencheck database, and WARC files.")
getCmd.PersistentFlags().IntP("workers", "w", 1, "Number of concurrent workers to run.")
getCmd.PersistentFlags().Int("max-concurrent-assets", 8, "Max number of concurrent assets to fetch PER worker. E.g. if you have 100 workers and this setting at 8, Zeno could do up to 800 concurrent requests at any time.")
getCmd.PersistentFlags().Uint("max-hops", 0, "Maximum number of hops to execute.")
getCmd.PersistentFlags().Uint8("max-hops", 0, "Maximum number of hops to execute.")
getCmd.PersistentFlags().String("cookies", "", "File containing cookies that will be used for requests.")
getCmd.PersistentFlags().Bool("keep-cookies", false, "Keep a global cookie jar")
getCmd.PersistentFlags().Bool("headless", false, "Use headless browsers instead of standard GET requests.")
Expand All @@ -56,6 +56,8 @@ func getCMDsFlags(getCmd *cobra.Command) {
getCmd.PersistentFlags().StringSlice("exclude-string", []string{}, "Discard any (discovered) URLs containing this string.")
getCmd.PersistentFlags().Bool("random-local-ip", false, "Use random local IP for requests. (will be ignored if a proxy is set)")
getCmd.PersistentFlags().Int("min-space-required", 20, "Minimum space required in GB to continue the crawl.")
getCmd.PersistentFlags().Bool("handover", false, "Use handover mechanism to dispatch URLs via a buffer before enqueuing on disk.")
getCmd.PersistentFlags().Bool("batch-write-WAL", false, "Use commited batch writes to the WAL. (can cause more seed loss on crash)")

// Proxy flags
getCmd.PersistentFlags().String("proxy", "", "Proxy to use when requesting pages.")
Expand Down Expand Up @@ -88,7 +90,7 @@ func getCMDsFlags(getCmd *cobra.Command) {
// Aliases shouldn't be used as proper flags nor declared in the config struct
// Aliases should be marked as deprecated to inform the user base
// Aliases values should be copied to the proper flag in the config/config.go:handleFlagsAliases() function
getCmd.PersistentFlags().Uint("hops", 0, "Maximum number of hops to execute.")
getCmd.PersistentFlags().Uint8("hops", 0, "Maximum number of hops to execute.")
getCmd.PersistentFlags().MarkDeprecated("hops", "use --max-hops instead")
getCmd.PersistentFlags().MarkHidden("hops")

Expand Down
4 changes: 2 additions & 2 deletions cmd/get_list.go
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ import (
"fmt"

"github.com/internetarchive/Zeno/internal/pkg/crawl"
"github.com/internetarchive/Zeno/internal/pkg/frontier"
"github.com/internetarchive/Zeno/internal/pkg/queue"
"github.com/spf13/cobra"
)

Expand Down Expand Up @@ -32,7 +32,7 @@ var getListCmd = &cobra.Command{
}

// Initialize initial seed list
crawl.SeedList, err = frontier.IsSeedList(args[0])
crawl.SeedList, err = queue.FileToItems(args[0])
if err != nil || len(crawl.SeedList) <= 0 {
crawl.Log.WithFields(map[string]interface{}{
"input": args[0],
Expand Down
12 changes: 10 additions & 2 deletions cmd/get_url.go
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ import (
"net/url"

"github.com/internetarchive/Zeno/internal/pkg/crawl"
"github.com/internetarchive/Zeno/internal/pkg/frontier"
"github.com/internetarchive/Zeno/internal/pkg/queue"
"github.com/spf13/cobra"
)

Expand Down Expand Up @@ -43,7 +43,15 @@ var getURLCmd = &cobra.Command{
return err
}

crawl.SeedList = append(crawl.SeedList, *frontier.NewItem(input, nil, "seed", 0, "", false))
item, err := queue.NewItem(input, nil, "seed", 0, "", false)
if err != nil {
crawl.Log.WithFields(map[string]interface{}{
"input_url": arg,
"err": err.Error(),
}).Error("Failed to create new item")
return err
}
crawl.SeedList = append(crawl.SeedList, *item)
}

// Start crawl
Expand Down
8 changes: 5 additions & 3 deletions config/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -43,9 +43,9 @@ type Config struct {
ElasticSearchURLs []string `mapstructure:"es-url"`
WorkersCount int `mapstructure:"workers"`
MaxConcurrentAssets int `mapstructure:"max-concurrent-assets"`
MaxHops uint `mapstructure:"max-hops"`
MaxRedirect int `mapstructure:"max-redirect"`
MaxRetry int `mapstructure:"max-retry"`
MaxHops uint8 `mapstructure:"max-hops"`
MaxRedirect uint8 `mapstructure:"max-redirect"`
MaxRetry uint8 `mapstructure:"max-retry"`
HTTPTimeout int `mapstructure:"http-timeout"`
MaxConcurrentRequestsPerDomain int `mapstructure:"max-concurrent-per-domain"`
ConcurrentSleepLength int `mapstructure:"concurrent-sleep-length"`
Expand Down Expand Up @@ -74,6 +74,8 @@ type Config struct {
HQContinuousPull bool `mapstructure:"hq-continuous-pull"`
HQRateLimitSendBack bool `mapstructure:"hq-rate-limiting-send-back"`
NoStdoutLogging bool `mapstructure:"no-stdout-log"`
Handover bool `mapstructure:"handover"`
BatchWriteWAL bool `mapstructure:"batch-write-WAL"`
}

var (
Expand Down
11 changes: 5 additions & 6 deletions go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@ require (
github.com/CorentinB/warc v0.8.40
github.com/PuerkitoBio/goquery v1.9.2
github.com/asaskevich/govalidator v0.0.0-20230301143203-a9d515a09cc2
github.com/beeker1121/goque v2.1.0+incompatible
github.com/clbanning/mxj/v2 v2.7.0
github.com/dustin/go-humanize v1.0.1
github.com/elastic/go-elasticsearch/v8 v8.14.0
Expand All @@ -25,9 +24,11 @@ require (
github.com/spf13/viper v1.19.0
github.com/stretchr/testify v1.9.0
github.com/telanflow/cookiejar v0.0.0-20190719062046-114449e86aa5
github.com/tomnomnom/linkheader v0.0.0-20180905144013-02ca5825eb80
github.com/zeebo/xxh3 v1.0.2
go.uber.org/goleak v1.3.0
golang.org/x/net v0.26.0
golang.org/x/net v0.27.0
google.golang.org/protobuf v1.34.2
mvdan.cc/xurls/v2 v2.5.0
)

Expand All @@ -52,7 +53,6 @@ require (
github.com/inconshreveable/mousetrap v1.1.0 // indirect
github.com/json-iterator/go v1.1.12 // indirect
github.com/klauspost/compress v1.17.9 // indirect
github.com/klauspost/cpuid/v2 v2.2.8 // indirect
github.com/klauspost/pgzip v1.2.6 // indirect
github.com/magiconair/properties v1.8.7 // indirect
github.com/mattn/go-colorable v0.1.13 // indirect
Expand Down Expand Up @@ -84,12 +84,11 @@ require (
go.opentelemetry.io/otel/trace v1.28.0 // indirect
go.uber.org/atomic v1.9.0 // indirect
go.uber.org/multierr v1.9.0 // indirect
golang.org/x/crypto v0.24.0 // indirect
golang.org/x/crypto v0.25.0 // indirect
golang.org/x/exp v0.0.0-20240506185415-9bf2ced13842 // indirect
golang.org/x/sync v0.7.0 // indirect
golang.org/x/sys v0.21.0 // indirect
golang.org/x/sys v0.22.0 // indirect
golang.org/x/text v0.16.0 // indirect
google.golang.org/protobuf v1.34.2 // indirect
gopkg.in/ini.v1 v1.67.0 // indirect
gopkg.in/yaml.v3 v3.0.1 // indirect
)
20 changes: 6 additions & 14 deletions go.sum
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,6 @@ github.com/armon/go-socks5 v0.0.0-20160902184237-e75332964ef5 h1:0CwZNZbxp69SHPd
github.com/armon/go-socks5 v0.0.0-20160902184237-e75332964ef5/go.mod h1:wHh0iHkYZB8zMSxRWpUBQtwG5a7fFgvEO+odwuTv2gs=
github.com/asaskevich/govalidator v0.0.0-20230301143203-a9d515a09cc2 h1:DklsrG3dyBCFEj5IhUbnKptjxatkF07cF2ak3yi77so=
github.com/asaskevich/govalidator v0.0.0-20230301143203-a9d515a09cc2/go.mod h1:WaHUgvxTVq04UNunO+XhnAqY/wQc+bxr74GqbsZ/Jqw=
github.com/beeker1121/goque v2.1.0+incompatible h1:m5pZ5b8nqzojS2DF2ioZphFYQUqGYsDORq6uefUItPM=
github.com/beeker1121/goque v2.1.0+incompatible/go.mod h1:L6dOWBhDOnxUVQsb0wkLve0VCnt2xJW/MI8pdRX4ANw=
github.com/beorn7/perks v1.0.1 h1:VlbKKnNfV8bJzeqoa4cOKqO6bYr3WgKZxO8Z16+hsOM=
github.com/beorn7/perks v1.0.1/go.mod h1:G2ZrVWU2WbWT9wwq4/hrbKbnv/1ERSJQ0ibhJ6rlkpw=
github.com/cespare/xxhash/v2 v2.3.0 h1:UL815xU9SqsFlibzuggzjXhog7bL6oX9BbNZnL2UFvs=
Expand Down Expand Up @@ -78,8 +76,6 @@ github.com/json-iterator/go v1.1.12 h1:PV8peI4a0ysnczrg+LtxykD8LfKY9ML6u2jnxaEnr
github.com/json-iterator/go v1.1.12/go.mod h1:e30LSqwooZae/UwlEbR2852Gd8hjQvJoHmT4TnhNGBo=
github.com/klauspost/compress v1.17.9 h1:6KIumPrER1LHsvBVuDa0r5xaG0Es51mhhB9BQB2qeMA=
github.com/klauspost/compress v1.17.9/go.mod h1:Di0epgTjJY877eYKx5yC51cX2A2Vl2ibi7bDH9ttBbw=
github.com/klauspost/cpuid/v2 v2.2.8 h1:+StwCXwm9PdpiEkPyzBXIy+M9KUb4ODm0Zarf1kS5BM=
github.com/klauspost/cpuid/v2 v2.2.8/go.mod h1:Lcz8mBdAVJIBVzewtcLocK12l3Y+JytZYpaMropDUws=
github.com/klauspost/pgzip v1.2.6 h1:8RXeL5crjEUFnR2/Sn6GJNWtSQ3Dk8pq4CL3jvdDyjU=
github.com/klauspost/pgzip v1.2.6/go.mod h1:Ch1tH69qFZu15pkjo5kYi6mth2Zzwzt50oCQKQE9RUs=
github.com/kr/pretty v0.3.1 h1:flRD4NNwYAUpkphVc1HcthR4KEIFJ65n8Mw5qdRn3LE=
Expand Down Expand Up @@ -183,10 +179,6 @@ github.com/syndtr/goleveldb v1.0.0/go.mod h1:ZVVdQEZoIme9iO1Ch2Jdy24qqXrMMOU6lpP
github.com/telanflow/cookiejar v0.0.0-20190719062046-114449e86aa5 h1:gTQl5nPlc9B53vFOKM8aJHwxB2BW2kM49PVR5526GBg=
github.com/telanflow/cookiejar v0.0.0-20190719062046-114449e86aa5/go.mod h1:qNgA5MKwTh103SxGTooqZMiKxZTaV9UV3KjN7I7Drig=
github.com/yuin/goldmark v1.4.13/go.mod h1:6yULJ656Px+3vBD8DxQVa3kxgyrAnzto9xy5taEt/CY=
github.com/zeebo/assert v1.3.0 h1:g7C04CbJuIDKNPFHmsk4hwZDO5O+kntRxzaUoNXj+IQ=
github.com/zeebo/assert v1.3.0/go.mod h1:Pq9JiuJQpG8JLJdtkwrJESF0Foym2/D9XMU5ciN/wJ0=
github.com/zeebo/xxh3 v1.0.2 h1:xZmwmqxHZA8AI603jOQ0tMqmBr9lPeFwGg6d+xy9DC0=
github.com/zeebo/xxh3 v1.0.2/go.mod h1:5NWz9Sef7zIDm2JHfFlcQvNekmcEl9ekUZQQKCYaDcA=
go.opentelemetry.io/otel v1.28.0 h1:/SqNcYk+idO0CxKEUOtKQClMK/MimZihKYMruSMViUo=
go.opentelemetry.io/otel v1.28.0/go.mod h1:q68ijF8Fc8CnMHKyzqL6akLO46ePnjkgfIMIjUIX9z4=
go.opentelemetry.io/otel/metric v1.28.0 h1:f0HGvSl1KRAU1DLgLGFjrwVyismPlnuU6JD6bOeuA5Q=
Expand All @@ -203,8 +195,8 @@ go.uber.org/multierr v1.9.0 h1:7fIwc/ZtS0q++VgcfqFDxSBZVv/Xo49/SYnDFupUwlI=
go.uber.org/multierr v1.9.0/go.mod h1:X2jQV1h+kxSjClGpnseKVIxpmcjrj7MNnI0bnlfKTVQ=
golang.org/x/crypto v0.0.0-20190308221718-c2843e01d9a2/go.mod h1:djNgcEr1/C05ACkg1iLfiJU5Ep61QUkGW8qpdssI0+w=
golang.org/x/crypto v0.0.0-20210921155107-089bfa567519/go.mod h1:GvvjBRRGRdwPK5ydBHafDWAxML/pGHZbMvKqRZ5+Abc=
golang.org/x/crypto v0.24.0 h1:mnl8DM0o513X8fdIkmyFE/5hTYxbwYOjDS/+rK6qpRI=
golang.org/x/crypto v0.24.0/go.mod h1:Z1PMYSOR5nyMcyAVAIQSKCDwalqy85Aqn1x3Ws4L5DM=
golang.org/x/crypto v0.25.0 h1:ypSNr+bnYL2YhwoMt2zPxHFmbAN1KZs/njMG3hxUp30=
golang.org/x/crypto v0.25.0/go.mod h1:T+wALwcMOSE0kXgUAnPAHqTLW+XHgcELELW8VaDgm/M=
golang.org/x/exp v0.0.0-20240506185415-9bf2ced13842 h1:vr/HnozRka3pE4EsMEg1lgkXJkTFJCVUX+S/ZT6wYzM=
golang.org/x/exp v0.0.0-20240506185415-9bf2ced13842/go.mod h1:XtvwrStGgqGPLc4cjQfWqZHG1YFdYs6swckp8vpsjnc=
golang.org/x/mod v0.6.0-dev.0.20220419223038-86c51ed26bb4/go.mod h1:jJ57K6gSWd91VN4djpZkiMVwK6gcyfeH4XE8wZrZaV4=
Expand All @@ -215,8 +207,8 @@ golang.org/x/net v0.0.0-20210226172049-e18ecbb05110/go.mod h1:m0MpNAwzfU5UDzcl9v
golang.org/x/net v0.0.0-20220722155237-a158d28d115b/go.mod h1:XRhObCWvk6IyKnWLug+ECip1KBveYUHfp+8e9klMJ9c=
golang.org/x/net v0.6.0/go.mod h1:2Tu9+aMcznHK/AK1HMvgo6xiTLG5rD5rZLDS+rp2Bjs=
golang.org/x/net v0.9.0/go.mod h1:d48xBJpPfHeWQsugry2m+kC02ZBRGRgulfHnEXEuWns=
golang.org/x/net v0.26.0 h1:soB7SVo0PWrY4vPW/+ay0jKDNScG2X9wFeYlXIvJsOQ=
golang.org/x/net v0.26.0/go.mod h1:5YKkiSynbBIh3p6iOc/vibscux0x38BZDkn8sCUPxHE=
golang.org/x/net v0.27.0 h1:5K3Njcw06/l2y9vpGCSdcxWOYHOUk3dVNGDXN+FvAys=
golang.org/x/net v0.27.0/go.mod h1:dDi0PyhWNoiUOrAS8uXv/vnScO4wnHQO4mj9fn/RytE=
golang.org/x/sync v0.0.0-20180314180146-1d60e4601c6f/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM=
golang.org/x/sync v0.0.0-20190423024810-112230192c58/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM=
golang.org/x/sync v0.0.0-20220722155255-886fb9371eb4/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM=
Expand All @@ -234,8 +226,8 @@ golang.org/x/sys v0.0.0-20220811171246-fbc7d0a398ab/go.mod h1:oPkhp1MJrh7nUepCBc
golang.org/x/sys v0.5.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
golang.org/x/sys v0.6.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
golang.org/x/sys v0.7.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
golang.org/x/sys v0.21.0 h1:rF+pYz3DAGSQAxAu1CbC7catZg4ebC4UIeIhKxBZvws=
golang.org/x/sys v0.21.0/go.mod h1:/VUhepiaJMQUp4+oa/7Zr1D23ma6VTLIYjOOTFZPUcA=
golang.org/x/sys v0.22.0 h1:RI27ohtqKCnwULzJLqkv897zojh5/DwS/ENaMzUOaWI=
golang.org/x/sys v0.22.0/go.mod h1:/VUhepiaJMQUp4+oa/7Zr1D23ma6VTLIYjOOTFZPUcA=
golang.org/x/term v0.0.0-20201126162022-7de9c90e9dd1/go.mod h1:bj7SfCRtBDWHUb9snDiAeCFNEtKQo2Wmx5Cou7ajbmo=
golang.org/x/term v0.0.0-20210927222741-03fcf44c2211/go.mod h1:jbD1KX2456YbFQfuXm/mYQcufACuNUgVhRMnK/tPxf8=
golang.org/x/term v0.5.0/go.mod h1:jMB1sMXY+tzblOD4FWmEbocvup2/aLOaQEp7JmGp78k=
Expand Down
20 changes: 14 additions & 6 deletions internal/pkg/crawl/api.go
Original file line number Diff line number Diff line change
Expand Up @@ -21,11 +21,12 @@ type APIWorkersState struct {

// APIWorkerState represents the state of an API worker.
type APIWorkerState struct {
WorkerID string `json:"worker_id"`
Status string `json:"status"`
LastError string `json:"last_error"`
LastSeen string `json:"last_seen"`
Locked bool `json:"locked"`
WorkerID string `json:"worker_id"`
Status string `json:"status"`
LastError string `json:"last_error"`
LastSeen string `json:"last_seen"`
LastAction string `json:"last_action"`
Locked bool `json:"locked"`
}

// startAPI starts the API server for the crawl
Expand All @@ -42,7 +43,7 @@ func (crawl *Crawl) startAPI() {
"crawled": crawledSeeds + crawledAssets,
"crawledSeeds": crawledSeeds,
"crawledAssets": crawledAssets,
"queued": crawl.Frontier.QueueCount.Value(),
"queued": crawl.Queue.GetStats().TotalElements,
"uptime": time.Since(crawl.StartTime).String(),
}

Expand All @@ -53,12 +54,19 @@ func (crawl *Crawl) startAPI() {
http.HandleFunc("/metrics", setupPrometheus(crawl).ServeHTTP)
}

http.HandleFunc("/queue", func(w http.ResponseWriter, r *http.Request) {
w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode(crawl.Queue.GetStats())
})

http.HandleFunc("/workers", func(w http.ResponseWriter, r *http.Request) {
w.Header().Set("Content-Type", "application/json")
workersState := crawl.Workers.GetWorkerStateFromPool("")
json.NewEncoder(w).Encode(workersState)
})

http.HandleFunc("/worker/", func(w http.ResponseWriter, r *http.Request) {
w.Header().Set("Content-Type", "application/json")
workerID := strings.TrimPrefix(r.URL.Path, "/worker/")
workersState := crawl.Workers.GetWorkerStateFromPool(workerID)
if workersState == nil {
Expand Down
4 changes: 2 additions & 2 deletions internal/pkg/crawl/assets.go
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,11 @@ import (

"github.com/PuerkitoBio/goquery"
"github.com/internetarchive/Zeno/internal/pkg/crawl/sitespecific/cloudflarestream"
"github.com/internetarchive/Zeno/internal/pkg/frontier"
"github.com/internetarchive/Zeno/internal/pkg/queue"
"github.com/internetarchive/Zeno/internal/pkg/utils"
)

func (c *Crawl) extractAssets(base *url.URL, item *frontier.Item, doc *goquery.Document) (assets []*url.URL, err error) {
func (c *Crawl) extractAssets(base *url.URL, item *queue.Item, doc *goquery.Document) (assets []*url.URL, err error) {
var rawAssets []string

// Execute plugins on the response
Expand Down
Loading
Loading