Releases: ropensci/targets
Speed gains for large pipelines (with many up-to-date targets)
targets 1.10.0
Invalidating changes
These changes invalidate certain targets in a pipeline and cause them to rerun on the next tar_make()
.
- Exclude function signatures from
tar_repository_cas()
output strings to reduce the size of pipeline metadata (#1390). - Exclude function signatures from
tar_format()
output strings to reduce the size of pipeline metadata (#1390).
Summary of performance gains
tar_make()
and tar_outdated()
run much faster in this release. Extensive profiling was done on a real-world simulation pipeline with 66002 up-to-date targets. For tar_make()
using all the default settings:
Machine | Before (seconds) | After (seconds) | Speedup |
---|---|---|---|
M2 Macbook | 413.16 | 35.538 | 11.62587 |
RHEL9 | 450.66 | 94.08 | 4.790 |
And for tar_outdated()
using all the default settings
Machine | Before (seconds) | After (seconds) | Speedup |
---|---|---|---|
M2 Macbook | 91.314 | 16.636 | 5.48894 |
RHEL9 | 167.809 | 37.395 | 4.487472 |
To take advantage of these speed gains for an existing pipeline, you may have to run tar_make()
to convert the time stamps and file sizes to a new format. This initial tar_make()
is slow, but subsequent tar_make()
calls should be much faster than before the upgrade.
Other/specific changes
- Speed up
tar_make()
andtar_outdated()
by avoiding excessive buffering and disk writes for metadata and reporters when the pipeline is just skipping targets. - Use a more lookup-efficient data structure for
tar_runtime$file_info
(#1398). - Fall back on vector aggregation without names (#1401, @guglicap).
- Speed up representation of file sizes in metadata (#1408).
- Add a new
"forecast_interactive"
reporter totar_outdated()
to choose"forecast"
for interactive sessions and"silent"
for non-interactive ones. - Add a new
seconds_reporter_outdated
argument totar_config_set()
with a default of 1 to control the time interval of the reporter oftar_outdated()
and other passive algorithm functions. - Remove target descriptions from the default labels of graph visualizations.
igraph compatibility
targets 1.9.1
Bug fixes
- Allow branch references to contain multi-element
path
vectors with cloud metadata (#1382, @n8layman). - Avoid partial matches in internal code (#1384, @olivroy).
- Add error handling around calls to
ps::ps_disk_partitions()
andps::ps_fs_mount_point()
. - Do not store
_targets/objects/
paths in metadata for CAS repositories (#1391).
Compatibility
- Ensure compatibility with
igraph
>= 2.1.2.
Memory efficiency
targets 1.9.0
Improvements
- Un-break workflows that use
format = "file_fast"
(#1339, @koefoeden). - Fix deadlock in
error = "trim"
(#1340, @koefoeden). - Remove tailored debugging message (#1341, @koefoeden).
- Store warnings while writing to storage (#1345, @Aariq).
- Allow
garbage_collection
to be a non-negative integer to control the frequency of garbage collection in a performant, convenient, unified way (#1351). - Deprecate the
garbage_collection
argument oftar_make()
,tar_make_future()
, andtar_make_clusterm()
(#1351). - Instrument
target_run()
,target_prepare()
, andtarget_conclude()
usingautometric
. - Avoid sending problematic error classes such as
"vctrs_error_subscript_oob"
torlang::abort()
(#1354, @Jiefei-Wang). - Reduce memory consumption by ~23% in large pipelines by avoiding the accumulation of promise objects (#1352).
- Avoid
store_assert_format()
andstore_convert_object()
isstorage
is"none"
. - Add a
list()
method totar_repository_cas()
to make it easier and more efficient to specify custom CAS repositories (#1366). - Improve speed and reduce memory consumption by avoiding deep copies of inner environments of target definition objects (#1368).
- Reduce memory consumption by storing buds and branches as lightweight references when
memory
is"transient"
(#1364). - Replace the
memory
class with the newlookup
class. - Implement
memory = "auto"
to select transient memory for dynamic branches and persistent memory for other targets (#1371). - Omit whole pattern targets from branch subpipelines when possible. Should reduce memory consumption in some cases.
- Omit whole stem targets from branch subpipelines when
retrieval
is"main"
and only a bud is actually used. The same cannot be done with branches because each branch may need to be (un)marshaled individually. - Compress branches into references when
retrieval
is"worker"
and the whole pattern is part of the subpipeline. - Avoid duplicated branch aggregation: just send the branches over the network.
- Back-compatibly switch
format = "qs"
fromqs
toqs2
(#1373). - Add
tar_unblock_process()
.
Potentially invalidating changes
- Add
"keepNA"
and"keepInteger"
to.deparseOpts()
(#1375). This may cause existing pipelines to rerun, but it makes add-ons liketarchetypes::tar_map()
much easier to use.
Content addressable storage
targets 1.8.0
- Wrap
tar_watch()
UI module inbslib::page()
(#1302, @kwbyron-lilly). - Remove
callr_function
intar_make_as_job()
argument list. - Ensure
storage = "worker"
is respected when the process of storing an object generates an error (#1304, @multimeric). - Default to the
_targets.R
pattern intar_branches()
(#1306, @multimeric, @mattwarkentin). - Remove superfluous functions and globals from metadata with
tar_prune()
(#1312, @benzipperer). - Change the default
workspace_on_error
option toTRUE
(#1310, @hadley). - Enhance and organize the
error = "stop"
error message. - Avoid saving a file in
_targets/objects
forerror = "null"
. Instead, switch to a special"null"
storage format class iferror
is"null"
the target throws an error. This should allow users to more freely create new formats withtar_format()
without worrying about how to handleNULL
objects created byerror = "null"
. - Implement
format = "auto"
(#1311, @hadley). - Replace
pingr
dependency withbase::socketConnection()
for local URL utilities (#1317, #1318, @Adafede). - Implement
tar_repository_cas()
,tar_repository_cas_local()
, andtar_repository_cas_local_gc()
for content-addressable storage (#1232, #1314, @noamross). - Add
tar_format_get()
to make implementing CAS systems easier. - Implement
error = "trim"
intar_target()
andtar_option_set()
(#1310, #1311, @hadley). - Use the file system type to decide whether to trust time stamps (#1315, @hadley, @gaborcsardi).
- Deprecate
format = "file_fast"
in favor of the above (#1315). - Deprecate
trust_object_timestamps
in favor of the more unifiedtrust_timestamps
intar_option_set()
(#1315). - Print storage size of each target in verbose reporters (#1337, @psychelzh).
- Combine help files of
tar_target()
andtar_target_raw()
. Same withtar_load()
andtar_load_raw()
. - Add a
substitute
argument totar_format()
to make it easier to write custom storage formats without metaprogramming.
bslib and speed
targets 1.7.1
- Use
bslib
intar_watch()
. - Speed up
target_upstream_edges()
andpipeline_upstream_edges()
by avoiding data frames until the last minute (17% speedup for certain kinds of large pipelines). - Automatically set
as_job
toFALSE
intar_make()
ifrstudioapi
and/or RStudio is not available.
secretbase
targets 1.7.0
Invalidating changes
- Use
secretbase::siphash13()
instead ofdigest(algo = "xxhash64", serializationVersion = 3)
so hashes of in-memory objects no longer depend on serialization version 3 headers (#1244, @shikokuchuo). Unfortunately, pipelines built with earlier versions oftargets
will need to rerun.
Other improvements
- Ensure patterns marshal properly (#1266, #1264, njtierney/geotargets#52, @Aariq, @njtierney).
- Inform and prompt the user when the pipeline was built with an old version of
targets
and changes to the package will cause the current work to rerun (#1244). For thetar_make*()
functions,utils::menu()
prompts the user to give people a chance to downgrade if necessary. - For type safety in the internal database class, read all columns as character vectors in
data.table::fread()
, then convert them to the correct types afterwards. - Add a new
tar_resources_custom_format()
function which can pass environment variables to customize the behavior of customtar_format()
storage formats (#1263, #1232, @Aariq, @noamross). - Only marshal dependencies if actually sending the target to a parallel worker.
Custom descriptions
targets 1.6.0
- Modernize
extras
intar_renv()
. tar_target()
gains adescription
argument for free-form text describing what the target is about (#1230, #1235, #1236, @tjmahr).tar_visnetwork()
,tar_glimpse()
,tar_network()
,tar_mermaid()
, andtar_manifest()
now optionally show target descriptions (#1230, #1235, #1236, @tjmahr).tar_described_as()
is a new wrapper aroundtidyselect::any_of()
to select specific subsets of targets based on the description rather than the name (#1136, #1196, @noamross, @mattmoo).- Fix the documentation of the
names
argument (nudge users towardtidyselect
expressions). - Make assertions on the pipeline process more robust (to check if two processes are trying to access the same data store).
CRAN patch
targets 1.5.1
- Avoid
arrow
-related CRAN check NOTE. use_targets()
only writes the_targets.R
script. Therun.sh
andrun.R
scripts are superseded by theas_job
argument oftar_make()
. Users not using the RStudio IDE can calltar_make()
withcallr_function = callr::r_bg
to run the pipeline as a background process.tar_make_clustermq()
andtar_make_future()
are superseded in favortar_make(use_crwe = TRUE)
, so template files are no longer written for the former automatically.
Small fixes
targets 1.4.1
- Print "errored pipeline" when at least one target errors.
- Bump minimum
clustermq
version to 0.9.2. - Repair the
tar_debug_instructions()
tips for when commands are long. - Do not look for dependencies of primitive functions (#1200, @smwindecker, @joelnitta).
AWS/crew efficiency, random number safety
targets 1.4.0
Invalidating changes
Because of the changes below, upgrading to this version of targets
will unavoidably invalidate previously built targets in existing pipelines. Your pipeline code should still work, but any targets you ran before will most likely need to rerun after the upgrade.
- Use SHA512 during the creation of target-specific pseudo-random number generator seeds (#1139). This change decreases the risk of overlapping/correlated random number generator streams. See the "RNG overlap" section of the
tar_seed_create()
help file for details and justification. Unfortunately, this change will invalidate all currently built targets because the seeds will be different. To avoid rerunning your whole pipeline, setcue = tar_cue(seed = FALSE)
intar_target()
. - For cloud storage: instead of the hash of the local file, use the ETag for AWS S3 targets and the MD5 hash for GCP GCS targets (#1172). Sanitize with
targets:::digest_chr64()
in both cases before storing the result in the metadata. - For a cloud target to be truly up to date, the hash in the metadata now needs to match the current object in the bucket, not the version recorded in the metadata (#1172). In other words,
targets
now tries to ensure that the up-to-date data objects in the cloud are in their newest versions. So if you roll back the metadata to an older version, you will still be able to access historical data versions with e.g.tar_read()
, but the pipeline will no longer be up to date.
Other changes to seeds
- Add a new exported function
tar_seed_create()
which creates target-specific pseudo-random number generator seeds. - Add an "RNG overlap" section in the
tar_seed_create()
help file to justify and defend howtargets
andtarchetypes
approach pseudo-random numbers. - Add function
tar_seed_set()
which sets a seed and sets all the RNG algorithms to their defaults in the R installation of the user. Each target now usestar_seed_set()
function to set its seed before running its R command (#1139). - Deprecate
tar_seed()
in favor of the newtar_seed_get()
function.
Other cloud storage improvements
- For all cloud targets, check hashes in batched LIST requests instead of individual HEAD requests (#1172). Dramatically speeds up the process of checking if cloud targets are up to date.
- For AWS S3 targets,
tar_delete()
,tar_destroy()
, andtar_prune()
now use efficient batched calls todelete_objects()
instead of costly individual calls todelete_object()
(#1171). - Add a new
verbose
argument totar_delete()
,tar_destroy()
, andtar_prune()
. - Add a new
batch_size
argument totar_delete()
,tar_destroy()
, andtar_prune()
. - Add new arguments
page_size
andverbose
totar_resources_aws()
(#1172). - Add a new
tar_unversion()
function to remove version IDs from the metadata of cloud targets. This makes it easier to interact with just the current version of each target, as opposed to the version ID recorded in the local metadata.
Other improvements
- Migrate to the changes in
clustermq
0.9.0 (@mschubert). - In progress statuses, change "started" to "dispatched" and change "built" to "completed" (#1192).
- Deprecate
tar_started()
in favor oftar_dispatched()
(#1192). - Deprecate
tar_built()
in favor oftar_completed()
(#1192). - Console messages from reporters say "dispatched" and "completed" instead of "started" and "built" (#1192).
- The
crew
scheduling algorithm no longer waits on saturated controllers, and targets that are ready are greedily dispatched tocrew
even if all workers are busy (#1182, #1192). To appropriately set expectations for users, reporters print "dispatched (pending)" instead of "dispatched" if the task load is backlogged at the moment. - In the
crew
scheduling algorithm, waiting for tasks is now a truly event-driven process and consumes 5-10x less CPU resources (#1183). Only the auto-scaling of workers uses polling (with an inexpensive default polling interval of 0.5 seconds, configurable throughseconds_interval
in the controller). - Simplify stored target tracebacks.
- Print the traceback on error.