-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add interleaved versions of phase/cartToPolar/polarToCart #3607
Conversation
This PR is for performance only (at the cost of more template code and increased GPU code size) The additional variants can help the caller skip the creation of temporary GPU mats (where memory is more likely to be a critical resource), and can even allow in-place processing. magnitude/angles/x/y are often already interleaved when dealing with DFTs.
@cudawarped could you take a look? |
Of course, but I may not have time before the release of 4.9.0. |
additional "typename" disambiguifiers are required by some compilers
|
||
GpuMat dst = getOutputMat(_dst, xy.size(), CV_32FC1, stream); | ||
|
||
GpuMat_<float2> xyc(xy.reshape(2)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know the existing functions use reshape to convert a GpuMat
to a GpuMat_
before being passed to gridTransformxxx
but I find this confusing because nothing is being reshaped, would it not be better to use globPtr<>
directly on the GpuMats
. e.g.
if (angleInDegrees)
gridTransformUnary(globPtr<float2>(xy), globPtr<float>(dst), direction_interleaved_func<float2, true>(), stream);
If so the existing routines could be updated to remove the bloat.
GpuMat mag = getOutputMat(_mag, xy.size(), CV_32FC1, stream); | ||
GpuMat angle = getOutputMat(_angle, xy.size(), CV_32FC1, stream); | ||
|
||
GpuMat_<float2> xyc(xy.reshape(2)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
|
||
GpuMat magAngle = getOutputMat(_magAngle, xy.size(), CV_32FC2, stream); | ||
|
||
GpuMat_<float2> xyc(xy.reshape(2)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And here.
@@ -192,6 +276,49 @@ namespace | |||
ymat(y, x) = mag_val * sin_a; | |||
} | |||
|
|||
template <typename T, bool useMag> | |||
__global__ void polarToCartDstInterleavedImpl_(const GlobPtr<T> mag, const GlobPtr<T> angle, GlobPtr<typename MakeVec<T, 2>::type > xymat, const T scale, const int rows, const int cols) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you use PtrStep<T>
for mag
, angle
and xymat
then you pass them directly to this function instead of using the intermediate GpuMat_<T>
with reshape. Additionaly if angle
is a PtrStepSz<T>
then you don't need to pass cols
and rows
seperately. e.g.
__global__ void polarToCartDstInterleavedImpl_(const PtrStep<T> mag, const PtrStepSz<T> angle, PtrStep<typename MakeVec<T, 2>::type > xymat, const T scale, const int rows, const int cols)
{
typedef typename MakeVec<T, 2>::type T2;
const int x = blockDim.x * blockIdx.x + threadIdx.x;
const int y = blockDim.y * blockIdx.y + threadIdx.y;
if (x >= angle.cols || y >= angle.rows)
return;
You can also make this adjustment to all the other polarToCart
kernel calls including the existing one.
void polarToCartDstInterleavedImpl(const GpuMat& mag, const GpuMat& angle, GpuMat& xy, bool angleInDegrees, cudaStream_t& stream) | ||
{ | ||
typedef typename MakeVec<T, 2>::type T2; | ||
GpuMat_<T2> xyc(xy.reshape(2)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you switch to PtrStep
and PtrStepSz
inside polarToCartDstInterleavedImpl_
then this can be simplified to
template <typename T>
void polarToCartDstInterleavedImpl(const GpuMat& mag, const GpuMat& angle, GpuMat& xy, bool angleInDegrees, cudaStream_t& stream)
{
const dim3 block(32, 8);
const dim3 grid(divUp(angle.cols, block.x), divUp(angle.rows, block.y));
const T scale = angleInDegrees ? static_cast<T>(CV_PI / 180.0) : static_cast<T>(1.0);
if (mag.empty())
polarToCartDstInterleavedImpl_<T, false> << <grid, block, 0, stream >> >(mag, angle, xy, scale, angle.rows, angle.cols);
else
polarToCartDstInterleavedImpl_<T, true> << <grid, block, 0, stream >> >(mag, angle, xy, scale, angle.rows, angle.cols);
}
|
||
cv::cuda::GpuMat dstX1Y1 = createMat(size, CV_32FC1, useRoi); | ||
cv::cuda::GpuMat dstXY2 = createMat(size, CV_32FC1, useRoi); | ||
cv::cuda::phase(loadMat(x, useRoi), loadMat(y, useRoi), dstX1Y1, angleInDegrees); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you have a test case per function and compare the results to the CPU version it will make it easier to maintain going forward.
angle = randomMat(size, type); | ||
cv::Mat magnitudeAngle; | ||
cv::merge(magnitudeAngleChannels, 2, magnitudeAngle); | ||
const double tol = (type == CV_32FC1 ? 1.6e-4 : 1e-4) * (angleInDegrees ? 1.0 : 19.47); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again I would suggest a test case per function, comparing to cv::polarToCart
especially if you are going to use these tolerances. In this funciton you could now be 2*tol away from the CPU result.
use globPtr() and PtrStepSz<> to bypass confusing reshape() refactor tests
…opencv_contrib into cuda_phase_interleaved
const int x = blockDim.x * blockIdx.x + threadIdx.x; | ||
const int y = blockDim.y * blockIdx.y + threadIdx.y; | ||
|
||
if (x >= xymat.cols || y >= xymat.rows) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try to keep the out of range check consistent. I realize mag
can be empty but your using xymat
, angle
and magAngle
. Maybe stick with angle
and magAngle
.
Then you only need to use PtrStepSz
for angle
/magAngle
the other inputs can be PtrStep
.
GpuMat mag = getOutputMat(_mag, xy.size(), CV_32FC1, stream); | ||
GpuMat angle = getOutputMat(_angle, xy.size(), CV_32FC1, stream); | ||
|
||
GpuMat_<float> magc(mag.reshape(1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove the reshape completely, looking at it again it doesn't do anything? i.e.
GpuMat_ magc(mag);
@@ -2809,6 +2850,97 @@ INSTANTIATE_TEST_CASE_P(CUDA_Arithm, CartToPolar, testing::Combine( | |||
testing::Values(AngleInDegrees(false), AngleInDegrees(true)), | |||
WHOLE_SUBMAT)); | |||
|
|||
PARAM_TEST_CASE(CartToPolarInterleaved1, cv::cuda::DeviceInfo, cv::Size, AngleInDegrees, UseRoi) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give these test cases more informative names. Worst case scenario you could use CartToPolarInputInterleaved
, CartToPolarInputOutputInterleaved
, PolarToCartOutputInterleaved
, PolarToCartInputOutputInterleaved
if you can't think of anything better.
code style and simplifications
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
Passing tests on Windows 11, VS 2022 with CUDA 12.3.
@asmorkalov can you take a look.
@chacha21 You will need to squash and rebase this onto the tip of the 4.x branch as the CUDA CMake configuration has changed in the main repo since you submited this PR so I think it will fail on the CI. |
Is this OK after "Merge branch '4.x'" ? My brain has never accepted git terminology, I am not sure about the good operation (with GitHub Desktop) |
I rebased the patch to current 4.x and got build error with Cuda 11.8 and Ubuntu 18.04:
|
I don't have such a problem with CUDA 12.4 under Visual Studio 2022 |
Another issue with Cuda 12.5 on Ubuntu 20.04:
I rebased the local branch to 4.x to include all patches for CUDA 12.x. |
I think it must be related to some "tuple" name clashing in my calls to |
the `make_tuple` or `tie()` helper returns a `cuda::std::tuple`, but `cuda` is then ambiguous between `::cuda` and `cv::cuda`. removing `using cv::cuda` will help
The usage of `typename` seems different among compilers
the |
This PR is for performance only (at the cost of more template code and increased GPU code size) The additional variants can help the caller skip the creation of temporary GPU mats (where memory is more likely to be a critical resource), and can even allow in-place processing. magnitude/angles/x/y are often already interleaved when dealing with DFTs.
Pull Request Readiness Checklist
See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request
Patch to opencv_extra has the same branch name.