Skip to content

Latest commit

 

History

History
783 lines (633 loc) · 36 KB

2021-06-25.md

File metadata and controls

783 lines (633 loc) · 36 KB

< 2021-06-25 >

2,784,740 events, 1,424,969 push events, 2,289,253 commit messages, 160,307,086 characters

Friday 2021-06-25 01:57:41 by Sam Lantinga

Fixed bug 4996 - Mac: XBoxOne Bluetooth rumble isn't working

rofferom

I have an annoying issue on MacOS about XBoxOne Bluetooth rumble (Vendor: 0x045e, Product: 0x02fd).

When 360controller is installed, rumble is working correctly. However, Bluetooth rumble isn't working at all, with or without 360controller installed (although it is working with Chrome + https://html5gamepad.com).

I looked at the code, and it seems that XBox controllers are managed in MacOS in this file: SDL_hidapi_xbox360.c. The XBoxOne file is disabled for MacOS in SDL_hidjoystick_c.h.

The function HIDAPI_DriverXbox360_Rumble() is called correctly, and hid_write() returns no error.

I have tried a stupid test. I took the rumble packet from 360controller: https://github.com/360Controller/360Controller/blob/ec4e88eb2d2535e9b32561c702f42fb22b0a7f99/XBOBTFF/FFDriver.cpp#L620. With the patch I have attached, I manage to have rumble working on Bluetooth (with some stupid vibration level, but it proves it can if the packet is changed).

But it breaks the USB rumble with 360controller. A comment in the function makes an explicit reference to 360controller, I think that's why I have broken this specific usecase.

I don't know what is the correct way to fix this, but it seems that the current implementation has a missing case for Bluetooth support.

Note that I also tested master this morning, and I have another issue: if (!device->ffservice) { return SDL_Unsupported(); }

test fails in DARWIN_JoystickRumble(). This test has been done quickly, I'm not totaly confident about its accuracy.


Friday 2021-06-25 04:27:32 by Tom Gillespie

implement more elements of the racket reader for sxpyr cross testing

reading symbols with colons in them are by far the biggest issue

in the process uncovered some variant behavior in sbcl and ccl readers, sbcl read-list is implemented in such a way that it breaks setting #. as a macro charachter, ccl also has a nasty issue where reading symbols that end in colon e.g. scribble: fail without any way to know what the offending symbol was, it is possible to work around most of the other issues with ccl using evil hacks that look at the contents of a condition, but that one requires deeper modification, in theory implementing a version of sbcl's unintern restart might be a way forward

my understanding of the cl reader is significantly better than it was when I first started working on this, so I now understand why it is a bad idea to have loading a package result in changes to readtable, therefore we no longer call enable-clacket in enable.lisp

other old changes related to testing that may or may not be correct


Friday 2021-06-25 08:34:54 by Andrew Jeffery

hw: Introduce models for IBM's Flexible Service Interface

Firstly, I'll split this patch up in the future, but wanted to get the integrated setup sent out for any initial thoughts. Anyway:

Time for some fun with inter-processor buses. FSI allows a service processor access to the internal busses of a host POWER processor to perform configuration or debugging.

FSI has long existed in POWER processes and so comes with some baggage, including how it has been integrated into the ASPEED SoC.

Working backwards from the POWER processor, the fundamental pieces of interest for the implementation are:

  1. The Common FRU Access Macro (CFAM), an address space containing various "engines" that drive accesses on busses internal and external to the POWER chip. Examples include the SBEFIFO and I2C masters. The engines hang off of an internal Local Bus (LBUS) which is described by the CFAM configuration block.

  2. The FSI slave: The slave is the terminal point of the FSI bus for FSI symbols addressed to it. Slaves can be cascaded off of one another. The slave's configuration registers appear in address space of the CFAM to which it is attached.

  3. The FSI master: A controller in the platform service processor (e.g. BMC) driving CFAM engine accesses into the POWER chip. At the hardware level FSI is a bit-based protocol supporting synchronous and DMA-driven accesses of engines in a CFAM.

  4. The On-Chip Peripheral Bus (OPB): A low-speed bus typically found in POWER processors. This now makes an appearance in the ASPEED SoC due to tight integration of the FSI master IP with the OPB, mainly the existence of an MMIO-mapping of the CFAM address straight onto a sub-region of the OPB address space.

  5. An APB-to-OPB bridge enabling access to the OPB from the ARM core in the AST2600. Hardware limitations prevent the OPB from being directly mapped into APB, so all accesses are indirect through the bridge.

The implementation appears as following in the qemu device tree:

(qemu) info qtree
bus: main-system-bus
  type System
  ...
  dev: aspeed.apb2opb, id ""
    gpio-out "sysbus-irq" 1
    mmio 000000001e79b000/0000000000001000
    bus: opb.1
      type opb
      dev: fsi.master, id ""
        bus: fsi.bus.1
          type fsi.bus
          dev: cfam.config, id ""
          dev: cfam, id ""
            bus: lbus.1
              type lbus
              dev: scratchpad, id ""
                address = 0 (0x0)
    bus: opb.0
      type opb
      dev: fsi.master, id ""
        bus: fsi.bus.0
          type fsi.bus
          dev: cfam.config, id ""
          dev: cfam, id ""
            bus: lbus.0
              type lbus
              dev: scratchpad, id ""
                address = 0 (0x0)

The LBUS is modelled to maintain the qdev bus hierarchy and to take advantage of the object model to automatically generate the CFAM configuration block. The configuration block presents engines in the order they are attached to the CFAM's LBUS. Engine implementations should subclass the LBusDevice and set the 'config' member of LBusDeviceClass to match the engine's type.

CFAM designs offer a lot of flexibility, for instance it is possible for a CFAM to be simultaneously driven from multiple FSI links. The modeling is not so complete; it's assumed that each CFAM is attached to a single FSI slave (as a consequence the CFAM subclasses the FSI slave).

As for FSI, its symbols and wire-protocol are not modelled at all. This is not necessary to get FSI off the ground thanks to the mapping of the CFAM address space onto the OPB address space - the models follow this directly and map the CFAM memory region into the OPB's memory region. Future work includes supporting more advanced accesses that drive the FSI master directly rather than indirectly via the CFAM mapping, which will require implementing the FSI state machine and methods for each of the FSI symbols on the slave. Further down the track we can also look at supporting the bitbanged SoftFSI drivers in Linux by extending the FSI slave model to resolve sequences of GPIO IRQs into FSI symbols, and calling the associated symbol method on the slave to map the access onto the CFAM.

Signed-off-by: Andrew Jeffery [email protected] [clg: tons of fixes, lots of love and care and time/ we need to split this patch ! ] Signed-off-by: Cédric Le Goater [email protected]


Friday 2021-06-25 11:14:11 by Long Do

Merge pull request #6 from Longsans/coom

oh my fucking god i suck at UI so much


Friday 2021-06-25 12:38:55 by Vera Aguilera Puerto

HOLY SHIT I HATE CEF AND I HATE WEBDEVS AAAAAAAAAAAAAAAAAAA


Friday 2021-06-25 13:53:27 by Ayman Ahmed Nafi

Update README.md

Ayman Ahmed Nafi is a Bangladeshi Author, Blogger, Online Activities, Digital marketer, Entrepreneur. He was born in Chittagong. He focused himself on the study, but as a subsidiary, he also put himself on many social activities. Also, he is a Volunteer at UNICEF.

He is an independent writer. He is a writing blog and poem on many other platforms. Also, he is a member of the Bangladesh Department of Archives and Library. He has already written many poems. Poetry lovers think that is quite a response. He is also an activist for women's safety and the prevention of overall harassment. His writings have played a very important role in eradicating various social prejudices.

Ayman Ahmed dreams to make a great book library in the world. And the world will be enlightened by reading books there. He believes that books are the only means by which people can create themselves, understand themselves, and learn about themselves. At the same time, he wants to work for the betterment of the lives of the underprivileged. He is also the spokesperson of a book competition. Where the best book reader, the reader is awarded the best prize.


Friday 2021-06-25 17:59:49 by shutterberg

Create 272A.cpp

A. Dima and Friends time limit per test 2 seconds memory limit per test 256 megabytes input standard input output standard output

Dima and his friends have been playing hide and seek at Dima's place all night. As a result, Dima's place got messy. In the morning they decided that they need to clean the place.

To decide who exactly would clean the apartment, the friends want to play a counting-out game. First, all the guys stand in a circle, and then each of them shows some number of fingers on one hand (one to five), and then the boys count in a circle, starting from Dima, the number of people, respective to the total number of fingers shown. The person on who the countdown stops will clean the apartment.

For example, if Dima and one of his friends played hide and seek, and 7 fingers were shown during the counting-out, then Dima would clean the place. If there were 2 or say, 8 fingers shown, then his friend would clean the place.

Dima knows how many fingers each of his friends will show during the counting-out. Now he is interested in the number of ways to show some number of fingers on one hand (one to five), so that he did not have to clean the place. Help Dima. Input

The first line contains integer n (1 ≤ n ≤ 100) — the number of Dima's friends. Dima himself isn't considered to be his own friend. The second line contains n positive integers, not exceeding 5, representing, how many fingers the Dima's friends will show.

The numbers in the lines are separated by a single space. Output

In a single line print the answer to the problem.


Friday 2021-06-25 18:29:41 by Brian Hirsh

add a boxed CPU fallback kernel

Pull Request resolved: pytorch/pytorch#58065

This PR replaces the existing code-generated CPU fallback kernels that XLA uses with a single boxed CPU fallback.

Current state: there are a couple different design ideas that I want to point out, but the logic for the actually kernel is mostly done and passing tests.

Design

To preface, I'm not 100% tied to the current design and I'm putting the PR up now for opinions and totally open to alternatives, some of which I listed below. Actually after writing this description, I'm leaning toward the following changes:

  • Confirm whether or not we can remove all C++ logging info directly in the yaml.

Current Design

All of the CPU fallback codegen is deleted. In its place, XLA (and other external backends, later) can choose to opt into a CPU fallback by adding the following code in a C++ file. I have an corresponding xla-side PR with the xla changes.

There's no actual requirement to split up the code into a .h and .cpp file, but that's necessary in the XLA case because they sometimes need to call the fallback directly from their handcrafted kernels.

// xla_cpu_fallback.h
#include <ATen/native/CPUFallback.h>
...
void xla_cpu_fallback(const c10::OperatorHandle& op, torch::jit::Stack* stack);
...
// xla_cpu_fallback.cpp
#include "torch_xla/csrc/aten_cpu_fallback.h"
...
void xla_cpu_fallback(const c10::OperatorHandle& op, torch::jit::Stack* stack) {
  // Do custom logging here
  ...
  // Call the actual boxed CPU fallback.
  at::native::cpu_fallback(op, stack);
}

TORCH_LIBRARY_IMPL(_, XLA, m) {
  m.fallback(torch::CppFunction::makeFromBoxedFunction<&xla_cpu_fallback>());
}

Now that the fallback is exposed in the backend, they can call it directly. Doing so requires converting from an unboxed to a boxed context, which we provide a utility function before. E.g.:

#include <ATen/native/CPUFallback.h>

at::Tensor addmm(const at::Tensor& self,const at::Tensor& mat1,const at::Tensor& mat2,const at::Scalar& beta,const at::Scalar& alpha) {
  ....
  if (...call_fallback...) {
    return at::native::call_fallback_fn<&xla_cpu_fallback, decltype(at::addmm)>::call("aten::addmm", self, mat1, mat2, beta, alpha);
  }
  ...
}

That decltype(at::addmm) logic isn't actually used everywhere in the xla-side PR yet, since you hit issues with overloads. I could use it everywhere once #58092 lands.

Alternatives: The API for calling the CPU fallback directly is ugly, can we make it nicer? We could change the api to use at::redispatch, which would make it look something like this:

at::Tensor addmm(const at::Tensor& self,const at::Tensor& mat1,const at::Tensor& mat2,const at::Scalar& beta,const at::Scalar& alpha) {
  ....
  if (...call_fallback...) {
    return at::redispatch::addmm(c10::DispatchKeySet(c10::DispatchKey::CPUFallback), self, mat1, mat2, beta, alpha);
  }
  ...
}

Which definitely feels cleaner, but also requires adding a new DispatchKey just for this use case. Conditionally calling the CPU fallback doesn't sound like a hugely important use case, so I don't know if giving up one of our 64 dispatch key slots is worth the API improvement. Totally open to other opinions though!

Another more mild improvement that would avoid having to pass operator string names (including overloads) around would be to codegen (yet another) namespaced API. Something like this:

at::Tensor addmm(const at::Tensor& self,const at::Tensor& mat1,const at::Tensor& mat2,const at::Scalar& beta,const at::Scalar& alpha) {
  ....
  if (...call_fallback...) {
    return at::fallback::addmm<&xla_cpu_fallback>(self, mat1, mat2, beta, alpha);
  }
  ...
}

Writing that out actually I actually like it more (I think it'll let us get rid of decltype(...)). Maybe that is nice enough to warrant a new codegen API - I haven't tried adding that yet, but if people like it I'm happy to try it out.

More alternatives The current design also involves the backend manually writing and registering the boxed fallback themselves, but an alternative would be for us to do it in codegen too: they would just need to pass in all of the C++ logging that they want done in the fallback, directly through the yaml. The main downsides:

  • Backend code that wants to call the fallback needs to abide by whatever convention our codegen uses to name the generated boxed fallback.
  • Passing custom C++ logging through yaml is just more fragile: right now xla uses an iostream to log each tensor arg in the operator, so we'd have to either force other backends into the same convention or figure something else out later.

To be fair, we actually already do that: XLA has custom per-tensor-arg logging for all of the generated out wrappers in the codegen, which we do by passing their C++ logging info through the yaml. This seems unnecessary though, since out wrappers just call into a functional kernel, which is hand written with its own custom logging. So my take is: try to remove custom C++ logging from the yaml, and if it turns out to be really necessary, then we may as well take advantage of that to codegen the fallback.

Performance impact

While ops that fall back to CPU aren't exactly hot path, we probably don't want to use a boxed fallback if it turns out to be an absolute perf killer.

I ran my benchmarks using callgrind, benchmarking both at::add and at::add_out run on XLA. My callgrind benchmark for at::add can be found here (the add_out benchmark looks basically the same): https://www.internalfb.com/phabricator/paste/view/P415418587. I created the benchmark by hacking the existing xla C++ test build scripts and throwing in a reference to callgrind.

I also attached the full callgrind output for each benchmark; the full output is actually pretty noise and hard to parse, but I focused on everything underneath the at::add() call in the output, which was much more stable. My guess is that it's due to some heavyweight async startup processing that xla does.

at::add: before: 88,505,130 instructions. Full output: https://www.internalfb.com/phabricator/paste/view/P415421001 after: 102,185,654 instructions. Full output: https://www.internalfb.com/phabricator/paste/view/P415421273 delta: ~15.5% increase

at::add_out: before: 63,897,395 instructions. Full output: https://www.internalfb.com/intern/everpaste/?handle=GBrrKwtAPlix9wUEAOZtrFXpdO5UbsIXAAAz after: 73,170,346 instructions. Full output: https://www.internalfb.com/phabricator/paste/view/P415423227 delta: ~14.5% increase

High level takeaway: A framework overhead increase of 10-20% doesn't seem too horrible for the CPU fallback use case.

For structured, functional ops that requires a CPU fallback, we're actually in an unfortunate situation: we're doing even more work than necessary. Our codegen automatically creates a CompositeExplicitAutograd kernel which calls into the out operator. So the extra work that we end up doing is:

  • An extra dispatcher hop: (at::add -> CompositeExplicitAutograd -> CPUFallback -> at::native::add) instead of (at::add -> CPUFallback -> at::native::add)
  • An unnecessary tensor allocation (the CompositeExplicitAutograd kernel uses at::empty() to create an output tensor, which is immediately overwritten by the CPU fallback)
  • An unnecessary meta() call (the CompositeExplicitAutograd kernel calls it to create the output tensor, but we call it again in the CPU kernel).
  • unboxing->boxing->unboxing logic (this is the only strictly required piece)

There are definitely ways to avoid the unnecessary work explained above: one would be to give the boxed fallback higher priority than composite keys (there's an issue for it here), and codegen fallthroughs for all composite ops. It'll require more infra to set up, so I see it as more of a perf knob that we can apply if we need it later.

Unfortunately I couldn't dig much deeper into the differences aside from the aggregate change in instructions, since it looks like callgrind fudged some of the instruction attribution (at::to_cpu takes up a ton of instructions, but I don't see any attribution for the at::native::add kernel anywhere).

Differential Revision: D28833085

NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on Phabricator! ghstack-source-id: 132409766


Friday 2021-06-25 18:53:13 by AnirudhaAgrawal1

Create README.md

TMJRock

A JavaScript framework by which the web developer creates a responsive website easily. User need not to worry about writing AJAX calls, creating and managing components.

Features

  1. AJAX Request
  2. Form Validation
  3. Accordian Pane
  4. Modal View

How to use this framework

Download the TMJRock.js, TMJRock.css file and place in your working directory, then include both the files in your html file.

<script src='js/TMJRock.js'></script>
<link rel='stylesheet' type='text/css' href='css/TMJRock.css'>

How to use this framework

AJAX Request

<!Doctype html>
<html lang='en'>
<head>
<meta charset='utf-8'>
<title>Ajax Example</title>
<script src='js/TMJRock.js'></script>
<script>
function populateDesignations()
{
$$$.ajax({
"url":"servletOne",
"methodType":"GET",
"success":function(responseData){
alert(responseData);
},
"failure":function(){
alert("Some problem");
}
});
}
window.addEventListener('load',populateDesignations);
</script>
</head>
<body>
<h1>Get type request</h1>
<br>
<br>
</body>
</html>

Form Validation

<!Doctype html>
<html lang='en'>
<head>
<title>TMJRock Validation</title>
<script src='js/TMJRock.js'></script>
<script>
function doSomething()
{
return $$$("someForm").isValid({
"nm":{
"required":true,
"maxLength":20,
"errorPane":"nmErrorSection",
"errors":{
"required":"Name required",
"maxLength":"Name cannot exceed 20 characters"
}
},
"ad":{
"required":true,
"errorPane":"adErrorSection",
"errors":{
"required":"Address required"
}
},
"ct":{
"invalid":-1,
"errorPane":"ctErrorSection",
"errors":{
"invalid":"Select city"
}
},
"gender":{
"required":true,
"errorPane":"genderErrorSection",
"errors":{
"required":"select gender"
}
},
"agree":{
"requiredState":true,
"displayAlert":true,
"errors":{
"requiredState":"Select I agree checkbox"
}
}
});
}
</script>

<body>
<h1>TMJRock validations</h1>
<form id='someForm' onsubmit='return doSomething()'>
Name <input type='text' name='nm' id='nm'><span id='nmErrorSection' style='color:red'></span><br>
Address <textarea id='ad' name='ad'></textarea><span id='adErrorSection' style='color:red'></span><br>
Select city
<select id='ct' name='ct'>
<option value='-1'>select city</option>
<option value='1'>Ujjain</option>
<option value='2'>Dewas</option>
<option value='3'>Indore</option>
</select>
<span id='ctErrorSection' style='color:red'></span><br>
<br>
Gender &nbsp;&nbsp;&nbsp;
Male <input type='radio' name='gender' id='ml' value='M'>&nbsp;&nbsp;&nbsp;
Female <input type='radio' name='gender' id='fe' value='F'>&nbsp;&nbsp;&nbsp;
<span id='genderErrorSection' style='color:red'></span>
<br>
<input type='checkbox' name='agree' id='ag' value='y'> I agree?
<br>
<button type='submit'>Registor</button>
</form>
</body>
</html>

Output


Accordian Pane


<!Doctype html>
<html lang='en'>
<head>
<title>TMJRock Accordian Pane</title>
<script src='TMJRock.js'></script>
</head>
<body>
<h1>TMJRock Accordian Pane</h1>
<div accordian="true">
<h3>Heading 1</h3>
<div>
1 Whatever Whatever Whatever Whatever
2 Whatever Whatever Whatever Whatever
3 Whatever Whatever Whatever Whatever
4 Whatever Whatever Whatever Whatever
5 Whatever Whatever Whatever Whatever
6 Whatever Whatever Whatever Whatever
</div>
<h3>Heading 2</h3>
<div>
11 Whatever Whatever Whatever Whatever
22 Whatever Whatever Whatever Whatever
33 Whatever Whatever Whatever Whatever
44 Whatever Whatever Whatever Whatever
55 Whatever Whatever Whatever Whatever
66 Whatever Whatever Whatever Whatever
</div>
<h3>Heading 3</h3>
<div>
111 Whatever Whatever Whatever Whatever
222 Whatever Whatever Whatever Whatever
333 Whatever Whatever Whatever Whatever
444 Whatever Whatever Whatever Whatever
555 Whatever Whatever Whatever Whatever
666 Whatever Whatever Whatever Whatever
</div>
</div>
<div id='gogo' accordian="true">
<h3>Heading 1000</h3>
<div>
1 Whatever Whatever Whatever Whatever
2 Whatever Whatever Whatever Whatever
3 Whatever Whatever Whatever Whatever
4 Whatever Whatever Whatever Whatever
5 Whatever Whatever Whatever Whatever
6 Whatever Whatever Whatever Whatever
</div>
<h3>Heading 2000</h3>
<div>
11 Whatever Whatever Whatever Whatever
22 Whatever Whatever Whatever Whatever
33 Whatever Whatever Whatever Whatever
44 Whatever Whatever Whatever Whatever
55 Whatever Whatever Whatever Whatever
66 Whatever Whatever Whatever Whatever
</div>
<h3>Heading 3000</h3>
<div>
111 Whatever Whatever Whatever Whatever
222 Whatever Whatever Whatever Whatever
333 Whatever Whatever Whatever Whatever
444 Whatever Whatever Whatever Whatever
555 Whatever Whatever Whatever Whatever
666 Whatever Whatever Whatever Whatever
</div>
</div>
</body>
</html>

Output


Modal View

<!Doctype html>
<html lang='en'>
<head>
<title>TMJRock Modal View</title>
<script src='TMJRock.js'></script>
<link rel='stylesheet' type='text/css' href='TMJRock.css'>
<script>
function onOpened()
{
alert("Modal with id ab opened");
}
function onClosed()
{
alert("modal with id ab closed")
}
function abBeforeOpening()
{
alert("ab is ready to be opened");
return true;
}
function abBeforeClosing()
{
alert("b is going to be close");
return true;
}
function createModal1()
{
$$$.modals.show('ab');
}
function createModal2()
{
$$$.modals.show('pq');
}
</script>
<body>
<h1>TMJROCK Modal view</h1>
<button type='button' onclick='createModal1()'>show first modal</button>
<button type='button' onclick='createModal2()'>show second modal</button>
<div id='ab' forModal='true' size='200x300' header='Some Heading' footer='Some footer' maskColor='#808080' modalBackgroundColor='#FFFFFF' closeButton='true' afterOpening='onOpened()' afterClosing='onClosed()' beforeOpening='abBeforeOpening()' beforeClosing='abBeforeClosing()'>
GOD is great
</div>
<div id='pq' forModal='true'>
Ujjain is the city of GODS
</div>
</body>
</html>

Output



Friday 2021-06-25 19:02:32 by sephirothbahamut

I hereby present to you the end result of two weeks of pain and suffering: Client and goddamn Server


Friday 2021-06-25 21:42:44 by Anthony

LESSS GOOO I DID IT

Maybe if I put this much work into my social life I would have friends


Friday 2021-06-25 22:55:55 by Christopher Besch

yeah, I know, it doesn't work, but I kinda had to help my brother get his wifi working...


Friday 2021-06-25 23:28:23 by Brian Hirsh

add a boxed CPU fallback kernel (#58065)

Summary: Pull Request resolved: pytorch/pytorch#58065

This PR replaces the existing code-generated CPU fallback kernels that XLA uses with a single boxed CPU fallback.

Current state: there are a couple different design ideas that I want to point out, but the logic for the actually kernel is mostly done and passing tests.

Design

To preface, I'm not 100% tied to the current design and I'm putting the PR up now for opinions and totally open to alternatives, some of which I listed below. Actually after writing this description, I'm leaning toward the following changes:

  • Confirm whether or not we can remove all C++ logging info directly in the yaml.

Current Design

All of the CPU fallback codegen is deleted. In its place, XLA (and other external backends, later) can choose to opt into a CPU fallback by adding the following code in a C++ file. I have an corresponding xla-side PR with the xla changes.

There's no actual requirement to split up the code into a .h and .cpp file, but that's necessary in the XLA case because they sometimes need to call the fallback directly from their handcrafted kernels.

// xla_cpu_fallback.h
#include <ATen/native/CPUFallback.h>
...
void xla_cpu_fallback(const c10::OperatorHandle& op, torch::jit::Stack* stack);
...
// xla_cpu_fallback.cpp
#include "torch_xla/csrc/aten_cpu_fallback.h"
...
void xla_cpu_fallback(const c10::OperatorHandle& op, torch::jit::Stack* stack) {
  // Do custom logging here
  ...
  // Call the actual boxed CPU fallback.
  at::native::cpu_fallback(op, stack);
}

TORCH_LIBRARY_IMPL(_, XLA, m) {
  m.fallback(torch::CppFunction::makeFromBoxedFunction<&xla_cpu_fallback>());
}

Now that the fallback is exposed in the backend, they can call it directly. Doing so requires converting from an unboxed to a boxed context, which we provide a utility function before. E.g.:

#include <ATen/native/CPUFallback.h>

at::Tensor addmm(const at::Tensor& self,const at::Tensor& mat1,const at::Tensor& mat2,const at::Scalar& beta,const at::Scalar& alpha) {
  ....
  if (...call_fallback...) {
    return at::native::call_fallback_fn<&xla_cpu_fallback, decltype(at::addmm)>::call("aten::addmm", self, mat1, mat2, beta, alpha);
  }
  ...
}

That decltype(at::addmm) logic isn't actually used everywhere in the xla-side PR yet, since you hit issues with overloads. I could use it everywhere once #58092 lands.

Alternatives: The API for calling the CPU fallback directly is ugly, can we make it nicer? We could change the api to use at::redispatch, which would make it look something like this:

at::Tensor addmm(const at::Tensor& self,const at::Tensor& mat1,const at::Tensor& mat2,const at::Scalar& beta,const at::Scalar& alpha) {
  ....
  if (...call_fallback...) {
    return at::redispatch::addmm(c10::DispatchKeySet(c10::DispatchKey::CPUFallback), self, mat1, mat2, beta, alpha);
  }
  ...
}

Which definitely feels cleaner, but also requires adding a new DispatchKey just for this use case. Conditionally calling the CPU fallback doesn't sound like a hugely important use case, so I don't know if giving up one of our 64 dispatch key slots is worth the API improvement. Totally open to other opinions though!

Another more mild improvement that would avoid having to pass operator string names (including overloads) around would be to codegen (yet another) namespaced API. Something like this:

at::Tensor addmm(const at::Tensor& self,const at::Tensor& mat1,const at::Tensor& mat2,const at::Scalar& beta,const at::Scalar& alpha) {
  ....
  if (...call_fallback...) {
    return at::fallback::addmm<&xla_cpu_fallback>(self, mat1, mat2, beta, alpha);
  }
  ...
}

Writing that out actually I actually like it more (I think it'll let us get rid of decltype(...)). Maybe that is nice enough to warrant a new codegen API - I haven't tried adding that yet, but if people like it I'm happy to try it out.

More alternatives The current design also involves the backend manually writing and registering the boxed fallback themselves, but an alternative would be for us to do it in codegen too: they would just need to pass in all of the C++ logging that they want done in the fallback, directly through the yaml. The main downsides:

  • Backend code that wants to call the fallback needs to abide by whatever convention our codegen uses to name the generated boxed fallback.
  • Passing custom C++ logging through yaml is just more fragile: right now xla uses an iostream to log each tensor arg in the operator, so we'd have to either force other backends into the same convention or figure something else out later.

To be fair, we actually already do that: XLA has custom per-tensor-arg logging for all of the generated out wrappers in the codegen, which we do by passing their C++ logging info through the yaml. This seems unnecessary though, since out wrappers just call into a functional kernel, which is hand written with its own custom logging. So my take is: try to remove custom C++ logging from the yaml, and if it turns out to be really necessary, then we may as well take advantage of that to codegen the fallback.

Performance impact

While ops that fall back to CPU aren't exactly hot path, we probably don't want to use a boxed fallback if it turns out to be an absolute perf killer.

I ran my benchmarks using callgrind, benchmarking both at::add and at::add_out run on XLA. My callgrind benchmark for at::add can be found here (the add_out benchmark looks basically the same): https://www.internalfb.com/phabricator/paste/view/P415418587. I created the benchmark by hacking the existing xla C++ test build scripts and throwing in a reference to callgrind.

I also attached the full callgrind output for each benchmark; the full output is actually pretty noise and hard to parse, but I focused on everything underneath the at::add() call in the output, which was much more stable. My guess is that it's due to some heavyweight async startup processing that xla does.

at::add: before: 88,505,130 instructions. Full output: https://www.internalfb.com/phabricator/paste/view/P415421001 after: 102,185,654 instructions. Full output: https://www.internalfb.com/phabricator/paste/view/P415421273 delta: ~15.5% increase

at::add_out: before: 63,897,395 instructions. Full output: https://www.internalfb.com/intern/everpaste/?handle=GBrrKwtAPlix9wUEAOZtrFXpdO5UbsIXAAAz after: 73,170,346 instructions. Full output: https://www.internalfb.com/phabricator/paste/view/P415423227 delta: ~14.5% increase

High level takeaway: A framework overhead increase of 10-20% doesn't seem too horrible for the CPU fallback use case.

For structured, functional ops that requires a CPU fallback, we're actually in an unfortunate situation: we're doing even more work than necessary. Our codegen automatically creates a CompositeExplicitAutograd kernel which calls into the out operator. So the extra work that we end up doing is:

  • An extra dispatcher hop: (at::add -> CompositeExplicitAutograd -> CPUFallback -> at::native::add) instead of (at::add -> CPUFallback -> at::native::add)
  • An unnecessary tensor allocation (the CompositeExplicitAutograd kernel uses at::empty() to create an output tensor, which is immediately overwritten by the CPU fallback)
  • An unnecessary meta() call (the CompositeExplicitAutograd kernel calls it to create the output tensor, but we call it again in the CPU kernel).
  • unboxing->boxing->unboxing logic (this is the only strictly required piece)

There are definitely ways to avoid the unnecessary work explained above: one would be to give the boxed fallback higher priority than composite keys (there's an issue for it here), and codegen fallthroughs for all composite ops. It'll require more infra to set up, so I see it as more of a perf knob that we can apply if we need it later.

Unfortunately I couldn't dig much deeper into the differences aside from the aggregate change in instructions, since it looks like callgrind fudged some of the instruction attribution (at::to_cpu takes up a ton of instructions, but I don't see any attribution for the at::native::add kernel anywhere).

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D28833085

Pulled By: bdhirsh

fbshipit-source-id: 537ebd5d7fb5858f1158764ff47132d503c3b92b


< 2021-06-25 >