Semantics integration, per-function throwaway databases for heavyweight dataflow analyses #13

pgoodman · 2021-01-17T02:44:02Z

This issue serves to track ideas and goals related to integrating instruction semantics.

Basic example:

EA_LEA:   lea eax, [edi + esi * 2 + 0xf00]
EA_MOV:   mov ecx, dword ptr [eax]

From this, we'd like to create the following approximate relations:

raw_operation(0, FuncEA, EA_LEA, READ_REGISTER_32, REG_EDI, _, _)  ; read edi
raw_operation(1, FuncEA, EA_LEA, READ_REGISTER_32, REG_ESI, _, _)  ; read esi
raw_operation(2, FuncEA, EA_LEA, CONSTANT_32, 0xf00, _, _)
raw_operation(3, FuncEA, EA_LEA, BASE_2xINDEX_DISP, 0, 1, 2) ; edi + (esi * 2) + 0xf00
raw_operation(4, EA_LEA, WRITE_REGISTER_32, REG_EAX, 3, _)

raw_operation(5, FuncEA, EA_MOV, READ_REGISTER_32, REG_EAX, _, _)
raw_operation(6, FuncEA, EA_MOV, READ_MEMORY_32, 5, _, _)  ; `dword ptr [eax]`
raw_operation(7, FuncEA, EA_MOV, WRITE_REGISTER_32, REG_ECX, 6, _)

In the above, things like REG_EDI would be the unique IDs of values for the registers. So probably raw_operation wouldn't start at 0.

From here, we'd want rules that do some basic things like copy propagation. This means definition the post-instruction register state. These rules would exist in a separare datalog db, with one instance per function. The idea would be to run these, have them publish messages that we'd store back into the main db, then destroy these instances until they're needed again. Key idea: throwaway databases.

; Orchestrated by creator of throwaway db, unique values for `ValOnFuncEntry`.
#message func_entry_reg_state(FuncEA, RegName, ValOnFuncEntry)

; Sent from the users of the main DB to us, based off of `transfer`.
#message local_transfer(FromEA, ToEA)

; !!!!
; Could have a bunch of messages, e.g. for constants or symbolic values that are
; user-provided, for interactive constant folding / symexec during a speculative execution
; or speculative specialization of a function.
; !!!!

; Base case for beginning of a function: each register has a tombstone `VAL_ON_FUNC_ENTRY` value
; on entry to a function.
post_inst_reg_state(FuncEA, RegName, ValOnFuncEntry)
    : func_entry_reg_state(FuncEA, RegName, ValOnFuncEntry).
    , !raw_operation(_, FuncEA, FuncEA, WRITE_REGISTER_32, RegName, _, _).

; Any operation that writes out the register value submits its value to the post-register state
; of that instruction. 
post_inst_reg_state(InsnEA, RegName, WrittenId)
    : raw_operation(_, FuncEA, InsnEA, WRITE_REGISTER_32, RegName, WrittenId, _).

; If we have a flow from `PredInsnEA` to `InsnEA`, and if `InsnEA` doesn't write to
; `RegName`, then propagate the value of `WrittenId` for `RegName`.
post_inst_reg_state(FuncEA, InsnEA, RegName, WrittenId)
    : local_transfer(PredInsnEA, InsnEA)
    , !raw_operation(_, FuncEA, InsnEA, WRITE_REGISTER_32, RegName, _, _)
    , post_inst_reg_state(FuncEA, PredInsnEA, RegName, WrittenId).

I think with the simple rules above, execution would converge toward the following:

the datalog system would explore all possible paths through the function
the datalog system /should/ converge because the above doesn't do any actual constant folding, i.e. it never invents or introduces new values, just copied some around. It's tantamount to repeated matrix multiplication.

For every point in a function, we would be able to express register values in terms of register values from another place in the function. This is something that the GrammaTech people mentioned in their ddisasm paper. Basically, we could say: there exists a path such that the register written at instruction EA1 is read by the instruction at EA2.

I think one thing that becomes apparent from this type of "raw operation" representation is that we'd want to represent conditional register writes that either preserve the register contents or alter them, that way we can model those data-centric flows.

The text was updated successfully, but these errors were encountered:

pgoodman · 2021-01-17T03:25:45Z

post_inst_reg_state is basically reaching definitions :-)

pgoodman · 2021-01-21T23:30:41Z

I think it'd also be interesting to do a form of equality saturation over the IRs. This is be beneficial for loads/stores being converted into things like typed field accesses.

pgoodman added the discussion label Jan 17, 2021

pgoodman self-assigned this Jan 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Semantics integration, per-function throwaway databases for heavyweight dataflow analyses #13

Semantics integration, per-function throwaway databases for heavyweight dataflow analyses #13

pgoodman commented Jan 17, 2021

pgoodman commented Jan 17, 2021

pgoodman commented Jan 21, 2021

Semantics integration, per-function throwaway databases for heavyweight dataflow analyses #13

Semantics integration, per-function throwaway databases for heavyweight dataflow analyses #13

Comments

pgoodman commented Jan 17, 2021

pgoodman commented Jan 17, 2021

pgoodman commented Jan 21, 2021