Skip to content
This repository has been archived by the owner on Jan 6, 2022. It is now read-only.

Semantics integration, per-function throwaway databases for heavyweight dataflow analyses #13

Open
pgoodman opened this issue Jan 17, 2021 · 2 comments
Assignees

Comments

@pgoodman
Copy link
Contributor

This issue serves to track ideas and goals related to integrating instruction semantics.

Basic example:

EA_LEA:   lea eax, [edi + esi * 2 + 0xf00]
EA_MOV:   mov ecx, dword ptr [eax]

From this, we'd like to create the following approximate relations:

raw_operation(0, FuncEA, EA_LEA, READ_REGISTER_32, REG_EDI, _, _)  ; read edi
raw_operation(1, FuncEA, EA_LEA, READ_REGISTER_32, REG_ESI, _, _)  ; read esi
raw_operation(2, FuncEA, EA_LEA, CONSTANT_32, 0xf00, _, _)
raw_operation(3, FuncEA, EA_LEA, BASE_2xINDEX_DISP, 0, 1, 2) ; edi + (esi * 2) + 0xf00
raw_operation(4, EA_LEA, WRITE_REGISTER_32, REG_EAX, 3, _)

raw_operation(5, FuncEA, EA_MOV, READ_REGISTER_32, REG_EAX, _, _)
raw_operation(6, FuncEA, EA_MOV, READ_MEMORY_32, 5, _, _)  ; `dword ptr [eax]`
raw_operation(7, FuncEA, EA_MOV, WRITE_REGISTER_32, REG_ECX, 6, _)

In the above, things like REG_EDI would be the unique IDs of values for the registers. So probably raw_operation wouldn't start at 0.

From here, we'd want rules that do some basic things like copy propagation. This means definition the post-instruction register state. These rules would exist in a separare datalog db, with one instance per function. The idea would be to run these, have them publish messages that we'd store back into the main db, then destroy these instances until they're needed again. Key idea: throwaway databases.

; Orchestrated by creator of throwaway db, unique values for `ValOnFuncEntry`.
#message func_entry_reg_state(FuncEA, RegName, ValOnFuncEntry)

; Sent from the users of the main DB to us, based off of `transfer`.
#message local_transfer(FromEA, ToEA)

; !!!!
; Could have a bunch of messages, e.g. for constants or symbolic values that are
; user-provided, for interactive constant folding / symexec during a speculative execution
; or speculative specialization of a function.
; !!!!

; Base case for beginning of a function: each register has a tombstone `VAL_ON_FUNC_ENTRY` value
; on entry to a function.
post_inst_reg_state(FuncEA, RegName, ValOnFuncEntry)
    : func_entry_reg_state(FuncEA, RegName, ValOnFuncEntry).
    , !raw_operation(_, FuncEA, FuncEA, WRITE_REGISTER_32, RegName, _, _).

; Any operation that writes out the register value submits its value to the post-register state
; of that instruction. 
post_inst_reg_state(InsnEA, RegName, WrittenId)
    : raw_operation(_, FuncEA, InsnEA, WRITE_REGISTER_32, RegName, WrittenId, _).

; If we have a flow from `PredInsnEA` to `InsnEA`, and if `InsnEA` doesn't write to
; `RegName`, then propagate the value of `WrittenId` for `RegName`.
post_inst_reg_state(FuncEA, InsnEA, RegName, WrittenId)
    : local_transfer(PredInsnEA, InsnEA)
    , !raw_operation(_, FuncEA, InsnEA, WRITE_REGISTER_32, RegName, _, _)
    , post_inst_reg_state(FuncEA, PredInsnEA, RegName, WrittenId).

I think with the simple rules above, execution would converge toward the following:

  • the datalog system would explore all possible paths through the function
  • the datalog system /should/ converge because the above doesn't do any actual constant folding, i.e. it never invents or introduces new values, just copied some around. It's tantamount to repeated matrix multiplication.

For every point in a function, we would be able to express register values in terms of register values from another place in the function. This is something that the GrammaTech people mentioned in their ddisasm paper. Basically, we could say: there exists a path such that the register written at instruction EA1 is read by the instruction at EA2.

I think one thing that becomes apparent from this type of "raw operation" representation is that we'd want to represent conditional register writes that either preserve the register contents or alter them, that way we can model those data-centric flows.

@pgoodman pgoodman self-assigned this Jan 17, 2021
@pgoodman
Copy link
Contributor Author

post_inst_reg_state is basically reaching definitions :-)

@pgoodman
Copy link
Contributor Author

I think it'd also be interesting to do a form of equality saturation over the IRs. This is be beneficial for loads/stores being converted into things like typed field accesses.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant