Question about multi-threading #2

MolVlad · 2025-01-22T08:13:36Z

I wonder whether it is possible to achieve further acceleration of decoding one frame by using multi-threading. Have you studied this question? Thanks in advance!

williamyang98 · 2025-01-27T04:07:17Z

Parallelisation inside one symbol

It should be possible to parallelise this for loop inside the butterfly.

ViterbiDecoderCpp/include/viterbi/viterbi_decoder_scalar.h

Line 64 in 1758604

    
           for (size_t curr_state = 0u; curr_state < Base::BranchTable::NUMSTATES; curr_state++) {

For codes with a large constraint length like Cassini (K=15, NUMSTATES = 2^(K-2) = 8192), there should be performance improvements with multithreading if the overhead is low enough. (Probably wouldn't be beneficial K is small and the NUMSTATES is small).

Parallelisation between all symbols

As for parallising the entire decoding of the symbol frame

ViterbiDecoderCpp/include/viterbi/viterbi_branch_table.h

Line 45 in 1758604

for (size_t state = 0u; state < NUMSTATES; state++) {

Branch error metrics for each symbol is calculated from the error metrics from the previous symbol. I'm not sure how you would eliminate or work around this data dependency to make parallelisation possible.

ViterbiDecoderCpp/include/viterbi/viterbi_decoder_scalar.h

Lines 109 to 134 in 1758604

    
           // r = leading bit of previous state 
        
           // s = input bit 
        
           // g = parity bit corresponding to output symbol 
        
           //  next_error_r_s = r^s^g 
        
           const error_t next_error_0_0 = old_metric[curr_state_0] + total_error; 
        
           const error_t next_error_1_0 = old_metric[curr_state_1] + inverted_error; 
        
           const error_t next_error_0_1 = old_metric[curr_state_0] + inverted_error; 
        
           const error_t next_error_1_1 = old_metric[curr_state_1] + total_error; 
        
           // TODO: When adding our error metrics we may cause an overflow to happen  
        
           //       if the renormalisation step was not performed in time 
        
           //       Our intrinsics implementations use saturated arithmetic to prevent an overflow 
        
           //       Perhaps it is possible to use something like GCC's builtin saturated add here? 
        
           // Select the previous state r with a lower error for an input bit s 
        
           const decision_bits_t decision_0 = next_error_0_0 > next_error_1_0; 
        
           const decision_bits_t decision_1 = next_error_0_1 > next_error_1_1; 
        
           // Update metrics 
        
           new_metric[next_state_0] = decision_0 ? next_error_1_0 : next_error_0_0; 
        
           new_metric[next_state_1] = decision_1 ? next_error_1_1 : next_error_0_1; 
        
           // Store the leading bit for the previous state for the next states (X|0) and (X|1) 
        
           const decision_bits_t bits = decision_0 | (decision_1 << 1); 
        
           const size_t curr_pack_index = next_state_0 / Base::Decisions::TOTAL_BITS_PER_BLOCK; 
        
           const size_t curr_pack_bit   = next_state_0 % Base::Decisions::TOTAL_BITS_PER_BLOCK; 
        
           decision[curr_pack_index] |= (bits << curr_pack_bit);

You could calculate the branch errors for each symbol in parallel

ViterbiDecoderCpp/include/viterbi/viterbi_decoder_scalar.h

Lines 67 to 83 in 1758604

    
           for (size_t i = 0; i < Base::R; i++) { 
        
               const soft_t sym = symbols[i]; 
        
               const soft_t expected_sym = base.m_branch_table[i][curr_state]; 
        
               const soft_t error = expected_sym - sym; 
        
               const error_t abs_error = error_t(get_abs(error)); 
        
               total_error += abs_error; 
        
           } 
        
           assert(total_error <= base.m_config.soft_decision_max_error); 
        
           // We only store half the states in the branch table, but here we expand it out to explore the other unstored half 
        
           // Both state 0 and state 1 when shifted give the same next state (for the same input bit) 
        
           // Most significant bit for state 1 is shifted out when considering the next state 
        
           // since it is shifted out of the (K-1) bits considered for the next state 
        
           const size_t curr_state_0 = curr_state;                               // Goes up to K-2 bits 
        
           const size_t curr_state_1 = curr_state + Base::Metrics::NUMSTATES/2;  // Adds (K-1)th bit 
        
           const size_t next_state_0 = (curr_state << 1) | 0; 
        
           const size_t next_state_1 = (curr_state << 1) | 1;

and then do some kind of reduction (add error, find min error, repeat) to get the final decison bits (this part would not be parallel).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about multi-threading #2

Question about multi-threading #2

MolVlad commented Jan 22, 2025

williamyang98 commented Jan 27, 2025

Question about multi-threading #2

Question about multi-threading #2

Comments

MolVlad commented Jan 22, 2025

williamyang98 commented Jan 27, 2025

Parallelisation inside one symbol

Parallelisation between all symbols