Barrier implementation using SPE signals

Technical discussion on the newly released and hard to find PS3.

Moderators: cheriff, emoon

Post Reply
ouasse
Posts: 80
Joined: Mon Jul 30, 2007 5:58 am
Location: Paris, France

Barrier implementation using SPE signals

Post by ouasse »

Hello,

As I've had little support about this problem from the IBM forum, maybe somebody here may be able to help me.

I want to implement an barrier between the SPE's, with no PPE intervention.

A barrier is a synchronization object which can be used to create meeting points between different processors. A processor which has called the barrier remains blocked until all other processors have also called the barrier.

My approach uses signals in ORed mode. My code looks okay to me, but unfortunately it hangs on my PS3. The program runs as expected on the Cell simulator (SDK 3.0).

The test code follows. first, a global definition (on both PPE and SPE) :

Code: Select all

typedef struct {
  int spe_rank;
  int spe_count;
  int __dummy[2];
  uint64_t sig1[6];
} spe_data;

Then, the PPE program (includes removed):

Code: Select all

#define SPE_COUNT 6
 
extern spe_program_handle_t barrier_sig_spu_handle;
 
typedef struct {
  spe_data d __attribute((aligned(16)));
  spe_context_ptr_t context;
  pthread_t thread;
} spe_id;
 
 
void *main_thr(void *ptr) {
  spe_id *id = (spe_id*)ptr;
  unsigned int entry_point = SPE_DEFAULT_ENTRY;
  int retval;
  do
    retval = spe_context_run(id->context, &entry_point, 0, &id->d, NULL, NULL);
  while (retval > 0);                  /* Run until exit or error */
  if(retval)
    perror("An error occurred running the SPE program");
  return NULL;
}
 
 
 
int main(int argc, char *argv[])
{
  spe_id spe[SPE_COUNT] __attribute((aligned(16)));
  int i, j;
  uint32_t dummy;
  uint64_t sig1_ea;
 
  for &#40;i=0; i<SPE_COUNT; i++&#41; &#123;
    spe&#91;i&#93;.d.spe_rank = i;
    spe&#91;i&#93;.d.spe_count = SPE_COUNT;
    spe&#91;i&#93;.context = spe_context_create&#40;SPE_EVENTS_ENABLE
        | SPE_CFG_SIGNOTIFY1_OR | SPE_MAP_PS, NULL&#41;;
    spe_program_load&#40;spe&#91;i&#93;.context, &barrier_sig_spu_handle&#41;;
    sig1_ea = &#40;unsigned int&#41;spe_ps_area_get&#40;spe&#91;i&#93;.context,
        SPE_SIG_NOTIFY_1_AREA&#41; + 12;
    for &#40;j=0; j<SPE_COUNT; j++&#41;
      spe&#91;j&#93;.d.sig1&#91;i&#93; = sig1_ea;
  &#125;
 
  for &#40;i=0; i<SPE_COUNT; i++&#41; &#123;
    pthread_create&#40;&spe&#91;i&#93;.thread, NULL, main_thr, &spe&#91;i&#93;&#41;;
  &#125;
 
  for &#40;i=0; i<SPE_COUNT; i++&#41; &#123;
    while &#40;!spe_out_mbox_status&#40;spe&#91;i&#93;.context&#41;&#41;;
    spe_out_mbox_read&#40;spe&#91;i&#93;.context, &dummy, 1&#41;;
  &#125;
  dummy = 0;
  for &#40;i=0; i<SPE_COUNT; i++&#41;
    spe_in_mbox_write&#40;spe&#91;i&#93;.context, &dummy, 1, SPE_MBOX_ALL_BLOCKING&#41;;
 
  for &#40;i=0; i<SPE_COUNT; i++&#41; &#123;
    pthread_join&#40;spe&#91;i&#93;.thread, NULL&#41;;
  &#125;
  return 0;
&#125;


And finally, the SPE code:

Code: Select all

spe_data d __attribute&#40;&#40;aligned&#40;16&#41;&#41;&#41;;
 
void spe_barrier&#40;void&#41; &#123;
  volatile vec_uint4 signal;
  int i;
  void *ls = &#40;&#40;char*&#41;&signal&#41;+12;
  uint32_t expected = &#40;1<<d.spe_count&#41;-1;
  uint32_t received = 1<<d.spe_rank;
  signal = spu_promote&#40;received, 3&#41;;
  for &#40;i=0; i<d.spe_count; i++&#41;
    if &#40;i != d.spe_rank&#41; &#123;
      mfc_sndsig&#40;ls, d.sig1&#91;i&#93;, 4, 0, 0&#41;;
    &#125;
  while &#40;received != expected&#41; &#123;
    received |= spu_read_signal1&#40;&#41;;
    /*printf&#40;"spe%d received = %d\n", d.spe_rank, received&#41;;*/
  &#125;
&#125;
 
int main&#40;unsigned long long spe_id, unsigned long long pdata&#41;
&#123;
  mfc_get&#40;&d, pdata, sizeof&#40;d&#41;, 0, 0, 0&#41;;
  mfc_write_tag_mask&#40;1<<0&#41;;
  spu_mfcstat&#40;MFC_TAG_UPDATE_ALL&#41;;
 
  spu_write_out_mbox&#40;0&#41;;
  spu_read_in_mbox&#40;&#41;;
 
  printf&#40;"spe%d/%d&#58; ready\n", d.spe_rank, d.spe_count&#41;;
  spe_barrier&#40;&#41;;
  printf&#40;"spe%d passed the barrier\n", d.spe_rank&#41;;
 
  return 0;
&#125;
The basic idea about the algorithm is that every SPE waits for all the other SPE's to have reached the barrier. For this, the signal 1 channel has been configured in OR mode, so what a SPE reads in its signal 1 channel is a bit field corresponding to all the SPE's which already have reached the barrier. Then it loops waiting for all remaining SPE's to have reached the barrier, and returns.

On PS3, there are several cases:
- If the SPE code was compiled using -O2, only one SPE passes the barrier, all others hang
- If the SPE code was compiled without optimization, one, two or three SPE's pass the barrier
- If I comment out the printf in the spe_barrier function, the code works whatever the optimization level.

On the Cell simulator, every case works.

This optimization / printf difference makes me think about a synchronization problem, but my lack of experience does really not help me to find out the source of the problem.

I really wonder why a SPE can eventually not receive all the signals which have been sent to it.

What do you people think about this ? could it be a linux kernel problem ?

thank you in advance,

François
IronPeter
Posts: 207
Joined: Mon Aug 06, 2007 12:46 am
Contact:

Post by IronPeter »

Hmm... Probably you need the following code

while( !spu_stat_signal1( ) );
spu_read_signal1( );
ps2devman
Posts: 259
Joined: Mon Oct 09, 2006 3:56 pm

Post by ps2devman »

Just for additional useless info..., MS and Nvidia call this 'barrier' mechanism a 'fence'... (may help to google and find more info about it)
ouasse
Posts: 80
Joined: Mon Jul 30, 2007 5:58 am
Location: Paris, France

Post by ouasse »

IronPeter wrote:Hmm... Probably you need the following code

while( !spu_stat_signal1( ) );
spu_read_signal1( );
There should be no reason why we would have to do it this way.

when calling spu_read_signal1(), the SPU is supposed to stall until a signal notification has been received. I tried your solution however, and it doesn't work.

Anyway, I really doubt about the integrity of my PS3 kernel. This program definitely should be working on the ps3, just like how it works on the Cell simulator. Could some of you try tris program on your ps3s, and tell me wheter it terminates correctly ? if it works, could you please tell me which kernel you are using (version, and kernel config file) ?

thank you very much.
ouasse
Posts: 80
Joined: Mon Jul 30, 2007 5:58 am
Location: Paris, France

Post by ouasse »

I have found a working solution. Putting

Code: Select all

  mfc_write_tag_mask&#40;1<<4&#41;;
  spu_mfcstat&#40;MFC_TAG_UPDATE_ALL&#41;;
after the mfc_sndsig loop fixed the problem.

This still is quite strange, that I had to do that to fix the problem. The CBE handbook states that there is no need to wait for sndsig DMA commands to complete, as they mandatorily will.
ouasse
Posts: 80
Joined: Mon Jul 30, 2007 5:58 am
Location: Paris, France

Post by ouasse »

shame on me. The previous post was a workaround, not a solution.
The definitive solution was to declare the 'signal' variable as static in spe_barrier()
IronPeter
Posts: 207
Joined: Mon Aug 06, 2007 12:46 am
Contact:

Post by IronPeter »

Cool :). DMA from stack.

Yes, nice error.
Post Reply