Experiencing disruptions

WN de cloud network problem →

Computing Elements

Investigating - We are investigating a potential issue that might affect the uptime of one our of services. We are sorry for any inconvenience this may cause you. This incident post will be updated once we have more information.


HPC - Broken Omnipath switch →

Batch System

Tenemos dos switchs Omnipath y uno se ha roto en los nodos de hpc. Estos son los que realizan el intercambio de mensajes para jobs paralelizados, cola compute de Slurm. Estamos a la espera que nos manden recambio

-– — —

We have two Omnipath switches, and one has broken in the HPC nodes. These are the ones that exchange messages for parallelized jobs, Slurm compute queue. We are waiting for a replacement to be sent to us.


Altamira supercomputer   (?) Altamira supercomputer related systems.
Batch System   (?) Slurm batch system for Altamira Maintenance
Login nodes   (?) Altamira login nodes (login1, login2) Operational
Cloud Infrastructure   (?) OpenStack Cloud infrastructure.
Grid and HTC   (?) General purpose batch system and high throughput compute system.
Web and miscelaneous services   (?) Web services, wiki pages and other services.
AAI   (?) Authentication, Authorization and Identity systems.
Networking   (?) Internal and external networking.
Storage systems   (?) Distributed storage systems.

Incident history


September 7, 2023 at 9:15 AM UTC

Problema con red cloud / Cloud network problem

Resolved after 100h 0m of downtime
April 10, 2023 at 3:34 AM UTC

General power failure

Resolved after 3h 54m of downtime
March 6, 2023 at 8:00 AM UTC

Actualización de router / Router upgrade

Resolved after 4h 0m of downtime
February 8, 2023 at 11:20 AM UTC

External Network vlan change

Resolved after 20h 40m of downtime
February 1, 2023 at 9:15 AM UTC

Spectrum Scale low performance

Resolved after 239h 59m of downtime
January 19, 2023 at 7:30 AM UTC

Nextcloud Upgrade

Resolved after 60m of downtime
December 14, 2022 at 8:55 AM UTC

Cloud DHCP failed

Resolved after 1h 4m of downtime
November 22, 2022 at 7:00 PM UTC

Actualización del estado del CPD

Resolved after 208h 13m of downtime
November 14, 2022 at 8:00 AM UTC

Update logins kernel

Resolved in under a minute
October 27, 2022 at 7:00 AM UTC

Network Link Upgrade

Resolved after 1h 30m of downtime

←   Previous     4 / 8     Next   →