Events:

2017-12-28 at 12:12 [tegner]
The machine hosting the Tegner slurm master controller is currently unavailable. Batch system operations won't respond. Given the holiday season the time to fix is hard to estimate.
2017-12-20 at 13:45 [tegner]
Tegner: At about 11:38 CET (10:38 UTC), the Tegner login node tegner-login-1 crashed because of a memory module that went bad. Currently tegner-login-2 is used instead.
2017-12-18 at 10:07 [beskow]
The involved batch front-ends have been restarted and things should work as usual. Note that many job running on the affected front-end have been affected.
2017-12-18 at 07:46 [beskow]
One batch front-end seem to have gotten OOM'ed (out-of-memory) over night and will likely be restarted later this morning. Until then batch-system related operations will be affected.
2017-12-14 at 13:49 [tegner]
At 14:00 today 2017-12-14 we will start a rolling upgrade of the compute nodes and login-nodes of tegner. With the exception of when we restart the login nodes there should be no impact for users. We plan to restart tegner.pdc.kth.se on Monday 20171218 at 10:00 and tegner-login-2.pdc.kth.se at 16:00 today.
2017-12-09 at 10:52 [beskow]
Informational - last night a batch front-end unintentionally was emptied rather than drained of jobs. Some jobs might have been affected.
2017-11-28 at 11:59 [beskow]
Cooling/backup cooling fail-over tests completed. Jobs running again.
2017-11-27 at 14:10 [beskow]
Tomorrow, 2017-11-28 starting ~10:00, we will stress the cooling/backup cooling fail-over. The system will be up, but no user jobs will be allowed to execute during testing. The tests are supposed to be completed during the same day.
2017-11-14 at 14:28 [beskow]
The system has been restarted and is since a while running jobs again.
2017-11-14 at 12:07 [beskow]
The system will be restarted ~13:00 today.
2017-11-14 at 09:55 [beskow]
During maintenance work on the machine room environment control system, backup cooling did not work as anticipated. Two cabinets out of eleven did shut down due to lack of sufficient cooling. We will assess the situation somewhat and then decide on whether to make a full system restart, or let stray running jobs complete.
2017-11-02 at 20:05 [tegner]
The tegner cluster should now be back and accepting jobs. However - we have identified an issue with the new setup that - while hopefully mitigated most likely still will require an additional shorter service window as soon as a proper solution can be found.
2017-11-01 at 18:31 [beskow]
Preventive maintenance of beskow and neighbouring file-system klemming completed. Systems serving jobs since a while.
2017-10-27 at 11:27 [beskow]
Preventive maintenance on beskow will start Wednesday November 1st, 08:00. The system will be unavailable during parts of that day.
2017-10-27 at 07:57 [tegner]
The Tegner cluster will have a service window the 1st and possibly 2nd of November 2017. System software will be updated and the system interconnect will be restructured to allow for additional nodes. We hope to complete the work in one day but the 2nd of November is reserved as well in case unforeseen delays develops.
2017-10-17 at 14:46 [tegner]
The tegner-login-1, one of the login-node of the Tegner cluster, was unresponsive for about 30 minutes around 14:00. It should be back now and no running jobs should have been affected. Further diagnostics are in progress.
2017-10-13 at 15:38 [tegner]
Tegner is up and running again.
2017-10-13 at 13:35 [tegner]
Tegner - switch failure, all 32 nodes in t02 rack crashed.
2017-09-27 at 19:21 [beskow]
What seem to be a tripped circuit breaker made ~90 nodes in one cabinet shutdown due to under-voltage earlier this afternoon. There was service on power in another cabinet of the system this morning, it remains to be seen if it is correlated or not.
2017-09-01 at 19:56 [beskow]
Most parts of the Beskow upgrade complete. Jobs are running, and login is enabled again.
2017-08-21 at 13:22 [beskow]
Starting Monday 2017-08-28, the Beskow upgrade will be launched. Preparation and physical installation is estimated to one work-week. Integration, customisation & stabilisation will take place the following workweek.
2017-07-17 at 13:44 [beskow]
The system has been restarted and is running jobs again. One cabinet, 1/9 of the system, is being locked out from running jobs, this to reduce power draw in it.
2017-07-17 at 11:16 [beskow]
One cabinet has gotten power problems and one row of cabinets has been automatically shut down overnight. All running jobs affected. The system will now be shutdown. Given the season of year, a restart might take somewhat longer than what we would like it to.
2017-07-10 at 08:05 [tegner]
The tegner login node tegner-login-1 has is now back after having been restarted.
2017-07-09 at 18:41 [tegner]
tegner-login-1, one of the login node of the tegner cluster will be restarted at CEST 08:00 Monday July 10th 2017 for system software updates. Please consider using tegner-login-2 until then. If you have any questions please send an email to support@pdc.kth.se
2017-07-04 at 13:27 [beskow]
Maintenance completed. The system is on-line and running jobs again.
2017-07-03 at 10:46 [beskow]
Preventive maintenance to take place tomorrow, 2017-07-04. The system will be taken off-line after 08:00.
2017-06-01 at 21:22 [beskow]
Earlier this evening/afternoon the PDC network backbone was not acting well. As a precaution jobs were inhibited to start while investigating. The block has now been lifted and jobs are starting again.
2017-05-26 at 19:19 [beskow]
The system was restarted earlier in the afternoon. Batch jobs were released and allowed to run a short while ago.
2017-05-26 at 09:58 [beskow]
Several cabinets reported power related problems this morning and many nodes have shut down. As today is a non-working day (klämdag/Brückentag) the response time, and time to fix, will be affected.
2017-03-13 at 20:30 [beskow]
cabinet c2-1 power disruptions causes nodes 1536-1727 (nids) to fail. Jobs on affected nodes will also fail.
2017-03-10 at 17:30 [milner]
The restart is finished, milner is on-line again.
2017-03-10 at 09:49 [milner]
Some critical milner file-systems went off-line over-night and the system is in need to be restarted. This will commence late this Friday afternoon the earliest.
2017-02-07 at 16:54 [tegner]
The service window for tegner has now ended. Thanks for your patience
2017-02-06 at 09:30 [tegner]
Tegner will have a scheduling pause for maintenance and upgrades 6-7th of February. Access to login- and transfer nodes may be affected for parts of the service window.
All flash news for 2018, 2017, 2016, 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995

Back to PDC
Subscribe to rss