On Thu, Sep 27, 2012 at 3:54 PM, Francesco Pietra <
chiendarret@gmail.com> wrote:
> There are for me two ways of getting cuda at work: (a) install the
> driver according to nvidia (as probably is implied in what you
> suggested); (b) rely on Debian amd64, which furnishes precompiled
> nvidia driver. I adopted (b) because upgrading is automatic and Debian
> is notoriously highly reliable.
>
> I did not take notice of the cuda driver I had just before the "fatal"
> upgrading, but it were months that I did not upgrade. The version noed
> on my amd64 notebook is 295.53; probably I upgraded from that version.
>
> Now, on amd64,version 304.48.1 is available, while in my system
> version 302.17-3 is installed, along with the basic
> nvidia-kernel-dkms, as I posted initially. All under cuda-toolkit
> version 4 (although this is not used in the "Debian way" of my
> installation).
>
> The output of
>
> dpkg -l |grep nvidia
>
> modinfo nvidia
>
> which I posted initially, indicate, in my experience, that everything
> is working correctly. On these basis, I suspected that 302.17-3 is too
> advanced for current namd builds, although everything is under toolkit
> 4 (or equivalent way).
>
> I could try to install 295 driver in place of 302 but probably someone
> knows better than me what could be expected. Moving forward is easy,
> going back, with any OS, is matter for experts.
>
> I am not sure that all I said is correct. I am a biochemist, not a
> software expert.
>
> Thanks for your kind attention.
>
> francesco pietra
>
> On Thu, Sep 27, 2012 at 2:58 PM, Aron Broom <
broomsday@gmail.com> wrote:
>> So one potential problem here: is 302.17 a development driver, or just the
>> one Linux installs itself from the proprietary drivers? It looks to me like
>> the absolutely newest development driver is ver 295.41. I'm not confident
>> that you'd be able to run NAMD without the development driver installed.
>> The installation is manual, and it should overwrite whatever driver you have
>> there. I recommend a trip to the CUDA development zone webpage.
>>
>> ~Aron
>>
>> On Thu, Sep 27, 2012 at 3:52 AM, Francesco Pietra <
chiendarret@gmail.com>
>> wrote:
>>>
>>> Hello:
>>> I have tried the NAMD_CVS-2012-09-26_Linux-x86_64-multicore-CUDA with
>>> nvidia version 302.17:
>>>
>>> Running command: namd2 heat-01.conf +p6 +idlepoll
>>>
>>> Charm++: standalone mode (not using charmrun)
>>> Converse/Charm++ Commit ID: v6.4.0-beta1-0-g5776d21
>>> CharmLB> Load balancer assumes all CPUs are same.
>>> Charm++> Running on 1 unique compute nodes (12-way SMP).
>>> Charm++> cpu topology info is gathered in 0.001 seconds.
>>> Info: NAMD CVS-2012-09-26 for Linux-x86_64-multicore-CUDA
>>> Info:
>>> Info: Please visit
http://www.ks.uiuc.edu/Research/namd/
>>> Info: for updates, documentation, and support information.
>>> Info:
>>> Info: Please cite Phillips et al., J. Comp. Chem. 26:1781-1802 (2005)
>>> Info: in all publications reporting results obtained with NAMD.
>>> Info:
>>> Info: Based on Charm++/Converse 60400 for multicore-linux64-iccstatic
>>> Info: Built Wed Sep 26 02:25:08 CDT 2012 by jim on
lisboa.ks.uiuc.edu
>>> Info: 1 NAMD CVS-2012-09-26 Linux-x86_64-multicore-CUDA 6 gig64
>>> francesco
>>> Info: Running on 6 processors, 1 nodes, 1 physical nodes.
>>> Info: CPU topology information available.
>>> Info: Charm++/Converse parallel runtime startup completed at 0.085423 s
>>> FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 3 (gig64):
>>> initialization error
>>> FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 1 (gig64):
>>> initialization error
>>> ------------- Processor 3 Exiting: Called CmiAbort ------------
>>> Reason: FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 3 (gig64):
>>> initialization error
>>>
>>> Program finished.
>>> FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 4 (gig64):
>>> initialization error
>>> FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 2 (gig64):
>>> initialization error
>>>
>>>
>>> As I had (nearly) no comment to such failures, I can only imagine that
>>> either (i) my question - disregarding obvious issues - was too silly
>>> to merit attention; (ii) it is well known that nvidia version 302.17
>>> is incompatible with current namd builds for Linux-GNU.
>>>
>>> At any event, in the frame of metapackages, it is probably impossible
>>> within Debian GNU-Linux wheezy to go back to a previous version of
>>> nvidia. On the other hand, the stable version of the OS furnishes a
>>> much too old version of nvidia. Therefore, my question is:
>>>
>>> Any chance to compile namd in front of installed nvidia version 302.17?
>>>
>>> Thanks for advice. Without access to namd-cuda I am currently hindered
>>> to answer a question raised by the reviewers of a manuscript (the CPU
>>> cluster has long ago been shut down, as it became too expensive for
>>> our budget)
>>>
>>> francesco pietra
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Sep 26, 2012 at 4:08 PM, Francesco Pietra <
chiendarret@gmail.com>
>>> wrote:
>>> > I forgot to mention that I am at final version 2.9 of namd.
>>> > f.
>>> >
>>> > On Wed, Sep 26, 2012 at 4:05 PM, Aron Broom <
broomsday@gmail.com> wrote:
>>> >> I'm not certain, but I think the driver version needs to match the CUDA
>>> >> toolkit version that NAMD uses, and I think the library file NAMD comes
>>> >> with
>>> >> is toolkit 4.0 or something of that sort.
>>> >>
>>> >> ~Aron
>>> >>
>>> >>
>>> >> On Wed, Sep 26, 2012 at 9:58 AM, Francesco Pietra
>>> >> <
chiendarret@gmail.com>
>>> >> wrote:
>>> >>>
>>> >>> Hi:
>>> >>> Following updating/upgrading of Debian GNU-Linux amd64 wheezy,
>>> >>> minimizations do not run anymore on GTX-680:
>>> >>>
>>> >>> CUDA error in CudaGetDeviceCount on Pe3 Pe4, Pe6. Initialization
>>> >>> error.
>>> >>>
>>> >>> The two GTX are regularly activated with
>>> >>> nvidia-smi -L
>>> >>> nvidia-smi -pm 1
>>> >>>
>>> >>> Server and nvidia are the same version:
>>> >>>
>>> >>> francesco@gig64:~$ dpkg -l |grep nvidia
>>> >>> ii glx-alternative-nvidia 0.2.2
>>> >>> amd64 allows the selection of NVIDIA as GLX provider
>>> >>> ii libgl1-nvidia-alternatives 302.17-3
>>> >>> amd64 transition libGL.so* diversions to
>>> >>> glx-alternative-nvidia
>>> >>> ii libgl1-nvidia-glx:amd64 302.17-3
>>> >>> amd64 NVIDIA binary OpenGL libraries
>>> >>> ii libglx-nvidia-alternatives 302.17-3
>>> >>> amd64 transition libgl.so diversions to
>>> >>> glx-alternative-nvidia
>>> >>> ii libnvidia-ml1:amd64 302.17-3
>>> >>> amd64 NVIDIA management library (NVML) runtime library
>>> >>> ii nvidia-alternative 302.17-3
>>> >>> amd64 allows the selection of NVIDIA as GLX provider
>>> >>> ii nvidia-glx 302.17-3
>>> >>> amd64 NVIDIA metapackage
>>> >>> ii nvidia-installer-cleanup 20120630+3
>>> >>> amd64 Cleanup after driver installation with the
>>> >>> nvidia-installer
>>> >>> ii nvidia-kernel-common 20120630+3
>>> >>> amd64 NVIDIA binary kernel module support files
>>> >>> ii nvidia-kernel-dkms 302.17-3
>>> >>> amd64 NVIDIA binary kernel module DKMS source
>>> >>> ii nvidia-smi 302.17-3
>>> >>> amd64 NVIDIA System Management Interface
>>> >>> ii nvidia-support 20120630+3
>>> >>> amd64 NVIDIA binary graphics driver support files
>>> >>> ii nvidia-vdpau-driver:amd64 302.17-3
>>> >>> amd64 NVIDIA vdpau driver
>>> >>> ii nvidia-xconfig 302.17-2
>>> >>> amd64 X configuration tool for non-free NVIDIA drivers
>>> >>> ii xserver-xorg-video-nvidia 302.17-3
>>> >>> amd64 NVIDIA binary Xorg driver
>>> >>> francesco@gig64:~$
>>> >>>
>>> >>>
>>> >>> root@gig64:/home/francesco# modinfo nvidia
>>> >>> filename: /lib/modules/3.2.0-2-amd64/updates/dkms/nvidia.ko
>>> >>> alias: char-major-195-*
>>> >>> version: 302.17
>>> >>> supported: external
>>> >>> license: NVIDIA
>>> >>> alias: pci:v000010DEd00000E00sv*sd*bc04sc80i00*
>>> >>> alias: pci:v000010DEd00000AA3sv*sd*bc0Bsc40i00*
>>> >>> alias: pci:v000010DEd*sv*sd*bc03sc02i00*
>>> >>> alias: pci:v000010DEd*sv*sd*bc03sc00i00*
>>> >>> depends: i2c-core
>>> >>> vermagic: 3.2.0-2-amd64 SMP mod_unload modversions
>>> >>> parm: NVreg_EnableVia4x:int
>>> >>> parm: NVreg_EnableALiAGP:int
>>> >>> parm: NVreg_ReqAGPRate:int
>>> >>> parm: NVreg_EnableAGPSBA:int
>>> >>> parm: NVreg_EnableAGPFW:int
>>> >>> parm: NVreg_Mobile:int
>>> >>> parm: NVreg_ResmanDebugLevel:int
>>> >>> parm: NVreg_RmLogonRC:int
>>> >>> parm: NVreg_ModifyDeviceFiles:int
>>> >>> parm: NVreg_DeviceFileUID:int
>>> >>> parm: NVreg_DeviceFileGID:int
>>> >>> parm: NVreg_DeviceFileMode:int
>>> >>> parm: NVreg_RemapLimit:int
>>> >>> parm: NVreg_UpdateMemoryTypes:int
>>> >>> parm: NVreg_InitializeSystemMemoryAllocations:int
>>> >>> parm: NVreg_UseVBios:int
>>> >>> parm: NVreg_RMEdgeIntrCheck:int
>>> >>> parm: NVreg_UsePageAttributeTable:int
>>> >>> parm: NVreg_EnableMSI:int
>>> >>> parm: NVreg_MapRegistersEarly:int
>>> >>> parm: NVreg_RegisterForACPIEvents:int
>>> >>> parm: NVreg_RegistryDwords:charp
>>> >>> parm: NVreg_RmMsg:charp
>>> >>> parm: NVreg_NvAGP:int
>>> >>> root@gig64:/home/francesco#
>>> >>>
>>> >>> I have also tried with recently used MD files, same problem:
>>> >>> francesco@gig64:~/tmp$ charmrun namd2 heat-01.conf +p6 +idlepoll 2>&1
>>> >>> | tee heat-01.log
>>> >>> Running command: namd2 heat-01.conf +p6 +idlepoll
>>> >>>
>>> >>> Charm++: standalone mode (not using charmrun)
>>> >>> Converse/Charm++ Commit ID: v6.4.0-beta1-0-g5776d21
>>> >>> CharmLB> Load balancer assumes all CPUs are same.
>>> >>> Charm++> Running on 1 unique compute nodes (12-way SMP).
>>> >>> Charm++> cpu topology info is gathered in 0.001 seconds.
>>> >>> Info: NAMD CVS-2012-06-20 for Linux-x86_64-multicore-CUDA
>>> >>> Info:
>>> >>> Info: Please visit
http://www.ks.uiuc.edu/Research/namd/
>>> >>> Info: for updates, documentation, and support information.
>>> >>> Info:
>>> >>> Info: Please cite Phillips et al., J. Comp. Chem. 26:1781-1802 (2005)
>>> >>> Info: in all publications reporting results obtained with NAMD.
>>> >>> Info:
>>> >>> Info: Based on Charm++/Converse 60400 for multicore-linux64-iccstatic
>>> >>> Info: Built Wed Jun 20 02:24:32 CDT 2012 by jim on
lisboa.ks.uiuc.edu
>>> >>> Info: 1 NAMD CVS-2012-06-20 Linux-x86_64-multicore-CUDA 6 gig64
>>> >>> francesco
>>> >>> Info: Running on 6 processors, 1 nodes, 1 physical nodes.
>>> >>> Info: CPU topology information available.
>>> >>> Info: Charm++/Converse parallel runtime startup completed at
>>> >>> 0.00989199 s
>>> >>> FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 5 (gig64):
>>> >>> initialization error
>>> >>> ------------- Processor 5 Exiting: Called CmiAbort ------------
>>> >>> Reason: FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 5 (gig64):
>>> >>> initialization error
>>> >>>
>>> >>> FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 1 (gig64):
>>> >>> initialization error
>>> >>> Program finished.
>>> >>> FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 3 (gig64):
>>> >>> initialization error
>>> >>> FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 2 (gig64):
>>> >>> initialization error
>>> >>> FATAL ERROR: CUDA error in cudaGetDeviceCount on Pe 4 (gig64):
>>> >>> initialization error
>>> >>> francesco@gig64:~/tmp$
>>> >>>
>>> >>>
>>> >>> This is a shared-mem machine.
>>> >>> Does the version 302.17 work for you?
>>> >>>
>>> >>> Thanks
>>> >>> francesco pietra
>>> >>>
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Aron Broom M.Sc
>>> >> PhD Student
>>> >> Department of Chemistry
>>> >> University of Waterloo
>>> >>
>>>
>>
>>
>>
>> --
>> Aron Broom M.Sc
>> PhD Student
>> Department of Chemistry
>> University of Waterloo
>>