[personal profile] barking_iguana
I have large CSVs I need to analyze. Some have over 100,000 lines, some have over 100 columns. So a spreadsheet is not going to cut it. The first rows are always column titles. I need to choose a database back-end and preferably good GUI front end (unless they come together, which I'm now very doubtful of.) I'd spend $40 for something that gets the job done. I'd spend $80 for something that has everything I want with a good, intuitive UI and documentation. But free is better if it's not too much of a hassle.

I'm running a Windows laptop with 4GB of memory. I don't need client-server, multiple users, optimization for read only access, or anything that web developers choose a database for. But I do need to crunch a lot of data without using more memory than I have. So it seems SQLite > PostgreSQL > MySQL. But I also need a facility to import the CSVs and with those 100+ columns, I want a wizard that will create the table and use the column names in the first row. It seems there are only two SQLite tools that do that, except one has that particular functionality currently broken and the other is very clunky and costs more than I want to pay for clunky.

LibreOffice Base's documentation (at least) regarding importing CSVs is also on the fritz and near as I can tell, the functionality is also totally missing after a reorganization of the UI. But OpenOffice/LibreOffice (if it works) is a 4th option for the back end, I suppose.

Learn ...

Date: 2011-12-20 03:12 pm (UTC)
From: [identity profile] freelikebeer.livejournal.com
python? Once you crunch the initial 100k lines, how often will those lines change?

Re: Learn ...

Date: 2011-12-20 03:35 pm (UTC)
From: [identity profile] barking-iguana.livejournal.com
I'll leave my original tables alone, but I'll be calculating a bunch of intermediate values and then several final values for export for each row. I may be coming back to repeat the procedure a few times a year, but I'm not going to worry about that for now.

How much ...

Date: 2011-12-20 05:36 pm (UTC)
From: [identity profile] freelikebeer.livejournal.com
of every row do you use in an individual calculation? You can view the tables as your database. So dumping it into a SQL served database is redundant. If you want to move from a flat file to a SQL database, that is fine, but then the SQL database should become your authoritative datasource [and so, maintained].

If you are mostly interested in focusing on the calculations themselves, then I would prefer the using python to express the calculation, rather than SQL to re-express the data.

Re: How much ...

Date: 2011-12-20 08:56 pm (UTC)
From: [identity profile] barking-iguana.livejournal.com
The tables are meant to use relationally. The user-created primary key in one table is often part of the identifying data in another table. There will be some calculations that involve quite a few columns in each row, and there will be many queries that must return calculations from more than one table.

Someone at Google+ said
Load it into a MySQL database, and use DbVisualizer to do SQL queries, reports, and analysis on it. Or, if you want more fuzzy bits, import it into an Access database and use it's happy fun tools.

'mysqlimport' will auto-create the tables for you (see http://dev.mysql.com/doc/refman/5.0/en/mysqlimport.html ).

I'm not understanding the "Crunch a lot of data without using more memory than I have" statement. From a database standpoint, your database is tiny. 100,000 rows with 100 columns? I'm running far larger databases in 2gig of ram on a linux host, and get instantaneous results.
which looks good to me, but you're not the only person to suggest Python, either. If Python can handle several of theses tables simultaneously in memory on my machine, then it might be worth learning a few lines of it. I know some SQL and wrote a proc that people kept trying to replace for years, except whatever the wrote ran more slowly. But I never used it much and it's been many years since I had the month where it was 20% of my job.

I'm good ...

Date: 2011-12-20 09:48 pm (UTC)
From: [identity profile] freelikebeer.livejournal.com
with loading it into MySQL, but you should commit to it and give up your tables once they are in the dbms.

Python is a useful language to have in the toolbox, but if this is a 'production' job and not a learning opportunity, use what you are fluent in.

Re: I'm good ...

Date: 2011-12-20 10:02 pm (UTC)
From: [identity profile] barking-iguana.livejournal.com
Even before you I saw your reply, I was returning here to say that I think the Group By functionality in SQL will be very well suited to what I need, so I should use MySQL.

This is data once used to determine other constants, will not be part of the application. It is also what is provided by other people. I think keeping the CSVs as backup makes sense, because if I accidentally do something to my database, I still have the original data in the original condition. The databse will have both the original and the derived data.

Profile

Dvd Avins

March 2020

S M T W T F S
123 4567
891011121314
15161718192021
22232425262728
293031    

Style Credit

Expand Cut Tags

No cut tags
Page generated Mar. 16th, 2026 10:53 am
Powered by Dreamwidth Studios