Working with XAMP

I. Install mysql of xamp to mysql service.

II. Create virtual host in xampp
1. Edit file D:\xampp\apache\conf\extra\httpd-vhosts.conf and add Vitural host

2. Edit file C:\Windows\System32\drivers\etc\hosts and add your domain

Restart apache, DONE!

AWS RedShift

I. Loading data from S3 with Gzip

II. Unloading data to S3 with Gzip

III. Show on query is running

IV.  The way to get the size of a database in Redshift.

V. Knowledge about Redshift

1. What is VACUUM?

VACUUM means reorganized data: Run the VACUUM command on the tables in the query to reclaim space and re-sort rows. Running VACUUM helps if the unsorted region is large and the query uses the sort key in a join or in the predicate.

When we use UPDATE schema , REDSHIFT we delete and insert new record.
The rows are marked for deletion, but not removed.
When you use SELECT RedShift will scan record delete and undelete therefore performance of query will reduce.
a. VACUUM to remove unnecessary records were deleted.
b. VACUUM to RE-INDEX of records
c. VACUUM to SORT recrods
All of them to increase performance when query table.

2. What is ANALYZE?

The ANALYZE command obtains a sample of rows from the table, does some calculations, and saves resulting column statistics. By default, Amazon Redshift runs a sample pass for the DISTKEY column and another sample pass for all of the other columns in the table. If you want to generate statistics for a subset of columns, you can specify a comma-separated column list.

Run the ANALYZE command on the tables in the query to make sure statistics are up to date. Running ANALYZE helps if any of the tables in the query have recently changed a lot in size. If running a full ANALYZE command will take too long, run ANALYZE on a single column to reduce processing time. This approach will still update the table size statistics; table size is a significant factor in query planning.

STATUPDATE ON : When COPY command , by default the COPY command performs an analysis after it loads data into an empty table.
Using above systax mean analyze the columns that are frequently used in the following: Sorting and grouping operations,Joins,Query predicates

 

Amazon Redshift, an independent review.

As on the Amazon AWS front page Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse that makes it simple and cost-effective to efficiently analyze all your data using your existing business intelligence tools.

In this independent review I will set up an Amazon Redshift datawarehouse and do some basic tests to check, without going too much deep, what can be achieved with a simple Redshift installation. Continue reading Amazon Redshift, an independent review.

Fast Inserts to PostgreSQL with JDBC and COPY FROM

I was reading some materials on how to make database inserts as efficient as possible from Java. it was motivated by an already existing application for storing of some measurements into PosgtreSQL. So I decided to compare known approaches to see if there is some way how to improve the application already using batched inserts.

For the purpose of  the test I created following table:

I decided to test the insertion of 1000 records to the table. The data for the recors was generated before running of any of test methods. Four test methods were created to reflect ususal approaches:

  • VSI (Very Stupid Inserts) – executing queries made of concatenated Strings one by one
  • SPI  (Stupid Prepared Inserts) – similar to VSI but using prepared statements
  • BPI (Batched Prepared Inserts) – prepared inserts, executed in batches of various length
  • CPI (Copy Inserts) – inserts based on COPY FROM, executed in batches of various length

Prior to each inserts the table is cleared, the same after all data are succesfully inserted. Commit is called only once in each test method, following all the insert calls.  The following code exerpts illustrate the above listed approaches:

VSI

SPI

BPI

CPI

I hoped to get some improvements for using COPY FROM instead of batched inserts but not expected no big gain. But the results were a pleasant surprise. For a batch of size 50 (as defined in the original aplication I wanted to improve) the COPY FROM gave 40% improvement.  I expect some improvements when data come from a stream and skip the StringBuffer-with-PushbackReader exercise.

See the graphs yourself – the number following the method abbreviation is the size of the batch.

inserts-barchart

Average time in milliseconds

inserts-linechart

All the 200 runs individually

Resource : http://rostislav-matl.blogspot.co.uk/2011/08/fast-inserts-to-postgresql-with-jdbc.html

Adding external/custom jars into Maven project

I. Using system dependency

The second method is to add the required dependency with the system scope and additionally provide an absolute path to the a jar file placed somewhere on the local disc:

You must deploy your library to folder /usr/lib/jvm/jre/lib/ext/RedshiftJDBC41-1.1.2.0002.jar.

II. Installing jar into local Maven repository

  • First we need install maven on your server follow steps by step:
  1. Run the wget command from the dir you want to extract maven too.
  2. run the following to extract the tar,
  3. move maven to /usr/local/apache-maven
  4. Next add the env variables to your ~/.bashrc file
  5. Execute these commandssource ~/.bashrc

6. Verify everything is working with the following command

  • Much better solution is to add the required dependency manually to the local repository using command:

  • For example adding external jar evalpostfix-1.0.jar to the local repository could look like this:

III. Add library by manual

  • Install 7-zip on your window and open jar file, copy your library into this jar file.